The vOICe next up previous
Next: Image-to-Sound Mapping Up: Contents Previous: Introduction

Image Mapping Options

In principle, direct cortical stimulation [7, 20] would appear to be the best approach, as it need not interfere with other senses. However, the huge interfacing problems and invasive nature of the approach, together with the low resolution results obtained up to now, probably make it a long-term research item.

As an immediate second choice, when considering interference, a vibrotactile or electrocutaneous skin-stimulating system seems the most adequate, because much of the skin normally plays only a subordinate role as an objective communication channel; e.g., under mobility conditions [1, 12, 16, 19, 21, 22]. That very same aspect might indirectly also be its major disadvantage, because there is no strong indication yet, nor an evolutionary basis, to expect the existence of all the biological ``hardware'' for a high-bandwidth communication channel, not just at the skin interface, but all the way up to the cognitive brain centres. In order to save on neural wiring, nature could have provided us with coding tricks that lead to a reasonable resolution when discerning a few skin positions, but that lead to severe information loss when stimulating a large matrix of skin positions. Extrapolation of local resolution experiments to global resolution estimates is risky. Pending decisive experimental results, one should not hold these considerations, which are debatable themselves, as an argument against the interesting work on tactile representations, but as an argument to concurrently investigate the merits of other sensory representations.

Therefore, knowing the importance of bandwidth in communicating data, in our case detailed environmental information, an obvious alternative is to exploit the capabilities of the human hearing system, which is the choice made in this paper. Although we cannot claim that this is the best possible choice, it is known that the human hearing system is quite capable of learning to process and interpret extremely complicated and rapidly changing (sound) patterns, such as speech or music in a noisy environment. The available effective bandwidth, on the order of 10 kHz, could correspond to a channel capacity of many thousands of bits per second. Yet again there may be restrictions, imposed by nonlinearities in the mechanics of the cochlea and the details of information encoding in the neural architecture [8]. In spite of these uncertainties, the known capabilities of the human hearing system in learning and understanding complicated acoustical patterns provided the basic motivation for developing an image-to-sound mapping system. For experimental work under laboratory conditions, the interference with normal hearing need not bother us yet. Moreover, it is not necessarily an insurmountable problem, provided the system does not block the hearing system to such an extent, that it becomes impossible to use it for speech recognition or to perceive warning signals. If the system only blocks subtle clues about the environment, normally perceived and used only by blind persons, while at the same time replacing those by much more accurate and reliable ``messages'', it will be worth it. As an example of a subtle clue, one may think of the changes in the perceived sound of one's footsteps, when walking past a doorway.

Concerning the input from the environment, true visual (camera) input has some major advantages over the use of sonar beam reflection patterns. The possibilities of sonar have been investigated rather extensively in the past [3, 5, 15, 23]. The short range of sonar makes it impossible to perceive far-away objects; e.g., buildings, which is essential for mobility guidance. Furthermore, visual input matches some of the most important information providers in our surroundings, such as road signs and markers. Contrary to sonar, visual input could also provide access to other important information sources like magazines and television. Technically, it is difficult to obtain unambiguous high resolution input using a scanning sonar, while any commercially available low-cost camera will do. It should be added, however, that this statement only holds if the camera input is supplemented by depth-clues through changes in perspective, to resolve ambiguities in distance. (Relative) depth information can be derived from evolving relative positions of the viewer and his environment, combined with knowledge about the typical real size of recognized objects, as one readily verifies by looking with one eye covered. Therefore, an equivalent of binocular vision is no strict prerequisite, and doubling of the more expensive hardware parts (such as the camera) can be avoided. Nevertheless, it should be noted that the approach described in this paper, presently involving one camera, lends itself quite well to a future binocular extension, by mapping left-eye images to sound patterns for the left ear and right-eye images to sound patterns for the right ear.

In order to increase the image resolution obtainable via an auditory representation, a time-multiplexed mapping is performed to distribute an image in time. A somewhat related method has been applied by Kaczmarek et al. [16] in the design of tactile displays. In this paper, however, the two-dimensional spatial brightness map of a visual image is 1-to-1 scanned and transformed into a two-dimensional map of oscillation amplitude as a function of frequency and time. The same basic functionality was considered by Dallas and Erickson [6]. It is likely that a subdivision of only one of the two spatial dimensions of the visual image into individual scan lines will be more amenable to a later mental integration or synthesis than a subdivision in both dimensions into rectangular patches. Only a one-dimensional mental synthesis will then be required for image reconstruction, in contrast to the two-dimensional scanning scheme proposed by Fish [11], and the sequencing scheme used in [16].

The human brain is far superior to most if not all existing computer systems in rapidly extracting relevant information from blurred and noisy, redundant images. Therefore, no attempt was made to deliberately filter the redundancy out of visual images that are being transformed into sound patterns. From a theoretical viewpoint, this means that the available bandwidth is not exploited in an optimal way. However, by keeping the mapping as direct and simple as possible, we reduce the risk of accidentally filtering out important clues. After all, especially a perfect non-redundant sound representation is prone to loss of relevant information in the non-perfect human hearing system. Also, a complicated non-redundant image-to-sound mapping may well be far more difficult to learn and comprehend than a straightforward mapping, while the mapping system would increase in complexity and cost.


next up previous
Next: Image-to-Sound Mapping Up: Contents Previous: Introduction
Peter B.L. Meijer, ``An Experimental System for Auditory Image Representations,'' IEEE Transactions on Biomedical Engineering, Vol. 39, No. 2, pp. 112-121, Feb 1992.