Need for Psychologists, Linguists, ...

The vOICe for Windows « The vOICe Home Page
« The vOICe for Windows

The vOICe seeing-with-sound technology generates unique identifiable soundscapes for any image or video frame. This "sonification" is done in a perfectly general and systematic manner, provably preserving much of the visual content, and largely within the main known psychophysical limitations of human hearing. Technically, the vision substitution and synthetic vision approach provides access to any visual information through an auditory display.

However, learning to see through sound is not unlike learning to understand a foreign language. When hearing speech in a foreign tongue, one immediately hears lots of complicated sound variations, but these at first seem to make no sense at all. Moreover, with languages such as French, the word boundaries are physically largely absent (no pauses in between subsequent words) and must
3D car spectrogram
3D spectrogram of a parked car soundscape.
``Meijer (1992) has developed a device (The vOICe) that converts 2-D spatial images into time-varying auditory signals. While based on the natural correspondence between pitch and height in a 2-D figure, it seems unlikely that the higher-level interpretive mechanisms of hearing are suited to handling complex 2-D spatial images usually associated with vision. Still, it is possible that if such a device were used by a blind person from very early in life, the person might develop the equivalent of rudimentary vision.''

Source: "Sensory Replacement and Sensory Substitution: Overview and Prospects for the Future", by Jack Loomis, in  "Converging Technologies for Improving Human Performance: Nanotechnology, Biotechnology, Information Technology and Cognitive Science," (US NBIC 2002), NSF-DOC Converging Technologies Report, p. 221, June 2002.

therefore be mentally reconstructed (segmented) through auditory pattern recognition, using knowledge about which words exist or do not exist in a given language. Learning to do this obviously requires extensive training, because it involves the application of a large "dictionary" of words plus a number of grammar templates.

Similarly, with The vOICe one will need to hear out visual objects from complex visual scenes in the form of auditory objects, which implies a mental segmentation at a more abstract level than what is commonly meant with the terms auditory grouping and segregation. The required segmentation for seeing with sound will have to be learnt, as it is not innate. An advantage over natural language is that the basic elements in The vOICe mapping are very simple, logically and directly linked to the visual information: the sound of a bright open circle can be rationally and consciously "auralized" from knowledge of what constitutes a circle and understanding how The vOICe mapping acts on that to create two simultaneous tones, one going up and down and another one going down and up, with the two tone split on the left and a matching tone merger on the right. This is different from natural language, where there is no reason why a circle should result in the word "circle", and indeed different languages use different words to denote (the imageable concept of) a circle. In this sense The vOICe language is universal as well as culturally neutral.

Still, to deal efficiently with visual scenery, one will have to learn to process visual information largely subconsciously, just like in normal visual reading one no longer spells out all the characters one by one for reasons of speed, but rather matches "word images" on-the-fly or even picks out the main keywords from multiple lines of text as in speed-reading. Many acquired human abilities share this characteristic of largely subconscious processing: riding a bike by consciously turning left of right as the bike begins to incline to the left or right is almost a guarantee for a fall, even though the "algorithm" by itself is correct and simple. The early learning phase can be sped up through conscious analysis, but beyond that, subconscious neural processing should take over sooner or later.

Note: To convince yourself how poor the performance of rational, conscious control actually is in real-time tasks, you can subject yourself again to an early learning phase by trying to ride your bike with your arms crossed. Disclaimer: be very careful, because you are likely to fall!
In addition to the "dictionary" of real-world objects, there is the "grammar" of how objects can interact and have a (visual) relation to each other, how visual perspective and shading can affect the appearance of one and the same object, correlations with multimodal sensory-motor interactions, and so on. In other words, the mental task becomes to learn to hear out the key invariants, particularly objects, in a scene despite the various transformations that change the appearance of objects through lighting, position and orientation, (partial) occlusion by other objects, parallax and visual perspective in general. On the one hand the human brain thus has to somehow compensate for these transformations to know what shapes, objects and textures are present in a scene, but on the other hand these transformations also provide information about (relative) positions and other properties, as well as "circumstantial evidence" determining the probability that any particular item is indeed part of a given scene. Many visual illusions are in fact based on concocting circumstantial evidence to support a wrong interpretation. Pattern recognition is to a large extent an implicit process where a priori knowledge about the world, plus many soft and hard constraints from the sensory input, together yield a probably-correct interpretation of what makes up a given visual scene.

Obviously, there are many issues that go well beyond the technical and psychophysical issues that have been covered so far. Devising and applying a good training schedule, perhaps in an educational setting, may make the difference between success and failure. Many things are taught at school, and seeing with sound may become part of school training for blind children, because it offers them a certain degree of access to the visual information that plays such an important role among the sighted. Psychologists, neuroscientists, linguists and representatives from other relevant scientific disciplines are therefore invited to join and participate in establishing the possibilities and limitations of seeing with sound. From the analogies with language, one might predict a critical age beyond which seeing with sound will no longer become second nature in spite of extensive training, but for now this is only a conjecture.

In the conclusions of the IEEE BME paper on An Experimental System for Auditory Image Representations (1992), the need for complementary work beyond the demonstration of technical feasibility was outlined as

``However, the further development towards a practical application of the system still awaits a thorough evaluation, with blind persons. This is needed to determine and quantify the limits of attainable image perception, before any firm conclusions about the usefulness of the system can be drawn, because only the technical feasibility has now been established. Such an evaluation should involve both (young) children and adults, while distinguishing the congenitally blind from the late-blinded, because neural adaptability may drop rapidly with age, and the neural development may strongly depend on prior visual experiences.''

In the Cross-Modal Sensory Streams presentation and demonstration given at SIGGRAPH 98, Florida, the need for follow-up was reiterated in terms of the need for multidisciplinary research: ``Cooperation between engineers, neuroscientists and psychologists to further evaluate the options would be a logical next step.''

Limitations in The vOICe approach?  Limitations in The vOICe approach?

The implications of some issues are fairly well understood, such as the mathematical and technical limitations imposed by the frequency-time uncertainty relation and the perceptual limitations imposed by the just noticeable difference (JND) for frequency discrimination with pure tones or the role of critical bands in discriminating complex sounds. Unclear is the role of informational masking and (innate) auditory streaming or segregation, and little is known about the cross-modal neural pathways and their equivalent bandwidth for transferring information, or about the neural processing and (age-dependent) neural plasticity for processing cross-modal sensory streams. Also to be clarified are the implications of the audio-visual semantic congruence as imposed by The vOICe mapping. In the area of psychology and education, is it unknown how serious a problem it is that people often, at least initially, consider the soundscapes to be unpleasant, or whether this is a judgement that subsides as one learns to master the interpretation of soundscapes and learns to appreciate the content, just like the ugly blob of ink called letters can turn out to make a beautiful poem or book after one has mastered reading through years of practice. Visual sounds seem to be criticized for their unpleasantness mainly by sighted people, probably because they have nothing to gain from the visual information in the soundscapes and therefore only get irritated by the sounds of the auditory display. Yet it is also unknown what the minimum required results are to make seeing with sound considered valuable or useful to the majority of blind users. Some specialized options such as the color identification are easily applied, while other dedicated features such as the math function plot may readily serve blind students with an interest in learning mathematics, but the potential of seeing with sound goes far beyond that if people can manage to master it. So for the general application of seeing with sound by blind people, a major source of concern is how to devise and execute a training program with proper guidance to help blind students stay motivated while they are dealing with the steep learning curve, possibly using exercises that are simple enough to have perceivable progress, interspersed with immersive exposure to real-life situations to get acquainted with real-life visual complexity and structure. Teaching a foreign language may again form a useful analogy for devising and executing a good and balanced long-term training program. Psychologists might also wish to consider the proposed immersive usage of The vOICe during training in the context of James Gibson's ecological theory of perception (Gibson, J.J., 1979, "The ecological approach to visual perception," Boston: Houghton Mifflin). For some of the more philosophical issues involved with synthetic vision via an image-to-sound mapping, see the page on sound-induced mental imagery. For some user accounts see the page on what blind users say about The vOICe. One experienced late-blind user of The vOICe put the learning-to-see effort like this: "Most people want that magic bullet giving instant sight. While this does occur over time, it is a developed thing."

Clearly, covering all of this spans many scientific disciplines. It is hoped that in the future there will be many more scientific publications, from various disciplines such as psychology, linguistics and neuroscience. Apart from the practical uses, it is hoped that The vOICe may also play a role in alleviating the psychological pain of those who go blind through trauma or disease, perhaps to help prevent or limit the deep depression that often inflicts those who go blind or otherwise crave to gain or regain visual input. Those who have something coming up in this context, or published already, are welcome to report. In the mean-time, one may check out some literature references.

Copyright © 1996 - 2024 Peter B.L. Meijer