The vOICe Mini-Tutorial

« The vOICe Home Page
« The vOICe for Windows

The following "mini-tutorial" on The vOICe for Windows is about experimental software for blind people. The software maps graphical views from a PC camera, computer screen, scanner or image file into closely corresponding (complex) sounds. With respect to applications in higher education, a built-in auditory graphing calculator is available for users of a modern screen reader. Color identification is also included. The software is available free of charge for personal or academic (non-commercial) use.

Blind person forms mental image from soundscape of arbitrary image
Sensory substitution using soundscapes for artificial synesthesia through mental imagery

The vOICe technology, spelled with capitals O I C in "vOICe" (Oh I see!), is a very powerful and general approach towards seeing with sound. It is not based on sonar or echolocation, but uses real visual input from a PC camera or "webcam" (other input options will be discussed further down). PC cameras are quite cheap nowadays, typically in the $50 to $100 range. Through special software this lets blind people carrying a portable PC and a PC camera hear Simple line and dot images live views from their environment through their stereo headphones, thus hearing the very same shapes and things that their sighted friends see with their eyes. The software translates images from the camera on-the-fly into closely corresponding sounds. For instance, a bright speck of light gives a short beep. If this bright speck is on your left, you will hear the beep on your left side, and if it is on your right, you will hear it on your right side. If the speck moves up, you will hear the pitch of the beep move up, and if the speck moves down, so does its pitch. With two specks you get two beeps, with three specks three beeps, and so on. A horizontal bright line yields a long tone, because the specks that make up this line give a corresponding concatenation of beeps in time, sounding as a pure tone. Again, if the whole line moves up or down, so does its pitch. A vertical line is a stack of specks, sounding all at the same time but all with different pitches since they are at different heights. Together this sounds like a brief noise burst. An instructive experiment would be to move your cane in front of the camera, because the cane appears visually simply as a bright white line, and you can play with the orientation of the cane and how that affects its sound.

The above image to sound mapping allows for sounding any visual scene, but the more complex the view, the more complex the sound will be. It takes about a second to sound the entire content of a view, and every second the sound will be "refreshed" to reflect any changes in the visual content of scenery as captured by the camera. These one-second sounds that contain the whole view are called "soundscapes", and they sound the visual content via a left to right scan with pitch indicating elevation and loudness indicating brightness. Please be warned though that real-life visual scenery is typically so very complex that you will be totally and utterly confused by what you hear! Vision just happens to be extremely complicated, and hearing real images and scenes is something totally new to your brain, whether you are congenitally blind or not. To the sighted it is just as confusing because they have learnt to process images via their eyes, not via their ears. One of the biggest and most important open questions and concerns is in fact how proficient users can become through extensive and immersive use of this technology. One cannot find out without trying though. Hopefully you will have a lot of fun experiencing all the visual input even if most of it does not make much sense to you yet. It may be a bit like listening to Chinese: seemingly entirely meaningless until you begin to master it through extensive practice and training. Got lost or bored with the above explanation? Then go try and experience The vOICe seeing-with-sound software for yourself! After all, a soundscape says more than a thousand words.

The vOICe for Windows - download voice.exe!
Learn to see

The vOICe for Windows, for Windows-95, 98, ME, NT, 2000, XP, Vista, Windows 7, 8, 10 and Windows 11

The executable "voice.exe" for Microsoft Windows can be downloaded from The vOICe for Windows web page. The software was designed to look for and operate with a PC camera or webcam, which you may not have yet. So when you start the program, chances are that you get the spoken message saying "I'm sorry, The vOICe cannot find your webcam", after which a built-in complex sound starts repeating itself (the demo sound is a photograph of a parked car, with a building on the background and street surface and poles on the foreground). You should hear the stereo sound pan from left to right, or else you are wearing your headphones backwards or your loudspeakers are interchanged. By the way, don't let the weird demo sound scare you off, because you can still do various things with the software even if you do not have a webcam yet.

For instance, simply press function key F11, which toggles the built-in exercise mode on (and off). The demo sound disappears and you will hear an image containing two bright filled rectangles at random positions. You can press the space bar to enter the manual update mode for hearing a new random image with two rectangles placed at new random positions. You can even select a shape with the + and - keys on your numeric keypad and move the selected shapes around with your arrow keys. There exist various options in the Edit menu to change the type and number of visual shapes, but we will skip that for now. Some details of use can depend on what screen reader and sound card you have. In case the audio from The vOICe blocks the speech from your screen reader, toggling Control F2 will release the sound card to your screen reader again. This is a workaround for older Windows versions, and it should not be needed for Windows 2000 or XP and later.

Now how are all these shapes sounded? Well, in line with what was said above, every image gets scanned from left to right in about a second, while indicating elevation by pitch and brightness by loudness. This procedure is completely general and can sound any image. Of course, many images will at this stage simply be way too complex to make sense of. However, with the two bright rectangles of the exercise mode, you get two sound bursts, with duration depending on their width, pitch depending on their height and elevation, and loudness depending on brightness. For bright lines you would get tones that go up or down as the lines go up or down when viewed from left to right, and for bright specks you would get short beeps, again with pitch denoting their elevation.

Note that the stereo panning in scanning from left to right may initially seem like a motion in the scene, while in reality it is merely the effect of the scanning: the stereo panning is there also when nothing in the visual scene changes. It is like repeatedly sliding the side of your hand from left to right over a surface, say a piece of Braille print, thus feeling any changes in texture. Just as with sliding your hand, the left-to-right "movement" that you hear is not part of the scene that you view, but merely a way to scan it sequentially.

Want to hear some demo sounds before trying this software? Then listen to some WAV sound samples first. One 88K WAV sound sample illustrates how you can simultaneously hear lines and other basic visual shapes. This two-second sound sample is available directly from the audio file link  voiscopebw2.wav.

Can you hear the bright curve and the ten little squares on the dark grey tiled background? Too easy? Then try the same with the one second 44K sound sample from the audio file link  voiscopebw.wav.

Note: your audio player should preferably be set to autorepeat in order to hear the same sound a number of times in succession for best conscious mental analysis. Sighted people can compare the above sound sample to the original GIF image as available right from the image file link voiscopebw.gif.

The sound samples were automatically synthesized from this image by importing the image into The vOICe for Windows software. Blind users of The vOICe for Windows can do this too: pressing Control O gives you an image file requester. Once an image is sounding, you can toggle function key F3 to slow down the image sound from one second to two seconds. Want to zoom into an image and move around? Then toggle function key F4 and use your arrow keys.

If you happen to have a flatbed scanner, chances are that you can use that as an input device as well! Just press Control Q to acquire a scan from your scanner and listen to whatever you had put on your scanner, even your hand. Use negative video via function key F5 to hear out any small dark objects on a bright background, as is typical for line drawings and printed material. Another simple experiment: if you had put colored clothing on your scanner, you only need to toggle function key F10 after the scan to hear its color named: The vOICe can easily double as a cheap color identifier, telling the color name of whatever is at the center of the view.

Want to hear what's around your mouse pointer on your screen? Then toggle function key F9 and move your mouse or use your arrow keys to move around the graphical user interface as it shows on the screen. You may hear borders of windows and icons and such. Anything graphical.

As you see, there is a myriad of options to play with, depending on what you want to do. There is even a built-in scientific graphing calculator for doing math: toggle function key F8, which by default sounds a graph of a sine wave with a horizontal and vertical axis, but with your screen reader you may set other functions and ranges via the standard Windows dialog box. The auditory graphing calculator option is described in more detail on the accessible graphing calculator web page, while camera-based access to graphs and function plots is outlined on the printed graph web page, while there is also a web page with self-training notes. Just follow the links for topics that interest you.

For full and independent access to all the features and options of The vOICe for Windows software, blind users will need a screen reader for reading menus and dialogs, although some very basic speech feedback is included in The vOICe. Free screen readers for starters are available as the third-party applications  Thunder,  eSpeak and  NVDA (The vOICe is not affiliated with these products in any way).

Going mobile for truly immersive visual experiences.

Once you have a webcam and its drivers installed on a notebook PC, netbook or tablet, you can put the PC in a backpack and you can tape or strap the webcam to your stereo headphones, thereby making for a simple mobile setup with a head-mounted camera. For convenient and affordable head-mounted webcam options, check out the USB camera glasses web page. (Note: The vOICe is not affiliated with these products in any way.)

Whatever you do with a mobile setup, make sure that your personal safety and that of others is ensured,
Home-Made Setup
(wearable webcam)
Wearable webcam for use with The vOICe
Simple home-made setup for The vOICe with a tiny USB webcam, in this case a Creative Webcam NoteBook webcam (CMOS) glued on top of regular stereo headphones. Click image to enlarge.
Improvised Setup
(wearable webcam)
Blind user wearing The vOICe
This blind user wears a simple home-made setup for The vOICe with a head-mounted CCD webcam on his cap.
because most of the visual information will only be very confusing and distracting at first, and the sounds will inevitably mask the environmental sounds a bit. It is strongly advised to start playing with it in a very familiar and safe environment, such as your home. That also makes it easier to relate what you see/hear to what you already know is there. In any case, it is assumed here that you use a cane and/or a guide dog: The vOICe can supplement but not replace long cane or guide dog. Now we will briefly discuss some aspects of vision that may be useful to know for congenitally blind people who may be only partly familiar with the various visual concepts.

To some extent, moving around totally blind is a bit like hopping from object to object with perceptual gaps in between - unless the objects themselves emit sounds. This is clearly an oversimplification, but you get the idea. The next object or obstacle often comes into view, by touch or echo, only after losing contact, again in terms of touch or echo, with the former object. With vision, there is a greater continuity in relating to objects because often several objects are in view at the same time, and the next object comes into view before the former object vanishes from the view. So there is a greater amount of perceptual overlap, even if the distance between objects or landmarks is fairly large. That also helps in following a route. By the way, everything said here about vision also applies to the soundscapes from the camera, because the soundscapes contain the same visual information.

In addition to often having several objects in view that are more or less in the foreground, there is also a visual background. That background is anything distant that is not occluded by the objects in the foreground. It could be a skyline of houses or buildings or whatever there is at a distance. Since distant things appear smaller visually, this background is often highly cluttered with tiny details, and one of the things that takes a lot of brain processing is to figure out what is in the foreground and what is in the background. That is difficult, although the sighted have had a lifetime of training to do it without apparent effort, but we can say something about what clues the brain gets and uses for that purpose. One of the key things is apparent change. A door at ten meters distance may appear visually the same as the side of a building at a hundred meters. Both can look like shaded rectangles filling part of the view. However, as you make a few strides towards the door, the door will appear to grow, while the building view barely changes. This is because your relative distance to the nearby door changes much more rapidly than the relative distance to the distant building. So although the visual shapes may at some point even be identical, the amount of change as you move tells something about which items are distant and which items are nearby.

It is the constant patterns within the soundscapes as you move that are part of the distant background (not counting the overall changes in pitch as you look up or down or the lateral shifts as you look left or right). So if you pay attention to those non-changing parts of the soundscape view, that will give you a frame of reference about your heading, somewhat like a "visual compass". The background too will change as you move along, but much more slowly.

Now imagine that you are walking towards a parked car. Don't try this kind of thing without proper safety precautions, such as having a sighted person with you, preferably an orientation and mobility instructor. With the camera pointing to this car, chances are that the car is still distant enough to let it appear fairly small. So much of the remaining view will be filled with whatever is around the car, including its background. Only when you get close to the car it will seem to grow until it completely fills the soundscape, but if you walk along it then it will simply drop out of view on your left or right while other things already came into view, maybe the next parked car. So roughly speaking, apart from simple horizontal and vertical shifts, it is the perceived amount of change as you move that tells which things must be nearby. Knowing this does not make it easy, far from that, but it is useful to be at least aware of such visual principles while experiencing and learning about live soundscapes. Again, if you go mobile, make sure though that whatever you do is in a safe environment, because your brain will be overwhelmed with new and confusing input that it cannot quite handle yet and your normal hearing will also be partly masked by the soundscapes.

In case you are new to vision, an analogy with natural hearing can be helpful to grasp those visual concepts about foreground and background. Suppose there is a distant road with traffic on your left and a distant school with the murmur of children playing on your right. As with the visual background, this auditory background won't change much as you make a few strides, although it will turn left as you turn right and vice versa. However, with this constant background continuously present, you may suddenly touch or start hearing the echo of parked car or wall, so here too it is the nearby things that give the more rapid perceptual changes.

It is the distant background, be it from natural sounds or from the camera soundscapes, that help you maintain your heading. The nearby objects are only confusing in that particular respect. Again to oversimplify, the nearby objects are of course important in mobility because they can be obstacles, while the distant ones, if perceived, are also important, but rather in orientation, in maintaining your overall heading, and in avoiding veering.

If you know of some metal fence or security door made of regularly spaced vertical bars, that can be a useful visual object to learn about the effects of visual perspective. Let's assume it is a fence. In using The vOICe, such a fence would give a very strong regular rhythm of noise bursts corresponding to the sequence of bars being automatically scanned from left to right. Hard to miss those kinds of gratings. At a distance the rhythm will be very fast. As you approach the fence, the "optical" rhythm slows down because the apparent visual spacing between the bars increases, until there are typically only a few bars in view when the fence gets within arm's reach. You will also find that when you are close to the fence and look along it, you will hear the optical rhythm move fast for the more distant parts of the fence and slowly for the nearby parts of the fence. It is all the consequence of distant things appearing smaller in a visual view. Once you start experiencing the live soundscapes immersively, you will find that some things can be made sense of even while you do not recognize the individual visual objects, and some rational background knowledge about the rules of visual perspective and what to pay attention to can be helpful in the beginning to make at least a little sense of the typical very high complexity of the camera sounds.

What impact this general and affordable technology will have is largely up to you to decide and determine. Right now no one knows how proficient people can get at it. The learning curve is known to be extremely steep. Obviously, you don't have to use it. However, it is an option now for those who want it or need it. It is your choice whether or not to explore it.

Blind people using or interested in using The vOICe can also join The vOICe user group (mailing list).

You can also listen to a user report in RealOne/RealPlayer streaming audio format. That user report is titled "Seeing with sound: A journey into sight", and it was presented at the Tucson 2002 conference on Consciousness, April 8, 2002. The MP3 streaming audio link is  tucson2002f.m3u.

Or you can listen to the feature in the CBC Radio One science program Quirks and Quarks of April 2005, at the MP3 streaming audio link  qq-2005-04-02a.m3u.

You may consider starting with borrowed or second-hand components first, and generally it is advised to start cheap such that you can make up your own mind about whether the seeing with sound approach suits you, because it is not something that you will master overnight, and there are inconveniences and no guarantees that it will serve your goals.

Another tutorial for The vOICe has been prepared by  Pranav Lal. An  Italian manual (PDF file) for using The vOICe for Windows has been prepared by  Giuseppe Masciopinto, a  Russian manual (PDF file) for using The vOICe for Windows has been prepared by Radion Mynayev, and a  Chinese tutorial and manual (PDF file) has been prepared by 石蟾. The latter is also available as a Chinese web page.

Copyright © 1996 - 2024 Peter B.L. Meijer