Real-Time Closed Captioning for the Blind

Uncensored Captioning of All Video Content by The vOICe

The vOICe for Windows « The vOICe Home Page
« The vOICe for Windows

Closed captioning for the blind is a technique for annotating video, television broadcasts and movies such that blind people get a description of what is going on visually in a scene. Normally this is done by sending additional hidden information along, which is then decoded into for instance a spoken description. Not only does this technique require special hardware and (modified) transmission channels plus significant human resources for creating the captioning content, it is also inevitably incomplete in its descriptions of scenes while it will show significant latencies in providing access to live events and news reports. This page will outline an alternative or complementary approach, based on The vOICe technology. Beware though that applicability hinges on the human ability to learn to see with sound..

Video frame source: CNN television
The vOICe snapshot of CNN broadcast
Seeing with Sound - The vOICe

Single video frame captioned by The vOICe (MP3 audio sample)

The vOICe technology may complement regular closed captioning and alleviate several of its fundamental weaknesses, by offering a form of closed captioning that is visually complete, as well as extremely concise, while operating in real-time (or very close to that, with typical auditory description latencies of less than one second). The synthesized audio runs automatically synchronized to the visual content from which it is derived on-the-fly. Moreover, it does not require any technical changes to the video source, the studio or the transmission channel, and it functions in situations where no regular closed captions are available, because the non-speech captioning is derived and encoded directly and automatically from the original video content upon arrival.

The digital audio rendering applies to any video input source, by mapping Simple line and dot images brightness to loudness, elevation to pitch and lateral position to time in each image scan (using stereo panning for enhanced perception). Thus the greyscale content of any image or visual scene can be represented in sound. Moreover, this form of digital captioning is completely independent of language through its use of non-speech audio, thus adding to the major economic advantages.

This approach, as currently implemented by The vOICe for Windows, is extremely general, and provably preserves much of the visual content, at least from a technical perspective, while it also meets several known psychoacoustic constraints that can further limit human hearing. However, there still also exist many open questions about the human ability, and willingness, to learn to perceive, comprehend and make use of this seeing-with-sound technology: the sounds of real-life visual scenes are often extremely complicated. Yet, pending outcomes of further research in that area, the approach is in principle attractive for economic reasons, as well as for the independence offered to blind viewers. Also, some events, such as scene changes, for instance due to jumps in camera position, or the scrolling text at the end of a movie, are easily heard and noticed. No special technical infrastructure is needed other than provisions for the client-side audio rendering of video streams as offered by The vOICe. No longer do blind people need to trust and rely solely on the interpretation of sighted people to tell them what is happening visually through a narrow selection of items of interest. Thereby we guarantee to avoid any form of censorship. With human-made verbal or textual video descriptions, censorship would be inevitable, even with the best of intentions, as a consequence of the required information reduction.

Verbal or textual descriptions of video are in practice always very incomplete, because only by skipping many of the visual items and textures can one attempt to keep up with the rapidly changing scenery in typical television broadcasts.
New York skyline with and without World Trade Center New York skyline before/after the WTC attack of Sept 11, 2001. Download the animated GIF and import it into The vOICe (Control O). Best heard in negative video (F5).
Snapshot from CNN television broadcast CNN logo captured by The vOICe from live CNN television broadcast (Control S). You can download the GIF image and import it into The vOICe (Control O).
In contrast, the use of soundscapes as created by The vOICe offers ultra-compact audio descriptions of visual scenes that can keep up with all but the fastest changes while preserving most of the visual content. This is what realtime closed captioning is all about. An MP3 audio sample is included here to illustrate how one second of sound captures even complex abstract visual scenes: by activating the following link, you can listen to a 18K MP3 soundscape of a video frame from a CNN television broadcast, namely from the graphically rich introduction lead to one of their news shows. Only one second of 3D enhanced complex sound here represents and contains the entire video snapshot, sounding the CNN logo and several concentric circles around a globe, and other visual details. The soundscape was automatically generated in a split second for the television snapshot image shown on the left, using a regular Video-for-Windows compliant PC television card. Instead of video from the cable TV input that was used in this case, the video input could just as easily have been captured from your home video camera, VCR, DVD player or DSS (Digital Satellite Service). If your capture card supports Digital Television (DTV) standards, this will allow for automatic DTV captioning as well, because the whole approach is based on back-end rendering of original video content. Note that for the above single-frame example, your media player is best set to auto-repeat in order to hear the soundscape again and again, making it easier for the untrained ear to mentally focus on hearing different parts within the soundscape, for instance on the left and on the right, and at a low and at a high pitch.

Another television-based MP3 sound sample, and links to still other MP3 samples concerning different visual contexts, can be found on the Television for the Blind page. For more general information about closed captioning and video description, visit the CPB/WGBH  National Center for Accessible Media (NCAM) website - including information on the DTV Access Project on closed captioning and video description services for DTV, as well as on the the Motion Picture Access Project (MoPix) on Rear Window Captioning (RWC) and DVS Theatrical Systems.

Copyright © 1996 - 2024 Peter B.L. Meijer