Mobile OCR, Face and Object Recognition for the Blind

« The vOICe Home Page
« The vOICe for Windows

The main goal of The vOICe vision technology is to offer an equivalent of "raw" visual input to blind people, via complex soundscapes, thus leaving the recognition tasks to the human brain. However, complementary to that it would be useful to have options for automatic recognition through computer vision technology. This page challenges object recognition engine developers to demonstrate the applicability of computer vision on mobile devices in real-life situations. It is an open invitation - with an open interface - to deliver convincing demonstrations for use with The vOICe for Windows. (Note that The vOICe for Android nowadays includes live mobile OCR for short texts.)

Mobile OCR

In walking around while wearing The vOICe with a head-mounted camera and stereo headphones - preferably all integrated in video sunglasses - it would be convenient if the blind user could occasionally have any text in the camera view automatically recognized and spoken by The vOICe, using speech recognition and speech synthesis for the user interface. The vOICe now supports all of this functionality by invoking an external OCR engine for Add speech control and speech feedback optical character recognition (OCR) of any text in the camera view. This could in principle help with reading large print (headlines), street signs (names), name tags and labels on or beside office doors, LCD and LED displays of digital clocks, calculators, VCRs, microwave ovens, elevators, etcetera.

It must be stressed that right now we only present the proof of concept by integrating free OCR engines with The vOICe: the actual text recognition results with most OCR engines will generally prove to be very poor, because they were not yet designed for use with low resolution live video from a PC camera. More development in this area is needed to arrive at more robust text recognition.

In other words, if you are a blind user hoping to find a reliable way of reading text with a wearable camera, you will most likely be very disappointed by what can be achieved right now, but we have to start somewhere or else there will be no progress.

Some OCR challenges

Webcam closeup view of printed text. Video resolution is 176 by 144 pixels. GOCR 0.50 returned
and Tesseract 2.01 returned

Webcam closeup view of clock radio LED display. Video resolution is 176 by 144 pixels. With The vOICe running in inverse video mode (function key F5), GOCR 0.50 returned
and Tesseract 2.01 returned

Screen capture of 176 by 144 pixel screen area (mouse area sonification, function key F9). GOCR 0.50 returned
and Tesseract 2.01 returned

Brick wall with board naming Indian restaurant. GOCR 0.50 returned
and Tesseract 2.01 returned

State-of-the-art OCR engines like ABBYY FineReader and Nuance OmniPage are so far generally meant for use with high resolution scans of black print on uniformly white paper, using a flatbed scanner or digital camera as the input device. With webcam input, resolution is far lower (by default 176 by 144 pixels with The vOICe), the text background rarely has uniform brightness and is often visually cluttered (in natural scenes) and/or noisy, and the text is often misaligned and skewed through the manual camera orientation and the effects of visual perspective and lens distortion. Until OCR engines have been improved to deal with such harsh visual conditions - like sighted people have, the results obtained with the procedures as described on this page will often be very disappointing. Still, we need to create this proof of concept in order to stimulate academic research on character extraction and recognition algorithms, while enabling third-party developers of OCR engines to readily test their latest and still experimental OCR engines with The vOICe. The combination with The vOICe lets blind users easily aim and align the camera with print through the characteristic sound texture of print, thereby greatly reducing the burden for OCR engines to correct for misaligned and skewed camera views.

Installation involves the following steps, assuming that you had already downloaded the latest version of The vOICe for Windows executable (voice.exe):

Download the zipped DLL file vOICeJPG.zip (100 K), and unzip this file to obtain the vOICeJPG.dll file.

Download the image format converter program djpeg.exe (60 K).

Download the GOCR version 0.50 Windows executable gocr.exe (about 159 K).

Move the three files (vOICeJPG.dll, djpeg.exe and gocr.exe) to the same directory where The vOICe for Windows executable (voice.exe) is stored.

Purpose: the DLL file vOICeJPG.dll allows The vOICe program to save images in JPEG format (in addition to the BMP format), while the djpeg.exe program will be used to convert the JPEG image files to the PNM format image files as supported by the GOCR program gocr.exe (Win32 binary). The vOICeJPG.dll and djpeg.exe modules are based on the work of the Independent JPEG Group. The free OCR engine GOCR is an open source OCR project, with the GOCR homepage at jocr.sourceforge.net and www.gocr.de.

Now while running The vOICe, press Control R. The vOICe will then save your current view as an image snapshot file vOICe.jpg (as well as vOICe.bmp), and run the batch file recognize.bat. If this batch file does not exist (in the same directory as voice.exe), it will first be automatically created by The vOICe. By default, this batch file will just contain the two command lines

djpeg -greyscale -dither none vOICe.jpg vOICe.pnm gocr -i vOICe.pnm -o vOICe.ocr

Note that any non-console programs must be prefixed by "start /w" to ensure that Windows first waits for a program to finish before starting the next program in the batch file, or else crashes may result if the next program attempts to read results written by the previous program. Sometimes it may also be useful to put a small delay in between commands. This can be done with extra command lines like "ping -n 1 127.0.0.1>NUL" that use a dummy timed ping to cause a delay, in this case a one second delay.

So the JPEG snapshot will be converted to PNM format by the djpeg.exe program, and the resulting vOICe.pnm file will form the input for the OCR engine GOCR. The plain text results from GOCR are saved in an ASCII file vOICe.ocr. Once the batch job has finished, The vOICe will take control again and print the (filtered) contents of the vOICe.ocr file to a dialog. When done, The vOICe will resume normal live soundscape generation. In case the batch file window does not automatically disappear once the batch job has finished, check the "Close on exit" checkbox in the Properties | Program tab of the recognize.bat batch file. This should make the window automatically disappear in all later runs.

Simple command line based interfaces like the one used above are also commonly used in benchmarking studies and in competitions, such that efficient reuse is easily accomplished through very minor modifications, while file I/O is rarely a performance bottleneck as compared to the CPU time spent on the recognition.

Moreover, in case you would want to keep a growing history of snapshots that are not overwritten, you can add lines like

set d=%DATE% set t=%TIME% set dstr=%d:~10,4%-%d:~4,2%-%d:~7,2% set tstr=%t:~0,2%h%t:~3,2%m%t:~6,2%s copy vOICe.jpg "vOICe %dstr% %tstr%.jpg"

to automatically add a timestamp to a copy of every saved vOICe.jpg file (e.g., "vOICe 2006-07-25 21h12m30s.jpg"). One can view this as related to Microsoft's SenseCam (MyLifeBits) project.

Mobile face and object recognition

Third-party developers can simply modify the contents of the batch file recognize.bat, because The vOICe will not overwrite it once it exists. This open interface makes it very easy to replace the invoked OCR engine and to include other types of visual object and visual pattern recognition engines for use in a cognitive vision system (to implement an auto-tagging "virtual commentator", "virtual reporter" or "virtual sighted guide"). Artificial cognitive systems in general need to be fed with real-life data for training purposes.

The vOICe saves a camera snapshot on request, and the third-party recognition engine then processes the image file and writes a text file.
COIL-20 sample. Even with perfect object recognition, giving a verbal description can never match the "raw" vision as provided by The vOICe, but it can offer useful complementary information to help the blind user in interpreting the visual scene.
One can think of integration with other more or less specialized applications for image analysis, such as face recognition, object recognition and object categorization, and automatic interpretation of bar codes, currency, signs, or logo's. For instance, one may consider using the Foveola command line tools for shape recognition, as developed by Patrick Andrews of Break-Step Productions Ltd, forming the basis of the SceneReader sign reader for extracting text from images, or face recognition and sign reading technology from Riya, the Microsoft Photo2Search photograph-based search project, approaches that make adaptive (trainable) vision systems through the use of many visual training examples. Startups such as Numenta, founded by Jeff Hawkins, Dileep George and Donna Dubinsky, can test and demonstrate their artificial intelligence capabilities on practical real-world recognition tasks that would supplement The vOICe's direct visual mapping approach. One may also consider building a database of feature signatures for everyday visual object views based on David Lowe's SIFT approach (Scale-Invariant Feature Transform), which formed the basis of the ViPR (visual pattern recognition) technology of Evolution Robotics. Another method is SURF (Speeded Up Robust Features, by Herbert Bay and others). A starting point for testing can be the use of public image databases such as COIL-20 or ETH-80.

364 keypoints found: David Lowe's SIFT demo applied to The vOICe sample image

Related projects and comments

Related "Mobile OCR for the Blind" projects exist in the form of the knfbReader Mobile from K-NFB Reading Technology, codeveloped by Kurzweil and NFB (KNFB, a Nokia N82 with OCR engine and TTS), the ITEX SiSystem SiRecognizer netbook and the AdvantEdge Reader. The Kurzweil reader is a handheld Pocket PC based device under development in a cooperation between Kurzweil Technologies, Inc. (KTI) and the National Federation of the Blind (NFB). When used with the same OCR engine and resolution settings, recognition performance of The vOICe should be similar to that of commercial portable readers. Other related projects include the Google-sponsored OCRopus OCR project of the IUPR research group, the Sypole project of the Faculté Polytechnique de Mons (FPM), TCTS and Université Libre de Bruxelles in Belgium, kooaba AG, the Trinetra project of Priya Narasimhan of Carnegie Mellon University, as well as the "DORA project" (Digital Object Recognition Audio-Assistant) for the visually impaired by Wolfgang Fink and Mark Tarbell of California Institute of Technology and James Weiland and Mark Humayun of Doheny Eye Institute, University of Southern California, the pedestrian crossing "electronic eye" project by Tadayoshi Shioyama and Mohammad Shorif Uddin at Kyoto Institute of Technology, Japan, on single camera detection of pedestrian crosswalks and traffic lights, the machine vision work by Mark Nitzberg, and Alan Yuille and others at Blindsight Corporation, and Simon Thorpe's SpikeNet Technology approach of biologically inspired object recognition (SNVision) through neural networks consisting of asynchronous spiking neurons.

NEC and NAIST are working on OCR for mobile camera phones, according to the New Scientist article "Camera phones will be high-precision scanners". Google is working on Google Goggles, targeting the possibility of using a camera phone or equivalent to recognize objects and texts from the environment in order to search for information (Google Visual Search).

Where is the camera? And the microphone?	"Sound of a zebra crossing"

Blind user of The vOICe with a "hatcam": a camera hidden inside a hat.	The vOICe soundscape for pedestrian crossing stripes (18K MP3 stereo sound sample). Video resolution is 176 by 144 pixels; audio is generated with 176 by 64 voicels. Note the characteristic tonality of the parallel white stripes. No need for a recognition engine here?
Photography: courtesy Barbara Schweizer

November 2004
After reading about camera-based object recognition engines that give verbal feedback to the blind, one blind male participant in The vOICe project, JJ, commented saying

"This is more like having a sighted guide than vision. I prefer to absorb and interpret data myself rather than being fed pre-digested perceptions."

Reports from participants in The vOICe project suggest that even if automatic visual recognition becomes technically feasible and reliable, many blind people would whenever possible still prefer to learn to "see for themselves" through a more direct non-interpreted visual view such as provided by The vOICe. A robust recognition engine could then serve a useful secondary role as a training tool or assist with special types of patterns.

If your camera supports it, The vOICe automatically temporarily switches to a higher resolution (up to VGA) when taking a snapshot, such that recognition engines can work with that higher resolution snapshot rather than the default 176 by 144 pixel view. The vOICe can also acquire images from a TWAIN compliant flatbed scanner or digital still image camera (Control Q) for subsequent OCR analysis (Control R). Better still, The vOICe can directly acquire a high resolution image from a TWAIN compliant device and apply OCR when pressing Control Alt R (or using the spoken "recognize" command when no video capture device is connected to the computer).

Bar code identification

Bar code example. With video resolution set to 320 by 240 pixels, GOCR 0.40 returned

Open Source OCR 1: GOCR project

The GOCR project is seeking volunteers for the further development of the GOCR engine and software library. For use of GOCR with The vOICe, it would be particularly welcome if work started on image preprocessing to improve the accuracy in extracting text embedded in video scenes (including LinuxTag 2005, Jörg Schulenburg captioning with TV broadcasts). This means dealing with complex non-uniform visual backgrounds, contrast enhancement and noise reduction for poor lighting conditions, deskewing of text to deal with camera misalignment, refinements for working with characters in low resolution bitmaps (which would also cover computer screen snapshots captured from within the mouse area sonification mode as toggled by function key F9!), and automatic color or greyscale inversion to suit OCR engines that can handle only dark print on a bright background (such as GOCR currently). Any results in this area can be readily tested with The vOICe through the open OCR engine interface described on this page, which makes it very easy to replace the OCR engine or insert other programs in the image processing and analysis chain.

Also note that new command line driven and file-based image processing engines can, if desired, be first developed and tested under Linux, and subsequently ported to Microsoft Windows for combination with The vOICe. This is in fact what happened with the GOCR engine.

GOCR can also recognize barcodes, and Rob Fugina's Internet UPC Database (upcdatabase.com) can be used to retrieve product information associated with recognized barcode numbers.

Open Source OCR 2: Tesseract OCR

Another OCR project is Tesseract OCR. Microsoft Windows executables for Tesseract are available as free downloads at code.google.com. Due to lack of documentation, it is still somewhat unclear what types of BMP files Tesseract supports, but it does appear to support greyscale BMP files. Tesseract is run through command lines like "tesseract phototest.tif output", usually applied in a batch file. One can easily integrate it with The vOICe (which does not generate greyscale BMP files itself) via its open interface, by using the following command lines in recognize.bat,

djpeg -greyscale -bmp vOICe.jpg vOICe.bmp tesseract vOICe.bmp vOICe move /Y vOICe.txt vOICe.ocr

where the JPEG output from The vOICe is first converted to greyscale BMP, after which the Tesseract OCR engine is invoked to yield a plain text output file vOICe.txt, which is finally moved to a vOICe.ocr file as expected by The vOICe for further processing (dialog popup or synthetic speech output). For the examples on this web page, GOCR appears to perform slightly better than Tesseract.

Yet another open source OCR project is GNU Ocrad, but its use in combination with The vOICe for Windows has not yet been investigated.

Commercial OCR: TopOCR

The TopOCR product now includes a command line interface and one can very easily integrate it with The vOICe via its open interface by using the following single command line in recognize.bat,

"C:\Program Files\TopOCR-Demo\topocr24-demo.exe" vOICe.jpg -THRESH 20 35 92 vOICe.ocr

Of course one must apply an appropriate path change for the TopOCR executable when using a version other than the free trial version 2.4 that was used here (which yielded "Neural Device _!~a^1" for the first test image on this web page): the executable path may be "C:\Program Files\TopOCR\topocr.exe" for the fully registered version. The TopOCR command line parameters may also need adjustment for best results.

Mobile OCR, Face and Object Recognition for the Blind

Copyright © 1996 - 2024 Peter B.L. Meijer