The vOICe next up previous
Next: Basic Design Considerations Up: Contents Previous: Image Mapping Options

Image-to-Sound Mapping

In order to achieve high resolution, an image is transformed into a time-multiplexed auditory representation. Whenever the previous, (k-1)th, image-to-sound conversion has finished, a new, kth, image is sampled, digitized and stored as an M (height, rows) × N (width, columns) pixel matrix matrix P^(k). This takes tau seconds. During this period, a recognizable synchronization click is generated to mark the beginning of the new image, or, equivalently, the ending of the previous image. The value of any pixel matrix element matrix element p^(k)_ij is one out of G grey-tones, i.e.

  definition of matrix P with grey values

Subsequently, the conversion into sound (re)starts, taking one of the N columns at a time, and starting with the leftmost one at j=1.

Image-to-Sound Mapping
Fig. 1. Principles of the image-to-sound mapping.

Fig. 1 illustrates the principles of the conversion procedure for the simple example of an 8 × 8, 3 grey-tone image (M = N = 8, G = 3). The mapping translates, for each pixel, vertical position into frequency, horizontal position into time-after-click, and brightness into oscillation amplitude. It takes T seconds to convert the whole, N-column, pixel matrix into sound. For a given column j, every pixel in this column is used to excite an associated sinusoidal oscillator in the audible frequency range. Different sinusoidal oscillators form an orthogonal basis, assuming they have frequencies that are integer multiples of some basic frequency. This ensures that information is preserved when going from a geometrical space to a Hilbert space of sinusoidal oscillators. A pixel at a more elevated position i corresponds to an oscillator of higher frequency frequency f_i . The larger the brightness of a pixel, represented by grey-tone matrix element p^(k)_ij, the larger the amplitude (``loudness'') of its associated oscillator. The M oscillator signals for the single column are superimposed, and the corresponding sound pattern s(t) is presented to the ear during T/N seconds. Then the next, (j+1)th, column, is converted into sound. This procedure continues until the rightmost, Nth, column has been converted into sound, T seconds after the start of the conversion. Subsequently, a new pixel matrix matrix P^(k+1) is obtained, which again takes tau seconds. Meanwhile, the image separating synchronization click is generated, which is essential for a proper lateral orientation. We should have tau much smaller than T, to continue as soon as possible with the conversion into sound of a new image. Once a new pixel matrix is stored, the image-to-sound conversion starts all over at the leftmost column. The conversion therefore returns to any particular column every tau + T seconds.

It must be noted that sinusoidal oscillation of finite duration - here at least T/N seconds - is not sinusoidal oscillation in the strict sense. This has significant consequences for obtainable resolution, as will be discussed in a later section.

Using omega_i = 2 pi f_i, the general transformation can be concisely written as

  s(t) = sum of p-weighted sines

during times t that satisfy the condition that some column j of a pixel matrix matrix P^(k) is being transformed, i.e., with t, j and k related by

  definition of time interval

  range for j and k

where t_k is the starting moment for the transformation of the first column j=1 of matrix P^(k). Therefore, if the first image (k=1) was grabbed starting at t=0, we have

  t_k = (k-1)*(tau+T) + tau

Additional requirements must be fulfilled for monotonicity (to avoid row permutation) and separability

  rising omega_i series

According to (2) - (4), s(t) is not defined during the image renewal periods of duration tau, i.e., for t in t_k - tau less-or-equal t less than t_k. As indicated before, a recognizable synchronization click is generated within each of these periods. The pixel brightnesses are in (2) directly used as amplitude values, which is always possible by using an appropriate scale for measuring brightness. The phases phi^(k)_i are just arbitrary constants during the image-to-sound conversion, but they may change during the generation of the synchronization clicks.

The sound patterns corresponding to simple shapes are easily imagined. For example, a straight bright line on a dark background, running from the bottom left to the upper right, will sound as a single tone steadily increasing in pitch, until the pitch jumps down after the click that separates subsequent images. Similarly, a bright rectangle standing on its side will sound like bandwidth limited noise, having a duration corresponding to its width, and a frequency band corresponding to its height and elevation. The simplicity of interpretation for simple shapes is very important, because it means that there is no major motivation diminishing learning barrier that has to be overcome before one begins to understand the new image representation. More realistic images yield more complicated sound patterns, but learning more complicated patterns can be gradual and, because of the simplicity of the mapping, accessible to conscious analysis. Of course, in the end the interpretation should become ``natural'' and ``automatic'' (i.e., largely subconscious) for reasons of speed.

It is worth noticing that the above image-to-sound mapping does not exploit the ability of the human hearing system to detect time delays (causing phase shifts) between sounds arriving at both ears. Because this is the natural way of detecting sound source directions, the perception of the jth column could be further supported by providing a time delay of the sound output. Using a monotonically increasing delay function d(.) of column position, with d(0)=0, one can take d(j-1) for the left ear and d(N-j) for the right ear. Although CCD's allow for a straightforward implementation, a delay function has not been implemented, because the time-after-click method seemed sufficient and could be realized without additional hardware cost.

In the following section, the values for M, N, G and T will be discussed. In order to prevent (significant) loss of information, a (virtually) bijective - i.e., invertible - mapping must be ensured through proper values for these parameters.


next up previous
Next: Basic Design Considerations Up: Contents Previous: Image Mapping Options
Peter B.L. Meijer, ``An Experimental System for Auditory Image Representations,'' IEEE Transactions on Biomedical Engineering, Vol. 39, No. 2, pp. 112-121, Feb 1992.