The vOICe next up previous
Next: Design and Implementation Up: Contents Previous: Image-to-Sound Mapping

Basic Design Considerations

Communicating M × N brightness values, each one taken from a set of G possible values, within T seconds through a communication channel, amounts to sending data at a rate of I bits per second, where I is given by

  I = M * N * 2log(G) / T

Several restrictions to the values of M, N, G and T must be considered to prevent an unacceptable loss of information within the communication channel, which in this case includes the human hearing system that has to deal with the output of the image-to-sound mapping system.

In the transformation described by (1) through (5), the G allowed brightness values correspond to the amplitudes of individual components of a superposition of - in principle - periodic signals. The amplitude values convey the image information. The periodic signals themselves, as well as their superposition, may have a large, even infinite, number of signal levels for a given amplitude. However, when monitoring the amplitudes of the Fourier components, as the human hearing system does to some extent, the G levels reappear. An estimate for the upper bound for obtainable resolution can be derived from the cross-talk among Fourier components, caused by the finite duration of the information carrying signals in the image-to-sound conversion. (In this paper, the term ``cross-talk'' is used to denote the spectral contribution, or leakage, to a particular frequency that is caused by limiting the duration of a sinusoidal oscillation of another frequency.) To facilitate the analysis, a very simple image sequence will be considered: all images are dark, except for one, the k'th image, in which only a single pixel (i', j') is bright, i.e. p^(k)_ij = 0 All i,j,k except for i=i',j=j',k=k'. By dropping the accents, (2) now becomes equivalent to

  Expression for s(t)

When neglecting the contributions of the horizontal synchronization clicks, one can derive, e.g., via Fourier transformation, that to avoid significant cross-talk with sinusoidal signals of duration T/N s, the frequency separation Delta f should be on the order of 2/(T/N) Hz. In that case, it can be shown that the oscillator frequency associated with a vertically neighbouring pixel on row i plus or minus 1 already lies beyond the maximum of the first spectral side lobe corresponding to the pixel on row i. For an equidistant frequency distribution we would have Delta f = B / (M-1), using a bandwidth of B Hz, such that the cross-talk restricting criterion becomes

  (M-1)*N less or equal B*T/2

In our application, using sequenced ``frozen'' images, the allowed conversion time T is restricted by the fact that the information content of an image soon becomes obsolete within a changing environment. Furthermore, the capacity of the human brain to assimilate and interpret information distributed in time is also limited. One may expect that biological systems are well-adapted to the processing and anticipation of major environmental changes to an extent that corresponds to the typical time constants of these changes and the time constants of physical response. For humans, although having a minimum response time of a few tenths of a second, most events, significant motorial responses, etc., take place on the time scale of a few seconds. One may think of events like a door being opened, or a cup of coffee being picked up. For longer time scales, humans have the option to think and contemplate an evolving situation, but the brain still seems highly dedicated and optimized towards the time scale of seconds, as when dealing with, for example, speech or mobility. Therefore, we should have a conversion time T on the order of one second or less. Similarly, we should have tau much smaller than 1 s, to avoid blurred images, as well as to have most of the time available for sending image information through the auditory channel. Furthermore, the total available bandwidth of the human hearing system amounts to some 20 kHz, but the useful bandwidth B of the human hearing system is not much more than 5 or 6 kHz. Above these frequencies, the sensitivity drops rapidly, especially for elderly people. Therefore, taking B = 5kHz and T = 1s, we find from (8) for a square matrix M = N that M less than equal 50.

It should be noted that obtainable resolution also depends on the particular image being transformed. The criterion (8) for obtainable resolution holds for situations with a few bright pixels per column on a dark background. The strongest cross-talk occurs to the nearest neighbours. However, many bright pixels positioned at larger distances could together also contribute significantly. Therefore, the obtainable resolution for detecting a small bright object against a dark background will be somewhat higher than for a small dark object against a bright background.

Criterion (8) is the consequence of an uncertainty relation between frequency and time, see also [18]. Yet even for signals of long duration, the ability of humans to distinguish different sound spectra is known to be limited. It must be verified that the separability of some 50 frequency components, estimated from uncertainty principles, does not (significantly) exceed other known limitations of human perception. The limits of frequency discrimination in human auditory perception are highly dependent on the particular sound patterns involved in the experiments. From the width of the critical bands of the human hearing system one would predict the separability of at most several tens (20 to 30) of components [13, 24]. On the other hand, the difference limen or just noticeable difference (jnd), determined for a single sinusoidal tone with slowly varying frequency, would suggest the separability of hundreds of components [24]. The image-to-sound mapping can, in principle, and dependent on the particular artificial image being transformed, yield almost any sound pattern used in the literature for frequency discrimination experiments (as long as phase relations are neglected). Moreover, it is virtually impossible to characterize what constitutes a ``normal'' image, and thereby characterize the corresponding sound patterns. Therefore, it seems reasonable that our present resolution estimate with 50 frequencies is not much more optimistic than expected from measurements of critical bands. Lowering the number of frequencies to the number suggested by critical bands would have disadvantages. It would enhance the undesirable perception of discontinuities in frequencies obtained with images containing, for example, widely separated continuous slanting bright lines on a dark background. In such cases the difference limen becomes the relevant limit for perception.


next up previous
Next: Design and Implementation Up: Contents Previous: Image-to-Sound Mapping
Peter B.L. Meijer, ``An Experimental System for Auditory Image Representations,'' IEEE Transactions on Biomedical Engineering, Vol. 39, No. 2, pp. 112-121, Feb 1992.