Communicating M × N brightness values, each one taken from a set of G possible values, within T seconds through a communication channel, amounts to sending data at a rate of I bits per second, where I is given by
Several restrictions to the values of M, N, G and T must be considered to prevent an unacceptable loss of information within the communication channel, which in this case includes the human hearing system that has to deal with the output of the image-to-sound mapping system.
In the transformation described by
(1) through (5), the G allowed
brightness values correspond to the amplitudes of individual
components of a superposition of - in principle - periodic signals.
The amplitude values convey the image information.
The periodic signals themselves, as well as their superposition,
may have a large, even infinite, number
of signal levels for a given amplitude. However, when
monitoring the amplitudes of the Fourier components, as the human
hearing system does to some extent, the G levels reappear.
An estimate for the upper bound for obtainable
resolution can be derived from the cross-talk
among Fourier components, caused by the finite duration of the
information carrying signals in the image-to-sound conversion.
(In this paper, the term ``cross-talk'' is used to denote the spectral
contribution, or leakage, to a particular frequency that is caused by
limiting the duration of a sinusoidal oscillation of another frequency.)
To facilitate the analysis, a very simple image sequence will be
considered: all images are dark, except for one, the k'th image, in which
only a single pixel (i', j') is bright, i.e.
.
By dropping the accents, (2) now becomes equivalent to
When neglecting the contributions of the horizontal synchronization clicks,
one can derive, e.g., via Fourier transformation, that to avoid significant
cross-talk with sinusoidal signals of duration T/N s, the frequency
separation should be on the order of 2/(T/N) Hz.
In that case, it can be shown that the oscillator frequency associated
with a vertically neighbouring pixel on row
already lies
beyond the maximum of the first spectral side lobe corresponding to
the pixel on row i. For an equidistant frequency distribution we
would have
, using a bandwidth of B Hz,
such that the cross-talk restricting criterion becomes
In our application, using sequenced ``frozen'' images, the allowed
conversion time T is restricted by the fact that the information
content of an image soon becomes obsolete within a changing environment.
Furthermore, the capacity of the human brain to assimilate and
interpret information distributed in time is also limited.
One may expect that biological systems are well-adapted to
the processing and anticipation of major environmental
changes to an extent that corresponds to the typical time constants
of these changes and the time constants of physical response.
For humans, although having a minimum response time of a few tenths
of a second, most events, significant motorial responses, etc., take
place on the time scale of a few seconds. One may think of
events like a door being opened, or a cup of coffee being picked up.
For longer time scales,
humans have the option to think and contemplate an evolving
situation, but the brain still seems highly dedicated and optimized
towards the time scale of seconds, as when dealing with, for example,
speech or mobility. Therefore, we should have a conversion time
T on the order of one second or less. Similarly, we should have
s, to avoid blurred images, as well
as to have most of the time available for sending image information
through the auditory channel. Furthermore, the total available
bandwidth of the human hearing system amounts to some 20 kHz,
but the useful bandwidth B of the human hearing system is not much more
than 5 or 6 kHz. Above these frequencies, the sensitivity drops rapidly,
especially for elderly people.
Therefore, taking B = 5kHz and T = 1s, we find from
(8) for a square matrix M = N that
.
It should be noted that obtainable resolution also depends on the particular image being transformed. The criterion (8) for obtainable resolution holds for situations with a few bright pixels per column on a dark background. The strongest cross-talk occurs to the nearest neighbours. However, many bright pixels positioned at larger distances could together also contribute significantly. Therefore, the obtainable resolution for detecting a small bright object against a dark background will be somewhat higher than for a small dark object against a bright background.
Criterion (8) is the consequence of an uncertainty relation between frequency and time, see also [18]. Yet even for signals of long duration, the ability of humans to distinguish different sound spectra is known to be limited. It must be verified that the separability of some 50 frequency components, estimated from uncertainty principles, does not (significantly) exceed other known limitations of human perception. The limits of frequency discrimination in human auditory perception are highly dependent on the particular sound patterns involved in the experiments. From the width of the critical bands of the human hearing system one would predict the separability of at most several tens (20 to 30) of components [13, 24]. On the other hand, the difference limen or just noticeable difference (jnd), determined for a single sinusoidal tone with slowly varying frequency, would suggest the separability of hundreds of components [24]. The image-to-sound mapping can, in principle, and dependent on the particular artificial image being transformed, yield almost any sound pattern used in the literature for frequency discrimination experiments (as long as phase relations are neglected). Moreover, it is virtually impossible to characterize what constitutes a ``normal'' image, and thereby characterize the corresponding sound patterns. Therefore, it seems reasonable that our present resolution estimate with 50 frequencies is not much more optimistic than expected from measurements of critical bands. Lowering the number of frequencies to the number suggested by critical bands would have disadvantages. It would enhance the undesirable perception of discontinuities in frequencies obtained with images containing, for example, widely separated continuous slanting bright lines on a dark background. In such cases the difference limen becomes the relevant limit for perception.