In order to achieve high resolution, an image is transformed
into a time-multiplexed auditory representation. Whenever the
previous, (k-1)th, image-to-sound conversion has finished, a new,
kth, image is sampled, digitized and stored as an M (height, rows) × N (width, columns)
pixel matrix .
This takes
seconds.
During this period, a recognizable synchronization
click is generated to mark the beginning of the new image,
or, equivalently, the ending of the previous image.
The value of any pixel matrix element
is one out of G grey-tones, i.e.
Subsequently, the conversion into sound (re)starts, taking one of the N columns at a time, and starting with the leftmost one at j=1.
![]() Fig. 1. Principles of the image-to-sound mapping. |
Fig. 1 illustrates the principles of the
conversion procedure for the simple example
of an 8 × 8, 3 grey-tone image (M = N = 8, G = 3). The mapping
translates, for each pixel, vertical position into frequency, horizontal
position into time-after-click, and brightness into oscillation amplitude.
It takes T seconds to convert the whole, N-column,
pixel matrix into sound. For a given column j, every pixel in this
column is used to excite an associated sinusoidal oscillator in the
audible frequency range. Different
sinusoidal oscillators form an orthogonal basis, assuming they have
frequencies that are integer multiples of some basic frequency. This
ensures that information is preserved when going from
a geometrical space to a Hilbert space of sinusoidal oscillators.
A pixel at a more elevated position i
corresponds to an oscillator of higher frequency .
The larger the brightness of a pixel, represented by grey-tone
, the larger the amplitude (``loudness'')
of its associated oscillator. The M oscillator signals for the
single column are superimposed, and the corresponding
sound pattern s(t) is presented to the ear during T/N seconds.
Then the next, (j+1)th, column,
is converted into sound. This procedure continues until the
rightmost, Nth, column has been converted into sound, T
seconds after the start of the conversion.
Subsequently, a new pixel matrix
is obtained,
which again takes
seconds. Meanwhile, the image separating
synchronization click is generated, which is essential for a proper
lateral orientation. We should have
,
to continue as soon as possible with the conversion into sound of
a new image. Once a new pixel matrix is stored, the image-to-sound
conversion starts all over at the leftmost column.
The conversion therefore returns to any particular column every
seconds.
It must be noted that sinusoidal oscillation of finite duration - here at least T/N seconds - is not sinusoidal oscillation in the strict sense. This has significant consequences for obtainable resolution, as will be discussed in a later section.
Using , the general transformation
can be concisely written as
during times t that satisfy the condition that some
column j of a pixel matrix is being transformed,
i.e., with t, j and k related by
where is the starting moment for the transformation of
the first column j=1 of
. Therefore, if the first
image (k=1) was grabbed starting at t=0, we have
Additional requirements must be fulfilled for monotonicity (to avoid row permutation) and separability
According to (2) - (4), s(t) is
not defined during the image renewal periods of duration ,
i.e., for t in
.
As indicated before, a recognizable synchronization
click is generated within each of these periods.
The pixel brightnesses are in (2) directly used
as amplitude values, which is always possible
by using an appropriate scale for measuring brightness.
The phases
are just
arbitrary constants during the image-to-sound conversion, but they
may change during the generation of the synchronization clicks.
The sound patterns corresponding to simple shapes are easily imagined. For example, a straight bright line on a dark background, running from the bottom left to the upper right, will sound as a single tone steadily increasing in pitch, until the pitch jumps down after the click that separates subsequent images. Similarly, a bright rectangle standing on its side will sound like bandwidth limited noise, having a duration corresponding to its width, and a frequency band corresponding to its height and elevation. The simplicity of interpretation for simple shapes is very important, because it means that there is no major motivation diminishing learning barrier that has to be overcome before one begins to understand the new image representation. More realistic images yield more complicated sound patterns, but learning more complicated patterns can be gradual and, because of the simplicity of the mapping, accessible to conscious analysis. Of course, in the end the interpretation should become ``natural'' and ``automatic'' (i.e., largely subconscious) for reasons of speed.
It is worth noticing that the above image-to-sound mapping does not exploit the ability of the human hearing system to detect time delays (causing phase shifts) between sounds arriving at both ears. Because this is the natural way of detecting sound source directions, the perception of the jth column could be further supported by providing a time delay of the sound output. Using a monotonically increasing delay function d(.) of column position, with d(0)=0, one can take d(j-1) for the left ear and d(N-j) for the right ear. Although CCD's allow for a straightforward implementation, a delay function has not been implemented, because the time-after-click method seemed sufficient and could be realized without additional hardware cost.
In the following section, the values for M, N, G and T will be discussed. In order to prevent (significant) loss of information, a (virtually) bijective - i.e., invertible - mapping must be ensured through proper values for these parameters.