Auditory Wavelets?


« The vOICe Home Page

A continuous wavelet transform (CWT) is a mathematical mapping that is in a number of ways similar to the classic Fourier transform: it is linear, invertible and orthogonal. However, contrary to the Fourier basis functions, the sines and cosines, which extend to infinity in time (hence are not localized in time), wavelet basis functions drop towards zero outside a finite domain. This allows for an effective localization in both time and frequency under the limitations imposed by the frequency-time uncertainty relation: the product of the uncertainty-in-frequency times the uncertainty-in-time will always be at least one over two pi (i.e., about 0.15915). That same relation lies at the heart of quantum mechanics, where the Heisenberg uncertainty relation applies to energy and time, because energy is, via Planck's constant, proportional to the frequency of probability waves. Anyway, because pure sines and cosines
Spectrogram of `Philips Research'
Spectrogram and waveform for human voice and The vOICe auditory display
 
form single points in the frequency domain, they must therefore extend to infinity in the time domain. Of course, physical events almost never take that long, nor do infinite frequencies play a role, and for practical use a good trade-off between localization properties in frequency and time is often desired. An obvious way to get that is to force the (co)sines to zero outside a finite domain by applying a weighting function, or window function, to them. A number of these have been named after their proponents or inventors, like the Bartlett window (triangular), the Hann window (raised cosine), the Welch window (parabolic), the Hamming window, etc. Basically, these windows make various compromises between smooth onset and decay in the time domain versus the main lobe width and sidelobe fall-off rate in the frequency domain. Windowing techniques based on these choices are used in creating spectrograms, which have the purpose of giving a good simultaneous resolution in time and frequency. This is done by plotting the spectra obtained by sliding a weighting time window along the signal of interest and applying a short-term Fourier transform (STFT, or windowed Fourier transform, WFT) to each windowed segment, finally giving a plot of slices of time-varying spectra. Such spectrograms, or sonograms in the case of sound, are often applied in speech analysis, acoustics, music and auditory display research, seismic research and many other activities dealing with time-varying phenomena. See the figure for two examples of sound spectrograms that were made using moving and overlapping rectangular time windows. One is a spectrogram of a second of human speech, and the other is a spectrogram of a second of sound generated by an experimental auditory display, The vOICe, developed for applications in the area of vision substitution and synthetic vision.

Now wavelets have characteristic scales for localization in time and frequency ``built-in,'' sort of, but unlike sines and cosines, wavelets are not uniquely defined, and different definitions for these wave packets may be preferred for different situations. Some examples of established wavelet classes are the Daubechies wavelets, Lemarie wavelets, Haar wavelets, Gabor wavelets and spline wavelets. However, we will not go into the details of these here.

A major qualitative difference between wavelets and windowed (co)sines is that for a fixed-width time window the characteristic number of dominant oscillations in the windowed (co)sine would be linearly dependent on frequency (or rather on basis function count), while for wavelets the number of dominant oscillations tends to be approximately constant. Simply put, more periods of a (co)sine fit into a given time interval at higher frequencies, whereas the wavelet widens its own ``time interval'' (effective width) to keep the number of periods in this interval about constant. Of course, there is no basic reason why one could not apply this same trick of variable window widths to windowed (co)sines to get the kind of constant quality factor (constant-Q) analysis as offered by wavelets - for applications where this would be appropriate: just take the effective width of a time window for a sinusoid inversely proportional to its frequency.

Time windows Consequently, unless the mathematical properties of having an exact lossless and/or orthogonal mapping and its inverse reconstruction are considered essential, there is no really convincing reason to use wavelets in sound synthesis and analysis. For instance, in auditory perception almost nothing is exact, even though the localization properties in time and frequency are very important. There is also no reason why the resonance properties of the basilar membrane would be best described using wavelets, since the constant-Q property approximates only part of the membrane. Constant bandwith, as suited to fixed-width time windowing, better approximates the, admittedly often less important, low-frequency part. Therefore, provided we consider using variable-width time windows, it makes sense to give up on wavelets and their exact orthogonality and other properties, in favour of alternative wave packets that may be better and more easily ``tuned'' to (exploiting) the nonidealities of the human hearing system, just like it can be useful to make an auditory spectrogram by logarithmic compression of the frequency scale to reflect differences in frequency sensitivity along the auditory spectrum. Furthermore, the human hearing system is known to be nonlinear in a number of ways, so even without knowing how to exploit that one can argue that a linear mapping is most likely not going to be an optimal mapping for human auditory scene analysis - although it may be quite good for most practical purposes. Alternatively, one may prefer the windowed (co)sines because of their conceptual simplicity, flexibility in exloring various time-frequency envelopes, and efficient implementation in software and hardware.

For reasons like these, The vOICe auditory display applies windowed (co)sines, rather than wavelets both in its real-time hardware, as well as in The vOICe Web App, in The vOICe for Windows, and in The vOICe for Android. The windowed (co)sines are here called ``voicels,'' because these little tonebursts act as the auditory counterparts of pixels. In the implementation of voicels, a choice was made for variable order B-spline time windows, because

  1. B-splines are piecewise polynomial, allowing for very efficient function evaluation.

  2. B-splines are of compact support (zero-valued outside a finite region), which contributes to their evaluation efficieny as a weighting function because of the limited overlap. For example, with quadratic B-splines, only three shifted B-splines contribute to any given point along the (time) ordinate.

  3. B-splines allow for any desired degree of continuity by selecting an appropriate B-spline order: beyond C0 piecewise linear (PWL) and C1 piecewise quadratic we get C2 piecewise cubic, etc, where Ck means that the function is continuously differentiable up to and including the k-th derivative.

  4. B-splines are variation diminishing, meaning that there are no more (local) extreme points in the superposition of B-splines than there are in the data that set the amplitudes for B-spline weighting.

  5. B-splines can without any problem be used with non-equidistant points while maintaining their variation diminishing and continuity properties. This makes it very easy to work with arbitrary sets of variable-width time windows.

  6. B-splines elegantly encompass rectangular (piecewise constant), triangular (piecewise linear) and piecewise quadratic time windows in a single mathematical framework, relating these popular low-order (as well as all higher order) variants through recursive definitions.

  7. B-splines are easily extended to an arbitrary number of dimensions, while preserving the above properties. This generalization employs tensor products of B-splines, being linear superpositions of products of scalar (univariate) B-splines in different directions. Such tensor products of quadratic variation diminishing (QVD) B-splines have been successfully applied in multivariate semiconductor device modelling for analog circuit simulation.

    QVD B-spline weighted

    References:
    W. M. Coughran, E. Grosse and D. J. Rose, ``Variation Diminishing Splines in Simulation,'' SIAM J. Sci. Stat. Comput., Vol. 7, pp. 696-705, April 1986. This is an excellent paper highlighting the merits of QVD splines for device modelling.
    P.B.L. Meijer, ``Fast and Smooth Highly Nonlinear Table Models for Device Modeling,'' IEEE Transactions on Circuits and Systems, Vol. 37, pp. 335-346, March 1990. This paper shows alternatives for tensor products of QVD splines for highly nonlinear multivariate data modelling.


The wavelet transform is a valuable tool for a wide variety of applications, as well as being a beautiful mathematical gem in its own right. However, the wavelet transform does not seem particularly advantageous in psychoacoustics when compared to a properly windowed Fourier transform.

For more information on sonification (auralization) by The vOICe, visit The vOICe Home Page.

Blind person forms mental image from soundscape of arbitrary image
Sensory substitution through sonification

Copyright © 1996 - 2024 Peter B.L. Meijer