Local invariance to time-shifting and stability to time-warping are necessary when representing acoustic scenes for similarity measurement. The scattering transform is designed to satisfy these properties while retaining high discriminative power. It is computed by applying auditory and modulation wavelet filter banks alternated with complex modulus nonlinearities.
Invariance and stability in audio signals
The notion of invariance to time-shifting plays an essential role in acoustic scene similarity retrieval. Indeed, recordings may be shifted locally in time without affecting similarity to other recordings. To discard this superfluous source of variability, signals are first mapped into a time-shift invariant feature space. These features are then used to calculate similarities. Since the features ensure invariance, it does not have to be learned when constructing the desired similarity measure.
Formally, given a signal x(t), we would like its translation xc(t)=x(t−c) to be mapped to the same feature vector provided that |c|≪T for some maximum duration T that specifies the extent of the time-shifting invariance. We can also define more complicated transformations by letting c vary with t. In this case, we have xτ(t)=x(t−τ(t)) for some function τ, which performs a time-warping of x(t) to obtain xτ(t). Time-warpings model various changes, such as small variations in pitch, reverberation, and rhythmic organization of events. These make up an important part of intra-class variability among natural sounds, so representations must be robust with respect to such transformations.
The wavelet scattering transform, described below, has both of these desired properties: invariance to time-shifting and stability to time-warping. The stability condition can be formulated as a Lipschitz continuity property, which guarantees that the feature transforms of x(t) and xτ(t) are close together if |τ′(t)| is bounded by a small constant [22].
Wavelet scalogram
Our convention for the Fourier transform of a continuous-time signal x(t) is \(\boldsymbol {\hat {x}}(\omega) = \int _{-\infty }^{+\infty } x(t) \exp (- \mathrm {i} 2\pi \omega t) \, \mathrm {d}t\). Let ψ(t) be a complex-valued analytic bandpass filter of central frequency ξ1 and bandwidth ξ1/Q1, where Q1 is the quality factor of the filter. A filter bank of wavelets is built by dilating ψ(t) according to a geometric sequence of scales \(\phantom {\dot {i}\!}2^{\gamma _{1}/Q_{1}}\), obtaining
$$ \boldsymbol{\psi_{\gamma_{1}}}(t) = 2^{-\gamma_{1}/Q_{1}} \boldsymbol{\psi}\left(2^{-\gamma_{1}/Q_{1}} t\right)\text{.} $$
(1)
The variable γ1 is a scale (an inverse log-frequency) taking integer values between 0 and (J1Q1−1), where J1 is the number of octaves spanned by the filter bank. For each γ1, the wavelet \(\boldsymbol {\psi _{\gamma _{1}}}(t)\) has a central frequency of \(\phantom {\dot {i}\!}2^{-\gamma _{1}/Q_{1}}\xi _{1}\) and a bandwidth of \(\phantom {\dot {i}\!}2^{-\gamma _{1}/Q_{1}}\xi _{1}/Q_{1}\) resulting in the same quality factor Q1 as ψ. In the following, we set ξ1 to 20 kHz, J1 to 10, and the quality factor Q1, which is also the number of wavelets per octave, to 8. This results in the wavelet filters covering the whole range of human hearing, from 20 Hz to 20 kHz. Setting Q1=8 results in filters whose bandwidth approximates an equivalent rectangular bandwidth (ERB) scale [41].
The wavelet transform of an audio signal x(t) is obtained by convolution with all wavelet filters. Applying a pointwise complex modulus, the transform yields the wavelet scalogram
$$ \boldsymbol{x_{1}}(t, \gamma_{1}) = \vert \boldsymbol{x} \ast \boldsymbol{\psi_{\gamma_{1}}} \vert (t). $$
(2)
The scalogram bears resemblance to the constant-Q transform (CQT), which is derived from the short-term Fourier transform (STFT) by averaging the frequency axis into constant-Q subbands of central frequencies \(\phantom {\dot {i}\!}2^{-\gamma _{1}/Q_{1}}\xi _{1}\). Indeed, both time-frequency representations are indexed by time t and log-frequency γ1. However, contrary to the CQT, the scalogram reaches a better time-frequency localization across the whole frequency range, whereas the temporal resolution of the traditional CQT is fixed by the support of the STFT analyzing window. Therefore, the scalogram has a better temporal localization at high frequencies than the CQT, at the expense of a greater computational cost since the inverse fast Fourier transform routine must be called for each wavelet \(\boldsymbol {\psi _{\gamma _{1}}}\) in the filter bank. However, this allows us to observe amplitude modulations at fine temporal scales in the scalogram, down to 2Q1/ξ1 for γ1=0, of the order of 1 ms given the aforementioned values of Q1 and ξ1.
To obtain the desired invariance and stability properties, the scalogram is averaged in time using a lowpass filter ϕ(t) with cut-off frequency 1/T (and approximate duration T), to get
$$ \mathbf{S_{1}}\boldsymbol{x}(t, \gamma_{1}) = \boldsymbol{x_{1}}(\cdot, \gamma_{1}) \ast \boldsymbol{\phi}(t), $$
(3)
which is known as the set of first-order scattering coefficients. They capture the average spectral envelope of x(t) over scales of duration T and where the spectral resolution is varying with constant Q. In this way, they are closely related to the mel-frequency spectrogram and related features, such as MFCCs.
Extracting modulations with second-order scattering
In auditory scenes, short-time amplitude modulations may be caused by a variety of rapid mechanical interactions, including collision, friction, turbulent flow, and so on. At longer time-scales, they also account for higher-level attributes of sound, such as prosody in speech or rhythm in music. Although they are discarded while filtering x1(t,γ1) into the time-shift invariant representation S1x(t,γ1), they can be recovered from x1(t,γ1) by a second wavelet transform and another complex modulus.
We define second-order wavelets \(\boldsymbol {\psi _{\gamma _{2}}}(t)\) in the same way as the first-order wavelets, but with parameters ξ2, J2, and Q2. Consequently, they have central frequencies \(\phantom {\dot {i}\!}2^{-\gamma _{2}/Q_{2}}\xi _{2}\) for γ2 taking values between 0 and (J2Q2−1). While this abuses notation slightly, the identity of the wavelets should be clear from context. The amplitude modulation spectrum resulting from a wavelet modulus decomposition using these second-order wavelets is then
$$ \boldsymbol{x_{2}}(t,\gamma_{1},\gamma_{2}) = \vert \boldsymbol{x_{1}} \ast \boldsymbol{\psi_{\gamma_{2}}} \vert(t,\gamma_{1}). $$
(4)
In the following, we set ξ2 to 2.5 kHz, Q2 to 1, and J2 to 12. Lastly, the low-pass filter ϕ(t) is applied to x2(t,γ1,γ2) to guarantee local invariance to time-shifting, which yields the second-order scattering coefficients
$$ \mathbf{S_{2}}\boldsymbol{x}(t,\gamma_{1},\gamma_{2}) = \boldsymbol{x_{2}}(\cdot,\gamma_{1},\gamma_{2}) \ast \boldsymbol{\phi}(t). $$
(5)
The scattering transform Sx(t,γ) consists of the concatenation of first-order coefficients S1x(t,γ1) and second-order coefficients S2x(t,γ1,γ2) into a feature matrix Sx(t,γ), where γ denotes either γ1 or (γ1,γ2). While higher-order scattering coefficients can be calculated, for the purposes of our current work, the first and second order are sufficient. Indeed, higher-order scattering coefficients have been shown to contain reduced energy and are therefore of limited use [36].
Gammatone wavelets
Wavelets \(\boldsymbol {\psi _{\gamma _{1}}}(t)\) and \(\boldsymbol {\psi _{\gamma _{2}}}(t)\) are designed as fourth-order Gammatone wavelets with one vanishing moment [35] and are shown in Fig. 1. In the context of auditory scene analysis, the asymmetric envelopes of Gammatone wavelets are more biologically plausible than the symmetric, Gaussian envelopes of the more widely used Morlet wavelets. Indeed, it allows to reproduce two important psychoacoustic effects in the mammalian cochlea: the asymmetry of temporal masking and the asymmetry of spectral masking [41]. The asymmetry of temporal masking is the fact that a masking noise has to be louder if placed after the onset of a stimulus rather than before. Likewise, because critical bands are skewed towards higher frequencies, a masking tone has to be louder if it is above the stimulus in frequency rather than below. It should also be noted that Gammatone wavelets follow the typical amplitude profile of natural sounds, beginning with a relatively sharp attack and ending with a slower decay. As such, they are similar to filters discovered automatically by unsupervised encoding of natural sounds [30, 31]. In addition, Gammatone wavelets have proven to outperform Morlet wavelets on a benchmark of supervised musical instrument classification from scattering coefficients [20]. This suggests that, despite being hand-crafted and not learned, Gammatone wavelets provide a sparser time-frequency representation of acoustic scenes compared to other variants. More information can be found in Additional file 2.