Physiologically Inspired Algorithm
The Physiologically Inspired Algorithm (PA) is a sound processing algorithm that is based on the auditory system. It receives binaural speech inputs and transforms the speech signals into neural spikes for processing. After processing, it reconstructs the neural spikes back into the acoustic domain. The PA is composed of four key stages: a cochlear filter bank, a midbrain spatial localization model, a cortical network model, and a stimulus reconstruction step. These components are implemented in MATLAB (Mathworks, Natick MA) and the Python Programming Language, and are illustrated in Fig. 1. Below, we describe each of these components in details.
Cochlear Filter Bank
The cochlear filter bank represents a fundamental stage in the processing of sounds at the periphery of the auditory system, where sounds are first decomposed into different frequency bands (Patterson et al. 1992). This is implemented using an equivalent rectangular bandwidth (ERB) gammatone filter bank (Slaney 1998), a widely used representation in computational models of the auditory system (Fig. 1A). The filter bank consists of 36 frequency channels with center frequencies ranging from 300 to 5000 Hz. The PA uses 36 frequency channels because it provides a good balance between physiological accuracy and computational complexity. Additional frequency channels provided minimal benefit to the model. Subsequent stages of the model assume that each frequency channel is processed independently, thus the processing during each subsequent stages of the model are repeated for each frequency channel.
Midbrain Spatial Localization Network
To identify the location of the sound source, the auditory system exploits two important spatial cues: the interaural time difference (ITD) and the interaural level difference (ILD). ITD is created when a sound arrives at the more proximal ear earlier than the more distal ear, while ILD is created when the head shadows the more distal ear, decreasing the loudness compared with the more proximal ear. There are many models for binaural cue computation (Dietz et al. 2018). We elected to adapt a physiology-based model of the spatial localization network of the barn owl midbrain (Fischer et al. 2009), because it is one of the most accurate and best understood physiological systems for sound localization.
The model is illustrated in Fig. 1 B. It calculates ITD using a cross-correlation-like operation, and calculates ILD by taking the difference in the energy envelopes between the left and right signals. Readers are referred to the original work by Fischer and colleagues for detailed mathematical descriptions of binaural cue extraction (Fischer et al. 2009). In a subsequent stage of processing in the inferior colliculus model, the ITD cues are combined with ILD cues via a multiplication-like operation (Peña and Konishi 2001; Fischer et al. 2007). The model we adapted from Fischer et al. operates on discretized waveforms, while the next stage of our work (Fig. 1C) operates on neural spikes; therefore, we implement model neurons at this stage to encode the input waveforms into neural spikes. Five model neurons were implemented at this stage, with preferred ITDs and ILDs corresponding to − 90°, − 45°, 0°, 45°, and 90° azimuth, respectively. The firing probabilities for each of the model neurons are calculated by adding the ITD and ILD signals at the sub-threshold level followed by an input-output non-linearity given by a threshold sigmoid function. This overall operation effectively represents a multiplication of ITD and ILD cues as observed physiologically (Peña and Konishi 2001). Spikes are generated based on the calculated firing rates using a Poisson-spike generator. The calculations described above are done independently for each frequency channel, corresponding to the frequency channels of the cochlear filter bank model.
We tuned the specific parameters of the model neuron to match the ITDs and ILDs for a human head. We calculated the azimuth-specific ITD and azimuth- and frequency-specific ILD of KEMAR head-related transfer functions (HRTFs) for the five azimuth locations. For each preferred azimuth, we adjusted the ITD and ILD tuning parameters of the model neuron to match the ITD and ILD calculated for that azimuth and frequency.
Cortical Network Model: Inhibition Across Spatial Channels
The cortical network implements the critical computation of inhibiting off-target spatial channels. The network implemented here uses neural spikes as both input and output, and its architecture is illustrated in Fig. 1C. The five azimuth locations on the left side of Fig. 1C represent inputs from the midbrain model neurons. The inputs excite both relay neurons (R) and interneurons (I). The relay neurons for each azimuth excite a single cortical neuron (C) (i.e., cortical neuron integrates information across spatial channels), while the interneurons inhibit the relay neurons of other spatial channels. Each node of the network is composed of leaky integrate-and-fire neurons. For all neurons, resting potential was − 60 mV, spiking threshold was − 40 mV, and the reversal potential for excitatory currents was 0 mV. In relay neurons, the reversal potential for inhibitory currents was − 70 mV. In interneurons, synaptic conductance for excitatory synapses was modeled as an alpha function with a time constant of 1 ms. In relay neurons, synaptic conductance for both excitatory and inhibitory synapses were modeled as the difference of a rising and a falling exponential, where rise and fall time constants were 1 and 3 ms, and 4 and 1000 ms, respectively. Time constants were chosen based on the study by Dong et al. (2016), with the exception of the fall time of the relay neuron inhibitory synapses. Since the goal of this network is to optimize the reconstruction of the auditory target, the fall time of the relay neuron’s inhibitory synapse was increased to 1000 ms in order produce a strong, sustained suppression of maskers in off-target channels, to produce the best reconstruction of targets. An absolute refractory period of 3 ms was enforced in all neurons. Synaptic strengths were uniform across all spatial channels for the same type of synapse. The synaptic conductances between input to inter-neurons and relay neurons were 0.11 and 0.07 nS, respectively. The synaptic conductance from relay to cortical neurons was 0.07 nS. The conductance for the cross-spatial channel inhibition was 0.2 nS, which was the minimum value required to effectively suppress off-target spatial channels. The network connectivity was set to select sounds originating from 0° azimuth, as shown by the blue inhibitory pathways in Fig. 1C. Cortical network models are specific for each frequency channel, and the network structures are identical across all frequency channels. There are no interactions between frequency channels unless otherwise specified.
Stimulus Reconstruction
The output of the cortical network is a set of processed neural spikes for each frequency channel. In order to evaluate the model performance, the neural response was “translated” back to an acoustic waveform that humans can understand via a “stimulus reconstruction.” Here, we develop a novel stimulus reconstruction technique based on the estimation of a linear reconstruction filter (Fig. 1D) (Bialek et al. 1991; Stanley et al. 1999; Mesgarani et al. 2009; Mesgarani and Chang 2012). The basic idea is to first convolve a neural spike train with a reconstruction filter function to estimate the envelopes of the acoustic waveform (see “Optimal filter”). Since each frequency channel has a distinct set of neural spikes, this process is independently carried out for each channel. Then, the envelopes are used to modulate carrier signals to obtain narrowband signals. Finally, the narrowband signals across frequency channels are summed (without weighting) to obtain a reconstructed stimulus. We tested two different carrier signals for the reconstruction algorithm: (1) pure tones with frequencies equal to the center frequencies of each channel, and (2) band-limited noise limited to the frequency range for each frequency channel. In this manuscript, we present the results for pure tone carriers, which achieved the highest quantitative scores by the short time objective intelligibility (STOI, details in Measures of Reconstruction Quality and Segregation Performance, below) measure.
Optimal Filter
Commonly used analytically derived reconstruction filters assume that each frequency channel is independent of one another (Theunissen et al. 2001; Mesgarani et al. 2009): For a set of stimulus and response from frequency channel f, the stimulus waveform sf(t) can be reconstructed from a set of spike trains xf(t) with spike-times ti (i = 1, 2, ···, n), by convolving xf(t) with a linear reconstruction filter, hf(t), to obtain an estimate of the original stimulus: \( {s}_{est,f}(t)=\sum \limits_i^n{h}_f\left(t-{t}_i\right) \). We derive hf(t) in the frequency domain: \( H\left(\omega \right)=\frac{S_{sx}\left(\omega \right)}{S_{xx}\left(\omega \right)} \), where Ssx(ω) is the cross-spectral density of a training stimulus s(t) and the corresponding spike train x(t), and Sxx(ω) is the power spectral density of the neural training response (Rieke et al. 1997; Gabbiani and Koch 1998). We restricted the length of hf(t) to 51.2 ms (or 2048 taps). The estimated original stimulus is then found by taking the unweighted sum across individual frequency channels: sest(t) = ∑fsest, f(t).
In contrast to the analytical approach described above, we introduced another frequency dimension, ω, to the optimal linear filter, to address the potential interactions across frequency channels: hf(t, ω). Such interactions may exist due to the relatively wide bandwidths of the gammatone filters, resulting in energies from one channel being picked up by adjacent channels. The estimated stimulus is obtained via a two-dimensional convolution obtained without zero-padding (“valid” mode in MATLAB or Python): sest, f = hf(t, ω) ∗ x(t, ω), where x(t, ω) is the response spike trains for all frequency channels over time t. Since the convolution is only computed on elements that do not require zero-padding, the result is a one-dimensional signal of length (t–2048). To calculate hf(t, ω), we initialized a zero matrix and set hf(t, ω)|ω = f = hf(t). We used gradient descent to minimize the mean-squared error (MSE) between the original signal’s envelopes and the reconstructed envelopes, treating the values of hf(t, ω) as free parameters. Initial one-dimensional reconstruction filters hf(t) were calculated in MATLAB, and two-dimensional filters were optimized using the Theano Toolbox in Python. The same process is repeated for each frequency channel f. We found that the optimal two-dimensional filter improved the reconstructions by 26 % relative to the one-dimensional filter, from 0.58 to 0.73, as assessed by the STOI measure (see “Measures of Segregation and Reconstruction Quality and Segregation Performance”).
Reconstruction Filter Training
We constructed a training waveform by extracting one instance of all call-sign and color-number combinations from the CRM corpus (see “Speech Stimuli”) and concatenated these into one continuous sound waveform. To derive the optimal filter, the training waveform was presented to the PA at 0° as the training input stimulus, and the corresponding cortical response was used as the training target. Since the optimal filter is a mapping between the clean CRM utterances (prior to introducing HRTF) and neural spikes, the effect of HRTF are removed from the reconstructed stimuli. After deriving the reconstruction filter, we tested the algorithm on other CRM sentences and their corresponding neural responses. Note that the training phase only requires training on clean speech. The filter was not re-trained as long as frequency channels remain independent of one another.
Cross Validation and Overfitting
We ran simulations with randomly selected TIMIT corpus sentences (Victor et al. 1990) while reconstructing with the CRM corpus-trained reconstruction filter. The reconstruction performance (see “Measures of Reconstruction Quality and Segregation Performance”) did not differ significantly from the simulations ran with the CRM corpus, differing by 4 % on average. Based on this result, we determined that overfitting was not an issue.
Code Accessibility
The code for the PA is available upon request.
Simulations
Speech Stimuli
The coordinated response measure (CRM) corpus (Bolia et al. 2000) was used to train and test the novel stimulus reconstruction technique, as well as test the segregation and reconstruction results using our physiology-inspired model. The CRM Corpus is a large set of recorded sentences in the form of [Ready CALLSIGN go to COLOR NUMBER now], where call sign, color, and number have 8, 4, and 8 variations, respectively. All recordings were stored as 40 kHz binary sound files. Directionality was added to the recordings by convolving each recording with KEMAR (Burkhard and Sachs 1975) head-related transfer functions (HRTFs) corresponding to the appropriate location (Gardner and Martin 1995; Kim and Choi 2005). For each simulation, we randomly selected three sentences from the CRM corpus, and designated one to be the “target” and the remaining to be the “maskers.” Since the result of the simulations is quantified using entire sentences, having repeats in any of the three keywords may artificially inflate the results. Therefore, sentences in each trio cannot contain the same call sign, color, or number. For simulations with single talkers, only the “target” sentence is used.
Simulation Scenarios
To test the segregation and reconstruction quality of the PA, we configured the model network to “attend to” 0° azimuth (Fig. 1C) by only activating the inhibitory neuron in the 0° spatial channel. This is achieved by setting the I-R connection matrix such that only the non-zero values in the matrix correspond to the connections from 0° I neurons to other spatial channels’ R neurons.
We designed three simulations to demonstrate that the PA is capable of: (1) monitoring the entire azimuth in quiet, (2) selectively encoding a preferred location while suppressing another non-preferred location when competition arises, and (3) robustly encoding a preferred location when maskers became louder than targets. Each simulation was repeated 20 times, each time using a different set of CRM sentence trios. In the first simulation, we presented the PA with a single target at locations between 0 to 90° in azimuth, at 5° intervals. We then calculated assessment measures (see “Measures of Reconstruction Quality and Segregation Performance”) of the quality and intelligibility of the reconstructed signal compared with the original target signal. In the second simulation, we presented one sentence at the target location (0°) and two masker sentences at symmetrical locations in 5° intervals from 0 to ±90°. The sentences have a target-to-masker ratio (TMR) of 0 dB, defined as the energy difference between the target and individual maskers. We then calculated speech intelligibility of the reconstruction compared with the target and masker sentences, respectively, for all masker locations. We then swapped the location of the two masker sentences and repeated the simulation. The third simulation was designed to test the robustness of the PA at low SNRs. In this simulation, the target was fixed at 0° and the maskers fixed at ± 90° respectively. The TMR was then varied between −13 and 13 dB. This equates to signal-to-noise ratios (SNRs) of −16 to 10 dB.
Measures of Reconstruction Quality and Segregation Performance
We compared several objective measures of speech intelligibility and quality including the STOI (Taal et al. 2010), the normalized covariance metric (NCM) (Chen and Loizou 2011), and the PESQ (Rix et al. 2001), each of which calculates the intelligibility and quality of a processed signal compared with its original unprocessed form (i.e., a reference signal). A higher score indicates better intelligibility or quality of the processed signal to human listeners. All CRM sentences have the words “ready,” “go to,” and “now,” and there is a possibility that these repetitions would inflate individual objective scores. Therefore, a relative segregation performance is quantified by the score difference between using target versus maskers as the reference signal, which we call Δ. In our analyses, all three objective measures performed similarly, in a qualitative sense. We present only the STOI results here. The STOI is designed to measure the intelligibility of speech in the presence of added noise, which makes it an appropriate measure to quantify the quality of the reconstructed speech. STOI scores are shown to be well-correlated to subjective intelligibility (Taal et al. 2010):
$$ \mathrm{Predicted}\ \mathrm{subjective}\ \mathrm{intelligibilit}y\ \left(\%\right)=\frac{100}{1+\exp \left(-13.1903\bullet STOI+6.5192\right)} $$
where subjective intelligibility score is measured by the percent of words correctly recognized by human listening subjects. By this measure, an STOI score above 0.7 is highly intelligible, corresponding to 90 % correct. MATLAB functions for computing intelligibility measures were generously provided by Stefano Cosentino at the University of Maryland.
Frequency Tuning
The model network we used assumes sharp frequency tuning, where frequency channels do not interact with one another. Cortical neurons have been found to have various sharpness in frequency tuning (Sen et al. 2001), and the model performance may depend on frequency tuning width. For these reasons, we explored the effects of frequency tuning curves on the network performance for single-target reconstructions. We modeled the spread of information across frequency channels with a Gaussian-shaped weighting function, centered around the center frequency (CF) of each frequency channel:
$$ {w}_{i,j}=\exp \left(-\frac{{\left(C{F}_j-C{F}_i\right)}^2}{2{\sigma}_i^2}\right) $$
where i and j are the indices of frequency channels, and σ is the standard deviation. The spread of information is modeled by having the relay neurons centered at CFi receive inputs from its neighboring frequency channels, centered at CFj,weighted by wi, j. The values of σi used in this simulation was determined by introducing the variable Q, defined as the ratio of CF to the full-width at half-maximum (FWHM) of a tuning curve (Sen et al. 2001). Here, we formulate Q in terms of the Gaussian weighing function’s FWHM, which can then be related to σi: \( Q=\frac{C{F}_i}{FWHM}=\frac{C{F}_i}{2\sqrt{.2\ln (2)}{\sigma}_i} \). We tested Qs ranging from Q = 0.85 (broad tuning) to Q = 23 (sharp tuning). For reference, Q values from field L in the zebra finch have been reported to range from 0.4 and 7.8 (Sen et al. 2001). This is the only simulation where there are interactions between frequency channels. Due to this cross-frequency interaction, we re-trained the reconstruction filter for each Q, using the same training sentences previously described as the training stimulus, the corresponding spike trains for each Q as the training target.
Robustness to Frequency Tuning
We processed 20 target sentences, placed at 0° azimuth, with our model network for Q ranging from 0.85 to 23. Performance of the model at each Q was evaluated by the intelligibility of the reconstructions with the targets alone, quantified by the STOI score (Table 1).
Table 1 Effect of Q on single-target reconstructions
Engineering Algorithms
Although our main goal here was to develop a physiologically inspired model, we were curious to compare the segregation performance of the PA to cutting-edge engineering algorithms. Some engineering algorithms, notably beam-formers, rely on increasing the number of sound inputs with the number of sources (see comparative discussion in Mandel et al. 2010), and/or rely on monaural features, such as pitch (Krishnan et al. 2014). In contrast, the PA is a binaural algorithm requiring only two inputs (left and right ear signals), as in the human auditory system, and does not use any additional information from monaural features. Thus, for a controlled comparison, we compared the segregation performance of the PA with two cutting-edge engineering algorithms that were essentially binaural: model-based expectation-maximization source separation and localization (MESSL) and a deep neural network (DNN) trained with binaural cues, and evaluated all algorithms with the same STOI metric.
MESSL
MESSL algorithm by the Ellis group is freely available via Github (https://github.com/mim/messl) (Mandel et al. 2010). MESSL uses binaural cues to localize and separate multiple sound sources. Specifically, it uses a Gaussian mixture model to compute probabilistic spectrogram masks for sound segregation. For the best performance possible, the correct number of sound sources in each scenario was provided to the algorithm as a parameter. MESSL also requires a parameter tau, which is an array of possible source ITDs converted to numbers of samples. For this parameter, we used 0 to 800 μs of ITDs, and omitted the negative taps. MESSL does not require any training, but does require careful selection of these parameters to optimize performance. We systematically searched over a range of tau and selected the ones that yielded the highest STOI and PESQ scores for this study.
DNN
The DNN algorithm is also freely available online (http://web.cse.ohio-state.edu/pnl/DNN_toolbox/) (Wang et al. 2014). DNN isolates a target sound from noisy backgrounds by constructing a mapping between a set of sound “features” and an ideal spectrogram mask. For a controlled comparison with the other algorithms, we replaced the monaural features in the DNN algorithm with three binaural features: ITD, ILD, and interaural cross-correlation coefficient (IACC). ITD was calculated through finding the peak location of the time-domain cross-correlation function, and the IACC was the peak value. To be consistent with the features used by the DNN model reported by Jiang et al. 2014, 64 frequency channels were used and the features were calculated for each time-frequency unit. We trained the DNN with sentences from the CRM corpus. Often, a critical factor in optimizing the performance of the DNN is the amount of training data used. The number of training sentences needed for the DNN performance to reach the highest performance under the two-masker scenario described above was used to train the DNN.
Stimuli for Comparing Algorithms
Sentences from the CRM corpus described above were used to form the speech stimuli for all three algorithms. The input sound mixtures had target-to-masker-ratios (TMRs) of 0 dB. TMR is defined as the sound level of the target to a single-masker, regardless of the number of maskers present.
Scenarios for Comparing Algorithms
Two scenarios were simulated for all three algorithms: In the first scenario, a target stimulus was placed at 0° while two symmetrical maskers were varied between 0 and ± 90°. In the second scenario, a target stimulus was placed at 0° while 2, 3, or 4 maskers were placed at ± 45 and/or ± 90° in all possible combinations. The STOI score was then calculated for each condition. For MESSL and DNN, STOI was calculated by comparing the output of each simulation against each of the original individual stimuli.
Segregation Comparison
Here, we describe the measure of segregation performance in more details. As mentioned previously, we quantify the segregation performance as the difference between STOIs, computed using either target or masker as the reference waveform, or ΔSTOI. The output of the PA is noisy due to the stimulus reconstruction step, resulting in an artifact that is absent in MESSL and DNN. The use of ΔSTOI allows comparison of true segregation ability, independent of reconstruction artifacts.
Algorithm Training
MESSL did not require any training. The amount of training versus performance for the DNN algorithm was determined experimentally using scenario 2. DNN required training on targets in the presence of maskers, with about 100 sentence mixtures required to reach peak performance, which was used to train the DNN algorithm in the simulations. Due to the small number of sentence available in the CRM corpus, we recognize we have most likely overfitted the DNN. However, we do not believe this to be an issue, because we want to create the best possible performance for DNN. The PA was trained in the same manner as previously described in “Reconstruction Filter Training.”