Quantifying Envelope and Fine-Structure Coding in Auditory Nerve Responses to Chimaeric Speech
- 970 Downloads
Any sound can be separated mathematically into a slowly varying envelope and rapidly varying fine-structure component. This property has motivated numerous perceptual studies to understand the relative importance of each component for speech and music perception. Specialized acoustic stimuli, such as auditory chimaeras with the envelope of one sound and fine structure of another have been used to separate the perceptual roles for envelope and fine structure. Cochlear narrowband filtering limits the ability to isolate fine structure from envelope; however, envelope recovery from fine structure has been difficult to evaluate physiologically. To evaluate envelope recovery at the output of the cochlea, neural cross-correlation coefficients were developed that quantify the similarity between two sets of spike-train responses. Shuffled auto- and cross-correlogram analyses were used to compute separate correlations for responses to envelope and fine structure based on both model and recorded spike trains from auditory nerve fibers. Previous correlogram analyses were extended to isolate envelope coding more effectively in auditory nerve fibers with low center frequencies, which are particularly important for speech coding. Recovered speech envelopes were present in both model and recorded responses to one- and 16-band speech fine-structure chimaeras and were significantly greater for the one-band case, consistent with perceptual studies. Model predictions suggest that cochlear recovered envelopes are reduced following sensorineural hearing loss due to broadened tuning associated with outer-hair cell dysfunction. In addition to the within-fiber cross-stimulus cases considered here, these neural cross-correlation coefficients can also be used to evaluate spatiotemporal coding by applying them to cross-fiber within-stimulus conditions. Thus, these neural metrics can be used to quantitatively evaluate a wide range of perceptually significant temporal coding issues relevant to normal and impaired hearing.
Keywordsneural cross correlation temporal coding recovered envelopes cochlear implants
Numerous perceptual studies have addressed fundamental questions about the relative contributions of the slowly varying envelope and rapidly varying fine-structure components of speech and music (Smith et al. 2002; Xu and Pfingst 2003; Zeng et al. 2005b). Envelope information is important for speech perception and supports robust speech identification in quiet when provided in as few as four frequency bands (Shannon et al. 1995). This finding has important implications for cochlear implants, which currently only provide envelope information over a small number (eight to 16) of electrodes and is consistent with the observation that many cochlear-implant patients understand speech remarkably well in quiet (Wilson et al. 1991). The relative roles of envelope and fine structure have recently been evaluated using specialized acoustic stimuli called auditory chimaeras, which have the envelope of one sound and the fine structure of another (Smith et al. 2002). Chimaeric speech constructed from two sentences is generally perceived as the sentence that provided envelope, whereas chimaeric music is perceived as the melody that contributed fine structure. The perceptual salience of acoustic fine structure for music perception and sound localization (Smith et al. 2002), lexical-tone perception (Xu and Pfingst 2003), and speech perception in noise (Qin and Oxenham 2003; Lorenzi et al. 2006) has been given as motivation for efforts to develop cochlear-implant strategies to provide fine structure in addition to envelope cues (e.g., Rubinstein et al. 1999; Nie et al. 2005).
Interpretation of perceptual studies that utilize auditory chimaeras relies on the assumption that envelope and fine structure can be isolated. However, signal-processing theorems state that the envelope and fine structure of band-limited signals are not independent, and information about one can be recovered mathematically from the other, e.g., envelope can be recovered from fine-structure by narrowband filtering (e.g., Voelcker 1966; Rice 1973; Logan 1977). Thus, narrowband cochlear filtering imposes constraints on the ability to isolate a sound’s fine structure from its envelope (Ghitza 2001, also see Saberi and Hafter 1995). Although envelope is clearly more salient than fine-structure for eight- and 16-band speech chimaeras, a reversal occurs for one- and two-band chimaeras for which fine structure supports robust speech recognition rather than envelope (Smith et al. 2002). Perceptual studies have suggested that recovered envelopes at the output of the cochlea may explain the reversal in these conditions for which the chimaeric analysis bands were much broader than cochlear filters (Zeng et al. 2004; Gilbert and Lorenzi 2006). However, these considerations were limited to perceptually based filter-bank models, which capture the basic effects of cochlear filtering but exclude many physiological factors (e.g., adaptation, phase-locking roll-off, two-tone suppression) that may affect envelope and fine-structure coding in neural responses to complex sounds.
The present study provides physiological evidence for the presence of recovered envelopes in auditory nerve (AN) responses to chimaeric speech. Neural cross-correlation coefficients were developed to quantify the similarity between envelope (or fine structure) components in two sets of spike-train responses. Auto- and cross-correlogram analyses were used to separate the contributions of envelope and fine structure (Joris 2003). These neural metrics can also be used to evaluate fundamental questions related to across-fiber temporal coding, which was recently hypothesized to be involved in the difficulties that hearing-impaired listeners have in understanding speech in complex acoustic backgrounds (Lorenzi et al. 2006; Moore 2008).
Auditory nerve model
Spike-train data from a computational model of AN responses (Zilany and Bruce 2006, 2007) was used to evaluate systematically the dependence of neural cross correlation on both stimulus parameters (e.g., number of chimaeric analysis bands) and AN-fiber parameters (e.g., characteristic frequency (CF), the frequency at which the fiber responds to the lowest sound level). The phenomenological AN model represents the most recent extension of a well-established model that has been rigorously tested against physiological AN responses to both simple and complex stimuli, including tones, broadband noise, and speech-like sounds (Carney 1993; Heinz et al. 2001a; Zhang et al. 2001; Bruce et al. 2003; Tan and Carney 2003; Zilany and Bruce 2006, 2007). Model threshold tuning curves have been well fit to the CF-dependent variation in bandwidth for normal-hearing cats (Miller et al. 1997), which is comparable to that of chinchillas (Shera et al. 2007; Temchin et al. 2008a, b). Many of the physiological properties associated with nonlinear cochlear tuning are captured by this model, including compression, suppression, and broadened tuning and best-frequency shifts with increases in sound level. The stochastic nature of AN responses is accounted for by a nonhomogenous Poisson process that was modified to include the effects of both absolute and relative refractory periods. Although the Poisson-based model does not capture all of the detailed stochastic properties of AN responses (e.g., Heil et al. 2007), the major statistical properties relevant to this work are captured by this model (e.g., Young and Barta 1986). Although the Zilany and Bruce (2006, 2007) model was chosen for this study, the results presented here do not depend on this choice, and several other AN models exist that would be expected to produce similar results (reviewed by Lopez-Poveda 2005).
The AN-model input is the sound stimulus waveform, while the output is a set of spike times for a single AN fiber with a specified CF. All model simulations were for high-spontaneous-rate (50 spikes/s) fibers, for which this AN model was designed and tested. Similar results were obtained for both broadband noise and speech when the model was extended to simulate higher-threshold, low-spontaneous rate fibers (not shown). Stimuli were re-sampled to 100 kHz prior to presentation to the model.
Surgical procedures and neurophysiological recording techniques
Several model predictions were verified by computing neural cross-correlation coefficients from spike trains recorded from 28 AN fibers during other experiments in the lab. All methods of animal care and use were approved by the Purdue Animal Care and Use Committee. Single-unit recordings were made from AN fibers from four adult chinchillas using standard techniques (e.g., Kiang et al. 1965; Heinz and Young 2004). All four chinchillas had AN-fiber thresholds within normal limits (i.e., lowest thresholds were 5–15 dB SPL). The animals weighed 400–600 g and were initially anesthetized with xylazine (1–1.5 mg/kg im) and ketamine (50–65 mg/kg im). Supplemental doses of fluids and barbiturate anesthesia (sodium pentobarbital, ~7.5 mg/kg/h iv) were given to maintain an areflexic state. Rectal temperature was maintained between 37°C and 38°C. A tracheotomy was performed to facilitate quiet breathing, and the bulla was vented with a polyethylene tube to equalize middle-ear pressure.
During the recordings, the animals were held in place with a stereotaxic apparatus within a double-walled, sound-attenuating chamber (Industrial Acoustics, Bronx, NY, USA). Sound was delivered monaurally through a custom closed-field acoustic system, with dynamic speakers (DT-48, Beyer Dynamic, Farmingdale, NY, USA) connected to a hollow ear bar inserted into the right ear canal to deliver calibrated acoustic stimuli near the tympanic membrane. The acoustic system was calibrated at the beginning of the experiment using a probe-tube microphone (ER-7C, Etymōtic, Elk Grove Village, IL, USA) placed within a few millimeters of the tympanic membrane. Single-unit recordings were made with 10–30 MΩ glass micropipettes filled with 3 M NaCl. The electrode signal was amplified (Dagan, Minneapolis, MN, USA) and filtered prior to timing (10-μs resolution) the action potentials based on a time–amplitude window discriminator (Bak Electronics, Mount Airy, MD, USA). Synchronous presentation of acoustic stimuli and data recording was controlled by custom software running in MATLAB (The Mathworks, Natick, MA, USA) integrated with custom and commercial hardware (Tucker-Davis Technologies, Alachua, FL, USA; National Instruments, Austin, TX, USA). A broadband noise search stimulus was used to isolate AN fibers. Fibers were characterized using an automated tuning-curve algorithm that was used to determine the fiber’s CF, threshold, and Q 10 (Chintanpalli and Heinz 2007), as well as by CF-tone rate-level functions and PST histograms.
Two independently generated frozen noise waveforms (A and B) were used for testing basic properties of the neural cross-correlation metrics. Noises A and B were both Gaussian and broadband, with a 2-s duration for model data and a 1.7-s duration for AN-fiber data (10-ms rise-fall times in both cases). Chimaeric speech stimuli were created from the original speech utterance “A boy fell from the window,” which had a duration of 1.7 s (Shen et al. 2001). Chimaeric stimuli were created from this utterance and a spectrally matched broadband noise using the chimaerizing algorithm and MATLAB code developed by Smith and colleagues (Shen et al. 2001; Smith et al. 2002). A variable number of FIR band-pass filters were equally spaced on a cochlear frequency map and spanned the frequency range from 80 to 8820 Hz. Envelope and fine structure were separated from the output of each filter for each sound using a Hilbert transform. The envelope from the first sound was multiplied by the fine structure from the second sound within each filter, prior to adding the resulting individual band-pass signals to create the final chimaera. The FIR group delay resulting from the chimaeric filtering was manually compensated for prior to presentation of the chimaeric stimuli. The speech fine-structure chimaera consisted of a combination of the fine structure from the original speech token and the envelope from the noise, whereas the speech envelope chimaera was created from the opposite contributions.
For every stimulus condition, spike trains were obtained in response to the original stimulus (A+) and its polarity-inverted pair (A−). The polarity inversion introduces a 180° phase shift of all frequency components, thereby inverting the fine structure of the stimulus, while not affecting the stimulus envelope (Joris 2003). Model and AN-fiber responses were obtained for 16–25 repetitions of each stimulus, which was sufficient to collect roughly 3,500 spikes for each stimulus condition. All stimuli presented to AN fibers were 1.7 s in duration, and a new stimulus was presented every 2.5 s. All stimuli within a given set (e.g., noises A+, A−, B+, and B−) were presented in an interleaved manner until the desired number of repetitions was completed. For all analyses, spikes within the first 50 ms of the response were excluded to avoid onset effects.
For both types of stimuli, and for both model and AN-fiber data, sound levels were chosen for each AN fiber to maximize the number of spikes, while minimizing the degradation in envelope coding at high levels due to saturation (e.g., Joris and Yin 1992; Louage et al. 2004). Data from model simulations were collected typically at the best modulation level for each stimulus type (i.e., noise A, or the original speech token). Best modulation level was determined for each model fiber as the sound level that produced the maximum amount of envelope coding, as quantified by the sumcor peak height (defined below; also see Louage et al. 2004). The sound levels used in the neurophysiological experiments were chosen typically to be within the upper one third of the fiber’s dynamic range based on a measured rate-level function for each stimulus type. Although these criteria were slightly different, the sound levels relative to fiber threshold were only slightly higher for the recorded AN data than for the model data. Larger differences in absolute sound level between model simulations and experimental data (up to 30 dB) arose due to higher (~10 dB) AN thresholds in chinchillas relative to cats (Miller et al. 1997; Temchin et al. 2008b) and because model thresholds were designed to match the lowest AN fiber threshold within the span of AN thresholds (~40 dB) at each CF, rather than the mean (Miller et al. 1997; Zilany and Bruce 2006). Noise spectrum levels ranged from −27 to −7 dB for model data, and from 2 to 26 dB for recorded AN data. Overall speech levels were 35 dB SPL for model data (normal hearing) and ranged from 62 to 72 dB SPL for recorded AN data.
Envelope and fine-structure cross-correlation metrics computed from spike trains
The first two columns of Figure 1 illustrate the computations involved in quantifying within-fiber temporal coding for noises A and B. Shuffled autocorrelograms (SACs, thick lines in panels A and B) were computed from single sets of spike trains recorded from a chinchilla AN fiber (CF = 827 Hz) responding to repeated presentations of the same stimulus (Louage et al. 2004). The shuffling involved comparing spike times between all possible pairs of stimulus repetitions within a given set of spike trains, i.e., for N repetitions, there are N × (N − 1) pairs. For each pair, intervals between every spike in the first spike train and every spike in the second spike train were tallied. The SAC is the shuffled all-order interval histogram obtained by tallying all intervals from all pairs with a 50-μs binwidth.
By computing inter-spike intervals across repetitions (rather than within), shuffled correlograms provide a more robust analysis of temporal responses than the all-order interval histograms that have often been used to estimate autocorrelation functions from neural responses to complex sounds (e.g., Ruggero 1973; Cariani and Delgutte 1996a, b). Temporal correlations at very small delays are accurately captured by shuffling because the deleterious effects of refractoriness are avoided. Also, shuffling produces significantly smoother estimates of correlation functions based on the same amount of data because the number of computed intervals is proportional to N 2, rather than N. Previous studies of the temporal coding of envelope and fine structure have often used periodic stimuli (e.g., Johnson 1980; Joris and Yin 1992). However, the shuffled correlogram analyses are applicable to any repeatable stimulus (Joris 2003; Louage et al. 2004), e.g., broadband noise or chimaeric speech.
SACs are typically normalized [by N(N − 1)r 2ΔτD, where r is the average discharge rate, Δτ is the binwidth, and D is the duration of the response window] to allow a more intuitive interpretation of temporal coding. With this normalization, a baseline value of 1 represents the absence of any temporal correlation (envelope or fine structure). A value greater than 1 represents positive correlation, and a value less than 1 represents negative correlation. SACs are plotted as function of delay (or inter-spike interval) and are therefore much like autocorrelation functions. As such, the SACs in Figures 1A, B have a peak at zero delay, with a symmetric damped oscillatory shape similar to an autocorrelation function for band-limited noise. The prominent oscillation occurs with a period equal to the reciprocal of CF, and the rate of decay of the side lobes is determined by the fiber’s bandwidth (Louage et al. 2004; Joris et al. 2008a). Because the statistical properties of the noise do not change with stimulus polarity, the SACs shown in Figure 1 are the average of the SACs for both polarities [e.g., SAC(A+) and SAC(A−)].
Joris and colleagues have demonstrated that the coding of fine structure and envelope can be separated by considering the neural responses to a stimulus and its polarity-inverted pair (Joris 2003; Louage et al. 2004; Joris et al. 2006a). The polarity inversion acts to invert the fine-structure of the stimulus, while not affecting the stimulus envelope. To facilitate the separation of envelope and fine-structure coding, Joris and colleagues computed what they referred to as a cross-stimulus autocorrelogram (XAC). They used the term auto to indicate the same fiber and the term cross-stimulus applied to the original stimulus and its polarity-inverted pair. To allow a general consideration of auto- and cross-correlation analyses, the term autocorrelogram was restricted in the present study to correlograms computed from a single set of spike trains, for which classic autocorrelation properties are expected (e.g., a symmetric shape with a peak at 0 delay). The term cross-correlogram was used for any correlogram computed across two separate sets of spike trains, for which classic autocorrelation properties are not expected (including XACs). Shuffled cross-correlograms [e.g., SCC(A,B)] were computed by comparing all possible repetition pairs (N A × N B) between two separate sets of spike trains in response to conditions A and B. Similar to the SAC computation, all-order intervals between every spike in the first spike train and every spike in the second spike train were tallied, with the SCC representing the complete all-order interval histogram based on all possible pairs (normalized by N A N B r A r BΔτD). In this generalized notation, the XAC (Joris 2003; Louage et al. 2004) used to facilitate the separation of envelope and fine structure is a cross-polarity correlogram [e.g., SCC(A+,A−)]. Each thin line in Figures 1A, B is the average of two statistically identical cross-polarity correlograms [e.g., SCC(A+,A−) and SCC(A−,A+)].
To emphasize the contribution of fine structure to temporal coding, difcors (second row of Fig. 1) were computed as the difference between the SAC (original envelope, original fine structure) and the cross-polarity correlogram [e.g., SCC(A+,A−), with original envelope and inverted fine structure]. By taking the difference, the common envelope contributions are minimized, thus emphasizing the fine-structure coding. Temporal fine-structure coding within a single AN fiber was quantified as the peak height of the difcor (Joris 2003; Louage et al. 2004). Note that while difcors clearly emphasize fine-structure coding, the isolation from envelope coding is not perfect (e.g., the difcor of a low-level sinusoidally amplitude-modulated (SAM) tone contains small oscillations at the modulation frequency superimposed on the prominent oscillations at the carrier frequency). However, the influence of envelope coding on difcor estimates of fine-structure coding can be argued to be small based on the small effect of sound level on difcor peak heights. While AN-fiber envelope coding typically degrades significantly as sound level increases (Joris and Yin 1992; Louage et al. 2004), difcor peak heights are only slightly reduced (Louage et al. 2004). Some of this decrease in difcor peak height with level represents a true decrease in phase locking (Johnson 1980). Nonetheless, this small drop provides an upper limit on the contribution of envelope coding to difcor peak height. For the well-defined SAM-tone stimulus, a more precise measure of fine-structure coding would be the synchrony coefficient at the carrier frequency; however, this analysis requires knowledge of the stimulus. For more complex and nonperiodic stimuli, such as broadband noise or speech, the difcor provides a useful estimate of the neural coding of fine structure that does not require knowledge of the stimulus.
To emphasize envelope coding, sumcors were computed as the average of the SAC and the cross-polarity correlogram [e.g., SCC(A+,A−)]. By taking the average, the common contribution of envelope coding was emphasized, and the contribution of fine-structure coding was minimized. Joris and colleagues quantified temporal envelope coding within a single fiber as the peak height of the sumcor (Joris 2003; Louage et al. 2004). Their quantification works well at high CFs, where fine structure is minimal and envelope is the primary source of temporal coding. However, at low CFs, responses to broadband stimuli have significant fine structure and envelope contributions, and the fine-structure contributions do not always cancel out completely in the sumcor (thin lines, row 3 of Fig. 1; also see Fig. 1D in Joris 2003). This leakage of fine structure into the sumcor reflects distortion that arises from rectification associated with neural responses. Thus, sumcors for low-CF fibers have undesirable high-frequency oscillations that are not associated with the slow envelope response. The fine-structure leakage was more easily isolated in the spectral domain, where it was apparent in the Fourier transform of the sumcor as a high-frequency spectral peak centered at 2 × CF (row 4, Fig. 1). The Fourier transform of the sumcor was taken using a 25-ms rectangular window centered on zero delay, with the sumcor baseline value of 1 subtracted prior to computing the Fourier transform. To correct for the triangular shape imposed on shuffled correlograms by the limited stimulus duration, an inverted triangular compensation ranging from 1 to 0 was added to the SACs and SCCs prior to computation of the sumcors.
The third column of Figure 1 illustrates the use of shuffled cross-correlograms to evaluate the similarity between two recorded spike trains from one AN fiber responding to two different stimuli (uncorrelated noises A and B); however, the same analyses are applicable to the condition of two different fibers responding to the same stimulus. From the spike-train data collected for A+, A−, B+, and B−, two statistically identical SCCs [i.e., SCC(A+,B+) and SCC(A−,B−)] were computed and averaged to obtain the cross-stimulus correlogram (thick line in Fig. 1C). Cross-stimulus, cross-polarity correlograms [e.g., SCC(A+,B−)] were computed to facilitate the separation of fine-structure and envelope cross-correlations by using difcors and sumcors, respectively. The thin line in Figure 1C represents the average of SCC(A+,B−) and SCC(A−,B+). The flat cross-stimulus correlograms (Figs. 1C, F, and I) indicate that there were very few common temporal fluctuations between the responses to these two uncorrelated noises.
More generally, shuffled correlograms quantify cross-correlation as a function of delay and demonstrate a peak at the characteristic delay (CD) between the two responses. Nonzero characteristic delays can arise both for within-fiber, cross-stimulus conditions [e.g., due to a delay or phase shift between two stimuli, or a change in stimulus level (Joris et al. 2008b)] and for cross-fiber, within-stimulus conditions [e.g., due to a traveling wave delay between different CFs (Joris et al. 2006b)]. Thus, the CDs in SCCs provide relative phase information across conditions and therefore avoid a commonly cited limitation of SAC analysis, i.e., the elimination of phase information.
A limitation of these stationary cross-correlation analyses for the present study (and similar perceptual approaches, e.g., Sheft et al. 2008) is that they only estimate the overall strength of recovered envelopes averaged across time-varying speech stimuli, i.e., they preclude the analysis of temporal variations in recovered envelopes that may occur. Short-time analyses could be developed to explore the nonstationarity of recovered envelopes, but such analyses would be limited by the large amount of data required to provide robust estimates of auto- and cross-correlations within short temporal windows. Alternative approaches could be explored, such as the compound PST histogram (e.g., Pfeiffer and Kim 1972), which is similar to the difcor in that it combines responses to stimuli of opposite polarity and could be extended to estimate envelope in addition to fine-structure coding. However, the reliability and noise floor of such an approach would need to be compared to the present shuffled-correlogram-based approach, which benefits from efficient smoothing provided by the shuffling procedure.
Spike trains were obtained from 28 chinchilla AN fibers as well as from a computational AN model to test the robustness of the neural cross-correlation metrics across a variety of conditions. The dynamic range of the neural cross-correlation coefficients was tested first in several conditions involving broadband noise. The neural cross-correlation metrics were used to provide physiological evidence for the recovery of speech envelope cues from speech fine structure at the output of the cochlea. Model responses were used to evaluate systematically the dependence of temporal envelope recovery on the number of analysis bands used to generate chimaeric speech and to predict the effects of sensorineural hearing loss (SNHL) on envelope recovery. In all cases, separate cross-correlation coefficients were computed for temporal envelope and fine-structure responses.
Separating envelope and fine-structure cross-correlations in broadband noise responses
Broadband noise does not have a particularly interesting envelope or fine-structure pattern; however, when passed through a narrowband cochlear filter, well-defined temporal envelope and fine-structure waveforms are created that depend on the characteristics of the filter (Ruggero 1973; Ghitza 2001; Joris 2003; Louage et al. 2004). Here, broadband noise was used as a stimulus for the development of neural cross-correlation coefficients for envelope and fine structure. Figure 1 illustrates the computations involved in computing these metrics using two sets of spike trains recorded from a single chinchilla AN fiber responding to two uncorrelated broadband noises (see “Methods” section). The neural cross-correlation coefficients (ρ TFS and ρ ENV) were computed as the amount of common temporal coding between the two responses (column 3) relative to the degree of temporal coding within each response individually (columns 1 and 2). The neural cross-correlation coefficients for fine structure (ρ TFS = −0.03), computed from the difcors (Eq. 1), and for envelope (ρ ENV = 0.01), computed from the corrected sumcors (Eq. 2), were both very close to zero for these responses to two uncorrelated noises.
Cross-stimulus correlations in chimaeric speech responses
The recovery of speech-envelope cues in cochlear responses to speech fine-structure stimuli has been predicted using perceptually based models and a general cross-correlation framework (Zeng et al. 2004; Gilbert and Lorenzi 2006; Sheft et al. 2008). In this framework, the presence of recovered speech-envelope cues was indicated by a large cross-correlation between the envelope responses (extracted from the model output waveforms) to the original speech and to the corresponding chimaeric stimulus made up of the fine-structure from speech and the envelope from a spectrally matched noise. The neural cross-correlation coefficients developed here were applied in the same general framework to provide a physiologically based evaluation of recovered envelope cues in both recorded and model AN spike trains.
The physiological model predictions suggest that recovered envelopes are greater at low CFs than at high CFs for both the one- and 16-band conditions. This result is consistent with recent perceptual modeling predictions for 16-band TFS speech and has been suggested to result from narrower cochlear filters at low CFs or from the presence of fundamental-frequency information at low CFs (Sheft et al. 2008). However, perceptual modeling predictions for one-band TFS speech showed recovered envelopes were small at low CFs and maximal around 1 kHz (Gilbert and Lorenzi 2006), which is inconsistent with the physiological results (Figs. 6 and 7). It was noted that the peak in recovered envelopes for the one-band condition corresponded with the maximum acoustic energy being near 1 kHz for the set of VCV stimuli used in the perceptual study; however, this correspondence did not exist for the 16-band condition (Gilbert and Lorenzi 2006). The sentence used in the present study had maximum spectral energy near 550 Hz.
Model predictions of fine-structure coding (Fig. 7A) were also consistent with the recorded AN data (Fig. 6A). Fine-structure coding was similar between both fine-structure chimaeras for low CFs, with ρ TFS dropping for the one-band case at higher CFs. The other “cross-over” condition (i.e., speech-fine-structure coding for the speech-envelope chimaera, Fig. 7B) generally showed minimal fine-structure coding at all CFs for both chimaeras. Speech envelope coding was robust for the 16-band speech envelope chimaera at all CFs and for the one-band chimaera for CFs up to 800 Hz, above which point ρ ENV decreased (Fig. 7D). Thus, auditory chimaeras were predicted to be effective in isolating speech envelope cues in AN responses to both one- and 16-band chimaeras, whereas the isolation of fine-structure cues was difficult to achieve, particularly in the one-band case.
The impaired predictions shown in Figure 9 represent a 30-dB threshold elevation due to either selective OHC or IHC damage. Overall sound level was increased by 25 dB for the impaired conditions in order to match the best-modulation level of the impaired 550-Hz CF model fibers. Both neural envelope (ρ ENV) and fine-structure (ρ TFS) coding were predicted to be degraded in the case of OHC damage (open triangles), but not in the case of IHC damage (open circles) relative to the normal-hearing predictions (filled squares). The degree of degradation was larger for recovered envelopes (for the 1–4 band conditions) than for fine-structure coding, representing a ~38% decrease in envelope coding compared to a 23% decrease in fine-structure coding. The lack of a predicted degradation for the 30-dB IHC loss provided a useful control to suggest that threshold shift alone does not account for these predicted degradations. Rather, it is likely that the reduction in cochlear nonlinearity that occurs for OHC damage and not for IHC damage is likely to be the basis for the predicted degradations in recovered envelope cues and fine-structure coding of speech fine-structure chimaeras.
The neural cross-correlation coefficients developed in this study have broad applications to studying temporal coding in that they provide a general framework for computing the similarity between either envelope or fine-structure components of two sets of spike-train responses. These metrics have a wide dynamic range in both within-fiber and across-fiber conditions, ranging from near 0 for uncorrelated conditions to near 1 for correlated conditions. Although this study focused on temporal coding at the output of the cochlea, the neural cross-correlation coefficients are generally applicable to auditory spike-train responses from any location within the auditory pathway. More generally, these analyses may be useful for studying the perceptual relevance and neural coding of stimulus information across different time scales in various sensory modalities (e.g., Gamzu and Ahissar 2001; Lu et al. 2001; Vickers et al. 2001; van Boxtel et al. 2006).
A normalized representation of neural cross-correlation
The neural cross-correlation metrics developed here represent an extension of shuffled auto- and cross-correlogram analyses recently developed by Joris and colleagues (Joris 2003; Louage et al. 2004, 2006; Joris et al. 2006a, b). These neural metrics provide normalized representations of correlated temporal coding computed as the degree of common temporal fluctuations in two-spike trains (cross-correlograms) relative to the degree of temporal coding within each spike-train response individually (autocorrelograms). Although this normalization is beneficial for a similarity metric, it eliminates the overall degree of temporal coding and can produce misleading results if used when there is very little baseline temporal coding (e.g., ρ TFS was not computed in Fig. 3 for fiber CFs > 5 kHz due to the rolloff in phase locking).
The most significant benefit of this within-fiber normalization is reduced variability in quantifying cross-correlation based on neural responses. The degree of envelope and fine-structure coding (sumcor and difcor peak heights) varies greatly across neurons (e.g., different CFs) and across stimulus conditions (e.g., sound levels; Louage et al. 2004). A population study that quantified correlation based solely on cross-correlogram peak heights (i.e., without normalization) would likely have too much variability to accurately quantify the small correlations identified with the neural cross-correlation coefficients.
Quantifying envelope coding in low-CF neural responses
An extension of previous correlogram analyses (Joris 2003; Louage et al. 2004) was also needed to improve quantification of envelope coding of chimaeric speech for low CFs. Sumcors, which nominally represent envelope coding as the common temporal aspects of responses to a stimulus and its polarity-inverted pair, do not perfectly eliminate fine structure (Fig. 1G) at the low frequencies of primary interest for speech. A more accurate isolation of envelope information was obtained by eliminating fine-structure contributions from a spectral representation of the sumcor (Fig. 2). The Fourier transform of the sumcor (Figs. 1J–L) estimates the envelope spectral density (or cross-spectral density), which could in fact be analyzed in more detail if certain envelope frequencies were of particular interest.
Envelope coding in the AN model
Although current AN models capture the important qualitative trends in responses to amplitude-modulated sounds, they quantitatively underestimate the degree of envelope coding (Nelson and Carney 2004). The underestimation is due to limitations of synapse adaptation in the model (Nelson and Carney 2004; Zhang and Carney 2005). Most computational AN models of this type would likely have the same limitation (Smith and Brachman 1980; Hewitt and Meddis 1991). This limitation can be seen in high-CF model sumcor peak heights (Fig. 2B), which were lower than corresponding values from AN data (e.g., Fig. 2A; also see Fig. 15A in Louage et al. 2004). However, qualitative trends in model sumcors matched very well with those from AN data. The effect of underestimated envelope coding was likely minimized by the normalized nature of the neural cross-correlation coefficients, as supported by similar findings from model and recorded responses (Figs. 3, 6, 7). Nonetheless, limitations in model predictions further motivate the usefulness of developing neural cross-correlation coefficients that can be applied directly to spike trains recorded from normal-hearing or hearing-impaired animals.
Quantifying envelope recovery at the output of the cochlea
The present results based on ρ ENV provide physiological evidence for recovered envelopes in AN responses to speech-noise chimaeras (with noisy true envelope cues created by cochlear filtering). Model predictions confirmed that recovered envelopes also exist for TFS speech (without true envelope cues) and speech–speech chimaeras (with meaningful true envelope cues prior to cochlear filtering). Thus, salient physiological recovered envelopes can exist in a variety of conditions with different types of true-envelope cues (Fig. 10).
The existence of physiological recovered envelopes is consistent with perceptual studies demonstrating intelligible recovered envelope cues at the output of gammatone filterbank models (Zeng et al. 2004; Gilbert and Lorenzi 2006). The finding that physiological recovered envelopes were larger for one-band than for 16-band chimaeras, but were not completely eliminated for eight- or 16-band conditions, is also consistent with perceptual results (Gilbert and Lorenzi 2006). Although generally consistent, some important differences and caveats must be considered, since these perceptual results were interpreted as suggesting that recovered envelopes were “essentially abolished” for eight or more analysis bands (Gilbert and Lorenzi 2006; Lorenzi et al. 2006). The model of cat AN tuning likely underestimates envelope recovery in humans, for which tuning was estimated to be two to three times sharper than cats (Shera et al. 2002; but see Ruggero and Temchin 2005). Also, the prominence of physiological recovered envelopes at low CFs (below ~500 Hz) was consistent with perceptual modeling for 16-band TFS speech (below ~340 Hz; Sheft et al. 2008), but not for one-band conditions (Gilbert and Lorenzi 2006). The exact CF limit for recovered envelopes likely depends on both the species and specific stimuli used; however, it is unclear why the CF dependence for 1- and 16-band conditions would differ in perceptual predictions and not in physiological predictions.
More recent perceptual modeling has confirmed that it is extremely difficult to completely eliminate recovered envelopes even with complex signal processing schemes (Sheft et al. 2008). However, the lack of a significant correlation between the degree of predicted recovered envelopes and measured speech identification across a variety of conditions was taken by Sheft et al. as evidence that recovered envelopes do not contribute substantially to TFS-speech perception. High-pass filtered (at 340 Hz) TFS speech eliminated the predicted prominent recovered envelopes in the fundamental-frequency region. Equivalent identification performance and phonetic-feature reception were obtained for high-pass and unfiltered TFS speech; however, physiological predictions suggest that significant recovered envelopes exist above 340 Hz. Identification for 32-band TFS speech was lower than for 16 bands, which was inconsistent with higher predicted recovered envelopes for 32-band than for 16-band conditions for many CFs below 1,000 Hz. However, predicted physiological recovered envelopes for 32-band chimaeras (not shown) were not different than 16-band conditions (Fig. 7). Further evidence that TFS speech perception is not solely based on recovered envelopes was provided by Gilbert et al. (2007), who demonstrated a larger effect of modulation masking on 16-band envelope speech than on 16-band TFS speech. Although these studies suggest recovered envelopes are not the basis for TFS-speech perception, the discrepancies and caveats discussed suggest a better integration of physiological metrics with perceptual studies would be beneficial.
Implications for the effects of sensorineural hearing loss on TFS cues
Recent demonstrations that listeners with SNHL have a reduced ability to use TFS cues (Lorenzi et al. 2006; Hopkins et al. 2008; Moore 2008) have motivated the idea that novel hearing-aid amplification strategies should be developed to enhance TFS coding. Given that acoustic TFS produces two types of neural cues (Fig. 10), it is important to consider the effects of SNHL on recovered envelope cues as well as true TFS cues. Outer-hair-cell damage, and associated loss of cochlear nonlinearities, was predicted to degrade recovered envelope cues (Fig. 9), which if perceptually relevant could contribute to an acoustic-TFS deficit. Recent perceptual studies have suggested that reduced frequency selectivity is not the only cause of degraded TFS processing. Listeners with high-frequency SNHL, but with normal low-frequency thresholds (and assumed normal low-CF frequency selectivity), were unable to identify low-pass filtered TFS speech (Lorenzi et al. 2009). Also, listeners with SNHL and only mildly degraded low-CF frequency selectivity had significant deficits in TFS processing for binaural pitch perception (Santurette and Dau 2007). However, the physiological bases for these perceptual deficits remain unknown.
There are several physiological mechanisms hypothesized to underlie the perceptual TFS deficit in SNHL listeners (Moore 2008). Conflicting evidence exists as to whether within-fiber encoding of fine-structure (i.e., phase locking) is degraded following SNHL (Harrison and Evans 1979; Woolf et al. 1981; Miller et al. 1997). Alternatively, a significant effect of SNHL on fine-structure coding may occur in terms of across-fiber encoding. The neural metric ρ TFS provides an intuitive representation of across-fiber fine-structure coding. Across-fiber correlation decreases as the CF separation increases between two AN fibers. Broader tuning from SNHL was predicted to increase the range of CF separations over which correlated activity existed (Heinz and Swaminathan 2008). This degradation would functionally reduce the number of independent information channels in the AN-fiber population. Broader tuning was also predicted to reduce the traveling-wave delay between different CFs, which was quantified as the characteristic delay in across-fiber cross-correlograms. Thus, SNHL is predicted to degrade normal spatiotemporal response patterns, which have been hypothesized to provide robust neural cues for a range of perceptual phenomena, including speech, pitch, and intensity perception, and tone detection in noise (Loeb et al. 1983; Shamma 1985; Heinz et al. 2001b; Carney et al. 2002; Cedolin and Delgutte 2007; Heinz 2007).
Cochlear implants and other applications
In addition to addressing fundamental neural coding questions about normal and impaired hearing, the neural cross-correlation metrics have direct applications in a number of other areas. Recent perceptual findings suggesting an important role for fine structure have led to an effort to provide fine-structure information to cochlear-implant patients, in addition to envelope information currently supplied (Rubinstein et al. 1999; Litvak et al. 2001; Nie et al. 2005). Neural cross-correlation coefficients are useful metrics for evaluating novel cochlear-implant stimulation (or hearing-aid amplification) strategies because they provide a quantitative physiological framework to test the ability to deliver complex-stimulus-related envelope and/or fine-structure information to the AN. Likewise, audio-coding strategies designed to compress the representation of sound without affecting perception could be evaluated in a physiological framework using neural cross-correlation coefficients. Depending on the acoustic material to be compressed (e.g., speech or music), a varying emphasis on envelope or fine-structure coding could be applied based on neural responses. Thus, the ability of neural cross-correlation coefficients to quantify recovery of stimulus-related temporal cues can be applied in cases for which recovery is undesirable (e.g., psychoacoustic stimuli to isolate fine structure or envelope) and in cases where recovery is desirable (e.g., cochlear implants or audio coding).
This research was supported by the National Institutes of Health (NIH)/National Institute on Deafness and Other Communication Disorders (NIDCD) grant R03DC07348 and by Weinberg funds from the department of Speech, Language, and Hearing Sciences at Purdue University. Sushrut Kale helped with the collection of the neurophysiological data. The authors thank Laurel Carney and Edward Bartlett for helpful comments on a previous version of the manuscript. Also, we gratefully acknowledge the thorough and very helpful reviews from Associate Editor Andrew Oxenham, Christian Lorenzi, and two other anonymous reviewers.
- Carney LH, Heinz MG, Evilsizer ME, Gilkey RH, Colburn HS. Auditory phase opponency: a temporal model for masked detection at low frequencies. Acustica-Acta Acustica. 88:334–347, 2002.Google Scholar
- Cedolin L, Delgutte B. Spatio–temporal representation of the pitch of complex tones in the auditory nerve. In: Kollmeier B, Klump G, Hohmann V, Langemann U, Mauermann M, Uppenkamp S and Verhey J (eds) Hearing—From Sensory Processing to Perception. Springer-Verlag, Berlin, pp. 61–70, 2007.CrossRefGoogle Scholar
- Heinz MG. Spatiotemporal encoding of vowels in noise studied with the responses of individual auditory nerve fibers. In: Kollmeier B, Klump G, Hohmann V, Langemann U, Mauermann M, Uppenkamp S and Verhey J (eds) Hearing—From Sensory Processing to Perception. Springer-Verlag, Berlin, pp. 107–115, 2007.CrossRefGoogle Scholar
- Kiang NYS, Watanabe T, Thomas EC, Clark LF. Discharge patterns of single fibers in the cat’s auditory nerve. Cambridge, MA, MIT, 1965.Google Scholar
- Logan BF, Jr. Information in the zero crossings of bandpass signals. Bell Syst. Tech. J. 56:487–510, 1977.Google Scholar
- Rice SO. Distortion produced by band limitation of an FM wave. Bell Syst. Tech. J. 52:605–626, 1973.Google Scholar
- Shen C, Smith ZM, Oxenham AJ, Delgutte B. Auditory Chimera Demo. http://research.meei.harvard.edu/Chimera/, 2001.
- Shera CA, Guinan JJ, Oxenham AJ. Otoacoustic estimates of cochlear tuning: validation in the chinchilla. Assoc Res Otolaryngol Abs 30:519, 2007.Google Scholar