The Role of Temporal Fine Structure Processing in Pitch Perception, Masking, and Speech Perception for Normal-Hearing and Hearing-Impaired People
- 3.8k Downloads
Complex broadband sounds are decomposed by the auditory filters into a series of relatively narrowband signals, each of which can be considered as a slowly varying envelope (E) superimposed on a more rapid temporal fine structure (TFS). Both E and TFS information are represented in the timing of neural discharges, although TFS information as defined here depends on phase locking to individual cycles of the stimulus waveform. This paper reviews the role played by TFS in masking, pitch perception, and speech perception and concludes that cues derived from TFS play an important role for all three. TFS may be especially important for the ability to “listen in the dips” of fluctuating background sounds when detecting nonspeech and speech signals. Evidence is reviewed suggesting that cochlear hearing loss reduces the ability to use TFS cues. The perceptual consequences of this, and reasons why it may happen, are discussed.
Keywordstemporal processing temporal fine structure phase locking cochlear hearing loss speech perception pitch perception
When a complex broadband sound is analyzed in the cochlea of a normal ear, the result is a series of bandpass-filtered signals, each corresponding to one position on the basilar membrane. This aspect of auditory analysis is often modeled (crudely) by short-term Fourier analysis, which expresses the signal in terms of the magnitude and phase of its spectral components. Traditionally, the spectral magnitudes have been regarded as of primary importance for perception, although under some conditions, the phases of the components play an important role (Moore 2002).
Both E and TFS information are represented in the timing of neural discharges, although TFS information depends on phase locking to individual cycles of the stimulus waveform (Young and Sachs 1979). In most mammals, phase locking weakens for frequencies above 4–5 kHz, although some useful phase locking information may persist for frequencies up to at least 10 kHz (Heinz et al. 2001). The upper limit of phase locking in humans is not known. Although TFS in the stimulus on the basilar membrane is present up to the highest audible frequencies, this paper is especially concerned with TFS information as represented in the patterns of phase locking in the auditory nerve. This information probably weakens at high frequencies, and so one way of exploring the use of TFS information is to examine changes in performance on various tasks as a function of frequency. Other ways will be described later in this paper.
The role of TFS in pitch perception
Evidence accrued over many years suggests that TFS plays a role in the perception of pitch for both pure and complex tones; for reviews, see Moore (2003) and Plack and Oxenham (2005). For steady pure tones, information from TFS seems to be necessary to account for the way that frequency discrimination varies with frequency (Heinz et al. 2001) and to account for the fact that frequency discrimination for very short tones is better than would be predicted from excitation-pattern (place) models, based on the broadening of the spectrum that occurs with decreasing duration (Moore 1973). For steady complex tones, information from TFS may be important for coding the frequencies of individual resolved partials (Moore et al. 2006b) and also for coding the temporal structure of the waveform evoked on the basilar membrane by unresolved harmonics with rank below about 14 (Moore et al. 2006a; Moore and Moore 2003b). For complex tones containing only harmonics above the 14th, the pitch seems to be determined by E rather than by TFS cues (Moore and Moore 2003b) and the perceived pitch is relatively weak (Houtsma and Smurzynski 1990).
Information from TFS may also play a role in the detection of frequency modulation (FM) at low rates. Moore and Sek (1992, 1994) have shown that a place model based on excitation patterns can account for the detection of FM, or mixtures of FM and amplitude modulation (AM) when the FM rate is medium or high (10 Hz and above). However, when the FM rate is low (5 Hz or less), the model fails to predict the data (Moore and Sek 1995; Sek and Moore 1995). Moore and Sek proposed that, for FM rates below 5 Hz, FM is detected by virtue of the changes in phase locking to the carrier that occur over time. Note that the carrier frequency itself can be rather high (e.g., 4,000 Hz). Moore and Sek suggested that the mechanism for decoding the phase-locking information was “sluggish” and became less effective when the oscillations in frequency were rapid. Hence, it played little role for high modulation rates. This sluggishness may be similar to that observed for binaural processing of interaural phase differences or interaural correlation, which also depends on sensitivity to TFS (Blauert 1972; Grantham and Wightman 1978, 1979); however, one recent study has shown that rapid changes in interaural timing can be heard for certain complex stimuli (Siveke et al. 2008).
In the experiments of Moore and Sek on FM detection, subjects could always fall back on the use of place cues when TFS cues became less effective (at high modulation rates). However, this does not imply that TFS cues become completely unusable for modulation rates above 5 Hz. When the amount of modulation is well above the detection threshold, TFS cues may play a role for higher modulation rates, especially when place cues are of limited benefit.
Masking and the role of TFS in dip listening
It is often easier to detect a signal in a fluctuating background sound than in a steady background sound, especially when the frequency of the signal is different from the center frequency of the masker. This effect has usually been ascribed to the ability to “listen in the dips” of the fluctuating background sound. I describe next an experiment that supports the idea that dip listening depends partly on the use of TFS information.
These results are consistent with the idea that TFS provides a cue that allows effective dip listening when the masker and signal frequencies fall in the range where phase locking is relatively precise. The decrease in masking release with increasing masker beat rate is consistent with the idea that the mechanism that “decodes” TFS information is sluggish and is less effective when there are rapid changes in TFS. The amount of sluggishness may depend somewhat on center frequency, being greater for low frequencies; it may be that the period needs to be reasonably stable over a certain number of stimulus cycles for TFS to be extracted effectively. This could explain the reduced masking release for the masker centered at 250 Hz. When the masker and signal frequencies are too high to support precise phase locking, some masking release still occurs, but it is smaller and depends only slightly on the masker beat rate. Presumably, some other mechanism leads to masking release in this case, for example, comparison of short-term levels across frequency channels or a shift in the position of the excitation pattern as the masker envelope passes through a minimum; this mechanism appears to be only slightly sluggish. The results of other masking experiments have also been interpreted as indicating a role of TFS in dip listening (Schooneveldt and Moore 1987).
The role of TFS in speech perception
Several researchers have investigated the role of E and TFS cues in speech perception by processing speech sounds in such a way that they contain mainly E or TFS cues. This has been done using different forms of vocoders (Dudley 1939). The speech is filtered into a number of contiguous frequency bands. To preserve E cues, the envelope is extracted from the signal at the output of each band, and the envelope is used to modulate the amplitude of a noise band (noise vocoder) or a sinusoid (tone vocoder) centered at the frequency of the band from which the envelope was derived. The modulated carriers are then combined (usually after a second stage of filtering to restrict the spectrum of the modulated carriers to the original bandwidths). Speech processed in this way will be referred to as “E-speech”. Experiments using E-speech have shown that only a few bands are required to give good intelligibility for speech in quiet (Shannon et al. 1995), although more bands are required when background sounds are present (Qin and Oxenham 2003; Stone and Moore 2003); for a review, see Lorenzi and Moore (2008). Overall, the results suggest that E cues are sufficient to give good intelligibility for speech in quiet, but they are not sufficient when background sounds are present, especially when the background is fluctuating. This may be the case because E cues alone are not sufficient to allow the perceptual segregation of mixtures of sounds, especially when information about the sounds is conveyed only by envelope fluctuations in different frequency bands.
Processing speech to remove E cues but preserve TFS cues is more difficult. In attempts to do this, each bandpass signal is divided by the envelope magnitude (usually the Hilbert envelope). As a result, each band signal becomes like an FM sinusoidal carrier, with a constant amplitude. The processed band signals are scaled in long-term amplitude so that they have the same root-mean-square amplitude as the original band signals and are then combined. Speech processed in this way will be referred to as “TFS-speech”. It should be emphasized that, although such processing preserves TFS information to some extent, the TFS information is nevertheless distorted and different from the information in the unprocessed speech.
A potential problem with TFS-speech was identified by Ghitza (2001). He showed that, although E cues are physically removed from TFS-speech, they are reconstructed at the outputs of the auditory filters (called cochlear filters by Ghitza) and may therefore contribute to intelligibility, especially when only a few broad analysis bands are used in the processing (Zeng et al. 2004). Thus, the intelligibility of TFS-speech may be influenced by reconstructed E cues. However, Gilbert and Lorenzi (2006) presented evidence indicating that the reconstructed E cues supported only minimal speech identification when the bandwidth of the analysis filters was less than 4 ERBN, or equivalently, when the number of analysis bands was equal to or greater than eight over the frequency range 0.08 to 8.02 kHz. Using stimuli processed using a large number of bands, it has been shown that, after training, normal-hearing listeners achieve high levels of intelligibility for TFS-speech in quiet (Gilbert and Lorenzi 2006; Lorenzi et al. 2006). Thus, TFS cues do seem to convey information for intelligibility. However, it may be that reconstructed envelope cues make a contribution to the intelligibility of TFS-speech, even though the envelope cues alone are not sufficient to give good intelligibility. The fact that learning is required to achieve high intelligibility with TFS-speech may indicate that the auditory system normally uses TFS cues in conjunction with envelope cues; when envelope cues are minimal, TFS information may be difficult to interpret. Alternatively, the learning may reflect the fact that TFS cues are distorted in the TFS-speech (relative to unprocessed speech), and it may require some training to overcome the effects of the distortion.
This section has emphasized the role of TFS information for speech perception in fluctuating background sounds. However, it should be noted that TFS information also plays a role in speaker identification and the understanding of tonal languages, in which differences in fundamental frequency (F0) or changes in F0 over time affect word meanings (Zeng et al. 2005).
The effect of hearing loss on the ability to use TFS information
Experimental evidence suggests that hearing-impaired listeners have a reduced ability to use TFS information for: (1) detecting FM at low rates (Lacher-Fougère and Demany 1998; Moore and Skrodzka 2002), (2) lateralization based on interaural phase differences (Lacher-Fougère and Demany 2005), and (3) discrimination of the (missing) F0 of complex tones (Moore and Moore 2003a).
Hopkins and Moore (2007) directly assessed the ability of subjects with moderate cochlear hearing loss to use TFS information. They measured the ability to discriminate a harmonic complex tone, with F0 = 100, 200, or 400 Hz, from a similar tone in which all components had been shifted up by the same amount in Hertz, ΔF. To reduce cues relating to differences in the excitation patterns of the two tones, the tones contained many components, and they were passed though a fixed bandpass filter centered on the upper (unresolved) harmonics. To prevent components outside the passband of the filter from being audible, a background noise was added. In the presence of this noise, the differences in excitation patterns between the harmonic and frequency-shifted tones were very small when the bandpass filter was centered on the 11th harmonic. People with normal hearing perceive the shifted tone as having a higher pitch than the harmonic tone (de Boer 1956; Moore and Moore 2003b). The envelope repetition rate of the two sounds is the same (equal to F0), so the difference in pitch is assumed to occur because of a difference in the TFS of the two sounds (Moore and Moore 2003b; Schouten et al. 1962).
Trained normally hearing subjects were able to perform this task well, presumably reflecting the ability to discriminate changes in the TFS of the harmonic and frequency-shifted tones. The smallest detectable frequency shift (corresponding to a detectability index d′ = 1) was typically about 0.05F0. Even untrained normally hearing subjects achieved thresholds of 0.2F0 or better (Moore and Sek 2008). However, subjects with moderate cochlear hearing loss generally performed very poorly. For most subjects and F0s, performance was not significantly above chance even for the maximum frequency shift of 0.5F0. Above-chance performance occurred only when there was little or no hearing loss at the center frequency of the filter passband. The results suggest that moderate cochlear hearing loss results in a reduced ability, or no ability, to discriminate harmonic from frequency-shifted tones based on TFS.
This conclusion applies only to relatively high center frequencies, since the lowest center frequency tested by Hopkins and Moore was 1,100 Hz. Some hearing-impaired subjects can process TFS for lower center frequencies, since, for example, they can recognize melodies played as a pattern of binaural pitches (Santurette and Dau 2006). However, many hearing-impaired subjects do show a poorer than normal ability to process binaural TFS information for frequencies below 1,000 Hz (Lacher-Fougère and Demany 2005; Santurette and Dau 2006).
Speech perception studies
People with cochlear hearing loss usually have difficulty in understanding speech when background sounds are present, especially when the background is fluctuating (Duquesnoy 1983). Recent evidence supports the idea that the difficulty stems partly from a reduced ability to process TFS cues. Lorenzi et al. (2006) measured identification scores for unprocessed, E, and TFS-speech in quiet for three groups of listeners: young with normal hearing and young and elderly with moderate “flat” hearing loss. After training, normal-hearing listeners scored perfectly with unprocessed speech and about 90% correct with E- and TFS-speech. Both young and elderly listeners with hearing loss performed almost as well as normal with unprocessed and E-speech but performed very poorly with TFS-speech, indicating a greatly reduced ability to use TFS. For the younger hearing-impaired group, scores for TFS-speech were highly correlated (r = 0.83) with the ability to take advantage of temporal dips in a background noise when identifying unprocessed speech. These results support the idea that the ability to use TFS is important for listening in the dips.
Lorenzi et al. (2008) measured the identification of E- and TFS-speech in quiet for normal-hearing listeners and listeners with high-frequency mild-to-severe hearing loss and normal (<20 dB HL) audiometric thresholds below 2 kHz. The stimuli were lowpass filtered at 1.5 kHz to restrict their spectrum to the region where audiometric thresholds were normal, and a condition with complementary highpass noise was included, to restrict “off-frequency listening”. Only a few of the hearing-impaired listeners were able to score above chance (6.25%) for the TFS-speech, whereas normal-hearing listeners achieved scores of 20–50% (relatively low scores would be expected given the limited audible frequency range of the stimuli). The results indicate that, for listeners with cochlear hearing loss, deficits in the ability to use TFS cues in speech can occur even when audiometric thresholds are within the normal range.
In the experiment of Hopkins et al. (2008), listeners with moderate cochlear hearing loss were also tested. The mean results are shown by the filled circles in Figure 3. The hearing-impaired listeners performed more poorly than the normal-hearing listeners in all conditions, and the difference in performance was greater when J was large. The improvement in performance going from completely vocoded stimuli (J = 0) to completely unprocessed stimuli (J = 32) was only about 5 dB for the hearing-impaired listeners, compared to about 15 dB for the normal-hearing listeners. For the former are, the improvement varied across listeners, with some listeners showing near-normal benefit and others showing no benefit at all. The reasons for the individual differences are not at present clear. Hopkins et al. (2008) reported that the benefit gained from the addition of TFS information was not correlated with the average audiometric threshold over the range 250 to 4,000 Hz.
Current cochlear implant systems convey mainly envelope information in different frequency bands, and this may partly account for the relatively poor ability of cochlear implantees to understand speech when background sounds are present. Nie et al. (2005) evaluated the potential contribution of TFS information to speech recognition in noise via acoustic simulations of a cochlear implant. They transformed the rapidly-varying TFS into a slowly varying FM signal which was applied to the carrier in each band (which was already amplitude modulated by the speech envelope in that band). They found that, for sentence recognition in the presence of a competing voice, adding this FM signal improved performance by as much as 71 percentage points. This illustrates the potentially large benefit of providing TFS information in a cochlear implant.
Possible reasons for the effect of cochlear hearing loss on sensitivity to TFS
Reduced precision of phase locking. Physiological studies disagree about whether or not this occurs for pure-tone stimuli (Harrison and Evans 1979; Miller et al. 1997; Woolf et al. 1981). For complex sounds, such as synthesized vowels, noise-induced hearing loss can lead to reduced synchrony capture (phase locking to the formant peaks), possibly as a result of diminished two-tone suppression (Miller et al. 1997)
A change in the relative phase of response at different points along the basilar membrane (Ruggero 1994). This would affect mechanisms for decoding TFS based on correlation of the outputs of adjacent places (Carney et al. 2002; Deng and Geisler 1987; Loeb et al. 1983; Shamma and Klein 2000; Shamma 1985)
More complex and more rapidly varying TFS resulting from broader auditory filters, which might make it more difficult for central mechanisms to decode the TFS information
A mismatch between TFS information and the place on the basilar membrane that would “normally” respond to that information. TFS information may be decoded on a place-specific basis (Huss and Moore 2005; Moore 1982; Oxenham et al. 2004; Srulovicz and Goldstein 1983). For example, the TFS at a 1-kHz rate may be decoded best by the central neurons that are tuned (in a normal ear) close to 1 kHz. Hearing loss may produce a shift in frequency-place mapping (Liberman and Dodds 1984; Sellick et al. 1982) and disrupt the decoding process. Note that this explanation depends on the assumption that the central decoding mechanism is, to some extent, “hard wired”
There may be central changes that occur following cochlear hearing loss, such as loss of inhibition, and such changes might disrupt the mechanisms for decoding the TFS
The work of the author was supported by the MRC, Deafness Research UK, and RNID. I thank Kathryn Hopkins, Christian Lorenzi, and Michael Stone for their collaboration in some of the work reported here. I thank Kathryn Hopkins for helpful comments on an earlier version of this paper. I also thank Associate Editor Fan-Gang Zeng and three anonymous reviewers for their helpful comments.
- Bracewell RN. The Fourier transform and its applications. New York, Mc-Graw Hill, 1986.Google Scholar
- Carney LH, Heinz MG, Evilsizer ME, Gilkey RH, Colburn HS. Auditory phase opponency: A temporal model for masked detection at low frequencies. Acta Acust. Acust. 88:334–346, 2002.Google Scholar
- Harrison RV, Evans EF. Some aspects of temporal coding by single cochlear fibres from regions of cochlear hair cell degeneration in the guinea pig. Arch. Otolaryngol. 224:71–78, 1979.Google Scholar
- Liberman MC, Dodds LW. Single neuron labeling and chronic cochlea pathology. III. Stereocilia damage and alterations in threshold tuning curves. Hear Res. 16:54–74, 1984.Google Scholar
- Lorenzi C, Debruille L, Garnier S, Fleuriot P, Moore BCJ. Abnormal processing of temporal fine structure in speech for frequencies where absolute thresholds are normal. J. Acoust. Soc. Am. (in press), 2008.Google Scholar
- Lorenzi C, Moore BCJ. Role of temporal envelope and fine structure cues in speech perception: A review. In: Dau T, Buchholz JM, Harte JM, Christiansen TU (eds) Auditory Signal Processing in Hearing-Impaired Listeners. 1st International Symposium on Auditory and Audiological Research (ISAAR 2007). Denmark, Centertryk A/S, pp. 263–272, 2008.Google Scholar
- Moore BCJ. An introduction to the psychology of hearing. 2nd edn. London, Academic, 1982.Google Scholar
- Moore BCJ. An introduction to the psychology of hearing. 5th edn. San Diego, Academic, 2003.Google Scholar
- Moore BCJ, Sek A. Development of a fast method for determining sensitivity to temporal fine structure. Int J Audiol. (in press), 2008.Google Scholar
- Plack CJ, Oxenham AJ. The psychophysics of pitch. In: Plack CJ, Oxenham AJ, Fay RR, Popper AN (eds) Pitch perception. New York, Springer, pp. 7–55, 2005.Google Scholar