Across-Channel Timing Differences as a Potential Code for the Frequency of Pure Tones
- First Online:
- Cite this article as:
- Carlyon, R.P., Long, C.J. & Micheyl, C. JARO (2012) 13: 159. doi:10.1007/s10162-011-0305-0
- 636 Downloads
When a pure tone or low-numbered harmonic is presented to a listener, the resulting travelling wave in the cochlea slows down at the portion of the basilar membrane (BM) tuned to the input frequency due to the filtering properties of the BM. This slowing is reflected in the phase of the response of neurons across the auditory nerve (AN) array. It has been suggested that the auditory system exploits these across-channel timing differences to encode the pitch of both pure tones and resolved harmonics in complex tones. Here, we report a quantitative analysis of previously published data on the response of guinea pig AN fibres, of a range of characteristic frequencies, to pure tones of different frequencies and levels. We conclude that although the use of across-channel timing cues provides an a priori attractive and plausible means of encoding pitch, many of the most obvious metrics for using that cue produce pitch estimates that are strongly influenced by the overall level and therefore are unlikely to provide a straightforward means for encoding the pitch of pure tones.
Keywordsauditory nerve pitch frequency pure tone phase transitions
Traditionally, theories of how the auditory system encodes pitch have focused on either the variation in firing rate across auditory nerve (AN) fibres having different characteristic frequencies (CFs; Zwicker 1970), or on the temporal pattern of firing (“phase locking”) of neurons responding to the sound (Schouten 1940; Wever 1949; Siebert 1970; Cariani and Delgutte 1996). Proponents of each class of theory have pointed to potential weaknesses of the other. Rate–place representations may suffer from the facts that the firing rates of the majority of AN fibres are saturated at high stimulation levels, and that the locus of maximum firing rate may change with level (e.g. Kim et al. 1980; Chatterjee and Zwislocki 1997; Versteegh et al. 2011). As a result, rate–place profiles may become degraded and/or shifted at high levels, even though psychophysical experiments indicate good performance and strong pitch at those levels (e.g. Wier et al. 1977). Phase-locking cues are immune to these distortions, but the method for processing this temporal information, without requiring the long neural delay lines assumed by some influential “autocorrelation”-based models (Licklider 1951; Meddis and Hewitt 1991; Meddis and O’Mard 1997), remains a matter of debate (de Cheveigné and Pressnitzer 2006; Meddis and O’Mard 2006; Schnupp et al. 2010). In addition, although most temporal theorists assume that the frequency of a pure tone is encoded primarily by phase locking for frequencies up to at least 2,000 Hz (the frequency at which phase locking usually starts to drop in the mammalian species studied to date), there is evidence that neurons in the auditory cortex phase lock only up to much lower frequencies (e.g. Eggermont 1991; Wang et al. 2008), leading to the suggestion that phase locking must be recoded into some other representation at, or below, the auditory cortex.
A number of authors have described neural mechanisms by which the auditory system might exploit these across-channel phase differences, both for the perception of pitch, the detection of tones in noise, the coding of sound level, and for enhancing spectral representations of complex sounds (Loeb et al. 1983; Shamma 1985; Carney 1994; Heinz et al. 2001; Carney et al. 2002; Colburn et al. 2003; Loeb 2005; Cedolin and Delgutte 2010). Examples include the subtraction of the outputs of nearby channels (Shamma 1985; Cedolin and Delgutte 2010), the processing of channels whose outputs are de-correlated due to the presence of a narrowband sound (Carney et al. 2002), and the detection of co-incidences between the outputs of channels having different CFs (Loeb et al. 1983; Loeb 2005). In each case, the proposed mechanisms allow for phase locking to be recoded, at an early stage of auditory processing, into some other (e.g. rate–place) representation and obviate the need for processes, such as autocorrelation, that may require long central delay lines for computation. At the same time, these across-channel timing cues are largely immune to the effects of firing rate saturation and could therefore combine the strengths of temporal and rate–place models whilst avoiding some of the drawbacks of each (Cedolin and Delgutte 2010).
Another appealing feature, noted by Shamma (1985) and by Cedolin and Delgutte (2010), is that PT cues may be level-invariant. This is illustrated in Figure 1 in which the sets of curves for each tone frequency contain individual curves for input levels of between 4 and 74 dB SPL, which substantially overlap. However, as discussed above, PTs arise from the filtering properties of the BM; these are known to be nonlinear, and so one might expect some level dependence in them too. Furthermore, it has been known for some time that the phase response of the BM and of individual neurons varies with the input level (Anderson et al. 1971; Rhode 1971); in fact, the slopes of the lines in Figure 1 are slightly shallower at higher levels, reflecting the broader frequency tuning at those levels (Kim, personal communication). Moreover, changes in this slope with level, arising from the operation of the cochlear amplifier and the consequent change in tuning, have been proposed as a means by which the sound level is encoded by the auditory system (Carney 1994; Heinz et al. 2001; Colburn et al. 2003), suggesting that PT cues may not provide a level-independent cue to pitch. The present study investigates this issue in an attempt to determine whether PTs can support a level-independent code for pure-tone pitch and evaluates four possible ways that phase transitions could be processed in order to provide such a level-independent code.
Level dependence of phase transition cues for pure-tone pitch
The phase transition data reported by Kim et al. (1980), and shown in Figure 1, provide an important demonstration of the presence of across-channel timing information in the AN. Here, we report the results of a quantitative treatment of some more recent data, reported by Palmer and Shackleton (2009), and which those authors generously shared with us. They measured the amplitude and phase responses of individual AN fibres of the anaesthetized guinea pig (GP) in response to tones of different frequencies and levels. Fortunately, those data were obtained for a range of fibre CFs, permitting a reanalysis in terms of the response across CFs to a tone of a given frequency and level. Our analysis differs from previous approaches that were based on simulated neural responses produced using models of the auditory periphery (Shamma 1985; Carney et al. 2002) or that estimated the response of neurons of different CFs to a given stimulus by recording from an individual neuron in response to a range of stimuli (Cedolin and Delgutte 2010). This allowed us to obtain direct measurements of the variation of firing phase with CF and to avoid having to make some of the simplifying assumptions associated with those alternative approaches.
Another feature of our approach is that we do not focus on trying to model the magnitude of behaviourally obtained frequency difference limens (FDLs), for two reasons, both of which apply quite generally to attempts to model perception using neural responses. First, the absolute size of predicted FDLs will be influenced by the noise present in the physiological data on which the predictions are based, by the need to combine measurements from several animals, and by assumptions concerning the number of neurons involved and the degree of statistical independence of their responses. Second, the neural data are from a species whose behavioural FDLs are larger than in humans, may vary with frequency in a different way, and may be influenced by non-sensory factors such as attention (Heffner et al. 1971; Hienz et al. 1993). Instead, we concentrate on determining whether across-channel timing cues could provide a level-independent code for the frequency of a pure tone. We do so because, although the pitch of pure tones in humans does vary with sound level, this variation is small and is of the order of only a few per cent (Stevens 1935; Walliser 1969; Verschuure and van Meeteren 1975). A cue that varied substantially with the overall level would not, we assume, provide a useful means of encoding pitch, and we make the additional assumption that this will be true for the GP as well as for humans. Specifically, we compare the extent to which the predictions of each potential coding scheme vary with input level and with input frequency, fs. Note that although “noise” in the data may obscure the effects of either fs or level, it is unlikely to cause effects that are not significant to appear so. Note also that we are comparing the size of two effects—fs and level—both of which are subject to the same measurement noise. Hence, the noise is unlikely to distort our estimate of the relative size of the two effects. Our analysis shows that many ways of processing PT cues—described in detail in “Behaviour of four cross-channel phase difference models”—are, in fact, strongly level-dependent. As such, those methods are unlikely, by themselves, to provide a robust level-independent code for the pitch of pure tones. A possible exception is a scheme that compares AN fibres that have different CFs and that respond out of phase to each other (cf. Carney et al. 2002); the plausibility of this scheme is discussed.
The analyses that follow use data obtained by Palmer and Shackleton (2009), and the reader is referred to that original article for experimental details. Stimuli were 50-ms sinusoids presented at a rate of 5 Hz to urethane-anaesthetized adult pigmented GPs. The phases and firing rates of the responses to these stimuli were then extracted, as a function of AN fibre CF, for a range of input levels. The data were typically obtained at 10-dB intervals; in the majority of cases, these levels corresponded (within 1 dB) to those specified in our analyses. The stimulus frequency spacing was in steps of 1.5 semitones from the CF of the considered AN fibre. Where the levels and frequencies specified in an analysis did not correspond exactly to those available in the data, phases and firing rates for a given input were estimated by linear interpolation from the values at the next highest and the next lowest levels and frequency where the measurements were obtained.
As noted above, estimates of the accuracy of any code will be affected by details specific to the method of data collection and to assumptions made in the modelling process. Hence, although plots of our model predictions are accompanied by error bars, these provide a guide to the statistical significance of the effects that we discuss, but do not indicate how small a frequency difference should be detectable by the GP.
Behaviour of four cross-channel phase difference models
Local subtraction model
Our implementation of the local subtraction scheme is a simplification for at least two reasons. First, the neural responses in any one AN fibre will not occur at exactly the same phase on every cycle, and, even on the same cycle, it is likely that different neurons with the same CF will fire at slightly different times. Second, any second-order neuron that subtracts the inputs from two adjacent channels will impose some smoothing due to variability in synaptic transmission times. This does not negate our assumption that the output of such a subtraction will on average increase monotonically with the phase difference between the two input AN channels. Furthermore, assuming that, for a given fs, the smoothing is the same at all CFs, the output of the local subtraction algorithm will have peaks in the same locations as those shown in the bottom row of Figure 4. It does, however, have one potential implication for our evaluation of the local subtraction scheme, which is based on the subtraction of phases rather than of simulated spike trains. This is that if any component of “neural smoothing” has a time constant that is constant in milliseconds rather than in radians, its effects would be greater for higher-frequency than for lower-frequency tones. As a result, the output of the subtraction operation would have peaks that, compared with those shown in the bottom row of Figure 4, would be relatively higher for low than for high fs.
A model based on the local subtraction idea was recently applied by Cedolin and Delgutte (2010) to the responses of cat AN fibers to harmonic complex tones. An advantage of this approach is that the across-channel timing comparisons are performed locally, and, where a complex tone consists of low-numbered harmonics, can be performed between AN fibers responding to the same harmonic (Cedolin and Delgutte 2010). This is important because it generates predictions consistent with the fact that the pitch of low-numbered harmonics is independent of their relative phase (Houtsma and Smurzynski 1990; Shackleton and Carlyon 1994) and because it would allow phase comparisons to be performed even when energy from other sound sources is present elsewhere in the spectrum. We compare Cedolin and Delgutte’s findings to the results of our analyses in “Comparison with other data”. Although neither we, nor they, could specify exactly which neurons might perform this subtraction, Cedolin and Delgutte suggested the dorsal cochlear nucleus as one possible site.
CF-at-knee point model
Phase transition slope model
Loeb et al. (1983) proposed a model consisting of an array of detectors in the medial superior olive, with each detector receiving inputs from pairs of AN fibers that were separated along the BM by a fixed amount. They pointed out that the maximum output would occur in those detectors whose input AN fibers responded with phases that differed by one wavelength (although they also noted that optimal discrimination might be based on other phase differences). This “critical distance” depends on the slope of the function relating phase in cycles to log10(CF). As shown in Figures 1, 4, and 6, this slope is not constant as a function of CF but, rather, increases markedly for CFs below the knee point. Here, we focus on the slope of the steep portion as a possible cue for fs. This slope will depend on the bandwidths of the peripheral filters through which the stimulus is passed (Heinz et al. 2001; Shera et al. 2010); a filter introduces a delay as the frequency of the input passes through its resonance, and the size of this delay is greater for narrow than for broad filters. As the slope is measured in terms of the logarithm of CF (which is roughly proportional to the distance along the BM), it will vary with the bandwidth of the analysing filters relative to their centre frequencies. This last point can be explained by considering a 1,000-Hz tone passed through a filter centred on, say 1,200 Hz and a 2,000-Hz tone passed through a filter centred on 2,400 Hz. If the bandwidths of the filters are a constant proportion of their centre frequencies, then each tone will be at the same relative position on the filter’s transfer function and the phase delays (in cycles) will be equal.
Model involving a comparison of phase above and below the knee point
A matter that would still need to be resolved is how the auditory system would extract this metric. A possibly important fact is that the phase at and above the knee point is fairly constant over a wide range of CFs, and so is the phase at which most AN fibers are firing. Therefore, “all” that the auditory system has to do is to find those fibers that are firing out of phase (by π radians) with the “most common” phase. However, phase is only “fairly” constant at CFs above the knee point. The data of Kim et al. (Fig. 1) show that, if one looks over a much wider range of CFs than in the data of Palmer et al., the phase delay does decrease gradually as one moves to more basal regions. Therefore, the issue arises as to which portion of the AN array is used as the “fairly constant” reference. An alternative would be for the system to measure the CF corresponding to knee point, perhaps using a simple metric as in Figure 3C–E and to exploit this information to identify the π shift point. An important drawback is that the “π shift point” and the knee point are often far apart (more than an octave difference in CF; Fig. 9), raising the issue of how the brain would extract this cue when more than one sound is present. Furthermore, when more than one tone is present, basal fibers will respond to a mixture of both, and the auditory system would have to extract the temporal pattern corresponding to each one. Evidence from the perception of spectrally overlapping mixtures of unresolved harmonics suggests that the brain is poor at extracting two periodicities from the same set of AN fibers (Carlyon 1996a, b).
Use of firing rate profiles
Comparison with other data
van der Heijden and Joris (2006) measured the amplitude and phase characteristics at different apical locations of the cat cochlea by recording the responses to inharmonic complex tones from cat AN fibers having a range of CFs. Their technique was designed to remove nonlinear contributions to the AN response arising from the transduction process (e.g. half-wave rectification and firing rate saturation). Measurements were obtained for stimuli at relatively low levels (within 35 dB of each fibre’s threshold) and were not presented separately for different levels. Their data, like those presented here, reveal phase-vs.-CF functions that are quite flat for CFs much higher than the input frequency, combined with a steeply sloping portion around CF. As also shown here, they found that the slope is steeper at higher frequencies: approximately −2.1 and −3.5 cycles/log10(CF) at 200 and 1,000 Hz, respectively. These slopes are somewhat steeper than those shown in Figure 8, probably due to the lower level used by them and/or to species differences. A visual inspection of some cat data presented by Kim et al. (1979) reveals a slope of about −2 cycles/log10(CF) at Ls = 45 dB SPL for fs = 620 Hz. The data of Kim et al. 1980, plotted in Figure 1, show substantially steeper slopes of about 6 cycles/log10(CF), presumably reflecting the higher fs of 2,100 Hz and the marked increase in the Q10 dB values of cat AN fibers as CF is increased above about 1,000 Hz (Palmer 1995; Shera et al. 2002). In general, the results of Kim et al. and of van der Heijden and Joris agree reasonably well with those described here despite the fact that both of those earlier data sets were obtained in the cat, whereas Palmer and Shackleton’s data were obtained in the GP. In addition, van der Heijden and Joris noted that they deliberately restricted their measures to fairly low levels because increasing level could have a “drastic” effect on the phase measurements. This is consistent with evidence from recordings of individual neurons innervating the cochlear apex showing that level affects group delay in a frequency-dependent manner (Versteegh et al. 2011) and with the present evidence that PTs are strongly affected by level. Hence, although there will undoubtedly be quantitative differences between species, the general form of phase transition data is similar.
Cedolin and Delgutte (2010) measured the response of individual AN fibers to harmonic complex tones whose fundamental frequency (F0) ranged from about 0.2 to about 0.7 times the neuron’s CF. They invoked the principle of “scaling invariance” (Zweig 1976) to calculate, for each set of such measurements, the predicted amplitude and phase response of an array of neurons, with different CFs, to a complex of a given F0. They then calculated not only the predicted rate–place profile but also a metric similar to the simple “local subtraction” method described above; the response of each “channel” was instantaneously subtracted from that of its neighbour and the absolute value of this difference then integrated over time. The variation in this summary statistic across CF, termed the “mean absolute spatial derivative (MASD)”, produced peaks at CFs corresponding to the low-numbered harmonics. They found that for CFs between 1,350 and 2,800 Hz, the MASD provided a better representation of the F0 than did a simple rate–place profile, but that the reverse was true at higher CFs, presumably due to the roll-off in phase locking at high frequencies. In addition, they estimated the best frequency (BF) of each neuron, based both on the rate–place and MASD profiles, and reported that these BFs varied less consistently with level for the MASD-based than for the rate-based measure. They also concluded that the accuracy of the F0 representation did deteriorate slightly with increasing stimulus level, but that the significance of this deterioration was less than that observed for the rate–place model.
Cedolin and Delgutte’s conclusion that their MASD model provided a relatively level-independent cue to pitch seems, at least at first sight, to be at variance with the substantial level dependence observed in our analyses of Palmer and Shackleton’s GP data and by van der Heijden and Joris’s observation that level can have a large effect on the phase transitions measured in the cat. One way of reconciling these two sets of findings would be if the response of the BM were more linear for complex tones than for the pure tones used by Palmer and Shackleton. In addition, the range of sound levels studied by Cedolin and Delgutte was generally lower than ours; 80% of the data they reported were obtained at levels between 25 and 65 dB SPL per component, whereas we report data for levels between 50 and 90 dB SPL. It is also worth noting that Cedolin and Delgutte focused on the finding that the PT cue was less level-dependent than the rate–place representation, a finding that is not inconsistent with our analysis. For example, Figure 11 shows that the rate vs. CF profile at 90 dB SPL is flat, whereas the “local subtraction” metric shown in Figure 5 does show some variation with input frequency at that level. Our focus is on whether there is an effect of level at all and on whether this effect is substantially less than the effect of input frequency.
Level dependence of across-channel phase cues
In the Introduction, we noted that the nonlinearity of the mechanical filtering properties of the BM made it plausible that PTs, which arise from those properties, would also be nonlinear and vary with input level. The analyses described in “Level dependence of phase transition cues for pure-tone pitch” generally support this prediction. It is possible that the phase effects we observed had not only a mechanical but also a neural basis. One potential influence of the amplitude response of the BM to a pure tone on the PT arises from a potential influence of level on the synaptic delay between inner hair cells (IHCs) and the AN. Recently, Versteegh et al. (2011) interpreted some of their data on the level dependence of AN phase responses in terms of evidence that larger input amplitudes may cause AN fibres to fire earlier, perhaps as a result of the effects of sound level on vesicle release at the IHC synapse. If this is true then, when, for a given input level, the inputs to fibres with CFs remote from fs are reduced by BM filtering, those AN fibres may be subject to slightly different synaptic delays, thereby contributing to the variation in the AN response phase as a function of CF. Furthermore, because BM filtering itself depends on the input level, the variation in these “synaptic” effects across CF may also be level-dependent.
Summary of PT cue analysis
For a given fs, the function relating AN response phase to log10(CF) is very shallow for CFs much higher than fs and substantially steeper for CFs around and below fs. This PT curve shows a local maximum, the position of which varies monotonically with fs. However, this position also varies with level, and the average effect of a 40-dB level change on its position is more than half that produced by changing input frequency over a 2.24-octave range.
The “knee point”—the position on the PT curve at which there is a marked increase in slope—also varies monotonically with fs and is also level-dependent. The average effect of a 40-dB level change is more than half that produced by changing input frequency over a 2.24-octave range.
The slope of the PT curve varies slightly with fs, presumably as a result of the variation in relative tuning with CF in the GP. It becomes shallower with increasing input level, presumably reflecting the worsening of frequency selectivity with increasing level observed here. The average effect of a 40-dB level change approached that produced by changing input frequency over a 2.24-octave range.
Models based on co-incidence detection are dependent on the slope of the phase-vs.-CF function. Over the steep part of that function, this slope is level-dependent and its variation with fs is slight and depends on variations in relative bandwidth with CF—a variation that may be absent above 500 Hz in humans (Baker and Rosen 2006). Therefore, to the extent that such models assume that phase differences are compared over the steep part of the function, they do not predict a robust, level-independent code for frequency.
The CF at which the phase of the AN response is exactly out of phase with that at the knee point varies monotonically with fs. Unlike the other potential codes described here, its variation with level was not significant overall. The average effect of a 40-dB level change was only about one tenth that produced by changing input frequency over a 2.24-octave range. However, this code requires the auditory system to compare channels with well-separated CFs, and it is not obvious how it could be used when more than one tonal component is present.
Overall summary and conclusions
Our analysis of Palmer and Shackleton’s (2009) GP data suggests that straightforward methods for extracting pure-tone frequency from PTs—such as the “local subtraction” and “knee point” metrics—are likely to be too strongly affected by level to provide a robust code. Perhaps the strongest candidate code that we observed was the CF at which neurons fire out of phase with those at and basal to the knee point; this metric varied substantially and monotonically with fs and was not substantially influenced by level. However, this code requires a comparison of AN fibres having quite different CFs, and it is less than clear how it could be implemented by the auditory system, particularly when more than one sound is present. A caveat that the present analyses shares with most other published studies is that the neural responses were obtained in an anaesthetized preparation.
Overall, our conclusion is therefore that although across-channel timing information provides a theoretically attractive neural code to pure-tone frequency (subjectively, pitch), the way in which it is extracted is unlikely to be straightforward. One possible conclusion is that across-channel timing cues are not used at all. At present, the most likely alternatives are either that the auditory system can estimate a quantity akin to the π-shift point, which entails comparing phase information across quite well-separated portions of the AN array, or uses one of the more “local” codes described here whilst controlling for the sometimes substantial effects of overall level, for example, by estimating total firing rate.
We are very grateful to Alan Palmer and Trevor Shackleton for providing us with their raw data and with some MATLAB scripts. Thanks are also due to Chris van den Honert, Associate Editor Laurel Carney and the anonymous reviewers for insightful comments on an earlier version of the manuscript.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.