Traditionally, theories of how the auditory system encodes pitch have focused on either the variation in firing rate across auditory nerve (AN) fibres having different characteristic frequencies (CFs; Zwicker 1970), or on the temporal pattern of firing (“phase locking”) of neurons responding to the sound (Schouten 1940; Wever 1949; Siebert 1970; Cariani and Delgutte 1996). Proponents of each class of theory have pointed to potential weaknesses of the other. Rate–place representations may suffer from the facts that the firing rates of the majority of AN fibres are saturated at high stimulation levels, and that the locus of maximum firing rate may change with level (e.g. Kim et al. 1980; Chatterjee and Zwislocki 1997; Versteegh et al. 2011). As a result, rate–place profiles may become degraded and/or shifted at high levels, even though psychophysical experiments indicate good performance and strong pitch at those levels (e.g. Wier et al. 1977). Phase-locking cues are immune to these distortions, but the method for processing this temporal information, without requiring the long neural delay lines assumed by some influential “autocorrelation”-based models (Licklider 1951; Meddis and Hewitt 1991; Meddis and O’Mard 1997), remains a matter of debate (de Cheveigné and Pressnitzer 2006; Meddis and O’Mard 2006; Schnupp et al. 2010). In addition, although most temporal theorists assume that the frequency of a pure tone is encoded primarily by phase locking for frequencies up to at least 2,000 Hz (the frequency at which phase locking usually starts to drop in the mammalian species studied to date), there is evidence that neurons in the auditory cortex phase lock only up to much lower frequencies (e.g. Eggermont 1991; Wang et al. 2008), leading to the suggestion that phase locking must be recoded into some other representation at, or below, the auditory cortex.

The pros and cons of the above arguments have been debated extensively elsewhere (Moore 2003; de Cheveigné 2005; Plack 2005). Here, we simply note that the various potential weaknesses of the two traditional approaches have led to considerable interest in an alternative class of explanation, whereby the auditory system performs an instantaneous comparison of the temporal pattern of firing in auditory neurons having different CFs (Loeb et al. 1983; Shamma 1985; Carney 1994; Heinz et al. 2001; Carney et al. 2002; Colburn et al. 2003; Oxenham et al. 2004; Loeb 2005; Moore and Carlyon 2005; Cedolin and Delgutte 2010). This idea stems from the fact that the phase of AN responses varies across CF in a manner that varies with the input frequency. It can be illustrated with reference to Figure 1, reprinted from an article by Kim et al. (1980), who recorded the responses of cat AN fibres having a wide range of CFs to a two-tone complex having component frequencies, f 1 = 2,100 Hz and f 2 = 2,700 Hz. The figure shows two sets of curves, each of which represents the phase of the AN response to one of the components. In each case, the phase is roughly constant over a range of high-frequency CFs, and there is a steep phase transition (PT) at CFs near the frequency of each tonal component (shown by the small arrows); similar results have been more recently obtained by van der Heijden and Joris (2006) using a different technique. To a first approximation, this pattern of results can be explained by the facts that a filter introduces a delay as the frequency of the input passes through its resonance, that the size of this delay is greater for narrow than for broad filters, and that the bandwidths of basilar membrane (BM) filters (as estimated from, e.g. tuning curves), when expressed in hertz, are narrower at the apex than at the base. A consequence of the variation in phase with CF is that neurons with CFs much higher than the input frequency fire at approximately the same time as each other, whereas those close to the input frequency fire out of phase. This is illustrated in Figure 2 which shows schematic “neurograms” to 1,000- and 1,200-Hz pure tones (these neurograms were not derived from the cat data obtained by Kim et al. and shown in Figure 1, but were based on fits to data obtained from another species as described in “Modelling approach”).

FIG. 1
figure 1

From Kim et al. (1980), with permission. Phase of AN response to a 2,100 + 2,700-Hz complex at levels ranging from 4 to 74 dB SPL. There are two sets of curves, corresponding to each component; the curves within each set are for different sound levels and, at the scale shown here, appear to overlap. Note that, unlike all the other Figs. in this article, CF decreases from high to low along the abscissa.

FIG. 2
figure 2

Schematic “neurograms” where the firing time of neurons in each CF channel are estimated from the arc-tangent fits to the CF-vs. phase curves, described in the text, for 1,000-Hz (A) and 1,200-Hz (B) pure tones. Bright colours (red, yellow) indicate a high instantaneous firing rate; darker colours (dark blue) correspond to instantaneous firing rates close to zero. Following Colburn et al. (2003), the instantaneous firing rate was approximated as a scaled exponential function of the input waveform where the scaling factor was chosen to yield a synchronization index of 0.81. Because the purpose of this figure is to illustrate specifically the phase effects and the relative timing of neural responses to tones of different frequencies, the effects of peripheral frequency selectivity on the firing rate are not shown. The panel on the right of the plot shows the mean across time of the absolute value of the difference in instantaneous rate between nearby simulated auditory channels, similar to Cedolin and Delgutte (2010). The two resulting mean absolute valued difference functions (one for the 1,000-Hz pure tone and the other for the 1,200-Hz pure tone) have been normalized so that their maximum peak amplitude equals 1. Note that the peak corresponds to a higher CF for the 1,200-Hz than for the 1,000-Hz tone.

A number of authors have described neural mechanisms by which the auditory system might exploit these across-channel phase differences, both for the perception of pitch, the detection of tones in noise, the coding of sound level, and for enhancing spectral representations of complex sounds (Loeb et al. 1983; Shamma 1985; Carney 1994; Heinz et al. 2001; Carney et al. 2002; Colburn et al. 2003; Loeb 2005; Cedolin and Delgutte 2010). Examples include the subtraction of the outputs of nearby channels (Shamma 1985; Cedolin and Delgutte 2010), the processing of channels whose outputs are de-correlated due to the presence of a narrowband sound (Carney et al. 2002), and the detection of co-incidences between the outputs of channels having different CFs (Loeb et al. 1983; Loeb 2005). In each case, the proposed mechanisms allow for phase locking to be recoded, at an early stage of auditory processing, into some other (e.g. rate–place) representation and obviate the need for processes, such as autocorrelation, that may require long central delay lines for computation. At the same time, these across-channel timing cues are largely immune to the effects of firing rate saturation and could therefore combine the strengths of temporal and rate–place models whilst avoiding some of the drawbacks of each (Cedolin and Delgutte 2010).

Another appealing feature, noted by Shamma (1985) and by Cedolin and Delgutte (2010), is that PT cues may be level-invariant. This is illustrated in Figure 1 in which the sets of curves for each tone frequency contain individual curves for input levels of between 4 and 74 dB SPL, which substantially overlap. However, as discussed above, PTs arise from the filtering properties of the BM; these are known to be nonlinear, and so one might expect some level dependence in them too. Furthermore, it has been known for some time that the phase response of the BM and of individual neurons varies with the input level (Anderson et al. 1971; Rhode 1971); in fact, the slopes of the lines in Figure 1 are slightly shallower at higher levels, reflecting the broader frequency tuning at those levels (Kim, personal communication). Moreover, changes in this slope with level, arising from the operation of the cochlear amplifier and the consequent change in tuning, have been proposed as a means by which the sound level is encoded by the auditory system (Carney 1994; Heinz et al. 2001; Colburn et al. 2003), suggesting that PT cues may not provide a level-independent cue to pitch. The present study investigates this issue in an attempt to determine whether PTs can support a level-independent code for pure-tone pitch and evaluates four possible ways that phase transitions could be processed in order to provide such a level-independent code.

Level dependence of phase transition cues for pure-tone pitch

Modelling approach

The phase transition data reported by Kim et al. (1980), and shown in Figure 1, provide an important demonstration of the presence of across-channel timing information in the AN. Here, we report the results of a quantitative treatment of some more recent data, reported by Palmer and Shackleton (2009), and which those authors generously shared with us. They measured the amplitude and phase responses of individual AN fibres of the anaesthetized guinea pig (GP) in response to tones of different frequencies and levels. Fortunately, those data were obtained for a range of fibre CFs, permitting a reanalysis in terms of the response across CFs to a tone of a given frequency and level. Our analysis differs from previous approaches that were based on simulated neural responses produced using models of the auditory periphery (Shamma 1985; Carney et al. 2002) or that estimated the response of neurons of different CFs to a given stimulus by recording from an individual neuron in response to a range of stimuli (Cedolin and Delgutte 2010). This allowed us to obtain direct measurements of the variation of firing phase with CF and to avoid having to make some of the simplifying assumptions associated with those alternative approaches.

Another feature of our approach is that we do not focus on trying to model the magnitude of behaviourally obtained frequency difference limens (FDLs), for two reasons, both of which apply quite generally to attempts to model perception using neural responses. First, the absolute size of predicted FDLs will be influenced by the noise present in the physiological data on which the predictions are based, by the need to combine measurements from several animals, and by assumptions concerning the number of neurons involved and the degree of statistical independence of their responses. Second, the neural data are from a species whose behavioural FDLs are larger than in humans, may vary with frequency in a different way, and may be influenced by non-sensory factors such as attention (Heffner et al. 1971; Hienz et al. 1993). Instead, we concentrate on determining whether across-channel timing cues could provide a level-independent code for the frequency of a pure tone. We do so because, although the pitch of pure tones in humans does vary with sound level, this variation is small and is of the order of only a few per cent (Stevens 1935; Walliser 1969; Verschuure and van Meeteren 1975). A cue that varied substantially with the overall level would not, we assume, provide a useful means of encoding pitch, and we make the additional assumption that this will be true for the GP as well as for humans. Specifically, we compare the extent to which the predictions of each potential coding scheme vary with input level and with input frequency, f s. Note that although “noise” in the data may obscure the effects of either f s or level, it is unlikely to cause effects that are not significant to appear so. Note also that we are comparing the size of two effects—f s and level—both of which are subject to the same measurement noise. Hence, the noise is unlikely to distort our estimate of the relative size of the two effects. Our analysis shows that many ways of processing PT cues—described in detail in “Behaviour of four cross-channel phase difference models”—are, in fact, strongly level-dependent. As such, those methods are unlikely, by themselves, to provide a robust level-independent code for the pitch of pure tones. A possible exception is a scheme that compares AN fibres that have different CFs and that respond out of phase to each other (cf. Carney et al. 2002); the plausibility of this scheme is discussed.

The analyses that follow use data obtained by Palmer and Shackleton (2009), and the reader is referred to that original article for experimental details. Stimuli were 50-ms sinusoids presented at a rate of 5 Hz to urethane-anaesthetized adult pigmented GPs. The phases and firing rates of the responses to these stimuli were then extracted, as a function of AN fibre CF, for a range of input levels. The data were typically obtained at 10-dB intervals; in the majority of cases, these levels corresponded (within 1 dB) to those specified in our analyses. The stimulus frequency spacing was in steps of 1.5 semitones from the CF of the considered AN fibre. Where the levels and frequencies specified in an analysis did not correspond exactly to those available in the data, phases and firing rates for a given input were estimated by linear interpolation from the values at the next highest and the next lowest levels and frequency where the measurements were obtained.

As noted above, estimates of the accuracy of any code will be affected by details specific to the method of data collection and to assumptions made in the modelling process. Hence, although plots of our model predictions are accompanied by error bars, these provide a guide to the statistical significance of the effects that we discuss, but do not indicate how small a frequency difference should be detectable by the GP.

Behaviour of four cross-channel phase difference models

Local subtraction model

Perhaps the simplest scheme for estimating the frequency (and therefore, the pitch) of pure tones, that takes advantage of the PT, is one in which an array of “second-order” neurons subtracts the response of each AN fibre from that of a fibre having a “neighbouring” CF. If the function relating phase to CF in the first-order (AN) neurons has a slope that is locally steeper over some region compared with surrounding regions, there will be a corresponding local maximum in the output of the second-order array. This concept is illustrated in Fig. 3, where part A) shows a schematic phase-vs-CF curve and where part B) shows the output of a hypothetical second-order array. It is also illustrated in Figure 2 which shows some schematic “neurograms” to 1,000- and 1,200-Hz pure tones, with the results of a local subtraction algorithm shown to the right of the plot. Note that the neurograms in Figure 2 include the effects of peripheral filtering on response phase, but not on its magnitude (firing rate). Note also that the phase-vs.-CF functions shown in Figure 3 and in all remaining plots in this article show CF increasing from low to high along the abscissa, opposite to the convention adopted by Kim et al. and shown in Figure 1.

FIG. 3
figure 3

The top row shows a schematic implementation of the local subtraction scheme. A Schematic representation of a phase-vs.-log10 CF curve consisting of a shallow portion plus a steeper portion having a region where the slope is locally maximal. B Output of an array of second-order neurons, whereby the instantaneous output of cells at a given CF are subtracted from that of their neighbour. This subtraction is implemented by subtracting the phases of adjacent channels. The bottom row shows the knee point model. C Phase-vs.-CF function consisting of a shallow and a steep portion. D Output of an array of second-order neurons, whereby the instantaneous output of cells at a given CF are subtracted from that of their neighbour. E Output of an array of third-order neurons operating on the different between adjacent second-order neurons shown in (D).

As mentioned above, for this “local subtraction” scheme to provide a useable measure of input frequency, the phase-vs.-CF function must show a local increase in slope over some fairly narrow range of CFs, and this maximum should vary monotonically with the input frequency. To check whether this was the case, we first calculated arc-tangent fits to the functions relating phase in cycles to log10(CF) for input levels of 50, 70 and 90 dB SPL and for f s ranging from 250 to 2,000 Hz. The fitted functions took the form y = A + B⋅arctan(x/C), where y is the phase of the simulated response of a fibre having CF = x and A, B, and C are free parameters. Those data and fits are shown in the top panels of Figure 4 where the different colours in each panel show data for different signal frequencies at a single level (fits to the 70 dB SPL data were also used to generate the “neurograms” in Fig. 2). The bottom panels of Figure 4 show an implementation of the “local subtraction” scheme in which the phases in each “bin” are subtracted from those of its neighbour; the bin widths use here correspond to units of 0.1 log10(CF). It can be seen that these functions do indeed show local maxima, which, although quite broad, generally vary monotonically with frequency, as illustrated for each level by the different curves in Figure 5. However, Figure 5 also shows that these “peak differences” are unfortunately not invariant with level. For example, every estimate at 90 dB SPL is higher than the corresponding value for the same input frequency at 50 or 70 dB SPL. Although the exact size of the level effect is rendered uncertain by the variability in our estimates, both it and the effect of frequency were significant when the peak difference values for three levels (50, 70, 90 dB SPL) and for ten values of f s spaced in units of 0.1 log10 f s were entered into a generalized linear model (level: F 1,28 = 42.02, P < 0.0001; frequency: F 1,28 = 42.02, P < 0.0001). For this and the other metrics described in this section, we derived a simple summary of the relative effects of f s and level by comparing its variation with f s from 250 to 1,189 Hz, averaged over level, to its variation with level averaged across that same range of f s. For the peak difference measure, the variation with level was 0.7 octaves, more than half the 1.24-octave effect of f s.

FIG. 4
figure 4

Top panels show the phase of the AN response as a function of characteristic frequency for input frequencies ranging from 250 to 2,000 Hz. Different input frequencies are colour-coded; the dotted lines show the best-fitting arc-tangent fits. Bottom panels are derived from the top panels and show the phase difference between adjacent CF bins, where each bin is 0.1(log10(CF)). The vertical dashed lines intersect the abscissa at CFs equal to the (colour-coded) input frequency. Left, middle, and top columns are for input levels of 50, 70 and 90 dB SPL, respectively.

FIG. 5
figure 5

Peaks in the “phase difference curves” (bottom panels of Fig. 4) as a function of stimulus frequency for three different levels. The peaks were estimated using the “robust regression” algorithm in Matlab (version 7.7.0, R2008b, The MathWorks, Natick, MA), which uses iteratively reweighted least squares with a bisquare weighting function. Error bars show 95% confidence limits.

Our implementation of the local subtraction scheme is a simplification for at least two reasons. First, the neural responses in any one AN fibre will not occur at exactly the same phase on every cycle, and, even on the same cycle, it is likely that different neurons with the same CF will fire at slightly different times. Second, any second-order neuron that subtracts the inputs from two adjacent channels will impose some smoothing due to variability in synaptic transmission times. This does not negate our assumption that the output of such a subtraction will on average increase monotonically with the phase difference between the two input AN channels. Furthermore, assuming that, for a given f s, the smoothing is the same at all CFs, the output of the local subtraction algorithm will have peaks in the same locations as those shown in the bottom row of Figure 4. It does, however, have one potential implication for our evaluation of the local subtraction scheme, which is based on the subtraction of phases rather than of simulated spike trains. This is that if any component of “neural smoothing” has a time constant that is constant in milliseconds rather than in radians, its effects would be greater for higher-frequency than for lower-frequency tones. As a result, the output of the subtraction operation would have peaks that, compared with those shown in the bottom row of Figure 4, would be relatively higher for low than for high f s.

A model based on the local subtraction idea was recently applied by Cedolin and Delgutte (2010) to the responses of cat AN fibers to harmonic complex tones. An advantage of this approach is that the across-channel timing comparisons are performed locally, and, where a complex tone consists of low-numbered harmonics, can be performed between AN fibers responding to the same harmonic (Cedolin and Delgutte 2010). This is important because it generates predictions consistent with the fact that the pitch of low-numbered harmonics is independent of their relative phase (Houtsma and Smurzynski 1990; Shackleton and Carlyon 1994) and because it would allow phase comparisons to be performed even when energy from other sound sources is present elsewhere in the spectrum. We compare Cedolin and Delgutte’s findings to the results of our analyses in “Comparison with other data”. Although neither we, nor they, could specify exactly which neurons might perform this subtraction, Cedolin and Delgutte suggested the dorsal cochlear nucleus as one possible site.

CF-at-knee point model

A second possible code for frequency that makes use of the PT involves estimating f s from the location of the transition between the steep and shallow portions of the functions shown in Figure 1 and in the top panels of Figure 4. This could be implemented by an array of “third-order” neurons that subtract the outputs of adjacent “second-order” neurons (Figure 3C–E). This “third-order” array would show a maximum when the second derivative of the phase-vs.-CF function is maximal. To estimate this “knee point”, we refitted the data shown in Figure 4 with “broken-stick” functions consisting of two straight lines, one of which was constrained to have a zero slope. Examples of this fit for two frequencies and levels are shown in Figure 6. When applied to the same range of f s as shown in Figure 4, this “broken-stick” model yields almost exactly the same RMS error of fit (all frequency and level conditions combined) as the “arc-tangent” model described above: 0.2791 cycles for the broken-stick fit model vs. 0.2796 for the arc-tangent fit. The knee point varies monotonically with f s, but, as Figure 7 shows, is not independent of level. For example, the knee point for a 250-Hz, 90-dB tone occurs at a place in the AN array having a higher CF than that for a 500-Hz, 50-dB tone. The effect of level, averaged across f s, was 0.8 octaves, more than half the 1.33-octave effect of f s.

FIG. 6
figure 6

Broken-stick fits to phase transition data for f s = 250 (blue) and 500 Hz (red) and for input levels of 60 dB SPL (circles) and 90 dB SPL (asterisks). The broken-stick fits are shown by dotted and solid lines for levels of 60 and 90 dB SPL, respectively.

FIG. 7
figure 7

The “knee point” as a function of level and input frequency. The curves shown are derived from the best-fitting model described in the text for input frequencies spaced 3 semitones apart, starting at 250 Hz, and with levels of 50, 70 and 90 dB SPL.

Phase transition slope model

Loeb et al. (1983) proposed a model consisting of an array of detectors in the medial superior olive, with each detector receiving inputs from pairs of AN fibers that were separated along the BM by a fixed amount. They pointed out that the maximum output would occur in those detectors whose input AN fibers responded with phases that differed by one wavelength (although they also noted that optimal discrimination might be based on other phase differences). This “critical distance” depends on the slope of the function relating phase in cycles to log10(CF). As shown in Figures 1, 4, and 6, this slope is not constant as a function of CF but, rather, increases markedly for CFs below the knee point. Here, we focus on the slope of the steep portion as a possible cue for f s. This slope will depend on the bandwidths of the peripheral filters through which the stimulus is passed (Heinz et al. 2001; Shera et al. 2010); a filter introduces a delay as the frequency of the input passes through its resonance, and the size of this delay is greater for narrow than for broad filters. As the slope is measured in terms of the logarithm of CF (which is roughly proportional to the distance along the BM), it will vary with the bandwidth of the analysing filters relative to their centre frequencies. This last point can be explained by considering a 1,000-Hz tone passed through a filter centred on, say 1,200 Hz and a 2,000-Hz tone passed through a filter centred on 2,400 Hz. If the bandwidths of the filters are a constant proportion of their centre frequencies, then each tone will be at the same relative position on the filter’s transfer function and the phase delays (in cycles) will be equal.

As shown in Figure 8, the slopes vary as functions both of level and of f s. The slopes are slightly shallower (less negative) at lower frequencies, presumably because, in the GP, the relative bandwidths of frequency tuning curves (e.g. as summarized by Shera et al. 2002) are broader at low frequencies. The slopes also become shallower (less negative) with increases in level, as would be expected with the decrease in frequency selectivity at high levels (Rhode 1971; Robles et al. 1986). Both of these effects were sufficiently reliable to produce significant main effects when the slope values for three levels (50, 70, 90 dB SPL) and for ten values of f s spaced in units of 0.1 log10 f s were entered into a generalized linear model (effect of f s: F 1,28 = 28.9, P < 0.00001; effect of level: F 1,28 = 22.1, P < 0.00001). Averaged across f s, slopes varied by a factor of 1.67 across level, approaching the factor of 2.07 for the variation with f s. Hence, any attempt to estimate f s based on a metric related to the slope—for example by estimating the difference along the AN array between fibers that responded in phase—would be strongly influenced by the input level.

FIG. 8
figure 8

Slope of the phase transition as a function of level and input frequency. The curves shown are derived from the best-fitting model described in the text for input frequencies spaced 3 semitones apart, starting at 250 Hz, and with levels of 50, 70 and 90 dB SPL.

Model involving a comparison of phase above and below the knee point

One potential decoding method is illustrated in Figure 9 which shows “broken-stick” fits to stimuli at two frequencies and levels (shown in Fig. 6), shifted so that the horizontal portions are aligned. The shift is performed because the animal has, of course, no idea about “absolute” phase and can only compare the relative phase across fibers. It can be seen that the CF at which the phase is π radians lower than that at the knee point (indicated by the black horizontal dotted lines) appears to be similar at the two levels for each frequency. Figure 10 shows this value, which we term the “π shift point”, as a function of f s for input levels of 50, 70 and 90 dB SPL. It increases monotonically with f s and has modest error bars. A generalized linear model revealed a significant effect of f s (F 1,28 = 524, P < 0.0001), but not of level (F 1,28 = 0.23, P = 0.633). Unlike the other measures described here, the effect of level, averaged across f s, was much smaller (0.23 octaves) than the effect of f s, which was 2.19 octaves; this, in turn, was close to the range of input frequencies (250–1,189 Hz) which was 2.24 octaves.

FIG. 9
figure 9

“Broken-stick” fits to the phase transition curves for input frequencies of 250 and 500 Hz and levels of 60 and 90 dB SPL. The 90-dB curves have been shifted upwards so that their horizontal portion overlaps with that of the 60-dB curves. The horizontal dashed lines show the phases that are half a cycle lower than at the knee point.

FIG. 10
figure 10

Value of the “π shift point” as a function of level and input frequency. The curves shown are derived from the best-fitting model described in the text for input frequencies spaced 3 semitones apart, starting at 250 Hz, and with levels of 50, 70 and 90 dB SPL.

A matter that would still need to be resolved is how the auditory system would extract this metric. A possibly important fact is that the phase at and above the knee point is fairly constant over a wide range of CFs, and so is the phase at which most AN fibers are firing. Therefore, “all” that the auditory system has to do is to find those fibers that are firing out of phase (by π radians) with the “most common” phase. However, phase is only “fairly” constant at CFs above the knee point. The data of Kim et al. (Fig. 1) show that, if one looks over a much wider range of CFs than in the data of Palmer et al., the phase delay does decrease gradually as one moves to more basal regions. Therefore, the issue arises as to which portion of the AN array is used as the “fairly constant” reference. An alternative would be for the system to measure the CF corresponding to knee point, perhaps using a simple metric as in Figure 3C–E and to exploit this information to identify the π shift point. An important drawback is that the “π shift point” and the knee point are often far apart (more than an octave difference in CF; Fig. 9), raising the issue of how the brain would extract this cue when more than one sound is present. Furthermore, when more than one tone is present, basal fibers will respond to a mixture of both, and the auditory system would have to extract the temporal pattern corresponding to each one. Evidence from the perception of spectrally overlapping mixtures of unresolved harmonics suggests that the brain is poor at extracting two periodicities from the same set of AN fibers (Carlyon 1996a, b).


Use of firing rate profiles

The analyses described in “Level dependence of phase transition cues for pure-tone pitch” focused solely on across-channel timing information. It is, however, possible that the auditory system uses a coding scheme that combines information on both the timing and the rate of firing across channels. Although it is beyond the scope of this article to perform such a combined analysis, some preliminary considerations suggest that it may not be useful in overcoming the effects of level on the various codes considered above. Figure 11 shows the firing rate profiles, defined as firing rate as a function of CF, normalized so that the spontaneous and maximum rates for each neuron scaled to between 0 and 1, respectively. Those data were then fit by second-order polynomials. The polynomial fits, although imperfect, capture the main features of the data. Note that although the profile at 60 dB SPL has a bandpass characteristic, the one at 90 dB SPL appears largely flat over the >3 octave range of CFs from which measures were obtained. Saturation of AN rate-level functions almost certainly accounts for at least part of this effect, and it is possible that a bandpass curve could be obtained by considering only those neurons with low spontaneous firing rates, which are relatively resistant to such effects (cf. Sachs et al. 1983). However, this would invoke selective processing of low spontaneous rate fibers, the necessity of which has been a major reason for questioning the validity of purely rate–place accounts (e.g. Carney 1994).

FIG. 11
figure 11

Normalized firing rate as a function of CF. Input: 1,000 Hz tone with a level of 60 dB SPL (A) or 90 dB SPL (B). The vertical blue line indicates CF = input frequency. The solid black lines are second-order polynomial fits.

Comparison with other data

van der Heijden and Joris (2006) measured the amplitude and phase characteristics at different apical locations of the cat cochlea by recording the responses to inharmonic complex tones from cat AN fibers having a range of CFs. Their technique was designed to remove nonlinear contributions to the AN response arising from the transduction process (e.g. half-wave rectification and firing rate saturation). Measurements were obtained for stimuli at relatively low levels (within 35 dB of each fibre’s threshold) and were not presented separately for different levels. Their data, like those presented here, reveal phase-vs.-CF functions that are quite flat for CFs much higher than the input frequency, combined with a steeply sloping portion around CF. As also shown here, they found that the slope is steeper at higher frequencies: approximately −2.1 and −3.5 cycles/log10(CF) at 200 and 1,000 Hz, respectively. These slopes are somewhat steeper than those shown in Figure 8, probably due to the lower level used by them and/or to species differences. A visual inspection of some cat data presented by Kim et al. (1979) reveals a slope of about −2 cycles/log10(CF) at L s = 45 dB SPL for f s = 620 Hz. The data of Kim et al. 1980, plotted in Figure 1, show substantially steeper slopes of about 6 cycles/log10(CF), presumably reflecting the higher f s of 2,100 Hz and the marked increase in the Q 10 dB values of cat AN fibers as CF is increased above about 1,000 Hz (Palmer 1995; Shera et al. 2002). In general, the results of Kim et al. and of van der Heijden and Joris agree reasonably well with those described here despite the fact that both of those earlier data sets were obtained in the cat, whereas Palmer and Shackleton’s data were obtained in the GP. In addition, van der Heijden and Joris noted that they deliberately restricted their measures to fairly low levels because increasing level could have a “drastic” effect on the phase measurements. This is consistent with evidence from recordings of individual neurons innervating the cochlear apex showing that level affects group delay in a frequency-dependent manner (Versteegh et al. 2011) and with the present evidence that PTs are strongly affected by level. Hence, although there will undoubtedly be quantitative differences between species, the general form of phase transition data is similar.

Cedolin and Delgutte (2010) measured the response of individual AN fibers to harmonic complex tones whose fundamental frequency (F 0) ranged from about 0.2 to about 0.7 times the neuron’s CF. They invoked the principle of “scaling invariance” (Zweig 1976) to calculate, for each set of such measurements, the predicted amplitude and phase response of an array of neurons, with different CFs, to a complex of a given F 0. They then calculated not only the predicted rate–place profile but also a metric similar to the simple “local subtraction” method described above; the response of each “channel” was instantaneously subtracted from that of its neighbour and the absolute value of this difference then integrated over time. The variation in this summary statistic across CF, termed the “mean absolute spatial derivative (MASD)”, produced peaks at CFs corresponding to the low-numbered harmonics. They found that for CFs between 1,350 and 2,800 Hz, the MASD provided a better representation of the F 0 than did a simple rate–place profile, but that the reverse was true at higher CFs, presumably due to the roll-off in phase locking at high frequencies. In addition, they estimated the best frequency (BF) of each neuron, based both on the rate–place and MASD profiles, and reported that these BFs varied less consistently with level for the MASD-based than for the rate-based measure. They also concluded that the accuracy of the F 0 representation did deteriorate slightly with increasing stimulus level, but that the significance of this deterioration was less than that observed for the rate–place model.

Cedolin and Delgutte’s conclusion that their MASD model provided a relatively level-independent cue to pitch seems, at least at first sight, to be at variance with the substantial level dependence observed in our analyses of Palmer and Shackleton’s GP data and by van der Heijden and Joris’s observation that level can have a large effect on the phase transitions measured in the cat. One way of reconciling these two sets of findings would be if the response of the BM were more linear for complex tones than for the pure tones used by Palmer and Shackleton. In addition, the range of sound levels studied by Cedolin and Delgutte was generally lower than ours; 80% of the data they reported were obtained at levels between 25 and 65 dB SPL per component, whereas we report data for levels between 50 and 90 dB SPL. It is also worth noting that Cedolin and Delgutte focused on the finding that the PT cue was less level-dependent than the rate–place representation, a finding that is not inconsistent with our analysis. For example, Figure 11 shows that the rate vs. CF profile at 90 dB SPL is flat, whereas the “local subtraction” metric shown in Figure 5 does show some variation with input frequency at that level. Our focus is on whether there is an effect of level at all and on whether this effect is substantially less than the effect of input frequency.

Scaling invariance

Our analysis also sheds light on the validity of the “scaling invariance” assumption and on the implications of violations of this assumption for phase transition cues. According to the principle of scaling invariance, any PT curve in Figure 4 for a given value of f s can be obtained by taking the curve for a frequency of xf s and shifting it horizontally by log10 x. Figure 12 illustrates this point and shows the same general pattern for signal levels of 50, 70 and 90 dB SPL (shown in different panels). In each panel, the thin red curve shows the arc-tangent fit to the PT curve for f s = 500 Hz, and the thick red curve shows that function shifted to the right by 1.6 octaves. The thin dotted black curve shows the PT response to a pure tone 1.6 octaves higher than 500 Hz—i.e. f s = 1,516 Hz, and the thick dotted black curve shows this function shifted vertically so that it coincides with the red curve at its rightmost point, corresponding to a CF of 2 kHz, in order to facilitate visual comparison. If scaling invariance holds, then the thick red and black curves should be identical, which they are not. Rather, at a point 1.6 octaves below that at which we forced them to have the same phase, they differ by between 0.45 and 0.54 cycles. What this means is that over a frequency range (1.6 octaves) which was the same as that used by Cedolin and Delgutte to study the responses of each cat AN fibre, there is a substantial deviation from scaling invariance as measured in the GP. Although a marked deviation from scaling invariance has been observed when comparing the phase lag at the base and apex of the cat cochlea (van der Heijden and Joris 2006), we cannot be sure that the failure of scaling invariance over the relatively local range of 1.6 octaves, observed here for the GP, would necessarily occur for the cat. However, it is worth noting that the variation in the relative bandwidths of frequency tuning curves with CF is roughly similar in the two species (Shera et al. 2002).

FIG. 12
figure 12

Illustration of the failure of scaling invariance. Each panel shows fits to the data obtained at one level (50 dB SPL (A), 70 dB SPL (B) and 90 dB SPL (C)). In each panel, the thin red line and the thin dotted black line show the arc-tangent fits from Figure 4 for input frequencies of 500 and 1,516 Hz, respectively; these frequencies are 1.6 octaves apart. The thick red solid line shows the fit to the 500-Hz data shifted rightwards by 1.6 octaves. The thick dotted black line shows the fit to the 1516-Hz data shifted upwards so that it coincides with the shifted 500-Hz fit at a CF of 2,000 Hz. If scaling invariance held, these two curves should overlap completely.

Level dependence of across-channel phase cues

In the Introduction, we noted that the nonlinearity of the mechanical filtering properties of the BM made it plausible that PTs, which arise from those properties, would also be nonlinear and vary with input level. The analyses described in “Level dependence of phase transition cues for pure-tone pitch” generally support this prediction. It is possible that the phase effects we observed had not only a mechanical but also a neural basis. One potential influence of the amplitude response of the BM to a pure tone on the PT arises from a potential influence of level on the synaptic delay between inner hair cells (IHCs) and the AN. Recently, Versteegh et al. (2011) interpreted some of their data on the level dependence of AN phase responses in terms of evidence that larger input amplitudes may cause AN fibres to fire earlier, perhaps as a result of the effects of sound level on vesicle release at the IHC synapse. If this is true then, when, for a given input level, the inputs to fibres with CFs remote from f s are reduced by BM filtering, those AN fibres may be subject to slightly different synaptic delays, thereby contributing to the variation in the AN response phase as a function of CF. Furthermore, because BM filtering itself depends on the input level, the variation in these “synaptic” effects across CF may also be level-dependent.

Summary of PT cue analysis

  1. 1.

    For a given f s, the function relating AN response phase to log10(CF) is very shallow for CFs much higher than f s and substantially steeper for CFs around and below f s. This PT curve shows a local maximum, the position of which varies monotonically with f s. However, this position also varies with level, and the average effect of a 40-dB level change on its position is more than half that produced by changing input frequency over a 2.24-octave range.

  2. 2.

    The “knee point”—the position on the PT curve at which there is a marked increase in slope—also varies monotonically with f s and is also level-dependent. The average effect of a 40-dB level change is more than half that produced by changing input frequency over a 2.24-octave range.

  3. 3.

    The slope of the PT curve varies slightly with f s , presumably as a result of the variation in relative tuning with CF in the GP. It becomes shallower with increasing input level, presumably reflecting the worsening of frequency selectivity with increasing level observed here. The average effect of a 40-dB level change approached that produced by changing input frequency over a 2.24-octave range.

  4. 4.

    Models based on co-incidence detection are dependent on the slope of the phase-vs.-CF function. Over the steep part of that function, this slope is level-dependent and its variation with f s is slight and depends on variations in relative bandwidth with CF—a variation that may be absent above 500 Hz in humans (Baker and Rosen 2006). Therefore, to the extent that such models assume that phase differences are compared over the steep part of the function, they do not predict a robust, level-independent code for frequency.

  5. 5.

    The CF at which the phase of the AN response is exactly out of phase with that at the knee point varies monotonically with f s . Unlike the other potential codes described here, its variation with level was not significant overall. The average effect of a 40-dB level change was only about one tenth that produced by changing input frequency over a 2.24-octave range. However, this code requires the auditory system to compare channels with well-separated CFs, and it is not obvious how it could be used when more than one tonal component is present.

Overall summary and conclusions

Our analysis of Palmer and Shackleton’s (2009) GP data suggests that straightforward methods for extracting pure-tone frequency from PTs—such as the “local subtraction” and “knee point” metrics—are likely to be too strongly affected by level to provide a robust code. Perhaps the strongest candidate code that we observed was the CF at which neurons fire out of phase with those at and basal to the knee point; this metric varied substantially and monotonically with f s and was not substantially influenced by level. However, this code requires a comparison of AN fibres having quite different CFs, and it is less than clear how it could be implemented by the auditory system, particularly when more than one sound is present. A caveat that the present analyses shares with most other published studies is that the neural responses were obtained in an anaesthetized preparation.

Overall, our conclusion is therefore that although across-channel timing information provides a theoretically attractive neural code to pure-tone frequency (subjectively, pitch), the way in which it is extracted is unlikely to be straightforward. One possible conclusion is that across-channel timing cues are not used at all. At present, the most likely alternatives are either that the auditory system can estimate a quantity akin to the π-shift point, which entails comparing phase information across quite well-separated portions of the AN array, or uses one of the more “local” codes described here whilst controlling for the sometimes substantial effects of overall level, for example, by estimating total firing rate.