Modelling approach
The phase transition data reported by Kim et al. (1980), and shown in Figure 1, provide an important demonstration of the presence of across-channel timing information in the AN. Here, we report the results of a quantitative treatment of some more recent data, reported by Palmer and Shackleton (2009), and which those authors generously shared with us. They measured the amplitude and phase responses of individual AN fibres of the anaesthetized guinea pig (GP) in response to tones of different frequencies and levels. Fortunately, those data were obtained for a range of fibre CFs, permitting a reanalysis in terms of the response across CFs to a tone of a given frequency and level. Our analysis differs from previous approaches that were based on simulated neural responses produced using models of the auditory periphery (Shamma 1985; Carney et al. 2002) or that estimated the response of neurons of different CFs to a given stimulus by recording from an individual neuron in response to a range of stimuli (Cedolin and Delgutte 2010). This allowed us to obtain direct measurements of the variation of firing phase with CF and to avoid having to make some of the simplifying assumptions associated with those alternative approaches.
Another feature of our approach is that we do not focus on trying to model the magnitude of behaviourally obtained frequency difference limens (FDLs), for two reasons, both of which apply quite generally to attempts to model perception using neural responses. First, the absolute size of predicted FDLs will be influenced by the noise present in the physiological data on which the predictions are based, by the need to combine measurements from several animals, and by assumptions concerning the number of neurons involved and the degree of statistical independence of their responses. Second, the neural data are from a species whose behavioural FDLs are larger than in humans, may vary with frequency in a different way, and may be influenced by non-sensory factors such as attention (Heffner et al. 1971; Hienz et al. 1993). Instead, we concentrate on determining whether across-channel timing cues could provide a level-independent code for the frequency of a pure tone. We do so because, although the pitch of pure tones in humans does vary with sound level, this variation is small and is of the order of only a few per cent (Stevens 1935; Walliser 1969; Verschuure and van Meeteren 1975). A cue that varied substantially with the overall level would not, we assume, provide a useful means of encoding pitch, and we make the additional assumption that this will be true for the GP as well as for humans. Specifically, we compare the extent to which the predictions of each potential coding scheme vary with input level and with input frequency, f
s. Note that although “noise” in the data may obscure the effects of either f
s or level, it is unlikely to cause effects that are not significant to appear so. Note also that we are comparing the size of two effects—f
s and level—both of which are subject to the same measurement noise. Hence, the noise is unlikely to distort our estimate of the relative size of the two effects. Our analysis shows that many ways of processing PT cues—described in detail in “Behaviour of four cross-channel phase difference models”—are, in fact, strongly level-dependent. As such, those methods are unlikely, by themselves, to provide a robust level-independent code for the pitch of pure tones. A possible exception is a scheme that compares AN fibres that have different CFs and that respond out of phase to each other (cf. Carney et al. 2002); the plausibility of this scheme is discussed.
The analyses that follow use data obtained by Palmer and Shackleton (2009), and the reader is referred to that original article for experimental details. Stimuli were 50-ms sinusoids presented at a rate of 5 Hz to urethane-anaesthetized adult pigmented GPs. The phases and firing rates of the responses to these stimuli were then extracted, as a function of AN fibre CF, for a range of input levels. The data were typically obtained at 10-dB intervals; in the majority of cases, these levels corresponded (within 1 dB) to those specified in our analyses. The stimulus frequency spacing was in steps of 1.5 semitones from the CF of the considered AN fibre. Where the levels and frequencies specified in an analysis did not correspond exactly to those available in the data, phases and firing rates for a given input were estimated by linear interpolation from the values at the next highest and the next lowest levels and frequency where the measurements were obtained.
As noted above, estimates of the accuracy of any code will be affected by details specific to the method of data collection and to assumptions made in the modelling process. Hence, although plots of our model predictions are accompanied by error bars, these provide a guide to the statistical significance of the effects that we discuss, but do not indicate how small a frequency difference should be detectable by the GP.
Behaviour of four cross-channel phase difference models
Local subtraction model
Perhaps the simplest scheme for estimating the frequency (and therefore, the pitch) of pure tones, that takes advantage of the PT, is one in which an array of “second-order” neurons subtracts the response of each AN fibre from that of a fibre having a “neighbouring” CF. If the function relating phase to CF in the first-order (AN) neurons has a slope that is locally steeper over some region compared with surrounding regions, there will be a corresponding local maximum in the output of the second-order array. This concept is illustrated in Fig. 3, where part A) shows a schematic phase-vs-CF curve and where part B) shows the output of a hypothetical second-order array. It is also illustrated in Figure 2 which shows some schematic “neurograms” to 1,000- and 1,200-Hz pure tones, with the results of a local subtraction algorithm shown to the right of the plot. Note that the neurograms in Figure 2 include the effects of peripheral filtering on response phase, but not on its magnitude (firing rate). Note also that the phase-vs.-CF functions shown in Figure 3 and in all remaining plots in this article show CF increasing from low to high along the abscissa, opposite to the convention adopted by Kim et al. and shown in Figure 1.
As mentioned above, for this “local subtraction” scheme to provide a useable measure of input frequency, the phase-vs.-CF function must show a local increase in slope over some fairly narrow range of CFs, and this maximum should vary monotonically with the input frequency. To check whether this was the case, we first calculated arc-tangent fits to the functions relating phase in cycles to log10(CF) for input levels of 50, 70 and 90 dB SPL and for f
s ranging from 250 to 2,000 Hz. The fitted functions took the form y = A + B⋅arctan(x/C), where y is the phase of the simulated response of a fibre having CF = x and A, B, and C are free parameters. Those data and fits are shown in the top panels of Figure 4 where the different colours in each panel show data for different signal frequencies at a single level (fits to the 70 dB SPL data were also used to generate the “neurograms” in Fig. 2). The bottom panels of Figure 4 show an implementation of the “local subtraction” scheme in which the phases in each “bin” are subtracted from those of its neighbour; the bin widths use here correspond to units of 0.1 log10(CF). It can be seen that these functions do indeed show local maxima, which, although quite broad, generally vary monotonically with frequency, as illustrated for each level by the different curves in Figure 5. However, Figure 5 also shows that these “peak differences” are unfortunately not invariant with level. For example, every estimate at 90 dB SPL is higher than the corresponding value for the same input frequency at 50 or 70 dB SPL. Although the exact size of the level effect is rendered uncertain by the variability in our estimates, both it and the effect of frequency were significant when the peak difference values for three levels (50, 70, 90 dB SPL) and for ten values of f
s spaced in units of 0.1 log10
f
s were entered into a generalized linear model (level: F
1,28 = 42.02, P < 0.0001; frequency: F
1,28 = 42.02, P < 0.0001). For this and the other metrics described in this section, we derived a simple summary of the relative effects of f
s and level by comparing its variation with f
s from 250 to 1,189 Hz, averaged over level, to its variation with level averaged across that same range of f
s. For the peak difference measure, the variation with level was 0.7 octaves, more than half the 1.24-octave effect of f
s.
Our implementation of the local subtraction scheme is a simplification for at least two reasons. First, the neural responses in any one AN fibre will not occur at exactly the same phase on every cycle, and, even on the same cycle, it is likely that different neurons with the same CF will fire at slightly different times. Second, any second-order neuron that subtracts the inputs from two adjacent channels will impose some smoothing due to variability in synaptic transmission times. This does not negate our assumption that the output of such a subtraction will on average increase monotonically with the phase difference between the two input AN channels. Furthermore, assuming that, for a given f
s, the smoothing is the same at all CFs, the output of the local subtraction algorithm will have peaks in the same locations as those shown in the bottom row of Figure 4. It does, however, have one potential implication for our evaluation of the local subtraction scheme, which is based on the subtraction of phases rather than of simulated spike trains. This is that if any component of “neural smoothing” has a time constant that is constant in milliseconds rather than in radians, its effects would be greater for higher-frequency than for lower-frequency tones. As a result, the output of the subtraction operation would have peaks that, compared with those shown in the bottom row of Figure 4, would be relatively higher for low than for high f
s.
A model based on the local subtraction idea was recently applied by Cedolin and Delgutte (2010) to the responses of cat AN fibers to harmonic complex tones. An advantage of this approach is that the across-channel timing comparisons are performed locally, and, where a complex tone consists of low-numbered harmonics, can be performed between AN fibers responding to the same harmonic (Cedolin and Delgutte 2010). This is important because it generates predictions consistent with the fact that the pitch of low-numbered harmonics is independent of their relative phase (Houtsma and Smurzynski 1990; Shackleton and Carlyon 1994) and because it would allow phase comparisons to be performed even when energy from other sound sources is present elsewhere in the spectrum. We compare Cedolin and Delgutte’s findings to the results of our analyses in “Comparison with other data”. Although neither we, nor they, could specify exactly which neurons might perform this subtraction, Cedolin and Delgutte suggested the dorsal cochlear nucleus as one possible site.
CF-at-knee point model
A second possible code for frequency that makes use of the PT involves estimating f
s from the location of the transition between the steep and shallow portions of the functions shown in Figure 1 and in the top panels of Figure 4. This could be implemented by an array of “third-order” neurons that subtract the outputs of adjacent “second-order” neurons (Figure 3C–E). This “third-order” array would show a maximum when the second derivative of the phase-vs.-CF function is maximal. To estimate this “knee point”, we refitted the data shown in Figure 4 with “broken-stick” functions consisting of two straight lines, one of which was constrained to have a zero slope. Examples of this fit for two frequencies and levels are shown in Figure 6. When applied to the same range of f
s as shown in Figure 4, this “broken-stick” model yields almost exactly the same RMS error of fit (all frequency and level conditions combined) as the “arc-tangent” model described above: 0.2791 cycles for the broken-stick fit model vs. 0.2796 for the arc-tangent fit. The knee point varies monotonically with f
s, but, as Figure 7 shows, is not independent of level. For example, the knee point for a 250-Hz, 90-dB tone occurs at a place in the AN array having a higher CF than that for a 500-Hz, 50-dB tone. The effect of level, averaged across f
s, was 0.8 octaves, more than half the 1.33-octave effect of f
s.
Phase transition slope model
Loeb et al. (1983) proposed a model consisting of an array of detectors in the medial superior olive, with each detector receiving inputs from pairs of AN fibers that were separated along the BM by a fixed amount. They pointed out that the maximum output would occur in those detectors whose input AN fibers responded with phases that differed by one wavelength (although they also noted that optimal discrimination might be based on other phase differences). This “critical distance” depends on the slope of the function relating phase in cycles to log10(CF). As shown in Figures 1, 4, and 6, this slope is not constant as a function of CF but, rather, increases markedly for CFs below the knee point. Here, we focus on the slope of the steep portion as a possible cue for f
s. This slope will depend on the bandwidths of the peripheral filters through which the stimulus is passed (Heinz et al. 2001; Shera et al. 2010); a filter introduces a delay as the frequency of the input passes through its resonance, and the size of this delay is greater for narrow than for broad filters. As the slope is measured in terms of the logarithm of CF (which is roughly proportional to the distance along the BM), it will vary with the bandwidth of the analysing filters relative to their centre frequencies. This last point can be explained by considering a 1,000-Hz tone passed through a filter centred on, say 1,200 Hz and a 2,000-Hz tone passed through a filter centred on 2,400 Hz. If the bandwidths of the filters are a constant proportion of their centre frequencies, then each tone will be at the same relative position on the filter’s transfer function and the phase delays (in cycles) will be equal.
As shown in Figure 8, the slopes vary as functions both of level and of f
s. The slopes are slightly shallower (less negative) at lower frequencies, presumably because, in the GP, the relative bandwidths of frequency tuning curves (e.g. as summarized by Shera et al. 2002) are broader at low frequencies. The slopes also become shallower (less negative) with increases in level, as would be expected with the decrease in frequency selectivity at high levels (Rhode 1971; Robles et al. 1986). Both of these effects were sufficiently reliable to produce significant main effects when the slope values for three levels (50, 70, 90 dB SPL) and for ten values of f
s spaced in units of 0.1 log10
f
s were entered into a generalized linear model (effect of f
s: F
1,28 = 28.9, P < 0.00001; effect of level: F
1,28 = 22.1, P < 0.00001). Averaged across f
s, slopes varied by a factor of 1.67 across level, approaching the factor of 2.07 for the variation with f
s. Hence, any attempt to estimate f
s based on a metric related to the slope—for example by estimating the difference along the AN array between fibers that responded in phase—would be strongly influenced by the input level.
Model involving a comparison of phase above and below the knee point
One potential decoding method is illustrated in Figure 9 which shows “broken-stick” fits to stimuli at two frequencies and levels (shown in Fig. 6), shifted so that the horizontal portions are aligned. The shift is performed because the animal has, of course, no idea about “absolute” phase and can only compare the relative phase across fibers. It can be seen that the CF at which the phase is π radians lower than that at the knee point (indicated by the black horizontal dotted lines) appears to be similar at the two levels for each frequency. Figure 10 shows this value, which we term the “π shift point”, as a function of f
s for input levels of 50, 70 and 90 dB SPL. It increases monotonically with f
s and has modest error bars. A generalized linear model revealed a significant effect of f
s (F
1,28 = 524, P < 0.0001), but not of level (F
1,28 = 0.23, P = 0.633). Unlike the other measures described here, the effect of level, averaged across f
s, was much smaller (0.23 octaves) than the effect of f
s, which was 2.19 octaves; this, in turn, was close to the range of input frequencies (250–1,189 Hz) which was 2.24 octaves.
A matter that would still need to be resolved is how the auditory system would extract this metric. A possibly important fact is that the phase at and above the knee point is fairly constant over a wide range of CFs, and so is the phase at which most AN fibers are firing. Therefore, “all” that the auditory system has to do is to find those fibers that are firing out of phase (by π radians) with the “most common” phase. However, phase is only “fairly” constant at CFs above the knee point. The data of Kim et al. (Fig. 1) show that, if one looks over a much wider range of CFs than in the data of Palmer et al., the phase delay does decrease gradually as one moves to more basal regions. Therefore, the issue arises as to which portion of the AN array is used as the “fairly constant” reference. An alternative would be for the system to measure the CF corresponding to knee point, perhaps using a simple metric as in Figure 3C–E and to exploit this information to identify the π shift point. An important drawback is that the “π shift point” and the knee point are often far apart (more than an octave difference in CF; Fig. 9), raising the issue of how the brain would extract this cue when more than one sound is present. Furthermore, when more than one tone is present, basal fibers will respond to a mixture of both, and the auditory system would have to extract the temporal pattern corresponding to each one. Evidence from the perception of spectrally overlapping mixtures of unresolved harmonics suggests that the brain is poor at extracting two periodicities from the same set of AN fibers (Carlyon 1996a, b).