In a two-interval two-alternative forced-choice task, listeners indicated which of the two stimuli had the higher pitch. No feedback was provided in the experiment proper. Each stimulus consisted of two components whose frequencies were explicitly chosen such that the listener’s pitch judgment would indicate one interval if a residue pitch was perceived (i.e., if the subject listened synthetically and heard the low or virtual pitch), but would indicate the other interval if no residue pitch was perceived (i.e., if the subject listened analytically to the individual components).
The stimuli were complex tones consisting of two “harmonics” of three different F0s: a 400-Hz F0 complex with components at 400 and 800 Hz (1st and 2nd harmonics), a 267-Hz F0 complex with components at 533 and 800 Hz (2nd and 3rd harmonics), and a 160-Hz F0 complex with components at 640 and 800 Hz (4th and 5th harmonics); Table 1. Note that, as the F0 decreases, the frequency of the lower harmonic increases, while that of the upper harmonic is constant. Thus, pitch judgments based on analytic listening would go in the opposite direction to those based on the residue pitch, i.e., based on synthetic listening. There were four conditions (Fig. 1B). In condition HP-HP, both components were HPs. In condition NBN-NBN, both components were NBNs. In condition HP-NBN, the lower component was a HP, while the upper component was a NBN, and in condition NBN-HP, the lower component was a NBN, while the upper component was a HP. The first two and the last two will be referred to as single-mode and mixed-mode conditions, respectively.
All stimuli were generated digitally in MATLAB. The HP stimuli were generated from a 500-ms Gaussian noise that was low-pass filtered at 2 kHz. They were generated in the spectral domain by first applying a fast Fourier transform (FFT) to the noise and then modifying the phases of one of two matched buffers representing the left and right channels. The modification consisted of linearly increasing (the original) phase as a function of frequency over a range of 0 to 2π radians between 3% below and 3% above the center frequency of the chosen component(s). Applying an inverse FFT to the two spectral buffers gave the signal waveforms for the left and right channels. Note that after this phase modification, each channel still contains noise as, within each channel, the phases are still random. Thus, there are no monaural spectral cues for pitch available. The HP stimuli had a spectrum level of 35 dB (re 20 μPa). The NBN(s) that were added diotically to the HP stimuli (or to the diotic low-pass noise for the NBN-NBN condition) extended from 3% below to 3% above the center frequency of the chosen component(s) and had a spectrum level of 41 dB (re 20 μPa). This relatively low spectrum level of the NBN was chosen following some informal listening as to give a pitch percept that was approximately matched in salience, i.e., that seemed equally loud, to that of the HP components.
During training, the complex tones had the same F0s as in the main experiment (400, 267, and 160 Hz) but they contained all harmonics up to 2.4 kHz. The harmonics were sinusoids added in random phase and presented in Gaussian noise that was low-pass filtered at 4 kHz. The root mean square (rms) level of the complex tone was 3 dB below that of the noise.
The stimulus duration was 500 ms, including 40-ms raised-cosine onset and offset ramps. The silent interval between the two intervals within a trial was 500 ms. The stimuli were played out using a 16-bit digital-to-analog converter (CED 1401 plus), with a sampling rate of 40 kHz. Stimuli were passed through an antialiasing filter (Kemo 21C30) with a cutoff frequency of 15 kHz (slope of 96 dB/oct) and presented via the two channels of Sennheiser HD650 headphones at an rms level of about 68 dB SPL.
Listeners indicated which of the two stimuli had the higher pitch, without receiving feedback. Each F0 complex was compared with each of the others. Within a given trial, both stimuli were from the same condition. Before each experimental block of 120 trials (ten trials for each condition and F0 comparison in a randomized order across condition and F0 comparison), listeners had a short training block of 30 trials that was intended to help them hear a residue pitch when only two harmonics were present. In the training block, feedback was provided. Overall, in the main experiment, 100 trials were collected in each condition for each F0 comparison for each listener.
Initially, five subjects were tested without a short training block before each experimental block. At that point, it seemed difficult to find subjects who heard a residue pitch with only two harmonics present in the single-mode conditions; only two of the five were able to do so. As synthetic listening with only two harmonics was a prerequisite for the subjects in the present study, it was considered worthwhile to introduce a short training block that might help subjects to listen synthetically. The initial five subjects were retested using training blocks (subject numbers 1, 2, 11, 13, and 14 in Fig. 2), and the effect of the presence of the training blocks on synthetic vs analytic listening was evaluated on the basis of the data from these five subjects.
Fifteen subjects (mean age = 30 years; range, 20–48 years) with self-reported normal hearing were tested. One of them was the first author. Thirteen of them had some degree of musical training. Informed consent was obtained from all subjects. This study was carried out in accordance with the UK regulations governing biomedical research and was approved by the Cambridge Psychology Research Ethics Committee.
The percentage of trials in which the listeners’ judgments followed the F0 rather than the spectral pitch of the individual components was calculated. These percentages (based on 100 trials) are shown in Figure 2, for each condition and F0 comparison, for all subjects. In the single-mode conditions, ten out of the 15 subjects (subject numbers 1–10) consistently heard a residue pitch with only two harmonics present for at least some of the F0 comparisons, i.e., their pitch judgments followed the F0. The remaining subjects either listened mainly analytically for all F0 comparisons in the single-mode conditions (subject 13) or were inconsistent. The data from these five subjects were excluded from further analysis (indicated by the dashed line in Fig. 2), as they would not allow any meaningful conclusions about the ability to hear a residue pitch in the mixed-mode conditions; one cannot expect subjects to perceive a residue pitch in the mixed-mode conditions if they do not perceive a residue pitch in the single-mode conditions. Next, percentages were averaged across the two single-mode conditions and across the two mixed-mode conditions, respectively. All statistical analyses were conducted after applying the rationalized arcsine units (RAU) transformation (Studebaker 1985) on these percentages. Pearson’s correlation coefficient was calculated between the mixed-mode RAU-transformed percentages of the time that responses went with F0 and the corresponding single-mode RAU-transformed percentages. t tests were conducted, separately on the data for each F0 comparison, to assess the significance of the difference between the RAU-transformed percentages of judgments following the F0 in the single-mode and in the mixed-mode conditions. The best fitting linear regression line was calculated for RAU-transformed percentages of judgments following the F0 in the mixed-mode and in the single-mode conditions, taking into account that both the single-mode and the mixed-mode data include measurement errors. This required minimizing the perpendicular distances of the data points from the fitted straight line, i.e., minimizing the horizontal and vertical distances simultaneously. Note that usually, i.e., in the standard way of doing linear regression, the deviations between data points and fitted straight line are minimized in one dimension only. For example, only the deviations between the measured values of the dependent y variable and the predicted y values (the y values on the regression line for given values of the independent variable x) are minimized; the independent variable x is assumed to be error-free.
The effect of the training was tested separately for a subgroup of five listeners. The statistical analysis showed that the training did not significantly increase the tendency to perceive the residue pitch rather than the pitches of individual components. Two t tests, one for the mixed-mode conditions and one for the single-mode conditions (with RAU values averaged across the three F0 comparisons), showed that there were no significant differences between the percentages of judgments following the residue pitch before and after training [single-mode, t(4) = −1.31, p = 0.26; mixed-mode, t(4) = −1.28, p = 0.27]. Thus, the tendency to listen analytically to a given set of two-tone harmonic complexes seems to be a stable aspect of perception that is not overcome by listening to multi-harmonic complexes with the same F0s for which the residue pitch is clearly perceived.