This experiment was designed to test the hypothesis that children are more obliged than adults to fuse separate spectral components of speech signals. Gordon’s (1997) procedures were used, and support for the hypothesis would be obtained if children showed greater masking protection than did adults in the full-formant condition, as compared with the F1-only condition.
Ninety-five listeners were tested in this experiment: 25 adults between the ages of 18 and 25 years; 32 children between 8 years, 0 months and 8 years, 11 months; and 37 children between 5 years, 2 months and 5 years, 11 months. All participants (or in the case of children, their parents on their behalf) reported having normal hearing, speech, and language. None of the children had had more than five episodes of otitis media before the age of 3 years. At the time of testing, all participants passed hearing screenings of the frequencies of 0.5, 1.0, 2.0, 4.0, and 6.0 kHz presented at 25 dB HL to each ear separately.
Equipment and materials
Testing took place in a soundproof booth, with the computer that controlled stimulus presentation and recorded responses in an adjacent room. Hearing screenings were done with a Welch Allen TM-262 audiometer and TDH-39 headphones. Stimuli were presented using a Soundblaster digital-to-analog converter, a Samson Q5 headphone amplifier, and AKG-K141 headphones.
Two pictures on cardboard (6 × 6 in.) were used so that listeners could point to the picture representing their response choice after each stimulus presentation. One picture was of a dog biting a woman’s leg (bit), and the other was of a man with playing cards in his hands and stacks of poker chips in front of him (bet).
Synthetic speech stimuli were created with the Sensimetrics “SenSyn” software, a version of the Klatt synthesizer. All stimuli were made at a 10-kHz sampling rate, with low-pass filtering below 5 kHz and 16-bit digitization. All stimuli were 60 ms long, which included 5-ms on and off ramps. Stimuli were modeled on the vowels and /ε/, with three steady-state formants. F2 and F3 were 2200 and 2900 Hz, respectively, for all stimuli. F1 was 375 Hz for and 625 Hz for /ε/. Formant bandwidths (at 3 dB below peak amplitude) were 50 Hz for F1, 110 Hz for F2, and 170 Hz for F3. Fundamental frequency (f0) was stable at 125 Hz.
To create the F1-only stimuli and the low-frequency portion of the full-formant stimuli, the two stimuli described above were low-pass filtered using a digital filter with attenuation starting at 1000 Hz, a transition band to 1250 Hz, and 50-dB attenuation above that. The /ε/ stimuli were used to create the high-pass portion of the full-formant stimuli. This was done by starting attenuation at 1250 Hz, with a transition band down to 1000 Hz and 50-dB attenuation below that. For the full-formant stimuli, this high-pass portion was combined with the low-pass F1-only portions, using synchronous onsets and offsets to create stimuli that had identical high-pass characteristics. Making these stimuli in this way allowed precise control over the amplitude relations of F1 and the higher formants. In the full-formant stimuli, the F2/F3 cosignal was 12 dB lower than the F1 target, which matched the relative levels in Gordon (1997). Pilot work by us with 24 adults showed no differences in outcomes from those reported here when the amplitude of the F2/F3 cosignal was varied between 16 and 24 dB below F1 in 2-dB steps. Consequently, maintaining relative amplitude across formants precisely as Gordon (1997) had done permitted the cleanest comparison of outcomes between his study and this one, and there was no compelling reason to deviate from his stimulus settings. Figure 2 shows smoothed spectra of the full-formant stimuli. It shows that only the frequency of F1 differs across and /ε/ conditions.
For use in training, synthetic versions of the words bit and bet were created from the full-formant stimuli by appending formant trajectories at the start and end of those stimuli. At the start, 40-ms transitions were appended with starting frequencies of 200, 1800, and 2300 Hz for F1, F2, and F3, respectively. Steady-state syllable portions were 100 ms for these word stimuli. At the end, 40-ms transitions were appended, with ending frequencies of 200, 1800, and 2900 Hz for F1, F2, and F3, respectively.
Flat-spectrum white noise was generated for masking purposes with a random-number generator in MATLAB. The noise was 600 ms long and was low-pass filtered below 1000 Hz in the same manner as the F1-only stimuli: with a transition band to 1250 Hz and 50-dB attenuation in the stop band.
Listeners visited the laboratory for a single session and were paid $12 for their participation. As much as possible, procedures replicated those in Gordon (1997), but adjustments needed to be made because children were included. In particular, children do not tolerate long periods of testing near threshold, so not as many threshold estimates could be obtained. Partly to compensate for that fact, but also to ensure that children could label stimuli reliably, extensive training with clear exemplars was provided. These modifications (of obtaining fewer threshold estimates, but of providing extensive training before testing) were first suggested by Aslin and Pisoni (1980) as ways to accommodate the special circumstances of working with children. Finally, feedback was not provided during testing itself. Gordon (1997) had provided feedback throughout testing. Previous work has shown that adults can modify their perceptual strategies by shifting the focus of their selective attention from one signal property to another, but that children cannot (e.g., Nittrouer, Miller, Crowther, & Manhart, 2000). Providing feedback during testing might have caused adults to modify their strategies over the course of data collection itself. That would likely not happen with children, so age-related outcomes would have been influenced due to this factor. Feedback was provided, however, during training to ensure that listeners reliably labeled stimuli and could perform the task before testing started.
The first general training involved the word stimuli. The experimenter introduced each picture separately and told the listener the name of the word associated with that picture. Listeners practiced pointing to the correct word and saying it after it had been spoken by the experimenter 10 times (5 times for each word). Having listeners both point to the picture and say the word ensured that they were correctly associating the word and the picture. Next, the synthetic words were presented over headphones at 74 dB SPL in random order without noise. The listener had to point to the correct picture and say the correct word. Feedback was provided. Fifty of these words (25 of each) were presented.
Next, the 60-ms full-formant stimuli were introduced, without noise. Listeners were instructed that they would be hearing “a little bit” of the word. They were told to continue pointing to the correct picture and saying the word that the little bit came from. Listeners heard 50 tokens of these samples (25 of each) at 74 dB SPL in random order, with feedback.
Condition-specific training and pre-test
Training for the F1-only condition followed because this condition was always presented first, which is what Gordon (1997) did. This training consisted of presenting 50 of the F1-only stimuli at 74 dB SPL without noise and having listeners point to and say the word associated with that formant pattern. Feedback was provided.
Finally, up to 50 of these stimuli were presented without noise or feedback in the pre-test. As soon as the listener responded correctly to nine out of ten consecutive presentations, the training stopped. If 50 stimuli were presented without the listener ever responding correctly to nine out of ten consecutive presentations, that listener was not tested in that particular condition.
The last two training steps (the condition-specific training and the pre-test) were repeated before testing with the full-formant stimuli, using full-formant stimuli.
An adaptive procedure (Levitt, 1971) was used to find the signal-to-noise ratio at which each listener could provide the correct vowel label 79.4% of the time. The noise was held constant throughout testing at 62 dB SPL, and the level of the signal varied. The initial signal level was 74 dB SPL. After three consecutive correct responses, the level of the signal decreased by 8 dB. That progression, or run, of decreasing signal level by 8 dB after three correct responses continued until the listener made one labeling error, at which time the level of the signal increased by 8 dB. That shift in direction of amplitude change is termed a reversal. Signal amplitude continued to increase until the listener responded with three correct responses, when another reversal occurred. During the first 2 runs (1 with decreasing amplitude and 1 with increasing), signal level changed by 8 dB on each step. During the next 2 runs, signal level changed by 4 dB. Across the next and final 12 runs, level changed by 2 dB on each step. The mean signal level at the last eight reversals was used as the threshold. No feedback was provided, and the stimuli were presented in an order randomized by the software.
After testing in each condition was completed, listeners heard ten stimuli at 74 dB SPL without noise and without feedback. They needed to respond correctly to nine of them. If they did not do so, their data were not included in the analysis.
Listeners had to meet the pre- and post-test inclusionary criteria for both conditions in order for their data to be included. This restriction ensured that the adaptive tracking procedure was not affected by listeners’ not reliably knowing the vowel labels.
One adult (4%), six 8-year-olds (19%), and fourteen 5-year-olds (38%) failed to meet either the pre- or post-test criterion described above. In all cases, these listeners failed to meet criterion for the F1-only condition. Failing to meet the criterion in the pre-test trials were the one adult, three 8-year-olds, and eleven 5-year-olds. The other three 8-year-olds and three 5-year-olds labeled F1-only stimuli adequately in the pretest but then failed to meet the criterion for the post-test trials. One of the 5-year-olds additionally failed to meet criterion for the full-formant post-test. That left 24 adults, twenty-six 8-year-olds, and twenty-three 5-year-olds with data to be included in the analyses.
Comparison of present results with Gordon (1997)
Methods for the present experiment differed slightly from those in Gordon (1997) because children participated. Therefore, the first step in analyzing these data was to see whether the magnitude of the CMP effect was similar for adults across the two studies. Table 1 shows labeling thresholds for all groups and both kinds of stimuli used in this experiment. Mean thresholds (and SDs) in Gordon’s (1997) experiment were 58.5 dB (2.3 dB) for the F1-only condition and 55.3 dB (2.1 dB) for the full-formant condition. That means that adults in that earlier experiment showed 3.2 dB of masking protection. Adults in the present experiment showed 3.3 dB of masking protection. Thus, although thresholds were slightly higher in the present experiment, masking protection for the full-formant condition, as compared with the F1-only condition, was equivalent.
A two-way analysis of variance (ANOVA) was performed on the thresholds shown in Table 1, with age as a between-subjects factor and number of formants (F1 only or full formant) as within-subjects factors. Both main effects were found to be significant: age, F(2, 70) = 39.38, p < .001; formants, F(1, 70) = 208.18, p < .001. Those findings reflect the trends seen in Table 1: Thresholds were generally higher for younger than for older listeners and for the F1-only than for the full-formant stimuli. In addition, the age × formants interaction was significant, F(2, 70) = 15.05, p < .001. This last outcome indicates that the magnitude of the formant effect increased with decreasing age. Means (and SDs) of differences (in decibels) between the F1-only and full-formant stimuli for adults, 8-year-olds, and 5-year-olds were 3.3 (3.5), 6.2 (3.8), and 9.2 (3.7), respectively. Those differences represent the CMP effect for each age group.
Matched t-tests were performed next on differences in thresholds for the F1-only and full-formant stimulus conditions for each group separately, to see whether CMP effects were significant. In all cases, they were: adults, t(23) = 4.57, p < .001; 8-year-olds, t(25) = 8.28, p < .001; and 5-year-olds, t(22) = 11.96, p < .001.
Finally, the magnitude of the age-related difference in thresholds for each kind of stimulus (F1 only or full formant) was indexed as a way of considering how close to adultlike children’s responses were. Table 2 shows Cohen’s ds (Cohen, 1988) for each possible combination of age groups. All comparisons show effects that are typically considered large (d > 0.60), but they are consistently smaller for full-formant than for F1-only stimuli. In particular, 5-year-olds showed thresholds closest to those of older children and adults for the full-formant speech stimuli. This finding lends support to the hypothesis that children are obliged to perceptually fuse the spectral components of speech signals. Without spectrally broad signals, which children fuse into unitary phonetic objects, children were greatly hampered in their perception.
This experiment was conducted to test the hypothesis that young children are more obliged than adults to perceptually fuse the spectral components of speech. On the basis of earlier findings, it was hypothesized that young children are not inclined to perceptually segregate separate acoustic components in the speech signal. Although children have been found to weight formant transitions particularly strongly in their phonetic decisions (e.g., Nittrouer, 1993, 2005; Nittrouer & Studdert-Kennedy, 1987), this apparently does not happen via a process in which separate formant transitions are perceptually segregated from the spectral array and independently examined, with a postperception summation. Rather, children rely on broad spectral forms, perceived as unitary objects, for phonetic recognition. Such a perceptual strategy predicts that children will accrue greater benefit than adults from having broad spectral information available in the speech signal. A paradigm for measuring CMP developed by Gordon (1997) was used to test this hypothesis.
Outcomes clearly supported the hypothesis posed by this study: Children showed significantly stronger CMP effects than did adults, and the younger the children, the stronger the effects. These effects are quantified by the difference in thresholds measured when listeners are presented with F1-only versus full-formant stimuli. Children had elevated thresholds, as compared with adults, for both kinds of stimuli, but they were disproportionately more elevated for F1-only stimuli. That pattern of results suggests that children benefit greatly from having complete spectral information about the speech signal, which they fuse into unitary percepts. They could have shown raised thresholds for the F1-only condition and CMP effects similar to those of adults, leading to equally elevated thresholds in both conditions. But they did not. They showed enhanced CMP effects. This finding is important because the perceptual feat of integrating spectral components to recover a unitary perceptual object, which provides protection from masking, seems sophisticated. Yet the enhanced performance of children, as compared with that of adults, means that children can perform these perceptual tasks at least as well as adults. The question left unanswered by this first experiment was what explains this enhanced spectral integration for children? For that matter, it is not clear from this one experiment what perceptual principle explains CMP for adults. It could be that the effect is based on different principles for listeners of different ages. If true, that might help explain why the effect was stronger for children than for adults.
In particular, it seemed possible at the conclusion of this first experiment that adults might rely strongly on what Bregman (1990) terms a schema-based principle of auditory grouping but that children might depend strongly on a primitive principle. Primitive principles of auditory grouping are those arising from the structure of the sound source itself, such as the harmonic relationship among spectral components. According to this account, listeners automatically—without learning—group spectral components together if they have the same harmonic structure. Schema-based principles are those that involve knowing which components of a complex auditory scene should be grouped together, perhaps because they are all part of a familiar pattern. According to this account, listeners use this stored knowledge of familiar patterns to integrate related signal components. In the case of the speech signals in this first experiment, the components of the full-formant stimuli might be integrated because they are all recognized as arising from a common generator, a single vocal tract. Because schemas generally require that perceivers know which parts of a complex scene should be grouped together, they are often learned, but they do not need to be. There are innate schemas.
In order to test the hypothesis that children’s outcomes might reflect primitive grouping principles while adults’ results demonstrate learned schemas, it was necessary to design stimuli that explicitly disrupt one of the primitive principles known to facilitate perceptual grouping of sounds, but without diminishing adults’ demonstration of CMP. If this was done and children showed drastic reductions in their CMP effects, the hypothesis would be supported that children’s demonstration of CMP in this first experiment depended on a primitive principle, but adults’ CMP did not. Fortunately, earlier work by Gordon (1997) suggested just the right stimulus manipulation.