In Experiment 1, we examined the processing of speech and noise information using a speeded classification paradigm (Garner, 1974). The dimensions of the speech signal were divided into indexical features that are intrinsic to the speech input—namely gender (Exp. 1A) and within-gender talker identity (Exp. 1B). Examining listeners’ classification of these different dimensions of speech in the presence of random noise variations, as well as classifying noise in the presence of random speech variations along these dimensions (gender or talker), would provide insight into which, if any, dimensions of the speech signal are processed independently of extraneous background noise in the context of a task that involves minimal access to the long-term mental lexicon. As such, this task can inform our understanding of speech-in-noise perception at a relatively early stage of processing. Furthermore, in order to investigate whether the extent of acoustic overlap in the spectral domain between two auditory signals has an impact on whether or not the speech and noise dimensions are processed independently, we also included conditions manipulating the spectral overlap of these dimensions in each experiment. That is, we manipulated whether energetic masking of the speech by the noise was present or absent.
If all concurrently presented perceptual details of a speech event, including those that are intrinsic to the speech signal (gender and talker) and those that are extrinsic to the speech signal (environmental noise), are perceptually integrated, we should observe slower reaction times in the orthogonal condition (which incorporated variation along both dimensions) than in the control condition (in which only one dimension varied). Furthermore, we may hypothesize that the integrality of the dimensions can vary as a function of the ease with which listeners can strip away elements of the auditory context that are extraneous to the speech signal. Under this view, perceptual separation of the speech and noise should be particularly facilitated when the noise and speech are highly acoustically distinct. In the present study, this would predict an asymmetry in the magnitudes of the difference in reaction times between orthogonal and control conditions between the spectrally separated condition and the spectrally overlapped condition. Since it might be more difficult to perceptually segregate the speech and noise dimensions in the spectrally overlapped condition, we might predict greater orthogonal interference in this condition than in the spectrally separated condition.
Finally, previous research has suggested that asymmetric interference effects arise from the relative discriminability of the dimensions, such that more-discriminable dimensions should be more difficult to ignore (Cutler et al., 2011; Garner, 1974). Discriminability is typically indexed by comparing reaction times in the control conditions of the two dimensions, with shorter latencies indicating easier discriminability. However, the prior work on the integration of speech and indexical information discussed above had not made this discriminability comparison within the same study, using the same materials and experimental setup. The present work included two indexical speech features, gender and within-gender talker identity, to further investigate the impact of discriminability on the integration of speech and noise. These indexical dimensions vary in their relative eases of discriminability, with gender (Exp. 1A) being easier to classify than within-gender talker identity (Exp. 1B).
Experiment 1A: Perceptual integration of talker gender and background noise
Method
Participants
Eighty-three American English listeners, who reported having no speech or hearing deficits at the time of testing, participated in this experiment. Participants who had experience with more than one language before the age of 11 were required to have learned English first and not to have been exposed to the other language for more than 5 h per week. Listeners were randomly assigned to either the non-energetic-masking (NEM) or the energetic-masking (EM) condition. In order to be included in the analyses, participants were required to attain at least 90 % classification accuracy for both dimensions of the Garner (1974) task, resulting in the exclusion of 11 participants (eight from the NEM and three from the EM group). This yielded 36 participants in the NEM (22 female, 14 male; mean age = 20 years) and 36 participants in the EM (22 female, 14 male; mean age = 20 years) condition.
Stimuli
The stimulus materials included 96 English disyllabic, initial-stress words produced by one male and one female American English talker. The words were produced in citation form and recorded at a 22,050-Hz sampling rate. Acoustic analyses performed on the 96 stimuli items produced by the two talkers revealed that the mean difference in fundamental frequencies between the male and female talkers was 86 Hz (M
male = 144 Hz, M
female = 230 Hz). The average F0 ranges (calculated as the difference between the mean F0 minimum and F0 maximum across words) were 165 Hz for the male talker and 146 Hz for the female talker.
The set of materials was digitally processed in Praat (Boersma & Weenink, 2013) to yield the four different noise and masking stimulus sets (Fig. 1). The stimuli were first normalized for duration, low-pass filtered at 5 kHz, and normalized for root-mean-squared (RMS) amplitude to 65 dB. For the nonenergetic masking (NEM) condition, two sets of stimuli were constructed, with each set including all 96 recorded words: For one set, the speech files were combined with narrow band-pass-filtered white noise from 7 to 10 kHz, and for the other set, they were combined with a 6-kHz pure tone (Fig. 1, right column). Similarly, two sets of stimuli were constructed to produce items for the energetic masking (EM) condition: The low-pass-filtered speech files were combined with either narrow band-pass-filtered white noise from 3 to 4 kHz (Set 1) or a 3-kHz pure tone (Set 2; Fig. 1, left column). In total, there were four sets of stimuli: NEM-noise, NEM-tone, EM-noise, and EM-tone.
Procedure
Whereas masking condition (NEM or EM) was a between-subjects manipulation, stimulus dimension (gender, noise) and stimulus set condition (control, correlated, orthogonal) were within-subjects manipulations. When making classification judgments, all participants were required to attend to either the gender (male vs. female) or the noise (pure tone vs. white noise) dimension. They completed both of these judgments in each of the three stimulus set conditions, blocked by stimulus dimension. This resulted in a total of six sets of trials, and all 96 words were presented in every set with no repetitions of items (i.e., a total of 96 trials per stimulus set condition). The orders of stimulus dimension and stimulus set were counterbalanced across participants, as were response button order and which words were presented with which particular gender–noise combinations (e.g., male talker with pure tone, female talker with white noise).
In the control conditions, the attended dimension varied randomly and the unattended dimension was held constant. One gender control set, for example, included words spoken by both the male and female talkers embedded only in white noise. The control set for the noise condition, on the other hand, presented words with both white-noise and pure-tone backgrounds spoken by a single talker. Each participant completed one gender and one noise control set, which were counterbalanced across participants. In the correlated condition, one value of the gender dimension was consistently paired with one value of the noise dimension. For instance, one set included words with a pure-tone background produced by the female talker and words with a white-noise background produced by the male talker, whereas the other set consisted of words in white noise produced by the male talker and words with a pure tone produced by the female talker. Which correlated set a participant received was also counterbalanced across participants. In the orthogonal condition, the attended and unattended dimensions varied randomly, whereby all items produced by both male and female talkers in both white noise and a pure tone were presented to the participant for classification along the attended dimension. Before each stimulus dimension condition, a brief familiarization phase was presented in order to orient listeners to the task procedures for that particular condition. Each familiarization phase consisted of ten trials (five items for each response option) using stimulus items not contained in the test phase.
Stimuli were presented over Sony MDR-V700 headphones at a comfortable listening volume in sound-attenuated booths. Upon hearing each item, participants were instructed to classify it on the basis of the appropriate attended dimension as quickly and accurately as possible by pressing one of two buttons on a response box.
Results
Percent correct classifications were calculated for each dimension (Table 1). The response latencies for correct trials, measured from the onset of the stimulus to the onset of the buttonpress, were also obtained (Table 1).
Table 1 Mean reaction times (in milliseconds) for masking condition, dimension, and stimulus set with accuracy
Only the latencies of correct responses were submitted for analysis. The data were analyzed using linear mixed-effects regression models (LMER; Baayen, Davidson, & Bates, 2008), with log-transformed reaction times as the dependent variable. Outlier trials that deviated by more than three standard deviations from the mean log reaction time of the condition were excluded from the analysis. The stimulus set was contrast-coded to investigate the following comparisons: control versus correlated (ContCorr) and control versus orthogonal (ContOrtho). Although we included fixed effects in the model to investigate both orthogonal interference (ContOrtho) and redundancy gain (ContCorr), we will report only the orthogonal interference results (see Appendix 1 for the redundancy gain findings), since redundancy gain is not as robust an indicator of perceptual integration as is orthogonal interference.Footnote 2 Additional contrast-coded fixed effects included dimension (gender vs. noise) and masking condition (EM, NEM), as well as the interactions of the stimulus set contrasts (ContCorr, ContOrtho) with dimension and masking condition. Random intercepts for participants and items were included. The model also contained random slopes by participants for the stimulus set contrasts and dimensions. Random slopes by items for the stimulus set contrasts, dimensions, and masking conditions were also included. Model comparisons were performed to determine whether the inclusion of each of these fixed factors and their interactions made a significant contribution to the model.
Figure 2 shows individual participants’ mean difference scores, depicting orthogonal interference (orthogonal – control) for each dimension and masking condition. From this figure, it is evident that the majority of listeners showed some interference in each of the conditions (as indicated by positive values). In all, 67 % (EM) and 58 % (NEM) of the participants showed positive interference values for gender classifications, and 64 % (EM) and 72 % (NEM) of the participants for noise classifications. In line with these observations, the results of the LMER analyses revealed a significant main effect of ContOrtho (β = −0.047, SE β = 0.006, χ
2(1) = 40.53, p < .05), whereby reaction times were slower overall in orthogonal than in control conditions.
On the basis of the mean differences between the control and orthogonal conditions, it appears that classification along the noise dimension was subject to greater orthogonal interference than was classification along the gender dimension. Indeed, this was confirmed by a significant ContOrtho × Dimension interaction (β = 0.024, SE β = 0.004), χ
2(1) = 39.35, p < .05. Separate LMERs were performed on the gender and noise data with the same fixed- and random-effects structure as above, but with the fixed effect of dimension (and any interactions containing it) removed. For the gender dimension, a significant effect of ContOrtho was found (β = −0.004, SE β = 0.007), χ
2(1) = 23.97, p < .05, indicating that listeners were slowed by irrelevant noise variation when classifying gender in the orthogonal condition, as compared to the baseline control. Similarly, for the noise dimension, a significant main effect of ContOrtho (β = −0.059, SE β = 0.011), χ
2(1) = 26.33, p < .05, revealed that listeners were slower at classifying the noise dimension when the gender dimension varied randomly. No additional main effects (dimension, masking condition) or any other interactions reached significance (χ
2 < 2.18, p > .05). In sum, these findings indicate that although significant orthogonal interference was found in both the gender and noise dimensions, the magnitude of this interference differed as a function of the dimension of classification.
In order to determine whether one dimension was inherently easier to discriminate than the other, the reaction times of the control conditions were compared. The LMER model contained fixed effects for dimension and masking condition, as well as random intercepts for participants and items. The model also included random slopes for dimension by participants and by items, as well as a random slope for masking condition by items. The model comparisons revealed no significant effects of dimension, masking condition, or the Masking Condition × Dimension interaction (χ
2 < 2.63, p > .05). This indicates that neither spectral separation nor the dimension of classification had a substantive impact on the speed with which listeners made their classifications. This lack of a significant difference may seem surprising, given the relatively smaller, but significant, average differences between control and orthogonal reaction times. However, an examination of the individual participants’ data revealed a wide range of individual variation in response speed to classifications of the different dimensions in the control conditions, making any seeming difference not statistically reliable.
The results of Experiment 1A suggest that the processing of gender and the processing of noise information are interdependent, since irrelevant variation in either dimension resulted in interference in classifying the other dimension. However, these interference effects were asymmetric, with irrelevant variation in the gender dimension causing greater interference when classifying noise than in the reverse direction. A comparison of the response latencies in the control conditions for these dimensions indicated that both dimensions were equally discriminable.
In order to further investigate the relationship between baseline classification speed and susceptibility to orthogonal interference, in Experiment 1B we examined whether a different indexical dimension of the speech signal—namely within-gender talker identity—is perceptually integrated with background noise, since talker identity is purportedly more challenging to classify than making a male–female judgment (Cutler et al., 2011). Prior research has suggested that an asymmetry in the discriminability of the dimensions may result in an asymmetry in the magnitudes of the interference effects (e.g., Cutler et al., 2011), with the slower dimension of classification being more susceptible to interference from the faster dimension than the faster dimension is susceptible to the slower one. Specifically, Cutler et al. (2011) found that within-gender talker classification was both slower in the control condition and subject to greater interference from irrelevant phonetic variation, whereas the reverse was the case for gender–phonetic classifications in Mullennix and Pisoni (1990). However, Experiment 1A showed asymmetric integration of extrinsic noise information and intrinsic gender, even though noise and gender classifications were accomplished with comparable speed. Experiment 1B allowed us to examine whether the asymmetry shown in Experiment 1A would reverse as a result of an asymmetry in discriminability, with greater interference effects in the talker than in the noise dimension, as would be predicted by an account that links lower discriminability in the attended dimension to greater interference effects.
Experiment 1B: Perceptual integration of talker identity and background noise
Method
Participants
Eighty-seven American English listeners participated in this experiment. All satisfied the same participant criteria outlined in Experiment 1A. On the basis of their classification accuracy performance (<90 % correct) in the Garner task, 15 participants were excluded (nine from the NEM and six from the EM group), resulting in 36 listeners in the NEM (23 female, 13 male; mean age = 20 years) and 36 listeners in the EM (29 female, seven male; mean age = 20 years) condition.
Stimuli
The same 96 words from Experiment 1A were used in this experiment, produced by two female American English talkers, one of which was the same female talker as in Experiment 1A. The talker from Experiment 1A had a mean F0 of 230 Hz, whereas the other female talker had a mean F0 of 197 Hz. The first female talker had a mean F0 range of 146 Hz, and the second talker a range of 99 Hz. The mean difference between the male and female talkers from Experiment 1A was 86 Hz, relative to a mean difference of 33 Hz in the present experiment. Identical processing procedures were performed on these speech files, yielding pure-tone- and white-noise-combined stimuli for each of the two masking conditions: NEM and EM.
Procedure
Listeners were required to attend to either talker identity by using arbitrarily assigned names (Sue vs. Carol) or noise (white noise vs. pure tone) in one of the two masking conditions (NEM or EM). All other task procedures were identical to those of Experiment 1A. As in the previous experiment, a brief familiarization phase preceded each stimulus dimension condition. In this case, it not only oriented listeners to the task procedures for that particular condition, but also allowed them to learn the names associated with the two talkers’ voices. The participants completed three stimulus set conditions (control, correlated, orthogonal) for each stimulus dimension (talker, voice), for a total of six sets of trials.
Results
Percent correct classifications were tabulated for each dimension (Table 2). The mean interference effects as well as the reaction times for correct responses for each dimension and masking condition are also presented in Table 2.
Table 2 Mean reaction times (in milliseconds) for masking condition, dimension, and stimulus set with accuracy
Response latencies (Fig. 3) were log-transformed and analyzed using LMER models. Outliers that satisfied the same criteria as in Experiment 1A were excluded from the analysis. These models contained the same fixed- and random-effects structure as in Experiment 1A, whereby stimulus set was contrast-coded to examine control versus correlated (ContCorr) and control versus orthogonal (ContOrtho), with additional fixed effects of dimension (talker vs. noise) and masking condition (EM, NEM), as well as the interactions of the stimulus set contrasts with dimension and masking condition. As in Experiment 1A, only the orthogonal interference results will be reported here (see Appendix 1 for the redundancy gain findings). Figure 3 reveals that the majority of listeners showed orthogonal interference, with 53 % (EM) and 69 % (NEM) of participants for talker classification and 67 % (EM) and 72 % (NEM) of participants for noise classification showing positive interference values.
Consistent with these observations, a significant main effect of ContOrtho was obtained (β = −0.046, SE β = 0.007), χ
2(1) = 33.20, p < .05, such that participants produced slower reaction times overall in orthogonal than in control conditions. We also found a significant main effect of dimension (β = 0.051, SE β = 0.010), χ
2(1) = 20.80, p < .05, with slower reaction times across conditions when identifying talkers than when identifying noise. The main effect of masking condition did not reach significance (χ
2 = 0.44, p = .51).
Furthermore, the mean response latencies (Table 2) suggest that the magnitudes of orthogonal interference in the EM condition were asymmetrical, with greater interference from irrelevant talker variation on noise classification (M = a 37-ms interference effect) than the reverse (M = a 13-ms interference effect). This was reflected in a significant Masking Condition × ContOrtho × Dimension interaction (β = 0.046, SE β = 0.008), χ
2(1) = 31.25, p < .05. Similar LMER analyses, as described above, were conducted to further investigate this three-way interaction. Indeed, a significant effect of ContOrtho was found in both masking conditions of the talker dimension [NEM: β = −0.063, SE β = 0.010, χ
2(1) = 27.94, p < .05; EM: β = –0.038, SE β = 0.012, χ
2(1) = 8.92, p = .0028]. For the noise dimension, the interference effect was significant for the EM condition (β = −0.053, SE β = 0.013), χ
2(1) = 13.31, p < .05, but marginal for the NEM condition (χ
2 = 2.78, p = .095).
To determine the relative classification ease for a given dimension, the reaction times of the control conditions were compared. The LMER model contained fixed effects for dimension and masking condition, random intercepts for participants and items, random slopes for dimension by participants and by items, as well as a random slope for masking condition by items. A significant main effect of dimension (β = 0.068, SE β = 0.013), χ
2(1) = 22.72, p < .05, was found, whereby listeners were slower at classifying talker identity than classifying noise in the control condition. No significant effect of masking condition or Dimension × Masking Condition interaction was found (χ
2 < 0.61, p > .05).
Discussion
The results of Experiment 1 suggest that certain indexical features of speech, such as gender and talker identity, are perceptually integrated with background noise during speech processing, even when the speech and noise signals are spectrally nonoverlapping. Experiment 1A demonstrated mutually dependent processing of gender and noise information, because significant orthogonal interference effects were found for classification along both dimensions. Our findings also revealed a processing asymmetry, whereby listeners were more affected by irrelevant gender variation in the noise classification task than by irrelevant noise variation in the gender classification task. The results from Experiment 1B demonstrated a similar asymmetry with respect to the magnitudes of the interference effect found for classification along each of the two dimensions. However, this interference appeared to be modulated by masking condition, since it was found in both the EM and NEM conditions for the talker dimension, but only in the EM condition for the noise dimension (although there was a trend toward interference in the NEM condition).
Additionally, Experiments 1A and 1B allowed us to examine the role of discriminability in the magnitudes of these interference effects, since talker identity is relatively more difficult to discriminate than gender. Cutler et al. (2011) posited a relationship between the sizes of orthogonal interference effects and how difficult it is to classify a given dimension (as indexed by reaction times), such that more difficult decisions will yield longer reaction times and, subsequently, greater interference effects. However, on the basis of the present experiments, the interference asymmetries found in Experiments 1A and 1B do not appear to be related to a discrepancy in classification difficulty between the noise and indexical dimensions. Indeed, asymmetric orthogonal interference was found in Experiment 1A, in which there was no significant difference in discriminability between the gender and noise dimensions in the control conditions. Furthermore, in Experiment 1B, a greater degree of orthogonal interference was found for the noise dimension, despite the fact that listeners were slower to make talker identity classifications than to make noise classifications in the control conditions. Thus, given these findings, inherent processing difficulty does not appear to be the primary factor influencing the directionality and magnitude of orthogonal interference effects.
The present results extend previous work examining the processing dependencies in speech perception (e.g., Mullennix & Pisoni, 1990). Prior work had reported that indexical and linguistic properties of the speech signal are perceptually integrated during speech processing. The findings of the present study suggest that listeners integrate indexical features of the speech signal with temporally concurrent auditory information—in this case, background noise. However, the processing asymmetries found for noise and both indexical speech properties (gender and talker identity) indicate that although context-specific information and indexical speech information are coupled, they are unevenly weighted during processing.
One possible explanation for the asymmetry between the speech and noise dimensions pertains to the relative salience of these dimensions. The Garner (1974) task involves selective attention, whereby listeners must attend to one dimension while ignoring the other. Tong, Francis, and Gandour (2008), examining the processing dependencies between consonants, vowels, and lexical tones in Mandarin Chinese, found that irrelevant segmental variation led to greater interference for lexical-tone classification than did irrelevant tone variation for segmental classification. Tong et al. posited that the information value of a given dimension could play a substantive role in selective attention, such that listeners may opt to attend to features that are more informative in resource-demanding situations, resulting in an asymmetry between dimensions in their susceptibilities to orthogonal interference. In their study, information value was determined by calculating the probability of the dimension occurring in a communicative system, a criterion by which segmental information is substantially more informative than tone information. In the context of the present findings, noise could be considered to provide less information generally for listeners, and thus to be less salient than linguistic information, making it more susceptible to interference from variation in a more-salient dimension. Indeed, from infancy, humans are purportedly biased toward listening to speech over nonspeech (Vouloumanos & Werker, 2007). Although we cannot quantify the relative saliences of speech versus noise by the same metric used by Tong et al., the relatively greater functional relevance of speech over noise should not be controversial. Thus, it could be that the observed processing asymmetries between dimensions that are intrinsic to the speech signal and extraneous background noise result from asymmetries in the information value of the dimensions being processed, with gender and talker features (speech-intrinsic dimensions) having greater information value than noise and pure-tone information (a speech-extrinsic, background dimension).
We note that the size of these interference effects are smaller than those in prior work with the Garner task using speech stimuli (e.g., Mullennix & Pisoni, 1990). This occurred despite the fact that the overall response times are relatively long (averages of 979–1081 ms in this study, as compared to 456–657 ms in Mullennix & Pisoni, 1990). One possible explanation is that low variability along the classification dimensions may have led to relatively smaller interference effects. For instance, Mullennix and Pisoni found that increasing the number of talkers for gender classification (up to 16) led to more robust orthogonal interference. In the present study we employed just two talkers and two noise types, which could have contributed to smaller interference effects. Moreover, it is also conceivable that dimensions intrinsic to the speech signal are more robustly perceptually integrated, by virtue of the fact that cues to classification of speech-intrinsic dimensions may be co-present within the same signal. For example, upon hearing the word “pill,” cues that identify the initial consonant can in part be used to identify the talker. However, with speech and noise dimensions, the speech signal does not hold any cues to help identify the noise type, which may result in relatively smaller orthogonal interference effects. With regard to the relatively long average reaction times in the present study, Mullennix and Pisoni noted that as the number of individual items increased, so too did the reaction times. If one considers the amount of item variability in the present experiment, the longer reaction times are perhaps not surprising, given that there were 96 different disyllabic words (relative to between two and 16 different monosyllabic words in Mullennix & Pisoni, 1990). Moreover, none of the 96 items were repeated within a given condition, unlike in Mullennix and Pisoni (1990), in which items were repeated between four and 32 times within a condition. These factors likely contributed to the overall longer reaction times observed in the present work.
Although the present study suggests integral processing of speech-intrinsic and -extrinsic features, one could also consider an alternative explanation for these findings that appeals to low-level processing mechanisms.Footnote 3 The information necessary to distinguish between two levels of a particular dimension (e.g., two female talkers) is in part carried by the frequency composition of the signal. In order to make the appropriate classifications, listeners must extract information about the relative frequency characteristics of the two talkers. It is conceivable that the interference of noise with gender and talker classification demonstrated in Experiment 1 may have arisen as a result of masking in the EM condition or of some spread of masking in the NEM condition. For example, in the EM condition, some of the indexical characteristics of the talkers may have been masked, since the noise overlapped with some parts of the spectra that carried the talker information. Moreover, it is possible that even in the NEM condition, despite the noise and the speech signal being spectrally separated, a spread of masking could have in part obscured the frequency composition necessary to make gender or talker classifications.
This explanation would likely predict a differential in gender or talker classification difficulty as a function of the type of noise presented concurrently with the speech signal, such that the presence of band-pass-filtered white noise should yield greater masking, and consequently slower response speeds, than the presence of a pure tone. However, a comparison of the reaction times within both the gender and talker control conditions (in which the noise background was consistent) found no significant differences between classifications made in the band-pass-filtered white-noise versus pure-tone conditions, or any significant interaction with energetic masking (EM or NEM), χ
2 < 1.32, p > .25, suggesting that listeners were not slower on the gender or talker classification tasks in the more heavily masked band-pass-filtered noise condition than in the single-frequency masking of the pure-tone condition, as would be predicted by a purely low-level explanation for the interference effects found in Experiment 1. It remains possible that the observed interference from irrelevant noise variation for gender and talker classification was due to masker uncertainty in the orthogonal condition, rather than to perceptual integration of the speech and noise signals, but this too would implicate a central (informational masking) rather than peripheral (energetic masking) locus for the observed interference effect. It remains for future work to determine exactly how speech and background noise interfere with each other during classification along noise and speech dimensions, respectively, but the presently available evidence seems to implicate some degree of higher-level processing involvement. In Experiment 2, we sought to provide further evidence of the integrality, or at least persistent association, of concurrently presented speech and noise from a task that taps into a later stage of processing—namely, the continuous recognition memory paradigm.