To hear or not to hear: Voice processing under visual load
Adaptation to female voices causes subsequent voices to be perceived as more male, and vice versa. This contrastive aftereffect disappears under spatial inattention to adaptors, suggesting that voices are not encoded automatically. According to Lavie, Hirst, de Fockert, and Viding (2004), the processing of task-irrelevant stimuli during selective attention depends on perceptual resources and working memory. Possibly due to their social significance, faces may be an exceptional domain: That is, task-irrelevant faces can escape perceptual load effects. Here we tested voice processing, to study whether voice gender aftereffects (VGAEs) depend on low or high perceptual (Exp. 1) or working memory (Exp. 2) load in a relevant visual task. Participants adapted to irrelevant voices while either searching digit displays for a target (Exp. 1) or recognizing studied digits (Exp. 2). We found that the VGAE was unaffected by perceptual load, indicating that task-irrelevant voices, like faces, can also escape perceptual-load effects. Intriguingly, the VGAE was increased under high memory load. Therefore, visual working memory load, but not general perceptual load, determines the processing of task-irrelevant voices.
KeywordsVoice Gender Aftereffects Attention Working memory Perceptual load
Human voices are rich in social information about a speaker’s identity, age, or gender (Schweinberger, Kawahara, Simpson, Skuk, & Zäske, 2014). Listeners routinely extract such cues even from nonspeech utterances (Skuk & Schweinberger, 2013b) or previously unheard speech (Zäske, Volberg, Kovács, & Schweinberger, 2014). Humans are often exposed to voices while engaging in other tasks, such as reading a newspaper in a busy coffee shop. The challenge for our attentional system is to focus on the task at hand, while monitoring the environment for behaviorally relevant information. The questions of whether and to what extent unattended voices are processed while we perform visual tasks is highly relevant for understanding both everyday voice perception and the distribution of attention between modalities.
Recent research on auditory adaptation suggested that exposure to nonlinguistic social cues in voices temporarily alters our perception of subsequent voices. For instance, prolonged listening to female voices causes androgynous test voices to sound more male, and vice versa (Schweinberger et al., 2008), suggesting contrastive coding of voice gender. Subsequent reports of voice aftereffects have revealed the neuronal codings of vocal age, identity, and affective information (Bestelmeyer, Rouger, DeBruine, & Belin, 2010; Skuk & Schweinberger, 2013a; Zäske, Schweinberger, & Kawahara, 2010; Zäske, Skuk, Kaufmann, & Schweinberger, 2013), in analogy to face aftereffects (reviewed in Webster & MacLeod, 2011). However, little is known about the role of attention in voice adaptation (but see Zäske, Fritz, & Schweinberger, 2013).
Adaptation has traditionally been conceived of as purely stimulus-driven. Accordingly, linguistic aftereffects were shown to be independent of focused attention to adaptors (Baart & Vroomen, 2010; Mullennix, 1986; Samuel & Kat, 1998; Sussman, 1993). At variance with these findings, the voice gender aftereffect (VGAE; Schweinberger et al., 2008) is abolished when spatial attention is diverted from adaptor voices (Zäske, Fritz, & Schweinberger, 2013). In Zäske, Fritz, and Schweinberger’s study, participants simultaneously adapted to male or female voices in one ear and to gender-neutral (androgynous) voices in the other ear. They attended either the left or the right ear and classified voice gender (Exp. 1) or syllable (Exp. 2) of the adaptor voices. Irrespective of the task during adaptation, gender classifications of the subsequent test voices indicated a VGAE only when gender-specific (male or female) adaptors, but not when androgynous adaptors, had been spatially attended. Although this suggests that voice gender is not processed automatically during selective attention to another voice, it is unclear whether the VGAE is also modulated by selective attention to other stimuli, and to visual stimuli in particular.
Here we explored this question by manipulating visual selective attention during voice adaptation according to load theory (Lavie, Hirst, de Fockert, & Viding, 2004). This theory holds that the extent to which distractors are processed depends on both the availability of perceptual resources and working memory. Specifically, due to limits in attentional capacity, high perceptual load of a relevant task impairs distractor processing by leaving little capacity that automatically spills over to distractors. By contrast, high working memory load promotes distractor processing, by disrupting working memory control over target prioritization.
This account has received substantial support from studies on vision (reviewed in de Fockert, 2013; Lavie, 2005) and audition (Alain & Izenberg, 2003; Conway, Cowan, & Bunting, 2001; Dalton, Santangelo, & Spence, 2009; Fairnie, Moore, & Remington, 2016; Muller-Gass & Schröger, 2007; but see Murphy, Fraenkel, & Dalton, 2013), and from studies probing load theory for crossmodal attention (Berman & Colby, 2002; Brand-D’Abrescia & Lavie, 2008; Jacoby, Hall, & Mattingley, 2012; Macdonald & Lavie, 2011; Molloy, Griffiths, Chait, & Lavie, 2015; Raveh & Lavie, 2015; but see Tellinghuisen & Nowak, 2003). Interestingly, and at variance with load theory, several studies have suggested that faces present a special case, in the sense that they may recruit a domain-specific capacity-limited system (Neumann, Mohamed, & Schweinberger, 2011; Neumann & Schweinberger, 2008, 2009). Here we considered the possibility that voices are also “special” (Belin, Bestelmeyer, Latinus, & Watson, 2011) and might be relatively immune to perceptual load when unattended, similar to faces (Neumann & Schweinberger, 2008). At present, it is unclear whether a similar domain-specific attentional system exists for voices.
Of relevance for crossmodal situations, Moradi, Koch, and Shimojo (2005) showed that face processing is unaffected by auditory working memory load. Specifically, the magnitude of the face identity aftereffect was unaffected by the load of an auditory digit memory task in that study. Similarly, auditory aftereffects of adaptation to linguistic aspects of speech seem unaltered by visual task demands. For instance, Samuel and Kat (1998) reported that auditory aftereffects following adaptation to a phonetic [ba]–[wa] continuum were unaffected by visual attention to arithmetic or rhyming tasks, suggesting that speech adaptation is an automatic low-level process. Furthermore, Baart and Vroomen (2010) found aftereffects for a [b]–[d] continuum, irrespective of visuospatial or verbal working memory load during audiovisual adaptation. However, it is unclear whether nonlinguistic voice aftereffects would be susceptible to different visual task demands.
Here we tested whether irrelevant adaptor voices are processed despite visual selective attention to a perceptual (Exp. 1) or a working memory (Exp. 2) task. Previous findings suggested that spatial attention to androgynous voices abolishes the VGAE induced by unattended gender-specific voices (Zäske, Fritz, & Schweinberger, 2013). It is possible that an unattended voice is filtered in the presence of another attended voice, but would be processed in a standard perceptual-load task with alphanumeric character targets (similar to faces; Neumann & Schweinberger, 2008). Alternatively, and according to load theory, high perceptual load should leave relatively less attentional capacity to spill over to an ignored adaptor voice, thereby impairing its processing, and hence the VGAE (Lavie et al., 2004). Conversely, high working memory load should increase the VGAE, because it interferes with the maintenance of target prioritization. As a result, irrelevant adaptor voices should be increasingly processed.
Thirty-two student participants (mean age = 21.9 years; range: 18–35; 16 female, four left-handed) contributed data. All of the participants were native German speakers, and none was familiar with any of the voices or reported hearing problems. All participants gave written informed consent and received course credit and an additional performance-based incentive of €1 or €2. The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of Friedrich Schiller University.
The voice stimuli were audio recordings from five female and five male native German speakers (20–27 years of age) uttering the four vowel–consonant–vowel (VCV) syllables /aba/, /aga/, /ibi/, and /igi/. Voices were recorded by means of a Sennheiser MD 421-II microphone, a CEntrance MicPortPro preamplifier, and a SoundMax HD audio soundcard (16-bit resolution, sampling rate of 44.1 kHz). Recordings were normalized for average amplitude and adjusted to a uniform duration of 886 ms (including 100 ms silence at the beginning and end) using Adobe Audition 1.5 software.
These preprocessed voices were then set into five pairs of female and male voices and were entered into a morphing algorithm (Kawahara & Matsui, 2003). The pairings were matched for similarity in intensity patterns in the spectrogram, to increase morph quality. Four pairs were used for the experimental trials, and a fifth pair was used for practice trials only.
From each morphed pair, three stimuli were chosen as the androgynous test stimuli, corresponding to 40 %/60 %, 50 %/50 %, and 60 %/40 % female/male proportions. Thus, a total of 48 different test stimuli—that is, from each of the four VCV syllables and three morph levels (MLs) for each of the four female–male pairs—were available for the experimental trials. The two types of adaptor stimuli were VCV syllables spoken by the male (0 %/100 %) and female (100 %/0 %) voices from the same pairs as above.
Participants were tested individually in a dimly lit, sound-attenuated booth. Instructions and visual stimuli were delivered via a computer screen at a viewing distance of 65 cm. The visual stimuli were white digits presented on a black background. The digit arrays subtended a visual angle of 5.73° × 0.71°.
Voice stimuli were presented in mono via Sennheiser HD 212Pro headphones with an approximate peak intensity of 60 dB(A), as determined with a Brüel & Kjær Precision Sound Level Meter, Type 2206. The experimenter did not talk to the participants during the session, to avoid spurious adaptation effects. To keep participants motivated and focused on the selective attention task, they were told that they could receive an additional bonus of €1 or €2, contingent on their accuracy and speed in the visual task.
Trials started with a red fixation cross for 1,000 ms, followed by three identical voice adaptors for 886 ms each (including 100 ms of pre- and poststimulus silence). With the onset of the first adaptor, the fixation cross was replaced with a display of six horizontally arranged digits. Using the “d” and “l” keys of a computer keyboard (German layout), participants indicated as quickly and as accurately as possible whether a 5 was among the digits. There was a 60 % probability that a 5 was present. As soon as a response had been entered, a new display of digits appeared, and so on until the offset of the third voice adaptor. Thus, the number of test displays depended on the individual speed of participants. This was done to ensure constant attention to the digits. Following a black screen (300 ms) and a green question mark (2,000 ms), participants classified the test voice (886 ms) as quickly and as accurately as possible according to gender (female/male). Measured from voice onset, they had 2,886 ms to enter their response via the “d” and “l” keys before the question mark was replaced with a black screen (500 ms). If responses were too slow, the words “Please respond faster” appeared instead (500 ms). Thus, each trial lasted 9,444 ms.
The order of the adaptation blocks and load blocks in both experiments was counterbalanced across male and female participants. Morphed test voices (MLs 40 %/60 %, 50 %/50 %, and 60 %/40 %) were presented according to the method of constant stimuli. For a given trial, the adaptor and test voices always uttered VCV syllables that differed with respect to both vowels and consonants (e.g., /aba/ vs. /igi/). Also, the adaptor and test voices always originated from different speaker pairs. For instance, if a test voice was a morph from speaker pair #4, the preceding adaptor voices originated from speaker pairs #1, #2, or #3. This was done to ensure that any adaptation effects would indeed reflect high-level adaptation to voice quality, rather than low-level stimulus-dependent effects. There were 24 trials for each experimental condition (2 adaptation conditions × 2 load conditions). The nonexperimental factors Adaptor Syllable and Speaker Pair, as well as Test Syllable, Speaker Pair, and ML, were balanced such that all factor levels were equally often represented within each experimental block. After 24 trials, participants received a written feedback of their performance in the visual selective-attention task (i.e., number of correctly classified displays and mean reaction time [RT]).
Prior to the experiment the trial procedure was practiced stepwise with a fifth speaker pair not used in the main experiment. In a first step (four trials), participants practiced the selective attention task without subsequent test voices. In a second step (ten trials), they were acquainted with the complete trial procedure. Overall, Experiment 1 lasted ~25 min.
Validation of the load manipulation
Mean performance (M) and standard deviations (SD) in the selective attention task for low and high visual perceptual load, depicted separately for the mean number of displays (correct displays/total number), accuracy, and reaction times (RTs) for correctly classified displays
Level of Load
Correct RT (ms)
Voice gender aftereffects
The VGAE was unaffected by the level of visual perceptual load, at variance with load theory (Lavie et al., 2004). Accordingly, high relative to low perceptual load decreases the processing of task-irrelevant stimuli. For instance, inattentional deafness to simple tone stimuli can be induced by loading visual perceptual task demands (Macdonald & Lavie, 2011; Molloy et al., 2015; Raveh & Lavie, 2015). Accordingly, one might expect larger voice adaptation under low (vs. high) perceptual load in the present study, provided that attentional resources are shared by the target and distractor stimuli. However, resources may not always be shared between stimuli when the targets and distractors belong to different modalities (e.g., Allport, Antonis, & Reynolds, 1972; Duncan, Martens, & Ward, 1997; Keitel, Maess, Schröger, & Müller, 2013) or when target processing and distractor processing are subject to different domain-specific capacity limits. We prefer the latter explanation, because it is more in line with the finding that voice adaptors are filtered out in the presence of another voice (Zäske, Fritz & Schweinberger, 2013). It also parallels reports that irrelevant face processing is reduced under high load when attending another target face, but not when attending other target objects, such as houses or hands (Neumann, Mohamed, & Schweinberger, 2009, 2011), or letter strings, as in the standard perceptual-load task (Neumann & Schweinberger, 2008). Importantly, the present results are therefore not necessarily inconsistent with studies showing effects of visual perceptual load on auditory processing (Macdonald & Lavie, 2011; Molloy et al., 2015; Raveh & Lavie, 2015), as these studies used simple tones rather than voices as the task-irrelevant stimuli.
Our findings are potentially related to evidence that the duration of visual motion aftereffects is also unaltered by auditory perceptual load (Rees et al., 2001), and to electrophysiological data that the mismatch negativity (MMN) to rare frequency or intensity changes of task-irrelevant tone pips is unaffected by the difficulty of a concurrent visual discrimination task (Muller-Gass, Stelmack, & Campbell, 2006). Future research will be needed to establish in more detail how attentional resources are allocated both between modalities and between specific stimulus domains within modalities. Specifically, we expect that systematic manipulation of different auditory target and distractor domains in the context of selective attention tasks will contribute more detailed information with respect to the existence of domain-specific attentional resources for certain kinds of auditory stimuli (e.g., human voices, similar to what has been proposed in the visual modality for faces).
The finding of a significant VGAE in the absence of attention to voice adaptors suggests that despite being irrelevant for the task at hand, voice gender was sufficiently processed for adaptation to occur. This is in line with the notion that audition is an “early-warning” system (Murphy et al., 2013) that constantly processes auditory input, independent of the attentional focus. However, this may seem at odds with our previous finding that the VGAE is abolished when gender-specific voice adaptors are ignored during dichotic adaptation in the presence of simultaneous androgynous adaptor voices in the attended ear (Zäske, Fritz, & Schweinberger, 2013). A possible explanation for this discrepancy may be the existence of voice-specific attentional resources, as discussed above. Accordingly, in Zäske, Fritz, and Schweinberger’s study, attention to an androgynous voice may have exhausted voice-specific resources, leaving little or no capacity for the processing of irrelevant gender-specific voice adaptors. The voice adaptors in the present study, by contrast, were presented along with task-relevant alphanumeric characters, which presumably spared voice-specific attentional resources, and thereby preserved processing of the voice adaptors (for a similar argument for faces, see Bindemann, Burton, & Jenkins, 2005; Neumann et al., 2011; Neumann & Schweinberger, 2008, 2009).
Note that the test stimuli were overall perceived as slightly more male than female. Although one may expect that morphed voices that are physically intermediate between original male and female speakers should on average be perceived equally often as male and female, deviations from 50 % male/female classifications are common in research using gender-morph continua, both with and without adaptation (e.g., Skuk & Schweinberger, 2014; Zäske, Fritz, & Schweinberger, 2013; Zäske, Schweinberger, Kaufmann, & Kawahara, 2009; Zäske, Skuk, et al., 2013). The present asymmetry could therefore reflect stronger aftereffects following female than following male voice adaptors, or may be the result of a general bias to perceive voices as “male.” The latter notion could be related to recent findings that the perception of gender-morphed voices can vary substantially between listeners and between speaker identities (Skuk, Dammann, & Schweinberger, 2015). To assess whether the VGAE is susceptible to visual working memory load, we conducted a second experiment. In Experiment 2, the participants adapted to irrelevant voices while recognizing digits they had previously encountered. This was done to test whether high working memory load would increase the processing of adaptor voices (Lavie et al., 2004). Alternatively, the processing of adaptor voices could be unaffected by the load of a crossmodal working memory task, similar to findings in the face domain (Moradi et al., 2005).
Thirty-two student participants (mean age = 22.5 yrs; range: 18–31; 16 female, one left-handed) contributed data. All of the participants were native German speakers, and none was familiar with any of the voices or reported hearing problems. All participants gave written informed consent and received course credit and an additional performance-based incentive of €1 or €2. The study was conducted in accordance with the Declaration of Helsinki and was approved by the Ethics Committee of Friedrich Schiller University. None of the participants had taken part in Experiment 1.
The voice stimuli were identical to those in Experiment 1. The single digits presented in the visual working memory task subtended a visual angle of 0.44° × 0.71°.
The procedures were analogous to those of Experiment 1, with the exception of the visual memory task described below (also see Fig. 1). On each trial, participants performed a visual working memory task while hearing three irrelevant adaptor voices. Specifically, participants had to remember six consecutive digits (0–9) that preceded the three male or female voice adaptors. To manipulate working memory load, the digits were either identical (low load) or all different (high load). During voice adaptation, the participants then classified several consecutive test digits as “studied” or “novel.” Trials started with a red fixation cross for 1,000 ms, followed by alternating presentation of a study digit (500 ms) and a black screen (100 ms). After the sixth study digit, a 1,000-ms backward mask (#) announced the upcoming test. The onset of the first test digit coincided with the onset of the first of the three identical voice adaptors. Using the “d” and “l” keys, participants indicated as quickly and as accurately as possible whether or not a given test digit had just been presented. The digits were randomly generated such that studied digits appeared with a 60 % probability at test. As soon as a response was entered, a new test digit appeared, and so on until the offset of the third voice adaptor. Following the offset of the third voice adaptor, the trial procedure was identical to that of Experiment 1 (see Fig. 1), such that the adaptation and test phases had the same timing and duration (9,444 ms) in both experiments. Including the encoding phase of the working memory task (4,500 ms), the overall trial duration was 13,944 ms. Overall, Experiment 2 lasted ~30 min.
Validation of the load manipulation
Mean performance (M) and standard deviations (SD) in the selective attention task for low and high visual working memory load, depicted separately for the mean number of displays (correct displays/total number), accuracy, and reaction times (RTs) for correctly classified displays
Level of Load
Correct RT (ms)
Voice gender aftereffects
We found significant VGAEs under both low- and high-load conditions, suggesting that despite being irrelevant for the task at hand, voice gender was sufficiently processed for adaptation to occur. Importantly, and in line with load theory (Lavie et al., 2004), the VGAE was increased under high visual working memory load (Exp. 2). Accordingly, high load disrupted stimulus-processing priorities, allowing task-irrelevant voice adaptors to be processed to a larger extent than under low load. Since executive control over task priorities is a high-level cognitive function, these results support the notion that voice adaptation occurs at higher-level processing stages (Schweinberger et al., 2008; Zäske, Fritz, & Schweinberger, 2013). These findings pose an interesting contrast to linguistic aftereffects of speech adaptation, which do not appear to depend on visual working memory and which proceed automatically (Baart & Vroomen, 2010; Samuel & Kat, 1998). How do these findings relate to reports that the face identity aftereffect is not susceptible to an auditory as opposed to a visual working memory task (Moradi et al., 2005)? A tentative explanation may be that intermodal attentional resources are asymmetrically distributed between vision and audition, causing visual working memory tasks to have a higher impact on auditory distractor processing than in the opposite direction. In this context, the possible role of phonological recoding of the visual stimuli for the present working memory task may deserve particular consideration.
Here we demonstrated that voice processing, as reflected in the voice gender aftereffect, is preserved despite selective attention to visual tasks during voice adaptation. Importantly, whereas the magnitude of the VGAE was increased under high relative to low working memory load (Exp. 2), in line with load theory (Lavie et al., 2004), the VGAE was completely unaffected by perceptual load (Exp. 1), at variance with load theory.
Taken together, the present results highlight limitations to the automaticity of voice processing (Zäske, Fritz, & Schweinberger, 2013), thereby pointing to an important difference from more “automatic” linguistic aftereffects (Baart & Vroomen, 2010; Samuel & Kat, 1998). Since the processing of unattended voices is enhanced by high visual working memory load, we suggest that voice adaptation occurs at higher-level processing stages, for which memory load effects would occur independently of target and distractor domains or modalities. By contrast, effects of perceptual load on voice processing depend on the domain of the target stimuli, and thus reflect domain-specific capacity limits. In conclusion, working memory, but not general perceptual capacities, determines the extent of voice processing during visual selective attention.
Note that in contrast to the present study, research on the linguistic aftereffects of speech has often used massed adaptation, with a relatively high number of adaptor stimuli followed by a complete series of test stimuli (e.g. Eimas & Corbit, 1973; Samuel & Kat, 1998). By contrast, we used only three adaptor stimuli preceding one test stimulus per trial. This was done in the tradition of previous studies on nonlinguistic adaptation, which indicated that a few adaptors are sufficient to elicit voice aftereffects for vocal gender, age, and identity (Schweinberger et al., 2008; Zäske, Schweinberger, & Kawahara, 2010; Zäske, Skuk, et al., 2013). Note also that our adaptor conditions were presented in blocks of 24 trials each, such that a much larger number (72 = 24 × 3) of adaptors of the same gender were interrupted only by the test stimuli. This design is more akin to one with “top-up” adaptors before each test stimulus, which is now also common in research on visual perceptual adaptation (e.g. Jenkins, Beaver, & Calder, 2006).
We thank Sascha Müller for assistance with the experimental programming, and Dana Schneider for inspiring discussions. We further thank three anonymous reviewers for helpful comments, and the Deutsche Forschungsgemeinschaft (DFG, ZA 745/1-2) and the Friedrich Schiller University (Amélie Mummendey Young Researchers’ Award) for financial support.
- Kawahara, H., & Matsui, H. (2003). Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation. In Proceedings of the 2003 I.E. International Conference on Acoustics, Speech, and Signal Processing: Vol. I (pp. 256–259). Piscataway, NJ: IEEE Press.Google Scholar
- Mullennix, J. W. (1986). Attentional limitations in the perception of speech. Unpublished doctoral dissertation, State University of New York, Buffalo, NY.Google Scholar
- Schweinberger, S. R., Casper, C., Hauthal, N., Kaufmann, J. M., Kawahara, H., Kloth, N., . . . Zäske, R. (2008). Auditory adaptation in voice perception. Current Biology, 18, 684–688. doi:10.1016/j.cub.2008.04.015