Gender in Voice Perception in Autism
- First Online:
- Cite this article as:
- Groen, W.B., van Orsouw, L., Zwiers, M. et al. J Autism Dev Disord (2008) 38: 1819. doi:10.1007/s10803-008-0572-8
- 454 Views
Deficits in the perception of social stimuli may contribute to the characteristic impairments in social interaction in high functioning autism (HFA). Although the cortical processing of voice is abnormal in HFA, it is unclear whether this gives rise to impairments in the perception of voice gender. About 20 children with HFA and 20 matched controls were presented with voice fragments that were parametrically morphed in gender. No differences were found in the perception of gender between the two groups of participants, but response times differed significantly. The results suggest that the perception of voice gender is not impaired in HFA, which is consistent with behavioral findings of an unimpaired voice-based identification of age and identity by individuals with autism. The differences in response times suggest that individuals with HFA use different perceptual approaches from those used by typically developing individuals.
Although impairments in social interaction, verbal and non-verbal communication, and repetitive-restricted behavior are the more conspicuous defining characteristics of autism (American Psychiatric Association 1994), atypical perceptual abilities and responses to stimuli are other characteristic features (Gustafsson 1997; Happe 1999). Perceptual discriminative abilities in the auditory and visual domains have been found to be either enhanced or diminished in autism (Bertone et al. 2005; Samson et al. 2006). Many individuals with autism show aversive reactions to everyday sounds (Kern et al. 2006; Rosenhall et al. 1999) and to tactile (Cascio et al. 2008) and visual stimuli (Talay-Ongan and Wood 2000).
Knowledge of how stimuli are processed in autism is important for both theoretical and clinical reasons. For instance, insight into atypical perceptual features may provide a powerful theoretical framework for the perceptual impairments and their neural etiologies in autism (Bertone and Faubert 2006; Mottron et al. 2006). At a clinical level, social perception, such as perception of voices and faces, is an important channel for non-verbal communication (Boucher et al. 2000) since both voices and faces contain information about a person’s gender, age, and emotional states. Typically developing neonates respond preferentially to voices (Eisenberg 1976) and can recognize the affective content of vocal tones at the age of 6 months (Walker-Andrews 1988), underlining the developmental importance of intact perception of social stimuli. In contrast, children with autism show no preference for their mother’s voices as opposed to other speech stimuli (Klin 1991) and show no preference for speech sounds as opposed to electronic sounds (Kuhl et al. 2005).
Some authors have argued that the impairments of social perception in autism are an extension of an impaired Theory of Mind in autism (ToM) (Golan et al. 2006; Rutherford et al. 2002). The ToM theory states that people with autism have a selective difficulty in inferring the mental states of others, as measured by False Believe tasks (Baron-Cohen et al. 1985), the Reading the Mind in the Eyes Test (Baron-Cohen et al. 2001), and the Reading the Mind in the Voice Test (Rutherford et al. 2002). The latter test requires the affective content of vocalizations to be named, which is more difficult for people with autism. However, these tests do not assess perceptual capabilities but rather test socioemotional and mentalizing skills in autism.
In the visual domain, several studies have found that when individuals with autism process facial expressions (Critchley et al. 2000) or neutral faces (Pierce et al. 2001; Schultz et al. 2000), cortical areas outside the fusiform face area are activated, areas that are normally activated during the processing of non-face objects. In a behavioral study with familiar faces, children with autism were less able to identify familiar faces than their typically developing counterparts (Boucher et al. 1998). Their memory for neutral faces was found to be impaired as well (Hauck et al. 1998). Yet, these studies did not address perceptual abilities per se. That is, these findings may reflect different perceptual approaches rather than perceptual deficits. Support for the theory that individuals with autism have a different perceptual approach comes from the finding that when children with autism look at familiar faces, they pay attention to facial features different from those looked at by typically developing children (Langdell 1978). Moreover, the ability of children with HFA to recognize faces is affected less by face inversion than it is in controls (Hobson et al. 1988). This suggests that faces are processed analytically in autism rather than holistically, as is the case in typically developing children.
While less research attention has been paid to the processing of auditory social stimuli, the studies performed so far have confirmed the predictions of ToM that mental state inferences based on vocalizations are impaired in autism (Golan et al. 2006; Rutherford et al. 2002). Further, the cortical processing of neutral voices (Gervais et al. 2004) and complex voice-like sounds by individuals with autism (Boddaert et al. 2003, 2004) was found to occur outside the superior temporal sulcus area, which is the voiceselective area in normal individuals. In contrast, non-vocal sounds were processed identically in individuals with autism and controls. Thus, the pattern of findings for the cortical processing of voices is remarkably similar to that for the cortical processing of faces in autism. Yet, behavioral studies have not provided clear evidence of an impaired perception of auditory social stimuli that extends beyond mental state related impairments. As with the identification of familiar faces, children with autism are less able than controls to recognize familiar voices (Boucher et al. 1998). Yet, it is not clear whether these differences reflect perceptual-discriminatory impairments or post-sensory high-level processes. Evidence suggesting that different high-level processes are activated in autism comes from research showing that the listening preferences of infants with autism tend to be non-socially directed (Klin 1991; Kuhl et al. 2005). Moreover, children with autism fail to orient to naturally occurring social stimuli, including verbal and non-verbal stimuli (Dawson et al. 1998).
It is not clear to what extent the abnormal cortical processing of voices reflects perceptual impairments, such as gender identification. In the visual domain, gender perception is affected in autism. In a paradigm that required matching videotaped sequences to photographs of men and women, individuals with autism were found to have difficulty identifying a person’s gender from their face (Hobson 1987). In a more direct paradigm, children with autism had greater difficulty identifying the gender of faces in silent movie fragments than controls (Giovannelli 2006). Yet, in the auditory domain, impairments in social perception are mainly due to the inability to recognize emotion in voices (Golan et al. 2006; Rutherford et al. 2002).
The aim of the current study was to investigate whether the abnormal cortical processing of voices in HFA results in an impaired ability to identify the gender of speakers from their voices. Therefore, we designed an auditory discrimination task in which voices were parametrically altered in gender, such that female voices gradually changed to male voices and vice versa. This approach would be very sensitive for detecting differences in the perception of gender, since the parametric manipulation avoids ceiling effects that might arise from using just two categories of natural voices (i.e. male or female) without gradual overlap. We presumed that differences in the perceived gender of a voice between children with autism and controls would reflect perceptual-discriminatory capabilities. Furthermore, we recorded response times and presumed that differences in response times would reflect the underlying processes: that is, we presumed that longer response times would reflect greater task difficulty. Specifically, longer response times for the control group would imply that the task itself is more difficult, while longer reaction times for the HFA group would imply that the participants with HFA find the task more difficult.
Controls (± SD)
Autism (± SD)
Statistic (t or χ2)
13.7 ± 1.3
14.1 ± 1.8
63.1 ± 42.0
62.2 ± 43.9
102.5 ± 11.8
99.6 ± 17.9
102.5 ± 10.1
101.8 ± 19.2
102.7 ± 16.1
97.0 ± 15.2
The participants with HFA were recruited from referrals to the outpatient unit of Karakter Child and Adolescent Psychiatry University Center Nijmegen. The clinical diagnosis of autism was established according to the DSM-IV criteria for autistic disorder (American Psychiatric Association 1994) on the basis of a series of clinical assessments which included a detailed developmental history, clinical observation, and medical work-up by a child psychiatrist, and cognitive testing by a clinical child psychologist. Clinical diagnoses were confirmed with the Autism Diagnostic Interview—Revised (Lord et al. 1994), as assessed by a clinical psychologist trained to research standards who had not been involved in the diagnostic process. Exclusion criteria were any general medical condition affecting brain function, neurological disorders, and substance abuse.
Control participants were recruited from local schools. To exclude psychiatric disorders or learning problems, CBCL and TRF questionnaires (Achenbach 1991) were completed by the parents/caretakers and school teachers. None of the control participants had scores on the CBCL and TRF in the clinical range. The study was approved by the Medical Ethical Committee (Commissie Mensgebonden Onderzoek Arnhem Nijmegen). Informed consent was obtained from all participants and their parents.
The second author administered the voice gender perception protocol and performed audiometric screening in one 45-min session. Participants were tested individually. In the perception protocol, sound fragments consisting of single words were presented in a sound shielded room using the stimulus delivery software package Presentation on a personal computer (Dell 810). A closed circumaural headphone (Sennheiser EH250) delivered the sounds at a fixed normal speech volume of approximately 60 dB. Participants were instructed to listen to the voice fragments and to chose, by pushing a button, whether the fragment was of a male or female voice. Participants were instructed to react as quickly and accurately as they could. Response times and the psychometric function of gender classification were recorded on line.
Since voice-based gender inferences are usually unambiguous, ceiling effects of natural voice classification were anticipated. Therefore, the acoustic characteristics of the voice fragments were parametrically manipulated to alter the encapsulated gender information using the software package Praat (Boersma, P. and Weenink, D. Praat: doing phonetics by computer. Version 4.4.12 www.praat.org). Perception of gender in human voices is based on two main characteristics: median pitch and formants. The median pitch is predominantly determined by the length of the vocal chords, such that the longer vocal cords of men give rise to lower sounds. The resonant frequencies, or formants, are mainly determined by the size and shape of the vocal tract, including the tongue, pharynx, and laryngeal, oral and nasal cavities. The smaller vocal tract in women yields a different distribution of formants, making it possible to correctly classify a speaker’s gender even when the median pitch is atypical, for example, a man with a high voice or a woman with a low voice.
To create voice fragments that gradually changed from masculine to feminine and vice versa, single word speech fragments were taken from radio plays and transformed into 10 subsequent categories by shifting the formant ratio and median pitch in equal amounts to a maximum of 1.2 formant-shift-ratio and +250 Hz median-pitch-shift to convert male voices and to a maximum of 1/1.2 formant-shift-ratio and −140 Hz median-pitch-shift to convert female voices into masculine voices. Only neutral non-emotional single word speech fragments were selected. The speech fragments had an average duration of 1.5 s, with a 2-s pause between subsequent fragments. All voice fragments were played at random so that information from the preceding voice fragment was uninformative for future gender judgments. In total, 400 voice fragments were used, with 40 fragments being played for each morphing category: 20 originally male and 20 originally female fragments. The transformed fragments were tested among 8 psychology students to ensure that the transformed masculine fragments indeed sounded feminine and vice versa. The transformed male voices were found to sound feminine and vice versa, but as the transformation increased further, the voices tended to sound more computer-like and less human. The more computer-like sound quality likely reflects artifacts that arise from the effects of phase incoherence, unnatural phase dispersion, and high spectral variance (Hui Ye Young 2004).
This study focused on two outcome parameters: ‘accuracy of gender perception’ and ‘response time for gender perception’. These two dependent variables were combined into one multivariate analysis of variance (MANOVA) for conservation of alpha error. Independent variables were Participant group as a between-subject variable and Manipulation and Gender as within-subject variables. Manipulation consisted of the 10 increasing steps in which voices typical for one gender were transformed to the other, while Gender represented the transformation of either originally masculine or originally feminine voices. The factor Measure represented the two dependent variables ‘accuracy of gender perception’ and ‘response time for gender perception’. SPSS for Windows (Release 14.0) was used for statistical analysis and significance test were two-tailed and evaluated at an alpha level of 0.05.
Summary of doubly MANOVA table
Degrees of freedom
Manipulation × Gender
Manipulation × Gender × Participant group
Manipulation × Gender × Measure
Manipulation × Gender × Measure × Participant group
Summary of MANOVA tables
Degrees of freedom
Accuracy of gender perception: male to female voice
Manipulation × Participant group
Response time of gender perception: male to female voice
Manipulation × Participant group
Accuracy of gender perception: female to male voice
Manipulation × Participant group
Response time of gender perception: female to male voice
Manipulation × Participant group
In the current study we investigated the auditory social perceptual capabilities of individuals with HFA and age, IQ, and gender-matched typically developing controls, using voice fragments that were parametrically manipulated to change the speaker’s gender. Although cortical voice processing has been found to be abnormal in autism (Gervais et al. 2004), it was not clear whether this reflected an impaired ability to perceive the social characteristics of voices in autism. In our voice gender paradigm, we found no differences in voice gender perception between children and adolescents with HFA and typically developing children and adolescents. Since we used a sensitive parametric study design to avoid ceiling effects, these negative findings indicate that individuals with HFA have an intact ability to discern the gender of a voice. This suggests that the impairments of auditory social perception shown by these individuals are confined to mentalization/emotion related impairments as predicted by impaired ToM in HFA. Extraction of gender, identity, and age information from voices is not impaired in HFA. Rutherford et al. (2002) found that people with autism could adequately infer speakers’ age from vocalizations, although they did have difficulties perceiving the affective content. Boucher and colleagues (2000) reported a comparable ability to discriminate unfamiliar voices between participants with autism and participants without autism. We furthermore found significant differences between the participant groups in the response times of the transformed male voices. While the response time increased linearly with increasing male voice manipulation in the controls, the response time curve of HFA group resembled a parabola, possibly indicating that different higher-level processes were used to perform the perceptual task. The involvement of different higher level processes during the performance of social perceptual tasks in autism has been reported, mostly related to directing attention to socially relevant clues (Dawson et al. 1998; Pierce et al. 1997) and analytic or piecemeal rather than holistic processing of social stimuli (Pelphrey et al. 2002) (for a review see Jemel et al. 2006). Thus, people with HFA may use a different, less-socially directed, perceptual approach even though the perception of social stimuli per se is not affected.
Gervais and colleagues proposed that an abnormal processing of voices might be one of the factors underlying the social anomalies in autism because (1) voices provide relevant social information about others, and (2) they found abnormal cortical activation in the voice selective superior temporal sulcus (STS) in autism for voice sounds with neutral affect compared to environmental sounds (Gervais et al. 2004). The STS is part of the hierarchically organized auditory system and is thought to be specialized for extracting auditory object features, such as speaker-related clues, and for transmission of this information to other areas for multimodal integration (Belin et al. 2000). The problems with extracting social information from vocalizations in autism seem to be confined to the perception of affective content, while gender, age, and identity perception seem unimpaired. Since Gervais et al. used voice fragments with neutral affect, it seems unlikely that the cortical processing abnormalities observed reflected an impaired perception of affect in autism. Then, how can the discrepancy between the cortical perceptual pattern and the behavioral perceptual pattern in HFA be explained? First, in general, cortical processing is not equivalent to behavioral performance in a one-to-one manner, as exemplified by the fact that children with a hemispherectomy in early life may show a remarkable degree of sensorimotor function (Holloway et al. 2000). Second, cortical activation may be less strongly correlated with behavioral performance in individuals with autism than in typically developing individuals because different perceptual approaches may activate other cortical areas rather than give rise to perceptual deficits per se (Jemel et al. 2006). Evidence for this assumption comes from data on face processing in autism, in which the perceptual approach was studied using partially covered photographs of faces (Joseph and Tanaka 2003), inverted faces (Teunisse and de Gelder 2003) and infrared eye-trackers (Klin et al. 2002; Pelphrey et al. 2002). These studies support the idea that people with autism have a locally oriented perception to facial components and utilize different scan paths that focus more on non-relevant features, such as ears or hair, and on lower regions of the face than controls, while perceptual abilities need not be impaired (Jemel et al. 2006). Further evidence for the idea that perceptual approaches mediate abnormal cortical activation in autism comes from the finding that activation of the fusiform gyrus is correlated with the amount of time spent fixating on the eyes of face stimuli in an fMRI task (Dalton et al. 2005).
The result of the current study, in which different response times for the transformed male voices were found, are consistent with individuals with HFA having a different perceptual approach from typically developing individuals. When performing a social discrimination task, participants with autism were equally able to identify the gender of voice fragments and the response time curve seemed to be a function of task difficulty: fast response times at both ends of the psychometric curve when gender was unambiguous and slower response times halfway the curve when gender was at its changing point and thus most ambiguous. In contrast, in the control group, the response time seemed to be a function of voice manipulation and increased as the naturalness of the voice fragments decreased.
In the present study, some limitations have to be taken into account. First, the response times for the transformed female voices were more variable than those for male voices in both groups of participants. This could be due to the nature of the stimuli, that is, morphing male to female voices gives a smoother transition than morphing female to male voices. Indeed, there are acoustic differences between male and female voices that could give rise to different ‘morphing characteristics’ (Mendoza et al. 1996). The spectral tilt of female voices is lower than that of male voices as a consequence of greater levels of aspiration noise, which causes the female voice to have a more “breathy” quality than the male voice (Mendoza et al. 1996). Furthermore, male voices show less interspeaker variation in spectral tilt, aspiration noise, and first-formant bandwidth, probably as a consequence of more complete glottal closure in males, leading to less energy loss at the glottis (Hanson and Chuang 1999). Thus, the greater variation in acoustic parameters in female voices may make it more difficult to transform female voices into male voices, which are characterized by a relative absence of spectral tilt, aspiration noise, and first formant bandwidth variation. The greater variation in response time in both participant groups for the transformed female voice fragments (as opposed to the male voice fragments) may thus be a reflection of the greater variety of acoustic parameters in female voices. Second, future studies might incorporate additional variables, such as measures of Theory of Mind, to examine whether the different perceptual pattern observed in the current study can be explained by a difference in mentalizing ability between both the two groups of participants. Third, possible differences in attention between the two groups of participants could give rise to different response patterns. Yet, potential differences in attention between the two groups of participants in the current study would be evident as differences in response time. Since the average response times did not differ between the groups, overall differences in attention are not likely to have influenced the results.
To conclude, the difference in response times between participants with HFA and typically developing participants could be interpreted as a consequence of different perceptual processes in HFA analogous to the different perceptual processes involved in face recognition in these individuals, in combination with the absence of impairments in extracting social information from voices. The concept that individuals with HFA have intact perceptual capabilities but different perceptual processes has implications for psychological models of HFA, since research should not selectively focus on whether people with autism are able to perceive social stimuli, but rather focus on whether people with autism direct their attention toward relevant features of social stimuli in real-life situations.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.