Introduction

Autism spectrum disorder (ADS) is a common neurodevelopmental disorder1. According to World Health Organization Fact Sheet on June 2021, about one out of 160 children in the world suffers from ASD. This neurodevelopmental disorder is characterized by difficulty in social interaction, atypical patterns of behavior, and communication challenges2. In terms of communication difficulties, one of the earliest symptoms of ASD is language impairment3,4 that is manifested in different forms, from being completely nonverbal to having unusual prosody5,6,7.

Prosody is concerned with suprasegmental features of speech and refers to the speech rhythm8 and also affective, pragmatic and syntactic communicative functions9. Therefore, prosody may determine the way that language is perceived by audiences10. Considering the fact that cross-linguistic evidence shows the early development of prosodic sensitivity to ambient language in newborns11,12, it is important to investigate if prosodic characteristics of the human voice are potent to obtain useful information for neurodevelopmental disorders like ASD.

Acoustic characteristics expressing prosody are pitch (also known as fundamental frequency), duration and intensity13 and pitch attributes such as pitch contour, pitch range and pitch variability. The pitch measure is related to the vibrational frequency of vocal cords and pitch contour shows the pitch variability during time evolution and is related to the information content of the utterance13. The pitch range is defined as the distance between maximum and minimum pitch values and determines the extent of variation that a speaker uses in a produced utterance. People with ASD show different pitch characteristics as compared with typically developing (TD) individuals14. However, inconsistencies are observed between studies. For example, some people with ASD exhibit a wider pitch range, while some other ASD people exhibit a shorter pitch range during speaking compared with TD individuals14. Furthermore, some studies showed that people with ASD had higher mean pitch and wider pitch range15, while according to other studies, the variation might not be significant16 or even different17. One possible factor that made the results inconsistent might be the experimental condition that voice is produced18. It was shown that emotional state affected acoustic properties of the speech19, Furthermore, the mental status during voice production, for example, stress condition, might affect the quality and characteristics of the produced voice20.

Duration, the time spent for producing utterances, also seems to be different between ASD and TD groups in a way that people with ASD have longer utterance duration compared with TD individuals10,13. This is not in accordance with the other results reported for the duration of the paired syllable18 or contrastive argument17. Furthermore, the duration for producing stressed and unstressed syllables is more different for TD individuals as compared with people with ASD16. Although the utterance duration has been reported to be influenced by the emotional state of TD individuals and is significantly longer in the sad emotional state compared with happy or neutral states, such a difference has not been reported for ASD people21.

The intensity of produced voice, sometimes is referred to as voice loudness or voice pressure level (dB), is another measure that has been widely investigated to capture differences between ASD and TD individuals10,13,22. Diehl et al. reported that following elicitation of question-like speech, there was no difference between speech intensity of TD and ASD groups13. Drimalla et al. found no difference between ASD and TD groups regarding the intensity of produced voice22 and Filipe et al. reported that the intensity of voice was not different between ASD and TD groups for both falling and rising intonations10. However, according to Olivati et al., the maximum and minimum intensity of vocal productions were significantly different between ASD and TD individuals, in which for both cases, ASD individuals showed higher values for minimum and maximum intensities23.

In studies regarding the comparison between prosodic features among ASD and TD individuals, it is important to consider factors such as gender, age, IQ or expressive spoken language of participants. It is well known that speech rate, pitch and voice intensity are correlated factors with speaker age24. Also, it was reported that the voice fundamental frequency of children reduced by increasing the age25. For school-age typically developing children, the pitch variation was shown to be greater than pre-school samples26. Gender is another important factor that affects prosodic features27. In general, prosodic attributes of female speakers show higher pitch value, longer duration and a wider pitch range28. According to Nadig and Shaw, IQ might also be correlated with acoustic pitch range in high functioning ASD individuals during a structured communication task, while no relationship was found for conversational communication29. Furthermore, a correlation between IQ measure and linguistic prosody task accuracy was reported for specific language impairment children30. Other factors such as the expressive language of participants or musical background are also effective on the prosodic features of an acoustic utterance. For example, lexical stress assignment is performed in a different manner between different languages31. In this regard, several studies used prosodic features for identifying the spoken language of participants32. Also, the correlation between prosodic features and factors such as age, gender and IQ is very complex and interaction should be considered between them. As an example for such interaction, it was shown that fundamental frequency changed differently between male and female samples after adulthood (i.e. decrease of voice fundamental frequency in women and increase of fundamental frequency in men); however, such change depends on the age span33.

Overall, extensive inconsistencies exist in the literature for the main differences of prosodic features between ASD and TD groups. Performing a systematic review and inclusion of previously conducted studies may help to solve such inconsistencies. By aggregating individual studies, systematic reviews minimize the bias of the obtained results, obtain more reproducible results and increase the power of statistical analyses due to larger sample sizes compared with individual studies. It also generates useful conceptual frameworks and guidelines for future studies by obtaining the effect of eligible confounding factors34.

To the best of our knowledge, Fusaroli et al. (2017) have performed the last systematic review about the acoustic features of people with ASD and the differences with TD individuals. According to Fusaroli et al., cumulative results showed that mean pitch and pitch range was the most significant different features between ASD and TD groups (Cohen’s d = 0.4–0.5)35. In their study, between-study heterogeneity for the included studies was negligible except for the mean pitch and pitch range. However, due to the lack of sufficient evidence, the reliable pooled effect for ASD for voice intensity and quality of voice was not reported35. After the study of Fusaroli et al. several new studies have been performed to investigate the acoustic characteristics of speech articulated by ASD sufferers.

The current study is an update for the last performed systematic review conducted by Fusaroli et al., increasing the study sample size from 30 to 39 and ASD participant sample size from 407 to 910 samples. However, it should be noted that, in Fusaroli et al., some other voice characteristics such as voice quality were also studied. They also considered multivariate studies of acoustic patterns which are out of the scope of the current study. The increased study and participant sample size enabled us to obtain more precise estimates. The main purpose of the current systematic review was to investigate which prosodic features could be considered as reliable markers for discriminating people with ASD from TD individuals. Performing cumulative research by considering newly obtained evidences will increase the sample size and hence improve the statistical power of outcomes. Furthermore, analyses regarding the impact of confounding factors such as the age of participants, gender and the type of voice elicitation tasks on the prosodic characteristics were performed. The result of this study can be used by researchers to develop machine-learning approaches for discriminating ASD and TD individuals or for screening people with ASD. Furthermore, the result can be useful for developing rehabilitation intervention strategies for improving the speaking abilities of people with ASD.

Results

The flow diagram for performing the current systematic review according to Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines is shown in Fig. 1.

Figure 1
figure 1

Flow diagram of the search procedure according to the PRISMA guidelines.

Initial searches in major databases (PubMed, Scopus and Web of Science) provided 395 studies. Furthermore, by the manual search of the reference list of review articles, related meta-analyses and eligible studies, 14 studies were obtained. After title and abstract screening, 66 studies remained, while only 39 of them contained results that enable us to calculate reliable effect size (i.e. standard mean difference). Although the total sample size of ASD and TD groups were 910 and 850, respectively, separate analysis for each prosodic feature (pitch, intensity, duration and speech rate) was performed with a smaller sample size.

Table 1 shows the retrieved eligible studies that were included in the current systematic review. According to the Newcastle-Ottawa scale (NOS), the quality of all studies was high (NOS ≥ 7) except for Refs.36,37,38,39 (NOS = 5, 4, 3, 6, respectively). Since the prevalence of ASD is higher in males and ASD is a male-biased disorder40, in the case of studies that reported results according to gender, the results for males were only considered. It should be noted that for such studies the sample size for females was relatively small; therefore, the inclusion of effect sizes for the female group increased the risk of small-study effect in the analyses41.

Table 1 Summary of included studies.

The results of this systematic review are as follows.

Mean pitch value

Twenty-two studies investigated the difference in mean pitch value between ASD and TD individuals. The results of these studies were completely inconsistent. Two studies14,62 reported lower mean pitch value for vocal productions of people with ASD, while ten other studies10,17,22,38,39,47,61,62,64,65 found higher mean pitch value for ASD individuals. Sheinkopf et al., investigating the acoustic characteristics of infants’ crying sound reported two mean pitch values for pain-related and non-pain-related cries. Furthermore, 11 studies13,15,18,29,38,45,49,58,59,60,63 found no significant difference (p > 0.05) between groups. By analyzing adult male participants (age > 13), Nayak et al. found a higher mean pitch value for ASD than TD group, while for younger male participants (age < 13), the mean pitch value was not different between groups38.

As Fig. 2 represents, the pooled mean difference for included studies (13 studies, 310 people with ASD and 268 TD individuals) was SMD =  − 0.4 (95% CI [− 0.70, − 0.10]), while a moderate to high between-study heterogeneity was observed (I2 = 67.4%, p < 0.05). To investigate the source of heterogeneity, two confounding factors (voice elicitation task and the age span of participants) were considered. For this purpose, studies were grouped according to the experimental task that was used for voice elicitation (i.e. Narration, Conversation, Focus and Cry) and the age span of ASD participants (i.e. infancy, childhood, adolescence and adulthood). The results for these subgroup analyses were reported in Table 2.

Figure 2
figure 2

Forest plot for mean pitch value measure. The negative sign shows that the mean pitch value is larger for ASD individuals as compared with TD individuals.

Table 2 Subgroup analyses for mean pitch difference between ASD and TD groups. The elicitation tasks and the age of participants were confounding factors.

Student t-test showed that the mean pitch value for TD and ASD groups was statistically different (p = 0.037; t-value = 1.876). For subgroups (according to the type of voice elicitation task and age span), statistical analysis using a two-way analysis of variance (ANOVA) was performed. The ANOVA model of \({y}_{ijt}=\mu +{\alpha }_{i}+{\beta }_{j}+{\gamma }_{ij}+{\varepsilon }_{ijt}\) was used in which alpha showed the effect of different treatments related to the first independent variable (task type, i: Cry, Focus, Narration, Conversation), beta was related to the second independent variable (age, j: Infancy, Childhood, Adolescence, Adulthood) and the gamma coefficient was related to the combination of treatments of two independent variables (i.e. interaction between age and task). The epsilon showed the error term. Multiple comparisons correction using the Bonferroni method was applied to the ANOVA outcome. The results showed no main effect of task type (F(3,8) = 0.36, p = 0.79), age (F(2,8) = 2.07, p = 0.188), and interaction between task type and age span (F(2,8) = 0.29, p = 0.76).

In order to check the impact of elicitation task and age span simultaneously, a more detailed subgroup analysis was performed and the results were reported in supporting materials (see supporting material file, Supplementary Appendix Sect. B).

Pitch standard deviation

In the current systematic review, in terms of pitch standard deviation, 16 studies reported the difference between ASD and TD groups. The total sample size of the retrieved studies, considered pitch standard deviation, was 305 and 329 samples for ASD and TD groups, respectively. Eight studies10,13,15,17,47,52,66,67 reported larger pitch standard deviation for people with ASD, while three studies26,38,46 obtained smaller pitch standard deviation for ASD individuals. Other studies reported no significant differences between ASD and TD individuals26,29,59,60,62,63. In the study of Nakai et al. lower pitch standard deviation was observed for school-aged ASD individuals, while the pre-school aged group did not show such a difference26. Among the included studies, nine studies reported the necessary information for calculating SMD (including 13 effect sizes since some of the studies reported more than one effect size). The pooled mean difference for acoustic pitch standard deviation between ASD and TD groups was SMD =  − 0.07 (95% CI [− 0.55, 0.42], I2 = 83.5%, p < 0.01), which was a very small and non-significant difference. Subgroup analyses considering different types of voice elicitation tasks were shown in Table 3. Since just one study was available for adolescence and adulthood, these age spans were excluded for further analysis.

Table 3 Subgroup analyses for the difference of pitch standard deviation between ASD and TD groups. The elicitation tasks and the age of participants were confounding factors.

The results of this meta-analysis indicated that pitch standard deviation was not significantly different between ASD and TD groups, even when confounding factors were adjusted.

Two-way analysis of variance showed that there was no main effect for task (F(3,6) = 0.4, p = 0.76), age (F(3,6) = 0.92, p = 0.48) or interaction between age and task type factors (F(4,12) = 0.87, p = 0.26).

Pitch range

Pitch range, the distance between the maximum and minimum pitch values68, is one of the measures that along with pitch standard deviation has been used extensively for voice pitch variability assessment10,21,45,53,61. In the current study, pitch range was considered as a measure for the broadness of fundamental frequency used by ASD or TD individuals during voice production.

The systematic search found 21 studies for the pitch range difference between ASD and TD groups. Thirteen studies reported the wider pitch range in ASD individuals’ vocal productions10,13,15,21,29,38,45,52,53,61,62,69,70, while only one study55 reported a narrower pitch range in ASD group. In seven studies, no significant difference between ASD and TD groups was found for pitch range (Refs.14,18,23,26,39,59 for one-word and16 for pseudowords). Among 21 eligible studies, 14 cases (17 mean difference values) had the necessary information for performing a meta-analysis. The total sample size of these 14 studies was 239 for ASD and 232 for TD individuals. The pooled mean difference for this analysis was SMD =  − 0.78 (95% CI [− 1.34, − 0.21], I2 = 89.9%, p < 0.001). As the result showed, there was between-study heterogeneity. To investigate the source of heterogeneity, subgroup analyses were performed considering the age of participants and the voice elicitation method. Table 4 reports the results of the above-mentioned subgroup analyses.

Table 4 Subgroup analysis for pitch range difference between ASD and TD groups. The elicitation tasks and the age of participants were confounding factors.

Student t-test showed that the pitch range value for TD and ASD groups was statistically different (p = 0.002; t-value = 3.21). According to the two-way ANOVA test, there was no main effect for age (F(2,9) = 0.2, p = 0.82), task type (F(2,9) = 0.13, p = 0.88) on standard mean difference of pitch range between ASD and TD groups. Furthermore, the two-way ANOVA test showed that there was no age and task type interaction (F(3,9) = 0.33, p = 0.806).

Pitch variability

Pitch variability is usually computed according to the standard deviation of fundamental frequencies or the range of fundamental frequencies i.e. the distance between the maximum and minimum pitch values18. The pitch standard deviation might better explain pitch variability than pitch range since the latter is a more sensitive measure to outliers. In order to be consistent with the definition of pitch variability in the literature (i.e. considering both pitch standard deviation and pitch range measures), the results of studies of “Pitch standard deviation” and “Pitch range” sections were combined. The systematic search retrieved 22 studies that investigated pitch variability (544 and 561samples for ASD and TD groups, respectively). These studies reported 30 effect sizes. The pooled mean difference for pitch variability measure was SMD =  − 0.462 (95% CI [− 0.84, − 0.08], I2 = 88.7%, p < 0.001). This result showed larger pitch variability for ASD group, while between-study heterogeneity was observed. To investigate the source of heterogeneity, subgroup analyses were performed considering the age of participants and the voice elicitation method. Table 5 reports the results of the above-mentioned subgroup analyses.

Table 5 Subgroup analyses for the difference of pitch variability between ASD and TD groups. The voice elicitation tasks and the age of participants were confounding factors.

Student t-test showed that the pitch variability for TD and ASD groups was statistically different (p = 0.008; t-value = 2.53). The two-way ANOVA test showed no main effect for age (F(3,19) = 0.22, p = 0.88), while the main effect for task type (F(3,19) = 8.03, p = 0.04) on the standard mean difference for pitch variability between groups was obtained. The post-hoc analysis showed that for narration-type tasks the pitch variability was larger for ASD samples. Furthermore, analysis of variance showed that there was no interaction between age and task type (F(4,19) = 0.6, p = 0.65).

Intensity

The difference in the intensity of vocal productions between ASD and TD individuals was found in 12 studies. Two studies21,23 reported higher intensity for ASD individuals, while two others58,60 found the lower intensity for ASD individuals. In addition, there was not any report of significant differences between ASD and TD groups in terms of voice intensity level in eight studies10,17,22,29,36,59,62,63. Among the eligible studies, ten of them had necessary information for calculating the pooled mean difference between ASD and TD groups for acoustic intensity10,17,21,23,29,58,59,60,62,63. Some of them reported more than one mean difference (for minimum or maximum intensity, during different time spans or due to different elicitation methods); therefore, 14 mean difference values were found for performing the meta-analysis. The total sample size for this analysis was 222 people with ASD and 182 TD individuals. The pooled SMD for acoustic intensity deference between ASD and TD groups was SMD =  − 0.14 (95% CI [− 0.58, 0.29], I2 = 82.1%, p < 0.001), which indicated the small and non-significant difference between groups (since mean difference contained zero). In Table 6, the results for subgroup analyses, considering the age span of ASD participants and vocal production elicitation methods, were shown.

Table 6 Subgroup analyses for voice intensity difference between ASD and TD groups. The elicitation tasks and the age of participants were confounding factors.

It should be noted that Ochi et al. (2019) also reported lower variation in the acoustic intensity of people with ASD58, while other studies like60,62 reported no significant difference between these groups. In the study of Choi and Lee, it was reported that intensity variation for people with ASD was significantly larger (p < 0.05) compared with TD individuals (SMD =  − 0.998, 95% CI [− 1.61, − 0.38])48.

Student t-test showed that the voice intensity for TD and ASD groups was not statistically different (p = 0.305; t-value = 0.524). Two-way ANOVA test for standard mean differences for voice intensity between ASD and TD groups showed a main effect of age (F(3,7) = 10.48, p = 0.006), while there were no significant effects for task type (F(3,7) = 0.17, p = 0.911) or the interaction between age and task type (F(4,7) = 0.2, p = 0.356). The post-hoc Bonferroni-corrected contrast analysis showed that the mean value for voice intensity was higher for adolescent ASD subjects in the conversation-type task.

Speech rate

Patel et al. found a significantly smaller speech rate for people with ASD as compared with TD individuals18, while two other studies29,58 refused the significant difference between the speech rate of ASD and TD individuals. Sufficient information was available for calculating SMD from three studies including18,29,58. The pooled mean difference for the eligible studies (ASD and TD group sample size was 132 and 69, respectively) showed weak and non-significant difference between speech rates of ASD and TD groups (SMD = 0.09 (95% CI [− 0.44, 0.62], I2 = 49.4%, p = 0.115)).

Voice duration

For the difference of mean voice duration between ASD and TD groups, 22 eligible studies were retrieved. The total sample size of the included studies was 257 and 234 for ASD and TD groups, respectively. Ten studies16,17,29,42,43,46,49,54,58,62 did not find any statistical difference in voice duration between ASD and TD groups. However, 11 other studies reported longer utterance/word duration for people with ASD10,13,21,29,36,50,52,58,64,71,72. Demouy et al. reported four SMD values for different types of intonations (i.e. descending, falling, floating, rising). Another study, which investigated the crying sounds of infants, reported shorter voice duration for ASD children63. Among the eligible studies, the standard mean difference could be calculated for 15 studies with 27 mean difference values, since in some studies several mean difference values were reported. Although between-study heterogeneity was observed (I2 = 72.1%), the performed meta-analysis for the difference of voice duration between ASD and TD groups obtained the pooled difference of SMD = -0.43 (95% CI [− 0.72, − 0.15], I2 = 72.1%, p < 0.01)), which indicated the significant longer duration for vocal productions in people with ASD.

Student t-test showed that the voice duration value for TD and ASD groups was statistically different (p = 0.017; t-value = 2.23). However, the two-way ANOVA test revealed that there was a main effect for age (F(3,20) = 8.68, p = 0.027), while no significant effect was found for task type (F(3,20) = 1.18, p = 0.344) or interaction between age and task type (F(4,20) = 1.37, p = 0.28) on standard mean difference of voice duration between ASD and TD groups. Post-hoc analysis revealed that such difference was mainly due to the statistical difference between childhood (t-value = 1.78, p = 0.048) and adolescence (t-value = 2.09, p = 0.04) subgroups.

From Fig. 3, it was clear that one possible source of between-study heterogeneity for voice duration might be the type of voice elicitation task. When participants were motivated to produce words and sentences in a word repetition or picture naming task (Force category in Fig. 3), between-study heterogeneity was moderate (I2 = 65.8%, p < 0.01), while the mean difference of duration was longer for ASD group (SMD =  − 0.38, 95% CI [− 0.69, − 0.08]). For cases in which voice duration was calculated for the crying period, studies were heterogeneous (I2 = 75.3%, p = 0.007). In the case of narration, one study23 showed non-significant heterogeneity, while due to the common sample population, it could not be considered as a reliable result. Another source of between-study heterogeneity was the age span of participants (see Fig. 4). According to Fig. 4, in the case of vocal production in infants, a non-significant and negligible heterogeneity (I2 = 0.00%, p = 0.579) was obtained, and there was a shorter voice duration for crying sounds for people with ASD (SMD = 0.38, 95% CI [− 0.02, 0.79]). For the adolescence and childhood periods, significant between-study heterogeneity was also observed (I2 = 71.3% and 71.6%, respectively, p < 0.01).

Figure 3
figure 3

Forest plot for the subgroup meta-analysis of the difference of voice duration between ASD and TD groups. The confounding factor for this analysis was the type of voice elicitation task.

Figure 4
figure 4

Forest plot for the subgroup meta-analysis of the difference of voice duration between ASD and TD groups. The confounding factor for this analysis was the age span of participants.

Other measures

In some limited numbers of studies, other features such as pitch slope and voice quality were considered. Bone et al. reported that the median pitch slope correlated with ASD severity and a more negatively sloped pitch was observed in children with severe ASD73. Pitch slope determines intonation73 or the perceived oddness of prosody74, where a more negative pitch slope is related to a flatter intonation73. The performance of participants during speech production was another measure for comparing ASD and TD individuals75. Furthermore, voice quality is used as a measure for comparing ASD and TD individuals; however, there are no universal measures for quantifying the voice quality35.

Publication bias

The results for publication bias according to Egger’s and Begg’s tests were reported in Table 7. The threshold level of 0.05 was used to indicate significant publication bias. The related funnel plot for representing publication bias can be found in Supplementary Appendix D in the Supporting Material.

Table 7 Results for assessing publication bias using the Begg’s and Egger’s tests for included studies for different acoustic measures.

Discussion

Study outcomes

Autism spectrum disorder (ASD) is a frequent neurodevelopmental disorder1. Recognizing common early symptoms and warning signs of ASD leads to early diagnosis and better treatment assessments. Since ASD is associated with language deficit, in this systematic study, we reviewed the literatures focused on prosodic features of pitch, duration, speech rate and intensity.

As the results of this study (Fig. 2) revealed, the pooled mean difference for mean pitch measure between ASD and TD groups was negative (SMD = − 0.4). The negative sign indicated a larger mean pitch for people with ASD. The confidence interval (95% CI [− 0.7, − 0.1]) did not include zero, therefore, the mean difference should be considered as a significant difference. Subgroup analysis (Table 2) revealed a higher mean pitch for people with ASD especially in adulthood (− 0.94 (95% CI [− 1.36, − 0.52])). This implied that the pitch difference between ASD and TD individuals was specifically significant during adulthood compared to other age spans in which studies showed inconsistent results.

Between-study heterogeneity for mean pitch difference was moderate (I2 = 67.4%, p < 0.001). Subgroup analysis revealed that both age and voice elicitation tasks were the sources of heterogeneity (see Table 2). For conversation-type tasks, when social interaction existed during vocal production, between-study heterogeneity was relatively high and significant (I2 = 80.7%, p < 0.05). Social interaction problems are the hallmarks of ASD76, therefore, a significant difference in the prosodic features of ASD individuals’ vocal production during social interactions in comparison with TD people is not surprising. This might be the reason that conversation-type tasks were an important source of the observed between-study heterogeneity. Furthermore, subgroup analysis according to the age span of participants showed that after childhood period (i.e. in adolescence and adulthood period) the results of studies regarding acoustic mean pitch difference between ASD and TD individuals were more homogenous. Previous studies revealed that mean pitch value decreased during development and it reaches the mature adult voice pitch value in 13–18 years of age77,78. Furthermore, Lee et al. reported that the acoustic features converged to a canonical level at the age of 1579. The convergence of prosodic features after childhood might be the reason for more homogenous differences for acoustic features in the adulthood age span between ASD and TD groups.

When pitch standard deviation was used, the current meta-analysis did not find a significant difference between ASD and TD groups (SMD =  − 0.07 (95% CI [− 0.55, 0.42]). While for another measure of pitch variability, i.e. pitch range, the current systematic review showed a significantly wider pitch range for the ASD group (SMD =  − 0.78(95% CI [− 1.34, − 0.21]). This can be attributed to the discomfort of ASD patients during speaking37. Lyakso et al. reported that a negative/positive trend in pitch variability (falling/rising pitch contour) usually shows the discomfort/comfort state in the speaking, while the flat pitch contour is related to the natural speaking37. According to Table 4, pitch range difference reduced during development. Since speech and language abilities enhance during development80, people with ASD might feel more comfortable speaking at older ages during social communication and this resulted in the smaller pitch range difference between ASD and TD groups that was observed in our study. It was also in accordance with the findings of Nakai et al. that reported a negative correlation between pitch variation and social reciprocal interaction in Japanese-speaking pre-school children26.

The current meta-analysis found a weak mean difference for voice intensity between ASD and TD groups (SMD =  − 0.14 (95% CI [− 0.58, 0.29]); however, the difference was not significant. Subglottic pressure, the rate of airflow, the glottal resistance81 and the vocal fold vibration properties82 are factors affecting intensity. Kostyuk et al. reported the weakness of respiratory muscles of ASD sufferers and Stewart et al. reported the abnormal lower long airway doublet branching in ASD children83 that might influence the air pressure for vibrating vocal cords. Therefore, the difference in voice intensity and pitch properties of vocal productions’ of ASD people can be attributed to the structural abnormalities in vocal tract of the ASD individuals. In addition, the mean pitch is the fundamental frequency (F0) of a voice. The mean value for F0 and its range is mainly determined by the vocal cord characteristic features such as thickness and length84. X-ray images from larynx of ASD and their TD counterparts revealed that the hyoid height of people with ASD was lower compared with controls85. Different larynx anatomy might be the reason for larger mean pitch and the larger pitch range in ASD individuals.

The pooled voice duration difference between ASD and TD groups showed significantly longer duration for ASD individuals (SMD =  − 0.43 (95% CI [− 0.72, − 0.15]). This result indicated that people with ASD needed more time for producing utterances as compared with TD individuals.

Assessment of publication bias using funnel plot as well as Egger’s and Begg’s rank tests revealed that there was no symptom of publication bias for performed analyses (see Table 7, Fig S3, supporting material). According to Table 7, both Egger’s and Begg’s tests obtained p values larger than 0.05, however, for pitch variability and duration Egger’s test had marginal p values (p < 0.1). This indicated that the result for publication bias obtained by Egger’s test should be considered with caution, especially for pitch variability and voice duration measures.

A deeper investigation in the results obtained by subgroup analyses revealed that the differences of prosodic features between ASD and TD groups were mainly seen for narration-type tasks. For narration-type tasks, significant longer mean pitch value, larger pitch range, higher pitch variability and longer voice duration were observed for people with ASD. While for focus-type tasks, the significant difference between groups was observed for limited numbers of acoustic features (mean pitch and voice duration). For conversation-type tasks, no statistically significant differences were observed (i.e. confidence interval included zero). This useful outcome should be considered for designing future studies.

Considering the age span of participants, pitch range, pitch variability and voice duration were significantly different between ASD and TD groups for the adolescence group. The mean pitch value was also significantly higher for adult ASD participants compared with TD peers. These results showed that the main differences were observed for adolescent and adult age spans. It should be noted that the structural features of the larynx and vocal cords are the most influential factor on the pitch, intensity and variation of a produced voice by human. The majority of structural changes in the larynx and vocal cords begin around puberty, i.e. during adolescence and finally, voice stabilizes in the early years of adulthood. Voice stabilization reduces the intrinsic voice variability86 for acoustic features and highlights the inter-subject differences. This might be the reason that why the main differences between acoustic features of ASD and TD groups were mainly seen at older ages (adolescence and adulthood).

Comparison with the last performed systematic review

The previous systematic review performed by Fusaroli et al. found that the mean pitch and pitch range were two acoustic features that were distinct between ASD and TD groups35. According to Fusaroli et al., the standard mean difference for mean pitch was SMD = − 0.41 (95% CI [− 0.68, − 0.15], 16 studies) which is very close to our estimate (SMD =  − 0.4 (95% CI [− 0.7, − 0.1], 22 studies). This confirms that mean pitch value can be considered as a reliable feature to distinguish ASD from TD individuals. For pitch range measure, the previous meta-analysis obtained the pooled mean difference of SMD =  − 0.5 (95% CI [− 0.77, − 0.24], 17 studies), while the current analysis obtained the pooled mean difference of SMD =  − 0.78 (95% CI [− 1.34, − 0.21], 21 studies) that showed the stronger capability of pitch range for distinguishing ASD and TD individuals. In fact, our result obtained a wider pitch range difference between ASD and TD individuals compared with Fusaroli et al.35. Inclusion of recently reported studies also showed that voice duration might be another measure that was significantly longer for ASD group (SMD =  − 0.43 (95% CI [− 0.72, − 0.15])). In accordance with the systematic review performed by Fusaroli et al., the current systematic review did not find any evidence for voice intensity, pitch standard deviation and speech rate to be as differentiating features between ASD and TD individuals.

Study limitations

Even though systematic reviews and meta-analyses provide the framework for combining results of several studies, the obtained results should be taken with caution due to several issues including heterogeneity between studies and publication bias87. Furthermore, for the topic of the current study, the spoken language of participants and the task that was used for voice elicitation were different extensively between studies. Such factors were effective on prosodic features32 and made studies heterogeneous. Anyway, the outcomes from a systematic review/meta-analysis study can be considered as a starting point in future studies for investigating the effect of potential confounding factors. In this perspective, Fusaroli et al. performed a cumulative yet self-correcting approach according to the outcomes of their previous meta-analysis35 in order to propose guidelines for overcoming the naïve shortcoming of a systematic review/meta-analysis study87.

In addition, the current study was performed according to the classical method for meta-analysis. However, another choice is Bayesian meta-analysis, which considers that both data and model parameters are random variables. It includes the a priori knowledge in the model and in this way enriches the meta-analysis88,89. Finally, during the systematic search, we found several studies that could not be included in the meta-analysis due to missing reported data. The missing data imputation strategies can be effective methods for including such studies in the meta-analysis90.

Conclusion

Several studies have reported altered vocal production in people with ASD. In this regard, it is important to investigate if prosodic characteristics of vocal productions of people with ASD are different enough to be used as the distinguishing factors between ASD and TD individuals. Because children start vocal productions from the first stages of development, reaching this conclusion is promising for ASD sufferers. The result may lead to the early diagnosis of ASD and better outcomes of their assessment. The current systematic review of the studies on prosodic features of vocal productions articulated by ASD sufferers was conducted to find the statistically frequently reported varieties between ASD and TD individuals. This study showed that some features like mean pitch, pitch range, pitch variability and voice duration were discriminative features. However, these findings were dependent on the age span of participants and the type of task used for voice elicitation. For voice elicitation tasks in which interaction with others was required or subjects engaged in a problem-solving task before voice elicitation, discriminative markers obtained lower statistical significance, while prosodic features during general narration showed a statistically significant difference between people with ASD and the normal group. Furthermore, mean differences in discriminative features between ASD and TD groups were usually observed for adolescents and adults. The findings showed that some other prosodic features such as voice intensity, pitch variability or speech rate were not potent to distinguish ASD individuals from TD people. The obtained results can be considered for developing intelligent methods for distinguishing people with ASD from TD individuals. For future works, we propose using enhanced statistical methods such as Bayesian meta-analysis frameworks.

Materials and methods

For performing the current systematic review, the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines (PRISMA) were used.

Search procedure

For finding eligible sources in line with the purpose of the current study, three major databases namely PubMed, Web of Science and Scopus were searched using the advanced search engine in each database. For the systematic search, the following search terms were used:(autism OR “Autism spectrum disorder” OR ASD OR “Asperger syndrome”) AND (“phonological disorder” OR “phonological difficulties” OR “phonological impairment” OR “speech disorder” OR “speech impairment” OR “speech difficulties” OR “voice disorder” OR “voice difficulties” OR “voice impairment” OR “phonology disorder” OR “phonology impairment” OR “phonology difficulties” OR phonology OR phonological OR phonetic) AND (“fundamental frequency” OR formants OR “acoustic energy” OR pitch). For searching eligible studies, no restriction on language or date of publication was applied.

Inclusion and exclusion criteria

As inclusion criteria: (1) Original research articles, conference papers, clinical trial or randomized control trial articles were included. (2) Studies were included if the difference in acoustic properties between autism spectrum disorder and typically developing conditions had been investigated. (3) Only studies were included that had investigated the acoustic features from an articulatory point of view. (4) Studies contained participants with hearing loss or other neurological disorders rather than ASD were excluded.

As exclusion criteria: (1) Review articles or related systematic review studies were excluded, even though their reference lists were searched for finding missing related studies. In addition, case reports and letter to the editor studies were not included. (2) Studies in which ASD group had been compared with people with language impairment and studies on ASD individuals without comparing them with TD control group, were excluded. (3) Studies with a very small sample size (n ≤ 3) were excluded from further analysis. (4) Studies that investigated speech perception or focused on brain mechanisms (for example elicited event-related potentials during speech production) were excluded. 5) Studies that had used specific tests to score verbal abilities or prosodic capabilities of participants91 were also excluded.

Study selection

The search procedure was performed by two independent authors (S.F and E.J) and retrieved references were transferred to a single EndNote library. After duplicate removal and title, abstract and full-text screening, eligible studies were found. Any disagreement in study selection between authors was resolved by discussion. A PICO model (Population: people with autism spectrum disorder and language- or age-matched typically developing group; Intervention: tasks for eliciting voice production in participants, including simple narrative tasks, two-sided interviews, picture-naming tasks and so on; Comparison: intensity, duration and pitch of produced utterance; and Outcome: the standard mean difference between ASD and TD groups) was used to select eligible studies. Some studies were ruled out from further analyses due to their lack of enough information to calculate the standardized mean difference between groups.

Data extraction

Using a data extraction form, information such as author name, publication year, type of study design, study sample size (number of ASD or TD individuals included in the study), number of male participants in each group, mean age of participants, procedures adopted in the study, the acoustic features used in the study, the main outcomes of the study and calculated or reported mean differences were extracted. For studies reporting several mean differences, all reported differences were considered for further analysis.

In the current study, we focused on acoustic features like intensity (loudness or pressure level), mean and variability of pitch (or fundamental frequency), duration for utterance production and speech rate. Other measures such as turn-taking44, correct word repetition rate, voice quality73 or pause duration were not considered.

Between-study heterogeneity, quality assessment and statistical analysis

Between-study heterogeneity was assessed using Cochran's Q-test and I2 statistic92. I2 value higher than 75% was considered as high heterogeneity, lower than 25% as small heterogeneity and between these two edges (i.e. 25% and 75%) was considered as moderate heterogeneity. To assess publication bias, funnel plot as a visualization tool was used and the Begg’s and Egger’s tests were used to quantify the possible bias93. Newcastle-Ottawa Scale (NOS)94, developed for nonrandomized studies used to evaluate the quality of studies. The difference between ASD and TD groups was calculated based on the standard mean difference (SMD) using Cohen’s d formula. During SMD calculation, the first group was TD and the second group was ASD. In this regard, the negative SMD value implied a larger value for ASD group. The adopted statistical significance level was 0.05. It should be noted that Cohen’s d is   biased upward for small samples. To correct such bias, the corrected d (d*) measure was used using the following formula95 during calculation.

$${d}^{*}=\frac{{M}_{1}-{M}_{2}}{{SD}_{pooled}}\left(\frac{N-3}{N-2.25}\right)\sqrt{\frac{N-2}{N}.}$$
(1)

In which, Mi was the mean value for i-th group, N was the sample size or number of studies. SDpooled was the pooled standard deviation and was calculated according to Eq. (2).

$${SD}_{pooled}=\sqrt{\frac{({n}_{1}-1){SD}_{1}^{2}+({n}_{2}-1){SD}_{2}^{2}}{{n}_{1}+{n}_{2}-2}}.$$
(2)

In Eq. (2), ni was the sample size and SDi was the standard deviation for i-the group.

Statistical comparison between acoustic features among ASD and TD groups was performed according to Kolmogorov–Smirnov normality test followed by independent t-test for normal distributed cases or the Mann–Whitney non-parametric U test for non-normal distribution cases. The significance level of 0.05 was considered for statistical analyses. Analysis of Variance (ANOVA) for testing the differences between groups in terms of two confounding factors (i.e. age span of participants and type of voice elicitation task) was performed (Two-way ANOVA) and post-hoc analysis according to the Bonferroni multiple comparisons correction test was used for finding the possible significant differences. For performing meta-analsis, STATA version 14 (StataCorp, College Station, TX, USA) was used, while for t-test and ANOVA analyses Matlab 2017b(MathWorks, MA, USA) was used.

Voice production tasks

It is hypothesized that different brain mechanisms are engaged for producing different types of human voices (i.e. unconstrained vs. constrained voice). It was shown that in some neurological and neurodevelopmental diseases the type of voice elicitation task was effective on produced prosodic features89. The selected studies were categorized based on different tasks used to elicit vocal production from participants and three categories of constrained voice production, unconstrained voice production and voice produced during crying were considered. In the first category, referred to ‘Focus’, participants were forced to have vocal production in response to a question, request of word imitation, word repetition or picture-naming tasks. Unconstrained category consisted of two subcategories of (1) without interaction with others where vocal production was done during story-telling or general narration tasks and (2) during communication with others. After this, the former was referred to as “Narration” and the latter was referred to as “Conversation”. Considering these two subcategories is important due to the ASD individuals’ impaired social communication behavior96.