Introduction

Autism spectrum disorder (ASD) is a neurobiologically-based, highly heritable condition [4, 81], but the brain bases of ASD have proven complex and difficult to characterize. An early neurobiological insight was Leo Kanner’s observation that the majority of his autistic patientsFootnote 1 had enlarged head sizes [43]. Numerous studies have since confirmed greater head circumference in ASD compared to typically developing control (TDC) samples, especially in studies with large sample sizes (e.g., [17]; for review see [68]). Piven and colleagues were the first to report increased brain volume in ASD compared to controls using MRI [61], which has been replicated in several studies [32, 33, 57]. However, not all imaging studies have found significant enlargement [3, 35, 37, 52, 70, 80].

One highly cited hypothesis—the Early Brain Overgrowth hypothesis [20]—accounts for inconsistencies in volumetric studies of ASD by postulating (1) average brain size at birth, followed by (2) periods of accelerated growth over the next 2 years, and then (3) deceleration of brain growth, equalizing volumes between groups by middle to late childhood. Support for the first two predictions (average size at birth followed by overgrowth) has been strong and consistent. Head circumference, which correlates highly with brain volume in newborns [49], has been observed to be normal at birth in most individuals who go on to develop ASD [21]. Moreover, brain volume appears to remain typical through 6 months of age in infants later diagnosed with ASD [38]. A recent longitudinal MRI study found a significantly greater rate of growth of total brain volume (TBV) between 12 and 24 months in ASD, resulting in significantly greater TBV in the ASD group at age 2 [39]. Consistent reports of larger TBV in young children with ASD [15, 22, 40, 75] also support the second prediction of the early overgrowth hypothesis: accelerated growth in the first 2 years.

The third prediction—arrested growth leading to normalization—has proven controversial. Several studies have found no difference in total brain volume between ASD and TDC groups in school-age [3, 37, 52, 80] and/or adulthood [35, 70], and a 2005 meta-analysis only supported overgrowth in 2-to-5-year-old autistic children, but not in children older than 6 years [63]. Nevertheless, many studies of school-age to adult samples have identified larger brain volumes in the ASD group [32, 33, 57, 61]. Two more-recent meta-analyses (2008 and 2015) support continued enlargement in school-age through adulthood [68, 76]. In a study of 1881 families of autistic 4–18-year-olds, affected probands had larger head circumference than their unaffected siblings, with an effect of 0.2 cm [17]. Although not a direct measure of brain volume, this study suggests that when appropriately controlling for sex, age, height, weight, and genetic ancestry, head size (an adequate predictor of brain volume, [6]), remains enlarged in ASD.

If overgrowth does persist in ASD, it is not clear which tissue types drive these differences. Some research suggests an imbalance of gray matter (GM) and white matter (WM). However, both increased GM relative to WM [12] and increased WM relative to GM [41] have been noted. Differences in ventricle size have also been noted, including enlarged third ventricles [36], and that ventricular enlargement in neonatal low-birth weight babies relates to a seven-fold increased risk of ASD development [54].

Previous work investigating brain volume in school age and beyond is limited by small sample sizes. For example, in the most recent meta-analysis [68], ASD samples ranged from 6 to 121 individuals, with a median of 20. Given the well-known heterogeneity in ASD of core symptom severity, intellectual abilities, and co-occurring psychiatric conditions [5, 51, 72], small samples have decreased statistical power to detect true effects and increased chance of studying biased groups, producing results that are harder to replicate [14]. Meta-analytic efforts mitigate some concerns related to sample size, but sampling error in small original studies can result in biased meta-analytic estimates [48], and publication bias and selective reporting lead to biased effect size estimates in meta-analysis that are difficult to correct [47]. While multi-site network studies producing publicly available datasets, such as the Autism Brain Imaging Data Exchange (ABIDE [24];), are pushing the field toward ever-larger datasets, inter-site and inter-scanner differences contribute significant noise to these data, which may be nonlinear [31]. This between-scanner noise, when random, limits the ability to find group differences [2, 78], and when systematic, biases observed effects.

Small sample sizes also limit the investigation of important individual differences, such as IQ, sex, and ASD symptom severity. Both sex and IQ are known correlates of brain volume (with larger brains in males and individuals with higher IQ [64, 91], and have known clinical relationships with ASD. Approximately half of autistic children have IQ more than one standard deviation below the mean [5]. ASD is four times more prevalent in boys than girls, who are often disproportionately under-represented or excluded from imaging studies. Some studies of brain structure and white matter tracts [8, 10], and functional connectivity [92] have suggested interactions between sex and diagnosis, but more systematic study of sex differences is needed. Moreover, efforts to associate brain volume with core ASD symptom severity have produced mixed results [1, 68].

To assess the prediction of the Early Brain Overgrowth hypothesis that brain volume normalizes by school age and adolescence in ASD, we investigated the relationships of diagnosis, age, sex, IQ, and core ASD symptom severity with global brain volumes (i.e., total brain, gray matter, white matter, and ventricular volumes) in a large, diverse sample of children, adolescents, and adults. There are several important strengths to the samples examined in the present study. The samples are among the largest of their kind, including 456 individuals in the primary sample, and 175 in the replication. Crucially, within each sample, all individuals were characterized and imaged at the same site, using the same MRI scanner and scan sequence, eliminating sources of error variance that are present in large samples produced by combining data across research sites. The primary sample is particularly strong in terms of diversity of several key characteristics with potential etiological correlates in ASD, including a large number of females with ASD (43, which to our knowledge represents the largest single-site female structural MRI sample to date), an inclusive IQ range (47-158), and a wide age (6 to 25).

Materials and methods

Participants

Participants in the primary sample were selected from the larger group of individuals who had participated in any imaging study at CHOP’s Center for Autism Research between 2009 and 2015, from whom a structural anatomical image was acquired. For individuals with ASD, final diagnosis was made by expert clinical judgment using DSM-IV criteria using results from the Autism Diagnostic Observation Schedule [50] and the Autism Diagnostic Interview-Revised [67]. In keeping with DSM-5, all diagnostic subcategories (autism, Asperger’s, PDD-NOS) were pooled into a single ASD group in this study. Four hundred ninety-eight participants had structural scans. Nineteen of these were excluded due to bad scan quality, and 16 were excluded because they received a final diagnosis other than ASD. Seven more individuals were excluded for not having an IQ estimate, leaving a final sample of 456 individuals (see Table 1 and S1, Additional file 1 for demographic data, Figure S1, Additional file 1 for age distributions). Diagnostic groups did not differ significantly on mean age or height (in the subset of 281 individuals for whom height was available at the time of the MRI). Groups differed significantly on proportion of males, reflecting general population differences between ASD and TDC. Racial proportions differed significantly between groups, so sensitivity analyses entailed repeating all analyses within only the White participants.

Table 1 Demographic and clinical information for the primary and replication samples

Cognitive ability

Participants’ cognitive ability (“IQ”) was assessed with one of four standard instruments: the General Cognitive Ability score of the Differential Abilities Scale, Second Edition [26], or the Full Scale IQ of the Wechsler Intelligence Scale for Children, Fourth Edition [90], and the Wechsler Abbreviated Scale of Intelligence, First or Second Edition [88, 89]. The distribution of IQ is shown in Figure S1, Additional file 1. IQ in the ASD group (M = 100.9, SD = 20.6) was significantly lower than controls (M = 113.0, SD = 16.0, t = − 7.0, p < 0.001).

Clinical severity

Clinical severity was assessed with three measures: The Social Responsiveness Scale–2 (SRS-2), a parent questionnaire assessing current ASD traits [19]; the Autism Diagnostic Observation Schedule Calibrated Severity Score (ADOS CSS), an estimate of severity based on clinician ratings [34]; and the Social Communication Questionnaire (SCQ), a parent questionnaire assessing lifetime symptom severity [66].

Parental education

Socio-economic status (parental educational attainment, occupation, and income) is related to children’s neurocognitive functioning, mediated by brain structure, with increased educational attainment of parents predicting increased surface area and volume in children [55]. Because of the well-known relationships between parental education, brain structure, and cognitive functioning, we (1) tested whether these relationships were observed in our TDC sample, and (2) conducted exploratory analyses to examine these relationships in ASD. See Supplementary Methods, Additional file 1 for treatment of this variable.

Replication sample participants and characterization

The replication sample consisted of an all-male cohort collected at Yale University. From 215 available participants, 40 were excluded due to poor scan quality, leaving a final sample of 175. Distributions of age and IQ within the groups are shown in Figure S2, Additional file 1 and Table 1. Yale sample participants were evaluated with the Wechsler Intelligence Scale for Children, Third Edition [86], Wechsler Abbreviated Scale of Intelligence, First Edition [88], or the Wechsler Adult Intelligence Scale, Revised or Third Edition [85, 87]. Diagnostic groups differed significantly on age (t = − 2.92, p < 0.05) and IQ (t = − 4.60, p < 0.001).

Image acquisition

CHOP anatomical images were acquired on a Siemens 3T wide-bore Magnetom Verio Tim scanner with a 32-channel head coil and a Siemens MPRAGE sequence (0.9 × 0.41 × 0.41 mm, TR = 1900, TE = 2.54, flip angle = 9). Replication sample images were collected at Yale University on a GE Signa 1.5 T using a high resolution SPGR sequence (2 NEX, 1.2 mm3; TR = 24, TE = 5, flip angle = 45, matrix=192 × 256, FOV = 30 cm, 124 contiguous 1.2 mm thick sagittal images).

Image processing

CHOP images were N3 bias corrected with ANTS [83] and brain extracted with LABEL ([71], see Fig. 1). Brain extractions were visually inspected, and manually edited with ITK-SNAP [93] if cortex was removed by the automated extraction. Yale images were intensity normalized using a histogram normalization procedure using the BioImage Suite Package [59]. Brain extraction was performed using BET (Brain Extraction Tool, S. M [74].), and conservatively thresholded to remove non-brain pixels only. Manual editing was performed to remove remaining non-brain tissue. Raters demonstrated excellent inter-rater reliability for brain volume (ICC = .99, n = 25).

Fig. 1
figure 1

Raw T1-weighted images (left) were N3 bias corrected and skull-stripped, with manual corrections to ensure cortex was not removed (middle). Skull-stripped images were processed with Freesurfer with manual corrections (right), producing volume estimates

For both datasets, segmentation of the volumes was performed by the Freesurfer image analysis suite (http://surfer.nmr.mgh.harvard.edu/, [23, 27,28,29]), producing total brain volume (TBV), gray matter volume (GMV), white matter volume (WMV), and ventricular volumes. To mitigate concerns that preprocessing techniques instantiated by different statistical packages show differential biases in comparing ASD and TDC volume [44], segmentations were visually inspected slice-by-slice (blind to all subject characteristics). Segmentation errors were manually edited using Freeview (e.g., dura labeled as gray matter, inaccurate identification of the gray/white or pial surface). Final segmentations were visually inspected and excluded if motion artifacts impacted segmentation quality, if a superior image was available for the participant (in the primary dataset), or if correspondence between Freesurfer’s total brain volume estimate and the total brain volume from gold standard manual tracing was exceptionally poor (in the replication dataset). See Supplementary Methods and Figure S3, Additional file 1 for reliability information and details about volume definitions.

For each volume measure, a regression model was tested including IQ, age, sex (CHOP only), diagnosis, the interaction of IQ and diagnosis, the interaction of age and diagnosis, and the interaction of sex and diagnosis (CHOP only). Nonsignificant interaction terms were removed to simplify the models and provide more precise estimates of the effect sizes of the main effects, with full models presented in Additional file 1 to illustrate null interaction findings. In the subset of individuals from the CHOP site for whom accurate height data was available, effects of height were also investigated. Effect sizes are reported as partial eta squared (partial η2) derived from equivalent ANOVAs. Partial η2 measures the proportion of variance in a dependent variable associated with each independent variable, with the effects of other independent variables and interactions partialled out, with suggested interpretive benchmarks of .01, .06, and .14 for small, medium, and large effects [65]. Estimates and standardized estimates from regressions are also included in the tables.

Power

In CHOP models including all terms (4 main effects and 3 interactions), there is 80% power to detect effects of f2 = 0.03, where Cohen’s guidelines suggest that effects of f2 > 0.02 are small and f2 > 0.15 are medium. In Yale models with all terms (3 main effects and 2 interactions), there is 80% power to detect effects of f2 = 0.08. Thus, the CHOP sample is powered to detect small effects, and the Yale sample is powered to detect small-to-medium effects.

Extreme size subgroup analysis

Some prior work has suggested that there is a higher rate of macrocephaly (head circumference above the 98th percentile) among autistic people. A recent meta-analysis found 15.7% of autistic participants had macrocephaly, and 9.1% showed brain overgrowth [68]. Higher rates of microcephaly have also been reported in ASD [30]. We conducted a post hoc analysis to explore the possibility that group-average differences in brain volume were driven by a subgroup of macrocephalic individuals in the ASD group. Within the TDC group, the mean and standard deviation of TBV were calculated within 3-year age bins separately by sex, and the number of individuals within the ASD group whose TBV exceeded 2 SD from the mean of their respective age/sex bin was examined. Individuals were excluded from this analysis when there were fewer than 2 TDC individuals within an age bin, because standard deviation could not be calculated. No individuals were excluded from the CHOP sample for this reason; 7 age bins including a total of 13 individuals were excluded from the Yale sample due to insufficient TDCs in the age bin.

Results

Group volume differences

Final models are presented in Table 2, with group means, standard deviations, and Cohen’s d effect size estimates presented in Table 3, and models including all non-significant interactions in Table S2, Additional file 1. Figure 2 graphically displays the relationships between TBV, GMV, WMV, and age and IQ, separated by diagnosis and sex in the CHOP sample. Diagnosis significantly predicted all brain tissue variables used in the analyses (TBV, GMV, WMV, cortical GMV, cortical WMV, cerebellar GMV, cerebellar WMV), except the ratio of GM to WM. All significant effects of diagnosis were in the direction of ASD showing larger volume than TDC. There also was a significant diagnosis-by-IQ interaction predicting TBV, GMV, WMV, cortical GMV, cortical WMV, and cerebellar GMV. There was no significant diagnosis-by-IQ interaction for cerebellar WMV.

Table 2 Models for primary sample. Uncorrected p-values are reported. One outlier was removed from the lateral ventricle model.
Table 3 Mean and standard deviation of volumes in each group in the CHOP sample, and Cohen’s d for the difference between groups. Note that groups are not matched for age, sex, and IQ, and that the Cohen’s d effect size estimate does not account for these factors
Fig. 2
figure 2

Relationships of IQ and age with total brain volume (a, b), gray matter volume (c, d), and white matter volume (e, f) in the primary sample, by diagnosis and sex. IQ shows a significant interaction with diagnosis predicting all three outcome measures (b, d, e). Age did not significantly predict TBV (a), negatively predicted GMV (c), and positively predicted WMV (e). Significant main effects of diagnosis and sex were observed in all 3 measures. Dashed lines indicate regions-of-significance, where the effect of diagnosis is not significant within the shaded region

To further understand this interaction, the regions of significance of the diagnosis-by-IQ interaction were evaluated using the Johnson-Neyman procedure, which indicates at which levels of a moderator an independent variable has a significant effect on the dependent variable [7]. Controlling for age and sex, the effect of diagnosis on TBV was significant for IQ scores less than 115.3 (ASD > TDC) and greater than 140.1 (TDC > ASD). This means that for IQ scores below 115 and above 140, the relationship between IQ and TBV differs between ASD and TDC. Within the TDC group, the semi-partial correlation of IQ with TBV given age and sex was r = 0.38, p < 0.001. Within the ASD group, this correlation was r = 0.045, p = 0.47. Thus, the typical positive correlation between IQ and TBV was absent in the ASD group.

Across both groups, age negatively predicted GMV, cortical GMV, and cerebellar GMV. Age positively predicted WMV, cortical WMV, and cerebellar WMV. Age did not significantly predict TBV. Notably, there were no significant age-by-diagnosis interactions in any of the models tested.

Sex was a significant predictor in every model, with large effects (male larger than female) on all measures except cerebellar WMV, on which it had a small effect. Notably, there were no significant sex-by-diagnosis interactions. There were significant main effects of IQ in all models except the cerebellar WMV. All significant IQ effects occurred in the presence of an IQ-by-diagnosis interaction.

Models were all tested including height as a predictor in the subset of individuals for whom there was an available measure of height within a year of the scan, with few changes to the significance of results. In white matter models (WMV, cortical WMV, and cerebellar WMV), age became non-significant as a predictor, likely due to the multicollinearity (age and height were highly correlated, r = 0.86, p < 0.001). Within the TBV model, age became a significant predictor (partial η2 =0.03, p < 0.01). The only other qualitative change was in the model of cerebellar WMV, in which the effects of sex and diagnosis became non-significant, likely due to less statistical power compared to the full sample (n = 456 versus n = 281).

Ventricles

Controlling for age, sex, and IQ, diagnosis was a significant predictor of lateral ventricular volume (partial η2 = 0.013, p < 0.05). Visual inspection of data indicated one extreme outlier in lateral ventricular volume; removing this outlier reduced the size of the effect (partial η2 = 0.009, p < 0.05, Fig. 3). There was a significant age-by-diagnosis interaction in the third ventricles, (partial η2 = 0.013, p < 0.05). In the ventricles, unlike in the majority of the tissue volume measures, there were no significant IQ-by-diagnosis interactions in predicting volume.

Fig. 3
figure 3

Ventricular volume in the primary sample

Gray matter-to-white matter ratio

To examine relative contributions of GMV and WMV differences between the groups, the ratio of GMV-to-WMV differences were examined in a model similar to those used to examine primary volumetric measures. This yielded no significant interaction terms, and no significant main effect of group or IQ. There were significant effects of age (partial η2 = 0.46, p < 0.001) and sex (partial η2 = 0.03, p < 0.05), with a greater gray-to-white ratio in females, and with this ratio decreasing with age (Figure S4, Additional file 1).

Clinical correlates of brain size.

When controlling for age, sex, and IQ, neither the ADOS CSS, the SRS, nor the SCQ significantly predicted TBV within the ASD group (Fig. 4, Table 4). That is, none of the three measures of ASD severity correlated with brain volume in the ASD group.

Fig. 4
figure 4

Relationships of clinical severity measures (a, SRS; b, ADOS CSS; c, SCQ) with TBV within the CHOP ASD group only. No severity measure showed a significant relationship with ASD symptoms, controlling for age, sex, and IQ

Table 4 Models showing the relationships of clinical severity measures with TBV within the CHOP ASD group only. No severity measure showed a significant relationship with ASD symptoms, controlling for age, sex, and IQ

Parental education

In order to obtain precise statistics accounting for the rank-order nature of the parental education data, zero-order correlations between parental education and brain volume within each group were examined with Kendall’s Tau. Within the TDC group, the relationship between TBV and parental education was significant and positive (τ = .21, p < 0.001, Figure S5, Additional file 1). This relationship was negative (although non-significant) within the ASD group (τ = − 0.09, p = 0.09). To explore the significance of this apparent disordinal interaction, parental education was added to the model of TBV, such that the full model was TBV ~ diagnosis + age + sex + IQ + parental education + IQ*diagnosis + parental education*diagnosis. In this model, the interaction of parental education and diagnosis was significant (partial η2 = 0.03, p < 0.01). This interaction is also significant in separate models in which father’s education is included as a binary factor (college degree or no college degree) indicating that this interaction effect is robust to choice of statistical method. When mother’s education is included as a binary factor, it is not significant (p = 0.12).

Race

Because race was imbalanced between groups, all of the models in Table 2 were examined within only the White participants, to rule out the explanation that racial differences accounted for group differences. The significance of terms changed in only three models: in the model predicting cerebellar GMV, IQ, diagnosis, and the IQ-by-diagnosis interaction were no longer significant (possibly due to reduced power); in the model predicting cerebellar WMV, diagnosis was no longer significant; and in the model predicting third ventricle volume, sex and diagnosis became significant. Although diagnosis remained a significant predictor in all other models, the effect size of diagnosis was somewhat reduced.

Extreme size subgroup analysis

To identify ASD participants with extremely large or small brains, we calculated the mean and standard deviation of TBV within 3-year age bins separately by sex within the TDC group, and examined ASD individuals whose TBV exceeded 2 SD from the mean for their age and sex. In the ASD group, there were 10 individuals with brains 2 SD above the mean for their age/sex bin (4.1%), and 10 with brains 2 SD below (4.1%, compared to 2.3% above and 2.8% below in the TDC group). Although a higher proportion of ASD individuals had brains with extreme sizes than TDC individuals, this difference was not statistically significant (χ2 (2, N = 456) = 1.9, p = 0.38). Information on the age, IQ, and gender ratio for the ASD individuals with larger, smaller, and typically-sized brains is presented in Table S3, Additional file 1. Comparing these groups statistically, there is a significant difference in age (F (2,237) = 3.95, p < 0.05), with the mean age of the extreme ASD groups higher than the mean age of the ASD individuals with typically sized brains. There were not significant differences in IQ (F (2, 237) = 0.12, p = 0.88), sex ratio (Fisher’s exact test p = 0.25), or ADOS CSS (F (2, 235) = 1.34, p = 0.26) between the groups. To investigate whether a subgroup of individuals with extremely sized brains drove the between-group diagnostic differences, the TBV model presented in Table 2 was re-examined excluding all ASD individuals with TBV > 2 SD from the mean, and the effect of diagnosis remained significant (partial η2 = 0.062, p < 0.001).

Replication results

As in the primary dataset, diagnosis was a significant predictor of TBV, GMV, and WMV in the Yale dataset (Tables 5 and 6, Fig. 5). There was no significant interaction of age and diagnosis in any of the models. There was also no significant interaction of diagnosis and IQ. These interactions were dropped from the models for simplicity, but are presented in Table S4, Additional file 1. In the full models with all interaction terms, the main effects of diagnosis on TBV, GMV, and WMV were in the same direction as in the main-effects-only models (ASD > TDC), but were not significant, potentially due to the loss of degrees of freedom. Although there was not a significant IQ-by-diagnosis interaction, the correlation between IQ and TBV was qualitatively smaller in the ASD sample than the TDC sample, which was the pattern observed in the primary dataset. Within the TDC group, the semi-partial correlation of IQ with TBV given age was r = 0.35, p < 0.001. Within the ASD group, this correlation was r = 0.25, p < 0.05. IQ was a significant predictor of TBV, GMV, and WMV. Age was a significant predictor of TBV and GMV. As in the primary dataset, diagnosis did not predict the ratio of GMV-to-WMV (Figure S6, Additional file 1). Additionally, both lateral ventricles and third ventricles were enlarged in the ASD group (Figure S7, Additional file 1). In the ASD subgroup analysis, there were 7 individuals with brains 2 SD above the mean for their age bin (9.1%), and 2 with brains 2 SD below (2.6%, compared to 2.4% above and 0% below in the TDC group). Fisher’s exact test indicates that the proportion of individuals with extremely-sized brains is different between the ASD and TDC groups (p = 0.035). Within the ASD group, there were no differences between the small, large, and typically-sized subgroups in age (F (2,74) = 0.49, p = 0.62) or IQ (F (2,74) = 0.33, p = 0.72). In models testing the main effects of IQ, age, and diagnosis on TBV when excluding the extremely-sized ASD individuals, diagnosis remained significant (partial η2 = 0.027, p = 0.04).

Table 5 Main effects of IQ, Age, and Diagnosis in the Yale sample.
Table 6 Mean and standard deviation of volumes in each group in the Yale sample, and Cohen’s d for the difference between groups. Note that groups are not matched for age and IQ, and that Cohen’s d does not account for these factors
Fig. 5
figure 5

Relationships of IQ and age with total brain volume (a, b), gray matter volume (c, d), and white matter volume (e, f) in the Yale sample, by diagnosis. Age negatively predicted TBV and GMV (a, c). IQ positively predicted all three measures (b, d, f). ASD status positively predicted all three measures

Discussion

We do not find evidence to support the prediction that brain size in ASD normalizes over development. The Early Overgrowth hypothesis predicts that either (1) there should be no significant main effect of diagnosis on brain size (i.e., that brain size has normalized between the groups aged 6–25 years old in our sample) or (2) there should be a significant interaction of age and diagnosis, with volumetric differences for the youngest autistic children and normalization of brain volume across the age range. Our findings do not support either prediction. In the primary sample, we found a significant main effect of diagnosis for GMV, WMV, and TBV, and no significant interaction with age, with volumes about 2.8–3.2% larger in the ASD group. This finding was replicated in the Yale sample, in which TBV and GMV are 3.1% and 5.3% larger in the ASD group, with no interactions with age. Furthermore, our explorations of sub-groups of ASD individuals with particularly enlarged brains (> 2 SD from the typical mean for their age/sex) suggested that this enlargement was slightly more common in older youth in the CHOP sample and consistent across ages in the Yale sample. This indicates that findings of group-level enlargement in ASD were not driven by a subset of only the youngest children in our samples having enlarged brains. In addition, we observed a significant interaction in the primary sample between diagnosis and IQ, such that the overall brain enlargement effect in ASD was driven by children with IQ scores less than 115. This interaction is due to the stronger correlation in the TDC sample between IQ and brain volume than in the ASD sample, which showed no relationship between IQ and brain volume. The main effect of diagnosis should be interpreted in light of this significant interaction with IQ. Nevertheless, this study finds converging evidence in two large datasets that early brain overgrowth persists through adolescence and into early adulthood in ASD, failing to support “normalization” predictions from the Early Overgrowth hypothesis.

Increased brain volume has been one of the most consistently observed biomarkers of ASD in young children. Our results suggest that brain enlargement persists into early adulthood. This is consistent with some prior publications, including MRI studies [32, 33, 57, 61, 76] and a very large study of head circumference [17], but not with others [3, 35, 37, 52, 63, 70, 80]. The current results are noteworthy because of the large samples collected on the same scanner (by sample), inclusion of a broad age and IQ range, and large female representation. The size and quality of the samples we report allow for more generalizable and definitive conclusions about the development of brain size in autism than have previously been possible from smaller studies.

These results support neither the model of GM/WM imbalance predicting increased GM but decreased WM in ASD [12], nor the model predicting greater effect sizes of diagnostic group on WM than GM [41]. Rather, the data suggest that structural differences in WM occur in roughly equal proportion to GM between groups.

There are several potential mechanisms underlying the persistent brain volume difference in ASD. Brain volume is the product of cortical thickness and cortical surface area, which are independently heritable and have unique mechanistic underpinnings [58]. Increased surface area has been identified in some ASD samples [56] but not others [62]. Greater cortical thickness or differential rates of change have also been observed in some studies [25, 46, 62, 73]. but not all [56]. Using a subset of the CHOP sample reported in the current study, we recently reported that regional deviations from a normative model of brain development in diffusion metrics, volume, thickness, and surface area can accurately classify diagnostic status, although diffusion metrics out-performed anatomical measures in this age-based approach [82]. We plan to further investigate regional differences in cortical surface area and thickness in future work.

In typical development, dendritic arborization and synaptogenesis occur rapidly in the first year of life, followed by dendritic pruning [42]. The emergence of brain volume differences and clinical symptoms across the first 2 years of life in ASD points to these as candidate mechanisms. One potential mechanism of reduced dendritic pruning is mammalian target of rapamycin (mTOR) kinase, which is regulated by a number of genes associated with ASD, including TSC1/TSC2, NF1, and PTEN [13]. Hyperactive mTOR can produce excessive synaptic proteins and impair autophagy, and has been correlated with increased dendritic spine density in post-mortem brains of autistic individuals [79]. Another potential genetic source of this effect is chromodomain helicase DNA binding protein 8 (CHD8). This regulatory gene with neurodevelopmental targets has been strongly associated with ASD [69], and has been clinically associated with macrocephaly in ASD and in zebrafish models [9]. These cellular processes may be expressed differentially in different regions of the brain. For example, post-mortem studies of neuronal density revealed higher density in some regions of autistic brains compared to controls [16], but lower density in other regions [84].

In the context of significantly larger brains on average in ASD, we find no correlation between any of our severity measures and TBV, consistent with prior findings in preschoolers [1]. The failure to find correlations with symptom severity complicates the clinical implications of the enlarged-brain biomarker, given the conceptualization of autism as a spectrum disorder. If the degree of brain enlargement is not associated with the degree of core ASD symptoms, it is unclear how increased brain volume is functionally important to causal mechanisms of ASD. It might be that increased brain volume is not an underlying source of core ASD symptom differences, but represents a collateral consequence of the true underlying source. If true, increased brain size could be a biomarker of ASD, without being central to the pathophysiology [77]. Alternatively, enlarged brain volume may represent a categorical diathesis for ASD, with dimensional causes and symptoms overlaid. Another alternative is that global volume differences may not entirely reflect localized differences in regions, pathways, and networks, which may correlate more closely with symptom severity. Another important alternative is that group-level volume differences may be driven by a subsample of individuals with both ASD and enlarged brains, who are not distinguished by clinical severity [1]. We find that 4.1% of the CHOP ASD group and 9.1% of the Yale ASD group had brain volume greater than 2 SD above the mean for their age and sex. However, neither IQ nor clinical severity differed between the subgroups with extremely large or small brains and the subgroup falling within the typical range. Importantly, the group-average difference in brain volume remains significant when excluding the autistic individuals with extremely-sized brains, indicating that group level differences are not entirely driven by a subgroup of individuals. Finally, the absence of a correlation between brain size and ASD severity might indicate that our ASD symptom metrics fail to capture important aspects of ASD heterogeneity, as our three measures are poorly correlated with one another in this sample and in others [11].

Diagnostic group differences in brain volume are also complicated by an interaction between diagnosis and IQ. In humans, the relationship between brain size and intelligence has long been noted [53, 60, 91]. Indeed, in both the primary and replication datasets, we find a significant correlation within the TDC sample between brain size and IQ, while correcting for sex and age. However, correlations within the ASD group are smaller, and for the primary sample are not significant. The lack of a correlation with IQ suggests that individual differences in brain size have a different meaning in ASD, and that additional tissue volume does not confer cognitive advantages.

What, mechanistically, might disrupt the relationship between brain volume and IQ in ASD? The relationship may be weakened through a combination of underlying, unmeasured microstructural differences or differences in network organization. Alternatively, if IQ measurement is less reliable in ASD than TDC, the correlation between IQ and brain volume would be attenuated. While the IQ measures used in this study have evidence of validity and reliability in both typical and clinical [26] and ASD-specific samples [90], evidence of test-retest reliability in an ASD sample is lacking, and the factor structure of IQ may be different in ASD [18].

Our regions-of-significance analyses suggest that the ability of a study to detect brain enlargement in ASD depends on the IQ of the TDC sample. We expect little between-group difference when TDCs have high IQs, and greater difference when TDCs have lower IQs. Even a sample well-matched on IQ would be expected to show little difference if both groups are high in IQ. Failing to include lower-IQ TDC participants in imaging studies may bias results toward null brain volume differences between ASD and TDC, and contribute to controversy over the persistence of brain enlargement in ASD. Although the IQ-by-diagnosis interaction was not significant in the Yale sample, there are several reasons to believe the CHOP dataset is superior in accuracy and sensitivity (i.e., greater sample size, 3T versus 1.5T scanner, improved scan sequences, diversity of sex, superior matching of demographics). Post hoc power analyses using the effect sizes of the interaction obtained in the CHOP sample indicate that the power to detect this interaction in the Yale sample was 0.53 for TBV, 0.65 for GMV, and 0.29 for WMV. Thus, even the relatively large Yale dataset was likely underpowered to detect these interactions.

Although expected sex-effects were observed (i.e., larger brains in males than females), no sex-by-diagnosis interactions were observed in any of the measures. These findings suggest that diagnostic group differences in global brain morphology are not related to sex.

In the TDC sample, higher parental educational attainment (a proxy for socioeconomic status, SES) was associated with increased TBV and higher child IQ. These findings are both consistent with a theoretical model of brain structure mediating the relationship between parent SES and child cognitive ability [55]. Interestingly, follow-up analyses found that the relationship between parental education and child’s IQ is attenuated in the ASD sample, and that the relationship between parental education and child’s brain volume is weak and reversed in ASD. These findings suggest that the mechanisms that result in enlarged brains in ASD disrupt the typical relationship between SES and neurocognitive development, as well as the relationship between brain size and cognitive ability. Additionally, including the interaction of parental education-by-diagnosis in a regression model predicting TBV increases the partial η2 value of the main effect of diagnosis (from 0.05 to 0.11). This finding highlights the importance of obtaining information about cognitive ability and educational attainment of parents. Such information allows for the study of not only the autistic individual’s ability, but also how much that ability deviates from predicted familial relationships in the absence of ASD.

Limitations

Our samples’ age range (6–25 years) is a significant limitation to our ability to fully evaluate the Early Brain Overgrowth hypothesis. Normalization is proposed to occur immediately following the period of overgrowth [20], with brain sizes of ASD and TDC equalizing by approximately age 5. Therefore, it is possible that the magnitude of the group differences observed in our sample would have been larger had they been observed as toddlers. A second limitation is that our data are cross sectional. This would be most problematic if there is a systematic difference in the brain volumes of individuals who chose to participate at different ages. The ideal test of the Early Brain Overgrowth hypothesis would follow a cohort prospectively from diagnosis as a toddler to adulthood. Longitudinal study is particularly important to assess individual differences in growth trajectories. For example, it is possible that a subset of our sample had larger brains relative to peers as toddlers, experienced normalization, and now have average-sized brains, while other individuals’ brain size did not normalize and remained enlarged. The group differences we report demonstrate clearly that brain volume changes do persist at a group level in autistic adolescents and adults, and further longitudinal study should investigate the potential clinical implications of differing individual trajectories.

Conclusions

In summary, this work provides evidence that brain volume does not normalize by school-age, adolescence, or young-adulthood in ASD. While the effect sizes obtained in both samples are somewhat smaller than those often reported in samples of toddlers, enlargement remains. As we do not have MRIs from younger ages, it is possible that some degree of normalization occurred prior to the present measurements; if so, this normalization was not exhaustive. It is important that cellular and molecular researchers understand this developmental context in the search for mechanisms that might account for brain overgrowth in ASD.