Introduction

Autism spectrum disorder (ASD) is a neurodevelopmental condition with diverse manifestations across symptoms including social challenges, repetitive behaviors, and difficulties in communication, both verbal and nonverbal1. This complex condition emerges in childhood and can affect cognitive abilities, emotional aspects, sensory and motor skills, and social interaction2. ASD often co-occurs with other disorders, such as intellectual challenges, seizures, and anxiety3. According to the Centers for Disease Control and Prevention (CDC), the prevalence of ASD has increased, roughly 1 in 36 children are affected4. ASD has a heritable component with genetic factors interacting with environmental influences5. While the exact causes are not fully understood, ASD is likely mediated by differential pathways of synaptic and neuronal development, cortical structure, and brain connectivity5. Synaptic alterations may be mediated by genetic factors that impact molecular pathways involved in brain growth and development5.

Early ASD diagnosis is critical to enable early intervention, which can improve social communication, cognitive, and behavioral outcomes in affected children, and provide support for families and caregivers6. Neuroimaging techniques such as functional MRI (fMRI), structural MRI (sMRI), electroencephalography (EEG), and functional near-infrared spectroscopy (fNIRS) are promising tools for understanding the neural underpinnings of ASD. sMRI reveals gray matter differences in ASD7,8,9. Reduced gray matter volume and increased gyrification in the temporal and frontal lobes are linked to language difficulties in autistic children7. Similarly, fMRI has provided valuable insights into the functional brain patterns associated with ASD10,11, such as connectivity alterations in regions involved in social behaviors and other core ASD-related behavioral differences12. Task-based fMRI reveals the "social brain" in ASD that implicates numerous brain regions including the medial prefrontal cortex, amygdala, and superior temporal sulcus13. ASD is associated with hypoactivation of the social brain regions relative to controls14. Meta-analysis of ten resting-state fMRI ASD studies helped to reveal altered functional connectivity in the default mode network15. Other reports show a pattern of predominantly hypo-connectivity in resting-state fMRI (rs-fMRI) and task-based fMRI studies16,17,18. Whereas, hyper-connectivity is also reported in ASD17,19. The collective findings suggest that patterns of both hypo- and hyper-connectivity in the ASD brain, and the specific pattern may depend on the brain regions and tasks involved20. Children with ASD often exhibit hyper-connectivity in regions such as the cerebellum and brainstem, which is linked to social interaction deficits19. Adolescents and adults show a more complex pattern, with both hyper- and hypo-connectivity, especially in networks like the default mode network (DMN) and salience-executive network19. Early development in toddlers with ASD is marked by hypo-connectivity between the DMN and visual circuits, associated with early social-communication difficulties21. Into adulthood, ASD is associated with hypo-connectivity in higher-order association areas that are implicated in complex cognitive functions22. Hyper-connectivity is linked to deficits in memory, attention, reasoning, and social interactions, while hypo-connectivity is associated with impairments in vision, execution, and social cognition23.

Machine learning (ML) is a form of artificial intelligence that uses statistical techniques to make predictions or decisions without being explicitly programmed. Deep learning has emerged as particularly robust to automatically extract data features and reduce from large and complex information to more rudimentary and/or binary outcomes, namely classification24,25. Although rs-fMRI can capture functional connectivity patterns and abnormalities linked to the ASD26, and sMRI can reveal anatomical deviations, most prior studies use single modalities for ASD classification27,28. With a sample of 500 individuals, a logistic regression ML framework demonstrates the feasibility of classifying ASD adults versus controls based on neuroimaging regional connectivity27.

Amplitude of low-frequency fluctuations (ALFF)29 and fractional ALFF (fALFF)30 are maps generated to reflect resting-state fMRI spontaneous brain activity. ALFF and fALFF images are associated the underlying neuronal activity and metabolic processes in the brain30. Regions with higher ALFF or fALFF values are considered to be more “active” or engaged in intrinsic brain processes during the resting state30.

One potential theory is that the combination of rs-fMRI and sMRI in a multimodal framework could result in better classification results. For instance, it was possible to classify ASD vs. controls using a fusion of rs-fMRI and sMRI data with accuracy of 65.6%31. This combined accuracy surpassed the individual accuracies from exclusive use of rs-fMRI (60.6%), gray matter (63.9%), and white matter (59.7%) information. In addition to sMRI and fMRI, other data modalities have been explored for ASD classification, including behavioral, EEG, wearable sensors, eye-tracking, and genetic data. Using EEG in a lightweight convolutional neural network resulted in promising findings by decoding neural signals that were related to ASD32. Researchers have also used computer-aided diagnosis in ASD, employing behavior signal processing to analyze audio, video, and eye-tracking signals33. By integrating multimodal data analysis, researchers can identify distinctive patterns associated with ASD that demonstrate promising accuracy in ASD classification.

In the current study, a multimodal deep learning framework for ASD classification was proposed using sMRI and rs-fMRI data. We extracted ALFF29 and fALFF30 from rs-fMRI and trained a stacked 3D-DenseNet model with one-channel and two-channel architectures. We hypothesized that employing twinned neuroimaging data sources would improve the performance of the ASD classification relative to single-input classifier approaches. To provide context for the 3D-DenseNet approach, we implemented an extreme gradient boosting (XGBoost)34 decision tree method that relied on region of interest (ROI) neuroimaging estimates to perform the classification task.

Materials and methods

Data

ABIDE I data were accessed on January 17, 202335, and consisted of 1112 potential participants, including 539 with ASD and 573 healthy controls. Data were restricted to individuals between 2 and 30 years of age. Data quality was assessed using visual and empirical methods: i.e. scans were excluded due to movement artifacts, ghosting, incomplete brain coverage, and other scanner artifacts. The resulting groups consisted of 351 with ASD and 351 control participants. Figure 1 shows the cohort details. Table 1 shows age and sex details, and the Supplementary Table S1 shows data collection by site and MRI scanner. We selected a balanced number of subjects from each site and within each group (ASD and control).

Fig. 1
figure 1

The CONSORT-style flow diagram demonstrates the quality control process, including examination for exclusion criteria resulting in final analyzed participants.

Table 1 Demographics of the participants.

We ensured that participants with major comorbidities were excluded across the different sites contributing to our dataset to minimize the impact of comorbid conditions, which can significantly influence clinical and neurobiological profiles. Participants with major psychiatric disorders (e.g., depression, schizophrenia, bipolar disorder), neurological conditions (e.g., epilepsy, traumatic brain injury), genetic disorders (e.g., Fragile X syndrome), and other significant medical conditions were systematically excluded based on each site's rigorous screening and diagnostic criteria. These exclusions help to minimize the potential influence of comorbid conditions.

sMRI data preprocessing

The sMRI were preprocessed using AFNI36, FSL (FMRIB Software Library)37, and SynthStrip38 implemented in FreeSurfer39. T1-weighted images were downsampled to 3mm isotropic resolution using trilinear interpolation, making these data more comparable to the fMRI and reducing the deep learning computational requirements. The sMRI retained site-acquisition matrix size differences. Specifically, the matrix sizes ranged from 58 × 79 × 66 to 86 × 86 × 49. Data diversity was handled by adaptive average pooling layers in the deep learning architecture.

rs-fMRI data preprocessing

The first four volumes were discarded to prepare the functional data for the following steps: (1) motion correction and (2) skull stripping using the Brain Extraction Tool (BET) with a fractional intensity threshold of 0.5. Additional steps were performed in line with the ALFF calculations:29 (3) despiking, (4) removing the linear temporal trend, (5) spatial smoothing using a Gaussian kernel with a 6 mm full width at half maximum (FWHM), and (6) time series bandpass filtering between 0.01 and 0.08 Hz to isolate low-frequency fluctuations.

Power spectrum of the filtered time series was computed, and its square root was derived to obtain the amplitude across frequencies. ALFF was calculated by summing the power within the 0.01–0.08 Hz low-frequency band. fALFF was computed as the ratio of ALFF to the total power across all frequencies, providing a fractional metric to control for individual variations in signal strength. Examples of ALFF and fALFF maps are provided for an ASD participant in Fig. 2.

Fig. 2
figure 2

The views are shown for (a) ALFF and (b) fALFF rs-fMRI maps. The data corresponds to a 12-year-old male with ASD.

Data augmentation

Data were augmented by rotation and scaling during model training. For each epoch, there was a 50% chance of data being randomly rotated ± 30 degrees around the z-axis and zoomed 0.7–1.3 × with the same probability.

Scaling and normalizing data

Image intensity histograms were examined for ALFF maps to evaluate potential site effects. A threshold was selected, and any ALFF voxel intensities beyond it were shifted to the new maximum bin value before all ALFF intensities were min–max normalized to the range (0, 1) using the following approach:

$$Intensity_{New} = \frac{{x - {\text{min}}\left( x \right)}}{{\max \left( x \right) - {\text{min}}\left( x \right)}}$$
(1)

3D-DenseNet

The DenseNet40 consisted of successive dense blocks. The layers within each block are connected through multiple feed-forward connections. The one-channel model used 3D image inputs and the details are presented in Supplementary Table S2 (e.g. the 2.44 million trainable parameters, batch normalization, ReLU activation, and convolutions in final layers). The one-channel 3D DenseNet classifier had an initial convolutional layer followed by four dense blocks. Each dense block contained a specific number of dense layers (4, 5, 5, 4). Following the last dense block, there were two fully connected output layers. A growth rate of 32 was used for the number of new feature maps added by each layer. The overall flowchart depicts the process, including the preprocessing of rs-fMRI and sMRI (Supplementary Fig. S1(a1)), ALFF and fALFF extraction (Supplementary Fig. S1(a2)), and employment of one-channel stacked 3D-DenseNet, as illustrated in Supplementary Fig. S1(b).

The two-channel DenseNet network accommodated twinned 3D MRI inputs: sMRI and ALFF, and sMRI and fALFF. The two-channel model had 3 dense blocks per channel, each with 4 dense layers as illustrated in Supplementary Fig. S1(c). The channel outputs were flattened, concatenated (464 features), and passed to a fully connected layer (200 neurons) and a final output layer (2 neurons).

Model training started with a limited number of epochs to assess parameter ranges and grid searches. The learning rate method varied from 0.00001 to 0.00005, depending on the epoch, for a total of 330 epochs. The batch size was 8 and the dropout rate was 0.1. The two-channel model had 3.22 million trainable parameters.

For all models, data were split into 90% training and 10% testing sets, and the model performance was conducted using tenfold cross-validation. They were implemented in Python 3.10 using PyTorch (version 2.0.0)41 and MONAI on a Linux operating system42. Model training was performed using a high-performance computing environment (https://docs.alliancecan.ca/wiki/Cedar). The hardware consisted of 2 × Intel Silver 4216 Cascade Lake processors, each operating at 2.1 GHz, 32 GB of RAM, and 4 × NVIDIA V100 Volta GPUs, each with 32 GB HBM2 memory. One-channel networks trained for 9 h and 15 min, while two-channel networks trained for 14 h. Visualization of training was done using an aggregator tool43.

XGBoost

We applied the XGBoost algorithm to the same sMRI data, using the Brainnetome atlas for ROI extraction on T1-weighted images to provide more explainability to the classification task. We extracted 246 ROI volume means to create tabular data for the classification of ASD versus controls. The following steps were undertaken. We used FSL to register data onto the Brainnetome atlas and extract the ROIs. Using the XGBoost library in Python, we performed a grid search to find the best hyperparameters. The model was trained with fivefold cross-validation on 80% of the training data, with 20% reserved for testing.

Statistical analysis

Statistical tests were performed to analyze differences in age and sex between the ASD and control groups; there were no age (t-test) or sex (chi-squared) differences (p > 0.05).

Mean classification results are reported as accuracy, sensitivity, specificity, precision, and F1 score values, calculated across tenfolds for each data type (i.e. sMRI, ALFF, ALFF-sMRI). The 95% standard error (SE) confidence interval and standard deviation (SD) are reported for accuracies, sensitivities, specificities, precisions, and F1 scores.

A one-way analysis of variance (ANOVA) and pair-wise t-tests were conducted to assess the significance of differences in mean accuracy, specificity, and sensitivity between the sMRI-ALFF model and the one-channel models (sMRI and ALFF) across the tenfolds. The performance of different classification models was evaluated using receiver operator characteristic (ROC) analysis and area under the curve (AUC) values across folds.

Results

A chi-square test for independence was performed to compare the distribution of eyes-open and eyes-closed cases between the ASD and control groups, and found that the results were non-significant (χ2 [1] = 1.63, p = 0.201).

Table 2 summarizes the performance metrics for three of the 3D-DenseNet models. The one-way ANOVA indicated significant differences in mean accuracy, specificity, precision, and F1 score between at least two of the models tested. Specifically, these differences were significant for accuracy (F (degrees of freedom [df = 2, 27]) = 15.5, p = 0.00003), specificity (F [df = 2, 27] = 15.5, p = 0.00003), precision (F [df = 2, 27] = 15.12, p = 0.00004), and F1 score (F [df = 2, 27] = 8.27, p = 0.0015). Sensitivity was not significantly different across models (F (2, 27) = 1.38, p = 0.26).

Table 2 Performance metrics for three of the 3D-DenseNet models are provided after tenfold cross-validation.

The results of the pair-wise t-tests are shown in Table 3. The simultaneous use of ALFF and sMRI data in a two-channel DenseNet had significantly improved classification accuracy compared to using only sMRI (t = 5.6, p = 0.0003) or ALFF (t = 2.8, p = 0.02). It was noted that the ALFF results individually and in combination with sMRI were much better than the fALFF model, hence the fALFF-based classification results are reported in Supplementary Tables S3 and S4.

Table 3 Each of the pairwise post-hoc t-tests showed a significant difference in the model accuracies.

Performance metrics are depicted in Fig. 3. The test accuracies and AUC values for one-channel ALFF, sMRI, and two-channel ALFF and sMRI are shown in Fig. 3a and b respectively. Notably, the two-channel ALFF and sMRI outperformed other models regarding mean AUC and mean accuracy. The ROC curves for these three models are shown in Fig. 3c–e.

Fig. 3
figure 3

Evaluation performance (a) Test accuracy of different models across folds. (b) AUC values for different models across folds (c) ROC curve of one-channel sMRI (d) ROC curve of one-channel ALFF (e) ROC curve of two-channel sMRI + ALFF. Note: The fALFF model results were not shown due to lower performance than ALFF.

The results of the sMRI-based XGBoost model are shown in Table 4. To interpret the model's decisions, we generated a feature importance plot (Fig. 4), highlighting the ten most important ROIs contributing to the final decision. The top two ROIs, based on relative importance, were dCa_L (Basal Ganglia—Left dorsal caudate) with a score of 0.026, and IPFtha_L (Thalamus—Left lateral pre-frontal thalamus) with a score of 0.015.

Table 4 Performance metrics for the sMRI XGBoost model.
Fig. 4
figure 4

The top ten most important ROIs in XGBoost model decision-tree that used sMRI as tabular data inputs for classification.

Discussion

In this study, we evaluated whether a multi-modal 3D-DenseNet deep learning network could accurately classify ASD vs. controls. The sample consisted of a range of young people, which constitutes relevant age window for ASD diagnosis. Data were balanced across sites within each group (ASD and control) to maintain consistency across sites while also create a diverse total sample based on the number of imaging sites. This approach helps mitigate site-specific biases and ensures that the model is exposed to a broad range of data during training. Imaging data underwent modest preprocessing steps before inputting to the 3D-DenseNet for model training. The input images were three-dimensional ALFF, fALFF, or sMRI data. The advantage of this 3D model is that all brain voxels could in theory contribute to the classification. The two-channel model combining ALFF and sMRI demonstrated the best performance among the implemented networks. Statistical tests further indicated that, overall, two-channel networks composed of two different types of data could significantly achieve higher accuracy in classifying ASD and control individuals, outperforming the use of a single data type.

Different ASD classification methods have been applied to the ABIDE I dataset. Some of these approaches, which used a sample size of over 700, achieved classification accuracy that are comparable to the current study (i.e. 66.74% to 71.74%)44,45,46,47,48,49,50,51,52,53,54. Classification accuracy of 83% was reported on the ABIDE I dataset using an ensemble classifier that fused features from conventional functional connectivity networks, low-order dynamic functional connectivity networks, and high-order dynamic functional connectivity networks55. It is noteworthy that the previous study used a relatively smaller ABIDE I sample, with 45 ASD and 47 controls. The current framework showcases significant differences and notable enhancements. The highest-performing deep learning model used both sMRI and rs-fMRI data, which is in contrast to the autoencoder and perceptron approaches used previously and based on rs-fMRI connectivity features54. Other researchers transformed the fMRI data to temporal features that were used as 3D inputs for CNNs; this approach yielded 64% accuracy49. While they evaluated multiple statistical features from the fMRI time series, their CNN models were only provided with a single type of extracted feature for classification.

The current results revealed that the fALFF-based model produced the poorest classification, which is noteworthy as the finding aligns with a previous study that compared and contrasted fALFF, ALFF, and regional homogeneity (ReHo) fMRI maps56. The fALFF is thought to correlate more strongly with the cerebral metabolic rate of glucose and oxygen utilization, compared to ALFF56. The ALFF map correlates with cerebral blood volume and future work is needed to characterize the source of physiological contrast in these neuroimaging data.

The accuracy, specificity, precision, and F1 score results underscore the significance of using combined data channels to drive better classification results. The superior classification accuracy of the two-channel model did not come at the expense of worse sensitivity or precision. This was reflected by the F1 score, which was highest for the two-channel model. We note that the sMRI-based model had the poorest specificity and precision and it is likely to produce the highest number of false positives, which could be problematic in the context of a clinical evaluation of this method. Conversely, false negatives are also critical issues to reconcile as they would be the scenario where an individual is falsely assigned a healthy diagnosis.

The sMRI-based XGBoost classification results were inferior to the sMRI DenseNet, however, the explainability of the XGBoost model results provide some insight. Namely, the high importance scores of the basal ganglia and thalamus ROIs for the classification. These regions play roles in ASD. Structural and functional abnormalities in the basal ganglia, crucial for motor control, cognition, and social behavior, are common in ASD57,58. These abnormalities include volumetric changes, altered cell density, and increased connectivity with cortical areas, leading to motor delays, sensory processing difficulties, and repetitive behaviors58,59. Similarly, the thalamus is a sensory and motor signal relay center and shows atypical thalamocortical connectivity in ASD, particularly in sensory regions60,61.

Clinically, ML models applied to neuroimaging data, such as MRI, can enhance diagnostic accuracy by identifying subtle brain patterns associated with ASD. This can lead to earlier diagnosis, intervention, and treatment options62. ML models can also assist in tailoring specific interventions based on the unique neurobiological profiles of patients. Additionally, the integration of ML with traditional diagnostic criteria can streamline the diagnostic process.

Future directions for this research include the development of more comprehensive and diverse datasets that integrate various imaging modalities (e.g., fMRI, sMRI, EEG) and demographic variables to improve the generalizability of ML models. Longitudinal studies tracking the developmental trajectory of ASD-related brain changes are essential for identifying predictive biomarkers and enhancing early detection63. Advanced ML techniques, such as deep learning and explainable AI, can improve the performance and interpretability of models. Furthermore, validating ML-based diagnostic systems in real-world clinical settings is critical. Developing standardized protocols and regulatory frameworks would ensure the safe and effective implementation of these tools. Focusing on pediatric and young adult populations presents opportunities where brain imaging data can be collected and image analysis could provide assistive decision support. The integration of neuroimaging and ML holds promise, for instance when there are more patient subgroups to consider.

There are likely many ways to continue to improve the MRI-based classification of ASD. First, because of the wider age range presented in ASD imaging studies, the brain age of individuals may be important, and incorporating a companion brain age element to the current classification approach could be interesting future work64. Deep learning architectures are inherently flexible and others have demonstrated that is it feasible to pre-train a model to first perform a brain age estimate and then fine-tune to a different task, such as classification65. The current study focused on ASD versus healthy controls, and it would be important to consider other related disorders as intermediate subgroups, such as Attention Deficit Hyperactivity Disorder (ADHD). Such a multi-class approach would be crucial in clinical applications. The inclusion of ADHD would allow for a more comprehensive understanding of neurodevelopmental disorders, and precise ASD identification. It is noted that ASD and ADHD have a high comorbidity66, which could necessitate more subgroups. Another possible future direction could be to explore sex-dependent classification approaches. As shown in a recent study, considering sex differences and developing separate classification pipelines for males and females could potentially improve ASD classification performance67. Hence, sex could be incorporated as an additional model input to account for differences in brain connectivity patterns between males and females.

Limitations

First, one of the challenges in large-scale MRI data sharing for research is the integration of data across multiple independent imaging sites, such as in ABIDE I. We did not explicitly account for scanner and site-related variations because we opted to preprocess images prior to model training. Including preprocessing steps detracts from the generalizability of the current method, however, this was a necessary step to improve image consistency. Second, although the current sample was large and we performed tenfold cross-validation, we did not consider an explicit external data source for model testing. Third, the overall performance scores for the ‘best’ sMRI-ALFF model were high but there is still room for improvement. In particular, it would be prudent to consider the sensitivity results and the inherent risk of producing false negatives. Fourth, the current focus was on the DenseNet because it is ideal for imaging inputs; however, other approaches such as support vector machines, XGBoost, or ensemble methods have merit. Deep learning models, including the 3D DenseNet used in this study, pose significant challenges for explainable artificial intelligence68. Specifically, the black-box nature makes it difficult to interpret which brain regions are contributing to classification decisions. Lastly, the current study relied exclusively on MRI, meanwhile, other ASD assessments could be used and/or incorporated into a multi-modality classifier model. By relying solely on MRI data, we broaden the base of research tools that can be used independently or alongside other measures, such as behavioral assessments, neurodevelopmental evaluations, and cognitive testing.

Conclusion

The findings reveal that exclusive use of research-grade MRI can be used to perform automated classification of ASD relative to controls. Two-channel networks used the 3D features from sMRI and ALFF maps to produce superior performance results relative to any one-channel network.