Skip to main content

Advertisement

Log in

Applying Machine Learning to Facilitate Autism Diagnostics: Pitfalls and Promises

  • Original Paper
  • Published:
Journal of Autism and Developmental Disorders Aims and scope Submit manuscript

Abstract

Machine learning has immense potential to enhance diagnostic and intervention research in the behavioral sciences, and may be especially useful in investigations involving the highly prevalent and heterogeneous syndrome of autism spectrum disorder. However, use of machine learning in the absence of clinical domain expertise can be tenuous and lead to misinformed conclusions. To illustrate this concern, the current paper critically evaluates and attempts to reproduce results from two studies (Wall et al. in Transl Psychiatry 2(4):e100, 2012a; PloS One 7(8), 2012b) that claim to drastically reduce time to diagnose autism using machine learning. Our failure to generate comparable findings to those reported by Wall and colleagues using larger and more balanced data underscores several conceptual and methodological problems associated with these studies. We conclude with proposed best-practices when using machine learning in autism research, and highlight some especially promising areas for collaborative work at the intersection of computational and behavioral science.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. For instance, model over-fitting can occur when training data is included in testing sets, which can inflate confidence in a result that is not likely to replicate in independent samples. Cross-validation is a common solution.

  2. The work of Wall et al. (2012a) has been extended in Duda et al. (2014). While some methodological issues are resolved, primary conceptual issues remain.

  3. Analyses we conducted in this paper use these revised ADOS algorithms.

  4. Apart from 4 Non-Spectrum subjects from the Boston Autism Consortium database.

  5. Proper application of machine learning usually entails optimizing parameter settings for a chosen classifier. The peak performance of a classifier for a given dataset cannot be achieved without this step. Since optimizing parameter settings for maximal classification performance can lead to over-fitting, an independent test set is required; often a third set called the Development set is used or another layer of cross-validation is performed. In our experiments, we use default parameter settings in order to most closely replicate the methodology employed by Wall et al. (2012a).

  6. Recall can be used interchangeably with either sensitivity or specificity, which differ only in naming convention of the “true” class.

  7. It is advisable to test multiple algorithmic approaches to achieve optimal accuracy; however, since this increases potential for over-fitting and consequently inflating results, an independent, held-out dataset is valuable.

  8. Note that sensitivity and specificity only differ in the naming convention of the “true” or “positive” class, and thus the term recall applies to any class.

References

  • Abrahams, B. S., & Geschwind, D. H. (2010). Connecting genes to brain in the autism spectrum disorders. Archives of Neurology, 67(4), 395.

    Article  PubMed Central  PubMed  Google Scholar 

  • AGRE Pedigree Algorithms. (2013). AGRE Pedigree Algorithms. N.p., n.d. Web. http://www.research.agre.org/agrecatalog/algorithm.cfm.

  • Amaral, D., Dawson, G., & Geschwind, D. (Eds.). (2011). Autism spectrum disorders. Oxford: Oxford University Press.

    Google Scholar 

  • American Psychiatric Association (Ed.). (2013). Diagnostic and statistical manual of mental disorders: DSM-5. American Psychiatric Association.

  • Audhkhasi, K., & Narayanan, S. (2013). A globally-variant locally-constant model for fusion of labels from multiple diverse experts without using reference labels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(4), 769–783.

    Article  PubMed  Google Scholar 

  • Baldi, P. (2001). Bioinformatics: The machine learning approach. Cambridge, MA: The MIT Press.

    Google Scholar 

  • Black, M. P., Katsamanis, A., Baucom, B. R., Lee, C. C., Lammert, A. C., Christensen, A., et al. (2013). Toward automating a human behavioral coding system for married couples’ interactions using speech acoustic features. Speech Communication, 55(1), 1–21.

    Article  Google Scholar 

  • Bone, D., Black, M. P., Lee, C. C., Williams, M. E., Levitt, P., Lee, S., & Narayanan, S. (2012). Spontaneous-speech acoustic–prosodic features of children with autism and the interacting psychologist. In INTERSPEECH (pp. 1043–1046).

  • Bone, D., Black, M. P., Lee, C. C., Williams, M. E., Levitt, P., Lee, S., & Narayanan, S. (2014, in press). The Psychologist as an Interlocutor in Autism Spectrum Disorder Assessment: Insights from a Study of Spontaneous Prosody. Journal of Speech, Language, and Hearing Research.

  • Bone, D., Lee, C. C., Chaspari, T., Black, M. P., Williams, M. E., Lee, S., Levitt, P. & Narayanan, S. (2013). Acoustic–prosodic, turn-taking, and language cues in child–psychologist interactions for varying social demand. In INTERSPEECH (pp. 2400–2404).

  • Chaspari, T., Bone, D., Gibson, J., Lee, C. C., & Narayanan, S. (2013). Using physiology and language cues for modeling verbal response latencies of children with ASD. In 2013 IEEE International Conference on acoustics, speech and signal processing (ICASSP) (pp. 3702–3706).

  • Constantino, J. N., LaVesser, P. D., Zhang, Y., Abbacchi, A. M., Gray, T., & Todd, R. D. (2007). Rapid quantitative assessment of autistic social impairment by classroom teachers. Journal of the American Academy of Child and Adolescent Psychiatry, 46(12), 1668–1676.

    Article  PubMed  Google Scholar 

  • Dawson, G., Webb, S., Schellenberg, G. D., Dager, S., Friedman, S., Aylward, E., et al. (2002). Defining the broader phenotype of autism: Genetic, brain, and behavioral perspectives. Development and Psychopathology, 14(3), 581–611.

    Article  PubMed  Google Scholar 

  • Duda, M., Kosmicki, J. A., & Wall, D. P. (2014). Testing the accuracy of an observation-based classifier for rapid detection of autism risk. Translational Psychiatry, 4(8), e424.

    Article  PubMed Central  PubMed  Google Scholar 

  • Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In ICML (Vol. 99, pp. 124–133).

  • Geschwind, D. H., Sowinski, J., Lord, C., Iversen, P., Shestack, J., Jones, P., et al. (2001). The autism genetic resource exchange: A resource for the study of autism and related neuropsychiatric conditions. American Journal of Human Genetics, 69(2), 463.

    Article  PubMed Central  PubMed  Google Scholar 

  • Gotham, K., Risi, S., Pickles, A., & Lord, C. (2007). The autism diagnostic observation schedule: Revised algorithms for improved diagnostic validity. Journal of Autism and Developmental Disorders, 37(4), 613–627.

    Article  PubMed  Google Scholar 

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.

    Article  Google Scholar 

  • Hu, V. W., & Steinberg, M. E. (2009). Novel clustering of items from the Autism Diagnostic Interview-Revised to define phenotypes within autism spectrum disorders. Autism Research, 2(2), 67–77.

    Article  PubMed Central  PubMed  Google Scholar 

  • Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI (Vol. 14, No. 2, pp. 1137–1145).

  • Lai, M. C., Lombardo, M. V., Chakrabarti, B., & Baron-Cohen, S. (2013). Subgrouping the autism “spectrum”: Reflections on DSM-5. PLoS Biology, 11(4).

  • Lee, H., Marvin, A. R., Watson, T., Piggot, J., Law, J. K., Law, P. A., et al. (2010). Accuracy of phenotyping of autistic children based on internet implemented parent report. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 153(6), 1119–1126.

    Google Scholar 

  • Levitt, P., & Campbell, D. B. (2009). The genetic and neurobiologic compass points toward common signaling dysfunctions in autism spectrum disorders. The Journal of Clinical Investigation, 119(4), 747.

    Article  PubMed Central  PubMed  Google Scholar 

  • Lord, C., & Jones, R. M. (2012). Annual Research Review: Re-thinking the classification of autism spectrum disorders. Journal of Child Psychology and Psychiatry, 53(5), 490–509.

    Article  PubMed Central  PubMed  Google Scholar 

  • Lord, C., Risi, S., Lambrecht, L., Cook, E. H, Jr, Leventhal, B. L., DiLavore, P. C., et al. (2000). The Autism Diagnostic Observation Schedule—Generic: A standard measure of social and communication deficits associated with the spectrum of autism. Journal of Autism and Developmental Disorders, 30(3), 205–223.

    Article  PubMed  Google Scholar 

  • Lord, C., Rutter, M., & Le Couteur, A. (1994). Autism diagnostic interview-revised: A revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. Journal of Autism and Developmental Disorders, 24(5), 659–685.

    Article  PubMed  Google Scholar 

  • Narayanan, S., & Georgiou, P. G. (2013). Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5), 1203–1233.

    Article  PubMed Central  PubMed  Google Scholar 

  • Picard, R. W. (2000). Affective computing. Cambridge, MA: MIT press.

    Google Scholar 

  • Rehg, J. M., Abowd, G. D., Rozga, A., Romero, M., Clements, M. A., Sclaroff, S., & Ye, Z. (2013). Decoding children’s social behavior. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (pp. 3414–3421). IEEE.

  • Rehg, J. M., Rozga, A., Abowd, G. D., & Goodwin, M. S. (2014). Behavioral imaging and autism. Pervasive Computing, IEEE, 13(2), 84–87.

    Article  Google Scholar 

  • Rosenberg, A. (2012). Classifying skewed data: Importance weighting to optimize average recall. In INTERSPEECH (pp. 2242–2245).

  • Schuller, B., Steidl, S., & Batliner, A. (2009, September). The INTERSPEECH 2009 emotion challenge. In INTERSPEECH (pp. 312–315).

  • Schuller, B., Steidl, S., Batliner, A., Schiel, F., & Krajewski, J. (2011, August). The INTERSPEECH 2011 Speaker State Challenge. In INTERSPEECH (pp. 3201–3204).

  • Wall, D. P., Dally, R., Luyster, R., Jung, J. Y., & DeLuca, T. F. (2012b). Use of artificial intelligence to shorten the behavioral diagnosis of autism. PloS One, 7(8).

  • Wall, D. P., Kosmicki, J. A., DeLuca, T., Harstad, E. B., & Fusaro, V. A. (2012a). Use of machine learning to shorten observation-based screening and diagnosis of autism. Translational Psychiatry, 2(4), e100.

    Article  PubMed Central  PubMed  Google Scholar 

  • Wei, L., Yang, Y., Nishikawa, R. M., & Jiang, Y. (2005). A study on several machine learning methods for classification of malignant and benign clustered microcalcifications. IEEE Transactions on Medical Imaging, 24(3), 371–380.

    Article  PubMed  Google Scholar 

  • Ye, Z., Li, Y., Fathi, A., Han, Y., Rozga, A., Abowd, G. D., & Rehg, J. M. (2012). Detecting eye contact using wearable eye-tracking glasses. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (pp. 699–704). ACM.

Download references

Acknowledgments

This work was supported by funds from NSF Award 1029035, “Computational Behavioral Science: Modeling, Analysis, and Visualization of Social and Communicative Behavior”, NIH grants P50 DC013027 and R01 DC012774, and the Alfred E. Mann Innovation in Engineering Fellowship. The authors are grateful to Shanping Qiu for her efforts in acquiring and preparing the BID data for analysis.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Bone.

Appendices

Appendix 1: Additional Methodological Details

Additional Methodological Details for ADOS Module 1 Data Experiments

The AGRE and BID ADOS Module 1 data demographics are provided in Table 3 for the experiments shown in Table 1 and Fig. 3. For the BID data, BCE diagnosis is also available, although we do not utilize it in this paper.

Table 3 Combined table of demographic information for experiments

To replicate the Wall et al. (2012a) proposed 8-code selection as in Table 1, Weka’s ADTree classifier was used. In this case, the algorithm was allowed to tune itself to the given training data, but was limited to making rules using only the proposed 8 codes. Wall et al. (2012a) did not specify whether the code scores were first re-mapped as in the ADOS algorithm (e.g., 3 is mapped to 2). We chose to re-map because: (1) from the tree-diagram provided by Wall et al. (2012a) it appears the codes were re-mapped; and (2) codes were re-mapped first in similar experiments by Wall et al. (2012b). Additionally, we noticed the selected codes did not match the proposed 8 regardless of code re-mapping or not.

Classification performance of ADOS diagnosis with the ADTree was evaluated (Fig. 3). Tenfold cross-validation was used. Three variations of input feature sets are considered. (1) All 29—all 29 codes are included, as was done in Wall et al. (2012a). (2) Proposed 8—only the 8 codes proposed in Wall et al. (2012a) are input. (3) Remaining 21 –the remaining 21 of 29 codes not in the Proposed 8 are used for classification.

The performance metric is unweighted average recall (UAR), the mean of sensitivity and specificity. Many machine learning algorithms optimize for accuracy—also known as weighted average recall (WAR), since it is a weighted summation of sensitivity and specificity, dependent on the class priors—or an approximation thereof. One option for directly optimizing UAR is to balance classes through upsampling or downsampling (Rosenberg 2012). Since the ADOS Autism class was much larger than the ADOS Non-Spectrum class, the ADOS Autism class can be downsampled or the ADOS Non-Spectrum class can be upsampled to optimize for UAR. In our experiments, we chose the latter. Upsampling was performed by adding exact copies of samples from the minority class only within the training data subset, in order to keep training and testing data independent. While other statistical methods exist for upsampling, they rely on certain assumptions about the data. For example, when randomly sampling from individual code scores to generate the entire set of scores for a simulated instance, it is possible to generate a set of scores that is very unlikely or impossible to occur in the real-world. Rather than making such assumptions, we upsampled whole observed data instances from the training data.

Class imbalance is also observed for the ADOS ASD (ADOS Autism and ADOS Autism Spectrum) versus ADOS Non-Spectrum experiments. The ADOS Autism class has many more samples in the AGRE and BID data than the ADOS Autism Spectrum class. In order to show a representative effect from the middle, more-subtle ADOS Autism Spectrum class, the ADOS Autism class was first randomly downsampled during training to be equal in size to the ADOS Autism Spectrum class. Then, the ADOS Non-Spectrum class was upsampled to be the same size as the new ADOS ASD class (as was done previously in the Autism/Non-Spectrum) experiments.

Additional Methodological Details for ADI-R Data Experiments

Data demographics for the ADI-R experiments are provided in Table 3. Code re-mapping was performed as in Wall et al. (2012b); in particular, 3 was mapped to 2, and 7 and 8 were mapped to 0 (except for the Onset Age in Hindsight item, which has acceptable values from 0 to 6). Tenfold cross-validation was performed. The upsampling and downsampling for ADI-R diagnosis experiments mirrors those for ADOS diagnosis experiments described in “Additional Methodological Details for ADOS Module 1 Data Experiments” section. In particular, when performing classification with 2-groups, the minority class was upsampled. For the case of Affected Status, categories of Not Quite Autism (NQA) and Broad Spectrum (BS) were first combined into a Broad-ASD (B-ASD) category; the Affected Status category was slightly larger, so it was downsampled to the size of the B-ASD category; then, the minority ADI-R Non-Autism (B-ASD + Not-Met) class was upsampled to be of equal size to the ADI-R Autism group during training.

Significance Testing for Unweighted Average Recall

UAR is increasingly popular in the machine learning literature for tasks with unbalanced data in which the recall of all classes are equally important. However, no established technique exists for computing statistical significance. Some researchers have used the binomial proportions test, as is done with accuracy, although this is not entirely valid. Accuracy is a weighted average of individual class recalls, weighted by the corresponding class prior. UAR is an unweighted average of individual recalls. Statistical tests exist for accuracy, sensitivity, and specificity; but no established test yet exists for UAR.

We propose using a slightly modified version of the exact binomial proportion test—we use the exact test since the data are not always sufficiently large for a normal approximation. Since UAR is an unweighted average of individual recalls, it is equally influenced by the recall of either class. The recall of a class with very few samples (e.g., 12) can vary much more than recall of the majority class (e.g., 942); notably, the machine learning algorithm does not typically consider class-size when optimizing for UAR. As such, the minor modification we made was to reduce the sample size N from 954 (12 + 942) to something smaller—in particular, N_eff (effective N). We set N_eff to twice the size (since there are two classes) of the minority class. In our example, N_eff is consequently 24, compared to the original N of 954. The negative implication is that some of the statistical power from the confidence in recall of the majority class is discarded; but the benefit is that the statistical power in the minority-class recall is not grossly exaggerated. Thus, this test is conservative, and is less likely to create false-positives.

Appendix 2: ADOS Module 1 Behavioral Codes

See Table 4.

Table 4 List of the ADOS Module 1 behavioral codes

Appendix 3: Additional Performance Measures

Here we present additional performance measures from our classification experiments with the following disclaimer: individual results should not be contrasted with metrics other than UAR, the mean of sensitivity and specificity, because the machine learning algorithms only optimizes for UAR in our experiments, and thus are not concerned with measures like sensitivity and specificity individually. That is, an algorithm is only concerned with reaching a peak in UAR. The other statistical measures may be viewed as a random realization that achieves the observed UAR; thus, comparison of, for example, sensitivity between individual results may be inappropriate.

We understand that analysis of each of these measures is standard in diagnostic research. However, our experimental results stand primarily as empirical support of certain methodological flaws present in the experiments of Wall et al. (2012a, b); as such, we compare results using the measure that the machine learning algorithm optimizes, UAR (technically it optimizes accuracy, but it effectively optimizes UAR since we balance classes during training.) We also note that analyzing true diagnostic validity of this approach would be further complicated by the fact that the ADOS has its own diagnostic error.

The following tables present six measures: unweighted average recall (UAR); sensitivity; specificity; positive predictive value (PPV); negative predictive value (NPV); and accuracy. Expanded results for the ADOS (c.f., Fig. 3) are presented in Table 5, while expanded results for the ADI-R (c.f., Table 2) are displayed in Table 6.

Table 5 Results for classifying ADOS categories using ADOS items
Table 6 Results for classifying “Affected Status” and ADI-R categories using ADI-R items

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bone, D., Goodwin, M.S., Black, M.P. et al. Applying Machine Learning to Facilitate Autism Diagnostics: Pitfalls and Promises. J Autism Dev Disord 45, 1121–1136 (2015). https://doi.org/10.1007/s10803-014-2268-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10803-014-2268-6

Keywords

Navigation