Applying Machine Learning to Facilitate Autism Diagnostics: Pitfalls and Promises

Bone, Daniel; Goodwin, Matthew S.; Black, Matthew P.; Lee, Chi-Chun; Audhkhasi, Kartik; Narayanan, Shrikanth

doi:10.1007/s10803-014-2268-6

Applying Machine Learning to Facilitate Autism Diagnostics: Pitfalls and Promises

Original Paper
Published: 08 October 2014

Volume 45, pages 1121–1136, (2015)
Cite this article

Journal of Autism and Developmental Disorders Aims and scope Submit manuscript

Daniel Bone¹,
Matthew S. Goodwin²,
Matthew P. Black³,
Chi-Chun Lee⁴,
Kartik Audhkhasi¹ &
…
Shrikanth Narayanan¹

4414 Accesses
129 Citations
38 Altmetric
5 Mentions
Explore all metrics

Abstract

Machine learning has immense potential to enhance diagnostic and intervention research in the behavioral sciences, and may be especially useful in investigations involving the highly prevalent and heterogeneous syndrome of autism spectrum disorder. However, use of machine learning in the absence of clinical domain expertise can be tenuous and lead to misinformed conclusions. To illustrate this concern, the current paper critically evaluates and attempts to reproduce results from two studies (Wall et al. in Transl Psychiatry 2(4):e100, 2012a; PloS One 7(8), 2012b) that claim to drastically reduce time to diagnose autism using machine learning. Our failure to generate comparable findings to those reported by Wall and colleagues using larger and more balanced data underscores several conceptual and methodological problems associated with these studies. We conclude with proposed best-practices when using machine learning in autism research, and highlight some especially promising areas for collaborative work at the intersection of computational and behavioral science.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Diagnosis of Autism: From Kanner to DSM-III to DSM-5 and Beyond

Article Open access 24 February 2021

Nicole E. Rosen, Catherine Lord & Fred R. Volkmar

What is autism?

Article 10 March 2021

Priya Joon, Anil Kumar & Milind Parle

Concerns About ABA-Based Intervention: An Evaluation and Recommendations

Article Open access 16 June 2021

Justin B. Leaf, Joseph H. Cihon, … Dara Khosrowshahi

Notes

For instance, model over-fitting can occur when training data is included in testing sets, which can inflate confidence in a result that is not likely to replicate in independent samples. Cross-validation is a common solution.
The work of Wall et al. (2012a) has been extended in Duda et al. (2014). While some methodological issues are resolved, primary conceptual issues remain.
Analyses we conducted in this paper use these revised ADOS algorithms.
Apart from 4 Non-Spectrum subjects from the Boston Autism Consortium database.
Proper application of machine learning usually entails optimizing parameter settings for a chosen classifier. The peak performance of a classifier for a given dataset cannot be achieved without this step. Since optimizing parameter settings for maximal classification performance can lead to over-fitting, an independent test set is required; often a third set called the Development set is used or another layer of cross-validation is performed. In our experiments, we use default parameter settings in order to most closely replicate the methodology employed by Wall et al. (2012a).
Recall can be used interchangeably with either sensitivity or specificity, which differ only in naming convention of the “true” class.
It is advisable to test multiple algorithmic approaches to achieve optimal accuracy; however, since this increases potential for over-fitting and consequently inflating results, an independent, held-out dataset is valuable.
Note that sensitivity and specificity only differ in the naming convention of the “true” or “positive” class, and thus the term recall applies to any class.

References

Abrahams, B. S., & Geschwind, D. H. (2010). Connecting genes to brain in the autism spectrum disorders. Archives of Neurology, 67(4), 395.
Article PubMed Central PubMed Google Scholar
AGRE Pedigree Algorithms. (2013). AGRE Pedigree Algorithms. N.p., n.d. Web. http://www.research.agre.org/agrecatalog/algorithm.cfm.
Amaral, D., Dawson, G., & Geschwind, D. (Eds.). (2011). Autism spectrum disorders. Oxford: Oxford University Press.
Google Scholar
American Psychiatric Association (Ed.). (2013). Diagnostic and statistical manual of mental disorders: DSM-5. American Psychiatric Association.
Audhkhasi, K., & Narayanan, S. (2013). A globally-variant locally-constant model for fusion of labels from multiple diverse experts without using reference labels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(4), 769–783.
Article PubMed Google Scholar
Baldi, P. (2001). Bioinformatics: The machine learning approach. Cambridge, MA: The MIT Press.
Google Scholar
Black, M. P., Katsamanis, A., Baucom, B. R., Lee, C. C., Lammert, A. C., Christensen, A., et al. (2013). Toward automating a human behavioral coding system for married couples’ interactions using speech acoustic features. Speech Communication, 55(1), 1–21.
Article Google Scholar
Bone, D., Black, M. P., Lee, C. C., Williams, M. E., Levitt, P., Lee, S., & Narayanan, S. (2012). Spontaneous-speech acoustic–prosodic features of children with autism and the interacting psychologist. In INTERSPEECH (pp. 1043–1046).
Bone, D., Black, M. P., Lee, C. C., Williams, M. E., Levitt, P., Lee, S., & Narayanan, S. (2014, in press). The Psychologist as an Interlocutor in Autism Spectrum Disorder Assessment: Insights from a Study of Spontaneous Prosody. Journal of Speech, Language, and Hearing Research.
Bone, D., Lee, C. C., Chaspari, T., Black, M. P., Williams, M. E., Lee, S., Levitt, P. & Narayanan, S. (2013). Acoustic–prosodic, turn-taking, and language cues in child–psychologist interactions for varying social demand. In INTERSPEECH (pp. 2400–2404).
Chaspari, T., Bone, D., Gibson, J., Lee, C. C., & Narayanan, S. (2013). Using physiology and language cues for modeling verbal response latencies of children with ASD. In 2013 IEEE International Conference on acoustics, speech and signal processing (ICASSP) (pp. 3702–3706).
Constantino, J. N., LaVesser, P. D., Zhang, Y., Abbacchi, A. M., Gray, T., & Todd, R. D. (2007). Rapid quantitative assessment of autistic social impairment by classroom teachers. Journal of the American Academy of Child and Adolescent Psychiatry, 46(12), 1668–1676.
Article PubMed Google Scholar
Dawson, G., Webb, S., Schellenberg, G. D., Dager, S., Friedman, S., Aylward, E., et al. (2002). Defining the broader phenotype of autism: Genetic, brain, and behavioral perspectives. Development and Psychopathology, 14(3), 581–611.
Article PubMed Google Scholar
Duda, M., Kosmicki, J. A., & Wall, D. P. (2014). Testing the accuracy of an observation-based classifier for rapid detection of autism risk. Translational Psychiatry, 4(8), e424.
Article PubMed Central PubMed Google Scholar
Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In ICML (Vol. 99, pp. 124–133).
Geschwind, D. H., Sowinski, J., Lord, C., Iversen, P., Shestack, J., Jones, P., et al. (2001). The autism genetic resource exchange: A resource for the study of autism and related neuropsychiatric conditions. American Journal of Human Genetics, 69(2), 463.
Article PubMed Central PubMed Google Scholar
Gotham, K., Risi, S., Pickles, A., & Lord, C. (2007). The autism diagnostic observation schedule: Revised algorithms for improved diagnostic validity. Journal of Autism and Developmental Disorders, 37(4), 613–627.
Article PubMed Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.
Article Google Scholar
Hu, V. W., & Steinberg, M. E. (2009). Novel clustering of items from the Autism Diagnostic Interview-Revised to define phenotypes within autism spectrum disorders. Autism Research, 2(2), 67–77.
Article PubMed Central PubMed Google Scholar
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI (Vol. 14, No. 2, pp. 1137–1145).
Lai, M. C., Lombardo, M. V., Chakrabarti, B., & Baron-Cohen, S. (2013). Subgrouping the autism “spectrum”: Reflections on DSM-5. PLoS Biology, 11(4).
Lee, H., Marvin, A. R., Watson, T., Piggot, J., Law, J. K., Law, P. A., et al. (2010). Accuracy of phenotyping of autistic children based on internet implemented parent report. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 153(6), 1119–1126.
Google Scholar
Levitt, P., & Campbell, D. B. (2009). The genetic and neurobiologic compass points toward common signaling dysfunctions in autism spectrum disorders. The Journal of Clinical Investigation, 119(4), 747.
Article PubMed Central PubMed Google Scholar
Lord, C., & Jones, R. M. (2012). Annual Research Review: Re-thinking the classification of autism spectrum disorders. Journal of Child Psychology and Psychiatry, 53(5), 490–509.
Article PubMed Central PubMed Google Scholar
Lord, C., Risi, S., Lambrecht, L., Cook, E. H, Jr, Leventhal, B. L., DiLavore, P. C., et al. (2000). The Autism Diagnostic Observation Schedule—Generic: A standard measure of social and communication deficits associated with the spectrum of autism. Journal of Autism and Developmental Disorders, 30(3), 205–223.
Article PubMed Google Scholar
Lord, C., Rutter, M., & Le Couteur, A. (1994). Autism diagnostic interview-revised: A revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. Journal of Autism and Developmental Disorders, 24(5), 659–685.
Article PubMed Google Scholar
Narayanan, S., & Georgiou, P. G. (2013). Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5), 1203–1233.
Article PubMed Central PubMed Google Scholar
Picard, R. W. (2000). Affective computing. Cambridge, MA: MIT press.
Google Scholar
Rehg, J. M., Abowd, G. D., Rozga, A., Romero, M., Clements, M. A., Sclaroff, S., & Ye, Z. (2013). Decoding children’s social behavior. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (pp. 3414–3421). IEEE.
Rehg, J. M., Rozga, A., Abowd, G. D., & Goodwin, M. S. (2014). Behavioral imaging and autism. Pervasive Computing, IEEE, 13(2), 84–87.
Article Google Scholar
Rosenberg, A. (2012). Classifying skewed data: Importance weighting to optimize average recall. In INTERSPEECH (pp. 2242–2245).
Schuller, B., Steidl, S., & Batliner, A. (2009, September). The INTERSPEECH 2009 emotion challenge. In INTERSPEECH (pp. 312–315).
Schuller, B., Steidl, S., Batliner, A., Schiel, F., & Krajewski, J. (2011, August). The INTERSPEECH 2011 Speaker State Challenge. In INTERSPEECH (pp. 3201–3204).
Wall, D. P., Dally, R., Luyster, R., Jung, J. Y., & DeLuca, T. F. (2012b). Use of artificial intelligence to shorten the behavioral diagnosis of autism. PloS One, 7(8).
Wall, D. P., Kosmicki, J. A., DeLuca, T., Harstad, E. B., & Fusaro, V. A. (2012a). Use of machine learning to shorten observation-based screening and diagnosis of autism. Translational Psychiatry, 2(4), e100.
Article PubMed Central PubMed Google Scholar
Wei, L., Yang, Y., Nishikawa, R. M., & Jiang, Y. (2005). A study on several machine learning methods for classification of malignant and benign clustered microcalcifications. IEEE Transactions on Medical Imaging, 24(3), 371–380.
Article PubMed Google Scholar
Ye, Z., Li, Y., Fathi, A., Han, Y., Rozga, A., Abowd, G. D., & Rehg, J. M. (2012). Detecting eye contact using wearable eye-tracking glasses. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (pp. 699–704). ACM.

Download references

Acknowledgments

This work was supported by funds from NSF Award 1029035, “Computational Behavioral Science: Modeling, Analysis, and Visualization of Social and Communicative Behavior”, NIH grants P50 DC013027 and R01 DC012774, and the Alfred E. Mann Innovation in Engineering Fellowship. The authors are grateful to Shanping Qiu for her efforts in acquiring and preparing the BID data for analysis.

Author information

Authors and Affiliations

Signal Analysis & Interpretation Laboratory (SAIL), University of Southern California, 3710 McClintock Ave., Los Angeles, CA, 90089, USA
Daniel Bone, Kartik Audhkhasi & Shrikanth Narayanan
Department of Health Sciences, Northeastern University, 336 Huntington Ave., Boston, MA, 02115, USA
Matthew S. Goodwin
Information Sciences Institute, University of Southern California, 4676 Admiralty Way, Marina del Rey, CA, 90292, USA
Matthew P. Black
Department of Electrical Engineering, National Tsing Hua University, No. 101, Sec. 2, Kuang Fu Road, Hsinchu, 30013, Taiwan
Chi-Chun Lee

Authors

Daniel Bone
View author publications
You can also search for this author in PubMed Google Scholar
Matthew S. Goodwin
View author publications
You can also search for this author in PubMed Google Scholar
Matthew P. Black
View author publications
You can also search for this author in PubMed Google Scholar
Chi-Chun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Kartik Audhkhasi
View author publications
You can also search for this author in PubMed Google Scholar
Shrikanth Narayanan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Bone.

Appendices

Appendix 1: Additional Methodological Details

Additional Methodological Details for ADOS Module 1 Data Experiments

The AGRE and BID ADOS Module 1 data demographics are provided in Table 3 for the experiments shown in Table 1 and Fig. 3. For the BID data, BCE diagnosis is also available, although we do not utilize it in this paper.

Table 3 Combined table of demographic information for experiments

Full size table

To replicate the Wall et al. (2012a) proposed 8-code selection as in Table 1, Weka’s ADTree classifier was used. In this case, the algorithm was allowed to tune itself to the given training data, but was limited to making rules using only the proposed 8 codes. Wall et al. (2012a) did not specify whether the code scores were first re-mapped as in the ADOS algorithm (e.g., 3 is mapped to 2). We chose to re-map because: (1) from the tree-diagram provided by Wall et al. (2012a) it appears the codes were re-mapped; and (2) codes were re-mapped first in similar experiments by Wall et al. (2012b). Additionally, we noticed the selected codes did not match the proposed 8 regardless of code re-mapping or not.

Classification performance of ADOS diagnosis with the ADTree was evaluated (Fig. 3). Tenfold cross-validation was used. Three variations of input feature sets are considered. (1) All 29—all 29 codes are included, as was done in Wall et al. (2012a). (2) Proposed 8—only the 8 codes proposed in Wall et al. (2012a) are input. (3) Remaining 21 –the remaining 21 of 29 codes not in the Proposed 8 are used for classification.

The performance metric is unweighted average recall (UAR), the mean of sensitivity and specificity. Many machine learning algorithms optimize for accuracy—also known as weighted average recall (WAR), since it is a weighted summation of sensitivity and specificity, dependent on the class priors—or an approximation thereof. One option for directly optimizing UAR is to balance classes through upsampling or downsampling (Rosenberg 2012). Since the ADOS Autism class was much larger than the ADOS Non-Spectrum class, the ADOS Autism class can be downsampled or the ADOS Non-Spectrum class can be upsampled to optimize for UAR. In our experiments, we chose the latter. Upsampling was performed by adding exact copies of samples from the minority class only within the training data subset, in order to keep training and testing data independent. While other statistical methods exist for upsampling, they rely on certain assumptions about the data. For example, when randomly sampling from individual code scores to generate the entire set of scores for a simulated instance, it is possible to generate a set of scores that is very unlikely or impossible to occur in the real-world. Rather than making such assumptions, we upsampled whole observed data instances from the training data.

Class imbalance is also observed for the ADOS ASD (ADOS Autism and ADOS Autism Spectrum) versus ADOS Non-Spectrum experiments. The ADOS Autism class has many more samples in the AGRE and BID data than the ADOS Autism Spectrum class. In order to show a representative effect from the middle, more-subtle ADOS Autism Spectrum class, the ADOS Autism class was first randomly downsampled during training to be equal in size to the ADOS Autism Spectrum class. Then, the ADOS Non-Spectrum class was upsampled to be the same size as the new ADOS ASD class (as was done previously in the Autism/Non-Spectrum) experiments.

Additional Methodological Details for ADI-R Data Experiments

Data demographics for the ADI-R experiments are provided in Table 3. Code re-mapping was performed as in Wall et al. (2012b); in particular, 3 was mapped to 2, and 7 and 8 were mapped to 0 (except for the Onset Age in Hindsight item, which has acceptable values from 0 to 6). Tenfold cross-validation was performed. The upsampling and downsampling for ADI-R diagnosis experiments mirrors those for ADOS diagnosis experiments described in “Additional Methodological Details for ADOS Module 1 Data Experiments” section. In particular, when performing classification with 2-groups, the minority class was upsampled. For the case of Affected Status, categories of Not Quite Autism (NQA) and Broad Spectrum (BS) were first combined into a Broad-ASD (B-ASD) category; the Affected Status category was slightly larger, so it was downsampled to the size of the B-ASD category; then, the minority ADI-R Non-Autism (B-ASD + Not-Met) class was upsampled to be of equal size to the ADI-R Autism group during training.

Significance Testing for Unweighted Average Recall

UAR is increasingly popular in the machine learning literature for tasks with unbalanced data in which the recall of all classes are equally important. However, no established technique exists for computing statistical significance. Some researchers have used the binomial proportions test, as is done with accuracy, although this is not entirely valid. Accuracy is a weighted average of individual class recalls, weighted by the corresponding class prior. UAR is an unweighted average of individual recalls. Statistical tests exist for accuracy, sensitivity, and specificity; but no established test yet exists for UAR.

We propose using a slightly modified version of the exact binomial proportion test—we use the exact test since the data are not always sufficiently large for a normal approximation. Since UAR is an unweighted average of individual recalls, it is equally influenced by the recall of either class. The recall of a class with very few samples (e.g., 12) can vary much more than recall of the majority class (e.g., 942); notably, the machine learning algorithm does not typically consider class-size when optimizing for UAR. As such, the minor modification we made was to reduce the sample size N from 954 (12 + 942) to something smaller—in particular, N_eff (effective N). We set N_eff to twice the size (since there are two classes) of the minority class. In our example, N_eff is consequently 24, compared to the original N of 954. The negative implication is that some of the statistical power from the confidence in recall of the majority class is discarded; but the benefit is that the statistical power in the minority-class recall is not grossly exaggerated. Thus, this test is conservative, and is less likely to create false-positives.

Appendix 2: ADOS Module 1 Behavioral Codes

See Table 4.

Table 4 List of the ADOS Module 1 behavioral codes

Full size table

Appendix 3: Additional Performance Measures

Here we present additional performance measures from our classification experiments with the following disclaimer: individual results should not be contrasted with metrics other than UAR, the mean of sensitivity and specificity, because the machine learning algorithms only optimizes for UAR in our experiments, and thus are not concerned with measures like sensitivity and specificity individually. That is, an algorithm is only concerned with reaching a peak in UAR. The other statistical measures may be viewed as a random realization that achieves the observed UAR; thus, comparison of, for example, sensitivity between individual results may be inappropriate.

We understand that analysis of each of these measures is standard in diagnostic research. However, our experimental results stand primarily as empirical support of certain methodological flaws present in the experiments of Wall et al. (2012a, b); as such, we compare results using the measure that the machine learning algorithm optimizes, UAR (technically it optimizes accuracy, but it effectively optimizes UAR since we balance classes during training.) We also note that analyzing true diagnostic validity of this approach would be further complicated by the fact that the ADOS has its own diagnostic error.

The following tables present six measures: unweighted average recall (UAR); sensitivity; specificity; positive predictive value (PPV); negative predictive value (NPV); and accuracy. Expanded results for the ADOS (c.f., Fig. 3) are presented in Table 5, while expanded results for the ADI-R (c.f., Table 2) are displayed in Table 6.

Table 5 Results for classifying ADOS categories using ADOS items

Full size table

Table 6 Results for classifying “Affected Status” and ADI-R categories using ADI-R items

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bone, D., Goodwin, M.S., Black, M.P. et al. Applying Machine Learning to Facilitate Autism Diagnostics: Pitfalls and Promises. J Autism Dev Disord 45, 1121–1136 (2015). https://doi.org/10.1007/s10803-014-2268-6

Download citation

Published: 08 October 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s10803-014-2268-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Applying Machine Learning to Facilitate Autism Diagnostics: Pitfalls and Promises

Abstract

Access this article

Similar content being viewed by others

The Diagnosis of Autism: From Kanner to DSM-III to DSM-5 and Beyond

What is autism?

Concerns About ABA-Based Intervention: An Evaluation and Recommendations

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Additional Methodological Details

Additional Methodological Details for ADOS Module 1 Data Experiments

Additional Methodological Details for ADI-R Data Experiments

Significance Testing for Unweighted Average Recall

Appendix 2: ADOS Module 1 Behavioral Codes

Appendix 3: Additional Performance Measures

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Applying Machine Learning to Facilitate Autism Diagnostics: Pitfalls and Promises

Abstract

Access this article

Similar content being viewed by others

The Diagnosis of Autism: From Kanner to DSM-III to DSM-5 and Beyond

What is autism?

Concerns About ABA-Based Intervention: An Evaluation and Recommendations

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Additional Methodological Details

Additional Methodological Details for ADOS Module 1 Data Experiments

Additional Methodological Details for ADI-R Data Experiments

Significance Testing for Unweighted Average Recall

Appendix 2: ADOS Module 1 Behavioral Codes

Appendix 3: Additional Performance Measures

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation