Abstract
Machine learning has immense potential to enhance diagnostic and intervention research in the behavioral sciences, and may be especially useful in investigations involving the highly prevalent and heterogeneous syndrome of autism spectrum disorder. However, use of machine learning in the absence of clinical domain expertise can be tenuous and lead to misinformed conclusions. To illustrate this concern, the current paper critically evaluates and attempts to reproduce results from two studies (Wall et al. in Transl Psychiatry 2(4):e100, 2012a; PloS One 7(8), 2012b) that claim to drastically reduce time to diagnose autism using machine learning. Our failure to generate comparable findings to those reported by Wall and colleagues using larger and more balanced data underscores several conceptual and methodological problems associated with these studies. We conclude with proposed best-practices when using machine learning in autism research, and highlight some especially promising areas for collaborative work at the intersection of computational and behavioral science.
Similar content being viewed by others
Notes
For instance, model over-fitting can occur when training data is included in testing sets, which can inflate confidence in a result that is not likely to replicate in independent samples. Cross-validation is a common solution.
Analyses we conducted in this paper use these revised ADOS algorithms.
Apart from 4 Non-Spectrum subjects from the Boston Autism Consortium database.
Proper application of machine learning usually entails optimizing parameter settings for a chosen classifier. The peak performance of a classifier for a given dataset cannot be achieved without this step. Since optimizing parameter settings for maximal classification performance can lead to over-fitting, an independent test set is required; often a third set called the Development set is used or another layer of cross-validation is performed. In our experiments, we use default parameter settings in order to most closely replicate the methodology employed by Wall et al. (2012a).
Recall can be used interchangeably with either sensitivity or specificity, which differ only in naming convention of the “true” class.
It is advisable to test multiple algorithmic approaches to achieve optimal accuracy; however, since this increases potential for over-fitting and consequently inflating results, an independent, held-out dataset is valuable.
Note that sensitivity and specificity only differ in the naming convention of the “true” or “positive” class, and thus the term recall applies to any class.
References
Abrahams, B. S., & Geschwind, D. H. (2010). Connecting genes to brain in the autism spectrum disorders. Archives of Neurology, 67(4), 395.
AGRE Pedigree Algorithms. (2013). AGRE Pedigree Algorithms. N.p., n.d. Web. http://www.research.agre.org/agrecatalog/algorithm.cfm.
Amaral, D., Dawson, G., & Geschwind, D. (Eds.). (2011). Autism spectrum disorders. Oxford: Oxford University Press.
American Psychiatric Association (Ed.). (2013). Diagnostic and statistical manual of mental disorders: DSM-5. American Psychiatric Association.
Audhkhasi, K., & Narayanan, S. (2013). A globally-variant locally-constant model for fusion of labels from multiple diverse experts without using reference labels. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(4), 769–783.
Baldi, P. (2001). Bioinformatics: The machine learning approach. Cambridge, MA: The MIT Press.
Black, M. P., Katsamanis, A., Baucom, B. R., Lee, C. C., Lammert, A. C., Christensen, A., et al. (2013). Toward automating a human behavioral coding system for married couples’ interactions using speech acoustic features. Speech Communication, 55(1), 1–21.
Bone, D., Black, M. P., Lee, C. C., Williams, M. E., Levitt, P., Lee, S., & Narayanan, S. (2012). Spontaneous-speech acoustic–prosodic features of children with autism and the interacting psychologist. In INTERSPEECH (pp. 1043–1046).
Bone, D., Black, M. P., Lee, C. C., Williams, M. E., Levitt, P., Lee, S., & Narayanan, S. (2014, in press). The Psychologist as an Interlocutor in Autism Spectrum Disorder Assessment: Insights from a Study of Spontaneous Prosody. Journal of Speech, Language, and Hearing Research.
Bone, D., Lee, C. C., Chaspari, T., Black, M. P., Williams, M. E., Lee, S., Levitt, P. & Narayanan, S. (2013). Acoustic–prosodic, turn-taking, and language cues in child–psychologist interactions for varying social demand. In INTERSPEECH (pp. 2400–2404).
Chaspari, T., Bone, D., Gibson, J., Lee, C. C., & Narayanan, S. (2013). Using physiology and language cues for modeling verbal response latencies of children with ASD. In 2013 IEEE International Conference on acoustics, speech and signal processing (ICASSP) (pp. 3702–3706).
Constantino, J. N., LaVesser, P. D., Zhang, Y., Abbacchi, A. M., Gray, T., & Todd, R. D. (2007). Rapid quantitative assessment of autistic social impairment by classroom teachers. Journal of the American Academy of Child and Adolescent Psychiatry, 46(12), 1668–1676.
Dawson, G., Webb, S., Schellenberg, G. D., Dager, S., Friedman, S., Aylward, E., et al. (2002). Defining the broader phenotype of autism: Genetic, brain, and behavioral perspectives. Development and Psychopathology, 14(3), 581–611.
Duda, M., Kosmicki, J. A., & Wall, D. P. (2014). Testing the accuracy of an observation-based classifier for rapid detection of autism risk. Translational Psychiatry, 4(8), e424.
Freund, Y., & Mason, L. (1999). The alternating decision tree learning algorithm. In ICML (Vol. 99, pp. 124–133).
Geschwind, D. H., Sowinski, J., Lord, C., Iversen, P., Shestack, J., Jones, P., et al. (2001). The autism genetic resource exchange: A resource for the study of autism and related neuropsychiatric conditions. American Journal of Human Genetics, 69(2), 463.
Gotham, K., Risi, S., Pickles, A., & Lord, C. (2007). The autism diagnostic observation schedule: Revised algorithms for improved diagnostic validity. Journal of Autism and Developmental Disorders, 37(4), 613–627.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.
Hu, V. W., & Steinberg, M. E. (2009). Novel clustering of items from the Autism Diagnostic Interview-Revised to define phenotypes within autism spectrum disorders. Autism Research, 2(2), 67–77.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI (Vol. 14, No. 2, pp. 1137–1145).
Lai, M. C., Lombardo, M. V., Chakrabarti, B., & Baron-Cohen, S. (2013). Subgrouping the autism “spectrum”: Reflections on DSM-5. PLoS Biology, 11(4).
Lee, H., Marvin, A. R., Watson, T., Piggot, J., Law, J. K., Law, P. A., et al. (2010). Accuracy of phenotyping of autistic children based on internet implemented parent report. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 153(6), 1119–1126.
Levitt, P., & Campbell, D. B. (2009). The genetic and neurobiologic compass points toward common signaling dysfunctions in autism spectrum disorders. The Journal of Clinical Investigation, 119(4), 747.
Lord, C., & Jones, R. M. (2012). Annual Research Review: Re-thinking the classification of autism spectrum disorders. Journal of Child Psychology and Psychiatry, 53(5), 490–509.
Lord, C., Risi, S., Lambrecht, L., Cook, E. H, Jr, Leventhal, B. L., DiLavore, P. C., et al. (2000). The Autism Diagnostic Observation Schedule—Generic: A standard measure of social and communication deficits associated with the spectrum of autism. Journal of Autism and Developmental Disorders, 30(3), 205–223.
Lord, C., Rutter, M., & Le Couteur, A. (1994). Autism diagnostic interview-revised: A revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. Journal of Autism and Developmental Disorders, 24(5), 659–685.
Narayanan, S., & Georgiou, P. G. (2013). Behavioral signal processing: Deriving human behavioral informatics from speech and language. Proceedings of the IEEE, 101(5), 1203–1233.
Picard, R. W. (2000). Affective computing. Cambridge, MA: MIT press.
Rehg, J. M., Abowd, G. D., Rozga, A., Romero, M., Clements, M. A., Sclaroff, S., & Ye, Z. (2013). Decoding children’s social behavior. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (pp. 3414–3421). IEEE.
Rehg, J. M., Rozga, A., Abowd, G. D., & Goodwin, M. S. (2014). Behavioral imaging and autism. Pervasive Computing, IEEE, 13(2), 84–87.
Rosenberg, A. (2012). Classifying skewed data: Importance weighting to optimize average recall. In INTERSPEECH (pp. 2242–2245).
Schuller, B., Steidl, S., & Batliner, A. (2009, September). The INTERSPEECH 2009 emotion challenge. In INTERSPEECH (pp. 312–315).
Schuller, B., Steidl, S., Batliner, A., Schiel, F., & Krajewski, J. (2011, August). The INTERSPEECH 2011 Speaker State Challenge. In INTERSPEECH (pp. 3201–3204).
Wall, D. P., Dally, R., Luyster, R., Jung, J. Y., & DeLuca, T. F. (2012b). Use of artificial intelligence to shorten the behavioral diagnosis of autism. PloS One, 7(8).
Wall, D. P., Kosmicki, J. A., DeLuca, T., Harstad, E. B., & Fusaro, V. A. (2012a). Use of machine learning to shorten observation-based screening and diagnosis of autism. Translational Psychiatry, 2(4), e100.
Wei, L., Yang, Y., Nishikawa, R. M., & Jiang, Y. (2005). A study on several machine learning methods for classification of malignant and benign clustered microcalcifications. IEEE Transactions on Medical Imaging, 24(3), 371–380.
Ye, Z., Li, Y., Fathi, A., Han, Y., Rozga, A., Abowd, G. D., & Rehg, J. M. (2012). Detecting eye contact using wearable eye-tracking glasses. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (pp. 699–704). ACM.
Acknowledgments
This work was supported by funds from NSF Award 1029035, “Computational Behavioral Science: Modeling, Analysis, and Visualization of Social and Communicative Behavior”, NIH grants P50 DC013027 and R01 DC012774, and the Alfred E. Mann Innovation in Engineering Fellowship. The authors are grateful to Shanping Qiu for her efforts in acquiring and preparing the BID data for analysis.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Additional Methodological Details
Additional Methodological Details for ADOS Module 1 Data Experiments
The AGRE and BID ADOS Module 1 data demographics are provided in Table 3 for the experiments shown in Table 1 and Fig. 3. For the BID data, BCE diagnosis is also available, although we do not utilize it in this paper.
To replicate the Wall et al. (2012a) proposed 8-code selection as in Table 1, Weka’s ADTree classifier was used. In this case, the algorithm was allowed to tune itself to the given training data, but was limited to making rules using only the proposed 8 codes. Wall et al. (2012a) did not specify whether the code scores were first re-mapped as in the ADOS algorithm (e.g., 3 is mapped to 2). We chose to re-map because: (1) from the tree-diagram provided by Wall et al. (2012a) it appears the codes were re-mapped; and (2) codes were re-mapped first in similar experiments by Wall et al. (2012b). Additionally, we noticed the selected codes did not match the proposed 8 regardless of code re-mapping or not.
Classification performance of ADOS diagnosis with the ADTree was evaluated (Fig. 3). Tenfold cross-validation was used. Three variations of input feature sets are considered. (1) All 29—all 29 codes are included, as was done in Wall et al. (2012a). (2) Proposed 8—only the 8 codes proposed in Wall et al. (2012a) are input. (3) Remaining 21 –the remaining 21 of 29 codes not in the Proposed 8 are used for classification.
The performance metric is unweighted average recall (UAR), the mean of sensitivity and specificity. Many machine learning algorithms optimize for accuracy—also known as weighted average recall (WAR), since it is a weighted summation of sensitivity and specificity, dependent on the class priors—or an approximation thereof. One option for directly optimizing UAR is to balance classes through upsampling or downsampling (Rosenberg 2012). Since the ADOS Autism class was much larger than the ADOS Non-Spectrum class, the ADOS Autism class can be downsampled or the ADOS Non-Spectrum class can be upsampled to optimize for UAR. In our experiments, we chose the latter. Upsampling was performed by adding exact copies of samples from the minority class only within the training data subset, in order to keep training and testing data independent. While other statistical methods exist for upsampling, they rely on certain assumptions about the data. For example, when randomly sampling from individual code scores to generate the entire set of scores for a simulated instance, it is possible to generate a set of scores that is very unlikely or impossible to occur in the real-world. Rather than making such assumptions, we upsampled whole observed data instances from the training data.
Class imbalance is also observed for the ADOS ASD (ADOS Autism and ADOS Autism Spectrum) versus ADOS Non-Spectrum experiments. The ADOS Autism class has many more samples in the AGRE and BID data than the ADOS Autism Spectrum class. In order to show a representative effect from the middle, more-subtle ADOS Autism Spectrum class, the ADOS Autism class was first randomly downsampled during training to be equal in size to the ADOS Autism Spectrum class. Then, the ADOS Non-Spectrum class was upsampled to be the same size as the new ADOS ASD class (as was done previously in the Autism/Non-Spectrum) experiments.
Additional Methodological Details for ADI-R Data Experiments
Data demographics for the ADI-R experiments are provided in Table 3. Code re-mapping was performed as in Wall et al. (2012b); in particular, 3 was mapped to 2, and 7 and 8 were mapped to 0 (except for the Onset Age in Hindsight item, which has acceptable values from 0 to 6). Tenfold cross-validation was performed. The upsampling and downsampling for ADI-R diagnosis experiments mirrors those for ADOS diagnosis experiments described in “Additional Methodological Details for ADOS Module 1 Data Experiments” section. In particular, when performing classification with 2-groups, the minority class was upsampled. For the case of Affected Status, categories of Not Quite Autism (NQA) and Broad Spectrum (BS) were first combined into a Broad-ASD (B-ASD) category; the Affected Status category was slightly larger, so it was downsampled to the size of the B-ASD category; then, the minority ADI-R Non-Autism (B-ASD + Not-Met) class was upsampled to be of equal size to the ADI-R Autism group during training.
Significance Testing for Unweighted Average Recall
UAR is increasingly popular in the machine learning literature for tasks with unbalanced data in which the recall of all classes are equally important. However, no established technique exists for computing statistical significance. Some researchers have used the binomial proportions test, as is done with accuracy, although this is not entirely valid. Accuracy is a weighted average of individual class recalls, weighted by the corresponding class prior. UAR is an unweighted average of individual recalls. Statistical tests exist for accuracy, sensitivity, and specificity; but no established test yet exists for UAR.
We propose using a slightly modified version of the exact binomial proportion test—we use the exact test since the data are not always sufficiently large for a normal approximation. Since UAR is an unweighted average of individual recalls, it is equally influenced by the recall of either class. The recall of a class with very few samples (e.g., 12) can vary much more than recall of the majority class (e.g., 942); notably, the machine learning algorithm does not typically consider class-size when optimizing for UAR. As such, the minor modification we made was to reduce the sample size N from 954 (12 + 942) to something smaller—in particular, N_eff (effective N). We set N_eff to twice the size (since there are two classes) of the minority class. In our example, N_eff is consequently 24, compared to the original N of 954. The negative implication is that some of the statistical power from the confidence in recall of the majority class is discarded; but the benefit is that the statistical power in the minority-class recall is not grossly exaggerated. Thus, this test is conservative, and is less likely to create false-positives.
Appendix 2: ADOS Module 1 Behavioral Codes
See Table 4.
Appendix 3: Additional Performance Measures
Here we present additional performance measures from our classification experiments with the following disclaimer: individual results should not be contrasted with metrics other than UAR, the mean of sensitivity and specificity, because the machine learning algorithms only optimizes for UAR in our experiments, and thus are not concerned with measures like sensitivity and specificity individually. That is, an algorithm is only concerned with reaching a peak in UAR. The other statistical measures may be viewed as a random realization that achieves the observed UAR; thus, comparison of, for example, sensitivity between individual results may be inappropriate.
We understand that analysis of each of these measures is standard in diagnostic research. However, our experimental results stand primarily as empirical support of certain methodological flaws present in the experiments of Wall et al. (2012a, b); as such, we compare results using the measure that the machine learning algorithm optimizes, UAR (technically it optimizes accuracy, but it effectively optimizes UAR since we balance classes during training.) We also note that analyzing true diagnostic validity of this approach would be further complicated by the fact that the ADOS has its own diagnostic error.
The following tables present six measures: unweighted average recall (UAR); sensitivity; specificity; positive predictive value (PPV); negative predictive value (NPV); and accuracy. Expanded results for the ADOS (c.f., Fig. 3) are presented in Table 5, while expanded results for the ADI-R (c.f., Table 2) are displayed in Table 6.
Rights and permissions
About this article
Cite this article
Bone, D., Goodwin, M.S., Black, M.P. et al. Applying Machine Learning to Facilitate Autism Diagnostics: Pitfalls and Promises. J Autism Dev Disord 45, 1121–1136 (2015). https://doi.org/10.1007/s10803-014-2268-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10803-014-2268-6