Model-Based and Model-Free Techniques for Amyotrophic Lateral Sclerosis Diagnostic Prediction and Patient Clustering
Amyotrophic lateral sclerosis (ALS) is a complex progressive neurodegenerative disorder with an estimated prevalence of about 5 per 100,000 people in the United States. In this study, the ALS disease progression is measured by the change of Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS) score over time. The study aims to provide clinical decision support for timely forecasting of the ALS trajectory as well as accurate and reproducible computable phenotypic clustering of participants. Patient data are extracted from DREAM-Phil Bowen ALS Prediction Prize4Life Challenge data, most of which are from the Pooled Resource Open-Access ALS Clinical Trials Database (PRO-ACT) archive. We employed model-based and model-free machine-learning methods to predict the change of the ALSFRS score over time. Using training and testing data we quantified and compared the performance of different techniques. We also used unsupervised machine learning methods to cluster the patients into separate computable phenotypes and interpret the derived subcohorts. Direct prediction of univariate clinical outcomes based on model-based (linear models) or model-free (machine learning based techniques – random forest and Bayesian adaptive regression trees) was only moderately successful. The correlation coefficients between clinically observed changes in ALSFRS scores relative to the model-based/model-free predicted counterparts were 0.427 (random forest) and 0.545(BART). The reliability of these results were assessed using internal statistical cross validation and well as external data validation. Unsupervised clustering generated very reliable and consistent partitions of the patient cohort into four computable phenotypic subgroups. These clusters were explicated by identifying specific salient clinical features included in the PRO-ACT archive that discriminate between the derived subcohorts. There are differences between alternative analytical methods in forecasting specific clinical phenotypes. Although predicting univariate clinical outcomes may be challenging, our results suggest that modern data science strategies are useful in clustering patients and generating evidence-based ALS hypotheses about complex interactions of multivariate factors. Predicting univariate clinical outcomes using the PRO-ACT data yields only marginal accuracy (about 70%). However, unsupervised clustering of participants into sub-groups generates stable, reliable and consistent (exceeding 95%) computable phenotypes whose explication requires interpretation of multivariate sets of features.
• Used a large ALS data archive of 8,000 patients consisting of 3 million records, including 200 clinical features tracked over 12 months.
• Employed model-based and model-free methods to predict ALSFRS changes over time, cluster patients into cohorts, and derive computable phenotypes.
• Research findings include stable, reliable, and consistent (95%) patient stratification into computable phenotypes. However, clinical explication of the results requires interpretation of multivariate information.
KeywordsALS Amyotrophic lateral sclerosis Decision support Machine learning Predictive analytics Data science Big data
Colleagues from the Statistics Online Computational Resource (SOCR), Center for Complexity and Self-management of Chronic Disease (CSCD), Big Data Discovery Science (BDDS), and the Michigan Institute for Data Science (MIDAS) provided constructive feedback about this study.
Data used in the preparation of this article were obtained from the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) Database. As such, the following organizations and individuals within the PRO-ACT Consortium contributed to the design and implementation of the PRO-ACT Database and/or provided data, but did not participate in the analysis of the data or the writing of this report: Neurological Clinical Research Institute, MGH; Northeast ALS Consortium; Novartis; Prize4Life Israel; Regeneron Pharmaceuticals, Inc.; Sanofi; Teva Pharmaceutical Industries, Ltd.
Finally, the authors are deeply indebted to the journal editors and the anonymous reviewers who provided valuable recommendations and constructive critiques that improved the manuscript.
MT: developed techniques, conducted analyses, and wrote manuscript.
CG: developed techniques, conducted analyses, and wrote manuscript.
SAG: conceptualized the study and wrote manuscript.
AK: informatics, data analytics, and wrote manuscript.
BM: biostatistical methodology and wrote manuscript.
YG: conducted analyses, and wrote manuscript.
IDD: conceptualized the study, developed methods, conducted analyses, and wrote manuscript.
This research was partially supported by NSF grants 1734853, 1636840, 1416953, 0716055 and 1023115, NIH grants P20 NR015331, P50 NS091856, UL1TR002240, P30 DK089503, U54 EB020406, P30 AG053760, and K23 ES027221, and the Elsie Andresen Fiske Research Fund. These funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Compliance with Ethical Standards
Ethics Approval and Consent to Participate
University of Michigan Institutional Review Board (IRB) approval (HUM00115107) was obtained prior to managing, processing and analyzing the PRO-ACT data.
S.A.G. Dr. Goutman has received research support from the NIH/NIEHS (K23ES027221), Agency for Toxic Substances and Disease Registry/Centers for Disease Control, the ALS Association, Target ALS, Cytokinetics, and Neuralstem, Inc., and consulted for Cytokinetics.
- Allen-Zhu, Z., & Hazan, E. (2016). Variance reduction for faster non-convex optimization. in International Conference on Machine Learning.Google Scholar
- Beaulieu-Jones, B.K., & Moore, J.H. (2017). Missing data imputation in the electronic health record using deeply learned autoencoders, in Pacific Symposium on Biocomputing 2017, R.B. Altman, et al., Editors. p. 207–218.Google Scholar
- Bergsma, W., Croon, M.A., & Hagenaars, J.A. (2009). Marginal models: For dependent, clustered, and longitudinal categorical data. Springer Science & Business Media.Google Scholar
- Buuren, S., & Groothuis-Oudshoorn, K. (2011). Mice: Multivariate imputation by chained equations in R. Journal of statistical software, 45(3).Google Scholar
- Carreiro, A. V., Amaral, P. M. T., Pinto, S., Tomás, P., de Carvalho, M., & Madeira, S. C. (2015). Prognostic models based on patient snapshots and time windows: Predicting disease progression to assisted ventilation in amyotrophic lateral sclerosis. Journal of biomedical informatics, 58, 133–144.CrossRefGoogle Scholar
- Chatterjee, S., & Hadi, A.S. (2015). Regression analysis by example. John Wiley & Sons.Google Scholar
- De Sa, J.M. (2012). Pattern recognition: concepts, methods and applications. Springer Science & Business Media.Google Scholar
- Dinov, I. D., Heavner, B., Tang, M., Glusman, G., Chard, K., Darcy, M., Madduri, R., Pa, J., Spino, C., Kesselman, C., Foster, I., Deutsch, E. W., Price, N. D., van Horn, J. D., Ames, J., Clark, K., Hood, L., Hampstead, B. M., Dauer, W., & Toga, A. W. (2016). Predictive big data analytics: A study of Parkinson's disease using large, complex, heterogeneous, incongruent, multi-source and incomplete observations. PLoS One, 11(8), e0157077.CrossRefGoogle Scholar
- Fiedler, M., et al. (2006). Linear optimization problems with inexact data. Springer Science & Business Media.Google Scholar
- Gong, P., et al. (2013). A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. in International Conference on Machine Learning.Google Scholar
- Huang, Z., Zhang, H., Boss, J., Goutman, S. A., Mukherjee, B., Dinov, I. D., Guan, Y., & for the Pooled Resource Open-Access ALS Clinical Trials Consortium. (2017). Complete hazard ranking to analyze right-censored data: An ALS survival study. PLOS Computational Biology, 13(12), e1005887.CrossRefGoogle Scholar
- Maaten, L.v.d., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.Google Scholar
- Marek, K., Jennings, D., Lasch, S., Siderowf, A., Tanner, C., Simuni, T., Coffey, C., Kieburtz, K., Flagg, E., Chowdhury, S., Poewe, W., Mollenhauer, B., Klinik, P. E., Sherer, T., Frasier, M., Meunier, C., Rudolph, A., Casaceli, C., Seibyl, J., Mendick, S., Schuff, N., Zhang, Y., Toga, A., Crawford, K., Ansbach, A., de Blasio, P., Piovella, M., Trojanowski, J., Shaw, L., Singleton, A., Hawkins, K., Eberling, J., Brooks, D., Russell, D., Leary, L., Factor, S., Sommerfeld, B., Hogarth, P., Pighetti, E., Williams, K., Standaert, D., Guthrie, S., Hauser, R., Delgado, H., Jankovic, J., Hunter, C., Stern, M., Tran, B., Leverenz, J., Baca, M., Frank, S., Thomas, C. A., Richard, I., Deeley, C., Rees, L., Sprenger, F., Lang, E., Shill, H., Obradov, S., Fernandez, H., Winters, A., Berg, D., Gauss, K., Galasko, D., Fontaine, D., Mari, Z., Gerstenhaber, M., Brooks, D., Malloy, S., Barone, P., Longo, K., Comery, T., Ravina, B., Grachev, I., Gallagher, K., Collins, M., Widnell, K. L., Ostrowizki, S., Fontoura, P., Ho, T., Luthman, J., Brug, M. . ., Reith, A. D., & Taylor, P. (2011). The Parkinson progression marker initiative (PPMI). Progress in Neurobiology, 95(4), 629–635.CrossRefGoogle Scholar
- Moon, S. W., Dinov, I. D., Hobel, S., Zamanyan, A., Choi, Y. C., Shi, R., Thompson, P. M., Toga, A. W., & for the Alzheimer's Disease Neuroimaging Initiative. (2015b). Structural brain changes in early-onset Alzheimer's disease subjects using the LONI pipeline environment. Journal of Neuroimaging, 25(5), 728–737.CrossRefGoogle Scholar
- Pfohl, S. R., Kim, R. B., Coan, G. S., & Mitchell, C. S. (2018). Unraveling the complexity of amyotrophic lateral sclerosis survival prediction. Frontiers in Neuroinformatics, 12(36).Google Scholar
- Saykin, A. J., Shen, L., Yao, X., Kim, S., Nho, K., Risacher, S. L., Ramanan, V. K., Foroud, T. M., Faber, K. M., Sarwar, N., Munsie, L. M., Hu, X., Soares, H. D., Potkin, S. G., Thompson, P. M., Kauwe, J. S., Kaddurah-Daouk, R., Green, R. C., Toga, A. W., Weiner, M. W., & Alzheimer's Disease Neuroimaging Initiative. (2015). Genetic studies of quantitative MCI and AD phenotypes in ADNI: Progress, opportunities, and plans. Alzheimers & Dementia, 11(7), 792–814.CrossRefGoogle Scholar
- Taylor, A. A., Fournier, C., Polak, M., Wang, L., Zach, N., Keymer, M., Glass, J. D., Ennist, D. L., & The Pooled Resource Open-Access ALS Clinical Trials Consortium. (2016). Predicting disease progression in amyotrophic lateral sclerosis. Annals of Clinical and Translational Neurology, 3(11), 866–875.CrossRefGoogle Scholar
- Westeneng, H.-J., Debray, T. P. A., Visser, A. E., van Eijk, R. P. A., Rooney, J. P. K., Calvo, A., Martin, S., McDermott, C. J., Thompson, A. G., Pinto, S., Kobeleva, X., Rosenbohm, A., Stubendorff, B., Sommer, H., Middelkoop, B. M., Dekker, A. M., van Vugt, J. J. F. A., van Rheenen, W., Vajda, A., Heverin, M., Kazoka, M., Hollinger, H., Gromicho, M., Körner, S., Ringer, T. M., Rödiger, A., Gunkel, A., Shaw, C. E., Bredenoord, A. L., van Es, M. A., Corcia, P., Couratier, P., Weber, M., Grosskreutz, J., Ludolph, A. C., Petri, S., de Carvalho, M., van Damme, P., Talbot, K., Turner, M. R., Shaw, P. J., al-Chalabi, A., Chiò, A., Hardiman, O., Moons, K. G. M., Veldink, J. H., & van den Berg, L. H. (2018). Prognosis for patients with amyotrophic lateral sclerosis: Development and validation of a personalised prediction model. The Lancet Neurology, 17(5), 423–433.CrossRefGoogle Scholar
- Wistuba, M., Schilling, N., & Schmidt-Thieme, L.. (2015). Sequential model-free Hyperparameter tuning. in Data mining (ICDM), 2015 IEEE International Conference on.Google Scholar
- Witten, I.H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques. Morgan Kaufmann.Google Scholar