Evaluating Imputation Techniques for Missing Data in ADNI: A Patient Classification Study

  • Sergio CamposEmail author
  • Luis Pizarro
  • Carlos Valle
  • Katherine R. Gray
  • Daniel Rueckert
  • Héctor Allende
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9423)


In real-world applications it is common to find data sets whose records contain missing values. As many data analysis algorithms are not designed to work with missing data, all variables associated with such records are generally removed from the analysis. A better alternative is to employ data imputation techniques to estimate the missing values using statistical relationships among the variables. In this work, we test the most common imputation methods used in the literature for filling missing records in the ADNI (Alzheimer’s Disease Neuroimaging Initiative) data set, which affects about 80% of the patients–making unwise the removal of most of the data. We measure the imputation error of the different techniques and then evaluate their impact on classification performance. We train support vector machine and random forest classifiers using all the imputed data as opposed to a reduced set of samples having complete records, for the task of discriminating among different stages of the Alzheimer’s disease. Our results show the importance of using imputation procedures to achieve higher accuracy and robustness in the classification.


Missing data Imputation Classification ADNI Alzheimer 


  1. 1.
    Brookmeyer, R., Johnson, E., Ziegler-Graham, K., Arrighi, H.M.: Forecasting the global burden of Alzheimer’s disease. Alzheimer’s & Dementia 3(3), 186–191 (2007)CrossRefGoogle Scholar
  2. 2.
    Weiner, M.W., et al.: The Alzheimer’s Disease Neuroimaging Initiative: A review of papers published since its inception. Alzheimer’s & Dementia 9(5), 111–194 (2013)CrossRefGoogle Scholar
  3. 3.
    Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 2nd edn. Wiley-Interscience (2002)Google Scholar
  4. 4.
    Wang, C., Liao, X., Carin, L., Dunson, D.B.: Classification with incomplete data using Dirichlet process priors. JMLR 11, 3269–3311 (2010)zbMATHMathSciNetGoogle Scholar
  5. 5.
    Ingalhalikar, M., Parker, W.A., Bloy, L., Roberts, T.P.L., Verma, R.: Using multiparametric data with missing features for learning patterns of pathology. In: Ayache, N., Delingette, H., Golland, P., Mori, K. (eds.) MICCAI 2012, Part III. LNCS, vol. 7512, pp. 468–475. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  6. 6.
    Yuan, L., Wang, Y., Thompson, P.M., Narayan, V.A., Ye, J.: Multi-source feature learning for joint analysis of incomplete multiple heterogeneous neuroimaging data. NeuroImage 61(3), 622–632 (2012)CrossRefGoogle Scholar
  7. 7.
    Xiang, S., Yuan, L., Fan, W., Wang, Y., Thompson, P.M., Ye, J.: Bi-level multi-source learning for heterogeneous block-wise missing data. NeuroImage 102, Part 1, 192–206 (2014)Google Scholar
  8. 8.
    Thung, K.-H., Wee, C.-Y., Yap, P.-T., Shen, D.: Neurodegenerative disease diagnosis using incomplete multi-modality data via matrix shrinkage and completion. NeuroImage 91, 386–400 (2014)CrossRefGoogle Scholar
  9. 9.
    Lo, R.Y., Jagust, W.J.: Predicting missing biomarker data in a longitudinal study of Alzheimer disease. Neurology 78, 1376–1382 (2012)CrossRefGoogle Scholar
  10. 10.
    García-Laencina, P.J., Sancho-Gómez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: A review. Neural Computing and Applications 19(2), 263–282 (2010)CrossRefGoogle Scholar
  11. 11.
    Maronna, R.A., Martin, D.R., Yohai, V.J.: Robust Statistics: Theory and Methods. John Wiley and Sons, New York (2006)CrossRefGoogle Scholar
  12. 12.
    Arlot, S., Celisse, A.: A survey of cross-validation procedures for model selection. Statistics Surveys 4, 40–79 (2010)zbMATHMathSciNetCrossRefGoogle Scholar
  13. 13.
    Schneider, T.: Analysis of incomplete climate data: Estimation of mean valuesand covariance matrices and imputation of missing values. Journal of Climate 14, 853–871 (2001)CrossRefGoogle Scholar
  14. 14.
    Gray, K., Aljabar, P., Heckemann, R.A., Hammers, A., Rueckert, D.: Random forest-based similarity measures for multi-modal classification of Alzheimer’s disease. NeuroImage 65, 167–175 (2013)CrossRefGoogle Scholar
  15. 15.
    Báez, P.G., Araujo, C.P.S., Viadero, C.F., García, J.R.: Automatic prognostic determination and evolution of cognitive decline using artificial neural networks. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 898–907. Springer, Heidelberg (2007) CrossRefGoogle Scholar
  16. 16.
    Pelckmans, K., Brabanter, J.D., Suykens, J.A.K., Moor, B.D.: Handling missing values in support vector machine classifiers. Neural Networks 18(5–6), 684–692 (2005)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Sergio Campos
    • 1
    Email author
  • Luis Pizarro
    • 3
  • Carlos Valle
    • 1
  • Katherine R. Gray
    • 2
  • Daniel Rueckert
    • 2
  • Héctor Allende
    • 1
  1. 1.Departamento de InformáticaUniversidad Técnica Federico Santa MaríaValparaísoChile
  2. 2.Department of Computer ScienceUniversity College LondonLondonUK
  3. 3.Department of ComputingImperial College LondonLondonUK

Personalised recommendations