Exploring the Effects of Data Distribution in Missing Data Imputation

  • Jastin Pompeu Soares
  • Miriam Seoane Santos
  • Pedro Henriques AbreuEmail author
  • Hélder Araújo
  • João Santos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11191)


In data imputation problems, researchers typically use several techniques, individually or in combination, in order to find the one that presents the best performance over all the features comprised in the dataset. This strategy, however, neglects the nature of data (data distribution) and makes impractical the generalisation of the findings, since for new datasets, a huge number of new, time consuming experiments need to be performed. To overcome this issue, this work aims to understand the relationship between data distribution and the performance of standard imputation techniques, providing a heuristic on the choice of proper imputation methods and avoiding the needs to test a large set of methods. To this end, several datasets were selected considering different sample sizes, number of features, distributions and contexts and missing values were inserted at different percentages and scenarios. Then, different imputation methods were evaluated in terms of predictive and distributional accuracy. Our findings show that there is a relationship between features’ distribution and algorithms’ performance, and that their performance seems to be affected by the combination of missing rate and scenario at state and also other less obvious factors such as sample size, goodness-of-fit of features and the ratio between the number of features and the different distributions comprised in the dataset.


Missing data Data imputation Data distribution 



This article is a result of the project NORTE-01-0145-FEDER-000027, supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF).


  1. 1.
    Abreu, P.H., Santos, M.S., Abreu, M.H., Andrade, B., Silva, D.C.: Predicting breast cancer recurrence using machine learning techniques: a systematic review. ACM Comput. Surv. (CSUR) 49(3), 52 (2016)CrossRefGoogle Scholar
  2. 2.
    Aisha, N., Adam, M.B., Shohaimi, S.: Effect of missing value methods on bayesian network classification of hepatitis data. Int. J. Comput. Sci. Telecommun. 4(6), 8–12 (2013)Google Scholar
  3. 3.
    Amiri, M., Jensen, R.: Missing data imputation using fuzzy-rough methods. Neurocomputing 205, 152–164 (2016)CrossRefGoogle Scholar
  4. 4.
    Batista, G.E., Monard, M.C.: A study of k-nearest neighbour as an imputation method. HIS 87(251–260), 48 (2002)Google Scholar
  5. 5.
    Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Rregression Trees. CRC Press, Boca Raton (1984)zbMATHGoogle Scholar
  6. 6.
    Chambers, R.: Evaluation criteria for statistical editing and imputation, national statistics methodological series no. 28. University of Southampton (2001)Google Scholar
  7. 7.
    García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)CrossRefGoogle Scholar
  8. 8.
    García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Classifying patterns with missing values using multi-task learning perceptrons. Expert Syst. with Appl. 40(4), 1333–1341 (2013)CrossRefGoogle Scholar
  9. 9.
    Howell, D.C.: The treatment of missing data. The Sage Handbook of Social Science Methodology, pp. 208–224. Sage Publications, Thousand Oaks (2007)Google Scholar
  10. 10.
    Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmos. Enviro. 38(18), 2895–2907 (2004)CrossRefGoogle Scholar
  11. 11.
    Kohonen, T.: Self-Organizing Maps. Springer, Berlin (1995)CrossRefGoogle Scholar
  12. 12.
    Lopes, R.H.: Kolmogorov-smirnov test. International Encyclopedia of Statistical Science, pp. 718–720. Springer, New York (2011)CrossRefGoogle Scholar
  13. 13.
    Nanni, L., Lumini, A., Brahnam, S.: A classifier ensemble approach for the missing feature problem. Artif. Intell. Med. 55(1), 37–50 (2012)CrossRefGoogle Scholar
  14. 14.
    Pigott, T.D.: A review of methods for missing data. Educ. Res. Eval. 7(4), 353–383 (2001)CrossRefGoogle Scholar
  15. 15.
    Rahman, M.M., Davis, D.N.: Fuzzy unordered rules induction algorithm used as missing value imputation methods for k-mean clustering on real cardiovascular data. In: Proceedings of the World Congress on Engineering I, pp. 391–394 (2012)Google Scholar
  16. 16.
    Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Knowledge-Based Syst. 53, 51–65 (2013)CrossRefGoogle Scholar
  17. 17.
    Santos, M.S., Abreu, P.H., García-Laencina, P.J., Simão, A., Carvalho, A.: A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients. J. Biomed. Inf. 58, 49–59 (2015)CrossRefGoogle Scholar
  18. 18.
    Santos, M.S., Soares, J.P., Henriques Abreu, P., Araújo, H., Santos, J.: Influence of data distribution in missing data imputation. In: Artificial Intelligence in Medicine, pp. 285–294. Springer International Publishing, Cham (2017)CrossRefGoogle Scholar
  19. 19.
    Sivapriya, T., Kamal, A.N.B., Thavavel, V.: Imputation and classification of missing data using least square support vector machines-a new approach in dementia diagnosis. Int. J. Adv. Res. Artif. Intell. 1(4), 29–33 (2012)Google Scholar
  20. 20.
    Sorjamaa, A., Corona, F., Miche, Y., Merlin, P., Maillet, B., Séverin, E., Lendasse, A.: Sparse linear combination of soms for data imputation: application to financial database. In: Príncipe, J.C., Miikkulainen, R. (eds.) WSOM 2009. LNCS, vol. 5629, pp. 290–297. Springer, Heidelberg (2009). Scholar
  21. 21.
    Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for dna microarrays. Bioinformatics 17(6), 520–525 (2001)CrossRefGoogle Scholar
  22. 22.
    Van Buuren, S.: Flexible Imputation of Missing Data. CRC Press, Boca Raton (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Jastin Pompeu Soares
    • 1
  • Miriam Seoane Santos
    • 1
  • Pedro Henriques Abreu
    • 1
    Email author
  • Hélder Araújo
    • 2
  • João Santos
    • 3
  1. 1.CISUC, Department of Informatics EngineeringUniversity of CoimbraCoimbraPortugal
  2. 2.ISR, Department of Electrical and Computer EngineeringUniversity of CoimbraCoimbraPortugal
  3. 3.IPO-Porto Research Centre (CI-IPOP)PortoPortugal

Personalised recommendations