Skip to main content

Dealing with Missing Data and Uncertainty in the Context of Data Mining

Part of the Lecture Notes in Computer Science book series (LNAI,volume 10870)


Missing data is an issue in many real-world datasets yet robust methods for dealing with missing data appropriately still need development. In this paper we conduct an investigation of how some methods for handling missing data perform when the uncertainty increases. Using benchmark datasets from the UCI Machine Learning repository we generate datasets for our experimentation with increasing amounts of data Missing Completely At Random (MCAR) both at the attribute level and at the record level. We then apply four classification algorithms: C4.5, Random Forest, Naïve Bayes and Support Vector Machines (SVMs). We measure the performance of each classifiers on the basis of complete case analysis, simple imputation and then we study the performance of the algorithms that can handle missing data. We find that complete case analysis has a detrimental effect because it renders many datasets infeasible when missing data increases, particularly for high dimensional data. We find that increasing missing data does have a negative effect on the performance of all the algorithms tested but the different algorithms tested either using preprocessing in the form of simple imputation or handling the missing data do not show a significant difference in performance.


  • Missing data
  • Classification algorithms
  • Complete case analysis
  • Single imputation

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-92639-1_24
  • Chapter length: 13 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-92639-1
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.


  1. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    CrossRef  Google Scholar 

  2. Chai, X., Deng, L., Yang, Q., Ling, C.X.: Test-cost sensitive naive bayes classification. In: 2004 Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 51–58. IEEE (2004)

    Google Scholar 

  3. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  4. Fichman, A., Cummings, J.N.: Multiple imputation for missing data: Making the most of what you know. Organ. Res. Meth. 6(3), 282–308 (2003)

    CrossRef  Google Scholar 

  5. García-Laencina, P.J., Sancho-Gómez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)

    CrossRef  Google Scholar 

  6. Gavankar, S., Sawarkar, S.: Decision tree: Review of techniques for missing values at training, testing and compatibility. In: 2015 3rd International Conference on Artificial Intelligence, Modelling and Simulation (AIMS), pp. 122–126. IEEE (2015)

    Google Scholar 

  7. George-Nektarios, T.: Weka classifiers summary. Athens University of Economics and Bussiness Intracom-Telecom, Athens (2013)

    Google Scholar 

  8. Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001).

    CrossRef  MATH  Google Scholar 

  9. Horton, N., Kleinman, K.P.: Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 61, 79–90 (2007)

    CrossRef  MathSciNet  Google Scholar 

  10. Khalilia, M., Chakraborty, S., Popescu, M.: Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inf. Decis. Making 11(1), 51 (2011)

    CrossRef  Google Scholar 

  11. Kohavi, R., Becker, B., Sommerfield, D.: Improving simple bayes. In: Proceedings of the European Conference on Machine Learning. Citeseer (1997)

    Google Scholar 

  12. Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007)

    Google Scholar 

  13. Lichman, M.: UCI machine learning repository (2013).

  14. Little, R.J.A., Rubin, D.B.: Statistical Analysis With Missing Data. Wiley, Hoboken (2014)

    MATH  Google Scholar 

  15. Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, San Francisco (2014)

    Google Scholar 

  16. Quinlan, J.R., et al.: Bagging, boosting, and c4. 5. In: The Association for the Advancement of Artificial Intelligence (AAAI), vol. 1, pp. 725–730 (1996)

    Google Scholar 

  17. Donald, B.: Rubin. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 91(434), 473–489 (1996)

    CrossRef  Google Scholar 

  18. Scheffer, J.: Dealing with missing data. Res. Lett. Inf. Math. Sci. 3(1), 153–160 (2002)

    Google Scholar 

  19. Schölkopf, B., Burges, C.J.C., Smola, A.J.: Advances in Kernel Methods: Support Vector Learning. MIT press, Cambridge (1999)

    MATH  Google Scholar 

  20. Soley-Bori, M.: Dealing with missing data: Key assumptions and methods for applied analysis. Boston University School of Public Health (2013)

    Google Scholar 

  21. Tabachnick, B.G., Fidell, L.S., Osterlind, S.J.: Using Multivariate Statistics. Allyn and Bacon, Boston (2001)

    Google Scholar 

  22. Tran, C.T., Zhang, M., Andreae, P., Xue, B., Bui, L.T.: Multiple imputation and ensemble learning for classification with incomplete data. In: Leu, G., Singh, H.K., Elsayed, S. (eds.) Intelligent and Evolutionary Systems. PALO, vol. 8, pp. 401–415. Springer, Cham (2017).

    CrossRef  Google Scholar 

  23. van der Heijden, G.J.M.G., Donders, A.R.T., Stijnen, T., Moons, K.G.M.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)

    CrossRef  Google Scholar 

  24. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Massachusetts (2016)

    Google Scholar 

  25. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Angus, N., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)

    CrossRef  Google Scholar 

Download references


This work was supported by the Economic and Social Research Council (grant number ES/L011859/1).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Aliya Aleryani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Aleryani, A., Wang, W., De La Iglesia, B. (2018). Dealing with Missing Data and Uncertainty in the Context of Data Mining. In: , et al. Hybrid Artificial Intelligent Systems. HAIS 2018. Lecture Notes in Computer Science(), vol 10870. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-92638-4

  • Online ISBN: 978-3-319-92639-1

  • eBook Packages: Computer ScienceComputer Science (R0)