Abstract
Missing data is an issue in many real-world datasets yet robust methods for dealing with missing data appropriately still need development. In this paper we conduct an investigation of how some methods for handling missing data perform when the uncertainty increases. Using benchmark datasets from the UCI Machine Learning repository we generate datasets for our experimentation with increasing amounts of data Missing Completely At Random (MCAR) both at the attribute level and at the record level. We then apply four classification algorithms: C4.5, Random Forest, Naïve Bayes and Support Vector Machines (SVMs). We measure the performance of each classifiers on the basis of complete case analysis, simple imputation and then we study the performance of the algorithms that can handle missing data. We find that complete case analysis has a detrimental effect because it renders many datasets infeasible when missing data increases, particularly for high dimensional data. We find that increasing missing data does have a negative effect on the performance of all the algorithms tested but the different algorithms tested either using preprocessing in the form of simple imputation or handling the missing data do not show a significant difference in performance.
Keywords
- Missing data
- Classification algorithms
- Complete case analysis
- Single imputation
This is a preview of subscription content, access via your institution.
Buying options





References
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Chai, X., Deng, L., Yang, Q., Ling, C.X.: Test-cost sensitive naive bayes classification. In: 2004 Fourth IEEE International Conference on Data Mining, ICDM 2004, pp. 51–58. IEEE (2004)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7(Jan), 1–30 (2006)
Fichman, A., Cummings, J.N.: Multiple imputation for missing data: Making the most of what you know. Organ. Res. Meth. 6(3), 282–308 (2003)
García-Laencina, P.J., Sancho-Gómez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)
Gavankar, S., Sawarkar, S.: Decision tree: Review of techniques for missing values at training, testing and compatibility. In: 2015 3rd International Conference on Artificial Intelligence, Modelling and Simulation (AIMS), pp. 122–126. IEEE (2015)
George-Nektarios, T.: Weka classifiers summary. Athens University of Economics and Bussiness Intracom-Telecom, Athens (2013)
Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W., Yao, Y. (eds.) RSCTC 2000. LNCS (LNAI), vol. 2005, pp. 378–385. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45554-X_46
Horton, N., Kleinman, K.P.: Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models. Am. Stat. 61, 79–90 (2007)
Khalilia, M., Chakraborty, S., Popescu, M.: Predicting disease risks from highly imbalanced data using random forest. BMC Med. Inf. Decis. Making 11(1), 51 (2011)
Kohavi, R., Becker, B., Sommerfield, D.: Improving simple bayes. In: Proceedings of the European Conference on Machine Learning. Citeseer (1997)
Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007)
Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Little, R.J.A., Rubin, D.B.: Statistical Analysis With Missing Data. Wiley, Hoboken (2014)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Elsevier, San Francisco (2014)
Quinlan, J.R., et al.: Bagging, boosting, and c4. 5. In: The Association for the Advancement of Artificial Intelligence (AAAI), vol. 1, pp. 725–730 (1996)
Donald, B.: Rubin. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 91(434), 473–489 (1996)
Scheffer, J.: Dealing with missing data. Res. Lett. Inf. Math. Sci. 3(1), 153–160 (2002)
Schölkopf, B., Burges, C.J.C., Smola, A.J.: Advances in Kernel Methods: Support Vector Learning. MIT press, Cambridge (1999)
Soley-Bori, M.: Dealing with missing data: Key assumptions and methods for applied analysis. Boston University School of Public Health (2013)
Tabachnick, B.G., Fidell, L.S., Osterlind, S.J.: Using Multivariate Statistics. Allyn and Bacon, Boston (2001)
Tran, C.T., Zhang, M., Andreae, P., Xue, B., Bui, L.T.: Multiple imputation and ensemble learning for classification with incomplete data. In: Leu, G., Singh, H.K., Elsayed, S. (eds.) Intelligent and Evolutionary Systems. PALO, vol. 8, pp. 401–415. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49049-6_29
van der Heijden, G.J.M.G., Donders, A.R.T., Stijnen, T., Moons, K.G.M.: Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J. Clin. Epidemiol. 59(10), 1102–1109 (2006)
Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Massachusetts (2016)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Angus, N., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Acknowledgements
This work was supported by the Economic and Social Research Council (grant number ES/L011859/1).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Aleryani, A., Wang, W., De La Iglesia, B. (2018). Dealing with Missing Data and Uncertainty in the Context of Data Mining. In: , et al. Hybrid Artificial Intelligent Systems. HAIS 2018. Lecture Notes in Computer Science(), vol 10870. Springer, Cham. https://doi.org/10.1007/978-3-319-92639-1_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-92639-1_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92638-4
Online ISBN: 978-3-319-92639-1
eBook Packages: Computer ScienceComputer Science (R0)