Abstract
Missing values are a common issue in many real-world datasets, and therefore coping with such datasets is an essential requirement of classification since inadequate treatment of missing values often leads to large classification errors. One of the most popular ways to address incomplete data is to use imputation methods to fill missing fields with plausible values. Multiple imputation, which fills each missing field with a set of plausible values, is a powerful approach to dealing with incomplete data, but is mainly used for statistical analysis. Ensemble learning which constructs a set of classifiers instead of one classifier has proven capable of improving classification accuracy, but has been mainly applied to complete data. This paper proposes a combination of multiple imputation and ensemble learning to build an ensemble of classifiers for incomplete data classification tasks. A multiple imputation method is used to generate a set of diverse imputed datasets which is then used to build a set of diverse classifiers. Experiments on ten benchmark datasets use a decision tree as classification algorithm and compare the proposed approach with two other popular approaches to dealing with incomplete data. The results show that, in almost all cases, the proposed method achieves significantly better classification accuracy than the other methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Batista, G.E., Monard, M.C.: A study of k-nearest neighbour as an imputation method. In: Hybrid Intelligent Systems - HIS. pp. 251–260 (2002)
Berger, J.O.: Statistical decision theory and Bayesian analysis. Springer Science & Business Media (2013)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc. (2006)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. CRC Press (1984)
Buuren, S., Groothuis-Oudshoorn, K.: MICE: Multivariate imputation by chained equations in R. Journal of statistical software 45, 1–67 (2011)
Chen, H., Du, Y., Jiang, K.: Classification of incomplete data using classifier ensembles. In: Systems and Informatics (ICSAI), 2012 International Conference on. pp. 2229–2232 (2012)
Dietterich, T.G.: Ensemble methods in machine learning. In: International workshop on multiple classifier systems. pp. 1–15 (2000)
Farhangfar, A., Kurgan, L.A., Pedrycz, W.: A novel framework for imputation of missing values in databases. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on 37, 692–709 (2007)
GarcÃa-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Computing and Applications 19, 263–282 (2010)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD explorations newsletter 11, 10–18 (2009)
Harel, O., Zhou, X.H.: Multiple imputation: review of theory, implementation and software. Statistics in medicine 26, 3057–3077 (2007)
Krause, S., Polikar, R.: An ensemble of classifiers approach for the missing feature problem. In: Neural Networks, 2003. Proceedings of the International Joint Conference on. vol. 1, pp. 553–558 (2003)
Liaw, A., Wiener, M.: Classification and regression by randomforest. R news 2, 18–22 (2002)
Little, R.J., Rubin, D.B.: Statistical analysis with missing data. John Wiley & Sons (2014)
Opitz, D., Maclin, R.: Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research 11, 169–198 (1999)
Quinlan, J.R.: C4. 5: programs for machine learning. Elsevier (2014)
Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychological methods 7, 147 (2002)
Tran, C.T., Andreae, P., Zhang, M.: Impact of imputation of missing values on genetic programming based multiple feature construction for classification. In: 2015 IEEE Congress on Evolutionary Computation (CEC). pp. 2398–2405 (2015)
Tran, C.T., Zhang, M., Andreae, P.: Multiple imputation for missing data using genetic programming. In: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation. pp. 583–590 (2015)
Tran, C.T., Zhang, M., Andreae, P.: A genetic programming-based imputation method for classification with missing data. In: European Conference on Genetic Programming. pp. 149–163 (2016)
White, I.R., Royston, P., Wood, A.M.: Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine 30, 377–399 (2011)
Williams, D., Liao, X., Xue, Y., Carin, L., Krishnapuram, B.: On classification with incomplete data. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), 427–436 (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Tran, C.T., Zhang, M., Andreae, P., Xue, B., Bui, L.T. (2017). Multiple Imputation and Ensemble Learning for Classification with Incomplete Data. In: Leu, G., Singh, H., Elsayed, S. (eds) Intelligent and Evolutionary Systems. Proceedings in Adaptation, Learning and Optimization, vol 8. Springer, Cham. https://doi.org/10.1007/978-3-319-49049-6_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-49049-6_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49048-9
Online ISBN: 978-3-319-49049-6
eBook Packages: EngineeringEngineering (R0)