Abstract
The missing data issue is a fundamental challenge in terms of analyses and classification of data. The classification performance of incomplete data could be affected and produce different accuracy results compared with complete data. In this work we compare six scalable imputation methods, implemented on a Heart Failure dataset. The comparison is done by the performance metrics of three different classification methods namely J48, REPTree, and Random Forest. The aim of the research is to find a classifier that achieves best performance results after imputing the missing data using different imputation methods. The results show that in general, the Random Forest classification achieves the best results in comparison to the decision tree J48 and REP Tree. Furthermore, the performance of classification improved when imputing the missing values by concept most common (CMC) and support vector machine (SVM).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Liu, Z., Pan, Q., Dezert, J., Martin, A.: Adaptive imputation of missing values for incomplete pattern classification. Pattern Recogn. 52, 85–95 (2015)
Razzaghi, T., Roderick, O., Safro, I., Marko, N.: Fast imbalanced classification of healthcare data with missing values. arXiv preprint arXiv:1503.06250 (2015)
Batista, G.E., Monard, M.C.: An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 17, 519–533 (2003)
Zhang, S., Qin, Z., Ling, C.X., Sheng, S.: “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Trans. Knowl. Data Eng. 17, 1689–1693 (2005)
Marivate, V.N., Nelwamodo, F.V., Marwala, T.: Autoencoder, principal component analysis and support vector regression for data imputation. arXiv preprint arXiv:0709.2506 (2007)
Umathe, V.H., Chaudhary, G.: Imputation methods for incomplete data. In: 2015 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), pp. 1–4 (2015)
Carmona, C.J., Luengo, J., Gonzalez, P., del Jesus, M.J.: A preliminary study on missing data imputation in evolutionary fuzzy systems of subgroup discovery. In: 2012 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–7 (2012)
Zhang, Y., Kambhampati, C., Davis, D.N., Goode, K., Cleland, J.G.: A comparative study of missing value imputation with multiclass classification for clinical heart failure data. In: 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 2840–2844 (2012)
Little, R.J., Rubin, D.B.: The analysis of social science data with missing values. Sociol. Methods Res. 18, 292–326 (1989)
Nelwamondo, F.V., Mohamed, S., Marwala, T.: Missing data: a comparison of neural network and expectation maximisation techniques. arXiv preprint arXiv:0704.3474 (2007)
Farhangfar, A., Kurgan, L., Pedrycz, W.: A novel framework for imputation of missing values in databases. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 37, 692–709 (2007)
Belanche, L.A., Kobayashi, V., Aluja, T.: Handling missing values in kernel methods with application to microbiology data. Neurocomputing 141, 110–116 (2014)
Jordanov, I., Petrov, N.: Sets with incomplete and missing data—NN radar signal classification. In: 2014 International Joint Conference on Neural Networks (IJCNN), pp. 218–224 (2014)
Gheyas, I.A., Smith, L.S.: A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing 73, 3039–3065 (2010)
Min, P.: Based on kernel function and non-parametric multiple imputation algorithm to solve the problem of missing data. In: 2011 International Conference on Management Science and Industrial Engineering (MSIE), pp. 905–909 (2011)
Chauhan, H., Kumar, V., Pundir, S., Pilli, E.S.: A comparative study of classification techniques for intrusion detection. In: 2013 International Symposium on Computational and Business Intelligence (ISCBI), pp. 40–43 (2013)
Moore, L., Kambhampati, C., Cleland, J.G.F.: Classification of a real live heart failure clinical dataset- Is TAN Bayes better than other Bayes? In: 2014 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 882–887 (2014)
My Chau, T., Dongil, S., Dongkyoo, S.: A comparative study of medical data classification methods based on decision tree and bagging algorithms. In: 2009 Eighth IEEE International Conference on Dependable, Autonomic and Secure Computing, DASC 2009, pp. 183–187 (2009)
Nakai, M., Chen, D.-G., Nishimura, K., Miyamoto, Y.: Comparative study of four methods in missing value imputations under missing completely at random mechanism. Open J. Stat. 4, 27–37 (2014)
Kumdee, O., Ritthipravat, P., Bhongmakapat, T., Cheewaruangroj, W.: Dealing with missing values for effective prediction of NPC recurrence. In: 2008 SICE Annual Conference, pp. 1290–1294 (2008)
Dodge, Y., Zoppe, A.: Adjusting the EM algorithm for design of experiments with missing data. In: 2004 26th International Conference on Information Technology Interfaces, vol. 1, pp. 9–12 (2004)
Karmaker, A., Kwek, S.: Incorporating an EM-approach for handling missing attribute-values in decision tree induction. In: 2005 Fifth International Conference on Hybrid Intelligent Systems, HIS 2005, p. 6 (2005)
Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Rough Sets and Current Trends in Computing, pp. 573–579 (2004)
Grzymala-Busse, J.W., Goodwin, L.K., Grzymala-Busse, W.J., Zheng, X.: Handling missing attribute values in preterm birth data sets. In: Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing, pp. 342–351. Springer (2005)
Kaiser, J.: Dealing with missing values in data. J. Syst. Integrat. 5, 42–51 (2014)
Sivapriya, T., Kamal, A.N.B., Thavavel, V.: Imputation and classification of missing data using least square support vector machines–a new approach in dementia diagnosis. Int. J. Adv. Res. Artif. Intell. 1, 29–33 (2012)
Rogers, S.D.: Support vector machines for classification and imputation (2012)
Liu, Y., Liu, Y.: Incremental learning method of least squares support vector machine. In: 2010 International Conference on Intelligent Computation Technology and Automation (ICICTA), pp. 529–532 (2010)
Lomax, S., Vadera, S., Saraee, M.: A multi-armed bandit approach to cost-sensitive decision tree learning. In: 2012 IEEE 12th International Conference on Data Mining Workshops (ICDMW), pp. 162–168 (2012)
Agrawal, G.L., Gupta, H.: Optimization of C4.5 decision tree algorithm for data mining application. Int. J. Emerg. Technol. Adv. Eng. 3, 341–345 (2013)
Sharma, P., Singh, D., Singh, A.: Classification algorithms on a large continuous random dataset using rapid miner tool. In: 2015 2nd International Conference on Electronics and Communication Systems (ICECS), pp. 704–709 (2015)
Kaur, G., Chhabra, A.: Improved J48 classification algorithm for the prediction of diabetes. Int. J. Comput. Appl. 98, 13–17 (2014)
Almutairi, A., Parish, D.: Using classification techniques for creation of predictive intrusion detection model. In: 2014 9th International Conference on Internet Technology and Secured Transactions (ICITST), pp. 223–228 (2014)
Galathiya, A., Ganatra, A., Bhensdadia, C.: Classification with an improved Decision Tree Algorithm. Int. J. Comput. Appl. 46, 1–6 (2012)
Mohamed, W.N.H.W., Salleh, M.N.M., Omar, A.H.: A comparative study of Reduced Error Pruning method in decision tree algorithms. In: 2012 IEEE International Conference on Control System, Computing and Engineering (ICCSCE), pp. 392–397 (2012)
Balasundaram, A., Bhuvaneswari, P.T.V.: Comparative study on decision tree based data mining algorithm to assess risk of epidemic. In: IET Chennai Fourth International Conference on Sustainable Energy and Intelligent Systems (SEISCON 2013), pp. 390–396 (2013)
Junghun, P., Hsiao-Rong, T., Kuo, C.C.J.: GA-based internet traffic classification technique for qos provisioning. In: 2006 International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIH-MSP 2006, pp. 251–254 (2006)
Jian, X., Chen, P., Bin, L.: Random forest for relational classification with application to terrorist profiling. In: 2009 IEEE International Conference on Granular Computing, GRC 2009, pp. 630–633 (2009)
Svetnik, V., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P.: Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43, 1947–1958 (2003)
Cuzzocrea, A., Francis, S.L., Gaber, M.M.: An information-theoretic approach for setting the optimal number of decision trees in random forests. In: 2013 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1013–1019 (2013)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11, 10–18 (2009)
Alcalá-Fdez, A.F.J., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Logic Soft Comput. 17(2–3), 255–287 (2011)
Alcalá-Fdez, J., Sánchez, L., García, S., Jesus, M.J., Ventura, S., Garrell, J.M., et al.: KEEL: a software tool to assess evolutionary algorithms to data mining problems. Soft Comput. 13(3), 307–318 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Al Khaldy, M., Kambhampati, C. (2018). Performance Analysis of Various Missing Value Imputation Methods on Heart Failure Dataset. In: Bi, Y., Kapoor, S., Bhatia, R. (eds) Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016. IntelliSys 2016. Lecture Notes in Networks and Systems, vol 16. Springer, Cham. https://doi.org/10.1007/978-3-319-56991-8_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-56991-8_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-56990-1
Online ISBN: 978-3-319-56991-8
eBook Packages: EngineeringEngineering (R0)