Data Preprocessing in Data Mining pp 59-105 | Cite as
Dealing with Missing Values
Abstract
In this chapter the reader is introduced to the approaches used in the literature to tackle the presence of Missing Values (MVs). In real-life data, information is frequently lost in data mining, caused by the presence of missing values in attributes. Several schemes have been studied to overcome the drawbacks produced by missing values in data mining tasks; one of the most well known is based on preprocessing, formally known as imputation. After the introduction in Sect. 4.1, the chapter begins with the theoretical background which analyzes the underlying distribution of the missingness in Sect. 4.2. From this point on, the successive sections go from the simplest approaches in Sect. 4.3, to the most advanced proposals, focusing in the imputation of the MVs. The scope of such advanced methods includes the classic maximum likelihood procedures, like Expectation-Maximization or Multiple-Imputation (Sect. 4.4) and the latest Machine Learning based approaches which use algorithms for classification or regression in order to accomplish the imputation (Sect. 4.5). Finally a comparative experimental study will be carried out in Sect. 4.6.
Keywords
Multiple Imputation Imputation Method Principal Component Regression Input Attribute Data AugmentationReferences
- 1.Acuna, E., Rodriguez, C.: Classification, Clustering and Data Mining Applications. Springer, Berlin (2004)Google Scholar
- 2.Atkeson, C.G., Moore, A.W., Schaal, S.: Locally weighted learning. Artif. Intell. Rev. 11, 11–73 (1997)CrossRefGoogle Scholar
- 3.Aydilek, I.B., Arslan, A.: A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Inf. Sci. 233, 25–35 (2013)CrossRefGoogle Scholar
- 4.Azim, S., Aggarwal, S.: Hybrid model for data imputation: using fuzzy c-means and multi layer perceptron. In: Advance Computing Conference (IACC), 2014 IEEE International, pp. 1281–1285 (2014)Google Scholar
- 5.Barnard, J., Meng, X.: Applications of multiple imputation in medical studies: from aids to nhanes. Stat. Methods Med. Res. 8(1), 17–36 (1999)CrossRefGoogle Scholar
- 6.Batista, G., Monard, M.: An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 17(5), 519–533 (2003)CrossRefGoogle Scholar
- 7.Bezdek, J., Kuncheva, L.: Nearest prototype classifier designs: an experimental study. Int. J. Intell. Syst. 16(12), 1445–1473 (2001)MATHCrossRefGoogle Scholar
- 8.Broomhead, D., Lowe, D.: Multivariable functional interpolation and adaptive networks. Complex Systems 11, 321–355 (1988)MathSciNetGoogle Scholar
- 9.van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in r. J. Stat. Softw. 45(3), 1–67 (2011)Google Scholar
- 10.le Cessie, S., van Houwelingen, J.: Ridge estimators in logistic regression. Appl. Stat. 41(1), 191–201 (1992)MATHCrossRefGoogle Scholar
- 11.Chai, L., Mohamad, M., Deris, S., Chong, C., Choon, Y., Ibrahim, Z., Omatu, S.: Inferring gene regulatory networks from gene expression data by a dynamic bayesian network-based model. In: Omatu, S., De Paz Santana, J.F., González, S.R., Molina, J.M., Bernardos, A.M., Rodríguez, J.M.C. (eds.) Distributed Computing and Artificial Intelligence, Advances in Intelligent and Soft Computing, pp. 379–386. Springer, Berlin (2012)CrossRefGoogle Scholar
- 12.Ching, W.K., Li, L., Tsing, N.K., Tai, C.W., Ng, T.W., Wong, A.S., Cheng, K.W.: A weighted local least squares imputation method for missing value estimation in microarray gene expression data. Int. J. Data Min. Bioinform. 4(3), 331–347 (2010)CrossRefGoogle Scholar
- 13.Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theor. 14(3), 462–467 (1968)MATHMathSciNetCrossRefGoogle Scholar
- 14.Clark, P., Niblett, T.: The CN2 induction algorithm. Machine Learning 3(4), 261–283 (1989)Google Scholar
- 15.Cohen, W., Singer, Y.: A simple and fast and effective rule learner. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 335–342 (1999)Google Scholar
- 16.Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning (ICML), pp. 115–123 (1995).Google Scholar
- 17.Cortes, C., Vapnik, V.: Support vector networks. Machine Learning 20, 273–297 (1995)MATHGoogle Scholar
- 18.Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2 edn. Wiley, New York (1991)Google Scholar
- 19.Daniel, R.M., Kenward, M.G.: A method for increasing the robustness of multiple imputation. Comput. Stat. Data Anal. 56(6), 1624–1643 (2012)MATHMathSciNetCrossRefGoogle Scholar
- 20.Dempster, A., Laird, N., Rubin, D.: Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39, 1–38 (1977)MATHMathSciNetGoogle Scholar
- 21.Domingos, P., Pazzani, M.: On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29, 103–137 (1997)MATHCrossRefGoogle Scholar
- 22.Dorri, F., Azmi, P., Dorri, F.: Missing value imputation in dna microarrays based on conjugate gradient method. Comp. Bio. Med. 42(2), 222–227 (2012)CrossRefGoogle Scholar
- 23.Dunning, T., Freedman, D.: Modeling section effects, Sage, pp. 225–231 (2008)Google Scholar
- 24.Ennett, C.M., Frize, M., Walker, C.R.: Influence of missing values on artificial neural network performance. Stud. Health Technol. Inform. 84, 449–453 (2001)Google Scholar
- 25.Fan, R.E., Chen, P.H., Lin, C.J.: Working set selection using second order information for training support vector machines. J. Machine Learning Res. 6, 1889–1918 (2005)MATHMathSciNetGoogle Scholar
- 26.Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recognit. 41(12), 3692–3705 (2008). http://dx.doi.org/10.1016/j.patcog.2008.05.019
- 27.Farhangfar, A., Kurgan, L.A., Pedrycz, W.: A novel framework for imputation of missing values in databases. IEEE Trans. Syst. Man Cybern. Part A 37(5), 692–709 (2007)CrossRefGoogle Scholar
- 28.Fayyad, U., Irani, K.: Multi-interval discretization of continuous-valued attributes for classification learning. In: 13th International Joint Conference on Uncertainly in Artificial Intelligence(IJCAI93), pp. 1022–1029 (1993)Google Scholar
- 29.Feng, H., Guoshun, C., Cheng, Y., Yang, B., Chen, Y.: A SVM regression based approach to filling in missing values. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES (3), Lecture Notes in Computer Science, vol. 3683, pp. 581–587. Springer, Berlin (2005)Google Scholar
- 30.Feng, X., Wu, S., Liu, Y.: Imputing missing values for mixed numeric and categorical attributes based on incomplete data hierarchical clustering. In: Proceedings of the 5th International Conference on Knowledge Science, Engineering and Management, KSEM’11, pp. 414–424 (2011)Google Scholar
- 31.Figueroa García, J.C., Kalenatic, D., Lopez Bello, C.A.: Missing data imputation in multivariate data by evolutionary algorithms. Comput. Hum. Behav. 27(5), 1468–1474 (2011)CrossRefGoogle Scholar
- 32.de França, F.O., Coelho, G.P., Zuben, F.J.V.: Predicting missing values with biclustering: a coherence-based approach. Pattern Recognit. 46(5), 1255–1266 (2013)MATHCrossRefGoogle Scholar
- 33.Frank, E., Witten, I.: Generating accurate rule sets without global optimization. In: Proceedings of the 15th International Conference on Machine Learning, pp. 144–151 (1998)Google Scholar
- 34.Gheyas, I.A., Smith, L.S.: A neural network-based framework for the reconstruction of incomplete data sets. Neurocomputing 73(16–18), 3039–3065 (2010)CrossRefGoogle Scholar
- 35.Gibert, K.: Mixed intelligent-multivariate missing imputation. Int. J. Comput. Math. 91(1), 85–96 (2014)MATHCrossRefGoogle Scholar
- 36.Grzymala-Busse, J., Goodwin, L., Grzymala-Busse, W., Zheng, X.: Handling missing attribute values in preterm birth data sets. In: 10th International Conference of Rough Sets and Fuzzy Sets and Data Mining and Granular Computing(RSFDGrC05), pp. 342–351 (2005)Google Scholar
- 37.Grzymala-Busse, J.W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. In: Ziarko, W., Yao, Y.Y. (eds.) Rough Sets and Current Trends in Computing, Lecture Notes in Computer Science, vol. 2005, pp. 378–385. Springer, Berlin (2000)Google Scholar
- 38.Howell, D.: The analysis of missing data. SAGE Publications Ltd, London (2007)Google Scholar
- 39.Hruschka Jr, E.R., Ebecken, N.F.F.: Missing values prediction with k2. Intell. Data Anal. 6(6), 557–566 (2002)MATHGoogle Scholar
- 40.Hulse, J.V., Khoshgoftaar, T.M.: Incomplete-case nearest neighbor imputation in software measurement data. Inf. Sci. 259, 596–610 (2014)CrossRefGoogle Scholar
- 41.Ingsrisawang, L., Potawee, D.: Multiple imputation for missing data in repeated measurements using MCMC and copulas, pp. 1606–1610 (2012)Google Scholar
- 42.Ishioka, T.: Imputation of missing values for unsupervised data using the proximity in random forests. In: eLmL 2013, The 5th International Conference on Mobile, Hybrid, and On-line Learning, pp. 30–36 (2013)Google Scholar
- 43.Jamshidian, M., Jalal, S., Jansen, C.: Missmech: an R package for testing homoscedasticity, multivariate normality, and missing completely at random (mcar). J. Stat. Softw. 56(6), 1–31 (2014)Google Scholar
- 44.Joenssen, D.W., Bankhofer, U.: Hot deck methods for imputing missing data: the effects of limiting donor usage. In: Proceedings of the 8th International Conference on Machine Learning and Data Mining in Pattern Recognition, MLDM’12, pp. 63–75 (2012)Google Scholar
- 45.Juhola, M., Laurikkala, J.: Missing values: how many can they be to preserve classification reliability? Artif. Intell. Rev. 40(3), 231–245 (2013)CrossRefGoogle Scholar
- 46.Keerin, P., Kurutach, W., Boongoen, T.: Cluster-based knn missing value imputation for dna microarray data. In: Systems, Man, and Cybernetics (SMC), 2012 IEEE International Conference on, pp. 445–450. IEEE (2012)Google Scholar
- 47.Keerin, P., Kurutach, W., Boongoen, T.: An improvement of missing value imputation in dna microarray data using cluster-based lls method. In: Communications and Information Technologies (ISCIT), 2013 13th International Symposium on, pp. 559–564 (2013)Google Scholar
- 48.Khan, S.S., Hoey, J., Lizotte, D.J.: Bayesian multiple imputation approaches for one-class classification. In: Kosseim, L., Inkpen, D. (eds.) Advances in Artificial Intelligence - 25th Canadian Conference on Artificial Intelligence, Canadian AI 2012, Toronto, ON, Canada, Proceedings, pp. 331–336. 28–30 May 2012Google Scholar
- 49.Kim, H., Golub, G.H., Park, H.: Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinform. 21(2), 187–198 (2005)CrossRefGoogle Scholar
- 50.Krzanowski, W.: Multiple discriminant analysis in the presence of mixed continuous and categorical data. Comput. Math. Appl. 12(2, Part A), 179–185 (1986)MATHCrossRefGoogle Scholar
- 51.Kwak, N., Choi, C.H.: Input feature selection by mutual information based on parzen window. IEEE Trans. Pattern Anal. Mach. Intell. 24(12), 1667–1671 (2002)CrossRefGoogle Scholar
- 52.Kwak, N., Choi, C.H.: Input feature selection for classification problems. IEEE Trans. Neural Networks 13(1), 143–159 (2002)CrossRefGoogle Scholar
- 53.Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: 4th International Conference of Rough Sets and Current Trends in Computing (RSCTC04), pp. 573–579 (2004)Google Scholar
- 54.Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data, 1st edn. Wiley Series in Probability and Statistics, New York (1987)MATHGoogle Scholar
- 55.Little, R.J.A., Schluchter, M.D.: Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika 72, 497–512 (1985)MATHMathSciNetCrossRefGoogle Scholar
- 56.Lu, X., Si, J., Pan, L., Zhao, Y.: Imputation of missing data using ensemble algorithms. In: Fuzzy Systems and Knowledge Discovery (FSKD), 2011 8th International Conference on, vol. 2, pp. 1312–1315 (2011)Google Scholar
- 57.McLachlan, G.: Discriminant Analysis and Statistical Pattern Recognition. Wiley, New York(2004)Google Scholar
- 58.Merlin, P., Sorjamaa, A., Maillet, B., Lendasse, A.: X-SOM and L-SOM: a double classification approach for missing value imputation. Neurocomputing 73(7–9), 1103–1108 (2010)CrossRefGoogle Scholar
- 59.Michalksi, R., Mozetic, I., Lavrac, N.: The multipurpose incremental learning system AQ15 and its testing application to three medical domains. In: 5th INational Conference on Artificial Intelligence (AAAI86), pp. 1041–1045 (1986)Google Scholar
- 60.Miyakoshi, Y., Kato, S.: Missing value imputation method by using Bayesian network with weighted learning. IEEJ Trans. Electron. Inf. Syst. 132, 299–305 (2012)Google Scholar
- 61.Moller, F.: A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6, 525–533 (1990)CrossRefGoogle Scholar
- 62.Oba, S., aki Sato, M., Takemasa, I., Monden, M., ichi Matsubara, K., Ishii, S.: A bayesian missing value estimation method for gene expression profile data. Bioinform. 19(16), 2088–2096 (2003)CrossRefGoogle Scholar
- 63.Ouyang, M., Welsh, W.J., Georgopoulos, P.: Gaussian mixture clustering and imputation of microarray data. Bioinform. 20(6), 917–923 (2004)CrossRefGoogle Scholar
- 64.Panigrahi, L., Ranjan, R., Das, K., Mishra, D.: Removal and interpolation of missing values using wavelet neural network for heterogeneous data sets. In: Proceedings of the International Conference on Advances in Computing, Communications and Informatics, ICACCI ’12, pp. 1004–1009 (2012)Google Scholar
- 65.Patil, B., Joshi, R., Toshniwal, D.: Missing value imputation based on k-mean clustering with weighted distance. In: Ranka, S., Banerjee, A., Biswas, K., Dua, S., Mishra, P., Moona, R., Poon, S.H., Wang, C.L. (eds.) Contemporary Computing, Communications in Computer and Information Science, vol. 94, pp. 600–609. Springer, Berlin (2010)Google Scholar
- 66.Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), pp. 1226–1238 (2005)Google Scholar
- 67.Pham, D.T., Afify, A.A.: Rules-6: a simple rule induction algorithm for supporting decision making. In: Industrial Electronics Society, 2005. IECON 2005. 31st Annual Conference of IEEE, pp. 2184–2189 (2005)Google Scholar
- 68.Pham, D.T., Afify, A.A.: SRI: a scalable rule induction algorithm. Proc. Inst. Mech. Eng. [C]: J. Mech. Eng. Sci. 220, 537–552 (2006)CrossRefGoogle Scholar
- 69.Plat, J.: A resource allocating network for function interpolation. Neural Comput. 3(2), 213–225 (1991)CrossRefGoogle Scholar
- 70.Platt, J.C.: Fast training of support vector machines using sequential minimal optimization. In: Advances in Kernel Methods: Support Vector Learning, pp. 185–208. MIT Press, Cambridge (1999)Google Scholar
- 71.Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann Publishers Inc., San Francisco (1999)Google Scholar
- 72.Qin, Y., Zhang, S., Zhang, C.: Combining knn imputation and bootstrap calibrated empirical likelihood for incomplete data analysis. Int. J. Data Warehouse. Min. 6(4), 61–73 (2010)MathSciNetCrossRefGoogle Scholar
- 73.Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)Google Scholar
- 74.Rahman, G., Islam, Z.: A decision tree-based missing value imputation technique for data pre-processing. In: Proceedings of the 9th Australasian Data Mining Conference - Volume 121, AusDM ’11, pp. 41–50 (2011)Google Scholar
- 75.Rahman, M., Islam, M.: KDMI: a novel method for missing values imputation using two levels of horizontal partitioning in a data set. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds.) Advanced Data Mining and Applications. Lecture Notes in Computer Science, vol. 8347, pp. 250–263. Springer, Berlin (2013)CrossRefGoogle Scholar
- 76.Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: two novel techniques. Know.-Based Syst. 53, 51–65 (2013)CrossRefGoogle Scholar
- 77.Rahman, M.G., Islam, M.Z.: Fimus: a framework for imputing missing values using co-appearance, correlation and similarity analysis. Know.-Based Syst. 56, 311–327 (2014)CrossRefGoogle Scholar
- 78.Royston, P., White, I.R.: Multiple imputation by chained equations (MICE): implementation in STATA. J. Stat. Softw. 45(4), 1–20 (2011)MathSciNetGoogle Scholar
- 79.Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)MATHMathSciNetCrossRefGoogle Scholar
- 80.Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley, New York (1987)Google Scholar
- 81.Safarinejadian, B., Menhaj, M., Karrari, M.: A distributed EM algorithm to estimate the parameters of a finite mixture of components. Knowl. Inf. Syst. 23(3), 267–292 (2010)CrossRefGoogle Scholar
- 82.Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall, London (1997)MATHCrossRefGoogle Scholar
- 83.Schafer, J.L., Olsen, M.K.: Multiple imputation for multivariate missing-data problems: a data analyst’s perspective. Multivar. Behav. Res. 33(4), 545–571 (1998)CrossRefGoogle Scholar
- 84.Scheuren, F.: Multiple imputation: how it began and continues. Am. Stat. 59, 315–319 (2005)MathSciNetCrossRefGoogle Scholar
- 85.Schneider, T.: Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J. Clim. 14, 853–871 (2001)CrossRefGoogle Scholar
- 86.Schomaker, M., Heumann, C.: Model selection and model averaging after multiple imputation. Comput. Stat. Data Anal. 71, 758–770 (2014)MathSciNetCrossRefGoogle Scholar
- 87.Sehgal, M.S.B., Gondal, I., Dooley, L.: Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinform. 21(10), 2417–2423 (2005)CrossRefGoogle Scholar
- 88.Silva-Ramírez, E.L., Pino-Mejías, R., López-Coello, M., Cubiles-de-la Vega, M.D.: Missing value imputation on missing completely at random data using multilayer perceptrons. Neural Networks 24(1), 121–129 (2011)CrossRefGoogle Scholar
- 89.Simński, K.: Rough fuzzy subspace clustering for data with missing values. Comput. Inform. 33(1), 131–153 (2014)Google Scholar
- 90.Somasundaram, R., Nedunchezhian, R.: Radial basis function network dependent exclusive mutual interpolation for missing value imputation. J. Comput. Sci. 9(3), 327–334 (2013)CrossRefGoogle Scholar
- 91.Tanner, M.A., Wong, W.: The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 82, 528–540 (1987)MATHMathSciNetCrossRefGoogle Scholar
- 92.Ting, J., Yu, B., Yu, D., Ma, S.: Missing data analyses: a hybrid multiple imputation algorithm using gray system theory and entropy based on clustering. Appl. Intell. 40(2), 376–388 (2014)CrossRefGoogle Scholar
- 93.Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for dna microarrays. Bioinform. 17(6), 520–525 (2001)CrossRefGoogle Scholar
- 94.Unnebrink, K., Windeler, J.: Intention-to-treat: methods for dealing with missing values in clinical trials of progressively deteriorating diseases. Stat. Med. 20(24), 3931–3946 (2001)CrossRefGoogle Scholar
- 95.Vellido, A.: Missing data imputation through GTM as a mixture of t-distributions. Neural Networks 19(10), 1624–1635 (2006)MATHCrossRefGoogle Scholar
- 96.Wang, H., Wang, S.: Mining incomplete survey data through classification. Knowl. Inf. Syst. 24(2), 221–233 (2010)Google Scholar
- 97.Williams, D., Liao, X., Xue, Y., Carin, L., Krishnapuram, B.: On classification with incomplete data. IEEE Trans. Pattern Anal. Mach. Intell. 29(3), 427–436 (2007)Google Scholar
- 98.Wilson, D.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2(3), 408–421 (1972)MATHCrossRefGoogle Scholar
- 99.Wong, A.K.C., Chiu, D.K.Y.: Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans. Pattern Anal. Mach. Intell. 9(6), 796–805 (1987)CrossRefGoogle Scholar
- 100.Wu, X., Urpani, D.: Induction by attribute elimination. IEEE Trans. Knowl. Data Eng. 11(5), 805–812 (1999)CrossRefGoogle Scholar
- 101.Zhang, S.: Nearest neighbor selection for iteratively knn imputation. J. Syst. Softw. 85(11), 2541–2552 (2012)CrossRefGoogle Scholar
- 102.Zhang, S., Wu, X., Zhu, M.: Efficient missing data imputation for supervised learning. In: Cognitive Informatics (ICCI), 2010 9th IEEE International Conference on, pp. 672–679 (2010)Google Scholar
- 103.Zheng, Z., Webb, G.I.: Lazy learning of bayesian rules. Machine Learning 41(1), 53–84 (2000)MathSciNetCrossRefGoogle Scholar
- 104.Zhu, B., He, C., Liatsis, P.: A robust missing value imputation method for noisy data. Appl. Intell. 36(1), 61–74 (2012)CrossRefGoogle Scholar
- 105.Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowl. Data Eng. 23(1), 110–121 (2011)CrossRefGoogle Scholar