Abstract
Treatment of missing data has become increasingly significant in scientific research and engineering applications. The classic imputation strategy based on the K nearest neighbours (KNN) has been widely used to solve the plague problem. However, former studies do not give much attention to feature relevance, which has a significant impact on the selection of nearest neighbours. As a result, biased results may appear in similarity measurements. In this paper, we propose a novel method to impute missing data, named feature weighted grey KNN (FWGKNN) imputation algorithm. This approach employs mutual information (MI) to measure feature relevance. We present an experimental evaluation for five UCI datasets in three missingness mechanisms with various missing rates. Experimental results show that feature relevance has a non-ignorable influence on missing data estimation based on grey theory, and our method is considered superior to the other four estimation strategies. Moreover, the classification bias can be significantly reduced by using our approach in classification tasks.
Similar content being viewed by others
References
Heinzelman WR, Kulik J, Balakrishnan H (1999) Adaptive protocols for information dissemination in wireless sensor networks. In: Proceedings of the 5th annual ACM/IEEE International conference on Mobile computing and networking. ACM, pp 174–185
Kim H, Golub GH, Park H (2004) Imputation of missing values in DNA microarray gene expression data. In: Proceedings of 2004 IEEE computational systems bioinformatics conference, CSB 2004. IEEE, pp 572–573
Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198
Sehgal MSB, Gondal I, Dooley LS (2005) Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics 21(10):2417–2423
Pyle D (1999) Data preparation for data mining, vol 1. Morgan Kaufmann, San Francisco
Schafer JL (1997) Analysis of incomplete multivariate data. CRC Press, Boca Raton
Little RJ, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York
Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5-6):519–533
Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inform Bull 9(1):32–38
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525
de Andrade Silva J, Hruschka ER (2009) EACImpute: an evolutionary algorithm for clustering-based imputation. In: 9th International conference on intelligent systems design and applications, ISDA’09. IEEE, pp 1400–1406
Keerin P, Kurutach W, Boongoen T (2012) Cluster-based KNN missing value imputation for DNA microarray data. In: 2012 IEEE International conference on systems, man, and cybernetics (SMC). IEEE, pp 445–450
Hruschka ER, Hruschka Jr ER, Ebecken NF (2005) Towards efficient imputation by nearest-neighbors: a clustering-based approach. In: AI 2004: advances in artificial intelligence. Springer, Berlin Heidelberg New York, pp 513–525
Kim K-Y, Kim B-J, Yi G-S (2004) Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics 5(1):160
Huang C-C, Lee H-M (2004) A grey-based nearest neighbor approach for missing attribute value prediction. Appl Intell 20(3):239–252
Zhang S (2012) Nearest neighbor selection for iteratively kNN imputation. J Syst Softw 85(11):2541–2552
Enders C, Dietz S, Montague M, Dixon J (2006) Modern alternatives for dealing with missing data in special education research. Advances in Learning and Behavioral Disabilities 19:101–129
Di Nuovo AG (2011) Missing data analysis with fuzzy C-Means: a study of its application in a psychological scenario. Expert Syst Appl 38(6):6793–6797
Quinlan JR (1993) C4. 5: programs for machine learning, vol 1. Morgan Kaufmann, Los Altos
Tsai C-J, Lee C-I, Yang W-P (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714–731
Muñoz JF, Rueda M (2009) New imputation methods for missing data using quantiles. J Comput Appl Math 232(2):305–317
Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907
Zhang C, Zhu X, Zhang J, Qin Y, Zhang S (2007) GBKII: an imputation method for missing values. In: Advances in knowledge discovery and data mining. Springer, Berlin Heidelberg New York, pp 1080–1087
Little RJ, Rubin DB (2002) Statistical analysis with missing data
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol:1–38
González S, Rueda M, Arcos A (2008) An improved estimator to analyse missing data. Stat Pap 49 (4):791–796
Zhang S, Zhang J, Zhu X, Qin Y, Zhang C (2008) Missing value imputation based on data clustering. In: Transactions on computational science I. Springer, Berlin Heidelberg New York , pp 128–138
Gupta A, Lam MS (1996) Estimating missing values using neural networks. J Oper Res Soc:229–238
Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115
Fessant F, Midenet S (2002) Self-organising map for data imputation and correction in surveys. Neural Comput & Applic 10(4):300–310
Brás LP, Menezes JC (2007) Improving cluster-based missing value estimation of DNA microarray data. Biomol Eng 24(2):273–282
Van Hulse J, Khoshgoftaar TM (2014) Incomplete-case nearest neighbor imputation in software measurement data. Inf Sci 259(0):596–610
García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR, Verleysen M (2009) K nearest neighbours with mutual information for simultaneous classification and missing data imputation. Neurocomputing 72(7-9):1483–1493
Zhang S, Jin Z, Zhu X (2011) Missing data imputation by utilizing information within incomplete instances. J Syst Softw 84(3):452–459
Wasito I, Mirkin B (2005) Nearest neighbour approach in the least-squares data imputation algorithms. Inf Sci 169(1):1–25
Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. In: Rough sets and current trends in computing. Springer, Berlin Heidelberg New York, pp 573–579
Tian J, Yu B, Yu D, Ma S (2014) Missing data analyses: a hybrid multiple imputation algorithm using gray system theory and entropy based on clustering. Appl Intell 40(2):376–388
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Lall U, Sharma A (1996) A nearest neighbor bootstrap for resampling hydrologic time series. Water Resour Res 32(3):679–693
Kullback S (1997) Information theory and statistics. Courier Dover Publications, New York
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Ju-Long D (1982) Control problems of grey systems. Syst Control Lett 1(5):288–294
Kwak N, Choi C-H (2002) Input feature selection by mutual information based on Parzen window. IEEE Trans Pattern Anal Mach Intell 24(12):1667–1671
Dudani SA (1976) The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst Man Cybern 4:325–327
Zhu B, He C, Liatsis P (2012) A robust missing value imputation method for noisy data. Appl Intell 36(1):61–74
Keller G (2011) Statistics for management and economics. Cengage Learning
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18
Acknowledgments
This work is supported in part by the National Natural Science Foundation of China (Grant No. 71172219, 71302056), the Humanity and Social Science Youth Foundation of Ministry of Education, China (Grant No. 10YJC630352), and the Research Foundation of Education Department of Anhui Province of China (Grant No. SK2012B578).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pan, R., Yang, T., Cao, J. et al. Missing data imputation by K nearest neighbours based on grey relational structure and mutual information. Appl Intell 43, 614–632 (2015). https://doi.org/10.1007/s10489-015-0666-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-015-0666-x