Skip to main content
Log in

Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Treatment of missing data has become increasingly significant in scientific research and engineering applications. The classic imputation strategy based on the K nearest neighbours (KNN) has been widely used to solve the plague problem. However, former studies do not give much attention to feature relevance, which has a significant impact on the selection of nearest neighbours. As a result, biased results may appear in similarity measurements. In this paper, we propose a novel method to impute missing data, named feature weighted grey KNN (FWGKNN) imputation algorithm. This approach employs mutual information (MI) to measure feature relevance. We present an experimental evaluation for five UCI datasets in three missingness mechanisms with various missing rates. Experimental results show that feature relevance has a non-ignorable influence on missing data estimation based on grey theory, and our method is considered superior to the other four estimation strategies. Moreover, the classification bias can be significantly reduced by using our approach in classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Heinzelman WR, Kulik J, Balakrishnan H (1999) Adaptive protocols for information dissemination in wireless sensor networks. In: Proceedings of the 5th annual ACM/IEEE International conference on Mobile computing and networking. ACM, pp 174–185

  2. Kim H, Golub GH, Park H (2004) Imputation of missing values in DNA microarray gene expression data. In: Proceedings of 2004 IEEE computational systems bioinformatics conference, CSB 2004. IEEE, pp 572–573

  3. Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198

    Article  Google Scholar 

  4. Sehgal MSB, Gondal I, Dooley LS (2005) Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics 21(10):2417–2423

    Article  Google Scholar 

  5. Pyle D (1999) Data preparation for data mining, vol 1. Morgan Kaufmann, San Francisco

    Google Scholar 

  6. Schafer JL (1997) Analysis of incomplete multivariate data. CRC Press, Boca Raton

  7. Little RJ, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York

    Book  MATH  Google Scholar 

  8. Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5-6):519–533

    Article  Google Scholar 

  9. Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inform Bull 9(1):32–38

    Google Scholar 

  10. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525

    Article  Google Scholar 

  11. de Andrade Silva J, Hruschka ER (2009) EACImpute: an evolutionary algorithm for clustering-based imputation. In: 9th International conference on intelligent systems design and applications, ISDA’09. IEEE, pp 1400–1406

  12. Keerin P, Kurutach W, Boongoen T (2012) Cluster-based KNN missing value imputation for DNA microarray data. In: 2012 IEEE International conference on systems, man, and cybernetics (SMC). IEEE, pp 445–450

  13. Hruschka ER, Hruschka Jr ER, Ebecken NF (2005) Towards efficient imputation by nearest-neighbors: a clustering-based approach. In: AI 2004: advances in artificial intelligence. Springer, Berlin Heidelberg New York, pp 513–525

  14. Kim K-Y, Kim B-J, Yi G-S (2004) Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics 5(1):160

    Article  MathSciNet  Google Scholar 

  15. Huang C-C, Lee H-M (2004) A grey-based nearest neighbor approach for missing attribute value prediction. Appl Intell 20(3):239–252

    Article  MATH  Google Scholar 

  16. Zhang S (2012) Nearest neighbor selection for iteratively kNN imputation. J Syst Softw 85(11):2541–2552

    Article  Google Scholar 

  17. Enders C, Dietz S, Montague M, Dixon J (2006) Modern alternatives for dealing with missing data in special education research. Advances in Learning and Behavioral Disabilities 19:101–129

    Article  Google Scholar 

  18. Di Nuovo AG (2011) Missing data analysis with fuzzy C-Means: a study of its application in a psychological scenario. Expert Syst Appl 38(6):6793–6797

    Article  Google Scholar 

  19. Quinlan JR (1993) C4. 5: programs for machine learning, vol 1. Morgan Kaufmann, Los Altos

    Google Scholar 

  20. Tsai C-J, Lee C-I, Yang W-P (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714–731

    Article  Google Scholar 

  21. Muñoz JF, Rueda M (2009) New imputation methods for missing data using quantiles. J Comput Appl Math 232(2):305–317

    Article  MathSciNet  MATH  Google Scholar 

  22. Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907

    Article  Google Scholar 

  23. Zhang C, Zhu X, Zhang J, Qin Y, Zhang S (2007) GBKII: an imputation method for missing values. In: Advances in knowledge discovery and data mining. Springer, Berlin Heidelberg New York, pp 1080–1087

    Chapter  Google Scholar 

  24. Little RJ, Rubin DB (2002) Statistical analysis with missing data

  25. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol:1–38

  26. González S, Rueda M, Arcos A (2008) An improved estimator to analyse missing data. Stat Pap 49 (4):791–796

    Article  MATH  Google Scholar 

  27. Zhang S, Zhang J, Zhu X, Qin Y, Zhang C (2008) Missing value imputation based on data clustering. In: Transactions on computational science I. Springer, Berlin Heidelberg New York , pp 128–138

    Chapter  Google Scholar 

  28. Gupta A, Lam MS (1996) Estimating missing values using neural networks. J Oper Res Soc:229–238

  29. Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115

    Article  Google Scholar 

  30. Fessant F, Midenet S (2002) Self-organising map for data imputation and correction in surveys. Neural Comput & Applic 10(4):300–310

    Article  MATH  Google Scholar 

  31. Brás LP, Menezes JC (2007) Improving cluster-based missing value estimation of DNA microarray data. Biomol Eng 24(2):273–282

    Article  Google Scholar 

  32. Van Hulse J, Khoshgoftaar TM (2014) Incomplete-case nearest neighbor imputation in software measurement data. Inf Sci 259(0):596–610

    Article  Google Scholar 

  33. García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR, Verleysen M (2009) K nearest neighbours with mutual information for simultaneous classification and missing data imputation. Neurocomputing 72(7-9):1483–1493

    Article  Google Scholar 

  34. Zhang S, Jin Z, Zhu X (2011) Missing data imputation by utilizing information within incomplete instances. J Syst Softw 84(3):452–459

    Article  Google Scholar 

  35. Wasito I, Mirkin B (2005) Nearest neighbour approach in the least-squares data imputation algorithms. Inf Sci 169(1):1–25

    Article  MathSciNet  MATH  Google Scholar 

  36. Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. In: Rough sets and current trends in computing. Springer, Berlin Heidelberg New York, pp 573–579

    Chapter  Google Scholar 

  37. Tian J, Yu B, Yu D, Ma S (2014) Missing data analyses: a hybrid multiple imputation algorithm using gray system theory and entropy based on clustering. Appl Intell 40(2):376–388

    Article  Google Scholar 

  38. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  39. Lall U, Sharma A (1996) A nearest neighbor bootstrap for resampling hydrologic time series. Water Resour Res 32(3):679–693

    Article  Google Scholar 

  40. Kullback S (1997) Information theory and statistics. Courier Dover Publications, New York

    MATH  Google Scholar 

  41. Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  42. Ju-Long D (1982) Control problems of grey systems. Syst Control Lett 1(5):288–294

    Article  MathSciNet  MATH  Google Scholar 

  43. Kwak N, Choi C-H (2002) Input feature selection by mutual information based on Parzen window. IEEE Trans Pattern Anal Mach Intell 24(12):1667–1671

    Article  Google Scholar 

  44. Dudani SA (1976) The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst Man Cybern 4:325–327

    Article  Google Scholar 

  45. Zhu B, He C, Liatsis P (2012) A robust missing value imputation method for noisy data. Appl Intell 36(1):61–74

    Article  Google Scholar 

  46. Keller G (2011) Statistics for management and economics. Cengage Learning

  47. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (Grant No. 71172219, 71302056), the Humanity and Social Science Youth Foundation of Ministry of Education, China (Grant No. 10YJC630352), and the Research Foundation of Education Department of Anhui Province of China (Grant No. SK2012B578).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruilin Pan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pan, R., Yang, T., Cao, J. et al. Missing data imputation by K nearest neighbours based on grey relational structure and mutual information. Appl Intell 43, 614–632 (2015). https://doi.org/10.1007/s10489-015-0666-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-015-0666-x

Keywords

Navigation