Advertisement

Applied Intelligence

, Volume 43, Issue 3, pp 614–632 | Cite as

Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

  • Ruilin PanEmail author
  • Tingsheng Yang
  • Jianhua Cao
  • Ke Lu
  • Zhanchao Zhang
Article

Abstract

Treatment of missing data has become increasingly significant in scientific research and engineering applications. The classic imputation strategy based on the K nearest neighbours (KNN) has been widely used to solve the plague problem. However, former studies do not give much attention to feature relevance, which has a significant impact on the selection of nearest neighbours. As a result, biased results may appear in similarity measurements. In this paper, we propose a novel method to impute missing data, named feature weighted grey KNN (FWGKNN) imputation algorithm. This approach employs mutual information (MI) to measure feature relevance. We present an experimental evaluation for five UCI datasets in three missingness mechanisms with various missing rates. Experimental results show that feature relevance has a non-ignorable influence on missing data estimation based on grey theory, and our method is considered superior to the other four estimation strategies. Moreover, the classification bias can be significantly reduced by using our approach in classification tasks.

Keywords

Missing data Grey theory Mutual information Feature relevance K nearest neighbours 

Notes

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (Grant No. 71172219, 71302056), the Humanity and Social Science Youth Foundation of Ministry of Education, China (Grant No. 10YJC630352), and the Research Foundation of Education Department of Anhui Province of China (Grant No. SK2012B578).

References

  1. 1.
    Heinzelman WR, Kulik J, Balakrishnan H (1999) Adaptive protocols for information dissemination in wireless sensor networks. In: Proceedings of the 5th annual ACM/IEEE International conference on Mobile computing and networking. ACM, pp 174–185Google Scholar
  2. 2.
    Kim H, Golub GH, Park H (2004) Imputation of missing values in DNA microarray gene expression data. In: Proceedings of 2004 IEEE computational systems bioinformatics conference, CSB 2004. IEEE, pp 572–573Google Scholar
  3. 3.
    Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198CrossRefGoogle Scholar
  4. 4.
    Sehgal MSB, Gondal I, Dooley LS (2005) Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics 21(10):2417–2423CrossRefGoogle Scholar
  5. 5.
    Pyle D (1999) Data preparation for data mining, vol 1. Morgan Kaufmann, San FranciscoGoogle Scholar
  6. 6.
    Schafer JL (1997) Analysis of incomplete multivariate data. CRC Press, Boca RatonGoogle Scholar
  7. 7.
    Little RJ, Rubin DB (2002) Statistical analysis with missing data. Wiley, New YorkCrossRefzbMATHGoogle Scholar
  8. 8.
    Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5-6):519–533CrossRefGoogle Scholar
  9. 9.
    Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inform Bull 9(1):32–38Google Scholar
  10. 10.
    Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525CrossRefGoogle Scholar
  11. 11.
    de Andrade Silva J, Hruschka ER (2009) EACImpute: an evolutionary algorithm for clustering-based imputation. In: 9th International conference on intelligent systems design and applications, ISDA’09. IEEE, pp 1400–1406Google Scholar
  12. 12.
    Keerin P, Kurutach W, Boongoen T (2012) Cluster-based KNN missing value imputation for DNA microarray data. In: 2012 IEEE International conference on systems, man, and cybernetics (SMC). IEEE, pp 445–450Google Scholar
  13. 13.
    Hruschka ER, Hruschka Jr ER, Ebecken NF (2005) Towards efficient imputation by nearest-neighbors: a clustering-based approach. In: AI 2004: advances in artificial intelligence. Springer, Berlin Heidelberg New York, pp 513–525Google Scholar
  14. 14.
    Kim K-Y, Kim B-J, Yi G-S (2004) Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics 5(1):160MathSciNetCrossRefGoogle Scholar
  15. 15.
    Huang C-C, Lee H-M (2004) A grey-based nearest neighbor approach for missing attribute value prediction. Appl Intell 20(3):239–252CrossRefzbMATHGoogle Scholar
  16. 16.
    Zhang S (2012) Nearest neighbor selection for iteratively kNN imputation. J Syst Softw 85(11):2541–2552CrossRefGoogle Scholar
  17. 17.
    Enders C, Dietz S, Montague M, Dixon J (2006) Modern alternatives for dealing with missing data in special education research. Advances in Learning and Behavioral Disabilities 19:101–129CrossRefGoogle Scholar
  18. 18.
    Di Nuovo AG (2011) Missing data analysis with fuzzy C-Means: a study of its application in a psychological scenario. Expert Syst Appl 38(6):6793–6797CrossRefGoogle Scholar
  19. 19.
    Quinlan JR (1993) C4. 5: programs for machine learning, vol 1. Morgan Kaufmann, Los AltosGoogle Scholar
  20. 20.
    Tsai C-J, Lee C-I, Yang W-P (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714–731CrossRefGoogle Scholar
  21. 21.
    Muñoz JF, Rueda M (2009) New imputation methods for missing data using quantiles. J Comput Appl Math 232(2):305–317MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907CrossRefGoogle Scholar
  23. 23.
    Zhang C, Zhu X, Zhang J, Qin Y, Zhang S (2007) GBKII: an imputation method for missing values. In: Advances in knowledge discovery and data mining. Springer, Berlin Heidelberg New York, pp 1080–1087CrossRefGoogle Scholar
  24. 24.
    Little RJ, Rubin DB (2002) Statistical analysis with missing dataGoogle Scholar
  25. 25.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol:1–38Google Scholar
  26. 26.
    González S, Rueda M, Arcos A (2008) An improved estimator to analyse missing data. Stat Pap 49 (4):791–796CrossRefzbMATHGoogle Scholar
  27. 27.
    Zhang S, Zhang J, Zhu X, Qin Y, Zhang C (2008) Missing value imputation based on data clustering. In: Transactions on computational science I. Springer, Berlin Heidelberg New York , pp 128–138CrossRefGoogle Scholar
  28. 28.
    Gupta A, Lam MS (1996) Estimating missing values using neural networks. J Oper Res Soc:229–238Google Scholar
  29. 29.
    Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115CrossRefGoogle Scholar
  30. 30.
    Fessant F, Midenet S (2002) Self-organising map for data imputation and correction in surveys. Neural Comput & Applic 10(4):300–310CrossRefzbMATHGoogle Scholar
  31. 31.
    Brás LP, Menezes JC (2007) Improving cluster-based missing value estimation of DNA microarray data. Biomol Eng 24(2):273–282CrossRefGoogle Scholar
  32. 32.
    Van Hulse J, Khoshgoftaar TM (2014) Incomplete-case nearest neighbor imputation in software measurement data. Inf Sci 259(0):596–610CrossRefGoogle Scholar
  33. 33.
    García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR, Verleysen M (2009) K nearest neighbours with mutual information for simultaneous classification and missing data imputation. Neurocomputing 72(7-9):1483–1493CrossRefGoogle Scholar
  34. 34.
    Zhang S, Jin Z, Zhu X (2011) Missing data imputation by utilizing information within incomplete instances. J Syst Softw 84(3):452–459CrossRefGoogle Scholar
  35. 35.
    Wasito I, Mirkin B (2005) Nearest neighbour approach in the least-squares data imputation algorithms. Inf Sci 169(1):1–25MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. In: Rough sets and current trends in computing. Springer, Berlin Heidelberg New York, pp 573–579CrossRefGoogle Scholar
  37. 37.
    Tian J, Yu B, Yu D, Ma S (2014) Missing data analyses: a hybrid multiple imputation algorithm using gray system theory and entropy based on clustering. Appl Intell 40(2):376–388CrossRefGoogle Scholar
  38. 38.
    Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37CrossRefGoogle Scholar
  39. 39.
    Lall U, Sharma A (1996) A nearest neighbor bootstrap for resampling hydrologic time series. Water Resour Res 32(3):679–693CrossRefGoogle Scholar
  40. 40.
    Kullback S (1997) Information theory and statistics. Courier Dover Publications, New YorkzbMATHGoogle Scholar
  41. 41.
    Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238CrossRefGoogle Scholar
  42. 42.
    Ju-Long D (1982) Control problems of grey systems. Syst Control Lett 1(5):288–294MathSciNetCrossRefzbMATHGoogle Scholar
  43. 43.
    Kwak N, Choi C-H (2002) Input feature selection by mutual information based on Parzen window. IEEE Trans Pattern Anal Mach Intell 24(12):1667–1671CrossRefGoogle Scholar
  44. 44.
    Dudani SA (1976) The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst Man Cybern 4:325–327CrossRefGoogle Scholar
  45. 45.
    Zhu B, He C, Liatsis P (2012) A robust missing value imputation method for noisy data. Appl Intell 36(1):61–74CrossRefGoogle Scholar
  46. 46.
    Keller G (2011) Statistics for management and economics. Cengage LearningGoogle Scholar
  47. 47.
    Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Ruilin Pan
    • 1
    Email author
  • Tingsheng Yang
    • 1
  • Jianhua Cao
    • 1
  • Ke Lu
    • 1
  • Zhanchao Zhang
    • 1
  1. 1.School of Management Science and EngineeringAnhui University of TechnologyMaanshanPeople’s Republic of China

Personalised recommendations