Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

Pan, Ruilin; Yang, Tingsheng; Cao, Jianhua; Lu, Ke; Zhang, Zhanchao

doi:10.1007/s10489-015-0666-x

Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

Published: 10 May 2015

Volume 43, pages 614–632, (2015)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Ruilin Pan¹,
Tingsheng Yang¹,
Jianhua Cao¹,
Ke Lu¹ &
…
Zhanchao Zhang¹

1938 Accesses
80 Citations
3 Altmetric
Explore all metrics

Abstract

Treatment of missing data has become increasingly significant in scientific research and engineering applications. The classic imputation strategy based on the K nearest neighbours (KNN) has been widely used to solve the plague problem. However, former studies do not give much attention to feature relevance, which has a significant impact on the selection of nearest neighbours. As a result, biased results may appear in similarity measurements. In this paper, we propose a novel method to impute missing data, named feature weighted grey KNN (FWGKNN) imputation algorithm. This approach employs mutual information (MI) to measure feature relevance. We present an experimental evaluation for five UCI datasets in three missingness mechanisms with various missing rates. Experimental results show that feature relevance has a non-ignorable influence on missing data estimation based on grey theory, and our method is considered superior to the other four estimation strategies. Moreover, the classification bias can be significantly reduced by using our approach in classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Hierarchical Missing Value Imputation Method by Correlation-Based K-Nearest Neighbors

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means Clustering Imputation Method for Handling Incomplete Data

Missing data imputation using decision trees and fuzzy clustering with iterative learning

Article 11 December 2019

References

Heinzelman WR, Kulik J, Balakrishnan H (1999) Adaptive protocols for information dissemination in wireless sensor networks. In: Proceedings of the 5th annual ACM/IEEE International conference on Mobile computing and networking. ACM, pp 174–185
Kim H, Golub GH, Park H (2004) Imputation of missing values in DNA microarray gene expression data. In: Proceedings of 2004 IEEE computational systems bioinformatics conference, CSB 2004. IEEE, pp 572–573
Kim H, Golub GH, Park H (2005) Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 21(2):187–198
Article Google Scholar
Sehgal MSB, Gondal I, Dooley LS (2005) Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics 21(10):2417–2423
Article Google Scholar
Pyle D (1999) Data preparation for data mining, vol 1. Morgan Kaufmann, San Francisco
Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data. CRC Press, Boca Raton
Little RJ, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York
Book MATH Google Scholar
Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5-6):519–533
Article Google Scholar
Zhang S (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inform Bull 9(1):32–38
Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17(6):520–525
Article Google Scholar
de Andrade Silva J, Hruschka ER (2009) EACImpute: an evolutionary algorithm for clustering-based imputation. In: 9th International conference on intelligent systems design and applications, ISDA’09. IEEE, pp 1400–1406
Keerin P, Kurutach W, Boongoen T (2012) Cluster-based KNN missing value imputation for DNA microarray data. In: 2012 IEEE International conference on systems, man, and cybernetics (SMC). IEEE, pp 445–450
Hruschka ER, Hruschka Jr ER, Ebecken NF (2005) Towards efficient imputation by nearest-neighbors: a clustering-based approach. In: AI 2004: advances in artificial intelligence. Springer, Berlin Heidelberg New York, pp 513–525
Kim K-Y, Kim B-J, Yi G-S (2004) Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics 5(1):160
Article MathSciNet Google Scholar
Huang C-C, Lee H-M (2004) A grey-based nearest neighbor approach for missing attribute value prediction. Appl Intell 20(3):239–252
Article MATH Google Scholar
Zhang S (2012) Nearest neighbor selection for iteratively kNN imputation. J Syst Softw 85(11):2541–2552
Article Google Scholar
Enders C, Dietz S, Montague M, Dixon J (2006) Modern alternatives for dealing with missing data in special education research. Advances in Learning and Behavioral Disabilities 19:101–129
Article Google Scholar
Di Nuovo AG (2011) Missing data analysis with fuzzy C-Means: a study of its application in a psychological scenario. Expert Syst Appl 38(6):6793–6797
Article Google Scholar
Quinlan JR (1993) C4. 5: programs for machine learning, vol 1. Morgan Kaufmann, Los Altos
Google Scholar
Tsai C-J, Lee C-I, Yang W-P (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714–731
Article Google Scholar
Muñoz JF, Rueda M (2009) New imputation methods for missing data using quantiles. J Comput Appl Math 232(2):305–317
Article MathSciNet MATH Google Scholar
Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907
Article Google Scholar
Zhang C, Zhu X, Zhang J, Qin Y, Zhang S (2007) GBKII: an imputation method for missing values. In: Advances in knowledge discovery and data mining. Springer, Berlin Heidelberg New York, pp 1080–1087
Chapter Google Scholar
Little RJ, Rubin DB (2002) Statistical analysis with missing data
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol:1–38
González S, Rueda M, Arcos A (2008) An improved estimator to analyse missing data. Stat Pap 49 (4):791–796
Article MATH Google Scholar
Zhang S, Zhang J, Zhu X, Qin Y, Zhang C (2008) Missing value imputation based on data clustering. In: Transactions on computational science I. Springer, Berlin Heidelberg New York , pp 128–138
Chapter Google Scholar
Gupta A, Lam MS (1996) Estimating missing values using neural networks. J Oper Res Soc:229–238
Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L (2010) Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med 50(2):105–115
Article Google Scholar
Fessant F, Midenet S (2002) Self-organising map for data imputation and correction in surveys. Neural Comput & Applic 10(4):300–310
Article MATH Google Scholar
Brás LP, Menezes JC (2007) Improving cluster-based missing value estimation of DNA microarray data. Biomol Eng 24(2):273–282
Article Google Scholar
Van Hulse J, Khoshgoftaar TM (2014) Incomplete-case nearest neighbor imputation in software measurement data. Inf Sci 259(0):596–610
Article Google Scholar
García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR, Verleysen M (2009) K nearest neighbours with mutual information for simultaneous classification and missing data imputation. Neurocomputing 72(7-9):1483–1493
Article Google Scholar
Zhang S, Jin Z, Zhu X (2011) Missing data imputation by utilizing information within incomplete instances. J Syst Softw 84(3):452–459
Article Google Scholar
Wasito I, Mirkin B (2005) Nearest neighbour approach in the least-squares data imputation algorithms. Inf Sci 169(1):1–25
Article MathSciNet MATH Google Scholar
Li D, Deogun J, Spaulding W, Shuart B (2004) Towards missing data imputation: a study of fuzzy k-means clustering method. In: Rough sets and current trends in computing. Springer, Berlin Heidelberg New York, pp 573–579
Chapter Google Scholar
Tian J, Yu B, Yu D, Ma S (2014) Missing data analyses: a hybrid multiple imputation algorithm using gray system theory and entropy based on clustering. Appl Intell 40(2):376–388
Article Google Scholar
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar
Lall U, Sharma A (1996) A nearest neighbor bootstrap for resampling hydrologic time series. Water Resour Res 32(3):679–693
Article Google Scholar
Kullback S (1997) Information theory and statistics. Courier Dover Publications, New York
MATH Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Ju-Long D (1982) Control problems of grey systems. Syst Control Lett 1(5):288–294
Article MathSciNet MATH Google Scholar
Kwak N, Choi C-H (2002) Input feature selection by mutual information based on Parzen window. IEEE Trans Pattern Anal Mach Intell 24(12):1667–1671
Article Google Scholar
Dudani SA (1976) The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst Man Cybern 4:325–327
Article Google Scholar
Zhu B, He C, Liatsis P (2012) A robust missing value imputation method for noisy data. Appl Intell 36(1):61–74
Article Google Scholar
Keller G (2011) Statistics for management and economics. Cengage Learning
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1):10–18
Article Google Scholar

Download references

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China (Grant No. 71172219, 71302056), the Humanity and Social Science Youth Foundation of Ministry of Education, China (Grant No. 10YJC630352), and the Research Foundation of Education Department of Anhui Province of China (Grant No. SK2012B578).

Author information

Authors and Affiliations

School of Management Science and Engineering, Anhui University of Technology, Maanshan, 243032, People’s Republic of China
Ruilin Pan, Tingsheng Yang, Jianhua Cao, Ke Lu & Zhanchao Zhang

Authors

Ruilin Pan
View author publications
You can also search for this author in PubMed Google Scholar
Tingsheng Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Cao
View author publications
You can also search for this author in PubMed Google Scholar
Ke Lu
View author publications
You can also search for this author in PubMed Google Scholar
Zhanchao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruilin Pan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pan, R., Yang, T., Cao, J. et al. Missing data imputation by K nearest neighbours based on grey relational structure and mutual information. Appl Intell 43, 614–632 (2015). https://doi.org/10.1007/s10489-015-0666-x

Download citation

Published: 10 May 2015
Issue Date: October 2015
DOI: https://doi.org/10.1007/s10489-015-0666-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

Abstract

Access this article

Similar content being viewed by others

A Hierarchical Missing Value Imputation Method by Correlation-Based K-Nearest Neighbors

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means Clustering Imputation Method for Handling Incomplete Data

Missing data imputation using decision trees and fuzzy clustering with iterative learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Missing data imputation by K nearest neighbours based on grey relational structure and mutual information

Abstract

Access this article

Similar content being viewed by others

A Hierarchical Missing Value Imputation Method by Correlation-Based K-Nearest Neighbors

LIPFCM: Linear Interpolation-Based Possibilistic Fuzzy C-Means Clustering Imputation Method for Handling Incomplete Data

Missing data imputation using decision trees and fuzzy clustering with iterative learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation