Abstract
Missing data imputation is an important research topic in data mining. The impact of noise is seldom considered in previous works while real-world data often contain much noise. In this paper, we systematically investigate the impact of noise on imputation methods and propose a new imputation approach by introducing the mechanism of Group Method of Data Handling (GMDH) to deal with incomplete data with noise. The performance of four commonly used imputation methods is compared with ours, called RIBG (robust imputation based on GMDH), on nine benchmark datasets. The experimental result demonstrates that noise has a great impact on the effectiveness of imputation techniques and our method RIBG is more robust to noise than the other four imputation methods used as benchmark.
Similar content being viewed by others
References
Abdel-Aal RE (2005) GMDH-based feature ranking and selection for improved classification of medical data. J Biomed Inf 38(6):456–468
Aksenova TI, Yurachkovsky YP (1988) A characterisation at unbiased structure and conditions of their J-optimality. Sov J Autom Inf Sci 21(4):36–42
Asuncion A, Newman DJ (2007) UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://www.ics.uci.edu/mlearn/MLRepository.html
Aussem A, de Morais SR (2008) A conservative feature subset selection algorithm with missing data. In: Kellenberger P (ed) Proc eighth IEEE int conf on data mining, ICDM’08, Pisa, Italy, pp 725–730
Barron AR, Barron RL (1988) Statistical learning networks: A unifying view. In: Wegman E (ed) Proc the 20th symposium on the interface: computing science and statistics. American Statistical Association, Washington, pp 192–203
Batista G, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533
Beaumont JF (2000) On regression imputation in the presence of nonignorable nonresponse. In: Proc of the survey research methods section, ASA, pp 580–585
Chen S, Huang C (2003) Generating weighted fuzzy rules from relational database systems for estimating null values using genetic algorithms. IEEE Trans Fuzzy Syst 11(4):495–506
Chen S, Huang C (2008) A new approach to generate weighted fuzzy rules using genetic algorithms for estimating null values. Expert Syst Appl 35(3):905–917
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc B 39:1–38
Elder JF, Brown DE (2000) Induction and polynomial networks. In: Fraser MD (ed) Proc network models for control and processing, induction and polynomial networks. Intellect Books, Exeter, pp 143–198
Farhangfar A, Kurgan L, Pedrycz W (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern Part A, Syst Humans 37(5):692–709
Farhangfar A, Kurgan L, Dy J (2008) Impact of imputation of missing values on classification error for discrete data. Pattern Recogn 41(12):3692–3705
Ford BL (1983) An overview of hot-deck procedures. In: Madow WG, OIkin I, Rubin DB (eds) Incomplete data in sample surveys, vol II: theory and bibliographies. Academic Press, New York, pp 85–207
Harel O, Zhou XH (2007) Multiple imputation: Review of theory, implementation and software. Stat Med 26(16):3057–3077
Hathaway RJ, Bezdek JC (2002) Clustering incomplete relational data using the non-Euclidean relational fuzzy c-means algorithm. Pattern Recogn Lett 23(1-3):151–160
Hruschka ER Jr, Hruschka ER, Ebecken N (2007) Bayesian networks for imputation in classification problems. J Intell Inf Syst 29(3):231–252
Huang CC, Lee HM (2004) A grey-based nearest neighbor approach for missing attribute value prediction. Appl Intell 20:239–252
Ivakhnenko AG (1968) The group method of data handling-a rival of the method of stochastic approximation. Sov Autom Control 1–3:43–55
Ivakhnenko AG (1971) Polynomial theory of complex systems. IEEE Trans Syst Man Cybern 1(4):364–378
Ivakhnenko AG, Kocherga YL (1983) Theory of two-level GMDH algorithms for long-range quantitative prediction. Sov Autom Control 16(6):7–12
Ivakhnenko AG, Stepashko VS (1985) Noise stability of modeling. Naukova Dumka, Kiev
Lakshminarayan K, Harp SA, Samad T (1999) Imputation of missing data in industrial databases. Appl Intell 11(3):259–275
Lemke F, Mueller J (2003) Self-organising data mining. Syst Anal Model Simul 43(2):231–240
Little R, Rubin D (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York
Madala HR, Ivakhnenko AG (1994) Inductive learning algorithms for complex systems modeling. CRC Press, Boca Raton
Mani S, Valtorta M, McDermott S (2005) Building Bayesian network models in medicine: The MENTOR experience. Appl Intell 22(2):93–108
Mannino M, Yang Y, Ryu Y (2009) Classification algorithm sensitivity to training data with non representative attribute noise. Decis Support Syst 46(3):743–751
Mehrara M et al (2009) Investigating the efficiency in oil futures market based on GMDH approach. Expert Syst Appl 36(4):7479–7483
Miller RG (1997) Beyond ANOVA: basics of applied statistics. Chapman & Hall, Boca Raton
Mueller JA, Lemke F (2000) Self-organizing data mining: an intelligent approach to extract knowledge from data. Libri Books, Berlin
Myrtveit I, Stensrud E, Olsson U (2001) Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods. IEEE Trans Softw Eng 27(11):999–1013
Oh S, Pedrycz W (2002) The design of self-organizing polynomial neural networks. Inf Sci 141(3–4):237–258
Olinsky A, Chen S, Harlow L (2003) The comparative efficacy of imputation methods for missing data in structural equation modeling. Eur J Oper Res 151(1):53–79
Puig V et al (2007) A GMDH neural network-based approach to passive robust fault detection using a constraint satisfaction backward test. Eng Appl Artif Intell 20(7):886–897
Qin Y et al (2007) Semi-parametric optimization for missing data imputation. Appl Intell 27(1):79–88
Quinlan JR (1993) C4. 5: Programs for machine learning. Morgan Kauffman, Los Altos
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Saar-Tsechansky M, Provost F (2007) Handling missing values when applying classification models. J Mach Learn Res 8:1625–1657
Schafer JL (1999) Multiple imputation: A primer. Stat Methods Med Res 8(1):3–15
Stepashko VS, Yurachkovskiy YP (1986) The present state of the theory of the group method of data handling. Sov J Autom Inf Sci c/c of Avtomatika 19(4):36–46
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, Boston
Tsikriktsis N (2005) A review of techniques for treating missing data in OM survey research. J Oper Manag 24(1):53–62
Twala B (2009) An empirical comparison of techniques for handling incomplete data when using decision trees. Appl Artif Intell 23(5):373–405
Ungaro F, Calzolari C, Busoni E (2005) Development of pedotransfer functions using a group method of data handling for the soil of the Pianura Padano-Veneta region of North Italy: Water retention properties. Geoderma 124(3–4):293–317
Van Buuren S et al (2006) Fully conditional specification in multivariate imputation. J Stat Comput Simul 76(12):1049–1064
Van Hulse J, Khoshgoftaar TM (2008) A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. J Syst Softw 81(5):691–708
Williams D et al (2007) On classification with incomplete data. IEEE Trans Pattern Anal Mach Intell 29(3):427–436
Wu X, Zhu X (2008) Mining with noise knowledge: Error-aware data mining. IEEE Trans Syst Man Cybern Part A 38(4):917–932
Zhu X, Wu X (2004) Class noise vs. attribute noise: A quantitative study. Artif Intell Rev 22(3):177–210
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is supported by National Natural Science Foundation of China (Grant No. 70771067) and the NSFC/RS (Royal Society of the UK) International Joint Project (Grant No. 70911130133).
Rights and permissions
About this article
Cite this article
Zhu, B., He, C. & Liatsis, P. A robust missing value imputation method for noisy data. Appl Intell 36, 61–74 (2012). https://doi.org/10.1007/s10489-010-0244-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-010-0244-1