Applied Intelligence

, Volume 35, Issue 1, pp 123–133 | Cite as

Shell-neighbor method and its application in missing data imputation

Article

Abstract

Data preparation is an important step in mining incomplete data. To deal with this problem, this paper introduces a new imputation approach called SN (Shell Neighbors) imputation, or simply SNI. The SNI fills in an incomplete instance (with missing values) in a given dataset by only using its left and right nearest neighbors with respect to each factor (attribute), referred them to Shell Neighbors. The left and right nearest neighbors are selected from a set of nearest neighbors of the incomplete instance. The size of the sets of the nearest neighbors is determined with the cross-validation method. And then the SNI is generalized to deal with missing data in datasets with mixed attributes, for example, continuous and categorical attributes. Some experiments are conducted for evaluating the proposed approach, and demonstrate that the generalized SNI method outperforms the kNN imputation method at imputation accuracy and classification accuracy.

Keywords

kNN Shell-NN Missing data imputation Mining incomplete data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Batista G, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17(5–6):519–533 CrossRefGoogle Scholar
  2. 2.
    Berthold MR, Huber KP (1998) Missing values and learning of fuzzy rules. Int J Uncertain, Fuzziness Knowl-Based Syst 6(2):171–178 MATHCrossRefGoogle Scholar
  3. 3.
    Chen J, Shao J (2001) Jackknife variance estimation for nearest-neighbor imputation. J Am Stat Assoc 96:260–269 MathSciNetMATHCrossRefGoogle Scholar
  4. 4.
    Dempster AP, Laird NM, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc, Ser B 39:1–38 MathSciNetMATHGoogle Scholar
  5. 5.
    Farhangfar A, et al (2007) A novel framework for imputation of missing values in databases. IEEE Trans Syst Man Cybern Part A: Syst Humans 37(5):692–709 CrossRefGoogle Scholar
  6. 6.
    Gabrys B (2002) Neuro-fuzzy approach to processing inputs with missing values in pattern recognition problems. Int J Approx Reason 30(3):149–179 MathSciNetMATHCrossRefGoogle Scholar
  7. 7.
    Gabrys B, Petrakieva L (2004) Combining labelled and unlabelled data in the design of pattern classification systems. Int J Approx Reason 35(3):251–273 MathSciNetMATHCrossRefGoogle Scholar
  8. 8.
    Ghahramani Z, Jordan M (1994) Supervised learning from incomplete data via an EM approach. Adv Neural Inf Process Syst 6:120–127 Google Scholar
  9. 9.
    Graham J, Cumsille P, Elek-Fisk E (2003) Methods for handling missing data. In: Handbook of psychology, vol 2. Wiley, New York, pp 87–114 Google Scholar
  10. 10.
    Han J, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann, San Mateo Google Scholar
  11. 11.
    Kang SS, Koehler K, Larsen MD (2007) Partial FEFI for incomplete tables with covariates. Iowa State University Press, Ames Google Scholar
  12. 12.
    Kothari R, Jain V (2002) Learning from labeled and unlabeled data. In: Proceedings of the 2002 international joint conference on neural networks, vol 3, pp 2803–2808 Google Scholar
  13. 13.
    Lin D (1998) An information-theoretic definition of similarity. In: ICML-98, pp 296–304 Google Scholar
  14. 14.
    Little R, Rubin D (2002) Statistical analysis with missing data. Wiley, New York, 2002 MATHGoogle Scholar
  15. 15.
    Mitchell T (1999) The role of unlabeled data in supervised learning. In: Proceedings of the sixth international colloquium on cognitive science Google Scholar
  16. 16.
    Nauck D, Kruse R (1999) Learning in neuro-fuzzy systems with symbolic attributes and missing values. In: Proceedings of the international conference on neural information processing (ICONIP’99), Perth, pp 142–147 Google Scholar
  17. 17.
    Nijman MJ, Kappen HJ (1997) Symmetry breaking and training from incomplete data with radial basis Boltzmann machines. Int J Neural Syst 8(3):301–315 CrossRefGoogle Scholar
  18. 18.
    Peng C, Zhu J (2008) Comparison of two approaches for handling missing covariates in logistic regression. Educ Psychol Meas 68(1):58–77 MathSciNetGoogle Scholar
  19. 19.
    Qin YS et al (2007) Semi-parametric optimization for missing data imputation. Appl Intell 27(1):79–88 MATHCrossRefGoogle Scholar
  20. 20.
    Quinlan J (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo Google Scholar
  21. 21.
    Rubin D, et al (1976) Inference and missing data. Biometrika 63(3):581–592 MathSciNetMATHCrossRefGoogle Scholar
  22. 22.
    Schafer J (1997) Analysis of incomplete multivariate data. Chapman & Hall, London MATHCrossRefGoogle Scholar
  23. 23.
    Schafer J, Graham J (2002) Missing data: Our view of the state of the art. Psychol Methods 7(2):147–177 CrossRefGoogle Scholar
  24. 24.
    Tresp V, Ahmad S, Neuneier R (1994) Training neural networks with deficient data. Adv Neural Inf Process Syst 6:128–135 Google Scholar
  25. 25.
    Zhang CQ et al (2007) GBKII: an imputation method for missing values. PAKDD-07, 2007, pp 1080–1087 Google Scholar
  26. 26.
    Zhang SC (2008) Parimputation: from imputation and null-imputation to partially imputation. IEEE Intell Inf Bull 9(1): 32–38 Google Scholar
  27. 27.
    Zhang SC, Qin ZX, Sheng SL, Ling CL (2005) “Missing is useful”: missing values in cost-sensitive decision trees. IEEE Trans Knowl Data Eng 17(12):1689–1693 CrossRefGoogle Scholar
  28. 28.
    Zhang SC et al (2008) Missing value imputation based on data clustering. Trans Comput Sci J 1:128–138 CrossRefGoogle Scholar
  29. 29.
    Zhang SC, Zhang CQ, Yang Q (2004) Information enhancement for data mining. IEEE Intell Syst 19:12–13 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Department of Computer ScienceZhejiang Normal UniversityJinhuaChina
  2. 2.State Key Laboratory for Novel Software TechnologyNanjing UniversityNanjingChina

Personalised recommendations