Missing Value Imputation Based on Data Clustering

  • Shichao Zhang
  • Jilian Zhang
  • Xiaofeng Zhu
  • Yongsong Qin
  • Chengqi Zhang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4750)


We propose an efficient nonparametric missing value imputation method based on clustering, called CMI (Clustering-based Missing value Imputation), for dealing with missing values in target attributes. In our approach, we impute the missing values of an instance A with plausible values that are generated from the data in the instances which do not contain missing values and are most similar to the instance A using a kernel-based method. Specifically, we first divide the dataset (including the instances with missing values) into clusters. Next, missing values of an instance A are patched up with the plausible values generated from A’s cluster. Extensive experiments show the effectiveness of the proposed method in missing value imputation task.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Zhang, S.C., et al.: Information Enhancement for Data Mining. IEEE Intelligent Systems, 2004 19(2), 12–13 (2004)CrossRefGoogle Scholar
  2. 2.
    Zhang, S.C., et al.: Missing is useful: Missing values in cost-sensitive decision trees. IEEE Transactions on Knowledge and Data Engineering 17(12), 1689–1693 (2005)CrossRefGoogle Scholar
  3. 3.
    Qin, Y.S., et al.: Semi-parametric Optimization for Missing Data Imputation. Applied Intelligence 27(1), 79–88 (2007)CrossRefGoogle Scholar
  4. 4.
    Zhang, C.Q., et al.: An Imputation Method for Missing Values. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 1080–1087. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  5. 5.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)Google Scholar
  6. 6.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)Google Scholar
  7. 7.
    Chen, J., Shao, J.: Jackknife variance estimation for nearest-neighbor imputation. J. Amer. Statist. Assoc. 96, 260–269 (2001)zbMATHCrossRefMathSciNetGoogle Scholar
  8. 8.
    Lall, U., Sharma, A.: A nearest-neighbor bootstrap for resampling hydrologic time series. Water Resource. Res. 32, 679–693 (1996)CrossRefGoogle Scholar
  9. 9.
    Chen, S.M., Chen, H.H.: Estimating null values in the distributed relational databases environments. Cybernetics and Systems: An International Journal 31, 851–871 (2000)zbMATHCrossRefGoogle Scholar
  10. 10.
    Chen, S.M., Huang, C.M.: Generating weighted fuzzy rules from relational database systems for estimating null values using genetic algorithms. IEEE Transactions on Fuzzy Systems 11, 495–506 (2003)CrossRefGoogle Scholar
  11. 11.
    Magnani, M.: Techniques for dealing with missing data in knowledge discovery tasks (2004) (Version of June 2004),
  12. 12.
    Kahl, F., et al.: Minimal Projective Reconstruction Including Missing Data. IEEE Trans. Pattern Anal. Mach. Intell. 23(4), 418–424 (2001)CrossRefMathSciNetGoogle Scholar
  13. 13.
    Gessert, G.: Handling Missing Data by Using Stored Truth Values. SIGMOD Record, 2001 20(3), 30–42 (1991)CrossRefGoogle Scholar
  14. 14.
    Pesonen, E., et al.: Treatment of missing data values in a neural network based decision support system for acute abdominal pain. Artificial Intelligence in Medicine 13(3), 139–146 (1998)CrossRefGoogle Scholar
  15. 15.
    Ramoni, M., Sebastiani, P.: Robust Learning with Missing Data. Machine Learning 45(2), 147–170 (2001)zbMATHCrossRefGoogle Scholar
  16. 16.
    Pawlak, M.: Kernel classification rules from missing data. IEEE Transactions on Information Theory 39(3), 979–988 (1993)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Forgy, E.: Cluster analysis of multivariate data: Efficiency vs. interpretability of classifications. Biometrics 21, 768 (1965)Google Scholar
  18. 18.
    Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases (1998)Google Scholar
  19. 19.
    Hamerly, H., Elkan, C.: Learning the k in k-means. In: Proc. of the 17th intl. Conf. of Neural Information Processing System (2003)Google Scholar
  20. 20.
    Zhang, S.C., et al.: Optimized Parameters for Missing Data Imputation. In: Yang, Q., Webb, G. (eds.) PRICAI 2006. LNCS (LNAI), vol. 4099, pp. 1010–1016. Springer, Heidelberg (2006)Google Scholar
  21. 21.
    Wang, Q., Rao, J.: Empirical likelihood-based inference in linear models with missing data. Scand. J. Statist. 29, 563–576 (2002a)zbMATHCrossRefMathSciNetGoogle Scholar
  22. 22.
    Wang, Q., Rao, J.N.K.: Empirical likelihood-based inference under imputation for missing response data. Ann. Statist. 30, 896–924 (2002b)zbMATHCrossRefMathSciNetGoogle Scholar
  23. 23.
    Silverman, B.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, New York (1986)zbMATHGoogle Scholar
  24. 24.
    Friedman, J., et al.: Lazy Decision Trees. In: Proceedings of the 13th National Conference on Artificial Intelligence, pp. 717–724 (1996)Google Scholar
  25. 25.
    John, S., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge (2004)Google Scholar
  26. 26.
    Lakshminarayan, K., et al.: Imputation of Missing Data Using Machine Learning Techniques. KDD 1996, 140–145 (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Shichao Zhang
    • 1
  • Jilian Zhang
    • 2
  • Xiaofeng Zhu
    • 1
  • Yongsong Qin
    • 1
  • Chengqi Zhang
    • 3
  1. 1.Department of Computer ScienceGuangxi Normal UniversityGuilinChina
  2. 2.School of Information SystemsSingapore Management UniversitySingapore
  3. 3.Faculty of Information TechnologyUniversity of Technology SydneyAustralia

Personalised recommendations