kDMI: A Novel Method for Missing Values Imputation Using Two Levels of Horizontal Partitioning in a Data set

  • Md. Geaur Rahman
  • Md Zahidul Islam
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8347)


Imputation of missing values is an important data mining task for improving the quality of data mining results. The imputation based on similar records is generally more accurate than the imputation based on all records of a data set. Therefore, in this paper we present a novel algorithm called kDMI that employs two levels of horizontal partitioning (based on a decision tree and k-NN algorithm) of a data set, in order to find the records that are very similar to the one with missing value/s. Additionally, it uses a novel approach to automatically find the value of k for each record. We evaluate the performance of kDMI over three high quality existing methods on two real data sets in terms of four evaluation criteria. Our initial experimental results, including 95% confidence interval analysis and statistical t-test analysis, indicate the superiority of kDMI over the existing methods.


Data pre-processing data cleansing missing value imputation EM algorithm Decision Trees 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aydilek, I.B., Arslan, A.: A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Information Sciences 233, 25–35 (2013)CrossRefGoogle Scholar
  2. 2.
    Batista, G., Monard, M.: An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence 17(5-6), 519–533 (2003)CrossRefGoogle Scholar
  3. 3.
    Cai, Z., Heydari, M., Lin, G.: Iterated local least squares microarray missing value imputation. Journal of Bioinformatics and Computational Biology 4(5), 935–958 (2006)CrossRefGoogle Scholar
  4. 4.
    Cheng, K., Law, N., Siu, W.: Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data. Pattern Recognition 45(4), 1281–1289 (2012)CrossRefGoogle Scholar
  5. 5.
    Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recognition 41(12), 3692–3705 (2008)CrossRefzbMATHGoogle Scholar
  6. 6.
    Frank, A., Asuncion, A.: UCI machine learning repository (2010), (accessed July 7, 2013)
  7. 7.
    Han, J., Kamber, M.: Data mining: Concepts and techniques. The Morgan Kaufmann Series in data management systems (2000)Google Scholar
  8. 8.
    Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmospheric Environment 38(18), 2895–2907 (2004)CrossRefGoogle Scholar
  9. 9.
    Kim, H., Golub, G., Park, H.: Missing value estimation for dna microarray gene expression data: local least squares imputation. Bioinformatics 21(2), 187–198 (2005)CrossRefGoogle Scholar
  10. 10.
    Maletic, J., Marcus, A.: Data cleansing: Beyond integrity analysis. In: Proceedings of the Conference on Information Quality, pp. 200–209. Citeseer (2000)Google Scholar
  11. 11.
    Quinlan, J.R.: Improved use of continuous attributes in C4. 5. Journal of Artificial Intelligence Research 4, 77–90 (1996)zbMATHGoogle Scholar
  12. 12.
    Rahman, M.G., Islam, M.Z.: A decision tree-based missing value imputation technique for data pre-processing. In: Australasian Data Mining Conference (AusDM 2011). CRPIT, vol. 121, pp. 41–50. ACS, Ballarat (2011)Google Scholar
  13. 13.
    Rahman, M.G., Islam, M.Z.: Data quality improvement by imputation of missing values. In: International Conference on Computer Science and Information Technology (CSIT 2013), Yogyakarta, Indonesia (2013)Google Scholar
  14. 14.
    Rahman, M.G., Islam, M.Z., Bossomaier, T., Gao, J.: Cairad: A co-appearance based analysis for incorrect records and attribute-values detection. In: The 2012 International Joint Conference on Neural Networks (IJCNN), pp. 1–10. IEEE, Brisbane (2012)CrossRefGoogle Scholar
  15. 15.
    Schneider, T.: Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate 14(5), 853–871 (2001)CrossRefGoogle Scholar
  16. 16.
    Willmott, C.: Some comments on the evaluation of model performance. Bulletin of the American Meteorological Society 63, 1309–1369 (1982)CrossRefGoogle Scholar
  17. 17.
    Yan, D., Wang, J.: Biclustering of gene expression data based on related genes and conditions extraction. Pattern Recognition 46(4), 1170–1182 (2013)CrossRefGoogle Scholar
  18. 18.
    Zhu, X., Wu, X., Yang, Y.: Error detection and impact-sensitive instance ranking in noisy datasets. In: Proceedings of the National Conference on Artificial Intelligence, pp. 378–384. AAAI Press; MIT Press, Menlo Park, CA; Cambridge, MA (2004)Google Scholar
  19. 19.
    Zhu, X., Zhang, S., Jin, Z., Zhang, Z., Xu, Z.: Missing value estimation for mixed-attribute data sets. IEEE Transactions on Knowledge and Data Engineering 23(1), 110–121 (2011)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Md. Geaur Rahman
    • 1
  • Md Zahidul Islam
    • 1
  1. 1.Center for Research in Complex Systems, School of Computing and MathematicsCharles Sturt UniversityBathurstAustralia

Personalised recommendations