k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values

  • Duy-Tai DinhEmail author
  • Van-Nam Huynh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11144)


This paper focuses on solving the problem of clustering for categorical data with missing values. Specifically, we design a new framework that can impute missing values and assign objects into appropriate clusters. For the imputation step, we use a decision tree-based method to fill in missing values. For the clustering step, we use a kernel density estimation approach to define cluster centers and an information theoretic-based dissimilarity measure to quantify the differences between objects. Then, we propose a center-based algorithm for clustering categorical data with missing values, namely k-CCM. An experimental evaluation was performed on real-life datasets with missing values to compare the performance of the proposed algorithm with other popular clustering algorithms in terms of clustering quality. Generally, the experimental result shows that the proposed algorithm has a comparative performance when compared to other algorithms for all datasets.


Data mining Partitional clustering Categorical data Missing values Incomplete dataset Decision tree-based imputation 



This paper is based upon work supported in part by the Air Force Office of Scientific Research/Asian Office of Aerospace Research and Development (AFOSR/AOARD) under award number FA2386-17-1-4046.


  1. 1.
    Aitchison, J., Aitken, C.G.: Multivariate binary discrimination by the kernel method. Biometrika 63(3), 413–420 (1976)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). Scholar
  3. 3.
    Chen, L., Wang, S.: Central clustering of categorical data with automated feature weighting. In: IJCAI, pp. 1260–1266 (2013)Google Scholar
  4. 4.
    Deb, R., Liew, A.W.C.: Missing value imputation for the analysis of incomplete traffic accident data. Inf. Sci.s 339, 274–289 (2016)CrossRefGoogle Scholar
  5. 5.
    Fujikawa, Y., Ho, T.B.: Cluster-based algorithms for dealing with missing values. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 549–554. Springer, Heidelberg (2002). Scholar
  6. 6.
    Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM (2007)Google Scholar
  7. 7.
    Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowl. Discov. 2(3), 283–304 (1998)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Kim, D.W., Lee, K., Lee, D., Lee, K.H.: A k-populations algorithm for clustering categorical data. Pattern Recogn. 38(7), 1131–1134 (2005)CrossRefGoogle Scholar
  9. 9.
    Lin, D., et al.: An information-theoretic definition of similarity. In: ICML, vol. 98, pp. 296–304. Citeseer (1998)Google Scholar
  10. 10.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefGoogle Scholar
  11. 11.
    Nguyen, T.-H.T., Huynh, V.-N.: A k-means-like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure. In: Gyssens, M., Simari, G. (eds.) FoIKS 2016. LNCS, vol. 9616, pp. 115–130. Springer, Cham (2016). Scholar
  12. 12.
    Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. Knowl. Based Syst. 53, 51–65 (2013)CrossRefGoogle Scholar
  13. 13.
    San, O.M., Huynh, V.N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci. 14, 241–247 (2004)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Tan, P.N., Kumar, V.: Interestingness measures for association patterns: a perspective. In: Proceedings of Workshop on Postprocessing in Machine Learning and Data Mining (2000)Google Scholar
  15. 15.
    Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)Google Scholar
  16. 16.
    Zaït, M., Messatfa, H.: A comparative study of clustering methods. Fut. Gener. Comput. Syst. 13(2–3), 149–159 (1997)CrossRefGoogle Scholar
  17. 17.
    Thanh-Phu, N., Duy-Tai, D., Van-Nam, H: A new context-based clustering framework for categorical data. Pacific Rim International Conference on Artificial Intelligence, pp. 697–709. Springer (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.School of Knowledge ScienceJapan Advanced Institute of Science and TechnologyNomiJapan

Personalised recommendations