Abstract
This paper focuses on solving the problem of clustering for categorical data with missing values. Specifically, we design a new framework that can impute missing values and assign objects into appropriate clusters. For the imputation step, we use a decision tree-based method to fill in missing values. For the clustering step, we use a kernel density estimation approach to define cluster centers and an information theoretic-based dissimilarity measure to quantify the differences between objects. Then, we propose a center-based algorithm for clustering categorical data with missing values, namely k-CCM. An experimental evaluation was performed on real-life datasets with missing values to compare the performance of the proposed algorithm with other popular clustering algorithms in terms of clustering quality. Generally, the experimental result shows that the proposed algorithm has a comparative performance when compared to other algorithms for all datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aitchison, J., Aitken, C.G.: Multivariate binary discrimination by the kernel method. Biometrika 63(3), 413–420 (1976)
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-28349-8_2
Chen, L., Wang, S.: Central clustering of categorical data with automated feature weighting. In: IJCAI, pp. 1260–1266 (2013)
Deb, R., Liew, A.W.C.: Missing value imputation for the analysis of incomplete traffic accident data. Inf. Sci.s 339, 274–289 (2016)
Fujikawa, Y., Ho, T.B.: Cluster-based algorithms for dealing with missing values. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 549–554. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-47887-6_54
Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM (2007)
Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowl. Discov. 2(3), 283–304 (1998)
Kim, D.W., Lee, K., Lee, D., Lee, K.H.: A k-populations algorithm for clustering categorical data. Pattern Recogn. 38(7), 1131–1134 (2005)
Lin, D., et al.: An information-theoretic definition of similarity. In: ICML, vol. 98, pp. 296–304. Citeseer (1998)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Nguyen, T.-H.T., Huynh, V.-N.: A k-means-like algorithm for clustering categorical data using an information theoretic-based dissimilarity measure. In: Gyssens, M., Simari, G. (eds.) FoIKS 2016. LNCS, vol. 9616, pp. 115–130. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30024-5_7
Rahman, M.G., Islam, M.Z.: Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques. Knowl. Based Syst. 53, 51–65 (2013)
San, O.M., Huynh, V.N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci. 14, 241–247 (2004)
Tan, P.N., Kumar, V.: Interestingness measures for association patterns: a perspective. In: Proceedings of Workshop on Postprocessing in Machine Learning and Data Mining (2000)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)
Zaït, M., Messatfa, H.: A comparative study of clustering methods. Fut. Gener. Comput. Syst. 13(2–3), 149–159 (1997)
Thanh-Phu, N., Duy-Tai, D., Van-Nam, H: A new context-based clustering framework for categorical data. Pacific Rim International Conference on Artificial Intelligence, pp. 697–709. Springer (2018)
Acknowledgment
This paper is based upon work supported in part by the Air Force Office of Scientific Research/Asian Office of Aerospace Research and Development (AFOSR/AOARD) under award number FA2386-17-1-4046.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Dinh, DT., Huynh, VN. (2018). k-CCM: A Center-Based Algorithm for Clustering Categorical Data with Missing Values. In: Torra, V., Narukawa, Y., Aguiló, I., González-Hidalgo, M. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2018. Lecture Notes in Computer Science(), vol 11144. Springer, Cham. https://doi.org/10.1007/978-3-030-00202-2_22
Download citation
DOI: https://doi.org/10.1007/978-3-030-00202-2_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00201-5
Online ISBN: 978-3-030-00202-2
eBook Packages: Computer ScienceComputer Science (R0)