Abstract
In this paper, we investigate the problem of determining the number of clusters in the k-modes based categorical data clustering process. We propose a new categorical data clustering algorithm with automatic selection of k. The new algorithm extends the k-modes clustering algorithm by introducing a penalty term to the objective function to make more clusters compete for objects. In the new objective function, we employ a regularization parameter to control the number of clusters in a clustering process. Instead of finding k directly, we choose a suitable value of regularization parameter such that the corresponding clustering result is the most stable one among all the generated clustering results. Experimental results on synthetic data sets and the real data sets are used to demonstrate the effectiveness of the proposed algorithm.
Similar content being viewed by others
References
Akaike H. (1974) A new look at the statistical model identification. IEEE Trans. Autom. Control 19: 716–723
Andreopoulos B., An A., and Wang X. (2005) Clustering the internet topology at multiple layers. WSEAS Transactions on Information Science and Applications 2: 1625–1634
Barbara D., Li, Y., and Couto J., Coolcat (2002) An entropy-based algorithm for categorical clustering. Proc. of ACM Conf. on Information and Knowledge Management (CIKM), McLean, Virginia, USA, 582–589
Bozdogan H. (1993) Choosing the number of component clusters in the mixture-model using a new informational complexity criterion of the inverse-fisher information matrix. Information and Classification, 40–54
Chaturved A., Green P., and Carroll J. (2001) k-modes clustering. Journal of Classification 18: 35–55
Chen, K., and Liu, L. (2005) The ‘Best K’ for entropy-based categorical data clustering. Proc of Scientific and Statistical Database Management (SSDBM05), Santa Barbara, CA, June
Gath I. and Geve A. (1989) Unsupervised optimal fuzzy clustering. IEEE. Trans. on PAMI 11: 773–781
Geoffrey M. and David P. (2000) Finite mixture models, 202–207
Guha S., Rastogi R., Shim K., ROCK (2000) A robust clustering algorithm for categorical attributes. Information Systems 25: 345–366
Hamerly G., Elkan C. (2003) Learning the k in k-means. Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems (NIPS), December
Huang, Z. (1997) A fast clustering algorithm to cluster large categorical data sets in data mining. Proc. SIGMOD Workshop Research Issues on Data Mining and Knowledge Discovery 1–8
Huang, Z. (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowledge Discovery 2:283–304
Huang, Z., Ng M., (1999) A Fuzzy k-modes algorithm for clustering categorical data. IEEE Trans on Fuzzy Systems, 7: 446–452
Jain A. K., Dubes R. C. (1988) Algorithms for clustering data. Englewood Cliffs, NJ: Prentice Hall
Leroux B. G. (1992) Consistent estimation of a mixing distribution. Annals of Statistics 20: 1350–1360
Li, M., Ng M., Cheng, Y., and Huang, J. (2008) Agglomerative fuzzy k-means clustering algorithm with selection of number of clusters. IEEE Trans. Knowledge and Data Eng
Manganaro V., Paratore S., Alessi E., Coffa S., and Cavallaro S. (2005) Adding semantics to gene expression profiles: new tools for drug discovery. Current Medicinal Chemistry 12: 1149–1160
Millligan G., Cooper M. (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179
Ng M., Li, M., Huang, J. and He, Z. (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29: 503–507
Nikhil R. and James C. (1995) On cluster validity for the fuzzy c-means model. IEEE Transactions on Fuzzy Systems 3: 370–379
Padhraic S. (2000) Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing 10:63–72
Pan, W. (1999) Bootstrapping likelihood for model selection with small samples. Journal of Computational and Graphical Statistics 8: 225–235
Rezaee M., Lelieveldt B. and Reiber J. (1998) A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters 19: 237–246
Schwarz G. (1978) Estimating the dimension of a model. Annals of Statistics 6: 461–464
Sun, H., Wang, S. and Jiang, Q. (2004) FCM-based model selection algorithms for determining the number of clusters. Pattern Recognition 37: 2027–2037
Windham M., Cutler A. (1992) Information ratios for validating mixture analysis. Statistical Association 87: 1188–1192
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Liao, Hy., Ng, M.K. Categorical data clustering with automatic selection of cluster number. Fuzzy Inf. Eng. 1, 5–25 (2009). https://doi.org/10.1007/s12543-009-0001-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12543-009-0001-5