Categorical data clustering with automatic selection of cluster number

Liao, Hai-yong; Ng, Michael K.

doi:10.1007/s12543-009-0001-5

Categorical data clustering with automatic selection of cluster number

Original Article
Published: 18 March 2009

Volume 1, pages 5–25, (2009)
Cite this article

Fuzzy Information and Engineering

Hai-yong Liao¹ &
Michael K. Ng¹

11 Citations
Explore all metrics

Abstract

In this paper, we investigate the problem of determining the number of clusters in the k-modes based categorical data clustering process. We propose a new categorical data clustering algorithm with automatic selection of k. The new algorithm extends the k-modes clustering algorithm by introducing a penalty term to the objective function to make more clusters compete for objects. In the new objective function, we employ a regularization parameter to control the number of clusters in a clustering process. Instead of finding k directly, we choose a suitable value of regularization parameter such that the corresponding clustering result is the most stable one among all the generated clustering results. Experimental results on synthetic data sets and the real data sets are used to demonstrate the effectiveness of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Akaike H. (1974) A new look at the statistical model identification. IEEE Trans. Autom. Control 19: 716–723
Article MATH MathSciNet Google Scholar
Andreopoulos B., An A., and Wang X. (2005) Clustering the internet topology at multiple layers. WSEAS Transactions on Information Science and Applications 2: 1625–1634
Google Scholar
Barbara D., Li, Y., and Couto J., Coolcat (2002) An entropy-based algorithm for categorical clustering. Proc. of ACM Conf. on Information and Knowledge Management (CIKM), McLean, Virginia, USA, 582–589
Bozdogan H. (1993) Choosing the number of component clusters in the mixture-model using a new informational complexity criterion of the inverse-fisher information matrix. Information and Classification, 40–54
Chaturved A., Green P., and Carroll J. (2001) k-modes clustering. Journal of Classification 18: 35–55
MathSciNet Google Scholar
Chen, K., and Liu, L. (2005) The ‘Best K’ for entropy-based categorical data clustering. Proc of Scientific and Statistical Database Management (SSDBM05), Santa Barbara, CA, June
Gath I. and Geve A. (1989) Unsupervised optimal fuzzy clustering. IEEE. Trans. on PAMI 11: 773–781
Google Scholar
Geoffrey M. and David P. (2000) Finite mixture models, 202–207
Guha S., Rastogi R., Shim K., ROCK (2000) A robust clustering algorithm for categorical attributes. Information Systems 25: 345–366
Article Google Scholar
Hamerly G., Elkan C. (2003) Learning the k in k-means. Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems (NIPS), December
Huang, Z. (1997) A fast clustering algorithm to cluster large categorical data sets in data mining. Proc. SIGMOD Workshop Research Issues on Data Mining and Knowledge Discovery 1–8
Huang, Z. (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowledge Discovery 2:283–304
Article Google Scholar
Huang, Z., Ng M., (1999) A Fuzzy k-modes algorithm for clustering categorical data. IEEE Trans on Fuzzy Systems, 7: 446–452
Article Google Scholar
Jain A. K., Dubes R. C. (1988) Algorithms for clustering data. Englewood Cliffs, NJ: Prentice Hall
MATH Google Scholar
Leroux B. G. (1992) Consistent estimation of a mixing distribution. Annals of Statistics 20: 1350–1360
Article MATH MathSciNet Google Scholar
Li, M., Ng M., Cheng, Y., and Huang, J. (2008) Agglomerative fuzzy k-means clustering algorithm with selection of number of clusters. IEEE Trans. Knowledge and Data Eng
Manganaro V., Paratore S., Alessi E., Coffa S., and Cavallaro S. (2005) Adding semantics to gene expression profiles: new tools for drug discovery. Current Medicinal Chemistry 12: 1149–1160
Article Google Scholar
Millligan G., Cooper M. (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179
Article Google Scholar
Ng M., Li, M., Huang, J. and He, Z. (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29: 503–507
Article Google Scholar
Nikhil R. and James C. (1995) On cluster validity for the fuzzy c-means model. IEEE Transactions on Fuzzy Systems 3: 370–379
Article Google Scholar
Padhraic S. (2000) Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing 10:63–72
Article Google Scholar
Pan, W. (1999) Bootstrapping likelihood for model selection with small samples. Journal of Computational and Graphical Statistics 8: 225–235
Google Scholar
Rezaee M., Lelieveldt B. and Reiber J. (1998) A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters 19: 237–246
Article MATH Google Scholar
Schwarz G. (1978) Estimating the dimension of a model. Annals of Statistics 6: 461–464
Article MATH MathSciNet Google Scholar
Sun, H., Wang, S. and Jiang, Q. (2004) FCM-based model selection algorithms for determining the number of clusters. Pattern Recognition 37: 2027–2037
Article MATH Google Scholar
Windham M., Cutler A. (1992) Information ratios for validating mixture analysis. Statistical Association 87: 1188–1192
Article Google Scholar

Download references

Author information

Authors and Affiliations

Centre for Mathematical Imaging and Vision and Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
Hai-yong Liao & Michael K. Ng

Authors

Hai-yong Liao
View author publications
You can also search for this author in PubMed Google Scholar
Michael K. Ng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael K. Ng.

About this article

Cite this article

Liao, Hy., Ng, M.K. Categorical data clustering with automatic selection of cluster number. Fuzzy Inf. Eng. 1, 5–25 (2009). https://doi.org/10.1007/s12543-009-0001-5

Download citation

Received: 25 August 2008
Revised: 20 September 2008
Accepted: 03 November 2008
Published: 18 March 2009
Issue Date: March 2009
DOI: https://doi.org/10.1007/s12543-009-0001-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Categorical data clustering with automatic selection of cluster number

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Keywords

Navigation

Categorical data clustering with automatic selection of cluster number

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Author information

Authors and Affiliations

Corresponding author

About this article

Cite this article

Share this article

Keywords

Search

Navigation