Skip to main content
Log in

Categorical data clustering with automatic selection of cluster number

  • Original Article
  • Published:
Fuzzy Information and Engineering

Abstract

In this paper, we investigate the problem of determining the number of clusters in the k-modes based categorical data clustering process. We propose a new categorical data clustering algorithm with automatic selection of k. The new algorithm extends the k-modes clustering algorithm by introducing a penalty term to the objective function to make more clusters compete for objects. In the new objective function, we employ a regularization parameter to control the number of clusters in a clustering process. Instead of finding k directly, we choose a suitable value of regularization parameter such that the corresponding clustering result is the most stable one among all the generated clustering results. Experimental results on synthetic data sets and the real data sets are used to demonstrate the effectiveness of the proposed algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Akaike H. (1974) A new look at the statistical model identification. IEEE Trans. Autom. Control 19: 716–723

    Article  MATH  MathSciNet  Google Scholar 

  2. Andreopoulos B., An A., and Wang X. (2005) Clustering the internet topology at multiple layers. WSEAS Transactions on Information Science and Applications 2: 1625–1634

    Google Scholar 

  3. Barbara D., Li, Y., and Couto J., Coolcat (2002) An entropy-based algorithm for categorical clustering. Proc. of ACM Conf. on Information and Knowledge Management (CIKM), McLean, Virginia, USA, 582–589

  4. Bozdogan H. (1993) Choosing the number of component clusters in the mixture-model using a new informational complexity criterion of the inverse-fisher information matrix. Information and Classification, 40–54

  5. Chaturved A., Green P., and Carroll J. (2001) k-modes clustering. Journal of Classification 18: 35–55

    MathSciNet  Google Scholar 

  6. Chen, K., and Liu, L. (2005) The ‘Best K’ for entropy-based categorical data clustering. Proc of Scientific and Statistical Database Management (SSDBM05), Santa Barbara, CA, June

  7. Gath I. and Geve A. (1989) Unsupervised optimal fuzzy clustering. IEEE. Trans. on PAMI 11: 773–781

    Google Scholar 

  8. Geoffrey M. and David P. (2000) Finite mixture models, 202–207

  9. Guha S., Rastogi R., Shim K., ROCK (2000) A robust clustering algorithm for categorical attributes. Information Systems 25: 345–366

    Article  Google Scholar 

  10. Hamerly G., Elkan C. (2003) Learning the k in k-means. Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems (NIPS), December

  11. Huang, Z. (1997) A fast clustering algorithm to cluster large categorical data sets in data mining. Proc. SIGMOD Workshop Research Issues on Data Mining and Knowledge Discovery 1–8

  12. Huang, Z. (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowledge Discovery 2:283–304

    Article  Google Scholar 

  13. Huang, Z., Ng M., (1999) A Fuzzy k-modes algorithm for clustering categorical data. IEEE Trans on Fuzzy Systems, 7: 446–452

    Article  Google Scholar 

  14. Jain A. K., Dubes R. C. (1988) Algorithms for clustering data. Englewood Cliffs, NJ: Prentice Hall

    MATH  Google Scholar 

  15. Leroux B. G. (1992) Consistent estimation of a mixing distribution. Annals of Statistics 20: 1350–1360

    Article  MATH  MathSciNet  Google Scholar 

  16. Li, M., Ng M., Cheng, Y., and Huang, J. (2008) Agglomerative fuzzy k-means clustering algorithm with selection of number of clusters. IEEE Trans. Knowledge and Data Eng

  17. Manganaro V., Paratore S., Alessi E., Coffa S., and Cavallaro S. (2005) Adding semantics to gene expression profiles: new tools for drug discovery. Current Medicinal Chemistry 12: 1149–1160

    Article  Google Scholar 

  18. Millligan G., Cooper M. (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179

    Article  Google Scholar 

  19. Ng M., Li, M., Huang, J. and He, Z. (2007) On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29: 503–507

    Article  Google Scholar 

  20. Nikhil R. and James C. (1995) On cluster validity for the fuzzy c-means model. IEEE Transactions on Fuzzy Systems 3: 370–379

    Article  Google Scholar 

  21. Padhraic S. (2000) Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing 10:63–72

    Article  Google Scholar 

  22. Pan, W. (1999) Bootstrapping likelihood for model selection with small samples. Journal of Computational and Graphical Statistics 8: 225–235

    Google Scholar 

  23. Rezaee M., Lelieveldt B. and Reiber J. (1998) A new cluster validity index for the fuzzy c-mean. Pattern Recognition Letters 19: 237–246

    Article  MATH  Google Scholar 

  24. Schwarz G. (1978) Estimating the dimension of a model. Annals of Statistics 6: 461–464

    Article  MATH  MathSciNet  Google Scholar 

  25. Sun, H., Wang, S. and Jiang, Q. (2004) FCM-based model selection algorithms for determining the number of clusters. Pattern Recognition 37: 2027–2037

    Article  MATH  Google Scholar 

  26. Windham M., Cutler A. (1992) Information ratios for validating mixture analysis. Statistical Association 87: 1188–1192

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael K. Ng.

About this article

Cite this article

Liao, Hy., Ng, M.K. Categorical data clustering with automatic selection of cluster number. Fuzzy Inf. Eng. 1, 5–25 (2009). https://doi.org/10.1007/s12543-009-0001-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12543-009-0001-5

Keywords

Navigation