Skip to main content
Log in

“Best K”: critical clustering structures in categorical datasets

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

The demand on cluster analysis for categorical data continues to grow over the last decade. A well-known problem in categorical clustering is to determine the best K number of clusters. Although several categorical clustering algorithms have been developed, surprisingly, none has satisfactorily addressed the problem of best K for categorical clustering. Since categorical data does not have an inherent distance function as the similarity measure, traditional cluster validation techniques based on geometric shapes and density distributions are not appropriate for categorical data. In this paper, we study the entropy property between the clustering results of categorical data with different K number of clusters, and propose the BKPlot method to address the three important cluster validation problems: (1) How can we determine whether there is significant clustering structure in a categorical dataset? (2) If there is significant clustering structure, what is the set of candidate “best Ks”? (3) If the dataset is large, how can we efficiently and reliably determine the best Ks?

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal CC, Magdalena C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1): 51–62

    Article  Google Scholar 

  2. Agresti A (1990) Categorical Data Analysis. Wiley, NY

    MATH  Google Scholar 

  3. Andritsos P, Tsaparas P, Miller RJ, Sevcik KC (2004) Limbo:scalable clustering of categorical data. In: Proceedings of international conference on extending database technology (EDBT)

  4. Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: Ordering points to identify the clustering structure. In: Proceedings of ACM SIGMOD conference, pp 49–60

  5. Barbara D, Jajodia S (eds) (2002) Applications of data mining in computer security. Kluwer, Dordrecht

    Google Scholar 

  6. Barbara D, Li Y, Couto J (2002) Coolcat: an entropy-based algorithm for categorical clustering. In: Proceedings of ACM conference on information and knowledge management (CIKM)

  7. Baulieu F (1997) Two variant axiom systems for presence/absence based dissimilarity coefficients. J Classif 14

  8. Baxevanis A, Ouellette F (eds) (2001) Bioinformatics: a practical guide to the analysis of genes and proteins, 2nd edn. Wiley, NY

  9. Bock H (1989) Probabilistic aspects in cluster analysis. In: Conceptual and numerical analysis of data. Springer, Berlin

  10. Brand M (1998) An entropic estimator for structure discovery. In: Proceedings Of neural information processing systems (NIPS). pp 723–729

  11. Celeux G, Govaert G (1991) Clustering criteria for discrete data and latent class models. J Classif

  12. Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of ACM SIGKDD conference

  13. Chen K, Liu L (2004) VISTA: Validating and refining clusters via visualization. Inf Vis 3(4): 257–270

    Article  Google Scholar 

  14. Chen K, Liu L (2005) The “best k” for entropy-based categorical clustering. In: Proceedings of international conference on scientific and statistical database management (SSDBM). pp 253–262

  15. Chen K, Liu L (2006) Detecting the change of clustering structure in categorical data streams. In: SIAM data mining conference

  16. Cheng CH, Fu AW-C, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD conference

  17. Cover T, Thomas J (1991) Elements of information theory. Wiley, NY

    Book  MATH  Google Scholar 

  18. Dhillon IS, Mellela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of ACM SIGKDD conference

  19. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Second international conference on knowledge discovery and data mining, pp 226–231

  20. Ganti V, Gehrke J, Ramakrishnan R (1999) CACTUS-clustering categorical data using summaries. In: Proceedings of ACM SIGKDD Conference

  21. Gibson D, Kleinberg J, Raghavan P (2000) Clustering categorical data: An approach based on dynamical systems. In: Proceedings of very large databases conference (VLDB). pp 222–236

  22. Gondek D, Hofmann T (2007) ‘Non-redundant data clustering’. Knowl Inf Syst 12(1): 1–24

    Article  Google Scholar 

  23. Guha S, Rastogi R, Shim K (2000) ROCK: A robust clustering algorithm for categorical attributes. Inf Syst 25(5): 345–366

    Article  Google Scholar 

  24. Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: Part I and II. SIGMOD Rec 31(2): 40–45

    Article  Google Scholar 

  25. Hastie T, Tibshirani R, Friedmann J (2001) The elements of statistical learning. Springer, Berlin

    MATH  Google Scholar 

  26. Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Workshop on research issues on data mining and knowledge discovery

  27. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice hall, New York

    MATH  Google Scholar 

  28. Jain AK, Dubes RC (1999) Data clustering: a review. ACM Comput Surv 31: 264–323

    Article  Google Scholar 

  29. Lehmann EL, Casella G (1998) Theory of Point Estimation. Springer, Berlin

    MATH  Google Scholar 

  30. Li T, Ma S, Ogihara M (2004) Entropy-based criterion in categorical clustering. In: Proceedings of international conference on machine learning (ICML)

  31. Meek C, Thiesson B, Heckerman D (2002) The learning-curve sampling method applied to model-based clustering. J Mach Learn Res 2: 397–418

    Article  MATH  MathSciNet  Google Scholar 

  32. Sharma S (1995) Applied multivariate techniques. Wiley, NY

    Google Scholar 

  33. Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proceedings of the 37-th annual allerton conference on communication, control and computing

  34. Wang J, Karypis G (2006) ‘On efficiently summarizing categorical databases’. Knowl Inf Syst 9(1): 19–37

    Article  Google Scholar 

  35. Wrigley N (1985) Categorical data analysis for geographers and environmental scientists. Longman, London

    Google Scholar 

  36. Yu JX, Qian W, Lu H, Zhou A (2006) Finding centric local outliers in categorical/numerical spaces. Knowl Inf Syst 9(3): 309–338

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keke Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, K., Liu, L. “Best K”: critical clustering structures in categorical datasets. Knowl Inf Syst 20, 1–33 (2009). https://doi.org/10.1007/s10115-008-0159-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0159-x

Keywords

Navigation