Clustering Based on Compressed Data for Categorical and Mixed Attributes

  • Erendira Rendón
  • José Salvador Sánchez
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4109)


Clustering in data mining is a discovery process that groups a set of data so as to maximize the intra-cluster similarity and to minimize the inter-cluster similarity. Clustering becomes more challenging when data are categorical and the amount of available memory is less than the size of the data set. In this paper, we introduce CBC (Clustering Based on Compressed Data), an extension of the Birch algorithm whose main characteristics refer to the fact that it can be especially suitable for very large databases and it can work both with categorical attributes and mixed features. Effectiveness and performance of the CBC procedure were compared with those of the well-known K-modes clustering algorithm, demonstrating that the CBC summary process does not affect the final clustering, while execution times can be drastically lessened.


Cluster Algorithm Leaf Node Compress Data Categorical Object Memory Size 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: scalable clustering of categorical data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 123–146. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  2. 2.
    Barbará, D., Li, Y., Couto, J.: COOLCAT: an entropy-based algorithm for categorical clustering. In: Proc. 11th Intl. Conf. on Information and Knowledge Management, pp. 582–589 (2002)Google Scholar
  3. 3.
    Ganti, V., Gehrkeand, J., Ramakrishanan, R.: CACTUS — Clustering categorical data using summaries. In: Proc. 5th ACM Sigmod Intl. Conf. on Knowledge Discovery in Databases, pp. 73–83 (1999)Google Scholar
  4. 4.
    Gowda, K., Diday, E.: Symbolic clustering using a new dissimilarity measure. Pattern Recognition 24, 567–578 (1991)CrossRefGoogle Scholar
  5. 5.
    Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. In: Proc. of the IEEE Intl. Conf. on Data Engineering, pp. 512–521 (1999)Google Scholar
  6. 6.
    Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proc. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tech. Report 97–07, UBC, Dept. of Computer Science (1997)Google Scholar
  7. 7.
    Ichino, M., Yaguchi, H.: Generalized Minkowski metrics for mixed feature-type data analysis. IEEE Trans. on Systems, Man and Cybernetics 24, 698–708 (1994)CrossRefMathSciNetGoogle Scholar
  8. 8.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, New York (1990)Google Scholar
  9. 9.
    Milenova, B.L., Campos, M.M.: Clustering large databases with numeric and nominal values using orthogonal projection. In: Proc. 29th Intl. Conf. on Very Large Databases (2003)Google Scholar
  10. 10.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. ACM-SIGMOD Intl. Conf. on Management of Data, pp. 103–114 (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Erendira Rendón
    • 1
  • José Salvador Sánchez
    • 2
  1. 1.Lab. Reconocimiento de PatronesInstituto Tecnológico de TolucaMetepecMexico
  2. 2.Dept. Llenguatges i Sistemes InformàticsUniversitat Jaume ICastelló de la PlanaSpain

Personalised recommendations