Measures of Dispersion and Cluster-Trees for Categorical Data

  • Ulrich Müller-Funk
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)


A clustering algorithm, in essence, is characterized by two features (1) the way in which the heterogeneity within resp. between clusters is measured (objective function) (2) the steps in which the splitting resp. fusioning proceeds. For categorical data there are no “standard indices” formalizing the first aspect. Instead, a number of ad hoc concepts have been used in cluster analysis, labelled “similarity”, “information”, “impurity” and the like. To clarify matters, we start out from a set of axioms summarizing our conception of “dispersion” for categorical attributes. To no surprise, it turns out, that some well-known measures, including the Gini index and the entropy, qualify as measures of dispersion. We try to indicate, how these measures can be used in unsupervised classification problems as well. Due to its simple analytic form, the Gini index allows for a dispersion-decomposition formula that can be made the starting point for a CART-like cluster tree. Trees are favoured because of i) factor selection and ii) communicability.


Cluster Algorithm Categorical Data Association Rule Probability Vector Gini Index 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. ANDRITSOS, P., TSAPARAS, P., MILLER, R.J. and SEVCIK, K.C. (2004): LIMBO: Scal-able clustering of categorical data. In: E. Bertino, S. Christodoulakis, D. Plexousakis, V. Christophides, M. Koubarakis, K. Böhm and E. Ferrari (Eds.): Advances in Database Technology—EDBT 2004. Springer, Berlin, 123-146.Google Scholar
  2. BARBARA, D., LI, Y. and COUTO, J. (2002): COOLCAT: An entropy-based algorithm for categorical clustering. Proceedings of the Eleventh International Conference on Informa-tion and Knowledge Management, 582-589.Google Scholar
  3. BREIMAN, L., FRIEDMAN, J.H., OLSHEN, R.A. and STONE, C.J. (1984): Classification and Regression Trees. CRC Press, Florida.zbMATHGoogle Scholar
  4. FAHRMEIR, L., HAMERLE, A. and TUTZ, G. (1996): Multivariate statistische Methoden. de Gruyter, Berlin.Google Scholar
  5. RENYI, A. (1971): Wahrscheinlichkeitsrechnung. Mit einem Anhang über Informationstheo-rie. VEB Deutscher Verlag der Wissenschaften, Berlin.Google Scholar
  6. TEBOULLE, M., BERKHIN, P., DHILLON, I., GUAN, Y. and KOGAN, J. (2006): Clustering with entropy-like k means algorithms. In: J. Kogan, C. Nicholas, and M. Teboulle (Eds.): Grouping Multidimensional Data: Recent Advances in Clustering. Springer Verlag, New York, 127-160.CrossRefGoogle Scholar
  7. TONG, Y.L. (1980): Probability inequalities in multivariate distributions. In: Z.W. Birnbaum and E. Lukacs (Eds.): Probability and Mathematical Statistics. Academic Press, New York.Google Scholar
  8. WITTING, H., MÜLLER-FUNK, U. (1995): Mathematische Statistik II - Asymptotische Statistik: Parametrische Modelle und nicht-parametrische Funktionale. Teubner, Stuttgart.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Ulrich Müller-Funk
    • 1
  1. 1.ERCISMünsterGermany

Personalised recommendations