Abstract
In this paper we propose a new index Z for measuring the dissimilarity between two hierarchical clusterings (or dendrograms). This index is a metric since it satisfies the axioms of non-negativity, symmetry and triangle inequality. A desirable property of this index is that it can be decomposed into the contributions pertaining to each stage of the hierarchies. We show the relations of such components with the currently used criteria for comparing two partitions. We obtain a global similarity index as the complement to one of the suggested dissimilarity and we derive its adjustment for agreement due to chance. We obtain similarity indexes pertaining to each stage of the hierarchies as the complement to one of the additive parts of the global distance Z. We consider the use of the proposed distance for more than two dendrograms and its use for the consensus of classifications and variable selection in cluster analysis. A series of simulation experiments and an application to a real data set are presented.
Similar content being viewed by others
References
Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indexes and correction for chance agreement. J Classif 23: 301–313
Albatineh AN, Niewiadomska-Bugaj M (2011) Correcting Jaccard and other similarity indexes for chance agreement in cluster analysis. Adv Data Anal Classif 5: 179–200
Baker FB (1974) Stability of two hierarchical grouping techniques. Case I: sensitivity to data errors. JASA 69: 440–445
Brusco MJ, Steinley D (2008) A binary integer program to maximize the agreement between partitions. J Classif 25: 185–193
Day WHE (1985) Optimal algorithms for comparing trees with labeled leaves. J Classif 2: 7–28
Day WHE (1986) Foreword: comparison and consensus of classification. J Classif 3: 183–185
Denoeud L (2008) Transfer distance between partitions. Adv Data Anal Classif 2: 279–294
Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. JASA 78: 553–569
Fowlkes EB, Gnanadesikan R, Kettenring JR (1988) Variable selection in clustering. J Classif 5: 205–228
Fraiman R, Justel A, Svarc M (2008) Selection of variables for cluster analysis and classification rules. JASA 103: 1294–1303
Gordon AD, Vichi M (1998) Partitions of partitions. J Classif 15: 265–285
Hubert LJ, Arabie P (1985) Comparing Partitions. J Classif 2: 193–218
Krieger AM, Green PE (1999) A generalized Rand-index methods for consensus clusterings of separate partitions of the same data base. J Classif 16: 63–89
Lapointe FJ, Legendre P (1995) Comparison tests for dendrograms: a comparative evaluation. J Classif 12: 265–282
Meila M (2007) Comparing clustering. An information based distance. J Multivar Anal 98: 873–895
Mesa H, Restrepo G (2008) On dendrograms and topology. Commun Math Comput Chem 60: 371–384
Rand WM (1971) Objective criteria for the evaluation of clustering methods. JASA 66: 846–850
Reilly C, Wang C, Ritherford M (2005) A rapid method for the comparison of cluster analyses. Stat Sin 15: 19–33
Restrepo G, Mesa H, Llanos EJ (2007) Three dissimilarity measures to contrast dendrograms. J Chem Inf Model 47: 761–770
Rohlf FJ (1982) Consensus indexes for comparing classifications. Math Biosci 59: 131–144
Sokal RR, Rohlf FJ (1962) The comparison of dendrograms by objective methods. Taxon 11: 33–40
Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38: 1409–1438
Steinley D, Brusco MJ (2008) Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika 73: 125–144
Tadesse MG, Sha N, Vannucci N (2005) Bayesian variable selection in clustering high dimensional data. JASA 100: 602–617
Wallace DL (1983) Comment on the paper “A method for comparing two hierarchical clusterings”. JASA 78: 569–578
Wang S, Zhu S (2008) Variable selection for model based high dimensional clustering and its application to microarray data. Biometrics 64: 440–448
Warrens MJ (2008) On the equivalence of Cohen’s Kappa and the Hubert-Arabie adjusted Rand index. J Classif 25: 177–183
Waterman MS, Smith TF (1978) On the similarity of dendrograms. J Theor Biol 73: 789–800
Youness G, Saporta G (2010) Comparing partitions of two sets of units based on the same variables. Adv Data Anal Classif 4: 53–64
Zani S (1986) Some measures for the comparison of data matrices. In: Proceedings of the XXXIII meeting of the Italian Statistical Society Bari, Italy, pp 157–169
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Morlini, I., Zani, S. Dissimilarity and similarity measures for comparing dendrograms and their applications. Adv Data Anal Classif 6, 85–105 (2012). https://doi.org/10.1007/s11634-012-0106-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-012-0106-2
Keywords
- Cluster analysis
- Consensus of classifications
- Distance
- Hierarchical trees
- L 1 norm
- Partitions
- Similarity of dendrograms
- Variable selection