Abstract
In evaluating the results of cluster analysis, it is common practice to make use of a number of fixed heuristics rather than to compare a data clustering directly against an empirically derived standard, such as a clustering empirically obtained from human informants. Given the dearth of research into techniques to express the similarity between clusterings, there is broad scope for fundamental research in this area. In defining the comparative problem, we identify two types of worst-case matches between pairs of clusterings, characterised as independently codistributed clustering pairs and conjugate partition pairs. Desirable behaviour for a similarity measure in either of the two worst cases is discussed, giving rise to five test scenarios in which characteristics of one of a pair of clusterings was manipulated in order to compare and contrast the behaviour of different clustering similarity measures. This comparison is carried out for previously-proposed clustering similarity measures, as well as a number of established similarity measures that have not previously been applied to clustering comparison. We introduce a paradigm apparatus for the evaluation of clustering comparison techniques and distinguish between the goodness of clusterings and the similarity of clusterings by clarifying the degree to which different measures confuse the two. Accompanying this is the proposal of a novel clustering similarity measure, the Measure of Concordance (MoC). We show that only MoC, Powers’s measure, Lopez and Rajski’s measure and various forms of Normalised Mutual Information exhibit the desired behaviour under each of the test scenarios.
Similar content being viewed by others
References
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data-mining applications
Arabie P, Boorman SS (1973) Multidimensional scaling of measures of distance between partitions. Math Psychol 10: 148–203
Baroni-Urbani C, Buser MW (1976) Similarity of binary data. Syst Zool 25(3): 251–259
Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software
Braun-Blanquet JNY (1932) Plant sociology: the study of plant communities. McGraw-Hill Book Company, Inc, New York
Cheeseman P, Stutz J (1996) Bayesian classification (autoclass): theory and results. In: Fayyad UN, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI/MIT press, Cambridge, pp 153–180
Coombs CH, Dawes RM, Tversky A (1970) Mathematical psychology: an elementary introduction. Prentice-Hall, Englewood Cliffs, NJ
Dennis RLH, Williams WR, Shreeve TG (1998) Faunal structures among european butterflies: evolutionary implications of bias for geography, endemism and taxonomic affiliation. Ecography 21: 181–203
Dice LE (1945) Measures of the amount of ecologic association between species. Ecology 26(3): 297–302
Fager EW, McGowan JA (1963) Zooplankton species groups in the north pacific:co-occurrences of species can be used to derive groups whose members react similarly to water-mass types. Science 140: 453–460 doi:10.1126/science.140.3566.453
Faith DP (1983) Asymmetric binary similarity measures. Oecologia 57(3): 287–290
Filkov V, Skiena S (2004) Heterogeneous data integration with the consensus clustering formalism. Data Integration in the Life Sciences (DILS). Int Workshop No 1 2994: 110–123
Forbes S (1925) Method of determining and measuring the associative relations of species. Science 61(1585): 518–524
Fossum TV, Haller SM (2004) Measuring card sort orthogonality. Expert Syst 22(3): 139–146
Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. Am Stat Assoc 78(383): 553–569
Fred A, Jain A (2003) Robust data clustering. In: IEEE computer society conference on computer vision and pattern recognition
Gilbert N, Wells TCE (1966) Analysis of quadrat data. Ecology 54(3): 675–685
Goodall DW (1967) The distribution of the matching coefficient. Biometics 23(4): 647–656
Halkidi M, Batistikis Y, Vazirgiannis M (2001) On clustering validation techniques. Intell Inf Syst 17: 107–145
Hamann U (1961) Merkmalbestand und verwandtschaftsbeziehungen de farinosae: Ein beitrag zum system der monokotyledonen. Wildenowia 2: 639–768
Hayek LC (1994) Analysis of amphibian biodiversity data. In: Heyer WR, Donnelly MA, McDiarmid RW, Hayek L-AC, Foster MS (eds) Measuring and monitoring biological diversity: standard methods for amphibians. Smithsonian Institution Press
Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415
Holliday JD, Hu C-Y, Willett P (2002) Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2d fragment bit-strings. Comb Chem High Throughput Screen 5(2): 155–166
Horibe Y (1985) Entropy and correlation. IEEE Trans Syst Man Cybern (SMC) SMC-15(5): 641–642
Jaccard P (1901) Distribution de la florine alpine dans la bassin de dranses. et dans quelques regiones voisines. Naturelles Bulletin de la Societe Vaudoise des Sciences, pp 241–272
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 2(32): 241–254
Karypis G, Han E-H, Kumar V (1999) Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Comput 32(8): 68–75
Knobbe AJ, Adrianns PW (1996) Analysis of binary association. In: Knowledge Discovery and Data Mining (KDD-96). Portland, Oregon, pp 311–314
Kulczynski S (1927) Zespoly roslin w pieninach—die pflanzenassoziationen der pieninen. Bulletin international de l’acadmie polonaise des sciences et des lettres B(2): 57–203
Kvalseth TO (1987) Entropy and correlation: some comments. IEEE Trans Syst Man Cybern SMC-17: 517–519
Lee TT (1987) An information theoretic analysis of relational databases - part 1: data dependencies and information metric. IEEE Trans Softw Eng SE-13(10): 1049–1061
Linfoot EH (1957) An informational measure of correlation. Inf Control 1: 85–87
Lopez de Mantaras R (1989) Id3 revisited: a distance-based criterion for attribute selection. In: International symposium on methodologies for intelligent systems (ISMIS-89). Charlotte, North California
MacQueen J (1967) Some methods for classification and analysis of multivariate observations
Malvestuto FM (1986) Statistical treatment of the information content of a database. Inf Syst 11(3): 211–223
Manning CD, Schutze H (1999) Foundations of statistical natural language processing. MIT Press, New York
McConnaughey BH (1964) The determination and analysis of plankton communities. Marine Research Indonesia Special (Penelitian Laut Di Indonesia) Spec. no. 30
Meila M (2003) Comparing clusterings by variation of information. Proceedings of the 16th annual conference of computational learning theory (COLT)
Michael EL (1920) Marine ecology and the coefficient of association: A plea in behalf of quantitative biology. J Ecol 8(1): 54–59
Mirkin B (1996) Mathematical classification and clustering. Kluwer Academic Press, Boston–Dordrecht
Mirkin B (2001) Eleven ways to look at the chi-squared coefficient for contingency tables. Am Stat 55(6): 111–120
Mountford MD (1962) An index of similarity and its application to classificatory problems. In: Murphy PW (ed) Progress in soil zoology. Butterworth, London, pp 43–50
Pawlak Z, Wong SK, Ziarko WIJM-M (1988) Rough sets: probabilistic versus deterministic approach. Int J Man Mach Stud 29(1): 81–95
Powers DMW (2007) Expected information in the transmission of an equality selection of distribution/clustering or of individual class labels, echnical report, Flinders University (S.A.)
Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1988) Numerical recipes in C: the art of scientific computing. Cambridge University Press, Cambridge
Quinlan JR (1990) Induction of decision trees. In: Shavlik JW, Dietterich TG (eds) Readings in machine learning, Morgan Kaufmann. Originally published in machine learning 1:81–106, 1986.
Rajski C (1961) A metric space of discrete probability distributions. Inf Control 4(4): 371–377
Rand WM (1971) Objective criteria for evaluation of clustering methods. J Am Stat Assoc 66(336): 846–850
Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132(3434): 1115–1118
Russell PF, Rao TR (1940) On habitat and association of species of anopheline larvae in southeastern, madras. Malaria Inst India 3: 153–178
Savage RM (1934) The breeding behavior of the common frog, rana remporaria linn., and of the common toad bufo bufo bufo linn. Zoological Society of London, pp 55–70
Sneath PHA (1968) Vigour and pattern in taxonomy. Gen Microbiol 54(1): 1–11
Sneath PHA, Sokal RR (1973) Numerical taxonomy. Freeman and Company, San Francisco
Sokal RR, Sneath PHA (1964) Principles of numerical taxonomy. Syst Zool 13: 106–108
Sorgenfrei T (1958) Molluscan assemblages from the marine middle miocene of south jutland and their environments
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining partitionings. Mach Learn Res 3: 583–617
Tarwid K (1960) Szacowanie zbieznosci nisz ekologicznych gatunkow droga oceny prawdopodobienstwa spotykania sie ich w polowach. Ecol Polska B(6): 115–130
Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Pres, New York
Thurstone L (1927) A law of comparative judgement. Psychol Rev 34: 278–286
Wallace D.L. (1983) A method for comparing two hierarchical clusterings: comment. Am Stat Assoc 78(383): 569–576
Wan SJ, Wong SKM (1989) A measure for concept dissimilarity and its applications in machine learning. In: International conference on computing and information. Toronto North, Canada, pp 23–27
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Amsterdam
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37
Yao YY, Wong SKM, Butz CJ (1999) On information theoretic measures of attribute importance. In: Zhong N (ed) PAKDD’99. Beijing, China, pp 133–137
Yule GU (1912) On the methods of measuring association between two attributes. R Soc Lond 75(6): 579–642
Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pfitzner, D., Leibbrandt, R. & Powers, D. Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Inf Syst 19, 361–394 (2009). https://doi.org/10.1007/s10115-008-0150-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0150-6