Characterization and evaluation of similarity measures for pairs of clusterings

  • Darius Pfitzner
  • Richard Leibbrandt
  • David Powers
Regular Paper

Abstract

In evaluating the results of cluster analysis, it is common practice to make use of a number of fixed heuristics rather than to compare a data clustering directly against an empirically derived standard, such as a clustering empirically obtained from human informants. Given the dearth of research into techniques to express the similarity between clusterings, there is broad scope for fundamental research in this area. In defining the comparative problem, we identify two types of worst-case matches between pairs of clusterings, characterised as independently codistributed clustering pairs and conjugate partition pairs. Desirable behaviour for a similarity measure in either of the two worst cases is discussed, giving rise to five test scenarios in which characteristics of one of a pair of clusterings was manipulated in order to compare and contrast the behaviour of different clustering similarity measures. This comparison is carried out for previously-proposed clustering similarity measures, as well as a number of established similarity measures that have not previously been applied to clustering comparison. We introduce a paradigm apparatus for the evaluation of clustering comparison techniques and distinguish between the goodness of clusterings and the similarity of clusterings by clarifying the degree to which different measures confuse the two. Accompanying this is the proposal of a novel clustering similarity measure, the Measure of Concordance (MoC). We show that only MoC, Powers’s measure, Lopez and Rajski’s measure and various forms of Normalised Mutual Information exhibit the desired behaviour under each of the test scenarios.

Keywords

Clustering Evaluation Similarity measures Cluster comparison Review 

References

  1. 1.
    Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data-mining applicationsGoogle Scholar
  2. 2.
    Arabie P, Boorman SS (1973) Multidimensional scaling of measures of distance between partitions. Math Psychol 10: 148–203MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Baroni-Urbani C, Buser MW (1976) Similarity of binary data. Syst Zool 25(3): 251–259CrossRefGoogle Scholar
  4. 4.
    Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue SoftwareGoogle Scholar
  5. 5.
    Braun-Blanquet JNY (1932) Plant sociology: the study of plant communities. McGraw-Hill Book Company, Inc, New YorkGoogle Scholar
  6. 6.
    Cheeseman P, Stutz J (1996) Bayesian classification (autoclass): theory and results. In: Fayyad UN, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAI/MIT press, Cambridge, pp 153–180Google Scholar
  7. 7.
    Coombs CH, Dawes RM, Tversky A (1970) Mathematical psychology: an elementary introduction. Prentice-Hall, Englewood Cliffs, NJMATHGoogle Scholar
  8. 8.
    Dennis RLH, Williams WR, Shreeve TG (1998) Faunal structures among european butterflies: evolutionary implications of bias for geography, endemism and taxonomic affiliation. Ecography 21: 181–203CrossRefGoogle Scholar
  9. 9.
    Dice LE (1945) Measures of the amount of ecologic association between species. Ecology 26(3): 297–302CrossRefGoogle Scholar
  10. 10.
    Fager EW, McGowan JA (1963) Zooplankton species groups in the north pacific:co-occurrences of species can be used to derive groups whose members react similarly to water-mass types. Science 140: 453–460 doi:10.1126/science.140.3566.453 CrossRefGoogle Scholar
  11. 11.
    Faith DP (1983) Asymmetric binary similarity measures. Oecologia 57(3): 287–290CrossRefGoogle Scholar
  12. 12.
    Filkov V, Skiena S (2004) Heterogeneous data integration with the consensus clustering formalism. Data Integration in the Life Sciences (DILS). Int Workshop No 1 2994: 110–123Google Scholar
  13. 13.
    Forbes S (1925) Method of determining and measuring the associative relations of species. Science 61(1585): 518–524Google Scholar
  14. 14.
    Fossum TV, Haller SM (2004) Measuring card sort orthogonality. Expert Syst 22(3): 139–146CrossRefGoogle Scholar
  15. 15.
    Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. Am Stat Assoc 78(383): 553–569MATHCrossRefGoogle Scholar
  16. 16.
    Fred A, Jain A (2003) Robust data clustering. In: IEEE computer society conference on computer vision and pattern recognitionGoogle Scholar
  17. 17.
    Gilbert N, Wells TCE (1966) Analysis of quadrat data. Ecology 54(3): 675–685CrossRefGoogle Scholar
  18. 18.
    Goodall DW (1967) The distribution of the matching coefficient. Biometics 23(4): 647–656CrossRefMathSciNetGoogle Scholar
  19. 19.
    Halkidi M, Batistikis Y, Vazirgiannis M (2001) On clustering validation techniques. Intell Inf Syst 17: 107–145MATHCrossRefGoogle Scholar
  20. 20.
    Hamann U (1961) Merkmalbestand und verwandtschaftsbeziehungen de farinosae: Ein beitrag zum system der monokotyledonen. Wildenowia 2: 639–768Google Scholar
  21. 21.
    Hayek LC (1994) Analysis of amphibian biodiversity data. In: Heyer WR, Donnelly MA, McDiarmid RW, Hayek L-AC, Foster MS (eds) Measuring and monitoring biological diversity: standard methods for amphibians. Smithsonian Institution PressGoogle Scholar
  22. 22.
    Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4): 387–415CrossRefGoogle Scholar
  23. 23.
    Holliday JD, Hu C-Y, Willett P (2002) Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2d fragment bit-strings. Comb Chem High Throughput Screen 5(2): 155–166Google Scholar
  24. 24.
    Horibe Y (1985) Entropy and correlation. IEEE Trans Syst Man Cybern (SMC) SMC-15(5): 641–642Google Scholar
  25. 25.
    Jaccard P (1901) Distribution de la florine alpine dans la bassin de dranses. et dans quelques regiones voisines. Naturelles Bulletin de la Societe Vaudoise des Sciences, pp 241–272Google Scholar
  26. 26.
    Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 2(32): 241–254CrossRefGoogle Scholar
  27. 27.
    Karypis G, Han E-H, Kumar V (1999) Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Comput 32(8): 68–75Google Scholar
  28. 28.
    Knobbe AJ, Adrianns PW (1996) Analysis of binary association. In: Knowledge Discovery and Data Mining (KDD-96). Portland, Oregon, pp 311–314Google Scholar
  29. 29.
    Kulczynski S (1927) Zespoly roslin w pieninach—die pflanzenassoziationen der pieninen. Bulletin international de l’acadmie polonaise des sciences et des lettres B(2): 57–203Google Scholar
  30. 30.
    Kvalseth TO (1987) Entropy and correlation: some comments. IEEE Trans Syst Man Cybern SMC-17: 517–519CrossRefGoogle Scholar
  31. 31.
    Lee TT (1987) An information theoretic analysis of relational databases - part 1: data dependencies and information metric. IEEE Trans Softw Eng SE-13(10): 1049–1061CrossRefGoogle Scholar
  32. 32.
    Linfoot EH (1957) An informational measure of correlation. Inf Control 1: 85–87MATHCrossRefMathSciNetGoogle Scholar
  33. 33.
    Lopez de Mantaras R (1989) Id3 revisited: a distance-based criterion for attribute selection. In: International symposium on methodologies for intelligent systems (ISMIS-89). Charlotte, North CaliforniaGoogle Scholar
  34. 34.
    MacQueen J (1967) Some methods for classification and analysis of multivariate observationsGoogle Scholar
  35. 35.
    Malvestuto FM (1986) Statistical treatment of the information content of a database. Inf Syst 11(3): 211–223MATHCrossRefGoogle Scholar
  36. 36.
    Manning CD, Schutze H (1999) Foundations of statistical natural language processing. MIT Press, New YorkMATHGoogle Scholar
  37. 37.
    McConnaughey BH (1964) The determination and analysis of plankton communities. Marine Research Indonesia Special (Penelitian Laut Di Indonesia) Spec. no. 30Google Scholar
  38. 38.
    Meila M (2003) Comparing clusterings by variation of information. Proceedings of the 16th annual conference of computational learning theory (COLT)Google Scholar
  39. 39.
    Michael EL (1920) Marine ecology and the coefficient of association: A plea in behalf of quantitative biology. J Ecol 8(1): 54–59CrossRefMathSciNetGoogle Scholar
  40. 40.
    Mirkin B (1996) Mathematical classification and clustering. Kluwer Academic Press, Boston–DordrechtMATHGoogle Scholar
  41. 41.
    Mirkin B (2001) Eleven ways to look at the chi-squared coefficient for contingency tables. Am Stat 55(6): 111–120CrossRefMathSciNetGoogle Scholar
  42. 42.
    Mountford MD (1962) An index of similarity and its application to classificatory problems. In: Murphy PW (ed) Progress in soil zoology. Butterworth, London, pp 43–50Google Scholar
  43. 43.
    Pawlak Z, Wong SK, Ziarko WIJM-M (1988) Rough sets: probabilistic versus deterministic approach. Int J Man Mach Stud 29(1): 81–95MATHCrossRefGoogle Scholar
  44. 44.
    Powers DMW (2007) Expected information in the transmission of an equality selection of distribution/clustering or of individual class labels, echnical report, Flinders University (S.A.)Google Scholar
  45. 45.
    Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1988) Numerical recipes in C: the art of scientific computing. Cambridge University Press, CambridgeMATHGoogle Scholar
  46. 46.
    Quinlan JR (1990) Induction of decision trees. In: Shavlik JW, Dietterich TG (eds) Readings in machine learning, Morgan Kaufmann. Originally published in machine learning 1:81–106, 1986.Google Scholar
  47. 47.
    Rajski C (1961) A metric space of discrete probability distributions. Inf Control 4(4): 371–377CrossRefMathSciNetGoogle Scholar
  48. 48.
    Rand WM (1971) Objective criteria for evaluation of clustering methods. J Am Stat Assoc 66(336): 846–850CrossRefGoogle Scholar
  49. 49.
    Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132(3434): 1115–1118CrossRefGoogle Scholar
  50. 50.
    Russell PF, Rao TR (1940) On habitat and association of species of anopheline larvae in southeastern, madras. Malaria Inst India 3: 153–178Google Scholar
  51. 51.
    Savage RM (1934) The breeding behavior of the common frog, rana remporaria linn., and of the common toad bufo bufo bufo linn. Zoological Society of London, pp 55–70Google Scholar
  52. 52.
    Sneath PHA (1968) Vigour and pattern in taxonomy. Gen Microbiol 54(1): 1–11Google Scholar
  53. 53.
    Sneath PHA, Sokal RR (1973) Numerical taxonomy. Freeman and Company, San FranciscoMATHGoogle Scholar
  54. 54.
    Sokal RR, Sneath PHA (1964) Principles of numerical taxonomy. Syst Zool 13: 106–108CrossRefGoogle Scholar
  55. 55.
    Sorgenfrei T (1958) Molluscan assemblages from the marine middle miocene of south jutland and their environmentsGoogle Scholar
  56. 56.
    Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining partitionings. Mach Learn Res 3: 583–617CrossRefMathSciNetGoogle Scholar
  57. 57.
    Tarwid K (1960) Szacowanie zbieznosci nisz ekologicznych gatunkow droga oceny prawdopodobienstwa spotykania sie ich w polowach. Ecol Polska B(6): 115–130Google Scholar
  58. 58.
    Theodoridis S, Koutroubas K (1999) Pattern recognition. Academic Pres, New YorkGoogle Scholar
  59. 59.
    Thurstone L (1927) A law of comparative judgement. Psychol Rev 34: 278–286Google Scholar
  60. 60.
    Wallace D.L. (1983) A method for comparing two hierarchical clusterings: comment. Am Stat Assoc 78(383): 569–576CrossRefGoogle Scholar
  61. 61.
    Wan SJ, Wong SKM (1989) A measure for concept dissimilarity and its applications in machine learning. In: International conference on computing and information. Toronto North, Canada, pp 23–27Google Scholar
  62. 62.
    Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, AmsterdamMATHGoogle Scholar
  63. 63.
    Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37CrossRefGoogle Scholar
  64. 64.
    Yao YY, Wong SKM, Butz CJ (1999) On information theoretic measures of attribute importance. In: Zhong N (ed) PAKDD’99. Beijing, China, pp 133–137Google Scholar
  65. 65.
    Yule GU (1912) On the methods of measuring association between two attributes. R Soc Lond 75(6): 579–642Google Scholar
  66. 66.
    Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2008

Authors and Affiliations

  • Darius Pfitzner
    • 1
  • Richard Leibbrandt
    • 1
  • David Powers
    • 1
  1. 1.Department of Computer Science, Engineering and MathematicsFlinders University of South AustraliaBedford ParkAustralia

Personalised recommendations