Skip to main content
Log in

Correcting Jaccard and other similarity indices for chance agreement in cluster analysis

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Correcting a similarity index for chance agreement requires computing its expectation under fixed marginal totals of a matching counts matrix. For some indices, such as Jaccard, Rogers and Tanimoto, Sokal and Sneath, and Gower and Legendre the expectations cannot be easily found. We show how such similarity indices can be expressed as functions of other indices and expectations found by approximations such that approximate correction is possible. A second approach is based on Taylor series expansion. A simulation study illustrates the effectiveness of the resulting correction of similarity indices using structured and unstructured data generated from bivariate normal distributions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Albatineh AN, Niewiadomska-Bugaj M, Mihalko DP (2006) On similarity indices and correction for chance agreement. J Classif 23: 301–313

    Article  MathSciNet  Google Scholar 

  • Albatineh AN, Niewiadomska-Bugaj M (2011) MCS: a method for finding the number of clusters. J Classif 28. doi:10.1007/s00357-010-9069-1

  • Albatineh AN (2010) Means and variances for a family of similarity indices used in cluster analysis. J Stat Plan Inference 140: 2828–2838

    Article  MathSciNet  MATH  Google Scholar 

  • Czekanowski J (1932) “Coefficient of racial likeness” und “durchschnittliche Differenz”. Anthropologischer Anzeiger 14: 227–249

    Google Scholar 

  • Dice LR (1945) Measures of the amount of ecological association between species. Ecology 26: 297–302

    Article  Google Scholar 

  • Fligner MA, Verducci JS, Blower PE (2002) A modification of the Jaccard–Tanimoto similarity index for diverse selection of chemical compounds using binary strings. Technometrics 44: 110–119

    Article  MathSciNet  Google Scholar 

  • Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78: 553–569

    Article  MATH  Google Scholar 

  • Gower JC, Legendre P (1986) Metric and Euclidean properties of dissimilarity coefficients. J Classif 3: 5–48

    Article  MathSciNet  MATH  Google Scholar 

  • Hamann U (1961) Merkmalsbestand und Verwandtschaftsbeziehungen der Farinosae. Willdenowia 2: 639–768

    Google Scholar 

  • Hubálek Z (1982) Coefficients of association and similarity based on binary (presence–absence) data: an evaluation. Biol Rev 57: 669–689

    Article  Google Scholar 

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2: 193–218

    Article  Google Scholar 

  • Jaccard P (1908) Nouvelles recherches sur la distribution florale. Bull Soc Vaudoise Sci Nat 44: 223–270

    Google Scholar 

  • Jaccard P (1912) The distribution of the flora of the alpine zone. New Phytol 11: 37–50

    Article  Google Scholar 

  • Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New Jersey

    MATH  Google Scholar 

  • Janson S, Vegelius J (1981) Measures of ecological association. Oecologia 49: 371–376

    Article  Google Scholar 

  • Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32: 241–254

    Article  Google Scholar 

  • Kulczynski S (1927) Die Pflanzenassoziationen der Pinien, Bulletin International de L’Académie Polonaise des Sciences et des Lettres, Classe des Sciences Mathématiques et Naturelles. Series B, Supplément II 2: 57–203

    Google Scholar 

  • Lamont BB, Grant KJ (1979) A comparison of twenty-one measures of site dissimilarity. In: Orlóci L, Rao CR, Stiteler WM (eds) Multivariate methods in ecological work. International Cooperation Publishing House, Fairland, pp 101–126

    Google Scholar 

  • Lancaster HO (1969) The Chi-squared distribution. John Wiley, New York

    MATH  Google Scholar 

  • Lehmann EL (1959) Testing statistical hypothesis. Wiley, New York

    Google Scholar 

  • Legendre P, Legendre L (1998) Numerical ecology. Elsevier, Amsterdam

    MATH  Google Scholar 

  • Mcconnaughey BH (1964) The determination and analysis of plankton communities. Marine Research, Special No, Indonesia, pp 1–40

  • Milligan G, Cooper M (1986) A study of the comparability of external criteria for hierarchical cluster analysis. Multivar Behav Res 21: 441–458

    Article  Google Scholar 

  • Milligan G, Soon S, Sokol L (1983) The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans Patt Anal Mach Intell PAMI-5: 40–47

    Article  Google Scholar 

  • Morey L, Agresti A (1984) The measurement of classification agreement: an adjustment to the Rand statistic for chance agreement. Educ Psychol Meas 44: 33–37

    Article  Google Scholar 

  • Rand W (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: 846–850

    Article  Google Scholar 

  • Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132: 1115–1118

    Article  Google Scholar 

  • Russell PF, Rao TR (1940) On habitat and association of species of anopheline larvae in South-Eastern Madras. J Malar Inst India 3: 153–178

    Google Scholar 

  • Saxena PC, Navaneerham K (1991) The effect of cluster size, dimensionality, and number of clusters on recovery of true cluster structure through Chernoff-type faces. Statistician 40: 415–425

    Article  Google Scholar 

  • Saxena PC, Navaneerham K (1993) Comparison of Chernoff-type face and non-graphical methods for clustering multivariate observations. Comput Stat Data Anal 15: 63–79

    Article  MATH  Google Scholar 

  • Snijders TAB, Dormaar M, Van Schuur WH, Dijkman-Caes C, Driessen G (1990) Distribution of some similarity coefficients for dyadic binary data in the case of associated attributes. J Classif 7: 5–31

    Article  MathSciNet  MATH  Google Scholar 

  • Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38: 1409–1438

    Google Scholar 

  • Sokal RR, Sneath PHA (1963) Principles of numerical taxonomy. WH Freeman, San Francisco

    Google Scholar 

  • Sørensen T (1948) A Method of establishing groups of equal amplitude in plant sociology based on similarity of species content. Biologiske Skrifter 5: 1–34

    Google Scholar 

  • Southwood TS (1978) Ecological methods. Chapman and Hall, London

    Google Scholar 

  • Steinley D (2004) Properties of the Hubert–Arabie adjusted Rand index. Psychol Methods 9: 386–396

    Article  Google Scholar 

  • Van Der Maarel E (1969) On the use of ordination models in phytosociology. Vegetatio 19: 21–46

    Google Scholar 

  • Wallace DL (1983) A method for comparing two hierarchical clusterings: comment. J Am Stat Assoc 78: 569–576

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmed N. Albatineh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Albatineh, A.N., Niewiadomska-Bugaj, M. Correcting Jaccard and other similarity indices for chance agreement in cluster analysis. Adv Data Anal Classif 5, 179–200 (2011). https://doi.org/10.1007/s11634-011-0090-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-011-0090-y

Keywords

Mathematics Subject Classification (2000)

Navigation