Data Mining and Knowledge Discovery

, Volume 29, Issue 5, pp 1458–1485 | Cite as

Generalization of clustering agreements and distances for overlapping clusters and network communities

Article

Abstract

A measure of distance between two clusterings has important applications, including clustering validation and ensemble clustering. Generally, such distance measure provides navigation through the space of possible clusterings. Mostly used in cluster validation, a normalized clustering distance, a.k.a. agreement measure, compares a given clustering result against the ground-truth clustering. The two widely-used clustering agreement measures are adjusted rand index and normalized mutual information. In this paper, we present a generalized clustering distance from which these two measures can be derived. We then use this generalization to construct new measures specific for comparing (dis)agreement of clusterings in networks, a.k.a. communities. Further, we discuss the difficulty of extending the current, contingency based, formulations to overlapping cases, and present an alternative algebraic formulation for these (dis)agreement measures. Unlike the original measures, the new co-membership based formulation is easily extendable for different cases, including overlapping clusters and clusters of inter-related data. These two extensions are, in particular, important in the context of finding communities in complex networks.

Keywords

Clustering agreement Cluster evaluation Cluster validation Network clusters Community detection Overlapping clusters 

Supplementary material

10618_2015_426_MOESM1_ESM.pdf (166 kb)
Supplementary material 1 (pdf 166 KB)
10618_2015_426_MOESM2_ESM.pdf (1 mb)
Supplementary material 2 (pdf 1073 KB)

References

  1. Aggarwal CC, Reddy CK (2014) Data clustering: algorithms and applications. CRC Press, Boca RatonGoogle Scholar
  2. Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indices and correction for chance agreement. J Classif 23:301–313MathSciNetCrossRefGoogle Scholar
  3. Anderson DT, Bezdek JC, Popescu M, Keller JM (2010) Comparing fuzzy, probabilistic, and possibilistic partitions. IEEE Trans Fuzzy Syst 18(5):906–918CrossRefGoogle Scholar
  4. Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with Bregman divergences. J Mach Learn Res 6:1705–1749MathSciNetMATHGoogle Scholar
  5. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008:P10008+CrossRefGoogle Scholar
  6. Brouwer RK (2008) Extending the rand, adjusted rand and Jaccard indices to fuzzy partitions. J Intell Inf Syst 32(3):213–235MathSciNetCrossRefGoogle Scholar
  7. Campello RR (2010) Generalized external indexes for comparing data partitions with overlapping categories. Pattern Recogn Lett 31(9):966–975CrossRefGoogle Scholar
  8. Collins LM, Dent CW (1988) Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivar Behav Res 23(2):231–242CrossRefGoogle Scholar
  9. Cortina-Borja M (2012) Handbook of parametric and nonparametric statistical procedures, 5th edn. J R Stat Soc Ser A 175(3):829–829Google Scholar
  10. Cui Y, Fern X, Dy J (2007) Non-redundant multi-view clustering via orthogonalization. In: Seventh IEEE International Conference on Data Mining, 2007. ICDM 2007. pp 133–142Google Scholar
  11. Dhillon IS, Tropp JA (2007) Matrix nearness problems with bregman divergences. SIAM J Matrix Anal Appl 29(4):1120–1146MathSciNetCrossRefGoogle Scholar
  12. Fortunato S (2010) Community detection in graphs. Phys Rep 486(35):75–174MathSciNetCrossRefGoogle Scholar
  13. Gregory S (2010) Finding overlapping communities in networks by label propagation. New J Phys 12(10):103,018CrossRefGoogle Scholar
  14. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218CrossRefGoogle Scholar
  15. Hullermeier E, Rifqi M, Henzgen S, Senge R (2012) Comparing fuzzy partitions: a generalization of the rand index and related measures. IEEE Trans Fuzzy Syst 20(3):546–556CrossRefGoogle Scholar
  16. Kulis B, Sustik MA, Dhillon IS (2009) Low-rank kernel learning with bregman matrix divergences. J Mach Learn Res 10:341–376MathSciNetMATHGoogle Scholar
  17. Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056,117CrossRefGoogle Scholar
  18. Lancichinetti A, Fortunato S, Kertesz J (2008a) Detecting the overlapping and hierarchical community structure of complex networks. New J Phys 11(3):20Google Scholar
  19. Lancichinetti A, Fortunato S, Radicchi F (2008b) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046,110CrossRefGoogle Scholar
  20. Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant communities in networks. PLoS One 6(4):e18,961CrossRefGoogle Scholar
  21. Light RJ, Margolin BH (1971) An analysis of variance for categorical data. J Am Stat Assoc 66(335):534–544MathSciNetCrossRefMATHGoogle Scholar
  22. Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRefMATHGoogle Scholar
  23. Mcauley J, Leskovec J (2014) Discovering social circles in ego networks. ACM Trans Knowl Discov Data 8(1):4:1–4:28CrossRefGoogle Scholar
  24. McDaid A, Hurley N (2010) Detecting highly overlapping communities with model-based overlapping seed expansion. In: 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), IEEE, pp 112–119Google Scholar
  25. McDaid AF, Greene D, Hurley N (2011) Normalized mutual information to evaluate overlapping community finding algorithms. arXiv:1110.2515
  26. Meilă M (2007) Comparing clusteringsan information based distance. J Multivar Anal 98(5):873–895CrossRefMATHGoogle Scholar
  27. Newman MEJ (2004) Fast algorithm for detecting community structure in networks. Phys Rev E 69(066):133Google Scholar
  28. Nielsen F, Nock R (2014) On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process Lett 21(1):10–13CrossRefGoogle Scholar
  29. Pons P, Latapy M (2005) Computing communities in large networks using random walks. In: Computer and Information Sciences-ISCIS 2005, Springer, pp 284–293Google Scholar
  30. Quere R, Le Capitaine H, Fraisseix N, Frelicot C (2010) On normalizing fuzzy coincidence matrices to compare fuzzy and/or possibilistic partitions with the rand index. In: 2010 IEEE International Conference on Data Mining, IEEE, pp 977–982Google Scholar
  31. Ronhovde P, Nussinov Z (2009) Multiresolution community detection for megascale networks by information-based replica correlations. Phys Rev E 80(1):016,109CrossRefGoogle Scholar
  32. Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci 105(4):1118–1123CrossRefGoogle Scholar
  33. Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, ICML ’09, pp 1073–1080Google Scholar
  34. Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854MathSciNetMATHGoogle Scholar
  35. Warrens MJ (2008a) On similarity coefficients for \(2\times 2\) tables and correction for chance. Psychometrika 73:487–502MathSciNetCrossRefGoogle Scholar
  36. Warrens MJ (2008b) On the equivalence of Cohen’s Kappa and the Hubert-Arabie adjusted rand index. J Classif 25:177–183MathSciNetCrossRefMATHGoogle Scholar
  37. Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, KDD ’09, pp 877–886Google Scholar
  38. Yang J, Leskovec J (2013) Overlapping community detection at scale: a nonnegative matrix factorization approach. In: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, New York, pp 587–596Google Scholar
  39. Zhou D, Li J, Zha H (2005) A new mallows distance based metric for comparing clusterings. In: Proceedings of the 22nd international conference on Machine learning, ACM, New York, pp 1028–1035Google Scholar

Copyright information

© The Author(s) 2015

Authors and Affiliations

  1. 1.Department of Computing ScienceUniversity of AlbertaEdmontonCanada

Personalised recommendations