Communities validity: methodical evaluation of community mining algorithms

Abstract

Grouping data points is one of the fundamental tasks in data mining, which is commonly known as clustering if data points are described by attributes. When dealing with interrelated data, that is represented in the form a graph wherein a link between two nodes indicates a relationship between them, there has been a considerable number of approaches proposed in recent years for mining communities in a given network. However, little work has been done on how to evaluate the community mining algorithms. The common practice is to evaluate the algorithms based on their performance on standard benchmarks for which we know the ground-truth. This technique is similar to external evaluation of attribute-based clustering methods. The other two well-studied clustering evaluation approaches are less explored in the community mining context; internal evaluation to statistically validate the clustering result and relative evaluation to compare alternative clustering results. These two approaches enable us to validate communities discovered in a real-world application, where the true community structure is hidden in the data. In this article, we investigate different clustering quality criteria applied for relative and internal evaluation of clustering data points with attributes and also different clustering agreement measures used for external evaluation and incorporate proper adaptations to make them applicable in the context of interrelated data. We further compare the performance of the proposed adapted criteria in evaluating community mining results in different settings through extensive set of experiments.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

References

  1. Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indices and correction for chance agreement. J Classif 23:301–313. doi:10.1007/s00357-006-0017-z

    MathSciNet  Article  Google Scholar 

  2. Aldecoa R, Marin I (2012) Closed benchmarks for network community structure characterization. Phys Rev E 85:026109

    Article  Google Scholar 

  3. Bezdek JC (1981) Pattern Recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Norwell

    Google Scholar 

  4. Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3:1–27

    MathSciNet  Article  MATH  Google Scholar 

  5. Campello R (2010) Generalized external indexes for comparing data partitions with overlapping categories. Pattern Recogn Lett 31(9):966–975

    Article  Google Scholar 

  6. Campello R, Hruschka ER (2006) A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets Syst 157(21):2858–2875

    MathSciNet  Article  MATH  Google Scholar 

  7. Chen J, Zaïane OR, Goebel R (2009) Detecting communities in social networks using max-min modularity. In: SIAM international conference on data mining, pp 978–989

  8. Clauset A (2005) Finding local community structure in networks. Phys Rev E (Statistical, Nonlinear, and Soft Matter Physics) 72(2):026132

    Article  Google Scholar 

  9. Collins LM, Dent CW (1988) Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivar Behav Res 23(2):231–242

    Article  Google Scholar 

  10. Dalrymple-Alford EC (1970) Measurement of clustering in free recall. Psychol Bull 74:32–34

    Article  Google Scholar 

  11. Danon L, Díaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. J Stat Mech Theory Exp 2005(09):09008. doi:10.1088/1742-5468/2005/09/P09008

  12. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227

    Google Scholar 

  13. Dumitrescu D, BL, Jain LC (2000) Fuzzy sets and their application to clustering and training. CRC Press, Boca Raton

  14. Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104

    MathSciNet  Article  Google Scholar 

  15. Fortunato S (2010) Community detection in graphs. Phys Rep 486(35):75–174

    MathSciNet  Article  Google Scholar 

  16. Fortunato S, Barthélemy M (2007) Resolution limit in community detection. Proc Nat Acad Sci 104(1):36–41

    Article  Google Scholar 

  17. Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Nat Acad Sci 99(12):7821–7826

    MathSciNet  Article  MATH  Google Scholar 

  18. Gregory S (2011) Fuzzy overlapping communities in networks. J Stat Mech Theory Exp 2:17

    Google Scholar 

  19. Gustafsson M, Hörnquist M, Lombardi A (2006) Comparison and validation of community structures in complex networks. Phys A Stat Mech Appl 367:559–576

    Article  Google Scholar 

  20. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inform Syst 17:107–145

    Article  MATH  Google Scholar 

  21. Hppner F, Klawonn F, Kruse R, Runkler T (1999) Fuzzy cluster analysis: methods for classification, data analysis and image recognition. Wiley, New York

  22. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  Google Scholar 

  23. Hubert LJ, Levin JR (1976) A general statistical framework for assessing categorical clustering in free recall. Psychol Bull 83:1072–1080

    Article  Google Scholar 

  24. Kenley EC, Cho Y-R (2011) Entropy-based graph clustering: application to biological and social networks. In: IEEE International Conference on Data Mining

  25. Krebs V. Books about us politics. http://www.orgnet.com/2004

  26. Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056117

    Article  Google Scholar 

  27. Lancichinetti A, Fortunato S (2012) Consensus clustering in complex networks. Nat Sci Rep 2:336

    Google Scholar 

  28. Lancichinetti A, Fortunato S, Kertsz J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11(3):033015

    Article  Google Scholar 

  29. Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046110

    Google Scholar 

  30. Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: ACM SIGKDD international conference on knowledge discovery in data mining, pp 177–187

  31. Leskovec J, Lang KJ, Mahoney M (2010) Empirical comparison of algorithms for network community detection. In: International conference on world wide web, pp 631–640

  32. Luo F, Wang JZ, Promislow E (2008) Exploring local community structures in large networks. Web Intell Agent Syst 6(4):387–400

    Google Scholar 

  33. Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    Google Scholar 

  34. Meil M (2007) Comparing clusteringsan information based distance. J Multivar Anal 98(5):873–895

    Article  Google Scholar 

  35. Milligan G, Cooper M (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179

    Article  Google Scholar 

  36. Newman M (2010) Networks: an introduction. Oxford University Press, Inc., New York

    Google Scholar 

  37. Newman MEJ (2006) Modularity and community structure in networks. Proc Nat Acad Sci 103(23):8577–8582

    Article  Google Scholar 

  38. Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113

    Article  Google Scholar 

  39. Nooy Wd, Mrvar A, Batagelj V (2004) Exploratory Social Network Analysis with Pajek. Cambridge University Press, Cambridge

  40. Onnela J-P, Fenn DJ, Reid S, Porter MA, Mucha PJ, Fricker MD, Jones NS (2010) Taxonomies of Networks. ArXiv e-prints

  41. Orman GK, Labatut V (2010) The effect of network realism on community detection algorithms. In: Proceedings of the 2010 international conference on advances in social networks analysis and mining. ASONAM ’10, pp 301–305

  42. Orman GK, Labatut V, Cherifi H (2011) Qualitative comparison of community detection algorithms. In: International conference on digital information and communication technology and its applications, vol 167, pp 265–279

  43. Pakhira M, Dutta A (2011) Computing approximate value of the pbm index for counting number of clusters using genetic algorithm. In: International conference on recent trends in information systems

  44. Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818

    Article  Google Scholar 

  45. Porter MA, Onnela J-P, Mucha PJ (2009) Communities in networks. Notices of the AMS 56(9):1082–1097

    Google Scholar 

  46. Rabbany R, Chen J, Zaïane OR (2010) Top leaders community detection approach in information networks. In: SNA-KDD workshop on social network mining and analysis

  47. Rabbany R, Takaffoli M, Fagnan J, Zaiane O, Campello R (2012) Relative validity criteria for community mining algorithms. In: International conference on advances in social networks analysis and mining (ASONAM)

  48. Rabbany R, Zaïane OR (2011) A diffusion of innovation-based closeness measure for network associations. In: IEEE international conference on data mining workshops, pp 381–388

  49. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabsi A-L (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551–1555

    Article  Google Scholar 

  50. Rees BS, Gallagher KB (2012) Overlapping community detection using a community optimized graph swarm. Soc Netw Anal Mining 2(4):405–417

    Article  Google Scholar 

  51. Rosvall M, Bergstrom CT (2007) An information-theoretic framework for resolving community structure in complex networks. Proc Nat Acad Sci 104(18):7327–7331

    Article  Google Scholar 

  52. Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Nat Acad Sci 105(4):1118–1123

    Article  Google Scholar 

  53. Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65

    Article  MATH  Google Scholar 

  54. Sallaberry A, Zaidi F, Melançon G (2013) Model for generating artificial social networks having community structures with small-world and scale-free properties. Soc Netw Anal Min 3(3):597–609

    Google Scholar 

  55. Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  MATH  Google Scholar 

  56. Theodoridis S, Koutroumbas K (2009) Cluster validity. In: Pattern recognition, chapter 16, 4 ed. Elsevier Science, London

  57. Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Mining 3(4):209–235

    MathSciNet  Google Scholar 

  58. Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th annual international conference on machine learning, ICML ’09. ACM, New York, pp 1073–1080

  59. Vinh NX, Epps J, Bailey J (2010). Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854

    MathSciNet  MATH  Google Scholar 

  60. Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge University Press, Cambridge

  61. Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09. ACM, New York, pp 877–886

  62. Yoshida T (2013) Weighted line graphs for overlapping community discovery. Soc Netw Anal Min 1–13. doi:10.1007/s13278-013-0104-1

  63. Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33:452–473

    Google Scholar 

Download references

Acknowledgments

The authors are grateful for the support from Alberta Innovates Centre for Machine Learning and NSERC. Ricardo Campello also acknowledges the financial support of Fapesp and CNPq.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Reihaneh Rabbany.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Rabbany, R., Takaffoli, M., Fagnan, J. et al. Communities validity: methodical evaluation of community mining algorithms. Soc. Netw. Anal. Min. 3, 1039–1062 (2013). https://doi.org/10.1007/s13278-013-0132-x

Download citation

Keywords

  • Evaluation approaches
  • Quality measures
  • Clustering evaluation
  • Clustering objective function
  • Community mining