Skip to main content

Communities validity: methodical evaluation of community mining algorithms

Abstract

Grouping data points is one of the fundamental tasks in data mining, which is commonly known as clustering if data points are described by attributes. When dealing with interrelated data, that is represented in the form a graph wherein a link between two nodes indicates a relationship between them, there has been a considerable number of approaches proposed in recent years for mining communities in a given network. However, little work has been done on how to evaluate the community mining algorithms. The common practice is to evaluate the algorithms based on their performance on standard benchmarks for which we know the ground-truth. This technique is similar to external evaluation of attribute-based clustering methods. The other two well-studied clustering evaluation approaches are less explored in the community mining context; internal evaluation to statistically validate the clustering result and relative evaluation to compare alternative clustering results. These two approaches enable us to validate communities discovered in a real-world application, where the true community structure is hidden in the data. In this article, we investigate different clustering quality criteria applied for relative and internal evaluation of clustering data points with attributes and also different clustering agreement measures used for external evaluation and incorporate proper adaptations to make them applicable in the context of interrelated data. We further compare the performance of the proposed adapted criteria in evaluating community mining results in different settings through extensive set of experiments.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

References

  • Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indices and correction for chance agreement. J Classif 23:301–313. doi:10.1007/s00357-006-0017-z

    MathSciNet  Article  Google Scholar 

  • Aldecoa R, Marin I (2012) Closed benchmarks for network community structure characterization. Phys Rev E 85:026109

    Article  Google Scholar 

  • Bezdek JC (1981) Pattern Recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Norwell

    Book  MATH  Google Scholar 

  • Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3:1–27

    MathSciNet  Article  MATH  Google Scholar 

  • Campello R (2010) Generalized external indexes for comparing data partitions with overlapping categories. Pattern Recogn Lett 31(9):966–975

    Article  Google Scholar 

  • Campello R, Hruschka ER (2006) A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets Syst 157(21):2858–2875

    MathSciNet  Article  MATH  Google Scholar 

  • Chen J, Zaïane OR, Goebel R (2009) Detecting communities in social networks using max-min modularity. In: SIAM international conference on data mining, pp 978–989

  • Clauset A (2005) Finding local community structure in networks. Phys Rev E (Statistical, Nonlinear, and Soft Matter Physics) 72(2):026132

    Article  Google Scholar 

  • Collins LM, Dent CW (1988) Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivar Behav Res 23(2):231–242

    Article  Google Scholar 

  • Dalrymple-Alford EC (1970) Measurement of clustering in free recall. Psychol Bull 74:32–34

    Article  Google Scholar 

  • Danon L, Díaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. J Stat Mech Theory Exp 2005(09):09008. doi:10.1088/1742-5468/2005/09/P09008

  • Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227

    Google Scholar 

  • Dumitrescu D, BL, Jain LC (2000) Fuzzy sets and their application to clustering and training. CRC Press, Boca Raton

  • Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104

    MathSciNet  Article  Google Scholar 

  • Fortunato S (2010) Community detection in graphs. Phys Rep 486(35):75–174

    MathSciNet  Article  Google Scholar 

  • Fortunato S, Barthélemy M (2007) Resolution limit in community detection. Proc Nat Acad Sci 104(1):36–41

    Article  Google Scholar 

  • Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Nat Acad Sci 99(12):7821–7826

    MathSciNet  Article  MATH  Google Scholar 

  • Gregory S (2011) Fuzzy overlapping communities in networks. J Stat Mech Theory Exp 2:17

    Google Scholar 

  • Gustafsson M, Hörnquist M, Lombardi A (2006) Comparison and validation of community structures in complex networks. Phys A Stat Mech Appl 367:559–576

    Article  Google Scholar 

  • Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inform Syst 17:107–145

    Article  MATH  Google Scholar 

  • Hppner F, Klawonn F, Kruse R, Runkler T (1999) Fuzzy cluster analysis: methods for classification, data analysis and image recognition. Wiley, New York

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  Google Scholar 

  • Hubert LJ, Levin JR (1976) A general statistical framework for assessing categorical clustering in free recall. Psychol Bull 83:1072–1080

    Article  Google Scholar 

  • Kenley EC, Cho Y-R (2011) Entropy-based graph clustering: application to biological and social networks. In: IEEE International Conference on Data Mining

  • Krebs V. Books about us politics. http://www.orgnet.com/2004

  • Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056117

    Article  Google Scholar 

  • Lancichinetti A, Fortunato S (2012) Consensus clustering in complex networks. Nat Sci Rep 2:336

    Google Scholar 

  • Lancichinetti A, Fortunato S, Kertsz J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11(3):033015

    Article  Google Scholar 

  • Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046110

    Google Scholar 

  • Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: ACM SIGKDD international conference on knowledge discovery in data mining, pp 177–187

  • Leskovec J, Lang KJ, Mahoney M (2010) Empirical comparison of algorithms for network community detection. In: International conference on world wide web, pp 631–640

  • Luo F, Wang JZ, Promislow E (2008) Exploring local community structures in large networks. Web Intell Agent Syst 6(4):387–400

    Google Scholar 

  • Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  • Meil M (2007) Comparing clusteringsan information based distance. J Multivar Anal 98(5):873–895

    Article  Google Scholar 

  • Milligan G, Cooper M (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179

    Article  Google Scholar 

  • Newman M (2010) Networks: an introduction. Oxford University Press, Inc., New York

    Book  Google Scholar 

  • Newman MEJ (2006) Modularity and community structure in networks. Proc Nat Acad Sci 103(23):8577–8582

    Article  Google Scholar 

  • Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113

    Article  Google Scholar 

  • Nooy Wd, Mrvar A, Batagelj V (2004) Exploratory Social Network Analysis with Pajek. Cambridge University Press, Cambridge

  • Onnela J-P, Fenn DJ, Reid S, Porter MA, Mucha PJ, Fricker MD, Jones NS (2010) Taxonomies of Networks. ArXiv e-prints

  • Orman GK, Labatut V (2010) The effect of network realism on community detection algorithms. In: Proceedings of the 2010 international conference on advances in social networks analysis and mining. ASONAM ’10, pp 301–305

  • Orman GK, Labatut V, Cherifi H (2011) Qualitative comparison of community detection algorithms. In: International conference on digital information and communication technology and its applications, vol 167, pp 265–279

  • Pakhira M, Dutta A (2011) Computing approximate value of the pbm index for counting number of clusters using genetic algorithm. In: International conference on recent trends in information systems

  • Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818

    Article  Google Scholar 

  • Porter MA, Onnela J-P, Mucha PJ (2009) Communities in networks. Notices of the AMS 56(9):1082–1097

    Google Scholar 

  • Rabbany R, Chen J, Zaïane OR (2010) Top leaders community detection approach in information networks. In: SNA-KDD workshop on social network mining and analysis

  • Rabbany R, Takaffoli M, Fagnan J, Zaiane O, Campello R (2012) Relative validity criteria for community mining algorithms. In: International conference on advances in social networks analysis and mining (ASONAM)

  • Rabbany R, Zaïane OR (2011) A diffusion of innovation-based closeness measure for network associations. In: IEEE international conference on data mining workshops, pp 381–388

  • Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabsi A-L (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551–1555

    Article  Google Scholar 

  • Rees BS, Gallagher KB (2012) Overlapping community detection using a community optimized graph swarm. Soc Netw Anal Mining 2(4):405–417

    Article  Google Scholar 

  • Rosvall M, Bergstrom CT (2007) An information-theoretic framework for resolving community structure in complex networks. Proc Nat Acad Sci 104(18):7327–7331

    Article  Google Scholar 

  • Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Nat Acad Sci 105(4):1118–1123

    Article  Google Scholar 

  • Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65

    Article  MATH  Google Scholar 

  • Sallaberry A, Zaidi F, Melançon G (2013) Model for generating artificial social networks having community structures with small-world and scale-free properties. Soc Netw Anal Min 3(3):597–609

    Google Scholar 

  • Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  MATH  Google Scholar 

  • Theodoridis S, Koutroumbas K (2009) Cluster validity. In: Pattern recognition, chapter 16, 4 ed. Elsevier Science, London

  • Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Mining 3(4):209–235

    MathSciNet  Google Scholar 

  • Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th annual international conference on machine learning, ICML ’09. ACM, New York, pp 1073–1080

  • Vinh NX, Epps J, Bailey J (2010). Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854

    MathSciNet  MATH  Google Scholar 

  • Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge University Press, Cambridge

  • Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09. ACM, New York, pp 877–886

  • Yoshida T (2013) Weighted line graphs for overlapping community discovery. Soc Netw Anal Min 1–13. doi:10.1007/s13278-013-0104-1

  • Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33:452–473

    Google Scholar 

Download references

Acknowledgments

The authors are grateful for the support from Alberta Innovates Centre for Machine Learning and NSERC. Ricardo Campello also acknowledges the financial support of Fapesp and CNPq.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reihaneh Rabbany.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Rabbany, R., Takaffoli, M., Fagnan, J. et al. Communities validity: methodical evaluation of community mining algorithms. Soc. Netw. Anal. Min. 3, 1039–1062 (2013). https://doi.org/10.1007/s13278-013-0132-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13278-013-0132-x

Keywords

  • Evaluation approaches
  • Quality measures
  • Clustering evaluation
  • Clustering objective function
  • Community mining