Skip to main content

Community Discovery: Simple and Scalable Approaches

  • Chapter
  • First Online:
User Community Discovery

Part of the book series: Human–Computer Interaction Series ((HCIS))

Abstract

The increasing size and complexity of online social networks have brought distinct challenges to the task of community discovery. A community discovery algorithm needs to be efficient, not taking a prohibitive amount of time to finish. The algorithm should also be scalable, capable of handling large networks containing billions of edges or even more. Furthermore, a community discovery algorithm should be effective in that it produces community assignments of high quality. In this chapter, we present a selection of algorithms that follow simple design principles, and have proven highly effective and efficient according to extensive empirical evaluations. We start by discussing a generic approach of community discovery by combining multilevel graph contraction with core clustering algorithms. Next we describe the usage of network sampling in community discovery, where the goal is to reduce the number of nodes and/or edges while retaining the network’s underlying community structure. Finally, we review research efforts that leverage various parallel and distributed computing paradigms in community discovery, which can facilitate finding communities in tera- and peta-scale networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://newsroom.fb.com/company-info/. Accessed in December 2014.

  2. 2.

    Here, we will discuss methods based on both shared-memory and distributed-memory architectures.

  3. 3.

    Overlapping community detection has also attracted considerable research attention [51, 53], yet existing studies have not adapted the multilevel framework discussed here. Combining the multilevel paradigm with overlapping community discovery will be an exciting future direction.

  4. 4.

    http://snap.stanford.edu/data/index.html.

  5. 5.

    Note that node sampling can also be achieved by creating an edge-induced subgraph from a subset of edges, therefore the node selection process is not always explicitly performed. The key distinction here is whether all nodes from the original graph are kept in the resultant sample graph.

  6. 6.

    The forest fire model described here is slightly different from that originally proposed in [25], which operates on directed graphs and thus has two parameters to control the “burning” of in- and out-links, respectively.

  7. 7.

    Both content and attribute information are modeled as an auxiliary feature vector associated with each node in the graph, so that the formulation is applicable to text, image, and many other forms of information, all of which will be referred to as “content information” henceforth.

  8. 8.

    An empirical guideline to select K is to let the size of \(E_{content}\) be similar to that of E.

  9. 9.

    This is different from the Twitter network described in Sect. 2.2.5.

  10. 10.

    Although forest fire is designed for node sampling, one can perform forest fire repeatedly, each time on a randomly-selected unburned node, until most nodes are burned. The collection of all burned edges are considered sampled edges.

  11. 11.

    The computed independent set is no longer guaranteed to be maximal.

References

  1. Adamic LA, Glance N (2005) The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd international workshop on link discovery. ACM, pp 36–43

    Google Scholar 

  2. Aggarwal CC, Zhao Y, Philip SY (2010) On clustering graph streams. In: SDM. SIAM, pp 478–489

    Google Scholar 

  3. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech: Theory Exp 2008(10):P10008

    Google Scholar 

  4. Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations. In: Proceedings of the thirtieth annual ACM symposium on theory of computing. ACM, pp 327–336

    Google Scholar 

  5. Bui TN, Jones C (1993) A heuristic for reducing fill-in in sparse matrix factorization. In: PPSC, pp 445–452

    Google Scholar 

  6. Bustamam A, Burrage K, Hamilton NA (2012) Fast parallel Markov clustering in bioinformatics using massively parallel computing on GPU with CUDA and ELLPACK-R sparse format. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 9(3):679–692

    Article  Google Scholar 

  7. Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing. ACM, pp 380–388

    Google Scholar 

  8. Chung FR (1997) Spectral graph theory, vol 92. American Mathematical Society, Providence

    MATH  Google Scholar 

  9. Dhillon I, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944

    Article  Google Scholar 

  10. Diniz PC, Plimpton S, Hendrickson B, Leland RW (1995) Parallel algorithms for dynamically partitioning unstructured grids. In: PPSC, pp 615–620

    Google Scholar 

  11. Doreian P, Mrvar A (2009) Partitioning signed social networks. Soc Netw 31(1):1–11

    Article  MATH  Google Scholar 

  12. Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: 19th conference on design automation. IEEE, pp 175–181

    Google Scholar 

  13. Fiedler M (1973) Algebraic connectivity of graphs. Czechoslov Math J 23(2):298–305

    MathSciNet  MATH  Google Scholar 

  14. Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174

    Article  MathSciNet  Google Scholar 

  15. George A, Liu J (1981) Computer solution of large sparse positive definite systems. Prentice Hall, Englewood Cliffs

    MATH  Google Scholar 

  16. Heath MT, Ng E, Peyton BW (1991) Parallel algorithms for sparse linear systems. SIAM Rev 33(3):420–460

    Article  MathSciNet  MATH  Google Scholar 

  17. Hubler C, Kriegel HP, Borgwardt K, Ghahramani Z (2008) Metropolis algorithms for representative subgraph sampling. In: Eighth IEEE international conference on data mining, ICDM’08. IEEE, pp 283–292

    Google Scholar 

  18. Kang U, Meeder B, Papalexakis EE, Faloutsos C (2014) HEigen: spectral analysis for billion-scale graphs. IEEE Trans Knowl Data Eng 26(2):350–362

    Article  Google Scholar 

  19. Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392

    Article  MathSciNet  MATH  Google Scholar 

  20. Karypis G, Kumar V (1999) Parallel multilevel series k-way partitioning scheme for irregular graphs. Siam Rev 41(2):278–300

    Article  MathSciNet  MATH  Google Scholar 

  21. Kernighan BW, Lin S (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49(2):291–307

    Article  MATH  Google Scholar 

  22. Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046110

    Article  Google Scholar 

  23. Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant communities in networks. PLOS ONE 6(4):e18961

    Article  Google Scholar 

  24. Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 631–636

    Google Scholar 

  25. Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, pp 177–187

    Google Scholar 

  26. Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2008) Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th international conference on world wide web. ACM, pp 695–704

    Google Scholar 

  27. Leskovec J, Lang KJ, Mahoney M (2010) Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th international conference on world wide web. ACM, pp 631–640

    Google Scholar 

  28. Leung IX, Hui P, Lio P, Crowcroft J (2009) Towards real-time community detection in large networks. Phys Rev E 79(6):066107

    Article  Google Scholar 

  29. Luby M (1986) A simple parallel algorithm for the maximal independent set problem. SIAM J Comput 15(4):1036–1053

    Article  MathSciNet  MATH  Google Scholar 

  30. Macskassy SA, Provost F (2007) Classification in networked data: a toolkit and a univariate case study. J Mach Learn Res 8:935–983

    Google Scholar 

  31. Maiya AS, Berger-Wolf TY (2010) Sampling community structure. In: Proceedings of the 19th international conference on world wide web. ACM, pp 701–710

    Google Scholar 

  32. Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113

    Article  Google Scholar 

  33. Niu Q, Lai PW, Faisal SM, Parthasarathy S, Sadayappan P (2014) A fast implementation of mlr-mcl algorithm on multi-core processors. In: 21st annual international conference on high performance computing, HiPC 2014, Goa, India, 17–20 December 2014

    Google Scholar 

  34. Ovelgonne M (2013) Distributed community detection in web-scale networks. In: 2013 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). IEEE, pp 66–73

    Google Scholar 

  35. Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P (2012) Community detection in social media. Data Min Knowl Discov 24(3):515–554

    Article  Google Scholar 

  36. Parlett BN (1980) The symmetric eigenvalue problem, vol 7. SIAM, Philadelphia

    MATH  Google Scholar 

  37. Parthasarathy S, Faisal SM (2013) Network clustering. CRC Press, Boca Raton, pp 415–456

    Google Scholar 

  38. Parthasarathy S, Ruan Y, Satuluri V (2011) Community discovery in social networks: applications, methods and emerging trends. Social network data analytics. Springer, Berlin, pp 79–113

    Chapter  Google Scholar 

  39. Pemmaraju S, Skiena S (2003) Computational discrete mathematics: combinatorics and graph theory with mathematica. Cambridge University Press, New York

    Book  MATH  Google Scholar 

  40. Prat-Pérez A, Dominguez-Sal D, Brunat JM, Larriba-Pey JL (2012) Shaping communities out of triangles. In: Proceedings of the 21st ACM international conference on information and knowledge management. ACM, pp 1677–1681

    Google Scholar 

  41. Prat-Pérez A, Dominguez-Sal D, Larriba-Pey JL (2014) High quality, scalable and parallel community detection for large real graphs. In: Proceedings of the 23rd international conference on world wide web, international world wide web conferences steering committee, pp 225–236

    Google Scholar 

  42. Richter Y, Yom-Tov E, Slonim N (2010) Predicting customer churn in mobile networks through analysis of social groups. In: SDM. SIAM, vol 2010, pp 732–741

    Google Scholar 

  43. Riedy EJ, Meyerhenke H, Ediger D, Bader DA (2012) Parallel community detection for massive graphs. Parallel processing and applied mathematics. Springer, Berlin, pp 286–296

    Chapter  Google Scholar 

  44. Ruan Y, Fuhry D, Parthasarathy S (2013) Efficient community detection in large networks using content and links. In: Proceedings of the 22nd international conference on world wide web, international world wide web conferences steering committee, pp 1089–1098

    Google Scholar 

  45. Satuluri V, Parthasarathy S (2009) Scalable graph clustering using stochastic flows: applications to community discovery. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 737–746

    Google Scholar 

  46. Satuluri V, Parthasarathy S, Ruan Y (2011) Local graph sparsification for scalable clustering. In: Proceedings of the 2011 international conference on management of data. ACM, pp 721–732

    Google Scholar 

  47. Soffer SN, Vázquez A (2005) Network clustering coefficient without degree-correlation biases. Phys Rev E 71(5):057101

    Article  Google Scholar 

  48. Staudt CL, Meyerhenke H (2013) Engineering high-performance community detection heuristics for massive graphs. In: Proceedings of the 2013 42nd international conference on parallel processing. IEEE Computer Society, pp 180–189

    Google Scholar 

  49. Van Dongen SM (2000) Graph clustering by flow simulation. Ph.D. thesis, University of Utrecht

    Google Scholar 

  50. Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416

    Article  MathSciNet  Google Scholar 

  51. Xie J, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: the state-of-the-art and comparative study. ACM Comput Surv (CSUR) 45(4):43

    Article  MATH  Google Scholar 

  52. Yang B, Cheung WK, Liu J (2007) Community mining from signed social networks. IEEE Trans Knowl Data Eng 19(10):1333–1348

    Article  Google Scholar 

  53. Yang J, Leskovec J (2013) Overlapping community detection at scale: a nonnegative matrix factorization approach. In: Proceedings of the sixth ACM international conference on web search and data mining. ACM, pp 587–596

    Google Scholar 

  54. Yang J, McAuley J, Leskovec J (2013) Community detection in networks with node attributes. In: 2013 IEEE 13th international conference on data mining (ICDM). IEEE, pp 1151–1156

    Google Scholar 

  55. Yang T, Jin R, Chi Y, Zhu S (2009) Combining link and content for community detection: a discriminative approach. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 927–936

    Google Scholar 

  56. Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33:452–473

    Article  Google Scholar 

Download references

Acknowledgments

We are thankful to the Editors and anonymous reviewers for their valuable comments, insightful suggestions and constructive feedback that greatly helped improving this article.

This work is supported by NSF Grants IIS-1111118, CCF-1240651, and DMS-1418265. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiye Ruan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Ruan, Y., Fuhry, D., Liang, J., Wang, Y., Parthasarathy, S. (2015). Community Discovery: Simple and Scalable Approaches. In: Paliouras, G., Papadopoulos, S., Vogiatzis, D., Kompatsiaris, Y. (eds) User Community Discovery. Human–Computer Interaction Series. Springer, Cham. https://doi.org/10.1007/978-3-319-23835-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23835-7_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23834-0

  • Online ISBN: 978-3-319-23835-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics