Community Discovery: Simple and Scalable Approaches

Ruan, Yiye; Fuhry, David; Liang, Jiongqian; Wang, Yu; Parthasarathy, Srinivasan

doi:10.1007/978-3-319-23835-7_2

Yiye Ruan⁷,
David Fuhry⁷,
Jiongqian Liang⁷,
Yu Wang⁷ &
…
Srinivasan Parthasarathy⁷

Part of the book series: Human–Computer Interaction Series ((HCIS))

584 Accesses
6 Citations

Abstract

The increasing size and complexity of online social networks have brought distinct challenges to the task of community discovery. A community discovery algorithm needs to be efficient, not taking a prohibitive amount of time to finish. The algorithm should also be scalable, capable of handling large networks containing billions of edges or even more. Furthermore, a community discovery algorithm should be effective in that it produces community assignments of high quality. In this chapter, we present a selection of algorithms that follow simple design principles, and have proven highly effective and efficient according to extensive empirical evaluations. We start by discussing a generic approach of community discovery by combining multilevel graph contraction with core clustering algorithms. Next we describe the usage of network sampling in community discovery, where the goal is to reduce the number of nodes and/or edges while retaining the network’s underlying community structure. Finally, we review research efforts that leverage various parallel and distributed computing paradigms in community discovery, which can facilitate finding communities in tera- and peta-scale networks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://newsroom.fb.com/company-info/. Accessed in December 2014.
2.
Here, we will discuss methods based on both shared-memory and distributed-memory architectures.
3.
Overlapping community detection has also attracted considerable research attention [51, 53], yet existing studies have not adapted the multilevel framework discussed here. Combining the multilevel paradigm with overlapping community discovery will be an exciting future direction.
4.
http://snap.stanford.edu/data/index.html.
5.
Note that node sampling can also be achieved by creating an edge-induced subgraph from a subset of edges, therefore the node selection process is not always explicitly performed. The key distinction here is whether all nodes from the original graph are kept in the resultant sample graph.
6.
The forest fire model described here is slightly different from that originally proposed in [25], which operates on directed graphs and thus has two parameters to control the “burning” of in- and out-links, respectively.
7.
Both content and attribute information are modeled as an auxiliary feature vector associated with each node in the graph, so that the formulation is applicable to text, image, and many other forms of information, all of which will be referred to as “content information” henceforth.
8.
An empirical guideline to select K is to let the size of \(E_{content}\) be similar to that of E.
9.
This is different from the Twitter network described in Sect. 2.2.5.
10.
Although forest fire is designed for node sampling, one can perform forest fire repeatedly, each time on a randomly-selected unburned node, until most nodes are burned. The collection of all burned edges are considered sampled edges.
11.
The computed independent set is no longer guaranteed to be maximal.

References

Adamic LA, Glance N (2005) The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd international workshop on link discovery. ACM, pp 36–43
Google Scholar
Aggarwal CC, Zhao Y, Philip SY (2010) On clustering graph streams. In: SDM. SIAM, pp 478–489
Google Scholar
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech: Theory Exp 2008(10):P10008
Google Scholar
Broder AZ, Charikar M, Frieze AM, Mitzenmacher M (1998) Min-wise independent permutations. In: Proceedings of the thirtieth annual ACM symposium on theory of computing. ACM, pp 327–336
Google Scholar
Bui TN, Jones C (1993) A heuristic for reducing fill-in in sparse matrix factorization. In: PPSC, pp 445–452
Google Scholar
Bustamam A, Burrage K, Hamilton NA (2012) Fast parallel Markov clustering in bioinformatics using massively parallel computing on GPU with CUDA and ELLPACK-R sparse format. IEEE/ACM Trans Comput Biol Bioinform (TCBB) 9(3):679–692
Article Google Scholar
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing. ACM, pp 380–388
Google Scholar
Chung FR (1997) Spectral graph theory, vol 92. American Mathematical Society, Providence
MATH Google Scholar
Dhillon I, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944
Article Google Scholar
Diniz PC, Plimpton S, Hendrickson B, Leland RW (1995) Parallel algorithms for dynamically partitioning unstructured grids. In: PPSC, pp 615–620
Google Scholar
Doreian P, Mrvar A (2009) Partitioning signed social networks. Soc Netw 31(1):1–11
Article MATH Google Scholar
Fiduccia CM, Mattheyses RM (1982) A linear-time heuristic for improving network partitions. In: 19th conference on design automation. IEEE, pp 175–181
Google Scholar
Fiedler M (1973) Algebraic connectivity of graphs. Czechoslov Math J 23(2):298–305
MathSciNet MATH Google Scholar
Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174
Article MathSciNet Google Scholar
George A, Liu J (1981) Computer solution of large sparse positive definite systems. Prentice Hall, Englewood Cliffs
MATH Google Scholar
Heath MT, Ng E, Peyton BW (1991) Parallel algorithms for sparse linear systems. SIAM Rev 33(3):420–460
Article MathSciNet MATH Google Scholar
Hubler C, Kriegel HP, Borgwardt K, Ghahramani Z (2008) Metropolis algorithms for representative subgraph sampling. In: Eighth IEEE international conference on data mining, ICDM’08. IEEE, pp 283–292
Google Scholar
Kang U, Meeder B, Papalexakis EE, Faloutsos C (2014) HEigen: spectral analysis for billion-scale graphs. IEEE Trans Knowl Data Eng 26(2):350–362
Article Google Scholar
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392
Article MathSciNet MATH Google Scholar
Karypis G, Kumar V (1999) Parallel multilevel series k-way partitioning scheme for irregular graphs. Siam Rev 41(2):278–300
Article MathSciNet MATH Google Scholar
Kernighan BW, Lin S (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49(2):291–307
Article MATH Google Scholar
Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046110
Article Google Scholar
Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant communities in networks. PLOS ONE 6(4):e18961
Article Google Scholar
Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 631–636
Google Scholar
Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, pp 177–187
Google Scholar
Leskovec J, Lang KJ, Dasgupta A, Mahoney MW (2008) Statistical properties of community structure in large social and information networks. In: Proceedings of the 17th international conference on world wide web. ACM, pp 695–704
Google Scholar
Leskovec J, Lang KJ, Mahoney M (2010) Empirical comparison of algorithms for network community detection. In: Proceedings of the 19th international conference on world wide web. ACM, pp 631–640
Google Scholar
Leung IX, Hui P, Lio P, Crowcroft J (2009) Towards real-time community detection in large networks. Phys Rev E 79(6):066107
Article Google Scholar
Luby M (1986) A simple parallel algorithm for the maximal independent set problem. SIAM J Comput 15(4):1036–1053
Article MathSciNet MATH Google Scholar
Macskassy SA, Provost F (2007) Classification in networked data: a toolkit and a univariate case study. J Mach Learn Res 8:935–983
Google Scholar
Maiya AS, Berger-Wolf TY (2010) Sampling community structure. In: Proceedings of the 19th international conference on world wide web. ACM, pp 701–710
Google Scholar
Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113
Article Google Scholar
Niu Q, Lai PW, Faisal SM, Parthasarathy S, Sadayappan P (2014) A fast implementation of mlr-mcl algorithm on multi-core processors. In: 21st annual international conference on high performance computing, HiPC 2014, Goa, India, 17–20 December 2014
Google Scholar
Ovelgonne M (2013) Distributed community detection in web-scale networks. In: 2013 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM). IEEE, pp 66–73
Google Scholar
Papadopoulos S, Kompatsiaris Y, Vakali A, Spyridonos P (2012) Community detection in social media. Data Min Knowl Discov 24(3):515–554
Article Google Scholar
Parlett BN (1980) The symmetric eigenvalue problem, vol 7. SIAM, Philadelphia
MATH Google Scholar
Parthasarathy S, Faisal SM (2013) Network clustering. CRC Press, Boca Raton, pp 415–456
Google Scholar
Parthasarathy S, Ruan Y, Satuluri V (2011) Community discovery in social networks: applications, methods and emerging trends. Social network data analytics. Springer, Berlin, pp 79–113
Chapter Google Scholar
Pemmaraju S, Skiena S (2003) Computational discrete mathematics: combinatorics and graph theory with mathematica. Cambridge University Press, New York
Book MATH Google Scholar
Prat-Pérez A, Dominguez-Sal D, Brunat JM, Larriba-Pey JL (2012) Shaping communities out of triangles. In: Proceedings of the 21st ACM international conference on information and knowledge management. ACM, pp 1677–1681
Google Scholar
Prat-Pérez A, Dominguez-Sal D, Larriba-Pey JL (2014) High quality, scalable and parallel community detection for large real graphs. In: Proceedings of the 23rd international conference on world wide web, international world wide web conferences steering committee, pp 225–236
Google Scholar
Richter Y, Yom-Tov E, Slonim N (2010) Predicting customer churn in mobile networks through analysis of social groups. In: SDM. SIAM, vol 2010, pp 732–741
Google Scholar
Riedy EJ, Meyerhenke H, Ediger D, Bader DA (2012) Parallel community detection for massive graphs. Parallel processing and applied mathematics. Springer, Berlin, pp 286–296
Chapter Google Scholar
Ruan Y, Fuhry D, Parthasarathy S (2013) Efficient community detection in large networks using content and links. In: Proceedings of the 22nd international conference on world wide web, international world wide web conferences steering committee, pp 1089–1098
Google Scholar
Satuluri V, Parthasarathy S (2009) Scalable graph clustering using stochastic flows: applications to community discovery. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 737–746
Google Scholar
Satuluri V, Parthasarathy S, Ruan Y (2011) Local graph sparsification for scalable clustering. In: Proceedings of the 2011 international conference on management of data. ACM, pp 721–732
Google Scholar
Soffer SN, Vázquez A (2005) Network clustering coefficient without degree-correlation biases. Phys Rev E 71(5):057101
Article Google Scholar
Staudt CL, Meyerhenke H (2013) Engineering high-performance community detection heuristics for massive graphs. In: Proceedings of the 2013 42nd international conference on parallel processing. IEEE Computer Society, pp 180–189
Google Scholar
Van Dongen SM (2000) Graph clustering by flow simulation. Ph.D. thesis, University of Utrecht
Google Scholar
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416
Article MathSciNet Google Scholar
Xie J, Kelley S, Szymanski BK (2013) Overlapping community detection in networks: the state-of-the-art and comparative study. ACM Comput Surv (CSUR) 45(4):43
Article MATH Google Scholar
Yang B, Cheung WK, Liu J (2007) Community mining from signed social networks. IEEE Trans Knowl Data Eng 19(10):1333–1348
Article Google Scholar
Yang J, Leskovec J (2013) Overlapping community detection at scale: a nonnegative matrix factorization approach. In: Proceedings of the sixth ACM international conference on web search and data mining. ACM, pp 587–596
Google Scholar
Yang J, McAuley J, Leskovec J (2013) Community detection in networks with node attributes. In: 2013 IEEE 13th international conference on data mining (ICDM). IEEE, pp 1151–1156
Google Scholar
Yang T, Jin R, Chi Y, Zhu S (2009) Combining link and content for community detection: a discriminative approach. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 927–936
Google Scholar
Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33:452–473
Article Google Scholar

Download references

Acknowledgments

We are thankful to the Editors and anonymous reviewers for their valuable comments, insightful suggestions and constructive feedback that greatly helped improving this article.

This work is supported by NSF Grants IIS-1111118, CCF-1240651, and DMS-1418265. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Ohio State University, 2015 Neil Avenue, 395 Dreese Lab, Columbus, OH, 43210, USA
Yiye Ruan, David Fuhry, Jiongqian Liang, Yu Wang & Srinivasan Parthasarathy

Authors

Yiye Ruan
View author publications
You can also search for this author in PubMed Google Scholar
David Fuhry
View author publications
You can also search for this author in PubMed Google Scholar
Jiongqian Liang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Srinivasan Parthasarathy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yiye Ruan .

Editor information

Editors and Affiliations

Inst.of Informatics and Tele- communications Nat. Center for, Ag.Paraskevi, Greece
Georgios Paliouras
Information Technologies Institute, CERTH, Thermi, Thessaloniki, Greece
Symeon Papadopoulos
Institute of Informatics, NCSR Demokritos, Aghia Paraskevi, Greece
Dimitrios Vogiatzis
Informatics and Telematics Institut Centre for Research & Technology He, Thermi-Thessaloniki, Greece
Yiannis Kompatsiaris

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ruan, Y., Fuhry, D., Liang, J., Wang, Y., Parthasarathy, S. (2015). Community Discovery: Simple and Scalable Approaches. In: Paliouras, G., Papadopoulos, S., Vogiatzis, D., Kompatsiaris, Y. (eds) User Community Discovery. Human–Computer Interaction Series. Springer, Cham. https://doi.org/10.1007/978-3-319-23835-7_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-23835-7_2
Published: 29 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23834-0
Online ISBN: 978-3-319-23835-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics