PEGASUS: mining peta-scale graphs

Kang, U; Tsourakakis, Charalampos E.; Faloutsos, Christos

doi:10.1007/s10115-010-0305-0

PEGASUS: mining peta-scale graphs

Regular Paper
Published: 08 June 2010

Volume 27, pages 303–325, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

U Kang¹,
Charalampos E. Tsourakakis¹ &
Christos Faloutsos¹

712 Accesses
95 Citations
3 Altmetric
Explore all metrics

Abstract

In this paper, we describe PeGaSus, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node, finding the connected components, and computing the importance score of nodes. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PeGaSus is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components, etc.) are essentially a repeated matrix-vector multiplication. In this paper, we describe a very important primitive for PeGaSus, called GIM-V (generalized iterated matrix-vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines, (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs, thanks to Yahoo!, with ≈ 6.7 billion edges.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Aggarwal G, Data M, Rajagopalan S, Ruhl M (2004) On the streaming model augmented with a sorting primitive. In: Proceedings of FOCS
Awerbuch B, Shiloach A (1983) New Connectivity and MSF Algorithms for Ultracomputer and PRAM. In: ICPP
Brin S, Page L (1998) The anatomy of a large-scale hypertextual (Web) search engine. In: WWW
Broder A, Kumar R, Maghoul F, Prabhakar R, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the Web. In: Computer Networks 33
Chaiken R, Jenkins B, Larson P, Ramsey B, Shakib D, Weaver S, Zhou J (2008) SCOPE: easy and efficient parallel processing of massive data sets. In VLDB
Chen C, Yan X, Zhu F, Han J (2007) gApprox: mining frequent approximate patterns from a massive network. In: IEEE international conference on data mining
Chen J, Zaiane O, Goebel R (2009) Detecting communities in social networks using max- min modularity. In: SIAM international conference on data mining
Cheng J, Yu J, Ding B, Yu P, Wang H (2008) Fast graph pattern matching. In: ICDE
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI
Dunbar R (1998) Grooming, gossip, and the evolution of language. Harvard University Press
Falkowski T, Barth A, Spiliopoulou M (2007) DENGRAPH: a density-based community detection algorithm. In: Web intelligence
Greiner J (1994) A comparison of parallel algorithms for connected components. In: Proceedings of the 6th ACM symposium on parallel algorithms and architectures
Grossman R, Gu Y (2008) Data mining using high performance data clouds: experimental studies using sector and sphere. In: ACM SIGKDD international conference on knowledge discovery and data mining
Hintsanen P, Toivonen H (2008) Finding reliable subgraphs from large probabilistic graphs. In: PKDD
Hirschberg D, Chandra A, Sarwate D (1979) Computing connected components on parallel computers. In: Communications of the ACM
Kang U, Tsourakakis C, Faloutsos C (2009) PEGASUS: a peta-scale graph mining system—implementation and observations. In: IEEE international conference on data mining
Kang U, Tsourakakis C, Appel A, Faloutsos C, Leskovec J (2010) Radius plots for mining tera-byte scale graphs: algorithms, patterns, and observations. In: SIAM international conference on data mining
Karypis G, Kumar V (1999) Parallel multilevel k-way partitioning for irregular graphs. In: SIAM Review
Ke Y, Cheng J, Yu J (2009) Top-k correlative graph mining. In: SIAM international conference on data mining
Ketkar N, Holder L, Cook D (2005) Subdue: compression-based frequent pattern discovery in graph data. In: OSDM
Kleinberg J (1998) Authoritative sources in a hyperlinked environment. In: Proceedings of the 9th ACM-SIAM SODA
Kolda T, Sun J (2008) Scalable tensor decompositions for multi-aspect data mining. In: IEEE international conference on data mining
Kuramochi M, Karypis G (2004) Finding frequent patterns in a large sparse graph. In: SIAM data mining conference
Lahiri M, Berger-Wolf T (2010) Periodic subgraph mining in dynamic networks. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0253-8
Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C (2005) Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication. In: practice of knowledge discovery in databases (PKDD)
Long B, Zhang Z, Yu P (2010) A general framework for relation graph clustering. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0255-6
McGlohon M, Akoglu L, Faloutsos C(2008) Weighted graphs and disconnected components: patterns and a generator. In: ACM SIGKDD international conference on knowledge discovery and data mining
Narasimhamurthy A, Greene D, Hurley N, Cunningham P (2010) Partitioning large networks without breaking communities. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0251-x
Newman M (2005) Power laws, Pareto distributions and Zipf’s law. In: Contemporary Physics
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: SIGMOD
Pan J, Yang H, Faloutsos C, Duygulu P (2004) Automatic multimedia cross-modal correlation discovery. In: ACM SIGKDD international conference on knowledge discovery and data mining
Pandurangan G, Raghavan P, Upfal E (2002) Using pagerank to characterize web structure. In: COCOON
Papadimitriou S, Sun J (2008) DisCo: distributed co-clustering with map-reduce. In: IEEE international conference on data mining
Peng W, Li T (2010) Temporal relation co-clustering on directional social network and author-topic evolution. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-010-0289-92
Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: parallel analysis with Sawzall. In: Scientific Programming Journal
Qian T, Srivastava J, Peng Z, Sheu P (2009) Simultaneously finding fundamental articles and new topics using a community tracking method. In: PAKDD
Ralf L (2008) Google’s MapReduce programming model—Revisited. In: Science of computer programming
Ranu S, Singh A (2009) GraphSig: a scalable approach to mining significant subgraphs in large graph databases. In: ICDE
Shiloach Y, Vishkin U (1982) An O(logn) parallel connectivity algorithm. J Algorithm
Shrivastava N, Majumder A, Rastogi R (2008) Mining (social) network graphs to detect random link attacks. In: ICDE
Tsourakakis C, Kang U, Miller GL, Faloutsos C (2009) DOULION: counting triangles in massive graphs with a coin. In: Knowledge discovery and data mining (KDD)
Tsourakakis C, Kolountzakis M, Miller GL (2009) Approximate triangle counting. In: Arxiv 0904.3761
Tsourakakis C (2010) Counting triangles in real-world networks using projections. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-010-0291-2
Wang C, Wang W, Pei J, Zhu Y, Shi B (2004) Scalable mining of large disk-based graph databases. In: ACM SIGKDD international conference on knowledge discovery and data mining
Wang N, Parthasarathy S, Tan K, Tung A (2008) CSV: visualizing and mining cohesive subgraph. In: SIGMOD
Yan X, Han J (2002) gSpan: graph-based substructure pattern mining. In: IEEE international conference on data mining
Zhu F, Yan X, Han J, Yu P (2007) gPrune: a constraint pushing framework for graph pattern mining. In: PAKDD

Download references

Author information

Authors and Affiliations

School of Computer Science, Department Computer Science, Carnegie Mellon University, Pittsburgh, 15213, PA, USA
U Kang, Charalampos E. Tsourakakis & Christos Faloutsos

Authors

U Kang
View author publications
You can also search for this author in PubMed Google Scholar
Charalampos E. Tsourakakis
View author publications
You can also search for this author in PubMed Google Scholar
Christos Faloutsos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to U Kang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kang, U., Tsourakakis, C.E. & Faloutsos, C. PEGASUS: mining peta-scale graphs. Knowl Inf Syst 27, 303–325 (2011). https://doi.org/10.1007/s10115-010-0305-0

Download citation

Received: 08 January 2010
Revised: 24 March 2010
Accepted: 11 April 2010
Published: 08 June 2010
Issue Date: May 2011
DOI: https://doi.org/10.1007/s10115-010-0305-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

PEGASUS: mining peta-scale graphs

Abstract

Access this article

Similar content being viewed by others

Clustering graph data: the roadmap to spectral techniques

Graph based anomaly detection and description: a survey

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PEGASUS: mining peta-scale graphs

Abstract

Access this article

Similar content being viewed by others

Clustering graph data: the roadmap to spectral techniques

Graph based anomaly detection and description: a survey

A Practical Fixed-Parameter Algorithm for Constructing Tree-Child Networks from Multiple Binary Trees

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation