Advertisement

PEGASUS: mining peta-scale graphs

Abstract

In this paper, we describe PeGaSus, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node, finding the connected components, and computing the importance score of nodes. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PeGaSus is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components, etc.) are essentially a repeated matrix-vector multiplication. In this paper, we describe a very important primitive for PeGaSus, called GIM-V (generalized iterated matrix-vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines, (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs, thanks to Yahoo!, with ≈ 6.7 billion edges.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 99

This is the net price. Taxes to be calculated in checkout.

References

  1. 1

    Aggarwal G, Data M, Rajagopalan S, Ruhl M (2004) On the streaming model augmented with a sorting primitive. In: Proceedings of FOCS

  2. 2

    Awerbuch B, Shiloach A (1983) New Connectivity and MSF Algorithms for Ultracomputer and PRAM. In: ICPP

  3. 3

    Brin S, Page L (1998) The anatomy of a large-scale hypertextual (Web) search engine. In: WWW

  4. 4

    Broder A, Kumar R, Maghoul F, Prabhakar R, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the Web. In: Computer Networks 33

  5. 5

    Chaiken R, Jenkins B, Larson P, Ramsey B, Shakib D, Weaver S, Zhou J (2008) SCOPE: easy and efficient parallel processing of massive data sets. In VLDB

  6. 6

    Chen C, Yan X, Zhu F, Han J (2007) gApprox: mining frequent approximate patterns from a massive network. In: IEEE international conference on data mining

  7. 7

    Chen J, Zaiane O, Goebel R (2009) Detecting communities in social networks using max- min modularity. In: SIAM international conference on data mining

  8. 8

    Cheng J, Yu J, Ding B, Yu P, Wang H (2008) Fast graph pattern matching. In: ICDE

  9. 9

    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDI

  10. 10

    Dunbar R (1998) Grooming, gossip, and the evolution of language. Harvard University Press

  11. 11

    Falkowski T, Barth A, Spiliopoulou M (2007) DENGRAPH: a density-based community detection algorithm. In: Web intelligence

  12. 12

    Greiner J (1994) A comparison of parallel algorithms for connected components. In: Proceedings of the 6th ACM symposium on parallel algorithms and architectures

  13. 13

    Grossman R, Gu Y (2008) Data mining using high performance data clouds: experimental studies using sector and sphere. In: ACM SIGKDD international conference on knowledge discovery and data mining

  14. 14

    Hintsanen P, Toivonen H (2008) Finding reliable subgraphs from large probabilistic graphs. In: PKDD

  15. 15

    Hirschberg D, Chandra A, Sarwate D (1979) Computing connected components on parallel computers. In: Communications of the ACM

  16. 16

    Kang U, Tsourakakis C, Faloutsos C (2009) PEGASUS: a peta-scale graph mining system—implementation and observations. In: IEEE international conference on data mining

  17. 17

    Kang U, Tsourakakis C, Appel A, Faloutsos C, Leskovec J (2010) Radius plots for mining tera-byte scale graphs: algorithms, patterns, and observations. In: SIAM international conference on data mining

  18. 18

    Karypis G, Kumar V (1999) Parallel multilevel k-way partitioning for irregular graphs. In: SIAM Review

  19. 19

    Ke Y, Cheng J, Yu J (2009) Top-k correlative graph mining. In: SIAM international conference on data mining

  20. 20

    Ketkar N, Holder L, Cook D (2005) Subdue: compression-based frequent pattern discovery in graph data. In: OSDM

  21. 21

    Kleinberg J (1998) Authoritative sources in a hyperlinked environment. In: Proceedings of the 9th ACM-SIAM SODA

  22. 22

    Kolda T, Sun J (2008) Scalable tensor decompositions for multi-aspect data mining. In: IEEE international conference on data mining

  23. 23

    Kuramochi M, Karypis G (2004) Finding frequent patterns in a large sparse graph. In: SIAM data mining conference

  24. 24

    Lahiri M, Berger-Wolf T (2010) Periodic subgraph mining in dynamic networks. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0253-8

  25. 25

    Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C (2005) Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication. In: practice of knowledge discovery in databases (PKDD)

  26. 26

    Long B, Zhang Z, Yu P (2010) A general framework for relation graph clustering. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0255-6

  27. 27

    McGlohon M, Akoglu L, Faloutsos C(2008) Weighted graphs and disconnected components: patterns and a generator. In: ACM SIGKDD international conference on knowledge discovery and data mining

  28. 28

    Narasimhamurthy A, Greene D, Hurley N, Cunningham P (2010) Partitioning large networks without breaking communities. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0251-x

  29. 29

    Newman M (2005) Power laws, Pareto distributions and Zipf’s law. In: Contemporary Physics

  30. 30

    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: SIGMOD

  31. 31

    Pan J, Yang H, Faloutsos C, Duygulu P (2004) Automatic multimedia cross-modal correlation discovery. In: ACM SIGKDD international conference on knowledge discovery and data mining

  32. 32

    Pandurangan G, Raghavan P, Upfal E (2002) Using pagerank to characterize web structure. In: COCOON

  33. 33

    Papadimitriou S, Sun J (2008) DisCo: distributed co-clustering with map-reduce. In: IEEE international conference on data mining

  34. 34

    Peng W, Li T (2010) Temporal relation co-clustering on directional social network and author-topic evolution. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-010-0289-92

  35. 35

    Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: parallel analysis with Sawzall. In: Scientific Programming Journal

  36. 36

    Qian T, Srivastava J, Peng Z, Sheu P (2009) Simultaneously finding fundamental articles and new topics using a community tracking method. In: PAKDD

  37. 37

    Ralf L (2008) Google’s MapReduce programming model—Revisited. In: Science of computer programming

  38. 38

    Ranu S, Singh A (2009) GraphSig: a scalable approach to mining significant subgraphs in large graph databases. In: ICDE

  39. 39

    Shiloach Y, Vishkin U (1982) An O(logn) parallel connectivity algorithm. J Algorithm

  40. 40

    Shrivastava N, Majumder A, Rastogi R (2008) Mining (social) network graphs to detect random link attacks. In: ICDE

  41. 41

    Tsourakakis C, Kang U, Miller GL, Faloutsos C (2009) DOULION: counting triangles in massive graphs with a coin. In: Knowledge discovery and data mining (KDD)

  42. 42

    Tsourakakis C, Kolountzakis M, Miller GL (2009) Approximate triangle counting. In: Arxiv 0904.3761

  43. 43

    Tsourakakis C (2010) Counting triangles in real-world networks using projections. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-010-0291-2

  44. 44

    Wang C, Wang W, Pei J, Zhu Y, Shi B (2004) Scalable mining of large disk-based graph databases. In: ACM SIGKDD international conference on knowledge discovery and data mining

  45. 45

    Wang N, Parthasarathy S, Tan K, Tung A (2008) CSV: visualizing and mining cohesive subgraph. In: SIGMOD

  46. 46

    Yan X, Han J (2002) gSpan: graph-based substructure pattern mining. In: IEEE international conference on data mining

  47. 47

    Zhu F, Yan X, Han J, Yu P (2007) gPrune: a constraint pushing framework for graph pattern mining. In: PAKDD

Download references

Author information

Correspondence to U Kang.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kang, U., Tsourakakis, C.E. & Faloutsos, C. PEGASUS: mining peta-scale graphs. Knowl Inf Syst 27, 303–325 (2011) doi:10.1007/s10115-010-0305-0

Download citation

Keywords

  • PEGASUS
  • Graph mining
  • GIM-V
  • Generalized iterative matrix-vector multiplication
  • Hadoop