Knowledge and Information Systems

, Volume 27, Issue 2, pp 303–325 | Cite as

PEGASUS: mining peta-scale graphs

  • U Kang
  • Charalampos E. Tsourakakis
  • Christos Faloutsos
Regular Paper

Abstract

In this paper, we describe PeGaSus, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node, finding the connected components, and computing the importance score of nodes. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PeGaSus is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components, etc.) are essentially a repeated matrix-vector multiplication. In this paper, we describe a very important primitive for PeGaSus, called GIM-V (generalized iterated matrix-vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines, (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs, thanks to Yahoo!, with ≈ 6.7 billion edges.

Keywords

PEGASUS Graph mining GIM-V Generalized iterative matrix-vector multiplication Hadoop 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal G, Data M, Rajagopalan S, Ruhl M (2004) On the streaming model augmented with a sorting primitive. In: Proceedings of FOCSGoogle Scholar
  2. 2.
    Awerbuch B, Shiloach A (1983) New Connectivity and MSF Algorithms for Ultracomputer and PRAM. In: ICPPGoogle Scholar
  3. 3.
    Brin S, Page L (1998) The anatomy of a large-scale hypertextual (Web) search engine. In: WWWGoogle Scholar
  4. 4.
    Broder A, Kumar R, Maghoul F, Prabhakar R, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the Web. In: Computer Networks 33Google Scholar
  5. 5.
    Chaiken R, Jenkins B, Larson P, Ramsey B, Shakib D, Weaver S, Zhou J (2008) SCOPE: easy and efficient parallel processing of massive data sets. In VLDBGoogle Scholar
  6. 6.
    Chen C, Yan X, Zhu F, Han J (2007) gApprox: mining frequent approximate patterns from a massive network. In: IEEE international conference on data miningGoogle Scholar
  7. 7.
    Chen J, Zaiane O, Goebel R (2009) Detecting communities in social networks using max- min modularity. In: SIAM international conference on data miningGoogle Scholar
  8. 8.
    Cheng J, Yu J, Ding B, Yu P, Wang H (2008) Fast graph pattern matching. In: ICDEGoogle Scholar
  9. 9.
    Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: OSDIGoogle Scholar
  10. 10.
    Dunbar R (1998) Grooming, gossip, and the evolution of language. Harvard University PressGoogle Scholar
  11. 11.
    Falkowski T, Barth A, Spiliopoulou M (2007) DENGRAPH: a density-based community detection algorithm. In: Web intelligenceGoogle Scholar
  12. 12.
    Greiner J (1994) A comparison of parallel algorithms for connected components. In: Proceedings of the 6th ACM symposium on parallel algorithms and architecturesGoogle Scholar
  13. 13.
    Grossman R, Gu Y (2008) Data mining using high performance data clouds: experimental studies using sector and sphere. In: ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  14. 14.
    Hintsanen P, Toivonen H (2008) Finding reliable subgraphs from large probabilistic graphs. In: PKDDGoogle Scholar
  15. 15.
    Hirschberg D, Chandra A, Sarwate D (1979) Computing connected components on parallel computers. In: Communications of the ACMGoogle Scholar
  16. 16.
    Kang U, Tsourakakis C, Faloutsos C (2009) PEGASUS: a peta-scale graph mining system—implementation and observations. In: IEEE international conference on data miningGoogle Scholar
  17. 17.
    Kang U, Tsourakakis C, Appel A, Faloutsos C, Leskovec J (2010) Radius plots for mining tera-byte scale graphs: algorithms, patterns, and observations. In: SIAM international conference on data miningGoogle Scholar
  18. 18.
    Karypis G, Kumar V (1999) Parallel multilevel k-way partitioning for irregular graphs. In: SIAM ReviewGoogle Scholar
  19. 19.
    Ke Y, Cheng J, Yu J (2009) Top-k correlative graph mining. In: SIAM international conference on data miningGoogle Scholar
  20. 20.
    Ketkar N, Holder L, Cook D (2005) Subdue: compression-based frequent pattern discovery in graph data. In: OSDMGoogle Scholar
  21. 21.
    Kleinberg J (1998) Authoritative sources in a hyperlinked environment. In: Proceedings of the 9th ACM-SIAM SODAGoogle Scholar
  22. 22.
    Kolda T, Sun J (2008) Scalable tensor decompositions for multi-aspect data mining. In: IEEE international conference on data miningGoogle Scholar
  23. 23.
    Kuramochi M, Karypis G (2004) Finding frequent patterns in a large sparse graph. In: SIAM data mining conferenceGoogle Scholar
  24. 24.
    Lahiri M, Berger-Wolf T (2010) Periodic subgraph mining in dynamic networks. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0253-8
  25. 25.
    Leskovec J, Chakrabarti D, Kleinberg J, Faloutsos C (2005) Realistic, mathematically tractable graph generation and evolution, using Kronecker multiplication. In: practice of knowledge discovery in databases (PKDD)Google Scholar
  26. 26.
    Long B, Zhang Z, Yu P (2010) A general framework for relation graph clustering. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0255-6
  27. 27.
    McGlohon M, Akoglu L, Faloutsos C(2008) Weighted graphs and disconnected components: patterns and a generator. In: ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  28. 28.
    Narasimhamurthy A, Greene D, Hurley N, Cunningham P (2010) Partitioning large networks without breaking communities. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-009-0251-x
  29. 29.
    Newman M (2005) Power laws, Pareto distributions and Zipf’s law. In: Contemporary PhysicsGoogle Scholar
  30. 30.
    Olston C, Reed B, Srivastava U, Kumar R, Tomkins A (2008) Pig latin: a not-so-foreign language for data processing. In: SIGMODGoogle Scholar
  31. 31.
    Pan J, Yang H, Faloutsos C, Duygulu P (2004) Automatic multimedia cross-modal correlation discovery. In: ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  32. 32.
    Pandurangan G, Raghavan P, Upfal E (2002) Using pagerank to characterize web structure. In: COCOONGoogle Scholar
  33. 33.
    Papadimitriou S, Sun J (2008) DisCo: distributed co-clustering with map-reduce. In: IEEE international conference on data miningGoogle Scholar
  34. 34.
    Peng W, Li T (2010) Temporal relation co-clustering on directional social network and author-topic evolution. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-010-0289-92
  35. 35.
    Pike R, Dorward S, Griesemer R, Quinlan S (2005) Interpreting the data: parallel analysis with Sawzall. In: Scientific Programming JournalGoogle Scholar
  36. 36.
    Qian T, Srivastava J, Peng Z, Sheu P (2009) Simultaneously finding fundamental articles and new topics using a community tracking method. In: PAKDDGoogle Scholar
  37. 37.
    Ralf L (2008) Google’s MapReduce programming model—Revisited. In: Science of computer programmingGoogle Scholar
  38. 38.
    Ranu S, Singh A (2009) GraphSig: a scalable approach to mining significant subgraphs in large graph databases. In: ICDEGoogle Scholar
  39. 39.
    Shiloach Y, Vishkin U (1982) An O(logn) parallel connectivity algorithm. J AlgorithmGoogle Scholar
  40. 40.
    Shrivastava N, Majumder A, Rastogi R (2008) Mining (social) network graphs to detect random link attacks. In: ICDEGoogle Scholar
  41. 41.
    Tsourakakis C, Kang U, Miller GL, Faloutsos C (2009) DOULION: counting triangles in massive graphs with a coin. In: Knowledge discovery and data mining (KDD)Google Scholar
  42. 42.
    Tsourakakis C, Kolountzakis M, Miller GL (2009) Approximate triangle counting. In: Arxiv 0904.3761Google Scholar
  43. 43.
    Tsourakakis C (2010) Counting triangles in real-world networks using projections. In: Knowledge and information systems (KAIS). doi:10.1007/s10115-010-0291-2
  44. 44.
    Wang C, Wang W, Pei J, Zhu Y, Shi B (2004) Scalable mining of large disk-based graph databases. In: ACM SIGKDD international conference on knowledge discovery and data miningGoogle Scholar
  45. 45.
    Wang N, Parthasarathy S, Tan K, Tung A (2008) CSV: visualizing and mining cohesive subgraph. In: SIGMODGoogle Scholar
  46. 46.
    Yan X, Han J (2002) gSpan: graph-based substructure pattern mining. In: IEEE international conference on data miningGoogle Scholar
  47. 47.
    Zhu F, Yan X, Han J, Yu P (2007) gPrune: a constraint pushing framework for graph pattern mining. In: PAKDDGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  • U Kang
    • 1
  • Charalampos E. Tsourakakis
    • 1
  • Christos Faloutsos
    • 1
  1. 1.School of Computer Science, Department Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations