SGraph: A Distributed Streaming System for Processing Big Graphs
Big graph processing has been widely used in various computational domains, ranging from language modeling to social networks. Graph-parallel systems have been proposed to process such big graphs on clusters with up to hundreds of nodes. However, the size of a big graph often exceeds the available main memories in a small cluster. As a consequence, task failures happen frequently. To address this problem, we propose SGraph, a distributed streaming graph processing system built on top of Spark. SGraph introduces a streaming data model to avoid loading all of the graph data which may exceed the available RAM space. In addition, SGraph leverages an edge-centric scatter-gather computing model that can be used to conveniently implement graph algorithms. Experiments demonstrate that SGraph can process graphs with up to 1.5 billion edges on small clusters with several low-cost commodity PCs, whereas existing systems may require up to tens or hundreds of high-end machines. Furthermore, SGraph is up to 2.3 times faster than existing systems.
KeywordsDistributed computing Graph processing Streaming
We appreciate the reviewers’s comments and the efforts of open-source contributors. This paper is supported by National Natural Science Foundation of China-Guangdong Government Joint Funding (2nd) for Super Computer Application Research and the Hong Kong GRF 2150851.
- 3.Avery, C.: Giraph: large-scale graph processing infrastructure on Hadoop. In: Proceedings of the Hadoop Summit, Santa Clara (2011)Google Scholar
- 4.Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: distributed graph-parallel computation on natural graphs. In: OSDI, vol. 12, p. 2 (2012)Google Scholar
- 5.Jain, N., Liao, G., Willke, T.L.: Graphbuilder: scalable graph ETL framework. In: First International Workshop on Graph Data Management Experiences and Systems, p. 4. ACM (2013)Google Scholar
- 6.Kwak, H., Lee, C., Park, H., Moon, S.: What is Twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600. ACM (2010)Google Scholar
- 7.Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 135–146. ACM (2010)Google Scholar
- 8.Page, L., Brin, S., Motwani, R., Winograd, T.: The Pagerank Citation Ranking: Bringing Order to the Web (1999)Google Scholar
- 9.Roy, A., Mihailovic, I., Zwaenepoel, W.: X-stream: edge-centric graph processing using streaming partitions. In: Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pp. 472–488. ACM (2013)Google Scholar
- 10.Xin, R.S., Crankshaw, D., Dave, A., Gonzalez, J.E., Franklin, M.J., Stoica, I.: GraphX: Unifying Data-Parallel and Graph-Parallel Analytics (2014). arXiv preprint arXiv:1402.2394
- 11.Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)Google Scholar