Abstract
Large-scale graph processing plays an increasingly important role for many data-related applications. Recently GPU has been adopted to accelerate various graph processing algorithms. However, since the architecture of GPU is very different from traditional computing model, the learning threshold for developing GPU-based applications is high. In this paper, we propose a GPU-based parallel graph processing system named GPregel to tackle this challenge. GPregel is a BSP model in graph processing such as Pregel from Google. It harnesses a lightweight compiler to hide the underlying complexity of the parallel processing details and simplifies programming, so that it greatly reduces the difficulty in utilizing the GPU to solve graph computing problems. Moreover, GPregel develops several optimizations for enhancing the performance, including (1) a special storage model for BSP model running on GPU, which overcomes the execution divergence and irregular memory access by coarse-grained designs; (2) a warp-level optimal strategy Parallelized-Messages-Sending and a thread-level optimal strategy Threads-Merge-Executing to accelerate the computations of high degree vertexes and low degree vertexes respectively; (3) messages copy mechanism optimization that utilizes a shared array and a rolling array to speed up the messages copy. Experiments demonstrate that GPregel can achieve high performance with little work for developers.
Similar content being viewed by others
References
Kapre, N., Mehta, N., Rizzo, D., Eslick, I., Rubin, R., Uribe, T.E., DeHon, A.: GraphStep: A system architecture for sparse-graph algorithms. In: Proceedings of the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’06), pp. 143–151 (2006)
Bader, D.A., Madduri, K.: GTgraph: A Synthetic Graph Generator Suite, Atlanta (2006)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 135–146 (2010)
Zhong, J., He, B.: Medusa: simplified graph processing on GPUs. IEEE Trans. Parallel Distrib. Syst. 25(6), 1543–1552 (2014)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM 33(8), 103–111 (1990)
Harish, P., Narayanan, P.J.: Accelerating Large Graph Algorithms on the GPU Using CUDA. Lecture Notes in Computer Science, pp. 197–208. Springer, Berlin (2007)
He, G., Feng, H., Li, C., Chen, H.: Parallel SimRank computation on large graphs with iterative aggregation. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 543–552, ACM (2010)
Katz, G.J., Kider, Jr, J.T.: All-pairs shortest-paths for large graphs on the GPU. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pp. 47–55 (2008)
Vineet, V., Narayanan, P.J.: CUDA cuts: fast graph cuts on the GPU. In: Proceedings of the IEEE Computer Society Computer Vision and Pattern Recognition Workshops, pp. 1–8 (2008)
Protocol Buffers: Google’s data interchange format. https://code.google.com/p/GPregel/
Apache thrift: http://thrift.apache.org/
Nvidia.: CUDA C Programming Guide version 8.0. (2016)
Bell, N., Hoberock, J.: Thrust: a productivity-oriented library for CUDA 26. In: Kirk, D., Hwu, W. (eds.) Programming Massively Parallel Processors, 2nd edn, pp. 339–358. Elsevier, Amsterdam (2013)
Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for manycore GPUs. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing. IPDPS 2009, pp. 1–10 (2009)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Stanford InfoLab (1999)
Mtibaa, A., May, M., Diot, C., Ammar, M.: PeopleRank: social opportunistic forwarding. IEEE Int. Conf. Comput. Commun. 54(1), 1–5 (2010)
Jones, S.: Introduction to dynamic parallelism. In: GPU Technology Conference Presentation S, vol. 338, p. 2012 (2012)
Dale, J.: CUDA function overheads. http://visionexperts.blogspot.com/2009/07/cuda-function-overheads.html
Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: a recursive model for graph mining. In: Proceedings of the Fourth SIAM International Conference on Data Mining SDM’ 04 (2004)
Amaral, L.A.N., Scala, A., Barthélémy, M., Stanley, H.E.: Classes of small-world networks. Proc. Natl. Acad. Sci. USA 97(21), 11149–11152 (2000)
Erdos, P., Renyi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. 5, 17–60 (1960)
Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/index.html
Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA. In: GPU Gems 3 (2007)
Gregor, D., Lumsdaine, A.: The parallel BGL: a generic library for distributed graph computations. In: Parallel Object-Oriented Scientific Computing (POOSC) (2005)
Berry, J., Mackey, G.: MultiThreaded graph library (MTGL). https://software.sandia.gov/trac/mtgl
Apache Incubator Giraph: http://incubator.apache.org/giraph/
GoldenOrb: http://www.raveldata.com/goldenorb/
Salihoglu, S., Widom J.: GPS: a graph processing system *. Stanford InfoLab (2013)
Bu, Y., Borkar, V., Jia, J., Carey, M.J., Condie, T.: Pregelix: big (ger) graph analytics on a dataflow engine. Proc. VLDB Endow. 8(2), 161–172 (2014)
Phoebus: https://github.com/xslogic/phoebus
Hong, S., Kyun, S., Tayo, K., Olukotun, O.K.: Accelerating CUDA graph algorithms at maximum warp. In: PPoPP, vol. 46, no. 8, pp. 267–276 (2011)
Luo, L., Wong, M., Hwu, W.: An effective GPU implementation of breadth-first search. In: Proceedings of the 47th ACM/IEEE Design Automation Conference (DAC), pp. 52–55 (2010)
Merrill, D., Garland, M., Grimshaw, A.: High-performance and scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’12), vol. 47, no. 8, pp. 117–128 (2011)
Liu, H., Huang, H.H.: Enterprise: breadth-first graph traversal on gpus. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, p. 68 (2015)
Liu, H., Huang, H.H., Hu, Y.: iBFS: concurrent breadth-first search on GPUs. In: Proceedings of the 2016 International Conference on Management of Data, ACM, pp. 403–416 (2016)
Wang, J., Rubin, N., Sidelnik, A., Yalamanchili, S.: Laperm: Locality aware scheduler for dynamic parallelism on gpus. In: Proceedings of the 43rd International Symposium on Computer Architecture, pp. 583–595 (2016)
Tang, X., Pattnaik, A., Jiang, H., Kayiran, O., Jog, A., Sreepathi Pai, M.I., Das, C.R.: Controlled Kernel Launch for dynamic parallelism in GPUs. In: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, ACM (2017)
El Hajj, I., Gómez-Luna, J., Li, C., Chang, L.W., Milojicic, D., Hwu, W.M.: KLAP: kernel launch aggregation and promotion for optimizing dynamic parallelism. In: Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12 (2016)
Wang, Y., Pan, Y., Davidson, A., Wu, Y., Yang, C., Wang, L., Owens, J. D.: Gunrock: GPU graph analytics. arXiv preprint arXiv:1701.01170 (2017)
Acknowledgements
This paper is supported in part by the National Natural Science Foundation of China under Grant No. 61472454.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lai, S., Lai, G., Lu, F. et al. A BSP model graph processing system on many cores. Cluster Comput 20, 1359–1377 (2017). https://doi.org/10.1007/s10586-017-0829-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-0829-0