LightHouse: An Automatic Code Generator for Graph Algorithms on GPUs

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10136)

Abstract

We propose LightHouse, a GPU code-generator for a graph language named Green-Marl for which a multicore CPU backend already exists. This allows a user to seamlessly generate both the multicore as well as the GPU backends from the same specification of a graph algorithm. This restriction of not modifying the language poses several challenges as we work with an existing abstract syntax tree of the language, which is not tailored to GPUs. LightHouse overcomes these challenges with various optimizations such as reducing the number of atomics and collapsing loops. We illustrate its effectiveness by generating efficient CUDA codes for four graph analytic algorithms, and comparing performance against their multicore OpenMP versions generated by Green-Marl. In particular, our generated CUDA code performs comparable to 4 to 64-threaded OpenMP versions for different algorithms.

References

  1. 1.
    Bader, D.A., Madduri, K.: Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2. In: ICPP 2006, pp. 523–530 (2006)Google Scholar
  2. 2.
    Buluç, A., Madduri, K.: Parallel breadth-first search on distributed memory systems. In: SC 2011, pp. 65:1–65:12. ACM (2011)Google Scholar
  3. 3.
    Burtscher, M., Nasre, R., Pingali, K.: A quantitative study of irregular programs on GPUs. In: IISWC 2012, pp. 141–151. IEEE Computer Society (2012)Google Scholar
  4. 4.
    Checconi, F., Petrini, F., Willcock, J., Lumsdaine, A., Choudhury, A.R., Sabharwal, Y.: Breaking the speed, scalability barriers for graph exploration on distributed-memory machines. In: SC 2012, pp. 13:1–13:12 (2012)Google Scholar
  5. 5.
    Gharaibeh, A., Costa, L.B., Santos-Neto, E., Ripeanu, M.: A yoke of oxen and a thousand chickens for heavy lifting graph processing. In: PACT 2012 (2012)Google Scholar
  6. 6.
    Hong, S., Chafi, H., Sedlar, E., Olukotun, K.: Green-Marl: a DSL for easy and efficient graph analysis. In: ASPLOS 2012, pp. 349–362 ACM (2012)Google Scholar
  7. 7.
    Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU-GPU architectures. In: CGO 2012. ACM (2012)Google Scholar
  8. 8.
    Kulkarni, M., Burtscher, M., Inkulu, R., Pingali, K., Casçaval, C.: How much parallelism is there in irregular applications? In: PPoPP 2009, pp. 3–14 (2009)Google Scholar
  9. 9.
    Kulkarni, M., Pingali, K., Ramanarayanan, G., Walter, B., Bala, K., Chew, L.P.: Optimistic parallelism benefits from data partitioning. SIGARCH Comput. Archit. News 36(1), 233–243 (2008)CrossRefGoogle Scholar
  10. 10.
    Kulkarni, M., Pingali, K., Walter, B., Ramanarayanan, G., Bala, K., Chew, L.P.: Optimistic parallelism requires abstractions. PLDI 42(6), 211–222 (2007)Google Scholar
  11. 11.
    Leskovec, J., Sosič, R.: SNAP: a general purpose network analysis and graph mining library in C++, June 2014. http://snap.stanford.edu/snap
  12. 12.
    Madduri, K., Bader, D., Berry, J., Crobak, J.: An experimental study of a parallel shortest path algorithm for solving large-scale graph instances. In: ALENEX (2007)Google Scholar
  13. 13.
    Nasre, R., Burtscher, M., Pingali, K.: Morph algorithms on GPUs. In: PPoPP 2013. ACM (2013)Google Scholar
  14. 14.
    Pearce, R., Gokhale, M., Amato, N.M.: Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In: SC 2010, pp. 1–11 (2010)Google Scholar
  15. 15.
    Pingali, K., Nguyen, D., Kulkarni, M., Burtscher, M., Hassaan, M.A., Kaleem, R., Lee, T.-H., Lenharth, A., Manevich, R., Méndez-Lojo, M., Prountzos, D., Sui, X.: The tao of parallelism in algorithms. In: PLDI 2011, pp. 12–25. ACM (2011)Google Scholar
  16. 16.
    Prountzos, D., Manevich, R., Pingali, K.: Elixir: a system for synthesizing concurrent graph programs. In: OOPSLA 2012, pp. 375–394. ACM (2012)Google Scholar
  17. 17.
    Prountzos, D., Manevich, R., Pingali, K.: Synthesizing parallel graph programs via automated planning. In: PLDI, pp. 533–544. ACM (2015)Google Scholar
  18. 18.
    Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: PLDI 2013, pp. 519–530. ACM (2013)Google Scholar
  19. 19.
    Shun, J., Blelloch, G.E.: Ligra: A lightweight graph processing framework for shared memory. In: PPoPP, pp. 135–146. ACM (2013)Google Scholar
  20. 20.
    Venkat, A., Shantharam, M., Hall, M., Strout, M.M.: Non-affine extensions to polyhedral code generation. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation, Optimization, CGO 2014, pp. 185:185–185:194. ACM, New York (2014)Google Scholar
  21. 21.
    Xiao, S., Feng, W.: Inter-block GPU communication via fast barrier synchronization. In: IPDPS, pp. 1–12. IEEE (2010)Google Scholar
  22. 22.
    Yoo, A., Chow, E., Henderson, K., McLendon, W., Hendrickson, B., Catalyurek, U.: A scalable distributed parallel breadth-first search algorithm on blueGene/L. In: ICS, p. 25. IEEE Computer Society (2005)Google Scholar
  23. 23.
    Zhang, E.Z., Jiang, Y., Guo, Z., Tian, K., Shen, X.: On-the-fly elimination of dynamic irregularities for GPU computing. In: ASPLOS. ACM (2011)Google Scholar
  24. 24.
    Zhong, J., He, B.: Medusa: simplified graph processing on GPUs. IEEE Trans. Parallel Distrib. Syst. 25(6), 1543–1552 (2014)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.IIT MadrasChennaiIndia

Personalised recommendations