Exploitation of GPUs for the Parallelisation of Probably Parallel Legacy Code

  • Zheng Wang
  • Daniel Powell
  • Björn Franke
  • Michael O’Boyle
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8409)


General purpose Gpus provide massive compute power, but are notoriously difficult to program. In this paper we present a complete compilation strategy to exploit Gpus for the parallelisation of sequential legacy code. Using hybrid data dependence analysis combining static and dynamic information, our compiler automatically detects suitable parallelism and generates parallel OpenCl code from sequential programs. We exploit the fact that dependence profiling provides us with parallel loop candidates that are highly likely to be genuinely parallel, but cannot be statically proven so. For the efficient Gpu parallelisation of those probably parallel loop candidates, we propose a novel software speculation scheme, which ensures correctness for the unlikely, yet possible case of dynamically detected dependence violations. Our scheme operates in place and supports speculative read and write operations. We demonstrate the effectiveness of our approach in detecting and exploiting parallelism using sequential codes from the Nas benchmark suite. We achieve an average speedup of 3.2x, and up to 99x, over the sequential baseline. On average, this is 1.42 times faster than state-of-the-art speculation schemes and corresponds to 99% of the performance level of a manual Gpu implementation developed by independent expert programmers.


GPU OpenCL Parallelization Thread Level Speculation 


  1. 1.
    NAS parallel benchmarks 2.3, OpenMP C version,
  2. 2.
    Ahn, W., Duan, Y., Torrellas, J.: Dealiaser: Alias speculation using atomic region support. In: ASPLOS 2013 (2013)Google Scholar
  3. 3.
    AMD. AMD/ATI Stream SDK,
  4. 4.
    Bridges, M., Vachharajani, N., Zhang, Y., Jablin, T., August, D.: Revisiting the sequential programming model for the multicore era. IEEE Micro 28(1) (2008)Google Scholar
  5. 5.
    Grewe, D., Wang, Z., O’Boyle, M.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO 2013 (2013)Google Scholar
  6. 6.
    Hayashi, A., Grossman, M., Zhao, J., Shirako, J., Sarkar, V.: Speculative execution of parallel programs with precise exception semantics on gpus. In: LCPC 2013 (2013)Google Scholar
  7. 7.
    Ketterlin, A., Clauss, P.: Profiling data-dependence to assist parallelization: Framework, scope, and optimization. In: MICRO 2012 (2012)Google Scholar
  8. 8.
    Kim, M., Kim, H., Luk, C.-K.: Sd3: A scalable approach to dynamic data-dependence profiling. In: MICRO 43Google Scholar
  9. 9.
    Landi, W.: Undecidability of static analysis. ACM Lett. Program. Lang. Syst. 1(4) (December 1992)Google Scholar
  10. 10.
    Lee, S., Eigenmann, R.: Openmpc: Extended openmp programming and tuning for gpus. In: SC 2010 (2010)Google Scholar
  11. 11.
    Mak, J., Faxén, K.-F., Janson, S., Mycroft, A.: Estimating and exploiting potential parallelism by source-level dependence profiling. In: EuroPar 2010 (2010)Google Scholar
  12. 12.
    Mehrara, M., Hao, J., Hsu, P.-C., Mahlke, S.: Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory. In: PLDI 2009 (2009)Google Scholar
  13. 13.
    Mishra, V., Aggarwal, S.K.: Partool: A feedback-directed parallelizer. In: Temam, O., Yew, P.-C., Zang, B. (eds.) APPT 2011. LNCS, vol. 6965, pp. 157–171. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  14. 14.
    Oancea, C.E., Mycroft, A.: A lightweight model for software thread-level speculation (TLS). In: PACT 2007 (2007)Google Scholar
  15. 15.
    Oancea, C.E., Mycroft, A.: Set-congruence dynamic analysis for thread-level speculation (TLS). In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 156–171. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  16. 16.
    Oancea, C.E., Mycroft, A., Harris, T.: A lightweight in-place implementation for software thread-level speculation. In: SPAA 2009 (2009)Google Scholar
  17. 17.
    Oancea, C.E., Mycroft, A., Harris, T.: A lightweight in-place implementation for software thread-level speculation. In: SPAA 2009 (2009)Google Scholar
  18. 18.
    Prabhu, M.K., Olukotun, K.: Using thread-level speculation to simplify manual parallelization. In: PPoPP 2003 (2003)Google Scholar
  19. 19.
    Rauchwerger, L.: Speculative parallelization of loops. Springer, Heidelberg (2011)Google Scholar
  20. 20.
    Rauchwerger, L., Padua, D.A.: The LRPD test: Speculative run-time parallelization of loops with privatization and reduction parallelization. IEEE Trans. Parallel Distrib. Syst. 10(2) (1999)Google Scholar
  21. 21.
    Samadi, M., Hormati, A., Lee, J., Mahlke, S.: Paragon: Collaborative speculative loop execution on gpu and cpu. In: GPGPU 2012 (2012)Google Scholar
  22. 22.
    Seo, S., Jo, G., Lee, J.: Performance characterization of the nas parallel benchmarks in opencl. In: IISWC 2011 (2011)Google Scholar
  23. 23.
    Thies, W., Chandrasekhar, V., Amarasinghe, S.P.: A practical approach to exploiting coarse-grained pipeline parallelism in C programs. In: MICRO 2007 (2007)Google Scholar
  24. 24.
    Tournavitis, G., Wang, Z., Franke, B., O’Boyle, M.F.: Towards a holistic approach to auto-parallelization: Integrating profile-driven parallelism detection and machine -learning based mapping. In: PLDI 2009 (2009)Google Scholar
  25. 25.
    Vandierendonck, H., Rul, S., De Bosschere, K.: The paralax infrastructure: Automatic parallelization with a helping hand. In: PACT 2010 (2010)Google Scholar
  26. 26.
    Vanka, R., Tuck, J.: Efficient and accurate data dependence profiling using software signatures. In: CGO 2012 (2012)Google Scholar
  27. 27.
    Vanka, R., Tuck, J.: Efficient and accurate data dependence profiling using software signatures. In: CGO 2012 (2012)Google Scholar
  28. 28.
    Wallace, S., Calder, B., Tullsen, D.M.: Threaded multiple path execution. In: ISCA 1998 (1998)Google Scholar
  29. 29.
    Wu, P., Kejariwal, A., Caşcaval, C.: Compiler-driven dependence profiling to guide program parallelization. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 232–248. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  30. 30.
    Yu, H., Li, Z.: Fast loop-level data dependence profiling. In: ICS 2012 (2012)Google Scholar
  31. 31.
    Zhai, A., Wang, S., Yew, P.-C., He, G.: Compiler optimizations for parallelizing general-purpose applications under thread-level speculation. In: PPoPP 2008 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Zheng Wang
    • 1
  • Daniel Powell
    • 2
  • Björn Franke
    • 2
  • Michael O’Boyle
    • 2
  1. 1.School of Computing and CommunicationsLancaster UniversityUnited Kingdom
  2. 2.School of InformaticsUniversity of EdinburghUnited Kingdom

Personalised recommendations