Optimistic Parallelism on GPUs

  • Min Feng
  • Rajiv Gupta
  • Laxmi N. Bhuyan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8967)


We present speculative parallelization techniques that can exploit parallelism in loops even in the presence of dynamic irregularities that may give rise to cross-iteration dependences. The execution of a speculatively parallelized loop consists of five phases: scheduling, computation, misspeculation check, result committing, and misspeculation recovery. While the first two phases enable exploitation of data parallelism, the latter three phases represent overhead costs of using speculation. We perform misspeculation check on the GPU to minimize its cost. We perform result committing and misspeculation recovery on the CPU to reduce the result copying and recovery overhead. The scheduling policies are designed to reduce the misspeculation rate. Our programming model provides API for programmers to give hints about potential misspeculations to reduce their detection cost. Our experiments yielded speedups of 3.62x-13.76x on an nVidia Tesla C1060 hosted in an Intel(R) Xeon(R) E5540 machine.


  1. 1.
    Amini, M., Goubier, O., Guelton, S., Mcmahon, J.O., Pasquier, F.X., Pean, G., Villalon, P.: Par4All: from convex array regions to heterogeneous computing. In: IMPACT (2012)Google Scholar
  2. 2.
    Ayguadé, E., Badia, R.M., Igual, F.D., Labarta, J., Mayo, R., Quintana-Ortí, E.S.: An extension of the StarSs programming model for platforms with multiple GPUs. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 851–862. Springer, Heidelberg (2009) CrossRefGoogle Scholar
  3. 3.
    Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: A compiler framework for optimization of affine loop nests for GPGPUs. In: ICS, pp. 225–234 (2008)Google Scholar
  4. 4.
    Dagum, L., Menon, R.: OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)CrossRefGoogle Scholar
  5. 5.
    Dang, F.H., Yu, H., Rauchwerger, L.: The R-LRPD test: speculative parallelization of partially parallel loops. In: IPDPS (2002)Google Scholar
  6. 6.
    Diamos, G., Yalamanchili, S.: Speculative execution on multi-GPU systems. In: IPDPS, pp. 1–12 (2010)Google Scholar
  7. 7.
    Ding, C., Shen, X., Kelsey, K., Tice, C., Huang, R., Zhang, C.: Software behavior oriented parallelization. In: PLDI, pp. 223–234 (2007)Google Scholar
  8. 8.
    Feng, W., Xiao, S.: To GPU synchronize or not GPU synchronize? In: ISCAS, pp. 3801–3804 (2010)Google Scholar
  9. 9.
    Lee, S., Min, S.-J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: PPoPP, pp. 101–110 (2009)Google Scholar
  10. 10.
    Liu, S., Eisenbeis, C., Gaudiot, J.-L.: Speculative execution on GPU: an exploratory study. In: ICPP, pp. 453–461 (2010)Google Scholar
  11. 11.
    Liu, S., Eisenbeis, C., Gaudiot, J.-L.: Value prediction and speculative execution on GPU. Int. J. Parallel Program. 39(5), 533–552 (2011)CrossRefGoogle Scholar
  12. 12.
    Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2), 40–53 (2008)CrossRefGoogle Scholar
  13. 13.
    Rauchwerger, L., Padua, D.: The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization. In: PLDI, pp. 218–232 (1995)Google Scholar
  14. 14.
    Samadi, M., Hormati, A., Lee, J., Mahlke, S.: Paragon: collaborative speculative loop execution on GPU and CPU. In: GPGPU, pp. 64–73 (2012)Google Scholar
  15. 15.
    Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. IEEE Des. Test 12(3), 66–73 (2010)Google Scholar
  16. 16.
    Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACC — first experiences with real-world applications. In: Kaklamanis, C., Papatheodorou, T., Spirakis, P.G. (eds.) Euro-Par 2012. LNCS, vol. 7484, pp. 859–870. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  17. 17.
    Wolfe, M.: Implementing the PGI accelerator model. In: GPGPU (2010)Google Scholar
  18. 18.
    Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. In: PLDI, pp. 86–97 (2010)Google Scholar
  19. 19.
    Zhang, C., Han, G., Wang, C.-L.: GPU-TLS: an efficient runtime for speculative loop parallelization on GPUs. In: CCGrid, pp. 120–127 (2013)Google Scholar
  20. 20.
    Zhang, E.Z., Jiang, Y., Guo, Z., Shen, X.: Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping. In: ICS, pp. 115–126 (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.NEC Laboratories AmericaPrincetonUSA
  2. 2.University of CaliforniaRiversideUSA

Personalised recommendations