On the Correctness of the SIMT Execution Model of GPUs

  • Axel Habermaier
  • Alexander Knapp
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7211)


GPUs are becoming a primary resource of computing power. They use a single instruction, multiple threads (SIMT) execution model that executes batches of threads in lockstep. If the control flow of threads within the same batch diverges, the different execution paths are scheduled sequentially; once the control flows reconverge, all threads are executed in lockstep again. Several thread batching mechanisms have been proposed, albeit without establishing their semantic validity or their scheduling properties. To increase the level of confidence in the correctness of GPU-accelerated programs, we formalize the SIMT execution model for a stack-based reconvergence mechanism in an operational semantics and prove its correctness by constructing a simulation between the SIMT semantics and a standard interleaved multi-thread semantics. We also demonstrate that the SIMT execution model produces unfair schedules in some cases. We discuss the problem of unfairness for different batching mechanisms like dynamic warp formation and a stack-less reconvergence strategy.


Execution Model Program Counter Disable State Active Thread Unfairness Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    AMD. Evergreen Family Instruction Set Architecture, Reference Guide (2011)Google Scholar
  2. 2.
    Barnat, J., Brim, L., Ceska, M., Lamr, T.: CUDA Accelerated LTL Model Checking. In: Proc. 15th Int. Conf. Parallel and Distributed Systems (ICPADS 2009), pp. 34–41 (2009)Google Scholar
  3. 3.
    Bošnački, D., Edelkamp, S., Sulewski, D., Wijs, A.: GPU-PRISM: An Extension of PRISM for General Purpose Graphics Processing Units. In: Proc. 9th Int. Wsh. Parallel and Distributed Methods in Verification (PDMV 2010), pp. 17–19 (2010)Google Scholar
  4. 4.
    Collange, S.: Stack-less SIMT Reconvergence at Low Cost. Technical Report HAL-00622654, INRIA (2011)Google Scholar
  5. 5.
    Coon, B.W., Nickolls, J.R., Nyland, L., Mills, P.C., Lindholm, J.E.: Indirect Function Call Instructions in a Synchronous Parallel Thread Processor, United States Patent Application #2009/0240931 (2009)Google Scholar
  6. 6.
    Fung, W.W.L., Aamodt, T.M.: Thread Block Compaction for Efficient SIMT Control Flow. In: Proc. 17th IEEE Int. Symp. High Performance Computer Architecture (HPCA 2011), pp. 25–36 (2011)Google Scholar
  7. 7.
    Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In: Proc. 40th Ann. IEEE/ACM Int. Symp. Microarchitecture (MICRO 2007), pp. 407–420 (2007)Google Scholar
  8. 8.
    Garland, M., Le Grand, S., Nickolls, J., Anderson, J., Hardwick, J., Morton, S., Phillips, E., Zhang, Y., Volkov, V.: Parallel Computing Experiences with CUDA. IEEE Micro 28(4), 13–27 (2008)CrossRefGoogle Scholar
  9. 9.
    Habermaier, A.: The Model of Computation of CUDA and its Formal Semantics. Technical Report 2011-14, University of Augsburg (2011)Google Scholar
  10. 10.
    Habermaier, A., Knapp, A.: On the Correctness of the SIMT Execution Model of GPUs. Technical Report 2012-1, University of Augsburg (2012)Google Scholar
  11. 11.
    Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 5th edn. Elsevier Science & Technology (2011)Google Scholar
  12. 12.
    Khronos Group Inc. The OpenGL Shading Language 4.20, Revision 6 (2011)Google Scholar
  13. 13.
    Khronos OpenCL Working Group. The OpenCL Specification 1.2, Revision 15 (2011)Google Scholar
  14. 14.
    Levinthal, A., Porter, T.: Chap – A SIMD Graphics Processor. SIGGRAPH Comput. Graph. 18, 77–82 (1984)CrossRefGoogle Scholar
  15. 15.
    The LLVM Compiler Infrastructure, (01/04/2012)
  16. 16.
    Mantor, M., Houston, M.: AMD Graphic Core Next: Low Power High Performance Graphics & Parallel Compute. Presentation at the AMD Fusion Developer Summit (2011)Google Scholar
  17. 17.
    Mark, W.: Future Graphics Architectures. ACM Queue 6, 54–64 (2008)CrossRefGoogle Scholar
  18. 18.
    Meng, J., Tarjan, D., Skadron, K.: Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In: Proc. 37th Ann. Int. Symp. Computer Architecture (ISCA 2010), pp. 235–246 (2010)Google Scholar
  19. 19.
    Moy, S., Lindholm, J.E.: Method and System for Programmable Pipelined Graphics Processing with Branching Instructions, United States Patent #6,947,047 (2005)Google Scholar
  20. 20.
    Muchnick, S.S.: Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc. (1997)Google Scholar
  21. 21.
    Nickolls, J.R., Dally, W.: The GPU Computing Era. IEEE Micro 30(2), 56–69 (2010)CrossRefGoogle Scholar
  22. 22.
    NVIDIA. DirectCompute Programming Guide 3.2 (2010)Google Scholar
  23. 23.
    NVIDIA. cuobjdump. CUDA Toolkit 4.1 (2011)Google Scholar
  24. 24.
    NVIDIA. NVIDIA CUDA C Programming Guide 4.1 (2011)Google Scholar
  25. 25.
    NVIDIA. NVIDIA Opens Up CUDA Platform by Releasing Compiler Source Code (2011), (01/04/2012)
  26. 26.
    Reynolds, J.C.: Theories of Programming Languages. Cambridge University Press (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Axel Habermaier
    • 1
  • Alexander Knapp
    • 1
  1. 1.Institute for Software and Systems EngineeringUniversity of AugsburgGermany

Personalised recommendations