Skip to main content

Static detection of uncoalesced accesses in GPU programs

Abstract

GPU programming has become popular due to the high computational capabilities of GPUs. Obtaining significant performance gains with GPU is however challenging and the programmer needs to be aware of various subtleties of the GPU architecture. One such subtlety lies in accessing GPU memory, where certain access patterns can lead to poor performance. Such access patterns are referred to as uncoalesced global memory accesses. This work presents a light-weight compile-time static analysis to identify such accesses in GPU programs. The analysis relies on a novel abstraction which tracks the access pattern across multiple threads. The abstraction enables quick prediction while providing correctness guarantees. We have implemented the analysis in LLVM and compare it against a dynamic analysis implementation. The static analysis identifies 95 pre-existing uncoalesced accesses in Rodinia, a popular benchmark suite of GPU programs, and finishes within seconds for most programs, in comparison to the dynamic analysis which finds 69 accesses and takes orders of magnitude longer to finish.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

References

  1. Allen JR, Kennedy K, Porterfield C, Warren J (1983) Conversion of control dependence to data dependence. In: Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on principles of programming languages, POPL ’83. ACM, New York, NY, USA, pp 177–189. https://doi.org/10.1145/567067.567085

  2. Amilkanthwar M, Balachandran, S (2013) CUPL: A compile-time uncoalesced memory access pattern locator for CUDA. In: Proceedings of the 27th international ACM conference on international conference on supercomputing, ICS ’13. ACM, New York, NY, USA, pp 459–460. https://doi.org/10.1145/2464996.2467288

  3. Baskaran MM, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P (2008) A compiler framework for optimization of affine loop nests for GPGPUs. In: Proceedings of the 22Nd annual international conference on supercomputing, ICS ’08. ACM, New York, NY, USA, pp 225–234. https://doi.org/10.1145/1375527.1375562

  4. Betts A, Chong N, Donaldson A, Qadeer S, Thomson P (2012) GPUVerify: a verifier for GPU kernels. SIGPLAN Notice 47(10):113–132. https://doi.org/10.1145/2398857.2384625

    Article  Google Scholar 

  5. Betts A, Chong N, Donaldson AF, Ketema J, Qadeer S, Thomson P, Wickerson J (2015) The design and implementation of a verification technique for GPU kernels. ACM Trans Program Lang Syst 37(3):10:1-10:49. https://doi.org/10.1145/2743017

    Article  Google Scholar 

  6. Boyer RS, Elspas B, Levitt KN (1975) SELECT – a formal system for testing and debugging programs by symbolic execution. In: Proceedings of the international conference on reliable software. ACM, New York, NY, USA, pp 234–245. https://doi.org/10.1145/800027.808445

  7. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW, Lee SH, Skadron K (2009) Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE international symposium on workload characterization (IISWC), IISWC ’09. IEEE Computer Society, Washington, DC, USA, pp 44–54. https://doi.org/10.1109/IISWC.2009.5306797

  8. Cousot P, Cousot R (1977) Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on principles of programming languages, POPL ’77. ACM, New York, NY, USA, pp 238–252. https://doi.org/10.1145/512950.512973

  9. Fauzia N, Pouchet LN, Sadayappan P (2015) Characterizing and enhancing global memory data coalescing on GPUs. In: Proceedings of the 13th Annual IEEE/ACM international symposium on code generation and optimization, CGO ’15. IEEE Computer Society, Washington, DC, USA, pp 12–22. http://dl.acm.org/citation.cfm?id=2738600.2738603

  10. Karrenberg R (2015) Automatic SIMD Vectorization of SSA-based Control Flow Graphs. Springer, Berlin

    Book  Google Scholar 

  11. Kim Y, Shrivastava A (2011) CuMAPz: A tool to analyze memory access patterns in CUDA. In: Proceedings of the 48th design automation conference, DAC ’11. ACM, New York, NY, USA, pp 128–133. https://doi.org/10.1145/2024724.2024754

  12. King JC (1975) A new approach to program testing. In: Proceedings of the International Conference on Reliable Software. ACM, New York, NY, USA, pp 228–233. https://doi.org/10.1145/800027.808444

  13. Li G, Gopalakrishnan G (2010) Scalable SMT-based verification of GPU kernel functions. In: Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE ’10. ACM, New York, NY, USA, pp 187–196. https://doi.org/10.1145/1882291.1882320

  14. Li G, Li P, Sawaya G, Gopalakrishnan G, Ghosh I, Rajan SP (2012) GKLEE: Concolic verification and test generation for GPUs. In: Proceedings of the 17th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP ’12. ACM, New York, NY, USA, pp 215–224. https://doi.org/10.1145/2145816.2145844

  15. Moll S, Hack S (2018) Partial control-flow linearization. In: Proceedings of the 39th ACM SIGPLAN conference on programming language design and implementation, PLDI 2018. ACM, New York, NY, USA, pp 543–556. https://doi.org/10.1145/3192366.3192413

  16. Nielson F, Nielson HR, Hankin C (2010) Principles of program analysis. Springer, Cham

    MATH  Google Scholar 

  17. Nvidia: CUDA C Programming Guide v9.0. http://docs.nvidia.com/cuda/cuda-c-programming-guide/

  18. Nvidia: Nvidia Performance Analysis Tools. http://developer.nvidia.com/performance-analysis-tools/

  19. Pharr M, Mark WR (2012) ispc: A spmd compiler for high-performance cpu programming. In: 2012 innovative parallel computing (InPar), pp 1–13. https://doi.org/10.1109/InPar.2012.6339601

  20. Sung IJ, Stratton JA, Hwu WMW (2010) Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In: Proceedings of the 19th international conference on parallel architectures and compilation techniques, PACT ’10. ACM, New York, NY, USA, pp 513–522. https://doi.org/10.1145/1854273.1854336

  21. Ueng SZ, Lathara M, Baghsorkhi SS, Hwu WMW (2008) Languages and compilers for parallel computing. chap. CUDA-Lite: reducing GPU Programming Complexity. Springer, Berlin, pp 1–15. https://doi.org/10.1007/978-3-540-89740-8_1

  22. Wu J, Belevich A, Bendersky E, Heffernan M, Leary C, Pienaar J, Roune B, Springer R, Weng X, Hundt R (2016) Gpucc: An open-source GPGPU compiler. In: Proceedings of the 2016 international symposium on code generation and optimization, CGO ’16. ACM, New York, NY, USA, pp 105–116. https://doi.org/10.1145/2854038.2854041

  23. Yang Y, Xiang P, Kong J, Zhou H (2010) A GPGPU compiler for memory optimization and parallelism management. In: Proceedings of the 31st ACM SIGPLAN conference on programming language design and implementation, PLDI ’10. ACM, New York, NY, USA, pp 86–97. https://doi.org/10.1145/1806596.1806606

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nimit Singhania.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Alur, R., Devietti, J., Leija, O.S.N. et al. Static detection of uncoalesced accesses in GPU programs. Form Methods Syst Des (2021). https://doi.org/10.1007/s10703-021-00362-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10703-021-00362-8

Keywords

  • GPU performance
  • Static analysis
  • Abstract execution