Skip to main content
Log in

RT-CUDA: A Software Tool for CUDA Code Restructuring

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Recent development in graphic processing units (GPUs) has opened a new challenge in harnessing their computing power as a new general purpose computing paradigm. However, porting applications to CUDA remains a challenge to average programmers, which have to package code in separate functions, explicitly manage data transfers between the host and device memories, and manually optimize GPU memory utilization. In this paper, we propose a restructuring tool (RT-CUDA) that takes a C-like program and some user directives as compiler hints to produce an optimized CUDA code. The tool strategy is based on efficient management of the memory system to minimize data motion by managing the transfer between host and device, maximizing bandwidth for device memory accesses, and enhancing data locality and re-use of cached data using shared-memory and registers. Enhanced resource utilization is implemented by re-writing code as parametric kernels and use of efficient auto-tuning. The tool enables calling numerical libraries (CuBLAS, CuSPARSE, etc.) to help implement applications in science simulation like iterative linear algebra solvers. For the above applications, the tool implement an inter-block global synchronization which allow the execution overall among a few iterations which is helpful to balance load and to avoid polling. Evaluation of RT-CUDA has been performed using a variety of basic linear algebra operators (Madd, MM, MV, VV, etc.) as well as the programming of iterative solvers for systems of linear equations like Jacobi and Conjugate Gradient algorithms. Significant speedup has been achieved over other compilers like PGI OpenACC and GPGPU compilers for the above applications. Evaluation shows that generated kernels efficiently call math libraries and enable implementing complete iterative solvers. The tool help scientists developing parallel simulators like reservoir simulators, molecular dynamics, etc. without exposing to complexity of GPU and CUDA programming. We have partnership with a group of researchers at the Saudi Aramco, a national company in Saudi Arabia. RT-CUDA is currently explored as a potential development tool for applications involving linear algebra solvers by the above group. In addition, RT-CUDA is being used by Senior and Graduate students at King Fahd University of Petroleum and Minerals in their projects as part of RT-CUDA continuous enhancement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26

Similar content being viewed by others

Notes

  1. https://sites.google.com/site/ayazresearch/research.

  2. https://sites.google.com/site/ayazresearch/research.

References

  1. Beyer, J.C., Stotzer, E.J., Hart, A., de Supinski, B.R.: Openmp for accelerators. In: Chapman, B.M., Gropp, W.D., Kumaran, K., Müller, M.S. (eds.) OpenMP in the Petascale Era, IWOMP, Lecture Notes in Computer Science, pp. 108–121. Springer, Berlin (2011)

  2. Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. SIGPLAN Not. 43(6), 101–113 (2008). doi:10.1145/1379022.1375595

    Article  Google Scholar 

  3. van den Braak, G., Mesman, B., Corporaal, H.: Compile-time gpu memory access optimizations. In: Embedded Computer Systems (SAMOS), 2010 International Conference on, pp. 200–207 (2010). doi:10.1109/ICSAMOS.2010.5642066

  4. Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., Hanrahan, P.: Brook for GPUs: stream computing on graphics hardware. ACM Trans. Graph. 23(3), 777–786 (2004). doi:10.1145/1015706.1015800

    Article  Google Scholar 

  5. Chen, C., Chame, J., Hall, M.: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In: International Symposium on Code Generation and Optimization (2005)

  6. Daemen, J., Rijmen, V.: The Design of Rijndael: AES—the Advanced Encryption Standard. Springer, Berlin (2002)

    Book  MATH  Google Scholar 

  7. Dagum, L., Menon, R.: OpenMP: an industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998). doi:10.1109/99.660313

    Article  Google Scholar 

  8. Davis, T.A., Hu, Y.: The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011). doi:10.1145/2049662.2049663

    MathSciNet  Google Scholar 

  9. Ershov, A.P.: On programming of arithmetic operations. Commun. ACM 1(8), 3–6 (1958). doi:10.1145/368892.368907

    Article  MATH  Google Scholar 

  10. Farivar, R., Campbell, R.: Plasma: shared memory dynamic allocation and bank-conflict-free access in gpus. In: Parallel Processing Workshops (ICPPW), 2012 41st International Conference on, pp. 612–613 (2012). doi:10.1109/ICPPW.2012.94

  11. Gebhart, M., Johnson, D.R., Tarjan, D., Keckler, S.W., Dally, W.J., Lindholm, E., Skadron, K.: A hierarchical thread scheduler and register file for energy-efficient throughput processors. ACM Trans. Comput. Syst. 30(2), 8:1–8:38 (2012). doi:10.1145/2166879.2166882

    Article  Google Scholar 

  12. Gray, A., Sjostrom, A., Llieva-Litova, N.: Best practice mini-guide accelerated clusters: using general purpose gpus. Tech. rep., University of Warsaw (2013)

  13. Ha, P.H., Tsigas, P., Anshus, O.J.: The synchronization power of coalesced memory accesses. In: Taubenfeld, G. (ed.) DISC, Lecture Notes in Computer Science, vol. 5218, pp. 320–334. Springer, Berlin (2008)

    Google Scholar 

  14. Han, T.D., Abdelrahman, T.S.: hicuda: a high-level directive-based language for gpu programming. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-2, pp. 52–61. ACM, New York (2009). doi:10.1145/1513895.1513902

  15. Harris, M.: Optimizing Parallel Reduction in CUDA. Tech. rep., nVidia (2008). http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

  16. Hennessy, J., Patterson, D., Asanović, K.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann/Elsevier, Los Altos/Amsterdam (2012)

    Google Scholar 

  17. Ikeda, T., Ino, F., Hagihara, K.: A code motion technique for accelerating generalpurpose computation on the gpu. In: Proceedings of the International Parallel and Distributed Processing Symposium, pp. 1–10 (2006)

  18. Jackson, A., Agathokleous, O.: Dynamic loop parallelisation. CoRR arXiv:1205.2367 (2012)

  19. Kasichayanula, K., Terpstra, D., Luszczek, P., Tomov, S., Moore, S., Peterson, G.: Power aware computing on gpus. In: Application Accelerators in High Performance Computing (SAAHPC), 2012 Symposium on, pp. 64–73 (2012). doi:10.1109/SAAHPC.2012.26

  20. Khan, A., Al-Mouhamed, M., Fatayar, A., Almousa, A., Baqais, A., Assayony, M.: Padding free bank conflict resolution for cuda-based matrix transpose algorithm. In: Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2014 15th IEEE/ACIS International Conference on, pp. 1–6 (2014). doi:10.1109/SNPD.2014.6888709

  21. Khan, A.H., Al-Mouhamed, M., Fatayer, A., Mohammad, N.: Optimizing the matrix multiply using strassen and winograd algorithms with limited recursions on many core. To appear in International Journal of Parallel Programming (IJPP) (2015)

  22. Khan, M., Basu, P., Rudy, G., Hall, M., Chen, C., Chame, J.: A script-based autotuning compiler system to generate high-performance cuda code. ACM Trans. Archit. Code Optim. 9(4), 31:1–31:25 (2013). doi:10.1145/2400682.2400690

    Article  Google Scholar 

  23. Lee, S., Eigenmann, R.: OpenMPC: extended openMP for efficient programming and tuning on GPUs. Int. J. Comput. Sci. Eng. (IJCSE) 8(1), 4–20 (2013)

    Article  Google Scholar 

  24. Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. SIGPLAN Not. 44(4), 101–110 (2009). doi:10.1145/1594835.1504194

    Article  Google Scholar 

  25. Leung, A., Vasilache, N., Meister, B., Baskaran, M.M., Wohlford, D., Bastoul, C., Lethin, R.: A mapping path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In: Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU), pp. 51–61 (2010)

  26. Liao, S.W., Du, Z., Wu, G., Lueh, G.Y.: Data and computation transformations for brook streaming applications on multiprocessors. In: Fourth IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 196–207 (2006)

  27. Liu, W., Vinter, B.: An efficient GPU general sparse matrix-matrix multiplication for irregular data. In: Proceedings of the IEEE 28th International Symposium on Parallel Distributed Processing, IPDPS14 (2014)

  28. Murthy, G., Ravishankar, M., Baskaran, M., Sadayappan, P.: Optimal loop unrolling for gpgpu programs. In: Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp. 1–11 (2010). doi:10.1109/IPDPS.2010.5470423

  29. Nugteren, C.: Improving the programmability of GPU architectures. Ph.D. thesis, Department of Electrical Engineering, Eindhoven University of Technology (2014)

  30. NVIDIA: Nvidias next generation cuda computer architecture kepler gk110. Whitepaper, NVIDIA Corporation (2013)

  31. NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara 95050, USA: CUDA C Best Practices Guide, 4.0 edn. (2011)

  32. NVIDIA Corporation: NVIDIA CUDA C Programming Guide (2011)

  33. OpenMP: The openmpapi specification for parallel programming (2013). http://openmp.org/wp/

  34. Peercy, M., Segal, M., Gerstmann, D.: A performance-oriented data parallel virtual machine for GPUs. In: SIGGRAPH ’06: ACM SIGGRAPH 2006 Sketches, p. 184. ACM, New York (2006)

  35. PGI: Portland group (2013). http://www.pgroup.com/resources/accel.htm

  36. Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: PPOPP, pp. 73–82 (2008)

  37. Standards of N.I., Technology: Text file formats. http://math.nist.gov/MatrixMarket/formats.html. Accessed: 19 Nov 2014

  38. Tojo, N., Tanabe, K., Matsuzaki, H.: Program conversion apparatus and computer readable medium (2014). US Patent 8,732,684

  39. Tuning cuda applications for kepler. http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html. Accessed: 10-06-2013

  40. Ueng, S.Z., Lathara, M., Baghsorkhi, S.S., Hwu, W.M.W.: CUDA-lite: reducing GPU programming complexity. In: Amaral, J.N. (ed.) Languages and Compilers for Parallel Computing, pp. 1–15. Springer, Berlin (2008). doi:10.1007/978-3-540-89740-8_1

    Chapter  Google Scholar 

  41. Volkov, V., Demmel, J.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the ACM/IEEE Conference on High Performance Computing, p. 31 (2008)

  42. Wakatani, A.: Effectiveness of a strip-mining approach for vq image coding using gpgpu implementation. In: Image and Vision Computing New Zealand, 2009. IVCNZ ’09. 24th International Conference, pp. 35–38 (2009). doi:10.1109/IVCNZ.2009.5378382

  43. Wang, G.: Coordinate strip-mining and kernel fusion to lower power consumption on GPU. In: Design, Automation Test in Europe Conference Exhibition (DATE), 2011, pp. 1–4 (2011). doi:10.1109/DATE.2011.5763317

  44. Wilt, N.: The CUDA Handbook: A Comprehensive Guide to GPU Programming. Addison-Wesley, Reading (2013)

    Google Scholar 

  45. Xiao, S., Feng, W.C.: Inter-block GPU communication via fast barrier synchronization. In: IPDPS, pp. 1–12 (2010)

  46. Xu, Q., Jeon, H., Annavaram, M.: Graph processing on GPUs: Where are the bottlenecks? In: Workload Characterization (IISWC), 2014 IEEE International Symposium on, pp. 140–149 (2014). doi:10.1109/IISWC.2014.6983053

  47. Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. In: Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 86–97 (2010)

  48. Yang, Y., Zhou, H.: The implementation of a high performance GPGPU compiler. Int. J. Parallel Program. 41(6), 768–781 (2013)

    Article  Google Scholar 

  49. Ye, D., Titov, A., Kindratenko, V., Ufimtsev, I., Martinez, T.: Porting optimized GPU kernels to a multi-core CPU: computational quantum chemistry application example. In: Application Accelerators in High-Performance Computing (SAAHPC), 2011 Symposium on, pp. 72–75 (2011). doi:10.1109/SAAHPC.2011.8

Download references

Acknowledgments

The authors would like to acknowledge the support provided by King Abdulaziz City for Science and Technology (KACST) through the Science & Technology Unit at King Fahd University of Petroleum & Minerals (KFUPM) for funding this work through Project No. 12-INF3008-04 as part of the National Science, Technology and Innovation Plan. We are also very thankful to Mr. Anas Al-Mousa for providing the code implementations in OpenACC and also thankful to King Abullah University of Science and Technology (KAUST) for providing access to their K20X GPU cluster to run the experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ayaz H. Khan.

Appendix: CUDA Kernel Optimizations

Appendix: CUDA Kernel Optimizations

1.1 Manual Optimizations

Vectorization improves the bandwidth utilization by using one of the vector data types such as float2, float4, float8 in CUDA as structs with some special data alignment [32, 44]. Global memory transactions in GPU are aligned to 128 bytes even if the actual data load is less than 128 bytes. So, if a specific thread within a warp performs 8 loads of float data type with sequence of load instructions then the GPU perform 8 different memory transactions. On the other hand, if a float8 type is used and perform single float8 load operation then it can be done by only one global memory transaction. So, to improve global memory bandwidth utilization, the programmer should use vector data types in the case when consecutive data loads are not aligned to 128 bytes. Texture Fetching utilized the texture memory that is a read-only portion of memory in device memory (DRAM) that has been cached (off-chip cache) on access [32, 44]. It has been accessible by all threads and host. It is optimized for 2D spatial locality, so threads of the same warp that read texture addresses that are close together achieve best performance. Texture references that are bound to CUDA arrays can be written to via surface-write operations by binding a surface to the same underlying CUDA array storage. Reading from a texture while writing to its underlying global memory array in the same kernel launch should be avoided because the texture caches are read-only and are not invalidated when the associated global memory is modified. So, texture fetches from addresses that have been written via global stores in the same kernel call returned undefined data. The data in texture can consist of 1, 2, or 4 elements of any of the following types: (1) Signed or unsigned 8, 16, or 32-bit integers, (2) 16-bit floating point values, and (3) 32-bit floating point values. Arrays declared in texture memory can be used in kernels by invoking texture intrinsic provided in CUDA such as tex1D(), tex2D(), and tex3D() for 1D, 2D, and 3D CUDA arrays respectively. Before invoking a kernel that uses texture memory, the texture must be bound to a CUDA array or device memory by calling cudaBindTexture(), cudaBindTexture2D(), or cudaBindTexturetoArray(). Coalesced Global Memory Access refers to combining multiple memory accesses into a single transaction [13]. Global memory is the slowest memory on the GPU. Simultaneous global memory accesses by each thread of a half-warp (16 threads) during the execution of a single read and write instruction are coalesced into a single access. This is achieved based on the following conditions: (1) the size of the memory element accessed by each thread is either 4, 8, or 16 bytes, (2) the elements to be accessed form a contiguous block of memory, (3) the Nth element is accessed by the Nth thread in the half-warp, does not affect if any thread in between not accessing the global memory that is divergent warp, and (4) the address of the first element is aligned to 16 times the element’s size. nVidia Kepler’s Shuffle Instructions perform data exchange between threads within a warp [30]. It is more faster than the use of shared memory. This feature allows the threads of a warp to exchange data with each other directly without going through shared (or global) memory. So, this can present an attractive way for applications to rapidly interchange data among threads. There are four variants of shuffle instructions in CUDA that are __shfl(), __shfl_up(), __shfl_down(), and __shfl_xor(). Shuffle instructions can be used to free up shared memory to be used for other data or to increase warp occupancy and to perform warp-synchronous optimizations (removing __syncthreads()). All the __shfl() intrinsics take an optional width parameter which permits sub-division of the warp into segments. For example, to exchange data between 4 groups of 8 lanes in a SIMD manner. If width is less than 32 then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. A thread may only exchange data with others in its own subsection. Width must have a value which is a power of 2 so that the warp can be subdivided equally; results are undefined if width is not a power of 2, or is a number greater than warpSize.

1.2 User Driven Optimizations

Loop Collapsing is a technique to transform some nested loops into a single-nested loop to reduce loop overhead and improve runtime performance [18] specifically for irregular applications such as sparse matrix vector multiplication (spMV). Such applications pose challenges in achieving high performance on GPU programs because stream architectures are optimized for regular program patterns. It improves the performance of the application in three ways: (1) the amount of parallel work (the number of iterations, to be executed by GPU threads) is increased (2) inter-thread locality is increased (3) control flow divergence is eliminated, such that adjacent threads can be executed concurrently in an SIMD manner [24]. Thread and Thread-Block Merging enhance the data sharing among thread blocks to reduce the number of global memory accesses [47, 48]. Thread-Block Merge determines the workload for each thread block while Thread Merge decides the workload for each thread. If data sharing among neighbouring blocks is due to a global to shared memory (G2S) access, Thread-Block Merge should be preferred to better utilization of the shared memory. When data sharing is from a global to register (G2R) access, Thread Merge from neighbouring blocks should be preferred due to the reuse of registers. If there are many G2R accesses, which lead to data sharing among different thread blocks, the register file is not large enough to hold all of the reused data. In this case, Thread-Block Merge should be used and shared memory variables should be introduced to hold the shared data. In addition, if a block does not have enough threads, Thread-Block Merge instead of Thread Merge should also be used to increase the number of threads in a block even if there is no data sharing. Thread Merge achieves the effects of loop unrolling. It combines several threads’ workload into one thread (combining N neighbouring blocks along column direction into one). By doing this, they can share not only shared memory but also the registers in the register file. Furthermore, some control flow statements and address computation can be reused, thereby further reducing the overall instruction count. The limitation is that an increased workload typically requires a higher number of registers, which may reduce the number of active threads that can fit in the hardware. Parallel Loop Swap is used to improve the performance of regular data accesses in nested loops [3, 24]. It uses to transform non-continuous memory accesses within the loop nest to a continuous memory access which is a candidate for the coalesced global memory access optimization. Strip Mining splits a loop into two nested loops [16, 4143]. The outer loop has stride equal to the strip size and the inner loop has strides of the original loop within a strip. This technique is also used in loop tiling. In loop tiling or loop blocking, loops are also interchanged after performing strip mining to improve the locality of memory references that is why loop tiling is also called strip-mine-and-interchange. Bank Conflict Free Shared Memory Access improves performance by reordering the data into shared memory such that the memory addresses requested by the consecutive threads in a half-warp should be mapped to different memory banks of shared memory [10]. Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. In GPUs, the warp size is 32 threads and the number of banks is 16. So, a shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. However, no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads. Using Read-Only Data Cache, introduced in nVidia Kepler in addition to L1 cache, can benefit the performance of bandwidth-limited kernels [30, 39]. This is the same cache used by the texture pipeline via a standard pointer without the need to bind a texture beforehand and without the sizing limitations of standard textures. This feature is automatically enabled and managed by the compiler, access to any variable or data structure that is known to be constant through programmer use of the C99-standard “const __restrict__” keyword tagged by the compiler to be loaded through constant data cache.

1.3 Compiler Optimizations

Common Sub-Expression Elimination is a compiler optimization technique that searches for instances of identical expressions, evaluates to the same value, and replace them with a single variable holding the computed value [9, 49]. It enhances the application performance by reducing the number of floating point operations. In CUDA, common sub-expression elimination can be used to avoid redundant calculations for the initial address of an array. Loop Invariant Code Motion (also called hoisting or scalar promotion) is a compiler optimization that has been performed automatically [17, 36]. Loop invariant code is a set of statements or expressions within the body of a loop that can be moved outside of the body without affecting the semantics of the program. It makes loops faster by reducing the amount of code that executes in each iteration of the loop. The CUDA C compiler automatically applies this optimization technique to the PTX code. Loop Unrolling is a compiler optimization technique that is applied for the known trip counts at the compile time either by using the constants or templating the kernel [28]. NVIDIA compiler also provides a directive ‘#pragma unroll’ to explicitly activate the loop unrolling on a particular loop.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, A.H., Al-Mouhamed, M., Al-Mulhem, M. et al. RT-CUDA: A Software Tool for CUDA Code Restructuring. Int J Parallel Prog 45, 551–594 (2017). https://doi.org/10.1007/s10766-016-0433-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-016-0433-6

Keywords

Navigation