International Journal of Parallel Programming

, Volume 43, Issue 5, pp 876–891 | Cite as

Efficient 3D Transpositions in Graphics Processing Units

  • Jose L. Jodra
  • Ibai GurrutxagaEmail author
  • Javier Muguerza


Matrix transposition is a basic operation for several computing tasks. Hence, transposing a matrix in a computer’s main memory has been well studied since many years ago. More recently, the out-of-place matrix transposition has been performed efficiently in graphical processing units (GPU), which are broadly used today for general purpose computing. However, due to the particular architecture of GPUs, the adaptation of the matrix transposition operation to 3D arrays is not straightforward. In this paper, we describe efficient implementations for graphical processing units of the 5 possible out-of-place 3D transpositions. Moreover, we also include the transposition of the most basic in-place 3D transpositions. The results show that the achieved bandwidth is close to a simple array copy and is similar to the 2D transposition.


3D transposition GPU CUDA Heterogeneous systems 


  1. 1.
    Bian, M., Bi, F., Liu, F.: Matrix transpose methods for SAR imaging system. In: 2010 IEEE 10th International Conference on Signal Processing (ICSP 2010). IEEE, pp. 2176–2179 (2010)Google Scholar
  2. 2.
    Sung, I.J.: Data Layout Transformation Through In-place Transposition. Ph.D. thesis, University of Illinois at Urbana-Champaign (2013)Google Scholar
  3. 3.
    Brenner, N.: Algorithm 467: matrix transposition in place. Commun. ACM 16(11), 692 (1973)CrossRefGoogle Scholar
  4. 4.
    Cate, E.G., Twigg, D.W.: Algorithm 513: analysis of in-situ transposition [F1]. ACM Trans. Math. Softw. 3(1), 104 (1977)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Chatterjee, S., Sen, S.: Cache-efficient matrix transposition. In: Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000. IEEE, pp. 195–205 (2000)Google Scholar
  6. 6.
    Gustavson, F., Karlsson, L., Kågström, B.: Parallel and cache-efficient in-place matrix storage format conversion. ACM Trans. Math. Softw. 38(3), 17:1 (2012)Google Scholar
  7. 7.
    Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in CUDA. Tech. rep., NVIDIA Corporation (2009).
  8. 8.
    Catanzaro, B., Keller, A., Garland, M.: A decomposition for in-place matrix transposition. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, pp. 193–206 (2014)Google Scholar
  9. 9.
    Berman, M.F.: A method for transposing a matrix. J. ACM 5(4), 383 (1958)CrossRefzbMATHGoogle Scholar
  10. 10.
    Windley, P.: Transposing matrices in a digital computer. Comput. J. 2(1), 47 (1959)MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: 40th Annual Symposium on Foundations of Computer Science, 1999. IEEE, pp. 285–297 (1999)Google Scholar
  12. 12.
    Knuth, D.E.: The Art of Computer Programming, vol. 3. Addison-Wesley, Reading (1973)Google Scholar
  13. 13.
    El-Moursy, A., El-Mahdy, A., El-Shishiny, H.: An efficient in-place 3D transpose for multicore processors with software managed memory hierarchy. In: Proceedings of the 1st International Forum on Next-generation Multicore/Manycore Technologies. ACM, pp. 10:1–10:6 (2008)Google Scholar
  14. 14.
    Ruetsch, G., Fatica, M.: CUDA Fortran for Scientists and Engineers. Morgan Kaufmann, Burlington (2013)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Jose L. Jodra
    • 1
  • Ibai Gurrutxaga
    • 2
    Email author
  • Javier Muguerza
    • 2
  1. 1.Department of Electronic TechnologyUniversity of the Basque Country, UPV/EHUDonostia-San SebastiánSpain
  2. 2.Department of Computer Architecture and TechnologyUniversity of the Basque Country, UPV/EHUDonostia-San SebastiánSpain

Personalised recommendations