Skip to main content

Efficient 3D Transpositions in Graphics Processing Units


Matrix transposition is a basic operation for several computing tasks. Hence, transposing a matrix in a computer’s main memory has been well studied since many years ago. More recently, the out-of-place matrix transposition has been performed efficiently in graphical processing units (GPU), which are broadly used today for general purpose computing. However, due to the particular architecture of GPUs, the adaptation of the matrix transposition operation to 3D arrays is not straightforward. In this paper, we describe efficient implementations for graphical processing units of the 5 possible out-of-place 3D transpositions. Moreover, we also include the transposition of the most basic in-place 3D transpositions. The results show that the achieved bandwidth is close to a simple array copy and is similar to the 2D transposition.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11


  1. 1.

    Bian, M., Bi, F., Liu, F.: Matrix transpose methods for SAR imaging system. In: 2010 IEEE 10th International Conference on Signal Processing (ICSP 2010). IEEE, pp. 2176–2179 (2010)

  2. 2.

    Sung, I.J.: Data Layout Transformation Through In-place Transposition. Ph.D. thesis, University of Illinois at Urbana-Champaign (2013)

  3. 3.

    Brenner, N.: Algorithm 467: matrix transposition in place. Commun. ACM 16(11), 692 (1973)

    Article  Google Scholar 

  4. 4.

    Cate, E.G., Twigg, D.W.: Algorithm 513: analysis of in-situ transposition [F1]. ACM Trans. Math. Softw. 3(1), 104 (1977)

    MathSciNet  Article  Google Scholar 

  5. 5.

    Chatterjee, S., Sen, S.: Cache-efficient matrix transposition. In: Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000. IEEE, pp. 195–205 (2000)

  6. 6.

    Gustavson, F., Karlsson, L., Kågström, B.: Parallel and cache-efficient in-place matrix storage format conversion. ACM Trans. Math. Softw. 38(3), 17:1 (2012)

  7. 7.

    Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in CUDA. Tech. rep., NVIDIA Corporation (2009).

  8. 8.

    Catanzaro, B., Keller, A., Garland, M.: A decomposition for in-place matrix transposition. In: Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, pp. 193–206 (2014)

  9. 9.

    Berman, M.F.: A method for transposing a matrix. J. ACM 5(4), 383 (1958)

    Article  MATH  Google Scholar 

  10. 10.

    Windley, P.: Transposing matrices in a digital computer. Comput. J. 2(1), 47 (1959)

    MathSciNet  Article  MATH  Google Scholar 

  11. 11.

    Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: 40th Annual Symposium on Foundations of Computer Science, 1999. IEEE, pp. 285–297 (1999)

  12. 12.

    Knuth, D.E.: The Art of Computer Programming, vol. 3. Addison-Wesley, Reading (1973)

    Google Scholar 

  13. 13.

    El-Moursy, A., El-Mahdy, A., El-Shishiny, H.: An efficient in-place 3D transpose for multicore processors with software managed memory hierarchy. In: Proceedings of the 1st International Forum on Next-generation Multicore/Manycore Technologies. ACM, pp. 10:1–10:6 (2008)

  14. 14.

    Ruetsch, G., Fatica, M.: CUDA Fortran for Scientists and Engineers. Morgan Kaufmann, Burlington (2013)

    Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Ibai Gurrutxaga.

Additional information

This work was funded by the Department of Education, Universities and Research of the Basque Government (IT395-10 Research Group Grant), by the University of the Basque Country UPV/EHU (ALDAPA Research Group Grant, GIU10/02 and BAILab Research and Training Unit Grant, UFI11/45), and by the Science and Education Department of the Spanish Government (ModelAccess Project, TIN2010-15549).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jodra, J.L., Gurrutxaga, I. & Muguerza, J. Efficient 3D Transpositions in Graphics Processing Units. Int J Parallel Prog 43, 876–891 (2015).

Download citation


  • 3D transposition
  • GPU
  • CUDA
  • Heterogeneous systems