Skip to main content

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

  • Conference paper
  • First Online:
  • 1789 Accesses

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 117))

Abstract

An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures running in a hybrid OpenMP/MPI configuration is presented. Significant boosts in speed are observed relative to the distributed transpose used in the state-of-the-art adaptive FFTW library. In some cases, a hybrid configuration allows one to reduce communication costs by reducing the number of message passing interface (MPI) nodes, and thereby increasing message sizes. This also allows for a more slab-like than pencil-like domain decomposition for multidimensional fast Fourier transforms (FFT), reducing the cost of, or even eliminating the need for, a second distributed transpose. Nonblocking all-to-all transfers enable user computation and communication to be overlapped.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    However, the recent availability of serial cache-oblivious in-place transposition algorithms in some cases tips the balance in favor of local transposition, if transposed output is acceptable.

References

  1. Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Foundations of Computer Science, 1999. 40th Annual Symposium on (IEEE, 1999), pp. 285–297

    Google Scholar 

  2. Dow, M.: Transposing a matrix on a vector computer. Parallel Comput. 21(12), 1997 (1995)

    Google Scholar 

  3. Choi, J., Dongarra, J.J., Walker, D.W.: Parallel matrix transpose algorithms on distributed memory concurrent computers. Parallel Comput. 21(9), 1387 (1995)

    Google Scholar 

  4. Al Na’mneh, R., Pan, W.D., Yoo, S.M.: Efficient adaptive algorithms for transposing small and large matrices on symmetric multiprocessors. Informatica 17(4), 535 (2006)

    Google Scholar 

  5. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93(2), 216 (2005)

    Google Scholar 

  6. Bowman, J.C., Roberts, M.: FFTW++: A fast Fourier transform \(\rm C++\) header class for the FFTW3 library. http://fftwpp.sourceforge.net (2010)

    Google Scholar 

  7. Bowman, J.C., Roberts, M.: SIAM J. Efficient dealiased convolutions without padding, SIAM. Sci. Comput. 33(1), 386 (2011)

    Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge Professor Wendell Horton for providing access to state-of-the-art computing facilities at the Texas Advanced Computer Center.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to John C. Bowman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Bowman, J., Roberts, M. (2015). Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors. In: Cojocaru, M., Kotsireas, I., Makarov, R., Melnik, R., Shodiev, H. (eds) Interdisciplinary Topics in Applied Mathematics, Modeling and Computational Science. Springer Proceedings in Mathematics & Statistics, vol 117. Springer, Cham. https://doi.org/10.1007/978-3-319-12307-3_14

Download citation

Publish with us

Policies and ethics