Abstract
An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures running in a hybrid OpenMP/MPI configuration is presented. Significant boosts in speed are observed relative to the distributed transpose used in the state-of-the-art adaptive FFTW library. In some cases, a hybrid configuration allows one to reduce communication costs by reducing the number of message passing interface (MPI) nodes, and thereby increasing message sizes. This also allows for a more slab-like than pencil-like domain decomposition for multidimensional fast Fourier transforms (FFT), reducing the cost of, or even eliminating the need for, a second distributed transpose. Nonblocking all-to-all transfers enable user computation and communication to be overlapped.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
However, the recent availability of serial cache-oblivious in-place transposition algorithms in some cases tips the balance in favor of local transposition, if transposed output is acceptable.
References
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Foundations of Computer Science, 1999. 40th Annual Symposium on (IEEE, 1999), pp. 285–297
Dow, M.: Transposing a matrix on a vector computer. Parallel Comput. 21(12), 1997 (1995)
Choi, J., Dongarra, J.J., Walker, D.W.: Parallel matrix transpose algorithms on distributed memory concurrent computers. Parallel Comput. 21(9), 1387 (1995)
Al Na’mneh, R., Pan, W.D., Yoo, S.M.: Efficient adaptive algorithms for transposing small and large matrices on symmetric multiprocessors. Informatica 17(4), 535 (2006)
Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93(2), 216 (2005)
Bowman, J.C., Roberts, M.: FFTW++: A fast Fourier transform \(\rm C++\) header class for the FFTW3 library. http://fftwpp.sourceforge.net (2010)
Bowman, J.C., Roberts, M.: SIAM J. Efficient dealiased convolutions without padding, SIAM. Sci. Comput. 33(1), 386 (2011)
Acknowledgements
The authors gratefully acknowledge Professor Wendell Horton for providing access to state-of-the-art computing facilities at the Texas Advanced Computer Center.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Bowman, J., Roberts, M. (2015). Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors. In: Cojocaru, M., Kotsireas, I., Makarov, R., Melnik, R., Shodiev, H. (eds) Interdisciplinary Topics in Applied Mathematics, Modeling and Computational Science. Springer Proceedings in Mathematics & Statistics, vol 117. Springer, Cham. https://doi.org/10.1007/978-3-319-12307-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-12307-3_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12306-6
Online ISBN: 978-3-319-12307-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)