Abstract
We will present hardware-oriented implementations of block-recursive approaches for matrix operations, esp. matrix multiplication and LU decomposition. An element order based on a recursively constructed Peano space-filling curve is used to store the matrix elements. This block-recursive numbering scheme is changed into a standard row-major order, as soon as the respective matrix subblocks fit into level-1 cache. For operations on these small blocks, we implemented hardware-oriented kernels optimised for Intel’s Core architecture. The resulting matrix-multiplication and LU-decomposition codes compete well with optimised libraries such as Intel’s MKL, ATLAS, or GotoBLAS, but have the advantage that only comparably small and well-defined kernel operations have to be optimised to achieve high performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aberdeen, D., Baxter, J.: Emmerald: a fast matrix-matrix multiply using Intel’s SSE instructions, Concurrency Computat.: Pract. Exper. 13 (2001)
Bader, M., Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Appl. 417(2–3) (2006)
Bader, M., Zenger, C.: A cache oblivious algorithm for matrix multiplication based on Peano’s space filling curve. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Waśniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, Springer, Heidelberg (2006)
Bader, M., Mayer, C.: Cache oblivious matrix operations using Peano curves. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, Springer, Heidelberg (2007)
Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20(4) (1999)
Elmroth, E., Gustavson, F., Jonsson, I., Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46(1) (2004)
GotoBLAS, Texas Advanced Computing Center, http://www.tacc.utexas.edu/resources/software/
Gustavson, F.G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development 41(6) (1997)
Intel Math Kernel Library (2005), http://intel.com/cd/software/products/asmo-na/eng/perflib/mkl/
Joffrain, T., Quintana-Orti, E.S., van de Geijn, R.: Updating an LU factorization and its application to scalable out-of-core, ?????
Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1–2) (2001)
Yotov, K., Roeder, T., Pingali, K., Gunnels, J., Gustavson, F.: Is cache oblivious DGEMM a viable alternative. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, Springer, Heidelberg (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bader, M., Franz, R., Günther, S., Heinecke, A. (2008). Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2007. Lecture Notes in Computer Science, vol 4967. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68111-3_66
Download citation
DOI: https://doi.org/10.1007/978-3-540-68111-3_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68105-2
Online ISBN: 978-3-540-68111-3
eBook Packages: Computer ScienceComputer Science (R0)