Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves
We will present hardware-oriented implementations of block-recursive approaches for matrix operations, esp. matrix multiplication and LU decomposition. An element order based on a recursively constructed Peano space-filling curve is used to store the matrix elements. This block-recursive numbering scheme is changed into a standard row-major order, as soon as the respective matrix subblocks fit into level-1 cache. For operations on these small blocks, we implemented hardware-oriented kernels optimised for Intel’s Core architecture. The resulting matrix-multiplication and LU-decomposition codes compete well with optimised libraries such as Intel’s MKL, ATLAS, or GotoBLAS, but have the advantage that only comparably small and well-defined kernel operations have to be optimised to achieve high performance.
KeywordsMatrix Multiplication Matrix Block Numbering Scheme Cache Line Single Precision
Unable to display preview. Download preview PDF.
- 1.Aberdeen, D., Baxter, J.: Emmerald: a fast matrix-matrix multiply using Intel’s SSE instructions, Concurrency Computat.: Pract. Exper. 13 (2001)Google Scholar
- 2.Bader, M., Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Appl. 417(2–3) (2006)Google Scholar
- 5.Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20(4) (1999)Google Scholar
- 6.Elmroth, E., Gustavson, F., Jonsson, I., Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46(1) (2004)Google Scholar
- 7.GotoBLAS, Texas Advanced Computing Center, http://www.tacc.utexas.edu/resources/software/
- 8.Gustavson, F.G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development 41(6) (1997)Google Scholar
- 9.Intel Math Kernel Library (2005), http://intel.com/cd/software/products/asmo-na/eng/perflib/mkl/
- 10.Joffrain, T., Quintana-Orti, E.S., van de Geijn, R.: Updating an LU factorization and its application to scalable out-of-core, ?????Google Scholar
- 11.Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1–2) (2001)Google Scholar
- 12.Yotov, K., Roeder, T., Pingali, K., Gunnels, J., Gustavson, F.: Is cache oblivious DGEMM a viable alternative. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, Springer, Heidelberg (2007)Google Scholar