Hardware-Oriented Implementation of Cache Oblivious Matrix Operations Based on Space-Filling Curves

  • Michael Bader
  • Robert Franz
  • Stephan Günther
  • Alexander Heinecke
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4967)


We will present hardware-oriented implementations of block-recursive approaches for matrix operations, esp. matrix multiplication and LU decomposition. An element order based on a recursively constructed Peano space-filling curve is used to store the matrix elements. This block-recursive numbering scheme is changed into a standard row-major order, as soon as the respective matrix subblocks fit into level-1 cache. For operations on these small blocks, we implemented hardware-oriented kernels optimised for Intel’s Core architecture. The resulting matrix-multiplication and LU-decomposition codes compete well with optimised libraries such as Intel’s MKL, ATLAS, or GotoBLAS, but have the advantage that only comparably small and well-defined kernel operations have to be optimised to achieve high performance.


Matrix Multiplication Matrix Block Numbering Scheme Cache Line Single Precision 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aberdeen, D., Baxter, J.: Emmerald: a fast matrix-matrix multiply using Intel’s SSE instructions, Concurrency Computat.: Pract. Exper. 13 (2001)Google Scholar
  2. 2.
    Bader, M., Zenger, C.: Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Appl. 417(2–3) (2006)Google Scholar
  3. 3.
    Bader, M., Zenger, C.: A cache oblivious algorithm for matrix multiplication based on Peano’s space filling curve. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Waśniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Bader, M., Mayer, C.: Cache oblivious matrix operations using Peano curves. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, Springer, Heidelberg (2007)CrossRefGoogle Scholar
  5. 5.
    Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20(4) (1999)Google Scholar
  6. 6.
    Elmroth, E., Gustavson, F., Jonsson, I., Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46(1) (2004)Google Scholar
  7. 7.
    GotoBLAS, Texas Advanced Computing Center,
  8. 8.
    Gustavson, F.G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development 41(6) (1997)Google Scholar
  9. 9.
  10. 10.
    Joffrain, T., Quintana-Orti, E.S., van de Geijn, R.: Updating an LU factorization and its application to scalable out-of-core, ?????Google Scholar
  11. 11.
    Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated empirical optimization of software and the ATLAS project. Parallel Computing 27(1–2) (2001)Google Scholar
  12. 12.
    Yotov, K., Roeder, T., Pingali, K., Gunnels, J., Gustavson, F.: Is cache oblivious DGEMM a viable alternative. In: Kågström, B., Elmroth, E., Dongarra, J., Waśniewski, J. (eds.) PARA 2006. LNCS, vol. 4699, Springer, Heidelberg (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Michael Bader
    • 1
  • Robert Franz
    • 1
  • Stephan Günther
    • 1
  • Alexander Heinecke
    • 1
  1. 1.Dept. of InformaticsTU MünchenMünchenGermany

Personalised recommendations