Using Non-canonical Array Layouts in Dense Matrix Operations

  • José R. Herrero
  • Juan J. Navarro
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4699)


We present two implementations of dense matrix multiplication based on two different non-canonical array layouts: one based on a hypermatrix data structure (HM) where data submatrices are stored using a recursive layout; the other based on a simple block data layout with square blocks (SB) where blocks are arranged in column-major order. We show that the iterative code using SB outperforms a recursive code using HM and obtains competitive results on a variety of platforms.


Matrix Multiplication Memory Hierarchy Data Layout Page Fault Iterative Code 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agarwal, R.C., Gustavson, F.G., Zubair, M.: Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM J. Res. Dev. 38, 563–576 (1994)Google Scholar
  2. 2.
    Elmroth, E., Gustavson, F., Jonsson, I., Kågström, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Review 46, 3–45 (2004)zbMATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    IBM: ESSL Guide and Reference for IBM ES/3090 Vector Multiprocessors (1986) Order No. SA 22-9220 (Febuary 1986)Google Scholar
  4. 4.
    Gallivan, K., Jalby, W., Meier, U., Sameh, A.: Impact of hierarchical memory systems on linear algebra algorithm design. Int. J. of Supercomputer Appl. 2, 12–48 (1988)CrossRefGoogle Scholar
  5. 5.
    Irigoin, F., Triolet, R.: Supernode partitioning. In: POPL 1988: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 319–329. ACM Press, New York (1988)CrossRefGoogle Scholar
  6. 6.
    Wolfe, M.: More iteration space tiling. In: ACM (ed.) Supercomputing 1989, Reno, Nevada, November 13-17, 1989, pp. 655–664. ACM Press, New York (1989)CrossRefGoogle Scholar
  7. 7.
    Lam, M., Rothberg, E., Wolf, M.: The cache performance and optimizations of blocked algorithms. In: Proceedings of ASPLOS 1991, pp. 67–74 (1991)Google Scholar
  8. 8.
    Temam, O., Granston, E.D., Jalby, W.: To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In: Supercomputing, pp. 410–419 (1993)Google Scholar
  9. 9.
    McKellar, A.C., Coffman, J.E.G.: Organizing matrices and matrix operations for paged memory systems. Communications of the ACM 12, 153–165 (1969)zbMATHCrossRefGoogle Scholar
  10. 10.
    Gustavson, F.G.: Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM J. Res. Dev. 41, 737–756 (1997)CrossRefGoogle Scholar
  11. 11.
    Toledo, S.: Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18, 1065–1081 (1997)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Ahmed, N., Pingali, K.: Automatic generation of block-recursive codes. In: Bode, A., Ludwig, T., Karl, W.C., Wismüller, R. (eds.) Euro-Par 2000. LNCS, vol. 1900, pp. 368–378. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  13. 13.
    Gustavson, F., Henriksson, A., Jonsson, I., Kågström, B.: Recursive blocked data formats and BLAS’s for dense linear algebra algorithms. In: Kagström, B., Elmroth, E., Waśniewski, J., Dongarra, J.J. (eds.) PARA 1998. LNCS, vol. 1541, pp. 195–206. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  14. 14.
    Andersen, B.S., Gustavson, F.G., Karaivanov, A., Marinova, M., Wasniewski, J., Yalamov, P.Y.: LAWRA: Linear algebra with recursive algorithms. In: Sørevik, T., Manne, F., Moe, R., Gebremedhin, A.H. (eds.) PARA 2000. LNCS, vol. 1947, pp. 38–51. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  15. 15.
    Andersen, B.S., Wasniewski, J., Gustavson, F.G.: A recursive formulation of Cholesky factorization of a matrix in packed storage. ACM Transactions on Mathematical Software (TOMS) 27, 214–244 (2001)zbMATHCrossRefGoogle Scholar
  16. 16.
    Gustavson, F.G.: New generalized data structures for matrices lead to a variety of high-performance algorithms. In: Engquist, B. (ed.) Simulation and visualization on the grid: Parallelldatorcentrum, Kungl. Tekniska Högskolan, proceedings 7th annual conference. Lecture Notes in Computational Science and Engineering, vol. 13, pp. 46–61. Springer, Heidelberg (1974)Google Scholar
  17. 17.
    Andersen, B.S., Gunnels, J.A., Gustavson, F., Wasniewski, J.: A recursive formulation of the inversion of symmetric positive defite matrices in packed storage data format. In: Fagerholm, J., Haataja, J., Järvinen, J., Lyly, M., Råback, P., Savolainen, V. (eds.) PARA 2002. LNCS, vol. 2367, pp. 287–296. Springer, Heidelberg (2002)Google Scholar
  18. 18.
    Andersen, B.S., Gunnels, J.A., Gustavson, F.G., Reid, J.K., Waśniewski, J.: A fully portable high performance minimal storage hybrid format Cholesky algorithm. ACM Transactions on Mathematical Software 31, 201–227 (2005)zbMATHCrossRefGoogle Scholar
  19. 19.
    Gustavson, F.G.: High-performance linear algebra algorithms using new generalized data structures for matrices. IBM J. Res. Dev. 47, 31–55 (2003)MathSciNetGoogle Scholar
  20. 20.
    Gustavson, F.G.: New generalized data structures for matrices lead to a variety of high performance dense linear algebra algorithms. In: Dongarra, J.J., Madsen, K., Waśniewski, J. (eds.) PARA 2004. LNCS, vol. 3732, pp. 11–20. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  21. 21.
    Gustavson, F.G.: Algorithm Compiler Architecture Interaction Relative to Dense Linear Algebra. Technical Report RC23715 (W0509-039), IBM, T.J. Watson (2005)Google Scholar
  22. 22.
    Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear array layouts for hierarchical memory systems. In: Proceedings of the 13th international conference on Supercomputing, pp. 444–453. ACM Press, New York (1999)CrossRefGoogle Scholar
  23. 23.
    Park, N., Hong, B., Prasanna, V.K.: Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parallel and Distrib. Systems 14, 640–654 (2003)CrossRefGoogle Scholar
  24. 24.
    Herrero, J.R., Navarro, J.J.: Compiler-optimized kernels: An efficient alternative to hand-coded inner kernels. In: Gavrilova, M., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Laganà, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3984, pp. 762–771. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  25. 25.
    Frens, J.D., Wise, D.S.: Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code. In: Proc. 6th ACM SIGPLAN Symp. on Principles and Practice of Parallel Program, SIGPLAN Notices, pp. 206–216 (1997)Google Scholar
  26. 26.
    Wise, D.S., Frens, J.D.: Morton-order matrices deserve compilers’ support. Technical Report TR 533, Computer Science Department, Indiana University (1999)Google Scholar
  27. 27.
    Chatterjee, S., Lebeck, A.R., Patnala, P.K., Thottethodi, M.: Recursive array layouts and fast parallel matrix multiplication. In: Proc. of the 11th annual ACM symposium on Parallel algorithms and architectures, pp. 222–231. ACM Press, New York (1999)Google Scholar
  28. 28.
    Athanasaki, E., Koziris, N.: Fast indexing for blocked array layouts to improve multi-level cache locality. In: Interaction between Compilers and Computer Architectures, pp. 109–119 (2004)Google Scholar
  29. 29.
    Bader, M., Mayer, C.: Cache oblivious matrix operations using Peano curves (These proceedings). In: PARA 2006, pp. 521–530. Springer, Heidelberg (2006)Google Scholar
  30. 30.
    Valsalam, V., Skjellum, A.: A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurrency and Computation: Practice and Experience 14, 805–839 (2002)zbMATHCrossRefGoogle Scholar
  31. 31.
    Athanasaki, E., Koziris, N., Tsanakas, P.: A tile size selection analysis for blocked array layouts. In: Interaction between Compilers and Computer Architectures, pp. 70–80 (2005)Google Scholar
  32. 32.
    Fuchs, G., Roy, J., Schrem, E.: Hypermatrix solution of large sets of symmetric positive-definite linear equations. Comp. Meth. Appl. Mech. Eng. 1, 197–216 (1972)CrossRefzbMATHGoogle Scholar
  33. 33.
    Herrero, J.R., Navarro, J.J.: Automatic benchmarking and optimization of codes: an experience with numerical kernels. In: Int. Conf. on Software Engineering Research and Practice, pp. 701–706. CSREA Press (2003)Google Scholar
  34. 34.
    Wise, D.S.: Representing matrices as quadtrees for parallel processors. Information Processing Letters 20, 195–199 (1985)CrossRefMathSciNetGoogle Scholar
  35. 35.
    Herrero, J.R., Navarro, J.J.: Adapting linear algebra codes to the memory hierarchy using a hypermatrix scheme. In: Wyrzykowski, R., Dongarra, J.J., Meyer, N., Waśniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 1058–1065. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  36. 36.
    Navarro, J.J., Juan, A., Lang, T.: MOB forms: A class of Multilevel Block Algorithms for dense linear algebra operations. In: Proceedings of the 8th International Conference on Supercomputing, pp. 354–363. ACM Press, New York (1994)CrossRefGoogle Scholar
  37. 37.
    Herrero, J.R., Navarro, J.J.: A study on load imbalance in parallel hypermatrix multiplication using OpenMP. In: Wyrzykowski, R., Dongarra, J.J., Meyer, N., Waśniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 124–131. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  38. 38.
    Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Supercomputing 1998, pp. 211–217. IEEE Computer Society Press, Los Alamitos (1998)Google Scholar
  39. 39.
    Goto, K., van de Geijn, R.: On reducing TLB misses in matrix multiplication. Technical Report CS-TR-02-55, Univ. of Texas at Austin (2002)Google Scholar
  40. 40.
    Gunnels, J., Gustavson, F., Pingali, K., Yotov, K.: Is cache-oblivious DGEMM viable (These proceedings). In: PARA 2006, pp. 919–928. Springer, Heidelberg (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • José R. Herrero
    • 1
  • Juan J. Navarro
    • 1
  1. 1.Computer Architecture Dept., Univ. Politècnica de Catalunya, C/ Jordi Girona 1-3, D6, ES-08034 BarcelonaSpain

Personalised recommendations