Is Cache-Oblivious DGEMM Viable?

  • John A. Gunnels
  • Fred G. Gustavson
  • Keshav Pingali
  • Kamen Yotov
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4699)


We present a study of implementations of DGEMM using both the cache-oblivious and cache-conscious programming styles. The cache-oblivious programs use recursion and automatically block DGEMM operands A,B,C for the memory hierarchy. The cache-conscious programs use iteration and explicitly block A,B,C for register files, all caches and memory. Our study shows that the cache-oblivious programs achieve substantially less performance than the cache-conscious programs. We discuss why this is so and suggest approaches for improving the performance of cache-oblivious programs.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agarwal, R.C., Gustavson, F.G., Zubair, M.: Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM Journal of Research and Development 38(5), 563–576 (1994)Google Scholar
  2. 2.
    Belady, L.A.: A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal 5(2), 78–101 (1966)Google Scholar
  3. 3.
    Chatterjee, S., et al.: Design and Exploitation of a High-performance SIMD Floating-point Unit for Blue Gene/L. IBM Journal of Research and Development 49(2-3), 377–391 (2005)CrossRefGoogle Scholar
  4. 4.
    Frigo, M., Leiserson, C., Prokop, H., Ramachandran, S.: Cache-oblivious Algorithms. In: FOCS 1999: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, p. 285. IEEE Computer Society Press, Los Alamitos (1999)Google Scholar
  5. 5.
    Dongarra, J.J., Gustavson, F.G., Karp, A.: Implementing Linear Algebra Algorithms for Dense Matrices on a Vector Pipeline Machine. SIAM Review 26(1), 91–112 (1984)MATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Elmroth, E., Gustavson, F.G., Kågström, B., Jonsson, I.: Recursive Blocked Algorithms and Hybrid Data Structures for Dense Matrix Library Software. SIAM Review 46(1), 3–45 (2004)MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Hong, J.-W., Kung, H.T.: I/O complexity: The red-blue pebble game. In: Proc. of the thirteenth annual ACM symposium on Theory of computing, pp. 326–333 (1981)Google Scholar
  8. 8.
    Gunnels, J.A., Gustavson, F.G., Henry, G.M., van de Geijn, R.A.: A Family of High-Performance Matrix Multiplication Algorithms. In: Dongarra, J.J., Madsen, K., Waśniewski, J. (eds.) PARA 2004. LNCS, vol. 3732, pp. 256–265. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  9. 9.
    Gustavson, F.G.: Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms. IBM Journal of Research and Development 41(6), 737–755 (1997)Google Scholar
  10. 10.
    Gustavson, F.G.: High Performance Linear Algebra Algorithms using New Generalized Data Structures for Matrices. IBM Journal of Research and Development 47(1), 31–55 (2003)MathSciNetGoogle Scholar
  11. 11.
    Gustavson, F.G., Gunnels, J.A., Sexton, J.C.: Minimal Data Copy for Dense Linear Algebra Factorization. In: Kågström, B., Elmroth, E. (eds.) Computational Science - Para 2006. LNCS, vol. xxxx, pp. 540–549. Springer, Heidelberg (2006)Google Scholar
  12. 12.
    Gustavson, F.G., Henriksson, A., Jonsson, I., Kågström, B., Ling, P.: Recursive blocked data formats and BLAS’s for dense linear algebra algorithms. In: Kagström, B., Elmroth, E., Waśniewski, J., Dongarra, J.J. (eds.) PARA 1998. LNCS, vol. 1541, pp. 195–206. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  13. 13.
    Gustavson, F.G., Henriksson, A., Jonsson, I., Kågström, B., Ling, P.: Superscalar GEMM-based level 3 BLAS—the on-going evolution of a portable and high-performance library. In: Kagström, B., Elmroth, E., Waśniewski, J., Dongarra, J.J. (eds.) PARA 1998. LNCS, vol. 1541, pp. 207–215. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  14. 14.
    Park, N., Hong, B., Prasanna, V.K.: Tiling, Block Data Layout, and Memory Hierarchy Performance. IEEE Trans. Parallel and Distributed Systems 14(7), 640–654 (2003)CrossRefGoogle Scholar
  15. 15.
    Roeder, T., Yotov, K., Pingali, K., Gunnels, J., Gustavson, F.: The Price of Cache Obliviousness. Department of Computer Science, University of Texas, Austin Technical Report CS-TR-06-43 (September 2006)Google Scholar
  16. 16.
    Sinharoy, B., Kalla, R.N., Tendler, J.M, Kovacs, R.G., Eickemeyer, R.J., Joyner, J.B.: POWER5 System Microarchitecture. IBM Journal of Research and Development 49(4/5), 505–521 (2005)Google Scholar
  17. 17.
    Whaley, R.C., Petitet, A., Dongarra, J.J.: Automated Empirical Optimization of Software and the ATLAS Project. Parallel Computing (1-2), 3–35 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • John A. Gunnels
    • 1
  • Fred G. Gustavson
    • 1
  • Keshav Pingali
    • 2
  • Kamen Yotov
    • 2
  1. 1.IBM T. J. Watson Research Center, Yorktown Heights, NY 10598USA
  2. 2.Dept. of Computer Science, Cornell University, Ithaca, NY 14853USA

Personalised recommendations