Automatic Tuning of PDGEMM Towards Optimal Performance

  • Sascha Hunold
  • Thomas Rauber
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3648)


Sophisticated parallel matrix multiplication algorithms like PDGEMM exhibit a complex structure and can be controlled by a large set of parameters including blocking factors and block sizes used for the serial execution on one of the participating processors. But it requires a deep understanding of both the parallel algorithm and the execution platform to select the parameters such that a minimum execution time results. In this article, we describe a simple mechanism that automatically selects a suitable set of parameters for PDGEMM which leads to a minimum execution time in most cases.


Block Size Matrix Dimension Blocking Factor Serial Execution Execution Platform 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK: A Linear Algebra Library for Message-Passing Computers. In: Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing (Minneapolis, MN, 1997), Philadelphia, PA, USA. Society for Industrial and Applied Mathematics, p. 15 (1997) (electronic)Google Scholar
  2. 2.
    Choi, J., Demmel, J., Dhillon, I., Dongarra, J., Ostrouchov, S., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers - Design Issues and Performance. Technical report, Knoxville, TN 37996, USA (1995)Google Scholar
  3. 3.
    Balay, S., Gropp, W.D., McInnes, L.C., Smith, B.F.: PETSc 2.0 User Manual. Argonne National Laboratory (1997),
  4. 4.
    Hunold, S., Rauber, T., Rünger, G.: Multilevel Hierarchical Matrix Multiplication on Clusters. In: Proceedings of the 18th Annual ACM International Conference on Supercomputing, ICS 2004, pp. 136–145 (2004)Google Scholar
  5. 5.
    Whaley, R.C., Dongarra, J.J.: Automatically Tuned Linear Algebra Software. Technical Report UT-CS-97-366, University of Tennessee (1997)Google Scholar
  6. 6.
    Geijn, R.A.V.D., Watts, J.: SUMMA: Scalable Universal Matrix Multiplication Algorithm. Concurrency: Practice and Experience 9, 255–274 (1997)CrossRefGoogle Scholar
  7. 7.
    Choi, J.: A New Parallel Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers. Concurrency: Practice and Experience 10, 655–670 (1998)zbMATHCrossRefGoogle Scholar
  8. 8.
    Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. John Hopkins University Press, Baltimore (1998)Google Scholar
  9. 9.
    Dongarra, J., Croz, J.D., Hammarling, S., Duff, I.: A Set of Level 3 Basis Linear Algebra Subprograms. ACM Transactions on Mathematical Software 16, 1–17 (1990)zbMATHCrossRefGoogle Scholar
  10. 10.
    Browne, S., Dongarra, J., Garner, N., London, K., Mucci, P.: A Scalable Cross- Platform Infrastructure for Application Performance Tuning Using Hardware Counters. In: Supercomputing 2000: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), p. 42. IEEE Computer Society, Los Alamitos (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Sascha Hunold
    • 1
  • Thomas Rauber
    • 1
  1. 1.Department of Mathematics and PhysicsUniversity of BayreuthGermany

Personalised recommendations