Computer Science - Research and Development

, Volume 27, Issue 4, pp 277–287 | Cite as

Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency

  • Hatem LtaiefEmail author
  • Piotr Luszczek
  • Jack Dongarra
Special Issue Paper


This paper presents the power profile of two high performance dense linear algebra libraries i.e., LAPACK and PLASMA. The former is based on block algorithms that use the fork-join paradigm to achieve parallel performance. The latter uses fine-grained task parallelism that recasts the computation to operate on submatrices called tiles. In this way tile algorithms are formed. We show results from the power profiling of the most common routines, which permits us to clearly identify the different phases of the computations. This allows us to isolate the bottlenecks in terms of energy efficiency. Our results show that PLASMA surpasses LAPACK not only in terms of performance but also in terms of energy efficiency.


Power profile Energy efficiency Dense linear algebra Tile algorithms Multicore architectures 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Agullo E, Hadri B, Ltaief H, Dongarrra J (2009) Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In: SC ’09: proceedings of the conference on high performance computing networking, storage and analysis. ACM, New York, pp 1–12. CrossRefGoogle Scholar
  2. 2.
    Anderson E, Bai Z, Bischof C, Blackford SL, Demmel JW, Dongarra JJ, Croz JD, Greenbaum A, Hammarling S, McKenney A, Sorensen DC (1999) LAPACK user’s guide, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia zbMATHCrossRefGoogle Scholar
  3. 3.
    Anzt H, Rocker B, Heuveline V (2010) Energy efficiency of mixed precision iterative refinement methods using hybrid hardware platforms—an evaluation of different solver and hardware configurations. Comput Sci 25(3–4):141–148. doi: 10.1007/s00450-010-0124-2 Google Scholar
  4. 4.
    Bekas C, Curioni A (2010) A new energy aware performance metric. Comput Sci 25(3–4):187–195. doi: 10.1007/s00450-010-0119-z Google Scholar
  5. 5.
    Bischof CH, Lang B, Sun X (2000) Algorithm 807: the SBR toolbox—software for successive band reduction. ACM Trans Math Softw 26(4):602–616. MathSciNetCrossRefGoogle Scholar
  6. 6.
    Buttari A, Dongarra J, Langou J, Langou J, Luszczek P, Kurzak J (2007) Mixed precision iterative refinement techniques for the solution of dense linear systems. Int J Hight Perform Comput Appl 21(4):457–466. doi: 10.1177/1094342007084026 CrossRefGoogle Scholar
  7. 7.
    Buttari A, Langou J, Kurzak J, Dongarra J (2009) A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput 35(1):38–53 MathSciNetCrossRefGoogle Scholar
  8. 8.
    Chen G, Malkowski K, Kandemir MT, Raghavan P (2005) Reducing power with performance constraints for parallel sparse applications. In: IPDPS. IEEE Comput Soc, Los Alamitos. Google Scholar
  9. 9.
    Ding Y, Malkowski K, Raghavan P, Kandemir MT (2008) Towards energy efficient scaling of scientific codes. In: IPDPS. IEEE Press, New York, pp 1–8. doi: 10.1109/IPDPS.2008.4536217 Google Scholar
  10. 10.
    Freeh VW, Lowenthal DK (2005) Using multiple energy gears in MPI programs on a power-scalable cluster. In: Pingali K, Yelick KA, Grimshaw AS (eds) Proceedings of the ACM SIGPLAN symposium on principles and practice of parallel programming (10th PPOPP’2005), Chicago, IL, USA. ACM SIGPLAN Notices, vol 40, pp 164–173 CrossRefGoogle Scholar
  11. 11.
    Ge R, Feng X, Song S, Chang HC, Li D, Cameron KW (2010) Powerpack: Energy profiling and analysis of high-performance systems and applications. IEEE Trans Parallel Distrib Syst PDS-21(5):658–671 CrossRefGoogle Scholar
  12. 12.
    Golub GH, Van Loan CF (1996) Matrix computation, 3rd edn. John Hopkins studies in the mathematical sciences. Johns Hopkins University Press, Baltimore Google Scholar
  13. 13.
    Kågström B, Kressner D, Quintana-Ortí E, Quintana-Ortí G (2008) Blocked algorithms for the reduction to Hessenberg-triangular form revisited. BIT Numer Math 48:563–584 zbMATHCrossRefGoogle Scholar
  14. 14.
    Kappiah N, Freeh VW, Lowenthal DK (2005) Just in time dynamic voltage scaling: exploiting inter-node slack to save energy in MPI programs. In: SC. IEEE Comput Soc, Los Alamitos, p 33. Google Scholar
  15. 15.
    Kogge P, Bergman K, Borkar S, Campbell D, Carlson W, Dally W, Denneau M, Franzon P, Harrod W, Hill K, Hiller J, Karp S, Keckler S, Klein D, Lucas R, Richards M, Scarpelli A, Scott S, Snavely A, Sterling T, Williams RS, Yelick K (2008) Exascale computing study: technology challenges in achieving exascale systems. Tech Rep TR-2008-13, Department of Computer Science and Engineering. University of Notre Dame Google Scholar
  16. 16.
    Ltaief H, Luszczek P, Dongarra J (2011, submitted) High performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures. ACM Trans Math Softw Google Scholar
  17. 17.
    Luszczek P, Ltaief H, Dongarra J (2011) Two-stage tridiagonal reduction for dense symmetric matrices using tile algorithms on multicore architectures. In: Proceedings of IPDPS 2011. ACM, Anchorage Google Scholar
  18. 18.
    Multicore application modeling infrastructure (MuMI) project.
  19. 19.
    Sutter H (2005) The free lunch is over: a fundamental turn toward concurrency in software. Dr Dobb’s Journal 30(3).
  20. 20.
    Trefethen LN, Bau D (1997) Numerical linear algebra. SIAM, Philadelphia. zbMATHCrossRefGoogle Scholar
  21. 21.
    University of Tennessee Knoxville (2010) PLASMA users’ guide, parallel linear algebra software for multicore architectures, version 2.3. Available electronically at

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  1. 1.KAUST Supercomputing LaboratoryThuwalSaudi Arabia
  2. 2.Department of Electrical Engineering and Computer ScienceUniversity of TennesseeKnoxvilleUSA

Personalised recommendations