Power profiling of Cholesky and QR factorizations on distributed memory systems

Special Issue Paper

Abstract

This paper presents the power profile of two high performance dense linear algebra libraries on distributed memory systems, ScaLAPACK and DPLASMA. From the algorithmic perspective, their methodologies are opposite. The former is based on block algorithms and relies on multithreaded BLAS and a two-dimensional block cyclic data distribution to achieve high parallel performance. The latter is based on tile algorithms running on top of a tile data layout and uses fine-grained task parallelism combined with a dynamic distributed scheduler (DAGuE) to leverage distributed memory systems. We present performance results (Gflop/s) as well as the power profile (Watts) of two common dense factorizations needed to solve linear systems of equations, namely Cholesky and QR. The reported numbers show that DPLASMA surpasses ScaLAPACK not only in terms of performance (up to 2X speedup) but also in terms of energy efficiency (up to 62 %).

Keywords

Power profile analysis Dense linear algebra Distributed memory system Dynamic scheduler 

References

  1. 1.
    MPI-2: extensions to the message passing interface standard. (1997) http://www.mpi-forum.org/
  2. 2.
    Agullo E, Hadri B, Ltaief H, Dongarra J (2009) Comparative study of one-sided factorizations with multiple software packages on multi-core hardware. In: Proceedings of the conference on high performance computing networking, storage and analysis (SC’09), pp 1–12 CrossRefGoogle Scholar
  3. 3.
    Anderson E, Bai Z, Bischof C, Blackford SL, Demmel JW, Dongarra JJ, Croz JD, Greenbaum A, Hammarling S, McKenney A, Sorensen DC (1999) LAPACK user’s guide, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia CrossRefMATHGoogle Scholar
  4. 4.
    Bosilca G, Bouteiller A, Danalis A, Faverge M, Haidar A, Herault T, Kurzak J, Langou J, Lemarinier P, Ltaief H, Luszczek P, YarKhan A, Dongarra J (2011) Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In: PDSEC-11. ACM, New York Google Scholar
  5. 5.
    Bosilca G, Bouteiller A, Herault T, Lemarinier P, Dongarra J (2011) DAGuE: a generic distributed DAG engine for high performance computing. In: HIPS Google Scholar
  6. 6.
    Buttari A, Langou J, Kurzak J, Dongarra J (2009) A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput 35(1):38–53 CrossRefMathSciNetGoogle Scholar
  7. 7.
    Choi J, Demmel J, Dhillon I, Dongarra J, Ostrouchov S, Petitet A, Stanley K, Walker D, Whaley RC (1996) ScaLAPACK, a portable linear algebra library for distributed memory computers-design issues and performance. Comput Phys Commun 97(1–2):1–15 CrossRefMATHGoogle Scholar
  8. 8.
    Cosnard M, Jeannot E (1999) Compact DAG representation and its dynamic scheduling. J Parallel Distrib Comput 58:487–514 CrossRefGoogle Scholar
  9. 9.
    Costa GD, Pierson JM (2011) Characterizing applications from power consumption: a case study for HPC benchmarks. In: Kranzlmüller D, Tjoa AM (eds) ICT-GLOW. Lecture notes in computer science, vol 6868. Springer, Berlin, pp 10–17. doi:10.1007/978-3-642-23447-7 Google Scholar
  10. 10.
    Dongarra J, Beckman P (2011) The international exascale software roadmap. Int J Supercomput Appl High Perform Comput 25(1):3–60 CrossRefGoogle Scholar
  11. 11.
    Ge R, Feng X, Song S, Chang HC, Li D, Cameron KW (2010) PowerPack: energy profiling and analysis of High-Performance systems and applications. IEEE Trans Parallel Distrib Syst 21(5):658–671 CrossRefGoogle Scholar
  12. 12.
    Geist A, Beguelin A, Dongarra J, Jiang W, Manchek R, Sunderam V (1994) PVM: parallel virtual machine: a users’ guide and tutorial for networked parallel computing. MIT Press, Cambridge MATHGoogle Scholar
  13. 13.
    Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. John Hopkins studies in the mathematical sciences. The John Hopkins University Press, Baltimore MATHGoogle Scholar
  14. 14.
    Haidar A, Ltaief H, Dongarra J (2011) Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In: SC11, Seattle, WA, USA Google Scholar
  15. 15.
    Haidar A, Ltaief H, Luszczek P, Dongarra J (2012) A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In: IPDPS’12, Shanghai, China Google Scholar
  16. 16.
    Kansal A, Zhao F (2008) Fine-grained energy profiling for power-aware application design. ACM SIGMETRICS Perform Eval Rev 36(2):26–31. http://doi.acm.org/10.1145/1453175.1453180 CrossRefGoogle Scholar
  17. 17.
    Kappiah N, Freeh VW, Lowenthal DK (2005) Just in time dynamic voltage scaling: exploiting inter-node slack to save energy in MPI programs. In: SC. IEEE Comput. Soc., Los Alamitos, p 33 Google Scholar
  18. 18.
    Ltaief H, Luszczek P, Dongarra J (2011) Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency. In: Second international conference on energy-aware HPC (EnA-HPC 2011), Hamburg, Germany Google Scholar
  19. 19.
    Quintana-Ortí G, Quintana-Ortí ES, Chan E, van de Geijn RA, Van Zee FG (2008) Scheduling of QR factorization algorithms on SMP and multi-core architectures. In: PDP. IEEE Comput. Soc., Los Alamitos, pp 301–310 Google Scholar
  20. 20.
    Trefethen LN, Bau D (1997) Numerical linear algebra. SIAM, Philadelphia. http://www.siam.org/books/OT50/Index.htm CrossRefMATHGoogle Scholar
  21. 21.
    University of Tennessee (2011) Knoxville: PLASMA users’ guide, parallel linear algebra software for multicore architectures. Version 2.4 Google Scholar
  22. 22.
    Zee FGV, Chan E, van de Geijn RA, Quintana-Orti ES, Quintana-Orti G (2009) The libflame library for dense matrix computations. Comput Sci Eng 11(6):56–63 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag (outside the USA) 2012

Authors and Affiliations

  1. 1.Innovative Computing LaboratoryUniversity of TennesseeKnoxvilleUSA
  2. 2.Supercomputing LaboratoryKAUSTThuwalSaudi Arabia

Personalised recommendations