Power profiling of Cholesky and QR factorizations on distributed memory systems
- 193 Downloads
This paper presents the power profile of two high performance dense linear algebra libraries on distributed memory systems, ScaLAPACK and DPLASMA. From the algorithmic perspective, their methodologies are opposite. The former is based on block algorithms and relies on multithreaded BLAS and a two-dimensional block cyclic data distribution to achieve high parallel performance. The latter is based on tile algorithms running on top of a tile data layout and uses fine-grained task parallelism combined with a dynamic distributed scheduler (DAGuE) to leverage distributed memory systems. We present performance results (Gflop/s) as well as the power profile (Watts) of two common dense factorizations needed to solve linear systems of equations, namely Cholesky and QR. The reported numbers show that DPLASMA surpasses ScaLAPACK not only in terms of performance (up to 2X speedup) but also in terms of energy efficiency (up to 62 %).
KeywordsPower profile analysis Dense linear algebra Distributed memory system Dynamic scheduler
The authors would like to thank Pr. Kirk Cameron from the Department of Computer Science at Virginia Tech, for granting access to his platform.
- 1.MPI-2: extensions to the message passing interface standard. (1997) http://www.mpi-forum.org/
- 4.Bosilca G, Bouteiller A, Danalis A, Faverge M, Haidar A, Herault T, Kurzak J, Langou J, Lemarinier P, Ltaief H, Luszczek P, YarKhan A, Dongarra J (2011) Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In: PDSEC-11. ACM, New York Google Scholar
- 5.Bosilca G, Bouteiller A, Herault T, Lemarinier P, Dongarra J (2011) DAGuE: a generic distributed DAG engine for high performance computing. In: HIPS Google Scholar
- 14.Haidar A, Ltaief H, Dongarra J (2011) Parallel reduction to condensed forms for symmetric eigenvalue problems using aggregated fine-grained and memory-aware kernels. In: SC11, Seattle, WA, USA Google Scholar
- 15.Haidar A, Ltaief H, Luszczek P, Dongarra J (2012) A comprehensive study of task coalescing for selecting parallelism granularity in a two-stage bidiagonal reduction. In: IPDPS’12, Shanghai, China Google Scholar
- 17.Kappiah N, Freeh VW, Lowenthal DK (2005) Just in time dynamic voltage scaling: exploiting inter-node slack to save energy in MPI programs. In: SC. IEEE Comput. Soc., Los Alamitos, p 33 Google Scholar
- 18.Ltaief H, Luszczek P, Dongarra J (2011) Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency. In: Second international conference on energy-aware HPC (EnA-HPC 2011), Hamburg, Germany Google Scholar
- 19.Quintana-Ortí G, Quintana-Ortí ES, Chan E, van de Geijn RA, Van Zee FG (2008) Scheduling of QR factorization algorithms on SMP and multi-core architectures. In: PDP. IEEE Comput. Soc., Los Alamitos, pp 301–310 Google Scholar
- 21.University of Tennessee (2011) Knoxville: PLASMA users’ guide, parallel linear algebra software for multicore architectures. Version 2.4 Google Scholar