Abstract
In this paper we conduct a detailed analysis of the sources of power dissipation and energy consumption during the execution of current dense linear algebra kernels on multicore processors, binding these two metrics together with performance to the arithmetic intensity of the operations. In particular, by leveraging the RAPL interface of an Intel E5 (“Sandy Bridge”) six-core CPU, we decompose the power-energy duo into its core (mainly due to floating-point units and cache), RAM (off-chip accesses), and uncore components,performing a series of illustrative experiments for a range of memory-bound to CPU-bound high performance kernels. Additionally, we investigate the energy proportionality of these three architecture components for the execution of linear algebra routines on the Intel E5.
Similar content being viewed by others
Notes
According to [18], the “core” mainly comprises the floating-point execution units, branch prediction logic, and the higher levels of cache. The “uncore” is basically composed of the last level of cache (L3 in this processor), the memory and interconnect controllers, and the power control logic.
We also evaluated Intel MKL 10.3 and GotoBLAS 1.13, but observed higher performance with OpenBLAS for the kernels and platform targeted in this study.
http://www.netlib.org/lapack version 3.5.0.
For simplicity, hereafter we neglect lower order terms in the arithmetic and storage costs.
References
Alonso P, Dolz MF, Mayo R, Quintana-Ortí ES (2014) Modeling power and energy consumption of dense matrix factorizations on multicore processors. Concurr. Computat. Practice Exp. (to appear)
Anderson E, Bai Z, Bischof C, Blackford LS, Demmel J, Dongarra JJ, Croz JD, Hammarling S, Greenbaum A, McKenney A, Sorensen D (1999) LAPACK Users’ guide, 3rd edn. SIAM, Philadelphia
Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA (2006) The landscape of parallel computing research: a view from Berkeley. Tech. Rep. UCB/EECS-2006-183, EECS Department, University of California, Berkeley
Barroso LA (2005) The price of performance. ACM Queue 3:48–53
Barroso LA, Hölzle U (2007) The case for energy-proportional computing. Computer 40(12):33–37
Beckett J, Bradfield R (2011) Power efficiency comparison of enterprise-class blade servers and enclosures. A Dell Technical White Paper
Bosilca G, Ltaief H, Dongarra J (2012) Power profiling of Cholesky and QR factorizations on distributed memory systems. In: Third international conference on energy-aware high performance computing (Ena-HPC), Hamburg, pp 1–9
Choi JW, Bedard D, Fowler R, Vuduc R (2013) A roofline model of energy. In: 27th IEEE Int Symp Parallel Distributed Processing (IPDPS), pp 661–672
Curtis-Maury M, Dzierwa J, Antonopoulos CD, Nikolopoulos DS (2006) Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In: Proc. 20th Annual Int. Conf. Supercomputing, ICS ’06, pp 157–166
David H, Gorbatov E, Hanebutte UR, Khanna R, Le C (2010) RAPL: memory power estimation and capping. In: 2010 ACM/IEEE Int. Symp. Low-Power Electronics and Design (ISLPED), pp 6189–194
Demmel J, Gearhart A (2012) Instrumenting linear algebra energy consumption via on-chip energy counters. Tech. Rep. UCB/EECS-2012-168, EECS Department, University of California, Berkeley
Dongarra JJ, Du Croz J, Hammarling S, Duff I (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17
Elnozahy E, Kistler M, Rajamony R (2003) Energy-efficient server clusters. In: Power-Aware Computer Systems Second International Workshop, vol 2325., PACS 2002. Lecture Notes in Computer Science (LNCS). Springer, Cambridge, p 179–197
Esmaeilzadeh H, Blem E, St. Amant R, Sankaralingam K, Burger D (2011) Dark silicon and the end of multicore scaling. In: Proc. 38th Annual Int. Symp. Computer architecture, ISCA ’11, pp 365–376
Freeh VW, Lowenthal D, Pan F, Kappiah N, Springer R, Rountree B, Femal M (2007) Analyzing the energy-time trade-off in high-performance computing applications. IEEE Trans Parallel Distrib Syst 18(6):835–848
Golub GH, Loan CFV (1989) Matrix computations, 2nd edn. The Johns Hopkins Univ. Press, Baltimore
Goto K, van de Geijn R (2008) High performance implementation of the level-3 BLAS. ACM Trans Math Soft 35(1), 4:1–4:14
Hill DL, Huff T, Kulick S, Safranek R (2010) The uncore: a modular approach to feeding the high-performance cores. Intel Technol J 14(3):30–49
Intel: Math Kernel Library (2012). http://developer.intel.com/software/products/mkl/. Accessed Apr 2014
(2012). http://xianyi.github.com/OpenBLAS/. Accessed Apr 2014
Ryckbosch F, Polfliet S, Eeckhout L (2011) Trends in server energy proportionality. Computer 44(9):69–72
Van Zee FG, van de Geijn RA (2013) BLIS: A framework for generating BLAS-like libraries. ACM Trans Math Soft (to appear)
Acknowledgments
This work was supported by the CICYT project TIN2011-23283 of MINECO and FEDER, and the EU Project FP7 318793 “EXA2GREEN.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aliaga, J.I., Barreda, M., Dolz, M.F. et al. Are our dense linear algebra libraries energy-friendly?. Comput Sci Res Dev 30, 187–196 (2015). https://doi.org/10.1007/s00450-014-0263-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-014-0263-y