Latency-aware DVFS for efficient power state transitions on many-core architectures

Lai, Zhiquan; Lam, King Tin; Wang, Cho-Li; Su, Jinshu

doi:10.1007/s11227-015-1415-y

Latency-aware DVFS for efficient power state transitions on many-core architectures

Published: 05 April 2015

Volume 71, pages 2720–2747, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Zhiquan Lai¹,
King Tin Lam²,
Cho-Li Wang² &
…
Jinshu Su¹

574 Accesses
11 Citations
6 Altmetric
Explore all metrics

Abstract

Energy efficiency is quickly becoming a first-class design constraint in high-performance computing (HPC). We need more efficient power management solutions to save energy costs and carbon footprint of HPC systems. Dynamic voltage and frequency scaling (DVFS) is a commonly used power management technique for making a trade-off between power consumption and system performance according to the time-varying program behavior. However, prior work on DVFS seldom takes into account the voltage and frequency scaling latencies, which we found to be a crucial factor determining the efficiency of the power management scheme. Frequent power state transitions without latency awareness can make a real impact on the execution performance of applications. The design of multiple voltage domains in some many-core architectures has made the effect of DVFS latencies even more significant. These concerns lead us to propose a new latency-aware DVFS scheme to adjust the optimal power state more accurately. Our main idea is to analyze the latency characteristics in depth and design a novel profile-guided DVFS solution which exploits the varying execution patterns of the parallel program to avoid excessive power state transitions. We implement the solution into a power management library for use by shared-memory parallel applications. Experimental evaluation on the Intel SCC many-core platform shows significant improvement in power efficiency after using our scheme. Compared with a latency-unaware approach, we achieve 24.0 % extra energy saving, 31.3 % more reduction in the energy–delay product and 15.2 % less overhead in execution time in the average case for various benchmarks. Our algorithm is also proved to outperform a prior DVFS approach attempted to mitigate the latency effects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Modern Primer on Processing in Memory

Comprehensive analysis of energy efficiency and performance of ARM and RISC-V SoCs

Article Open access 20 February 2024

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration

Article Open access 16 March 2024

Notes

Including external cooling, the system would draw an aggregate power of 24 megawatts.
In 2013, average annual residential electricity consumptions per capita in China and US are 498.6 kWh and 4,327.6 kWh, respectively. Detailed calculations and sources: Electricity consumption by China’s urban and rural residents (\(E_\mathrm{china}\)) is \(6,793 \times 10^8\) kWh [25]. China’s population (\(P_\mathrm{china}\)) as of September, 2013 is 1,362,391,579 [45]. Dividing \(E_\mathrm{china}\) by \(P_\mathrm{china}\) gives 498.6 kWh. Electricity usage per household in US (\(E_\mathrm{us}\)) in 2013 is 10,819 kWh [7]. Average household size in US (\(P_\mathrm{us}\)) (or in most wealthy countries) is close to 2.5 persons [44]. Dividing \(E_\mathrm{us}\) by \(P_\mathrm{us}\) gives 4,327.6 kWh.
Our estimation is done as follows: Tianhe-2 is using Xeon E5 2692v2 and Xeon Phi 31S1P (with 125 and 270 W TDPs). Assume their average power consumptions are 90 and 165 W (reference [20]), respectively. 90 W \(\times \) 32,000 + 165 W \(\times \) 48,000 = 10,800 kW. Divided by 17,808 kW gives 60.65 %.
For practical safety, we apply a slightly higher voltage than the theoretical least voltage, hence there is a small margin between the theoretical safe boundary curve and the least-voltage operating points for each frequency in Fig. 1.
We are aware of a recent compiler-based study [48] showing diminishing returns from DVFS by their analysis based on a high-level model. They argue that the reduction of dynamic power using DVFS is trivial compared with the total system power, considering the performance degradation due to DVFS, and therefore a “race to sleep” approach is indeed more energy efficient than using DVFS. However, this is true only for compute-bound workloads. We observe two latest phenomena that are against the conclusion of their analysis. First, for the state-of-the-art supercomputers such as Tianhe-2, the many-core (co)processors have dominated the entire system power by up to 60 %. Second, it is increasingly important to support the class of data-intensive HPC or multi-tenant cloud computing workloads nowadays. Such relatively memory-bound or I/O-bound workloads expose rich opportunity for DVFS to reap energy saving. So, DVFS is still an effective technique to achieve performance–energy trade-off as we have experimentally confirmed.

References

Arabnia HR, Thapliyal H, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: The 49th IEEE international midwest symposium on circuits and systems (MWSCAS’06), August 6–9, San Juan, Puerto Rico, pp 148–154
Baumann A, Barhamy P, Dagandz PE, Harrisy T, Isaacsy R, Peter S, Roscoe T, Schüpbach A, Singhania A (2009) The multikernel: a new OS architecture for scalable multicore systems. In: The ACM symposium on operating system principles (SOSP’09), pp 29–44
Bennett C, Grossman RL, Locke D, Seidman J, Vejcik S (2010) Malstone: towards a benchmark for analytics on large data clouds. In: The 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’10), ACM, pp 145–152
Cameron KW, Ge R, Feng X (2007) Designing computational clusters for performance and power. Adv Comput 69:89–153
Article Google Scholar
David R, Bogdan P, Marculescu R, Ogras U (2011) Dynamic power management of voltage-frequency island partitioned networks-on-chip using Intel’s Single-Chip Cloud Computer. In: The international symposium on networks-on-chip (NOCS’11), pp 257–258
Donald J, Martonosi M (2006) Techniques for multicore thermal management: classification and new exploration. In: The ACM/IEEE international symposium on computer architecture (ISCA’06), pp 78–88
Fahey J (2013) Home electricity use in US falling to 2001 levels. http://bigstory.ap.org/article/home-electricity-use-us-falling-2001-levels. Accessed Oct 2014
Feng WC, Cameron K (2007) The Green500 list: encouraging sustainable supercomputing. Computer 40(12):50–55
Article Google Scholar
Freeh VW, Lowenthal DK (2005) Using multiple energy gears in MPI programs on a power-scalable cluster. In: The 10th ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP’05). ACM, pp 164–173
Govil K, Chan E, Wasserman H (1995) Comparing algorithm for dynamic speed-setting of a low-power CPU. In: The 1st annual international conference on mobile computing and networking (MobiCom’95). ACM, Berkeley, California, USA, pp 13–25
Graph500: The Graph 500 benchmark. http://www.graph500.org. Accessed Oct 2014
Howard J, Dighe S, Vangal S, Ruhl G, Borkar N, Jain S, Erraguntla V, Konow M, Riepen M, Gries M, Droege G, Lund-Larsen T, Steibl S, Borkar S, De V, Wijngaart RVD (2011) A 48-core IA-32 message-passing processor in 45 nm CMOS using on-die message passing and DVFS for performance and power scaling. IEEE J Solid-State Circuits 46(1):173–183
Article Google Scholar
Ioannou N, Kauschke M, Gries M, Cintra M (2011) Phase-based application-driven hierarchical power management on the Single-Chip cloud Computer. In: The 20th international conference on parallel architectures and compilation techniques (PACT’11), pp 131–142
Intel Labs (2010) SCC external architecture specification (EAS) (revision 0.94). Technical report. https://communities.intel.com/servlet/JiveServlet/downloadBody/5852-102-1-9012/SCC_EAS.pdf. Accessed May 2010
Intel Labs (2010) The SCC programmer’s guide (revision 1.0). Technical report. https://communities.intel.com/servlet/JiveServlet/previewBody/5684-102-8-22523/SCCProgrammersGuide.pdf. Accessed Nov 2010
Iyer A, Marculescu D (2002) Power efficiency of voltage scaling in multiple clock, multiple voltage cores. In: The IEEE/ACM international conference on computer-aided design (ICCAD’02). ACM, New York, pp 379–386
Lai Z, Lam KT, Wang CL, Su J (2014) A power modeling approach for many-core architectures. In: The 10th international conference on semantics, knowledge and grids (SKG’14), pp 128–132
Lai Z, Lam KT, Wang CL, Su J, Yan Y, Zhu W (2013) Latency-aware dynamic voltage and frequency scaling on many-core architectures for data-intensive applications. In: The international conference on cloud computing and big data (CloudCom-Asia’13), pp 78–83
Lam KT, Shi J, Hung D, Wang CL, Lai Z, Yan Y, Zhu W (2014) Rhymes: a shared virtual memory system for non-coherent tiled many-core architectures. In: The 20th IEEE international conference on parallel and distributed systems (ICPADS’14), December 16–19, Hsinchu, Taiwan
Li B, Chang HC, Song SL, Su CY, Meyer T, Mooring J, Cameron K (2014) The power-performance tradeoffs of the Intel Xeon Phi on HPC applications. In: Workshop on large-scale parallel processing (LSPP’14), pp 1448–1456
Li D, Supinski BRd, Schulz M, Nikolopoulos DS, Cameron KW (2013) Strategies for energy-efficient resource management of hybrid programming models. IEEE Trans Parallel Distrib Syst 24(1):144–157
Article Google Scholar
Lo D, Kozyrakis C (2014) Dynamic management of TurboMode in modern multi-core chips. In: The 20th international symposium on high performance computer architecture (HPCA’14), pp 603–613
Ma K, Li X, Chen M, Wang X (2011) Scalable power control for many-core architectures running multi-threaded applications. In: The ACM/IEEE international symposium on computer architecture (ISCA’11), pp 449–460
Matthews O, Zhang M, Sorin D (2014) Scalably verifiable dynamic power management. In: The 20th IEEE international symposium on high performance computer architecture, pp 579–590
National Energy Administration (NEA) of China: China’s total electricity consumption in 2013. http://www.nea.gov.cn/2014-01/14/c_133043689.htm. AccessedOct 2014
Qingyuan D, Meisner D, Bhattacharjee A, Wenisch TF, Bianchini R (2012) Coscale: Coordinating CPU and memory system DVFS in server systems. In: The 45th annual IEEE/ACM international symposium on microarchitecture (MICRO’12). IEEE Computer Society, Vancouver, BC, Canada, pp 143–154
Rangan KK, Wei GY, Brooks D (2009) Thread motion: fine-grained power management for multi-core systems. In: The ACM/IEEE international symposium on computer architecture (ISCA’09), pp 302–313
Ravishankar C, Ananthanarayanan S, Garg S, Kennings A (2012) Analysis and evaluation of greedy thread swapping based dynamic power management for MPSoC platforms. In: The 13th international symposium on quality electronic design (ISQED’12), pp 617–624
Rotem E, Mendelson A, Ginosar R, Weiser U (2009) Multiple clock and voltage domains for chip multi processors. In: The 42th annual IEEE/ACM international symposium on microarchitecture (MICRO’09), New York, pp 459–468
Sartori J, Kumar R (2007) Proactive peak power management for many-core architectures. Technical report CRHC-07-04, University of Illinois at Urbana-Champaign
Simone D (2009) Power management in a manycore operating system. Masters thesis
Sinkar A, Ghasemi H, Schulte M, Karpuzcu U, Kim NS (2014) Low-cost per-core voltage domain support for power-constrained high-performance processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 22(4):747–758
Article Google Scholar
Sueur EL, Heiser G (2010) Dynamic voltage and frequency scaling: the laws of diminishing returns. In: The 2nd workshop on power aware computing and systems (HotPower’10), pp 1–8
Talpes E, Marculescu D (2005) Toward a multiple clock/voltage island design style for power-aware processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 13(5):591–603
Article Google Scholar
Thapliyal H, Arabnia H, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. In: Gavrilova M, Tan CJK (eds) Transactions on computational science III, lecture notes in computer science, vol 5300, chap. 6. Springer, Berlin, Heidelberg, pp 99–121
Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: The international conference on parallel and distributed processing techniques and applications (PDPTA’07), pp 449–450
Thapliyal H, Arabnia HR, Bajpai R, Sharma KK. (2007) Partial reversible gates (PRG) for reversible BCD arithmetic. In: The international conference on computer design (CDES’07), pp 97–98
Thapliyal H, Jayashree HV, Nagamani AN, Arabnia H (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adder. In: Gavrilova M, Tan CJK (eds) Transactions on computational science XVII, lecture notes in computer science, vol 7420, chap. 4. Springer, Berlin, Heidelberg, pp 73–97
Top500 List, June 2014. http://www.top500.org/lists/2014/06/. Accessed Oct 2014
Trader T (2014) China’s supercomputing strategy called out. http://www.hpcwire.com/2014/07/17/dd/. Accessed Oct 2014
Vogeleer KD, Memmi G, Jouvelot P, Coelho F (2014) The energy/frequency convexity rule: modeling and experimental validation on mobile devices. In: Wyrzykowski R, Dongarra J, Karczewski K, Waśniewski J (eds) Parallel processing and applied mathematics, lecture notes in computer science. Springer, Berlin, Heidelberg, pp 793–803
Chapter Google Scholar
Weiser M, Welch B, Demers A, Shenker S (1994) Scheduling for reduced CPU energy. In: The 1st USENIX conference on operating systems design and implementation (OSDI’94). USENIX Association, Monterey, California
Weissel A, Bellosa F (2002) Process cruise control: event-driven clock scaling for dynamic power management. In: The international conference on compilers, architecture and synthesis for embedded systems (CASES’02), pp 238–246
Wilson L. Average household electricity use around the world. http://shrinkthatfootprint.com/average-household-electricity-consumption. Accessed Oct 2014
World Population Review: China population 2014. http://worldpopulationreview.com/countries/china-population. Accessed Oct 2014
Yang B, Yu Z, Wei J (2014) Design of low-power modern radar SoC based on ASIX. Tsinghua Sci Technol 19(2):168–173
Article Google Scholar
Ye R, Xu Q (2012) Learning-based power management for multi-core processors via idle period manipulation. In: The 17th Asia and South Pacific design automation conference (ASP-DAC’12), pp 115–120
Yuki T, Rajopadhye S (2013) Folklore confirmed: compiling for speed = compiling for energy. In: The 26th international workshop on languages and compilers for parallel computing (LCPC’13), pp 169–184

Download references

Acknowledgments

This work is supported by Hong Kong RGC Grant HKU 716712E, National Basic Research Program of China (973) (No. 2014CB340303) and National Natural Science Foundation of China (No. 61303264, 61202482). Special thanks go to Intel China Center of Parallel Computing (ICCPC) and Beijing Soft Tech Technologies Co., Ltd. for providing us their support services of the SCC platform in their Wuxi data centers.

Author information

Authors and Affiliations

National Key Laboratory of Parallel and Distributed Processing, College of Computer, National University of Defense Technology, Changsha, China
Zhiquan Lai & Jinshu Su
Department of Computer Science, The University of Hong Kong, Hong Kong, China
King Tin Lam & Cho-Li Wang

Authors

Zhiquan Lai
View author publications
You can also search for this author in PubMed Google Scholar
King Tin Lam
View author publications
You can also search for this author in PubMed Google Scholar
Cho-Li Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jinshu Su
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiquan Lai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lai, Z., Lam, K.T., Wang, CL. et al. Latency-aware DVFS for efficient power state transitions on many-core architectures. J Supercomput 71, 2720–2747 (2015). https://doi.org/10.1007/s11227-015-1415-y

Download citation

Published: 05 April 2015
Issue Date: July 2015
DOI: https://doi.org/10.1007/s11227-015-1415-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latency-aware DVFS for efficient power state transitions on many-core architectures

Abstract

Access this article

Similar content being viewed by others

A Modern Primer on Processing in Memory

Comprehensive analysis of energy efficiency and performance of ARM and RISC-V SoCs

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Latency-aware DVFS for efficient power state transitions on many-core architectures

Abstract

Access this article

Similar content being viewed by others

A Modern Primer on Processing in Memory

Comprehensive analysis of energy efficiency and performance of ARM and RISC-V SoCs

SYCL in the edge: performance and energy evaluation for heterogeneous acceleration

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation