Skip to main content

Advertisement

Log in

Latency-aware DVFS for efficient power state transitions on many-core architectures

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Energy efficiency is quickly becoming a first-class design constraint in high-performance computing (HPC). We need more efficient power management solutions to save energy costs and carbon footprint of HPC systems. Dynamic voltage and frequency scaling (DVFS) is a commonly used power management technique for making a trade-off between power consumption and system performance according to the time-varying program behavior. However, prior work on DVFS seldom takes into account the voltage and frequency scaling latencies, which we found to be a crucial factor determining the efficiency of the power management scheme. Frequent power state transitions without latency awareness can make a real impact on the execution performance of applications. The design of multiple voltage domains in some many-core architectures has made the effect of DVFS latencies even more significant. These concerns lead us to propose a new latency-aware DVFS scheme to adjust the optimal power state more accurately. Our main idea is to analyze the latency characteristics in depth and design a novel profile-guided DVFS solution which exploits the varying execution patterns of the parallel program to avoid excessive power state transitions. We implement the solution into a power management library for use by shared-memory parallel applications. Experimental evaluation on the Intel SCC many-core platform shows significant improvement in power efficiency after using our scheme. Compared with a latency-unaware approach, we achieve 24.0 % extra energy saving, 31.3 % more reduction in the energy–delay product and 15.2 % less overhead in execution time in the average case for various benchmarks. Our algorithm is also proved to outperform a prior DVFS approach attempted to mitigate the latency effects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Including external cooling, the system would draw an aggregate power of 24 megawatts.

  2. In 2013, average annual residential electricity consumptions per capita in China and US are 498.6 kWh and 4,327.6 kWh, respectively. Detailed calculations and sources: Electricity consumption by China’s urban and rural residents (\(E_\mathrm{china}\)) is \(6,793 \times 10^8\) kWh [25]. China’s population (\(P_\mathrm{china}\)) as of September, 2013 is 1,362,391,579 [45]. Dividing \(E_\mathrm{china}\) by \(P_\mathrm{china}\) gives 498.6 kWh. Electricity usage per household in US (\(E_\mathrm{us}\)) in 2013 is 10,819 kWh [7]. Average household size in US (\(P_\mathrm{us}\)) (or in most wealthy countries) is close to 2.5 persons [44]. Dividing \(E_\mathrm{us}\) by \(P_\mathrm{us}\) gives 4,327.6  kWh.

  3. Our estimation is done as follows: Tianhe-2 is using Xeon E5 2692v2 and Xeon Phi 31S1P (with 125 and 270 W TDPs). Assume their average power consumptions are 90 and 165 W (reference [20]), respectively. 90 W \(\times \) 32,000 + 165 W \(\times \) 48,000 = 10,800 kW. Divided by 17,808 kW gives 60.65 %.

  4. For practical safety, we apply a slightly higher voltage than the theoretical least voltage, hence there is a small margin between the theoretical safe boundary curve and the least-voltage operating points for each frequency in Fig. 1.

  5. We are aware of a recent compiler-based study [48] showing diminishing returns from DVFS by their analysis based on a high-level model. They argue that the reduction of dynamic power using DVFS is trivial compared with the total system power, considering the performance degradation due to DVFS, and therefore a “race to sleep” approach is indeed more energy efficient than using DVFS. However, this is true only for compute-bound workloads. We observe two latest phenomena that are against the conclusion of their analysis. First, for the state-of-the-art supercomputers such as Tianhe-2, the many-core (co)processors have dominated the entire system power by up to 60 %. Second, it is increasingly important to support the class of data-intensive HPC or multi-tenant cloud computing workloads nowadays. Such relatively memory-bound or I/O-bound workloads expose rich opportunity for DVFS to reap energy saving. So, DVFS is still an effective technique to achieve performance–energy trade-off as we have experimentally confirmed.

References

  1. Arabnia HR, Thapliyal H, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: The 49th IEEE international midwest symposium on circuits and systems (MWSCAS’06), August 6–9, San Juan, Puerto Rico, pp 148–154

  2. Baumann A, Barhamy P, Dagandz PE, Harrisy T, Isaacsy R, Peter S, Roscoe T, Schüpbach A, Singhania A (2009) The multikernel: a new OS architecture for scalable multicore systems. In: The ACM symposium on operating system principles (SOSP’09), pp 29–44

  3. Bennett C, Grossman RL, Locke D, Seidman J, Vejcik S (2010) Malstone: towards a benchmark for analytics on large data clouds. In: The 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’10), ACM, pp 145–152

  4. Cameron KW, Ge R, Feng X (2007) Designing computational clusters for performance and power. Adv Comput 69:89–153

    Article  Google Scholar 

  5. David R, Bogdan P, Marculescu R, Ogras U (2011) Dynamic power management of voltage-frequency island partitioned networks-on-chip using Intel’s Single-Chip Cloud Computer. In: The international symposium on networks-on-chip (NOCS’11), pp 257–258

  6. Donald J, Martonosi M (2006) Techniques for multicore thermal management: classification and new exploration. In: The ACM/IEEE international symposium on computer architecture (ISCA’06), pp 78–88

  7. Fahey J (2013) Home electricity use in US falling to 2001 levels. http://bigstory.ap.org/article/home-electricity-use-us-falling-2001-levels. Accessed Oct 2014

  8. Feng WC, Cameron K (2007) The Green500 list: encouraging sustainable supercomputing. Computer 40(12):50–55

    Article  Google Scholar 

  9. Freeh VW, Lowenthal DK (2005) Using multiple energy gears in MPI programs on a power-scalable cluster. In: The 10th ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP’05). ACM, pp 164–173

  10. Govil K, Chan E, Wasserman H (1995) Comparing algorithm for dynamic speed-setting of a low-power CPU. In: The 1st annual international conference on mobile computing and networking (MobiCom’95). ACM, Berkeley, California, USA, pp 13–25

  11. Graph500: The Graph 500 benchmark. http://www.graph500.org. Accessed Oct 2014

  12. Howard J, Dighe S, Vangal S, Ruhl G, Borkar N, Jain S, Erraguntla V, Konow M, Riepen M, Gries M, Droege G, Lund-Larsen T, Steibl S, Borkar S, De V, Wijngaart RVD (2011) A 48-core IA-32 message-passing processor in 45 nm CMOS using on-die message passing and DVFS for performance and power scaling. IEEE J Solid-State Circuits 46(1):173–183

    Article  Google Scholar 

  13. Ioannou N, Kauschke M, Gries M, Cintra M (2011) Phase-based application-driven hierarchical power management on the Single-Chip cloud Computer. In: The 20th international conference on parallel architectures and compilation techniques (PACT’11), pp 131–142

  14. Intel Labs (2010) SCC external architecture specification (EAS) (revision 0.94). Technical report. https://communities.intel.com/servlet/JiveServlet/downloadBody/5852-102-1-9012/SCC_EAS.pdf. Accessed May 2010

  15. Intel Labs (2010) The SCC programmer’s guide (revision 1.0). Technical report. https://communities.intel.com/servlet/JiveServlet/previewBody/5684-102-8-22523/SCCProgrammersGuide.pdf. Accessed Nov 2010

  16. Iyer A, Marculescu D (2002) Power efficiency of voltage scaling in multiple clock, multiple voltage cores. In: The IEEE/ACM international conference on computer-aided design (ICCAD’02). ACM, New York, pp 379–386

  17. Lai Z, Lam KT, Wang CL, Su J (2014) A power modeling approach for many-core architectures. In: The 10th international conference on semantics, knowledge and grids (SKG’14), pp 128–132

  18. Lai Z, Lam KT, Wang CL, Su J, Yan Y, Zhu W (2013) Latency-aware dynamic voltage and frequency scaling on many-core architectures for data-intensive applications. In: The international conference on cloud computing and big data (CloudCom-Asia’13), pp 78–83

  19. Lam KT, Shi J, Hung D, Wang CL, Lai Z, Yan Y, Zhu W (2014) Rhymes: a shared virtual memory system for non-coherent tiled many-core architectures. In: The 20th IEEE international conference on parallel and distributed systems (ICPADS’14), December 16–19, Hsinchu, Taiwan

  20. Li B, Chang HC, Song SL, Su CY, Meyer T, Mooring J, Cameron K (2014) The power-performance tradeoffs of the Intel Xeon Phi on HPC applications. In: Workshop on large-scale parallel processing (LSPP’14), pp 1448–1456

  21. Li D, Supinski BRd, Schulz M, Nikolopoulos DS, Cameron KW (2013) Strategies for energy-efficient resource management of hybrid programming models. IEEE Trans Parallel Distrib Syst 24(1):144–157

    Article  Google Scholar 

  22. Lo D, Kozyrakis C (2014) Dynamic management of TurboMode in modern multi-core chips. In: The 20th international symposium on high performance computer architecture (HPCA’14), pp 603–613

  23. Ma K, Li X, Chen M, Wang X (2011) Scalable power control for many-core architectures running multi-threaded applications. In: The ACM/IEEE international symposium on computer architecture (ISCA’11), pp 449–460

  24. Matthews O, Zhang M, Sorin D (2014) Scalably verifiable dynamic power management. In: The 20th IEEE international symposium on high performance computer architecture, pp 579–590

  25. National Energy Administration (NEA) of China: China’s total electricity consumption in 2013. http://www.nea.gov.cn/2014-01/14/c_133043689.htm. AccessedOct 2014

  26. Qingyuan D, Meisner D, Bhattacharjee A, Wenisch TF, Bianchini R (2012) Coscale: Coordinating CPU and memory system DVFS in server systems. In: The 45th annual IEEE/ACM international symposium on microarchitecture (MICRO’12). IEEE Computer Society, Vancouver, BC, Canada, pp 143–154

  27. Rangan KK, Wei GY, Brooks D (2009) Thread motion: fine-grained power management for multi-core systems. In: The ACM/IEEE international symposium on computer architecture (ISCA’09), pp 302–313

  28. Ravishankar C, Ananthanarayanan S, Garg S, Kennings A (2012) Analysis and evaluation of greedy thread swapping based dynamic power management for MPSoC platforms. In: The 13th international symposium on quality electronic design (ISQED’12), pp 617–624

  29. Rotem E, Mendelson A, Ginosar R, Weiser U (2009) Multiple clock and voltage domains for chip multi processors. In: The 42th annual IEEE/ACM international symposium on microarchitecture (MICRO’09), New York, pp 459–468

  30. Sartori J, Kumar R (2007) Proactive peak power management for many-core architectures. Technical report CRHC-07-04, University of Illinois at Urbana-Champaign

  31. Simone D (2009) Power management in a manycore operating system. Masters thesis

  32. Sinkar A, Ghasemi H, Schulte M, Karpuzcu U, Kim NS (2014) Low-cost per-core voltage domain support for power-constrained high-performance processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 22(4):747–758

    Article  Google Scholar 

  33. Sueur EL, Heiser G (2010) Dynamic voltage and frequency scaling: the laws of diminishing returns. In: The 2nd workshop on power aware computing and systems (HotPower’10), pp 1–8

  34. Talpes E, Marculescu D (2005) Toward a multiple clock/voltage island design style for power-aware processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 13(5):591–603

    Article  Google Scholar 

  35. Thapliyal H, Arabnia H, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. In: Gavrilova M, Tan CJK (eds) Transactions on computational science III, lecture notes in computer science, vol 5300, chap. 6. Springer, Berlin, Heidelberg, pp 99–121

  36. Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: The international conference on parallel and distributed processing techniques and applications (PDPTA’07), pp 449–450

  37. Thapliyal H, Arabnia HR, Bajpai R, Sharma KK. (2007) Partial reversible gates (PRG) for reversible BCD arithmetic. In: The international conference on computer design (CDES’07), pp 97–98

  38. Thapliyal H, Jayashree HV, Nagamani AN, Arabnia H (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adder. In: Gavrilova M, Tan CJK (eds) Transactions on computational science XVII, lecture notes in computer science, vol 7420, chap. 4. Springer, Berlin, Heidelberg, pp 73–97

  39. Top500 List, June 2014. http://www.top500.org/lists/2014/06/. Accessed Oct 2014

  40. Trader T (2014) China’s supercomputing strategy called out. http://www.hpcwire.com/2014/07/17/dd/. Accessed Oct 2014

  41. Vogeleer KD, Memmi G, Jouvelot P, Coelho F (2014) The energy/frequency convexity rule: modeling and experimental validation on mobile devices. In: Wyrzykowski R, Dongarra J, Karczewski K, Waśniewski J (eds) Parallel processing and applied mathematics, lecture notes in computer science. Springer, Berlin, Heidelberg, pp 793–803

    Chapter  Google Scholar 

  42. Weiser M, Welch B, Demers A, Shenker S (1994) Scheduling for reduced CPU energy. In: The 1st USENIX conference on operating systems design and implementation (OSDI’94). USENIX Association, Monterey, California

  43. Weissel A, Bellosa F (2002) Process cruise control: event-driven clock scaling for dynamic power management. In: The international conference on compilers, architecture and synthesis for embedded systems (CASES’02), pp 238–246

  44. Wilson L. Average household electricity use around the world. http://shrinkthatfootprint.com/average-household-electricity-consumption. Accessed Oct 2014

  45. World Population Review: China population 2014. http://worldpopulationreview.com/countries/china-population. Accessed Oct 2014

  46. Yang B, Yu Z, Wei J (2014) Design of low-power modern radar SoC based on ASIX. Tsinghua Sci Technol 19(2):168–173

    Article  Google Scholar 

  47. Ye R, Xu Q (2012) Learning-based power management for multi-core processors via idle period manipulation. In: The 17th Asia and South Pacific design automation conference (ASP-DAC’12), pp 115–120

  48. Yuki T, Rajopadhye S (2013) Folklore confirmed: compiling for speed = compiling for energy. In: The 26th international workshop on languages and compilers for parallel computing (LCPC’13), pp 169–184

Download references

Acknowledgments

This work is supported by Hong Kong RGC Grant HKU 716712E, National Basic Research Program of China (973) (No. 2014CB340303) and National Natural Science Foundation of China (No. 61303264, 61202482). Special thanks go to Intel China Center of Parallel Computing (ICCPC) and Beijing Soft Tech Technologies Co., Ltd. for providing us their support services of the SCC platform in their Wuxi data centers.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiquan Lai.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lai, Z., Lam, K.T., Wang, CL. et al. Latency-aware DVFS for efficient power state transitions on many-core architectures. J Supercomput 71, 2720–2747 (2015). https://doi.org/10.1007/s11227-015-1415-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1415-y

Keywords

Navigation