The Journal of Supercomputing

, Volume 71, Issue 7, pp 2720–2747 | Cite as

Latency-aware DVFS for efficient power state transitions on many-core architectures

  • Zhiquan Lai
  • King Tin Lam
  • Cho-Li Wang
  • Jinshu Su


Energy efficiency is quickly becoming a first-class design constraint in high-performance computing (HPC). We need more efficient power management solutions to save energy costs and carbon footprint of HPC systems. Dynamic voltage and frequency scaling (DVFS) is a commonly used power management technique for making a trade-off between power consumption and system performance according to the time-varying program behavior. However, prior work on DVFS seldom takes into account the voltage and frequency scaling latencies, which we found to be a crucial factor determining the efficiency of the power management scheme. Frequent power state transitions without latency awareness can make a real impact on the execution performance of applications. The design of multiple voltage domains in some many-core architectures has made the effect of DVFS latencies even more significant. These concerns lead us to propose a new latency-aware DVFS scheme to adjust the optimal power state more accurately. Our main idea is to analyze the latency characteristics in depth and design a novel profile-guided DVFS solution which exploits the varying execution patterns of the parallel program to avoid excessive power state transitions. We implement the solution into a power management library for use by shared-memory parallel applications. Experimental evaluation on the Intel SCC many-core platform shows significant improvement in power efficiency after using our scheme. Compared with a latency-unaware approach, we achieve 24.0 % extra energy saving, 31.3 % more reduction in the energy–delay product and 15.2 % less overhead in execution time in the average case for various benchmarks. Our algorithm is also proved to outperform a prior DVFS approach attempted to mitigate the latency effects.


Power management Dynamic voltage and frequency scaling Profiling Shared virtual memory Many-core processors The single-chip cloud computer 



This work is supported by Hong Kong RGC Grant HKU 716712E, National Basic Research Program of China (973) (No. 2014CB340303) and National Natural Science Foundation of China (No. 61303264, 61202482). Special thanks go to Intel China Center of Parallel Computing (ICCPC) and Beijing Soft Tech Technologies Co., Ltd. for providing us their support services of the SCC platform in their Wuxi data centers.


  1. 1.
    Arabnia HR, Thapliyal H, Vinod AP (2006) Combined integer and floating point multiplication architecture (CIFM) for FPGAs and its reversible logic implementation. In: The 49th IEEE international midwest symposium on circuits and systems (MWSCAS’06), August 6–9, San Juan, Puerto Rico, pp 148–154Google Scholar
  2. 2.
    Baumann A, Barhamy P, Dagandz PE, Harrisy T, Isaacsy R, Peter S, Roscoe T, Schüpbach A, Singhania A (2009) The multikernel: a new OS architecture for scalable multicore systems. In: The ACM symposium on operating system principles (SOSP’09), pp 29–44Google Scholar
  3. 3.
    Bennett C, Grossman RL, Locke D, Seidman J, Vejcik S (2010) Malstone: towards a benchmark for analytics on large data clouds. In: The 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’10), ACM, pp 145–152Google Scholar
  4. 4.
    Cameron KW, Ge R, Feng X (2007) Designing computational clusters for performance and power. Adv Comput 69:89–153CrossRefGoogle Scholar
  5. 5.
    David R, Bogdan P, Marculescu R, Ogras U (2011) Dynamic power management of voltage-frequency island partitioned networks-on-chip using Intel’s Single-Chip Cloud Computer. In: The international symposium on networks-on-chip (NOCS’11), pp 257–258Google Scholar
  6. 6.
    Donald J, Martonosi M (2006) Techniques for multicore thermal management: classification and new exploration. In: The ACM/IEEE international symposium on computer architecture (ISCA’06), pp 78–88Google Scholar
  7. 7.
    Fahey J (2013) Home electricity use in US falling to 2001 levels. Accessed Oct 2014
  8. 8.
    Feng WC, Cameron K (2007) The Green500 list: encouraging sustainable supercomputing. Computer 40(12):50–55CrossRefGoogle Scholar
  9. 9.
    Freeh VW, Lowenthal DK (2005) Using multiple energy gears in MPI programs on a power-scalable cluster. In: The 10th ACM SIGPLAN symposium on principles and practice of parallel programming (PPoPP’05). ACM, pp 164–173Google Scholar
  10. 10.
    Govil K, Chan E, Wasserman H (1995) Comparing algorithm for dynamic speed-setting of a low-power CPU. In: The 1st annual international conference on mobile computing and networking (MobiCom’95). ACM, Berkeley, California, USA, pp 13–25Google Scholar
  11. 11.
    Graph500: The Graph 500 benchmark. Accessed Oct 2014
  12. 12.
    Howard J, Dighe S, Vangal S, Ruhl G, Borkar N, Jain S, Erraguntla V, Konow M, Riepen M, Gries M, Droege G, Lund-Larsen T, Steibl S, Borkar S, De V, Wijngaart RVD (2011) A 48-core IA-32 message-passing processor in 45 nm CMOS using on-die message passing and DVFS for performance and power scaling. IEEE J Solid-State Circuits 46(1):173–183CrossRefGoogle Scholar
  13. 13.
    Ioannou N, Kauschke M, Gries M, Cintra M (2011) Phase-based application-driven hierarchical power management on the Single-Chip cloud Computer. In: The 20th international conference on parallel architectures and compilation techniques (PACT’11), pp 131–142Google Scholar
  14. 14.
    Intel Labs (2010) SCC external architecture specification (EAS) (revision 0.94). Technical report. Accessed May 2010
  15. 15.
    Intel Labs (2010) The SCC programmer’s guide (revision 1.0). Technical report. Accessed Nov 2010
  16. 16.
    Iyer A, Marculescu D (2002) Power efficiency of voltage scaling in multiple clock, multiple voltage cores. In: The IEEE/ACM international conference on computer-aided design (ICCAD’02). ACM, New York, pp 379–386Google Scholar
  17. 17.
    Lai Z, Lam KT, Wang CL, Su J (2014) A power modeling approach for many-core architectures. In: The 10th international conference on semantics, knowledge and grids (SKG’14), pp 128–132Google Scholar
  18. 18.
    Lai Z, Lam KT, Wang CL, Su J, Yan Y, Zhu W (2013) Latency-aware dynamic voltage and frequency scaling on many-core architectures for data-intensive applications. In: The international conference on cloud computing and big data (CloudCom-Asia’13), pp 78–83Google Scholar
  19. 19.
    Lam KT, Shi J, Hung D, Wang CL, Lai Z, Yan Y, Zhu W (2014) Rhymes: a shared virtual memory system for non-coherent tiled many-core architectures. In: The 20th IEEE international conference on parallel and distributed systems (ICPADS’14), December 16–19, Hsinchu, TaiwanGoogle Scholar
  20. 20.
    Li B, Chang HC, Song SL, Su CY, Meyer T, Mooring J, Cameron K (2014) The power-performance tradeoffs of the Intel Xeon Phi on HPC applications. In: Workshop on large-scale parallel processing (LSPP’14), pp 1448–1456Google Scholar
  21. 21.
    Li D, Supinski BRd, Schulz M, Nikolopoulos DS, Cameron KW (2013) Strategies for energy-efficient resource management of hybrid programming models. IEEE Trans Parallel Distrib Syst 24(1):144–157CrossRefGoogle Scholar
  22. 22.
    Lo D, Kozyrakis C (2014) Dynamic management of TurboMode in modern multi-core chips. In: The 20th international symposium on high performance computer architecture (HPCA’14), pp 603–613Google Scholar
  23. 23.
    Ma K, Li X, Chen M, Wang X (2011) Scalable power control for many-core architectures running multi-threaded applications. In: The ACM/IEEE international symposium on computer architecture (ISCA’11), pp 449–460Google Scholar
  24. 24.
    Matthews O, Zhang M, Sorin D (2014) Scalably verifiable dynamic power management. In: The 20th IEEE international symposium on high performance computer architecture, pp 579–590Google Scholar
  25. 25.
    National Energy Administration (NEA) of China: China’s total electricity consumption in 2013. AccessedOct 2014
  26. 26.
    Qingyuan D, Meisner D, Bhattacharjee A, Wenisch TF, Bianchini R (2012) Coscale: Coordinating CPU and memory system DVFS in server systems. In: The 45th annual IEEE/ACM international symposium on microarchitecture (MICRO’12). IEEE Computer Society, Vancouver, BC, Canada, pp 143–154Google Scholar
  27. 27.
    Rangan KK, Wei GY, Brooks D (2009) Thread motion: fine-grained power management for multi-core systems. In: The ACM/IEEE international symposium on computer architecture (ISCA’09), pp 302–313Google Scholar
  28. 28.
    Ravishankar C, Ananthanarayanan S, Garg S, Kennings A (2012) Analysis and evaluation of greedy thread swapping based dynamic power management for MPSoC platforms. In: The 13th international symposium on quality electronic design (ISQED’12), pp 617–624Google Scholar
  29. 29.
    Rotem E, Mendelson A, Ginosar R, Weiser U (2009) Multiple clock and voltage domains for chip multi processors. In: The 42th annual IEEE/ACM international symposium on microarchitecture (MICRO’09), New York, pp 459–468Google Scholar
  30. 30.
    Sartori J, Kumar R (2007) Proactive peak power management for many-core architectures. Technical report CRHC-07-04, University of Illinois at Urbana-ChampaignGoogle Scholar
  31. 31.
    Simone D (2009) Power management in a manycore operating system. Masters thesisGoogle Scholar
  32. 32.
    Sinkar A, Ghasemi H, Schulte M, Karpuzcu U, Kim NS (2014) Low-cost per-core voltage domain support for power-constrained high-performance processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 22(4):747–758CrossRefGoogle Scholar
  33. 33.
    Sueur EL, Heiser G (2010) Dynamic voltage and frequency scaling: the laws of diminishing returns. In: The 2nd workshop on power aware computing and systems (HotPower’10), pp 1–8Google Scholar
  34. 34.
    Talpes E, Marculescu D (2005) Toward a multiple clock/voltage island design style for power-aware processors. IEEE Trans Very Large Scale Integr (VLSI) Syst 13(5):591–603CrossRefGoogle Scholar
  35. 35.
    Thapliyal H, Arabnia H, Srinivas MB (2009) Efficient reversible logic design of BCD subtractors. In: Gavrilova M, Tan CJK (eds) Transactions on computational science III, lecture notes in computer science, vol 5300, chap. 6. Springer, Berlin, Heidelberg, pp 99–121Google Scholar
  36. 36.
    Thapliyal H, Arabnia HR, Bajpai R, Sharma KK (2007) Combined integer and variable precision (CIVP) floating point multiplication architecture for FPGAs. In: The international conference on parallel and distributed processing techniques and applications (PDPTA’07), pp 449–450Google Scholar
  37. 37.
    Thapliyal H, Arabnia HR, Bajpai R, Sharma KK. (2007) Partial reversible gates (PRG) for reversible BCD arithmetic. In: The international conference on computer design (CDES’07), pp 97–98Google Scholar
  38. 38.
    Thapliyal H, Jayashree HV, Nagamani AN, Arabnia H (2013) Progress in reversible processor design: a novel methodology for reversible carry look-ahead adder. In: Gavrilova M, Tan CJK (eds) Transactions on computational science XVII, lecture notes in computer science, vol 7420, chap. 4. Springer, Berlin, Heidelberg, pp 73–97Google Scholar
  39. 39.
    Top500 List, June 2014. Accessed Oct 2014
  40. 40.
    Trader T (2014) China’s supercomputing strategy called out. Accessed Oct 2014
  41. 41.
    Vogeleer KD, Memmi G, Jouvelot P, Coelho F (2014) The energy/frequency convexity rule: modeling and experimental validation on mobile devices. In: Wyrzykowski R, Dongarra J, Karczewski K, Waśniewski J (eds) Parallel processing and applied mathematics, lecture notes in computer science. Springer, Berlin, Heidelberg, pp 793–803CrossRefGoogle Scholar
  42. 42.
    Weiser M, Welch B, Demers A, Shenker S (1994) Scheduling for reduced CPU energy. In: The 1st USENIX conference on operating systems design and implementation (OSDI’94). USENIX Association, Monterey, CaliforniaGoogle Scholar
  43. 43.
    Weissel A, Bellosa F (2002) Process cruise control: event-driven clock scaling for dynamic power management. In: The international conference on compilers, architecture and synthesis for embedded systems (CASES’02), pp 238–246Google Scholar
  44. 44.
    Wilson L. Average household electricity use around the world. Accessed Oct 2014
  45. 45.
    World Population Review: China population 2014. Accessed Oct 2014
  46. 46.
    Yang B, Yu Z, Wei J (2014) Design of low-power modern radar SoC based on ASIX. Tsinghua Sci Technol 19(2):168–173CrossRefGoogle Scholar
  47. 47.
    Ye R, Xu Q (2012) Learning-based power management for multi-core processors via idle period manipulation. In: The 17th Asia and South Pacific design automation conference (ASP-DAC’12), pp 115–120Google Scholar
  48. 48.
    Yuki T, Rajopadhye S (2013) Folklore confirmed: compiling for speed = compiling for energy. In: The 26th international workshop on languages and compilers for parallel computing (LCPC’13), pp 169–184Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Zhiquan Lai
    • 1
  • King Tin Lam
    • 2
  • Cho-Li Wang
    • 2
  • Jinshu Su
    • 1
  1. 1.National Key Laboratory of Parallel and Distributed Processing, College of ComputerNational University of Defense TechnologyChangshaChina
  2. 2.Department of Computer ScienceThe University of Hong KongHong KongChina

Personalised recommendations