Skip to main content

Advertisement

Log in

Optimization power consumption model of reliability-aware GPU clusters

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Power controlling on reliability-aware GPU clusters with dynamically variable voltage and speed is investigated as combinatorial optimization problem, namely the problem of minimizing task execution time with energy consumption constraint and the problem of minimizing energy consumption with system reliability constraint. The two problems have applied in general multiprocessor computing and real-time multiprocessing systems where energy consumption and system reliability both are important. These problems which emphasize the trade-off among performance, power and reliability have not been well studied before. In this research, a novel power control model is built based on Model Prediction Control theory. Maximum Entropy Method is used to determine partial ordering relation of control variable and to identify the quality of solutions. Our controller can cap the redundant energy consumption by dynamically transforming energy states of the nodes in GPU cluster. We compare our controller with the control scheme, which does not consider the system reliability. The experimental results demonstrate that the proposed controller is more reliable and valuable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Repantis T, Gu X, Kalogeraki V (2010) Qos-aware shared component composition for distributed stream processing system. IEEE Trans Parallel Distrib Syst 20(7):968–982

    Article  Google Scholar 

  2. Horvath T, Abdelzaher T, Shadron K, Liu X (2007) Dynamic voltage scaling in multitier web servers with end-to-end delay control. IEEE Trans Comput 56(4):444–458

    Article  MathSciNet  Google Scholar 

  3. Wang G, Ren X (2012) Power-efficient work distribution method for CPU-GPU heterogeneous system. In: Proceedings of international symposium on parallel and distributed processing with applications

    Google Scholar 

  4. Maruyama N, Nukada A, Mastsuoka S (2009) Software-based ECC for GPUs. In: Symposium on application accelerators in high performance computing

    Google Scholar 

  5. Dimitrov M, Mantor M, Zhou H (2009) Understanding software approaches for GPGPU reliability. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units. ACM, New York

    Google Scholar 

  6. Xin-Hai X, Xue-Jun Y, Yu-Fei L, Yi-Song L, Tao T (2011) Fault-tolerance method for CPU-GPU heterogeneous system. J Softw 22(10):2538–2552

    Article  Google Scholar 

  7. Sheaffer J, Luebke D, Skadron K (2007) A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors. In: Proceedings of 2007 graphics hardware

    Google Scholar 

  8. Haque IS, Pande VS (2009) Hard data on soft errors: a large-scale assessment of real-world error rates in GPGPU. In: Proceedings of the 10th IEEE/ACM international conference on cluster, cloud and grid computing

    Google Scholar 

  9. Xu X, Lin Y, Tang T et al (2010) HiAL-Ckpt: a hierarchical application-level checkpointing for CPU-GPU hybrid system. In: Proceedings of the 5th international conference on computer science & education, Heifei, China

    Google Scholar 

  10. Zhao B, Aydin H, Zhu D (2012) Energy management under general task-level reliability constraints. In: Proceedings of 2012 IEEE 18th real-time and embedded technology and applications symposium

    Google Scholar 

  11. Zhu D, Aydin H (2009) Reliability-aware energy management for periodic real-time tasks. IEEE Trans Comput 58(10):1382–1397

    Article  MathSciNet  Google Scholar 

  12. Wang X, Chen M, Fu X (2007) MIMI power control for high-density servers in an enclosure. IEEE Trans Parallel Distrib Syst 21(10):1412–1426

    Article  Google Scholar 

  13. Wang H, Chen Q (2012) Power estimating model and analysis of general programming on GPU. J Softw 7(5):1164–1170

    Google Scholar 

  14. Sunpyo H, Hyesoon K (2010) An integrated GPU power and performance model. In: Proceedings of the 37th annual international symposium on computer architecture, Saint-Malo, France, pp 280–289

    Google Scholar 

  15. Collange S, Defour D, Tisserand A (2009) Power consumption of GPUs from a software perspective. In: Proceedings of the 9th international conference on computational science, Baton Rouge, LA, pp 914–923

    Google Scholar 

  16. Bini E, Buttazzo G, Lipari G (2005) Speed modulation in energy-aware real-time systems. In: Proc. of the 17th euromicro conference on real-time systems

    Google Scholar 

  17. Seth K, Anantaraman A, Mueller F, Fast ER (2003) Frequency-aware static timing analysis. In: Proc. of 24th IEEE real-time system symposium

    Google Scholar 

  18. Wang X, Wang Y (2011) Coordinating power control and performance management for virtualized server clusters. IEEE Trans Parallel Distrib Syst 22(2):245–259

    Article  Google Scholar 

  19. Zhao B, Aydin H, Zhu D (2010) On maximizing reliability of real-time embedded applications under hard energy constraint. IEEE Trans Ind Inform 6(3):316–328

    Article  Google Scholar 

  20. Zhao B, Aydin H, Zhu D (2012) Energy management under general task-level reliability constraints. In: Proceedings of 2012 IEEE 18th real-time and embedded technology and applications symposium

    Google Scholar 

  21. Srinivasan S, Nk J (2006) Safety and reliability driven task allocation in distributed systems. IEEE Trans Comput 55(7):864–879

    Article  Google Scholar 

  22. Che S, Boyer M, Meng J, Tarjan D, Sheaffer JW (2009) Rodinia: a benchmark suite for heterogeneous computing. In: IEEE international symposium on workload charaterization

    Google Scholar 

  23. Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php

  24. Zhang Q, Zhou A, Jin Y (2008) RM-MEDA: a regularity model-based multi-objective estimation of distribution algorithm. IEEE Trans Evol Comput 12(1):41–63

    Article  Google Scholar 

  25. Yari G, Chaji AR (2012) Maximum Bayesian entropy method for determining ordered weighted averaging operator weights. Comput Ind Eng 63:338–342

    Article  Google Scholar 

  26. Farina M, Deb K, Amato P (2004) Dynamic multiobjective optimization problems: test cases, approximations, and applications. IEEE Trans Evol Comput 8(5):425–442

    Article  Google Scholar 

  27. Bemporad A, Morari M (1999) Robust model predictive control: a survey. Lect Notes Control Inf Sci 245:207–226

    Article  MathSciNet  Google Scholar 

  28. Moorthy AK, Seshadrinathan K et al (2010) Wireless video quality assessment: a study of subjective scores and objective algorithms. IEEE Trans Circuits Syst Video Technol 20(4):587–599

    Article  Google Scholar 

  29. Qu Q, Pei Y, Modestino JW (2006) An adaptive motion-based unequal error protection approach for real-time video transport over wireless IP networks. IEEE Trans Multimed 8(5):1033–1044

    Article  Google Scholar 

Download references

Acknowledgements

The authors thankfully acknowledge the support of National Nature Science Foundation of China (No. 60970012), the Innovation Program of Shanghai Science and Technology Commission (Nos. 09511501000, 09220502800), and Shanghai leading academic discipline project (XTKX2012).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haifeng Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Chen, Q. Optimization power consumption model of reliability-aware GPU clusters. J Supercomput 67, 153–174 (2014). https://doi.org/10.1007/s11227-013-0993-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-013-0993-9

Keywords

Navigation