Cluster Computing

, Volume 16, Issue 1, pp 77–90 | Cite as

Energy cost evaluation of parallel algorithms for multiprocessor systems

  • Zhuowei Wang
  • Xianbin Xu
  • Naixue XiongEmail author
  • Laurence T. Yang
  • Wuqing Zhao


With the continuous development of hardware and software, Graphics Processor Units (GPUs) have been used in the general-purpose computation field. They have emerged as a computational accelerator that dramatically reduces the application execution time with CPUs. To achieve high computing performance, a GPU typically includes hundreds of computing units. The high density of computing resource on a chip brings in high power consumption. Therefore power consumption has become one of the most important problems for the development of GPUs. This paper analyzes the energy consumption of parallel algorithms executed in GPUs and provides a method to evaluate the energy scalability for parallel algorithms. Then the parallel prefix sum is analyzed to illustrate the method for the energy conservation, and the energy scalability is experimentally evaluated using Sparse Matrix-Vector Multiply (SpMV). The results show that the optimal number of blocks, memory choice and task scheduling are the important keys to balance the performance and the energy consumption of GPUs.


GPUs Parallel algorithms Energy scalability Energy conservation Performance 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    NVIDIA Corporation: NVIDIA CUDA compute unfied device architecture programming guide. (2011)
  2. 2.
  3. 3.
    Buck, I., Fatahalian, K., Hanrahan, P.: GPUBench, Evaluating GPU performance for numerical and scientific applications. In: ACM Workshop on General-Purpose Computing on Graphics Processors (GP2), p. C-20 (2004) Google Scholar
  4. 4.
    Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.W.: An adaptive performance modeling tool for GPU architectures. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pp. 105-114. ACM, New York (2010) CrossRefGoogle Scholar
  5. 5.
    He, B., et al.: Efficient gather and scatter operations on graphics processors. In: ACM/IEEE SC (2007) Google Scholar
  6. 6.
    Goddeke, D., et al.: Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Comput. 33(10–11), 685–699 (2007) CrossRefGoogle Scholar
  7. 7.
    Govindaraju, N.K., Larsen, S., Gray, J., Manocha, D.: A memory model for scientific algorithms on graphics processors. In: SC (2006) Google Scholar
  8. 8.
    Trancoso, P., Charalambous, M.: Exploring graphics processor performance for general purpose applications. In: The Eighth Euromicro Conference on Digital System Design, Architectures, Methods, and Tools, pp. 306–313 (2005) CrossRefGoogle Scholar
  9. 9.
    Harrison, O., Waldron, J.: Optimising data movement rates for parallel processing applications on graphics processors. In: Parallel and Distributed Computing and Networks (PDCN) (2007) Google Scholar
  10. 10.
    Sheaffer, J., Skadron, K., Luebke, D.: Studding thermal management for graphic-processor architectures. In: IEEE International Symposium on Performance Analysis of Systems and Software (2005) Google Scholar
  11. 11.
    Ramani, K., Ibrahim, A., Shimizu, D.: PowerRed: a flexible power modeling frame work for power efficiency exploration in GPUs. In: Workshop on General Purpose Processing on Graphics Processing Units (GPGPU) (2007) Google Scholar
  12. 12.
    Tajuzawa, H., Satol, K., Kobay Ashi, H.: SPRAT: runtime processor selection for energy-aware computing. In: the Third international Workshop on Automatic Performance Tuning (2008) Google Scholar
  13. 13.
    Rofouei, M., Stathopoulos, T., Ryffel, S., Kaiser, W., Sarrafzadeh, M.: Energy-aware high performance computing with graphic processing units. In: Workshop on Power Aware Computing and System (2008) Google Scholar
  14. 14.
    Huang, S., Xiao, S., Feng, W.: On the energy efficiency of graphics processing units for scientific computing. In: 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS) (2009) Google Scholar
  15. 15.
    Korthikanti, V.A., Agha, G.: Towards optimizing energy costs of algorithms for shared memory architectures. In: SPAA (2010) Google Scholar
  16. 16.
    Korthikanti, V.A., Agha, G.: Analysis of parallel algorithms for energy conservation in scalable multicore architectures. In: ICPP, pp. 212–219 (2009) Google Scholar
  17. 17.
    Wang, X., Ziavras, S.: Performance-energy tradeoff for matrix multiplication on FPGA-based mixed-mode chip multiprocessors. In: International Symposium on Quality Electronic Design, pp. 386–391 (2007) CrossRefGoogle Scholar
  18. 18.
    Bender, M.A., Fineman, J.T.: Concurrent cache-oblibious b-trees. In: SPAA, Parallel Computing, pp. 228–237 (2005). 18171616 Google Scholar
  19. 19.
    Aggarwal, A., Viteer, J.S.: The input/output complexity of sorting and related problems. Commun. ACM 31, 1116–1127 (1988) CrossRefGoogle Scholar
  20. 20.
    Chandrakasan, A., Sheng, S., Brodersen, R.: Low-power CMOS digital design. IEEE J. Solid-State Circuits 27(4), 473–484 (1992) CrossRefGoogle Scholar
  21. 21.
    Blelloch, G.E.: Prefix sums and their applications. In: Reif, J.H. (ed.) Synthesis of Parallel Algorithms. Morgan Kaufmann, San Mateo (1990) Google Scholar
  22. 22.
    Harri, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA. In: Nguyen, H. (ed.) GPU Gems 3. Addison-Wesley, Reading (2007) Google Scholar
  23. 23.
    Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for GPU computing. In: Graphics Hardware 2007, pp. 97–106. ACM Press, New York (2007) Google Scholar
  24. 24.
    Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation (2008) Google Scholar
  25. 25.
    Bolz, J., Farmer, I., Grinspun, E., Schroder, P.: SPARSE matrix slovers on the GPU: Conjugate gradients and multigrid. ACM Trans. Graph. 22(3), 917–924 (2003). Proceedings of ACM SIGGRAPH CrossRefGoogle Scholar
  26. 26.
    Blelloch, G.E., Heroux, M.A., Zagham: Segmented operations for sparse matrix computation on vector multiprocessors. Technical Report CMU-CS-93-173, School of Computer Science, Carnegie Mellon University, August 1993 Google Scholar
  27. 27.
    Vazquez, F., Garzon, E.M., Martinez, J.A., Fernandex, J.J.: Scan primitives for vector computers. The sparse matrix vector produce on GPUs. Computer Architecture and Electronics Dep., University of Almeria (2009) Google Scholar
  28. 28.
    Baskaran, M.M., Bordawekar, R.: Optimizing Sparse Matrix-Vector Multiplication on GPUs. IBM Research Report RC24704 (2009) Google Scholar
  29. 29.
    Bik, A.J.C., Wijshoff, H.A.G.: Automatic data structure selection and transformation for sparse matrix computations. IEEE Trans. Parallel Distrib. Syst. 7, 109–126 (1996) CrossRefGoogle Scholar
  30. 30.
    Dotesenko, Y., Govindaraju, N.K., Sloan, P.-P., Boyd, C., Manferdelli, J.: Fast scan algorithms on graphics processors. In: ICS: Proceedings of the 22nd Annual International Conference on Supercomputing, New York, NY, USA, pp. 205–213. ACM Press, New York (2008) CrossRefGoogle Scholar
  31. 31.
    Chatterhee, S., Blelloch, G.E.: Zagham., Scan primitives for vector computers. In: Supercomputing’90: Proceedings of the 1990 Conference on Supercomputing, pp. 666–675 (1990) Google Scholar
  32. 32.
    Fatahaian, K., Sugerman, J., Hanrahan, P.: Understanding the efficiency of GPU algorithms for matrix-matrix multiplications. In: Proceedings of 19th Eurographics/SIGGRAPH Graphics Hardware Workshop, Graphics Hardware, Grenoble, France (2004) Google Scholar
  33. 33.
    Wang, Z., Xu, X., Zhao, W., Zhang, Y., He, S.: Optimizing Sparse Matrix-Vector Multiplication on CUDA. In: ICETE: Proceeding of the 2nd International Conference on Education Technology and Computer, Shanghai, China (2010) Google Scholar
  34. 34.
    CUDDP: CUDA Data Parallel Primitives Library. (2011)
  35. 35.
    Burd, T., Brodersen, R.: Design issues for dynamic voltage scaling. In: Proceeding of the 2000 International Symposium on Low Power Electronics and Design, (ISLPED’00) Rapallo, Italy, pp. 9–14 (2000) CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Zhuowei Wang
    • 1
  • Xianbin Xu
    • 1
  • Naixue Xiong
    • 2
    Email author
  • Laurence T. Yang
    • 3
  • Wuqing Zhao
    • 1
  1. 1.School of ComputerWuhan UniversityWuhanChina
  2. 2.Department of Computer ScienceGeorgia State UniversityAtlantaUSA
  3. 3.Department of Computer ScienceSt. Francis Xavier UniversityAntigonishCanada

Personalised recommendations