Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies

  • Lennart Johnsson
Part of the Lecture Notes in Earth System Sciences book series (LNESS)


During the last few years the convergence in architecture for High-Performance Computing systems that took place for over a decade has been replaced by a divergence. The divergence is driven by the quest for performance, cost-performance and in the last few years also energy consumption that during the life-time of a system have come to exceed the HPC system cost in many cases. Mass market, specialized processors, such as the Cell Broadband Engine (CBE) and Graphics Processors, have received particular attention, the latter especially after hardware support for double-precision floating-point arithmetic was introduced about three years ago. The recent support of Error Correcting Code (ECC) for memory and significantly enhanced performance for double-precision arithmetic in the current generation of Graphic Processing Units (GPUs) have further solidified the interest in GPUs for HPC. In order to assess the issues involved in potentially deploying clusters with nodes consisting of commodity microprocessors with some type of specialized processor for enhanced performance or enhanced energy efficiency or both for science and engineering workloads, PRACE, the Partnership for Advanced Computing in Europe, undertook a study that included three types of accelerators, the CBE, GPUs and ClearSpeed, and tools for their programming. The study focused on assessing performance, efficiency, power efficiency for double-precision arithmetic and programmer productivity. Four kernels, matrix multiplication, sparse matrix-vector multiplication, FFT, random number generation were used for the assessment together with High-Performance Linpack (HPL) and a few application codes. We report here on the results from the kernels and HPL for GPU and ClearSpeed accelerated systems. The GPU performed surprisingly significantly better than the CPU on the sparse matrix-vector multiplication on which the ClearSpeed performed surprisingly poorly. For matrix-multiplication, HPL and FFT the ClearSpeed accelerator was by far the most energy efficient device.


  1. Ali A, Johnsson L, Mirkovic D (2007) Empirical auto-tuning code generator for FFT and trigonometric transforms. Paper presented at the 5th workshop on optimizations for DSP and embedded systems. International symposium on code generation and optimization, San JoseGoogle Scholar
  2. AMD™ Processor Pricing (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/pricing/Pages/server-opteron.aspx, Advanced Micro Devices, Inc
  3. Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Yelick KA (2006) The landscape of parallel computing research: a view from Berkeley. (UCB/EECS-2006-183). http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
  4. ATI Radeo HD 5870 Graphics (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5870/Pages/ati-radeon-hd-5870-overview.aspx#2, Advanced Micro Devices, Inc
  5. Belady CL (2007) In the data center, power and cooling costs more than the IT equipment it supports. Electronics coolingGoogle Scholar
  6. Bell BS (2009) RV870 architecture, FS Media, Inc. Accessed 2 May 2011, from http://www.firingsquad.com/hardware/ati_radeon_hd_5870_performance_preview/page3.asp, FS Media, Inc
  7. CAPS (2011) CAPS enterprise. Accessed 2 May 2011, from http://www.caps-entreprise.com/index.php, CAPS enterprise
  8. Cell (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Cell&oldid=426379510, Wikipedia
  9. Cell Project at IBM Research (2011) IBM. Accessed 2 May 2011, from http://www.research.ibm.com/cell/, IBM
  10. Chen T, Raghavan R, Dale J, Iwata E (2005) Cell broadband engine architecture and its first implementation. Accessed from https://www.ibm.com/developerworks/power/library/pa-cellperf/
  11. Christadler I, Weinberg V (2010) RapidMind: portability across architectures and its limitations. Paper presented at the facing the multi-core challenge (conference proceedings), HeidelbergGoogle Scholar
  12. Clark J (1980) A VLSI geometry processor for graphics. Comput Mag 13(7):59–68Google Scholar
  13. Clark J (1982) The geometry engine: a VLSI geometry systems for graphics. Comput Graph 16(3):127–133CrossRefGoogle Scholar
  14. ClearSpeed (2011) ClearSpeed Technology. Accessed 2 May 2011, from http://www.clearspeed.com/, ClearSpeed Technology
  15. Colella P (2004) Defining software requirements for scientific computingGoogle Scholar
  16. Comparison of AMD Graphics Processing Units (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Comparison_of_AMD_graphics_processing_units&oldid=427053994, Wikipedia
  17. Comparison of Nvidia Graphics Processing Units (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units, Wikipedia
  18. Connection Machine (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Connection_Machine, Wikipedia
  19. Cray-1 Computer System (1976) Cray Research, Inc, MinnesotaGoogle Scholar
  20. CUDA (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Special:Cite&page=CUDA&id=427059959, Wikipedia
  21. Dolbeau R, Bihan S, Bodin F (2007) HMPP: a hybrid multi-core parallel programming environment. Paper presented at the proceedings of the workshop on general purpose processing on graphics processing units (GPGPU 2007), Boston. http://www.caps-entreprise.com/upload/ckfinder/userfiles/files/caps-hmpp-gpgpu-Boston-Workshop-Oct-2007.pdf
  22. Dongarra J, Graybill R, Harrod W, Lucas R, Lusk E, Luszczek P, Tikir M (2008) DARPA’s HPCS program: history, models, tools, languages. Adv Comput 72:1–100CrossRefGoogle Scholar
  23. Erbacci G, Cavazzoni C, Spiga F, Christadler I (2009) Report on petascale software libraries and programming models. Deliverable 6.6(RI-211528), 163. http://www.prace-project.eu/documents/public-deliverables-1/public-deliverables/d6-6.pdf
  24. ESC Corporation (ed) LDS-1/PDP-10 display system. Evans and Sutherland Computer Corporation, Salt Lake CityGoogle Scholar
  25. EuroBen Benchmark (2011) EuroBen. Accessed 2 May 2011, from http://www.euroben.nl/index.php, EuroBen
  26. Evans and Sutherland (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Evans_%26_Sutherland, Wikipedia
  27. Evans and Sutherland (2011b) Evans and Sutherland. Accessed 2 May 2011, from http://www.es.com/, Evans and Sutherland
  28. Feldman M (2009) Benchmark challenge: Nehalem versus Istanbul, HPC wire. HCP wire. Accessed from http://www.hpcwire.com/hpcwire/2009-06-18/benchmark_challenge_nehalem_versus_istanbul.html
  29. Flynn’s Taxonomy (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Flynn's_taxonomy, Wikipedia
  30. GeForce 256 (2011) NVIDIA corporation. Accessed 2 May 2011, from http://www.nvidia.com/page/geforce256.html, NVIDIA corporation
  31. Gelas JD (2008) Linpack: Intel’s Nehalem versus AMD Shanghai. Anandtech. Accessed from http://www.anandtech.com/show/3470
  32. Gelas JD (2010) AMD’s 12-core “Magny-Cours” Opteron 6174 versus Intel’s 6-core Xeon Anandtech. Accessed 2 May 2011, from http://www.anandtech.com/show/2978, Anandtech
  33. Ghuloum A, Sprangle E, Fang J, Wu G, Zhou Z (2007a) Ct: a flexible parallel programming model for tera-scale architectures. http://software.intel.com/file/25739
  34. Ghuloum A, Smith T, Wu G, Zhou X, Fang J, Guo P, So B, Rajagopalan M, Chen Y, Chen B (2007b) Future-proof data parallel algorithms and software on Intel® multi-core architecture. Intel Technol J 11(4):333–348Google Scholar
  35. Goodyear MPP (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Goodyear_MPP, Wikipedia
  36. GPU Shipments Report by Jon Peddie Research (2011) Jon Peddie Research. Accessed 2 May 2011, from http://jonpeddie.com/publications/market_watch/, Jon Peddie Research
  37. Graphics Processing Unit (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Graphics_processing_unit&oldid=427152592, Wikipedia
  38. Grochowski E, Annavaram M (2006) Energy per instruction trends in Intel® microprocessorsGoogle Scholar
  39. Hills WD (1989) The connection machine. MIT Press, CambridgeGoogle Scholar
  40. HMPP Open Standard (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=HMPP_Open_Standard&oldid=415481893, Wikipedia
  41. Homberg W (2009) Network specification and software data structures for the eQPACE architecture Jülich supercomputing center (JSC). Accessed 2 May 2011, from http://www2.fz-juelich.de/jsc/juice/eQPACE_Meeting/, Jülich supercomputing center (JSC)
  42. HP Challenge Benchmark Record (2011) University of Tennessee. Accessed 2 May 2011, from http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=403, University of Tennessee
  43. HPC Challenge Benchmark Record (2011) University of Tennessee. Accessed 2 May 2011, from http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=434, University of Tennessee
  44. Hybrid Multi-Core Parallel Programming Workbench (2011) CAPS enterprise. Accessed 2 May 2011, from http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36, CAPS enterprise
  45. IA-32 (Intel Architecture 32-bit) (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/IA-32, Wikipedia
  46. ILLIAC IV (1972) Corporation system characteristics and programming manual. Burroughs corporationGoogle Scholar
  47. ILLIAC IV (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/ILLIAC_IV, Wikipedia
  48. ILLIAC IV (2011b) Burroughs corporation. Accessed 2 May 2011, from http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf
  49. Intel 4004 (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Intel_4004, Wikipedia
  50. Intel 4004 (2011b) A big deal then, a big deal now. Intel corporation. Accessed 2 May 2011, from http://www.intel.com/about/companyinfo/museum/exhibits/4004/facts.htm, Intel corporation
  51. Intel 56XX (2011) Series products (formerly Westemere-\({\rm EP}\_\)). Intel corporation. Accessed 2 May 2011, from http://ark.intel.com/ProductCollection.aspx?codeName=33174, Intel corporation
  52. Intel Hyper-Threading Technology (Intel HT Technology) (2011) Intel Corporation. Accessed 2 May 2011, from http://www.intel.com/technology/platform-technology/hyper-threading/index.htm, Intel corporation
  53. Intel Math Kernel Library (2011) Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intel-mkl/, Intel corporation
  54. Intel Processor (2011) Clock speed (MHz). Accessed 2 May 2011, from http://smoothspan.files.wordpress.com/2007/09/clockspeeds.jpg
  55. Intel Xeon Processor E5540 (2011) Intel corporation. Accessed 2 May 2011, from http://ark.intel.com/Product.aspx?id=37104&processor=E5540&spec-codes=SLBF6, Intel corporation
  56. Intel(R) Array Building Blocks for Linux OS, User’s Guide (2011) (324171-006US), p 74. http://software.intel.com/sites/products/documentation/arbb/arbb_userguide_linux.pdf
  57. Intel(R) Array Building Blocks Virtual Machine, Specification (2011) (324820-002US), p 118. http://software.intel.com/sites/products/documentation/arbb/vm/arbb_vm.pdf
  58. Intel’s Ct Technology Code Samples (2010) Intel. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intels-ct-technology-code-samples/, Intel
  59. Introducing Intel many Integrated Core Architecture (2011) Intel corporation. Accessed 2 May 2011, from http://www.intel.com/technology/architecture-silicon/mic/index.htm, Intel corporation
  60. Introduction to Parallel GPU Computing (2010) Center for scalable application development softwareGoogle Scholar
  61. Johnsson L (2011) Overview of data centers energy efficiency evolution. In: Ranka S, Ahmad I (eds) Handbook of green computing. CRC Press, New YorkGoogle Scholar
  62. Kanellos M (2001) Intel’s accidental revolution. CNET news. Accessed from CNET News websiteGoogle Scholar
  63. Kennedy K, Koelbel C, Schreiber R (2004) Defining and measuring the productivity of programming languages. Int J High Perform Comput Appl 18(4):441–448CrossRefGoogle Scholar
  64. Kozin IN (2008) Evaluation of ClearSpeed accelerators for HPC, p 15. http://www.cse.scitech.ac.uk/disco/publications/Clearspeed.pdf
  65. Linpack, ClearSpeed (2010) CleerSpeed technology limited. Accessed 2 May 2011, from http://www.clearspeed.com/applications/highperformancecomputing/hpclinpack.php, CleerSpeed technology limited
  66. Matsuoka S, Dongarra J TESLA GPU computint. http://www.microway.com/pdfs/TeslaC2050-Fermi-Performance.pdf
  67. McCalpin JD (2011) STREAM: sustainable memory bandwidth in high-performance computers, University of Virginia. Accessed 2 May 2011, from http://www.cs.virginia.edu/stream/, University of Virginia
  68. McCool MD (2007) RapidMind multi-core development platform. CASCON Cell WorkshopGoogle Scholar
  69. McCool MD (2008) Developing for GPUs, cell, and multi-core CPUs using a unified programming model. Linux JGoogle Scholar
  70. Memory Bandwidth (STREAM)—Two-Socket Servers (including AMD™ 6100 Series Processors) (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/server/benchmarks/Pages/memory-bandwidth-stream-two-socket-servers.aspx, Advanced Micro Devices, Inc
  71. Mirkovic D, Mahasoom R, Johnsson L (2000) An adaptive software library for fast fourier transforms. Paper presented at the 2000 international conference on supercomputing, Santa FeGoogle Scholar
  72. Moore GE (1965) Craming more components onto integrated circuits. Electronics 38(8):114–117Google Scholar
  73. Non-Uniform Memory Access (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access, Wikipedia
  74. NVIDIA Corporation (2011) What is CUDA? Accessed 2 May 2011, from http://www.nvidia.com/object/what_is_cuda_new.html, NVIDIA corporation
  75. OpenCL (2010) Specification Version: 1(1), p 379. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
  76. OpenCL (2011) The open standard for parallel programming of heterogeneous systems. Khronos Group. Accessed 2 May 2011, from http://www.khronos.org/opencl/, Khronos Group
  77. Pentium 4 (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Pentium_4, Wikipedia
  78. Petitet A, Whaley RC, Dongarra J, Cleary A (2008) HPL–a portable implementation of the high-performance Linpack benchmark for distributed-memory computers, University of Tennessee Computer Science Department. Accessed 2 May 2011, from http://www.netlib.org/benchmark/hpl/, University of Tennessee Computer Science Department
  79. Petrov V, Fedorov G (2010) MKL FFT performance—comparison of local and distributed-memory implementations. Intel software network. Retrieved from http://software.intel.com/en-us/articles/mkl-fft-performance-using-local-and-distributed-implementation/
  80. Pettey C (2011) Gartner says worldwide PC shipments in fourth quarter of 2010 grew 3.1 percent; year-end shipments increased 13.8 percent. Accessed from http://www.gartner.com/it/page.jsp?id=1519417, Gartner, Inc
  81. Pettey C, Stevens H (2011) Gartner says 2010 worldwide server market returned to growth with shipments up 17 percent and revenue 13 percent. Gartner, Inc. Accessed 2 May 2011, from http://www.gartner.com/it/page.jsp?id=1561014, Gartner, Inc
  82. PGI Accelerator Programming Model for Fortran and C (2010) p 36. http://www.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.3.pdf
  83. Phillips E, Fatica M (2010) CUDA accelerated Linpack on clusters, E. Phillips. http://www.nvidia.com/content/GTC-2010/pdfs/2057_GTC2010.pdf
  84. Pollack F (1999) New microarchitecture challenges in the coming generations of CMOS process technologies. Paper presented at the proceedings of the 32nd annual IEEE/ACM international symposium on microarchitecture, HaifaGoogle Scholar
  85. Portland Group Inc (2011) Accelerated compilers. STMicroelectronics. Accessed 2 May 2011, from http://www.pgroup.com/resources/accel.htm, STMicroelectronics
  86. PRACE (2009) Preparatory phase project, Deliverable 8.3.1, technical component assessment and development, reportGoogle Scholar
  87. PRACE (2011) PRACE. Accessed 2 May 2011, from http://www.prace-ri.eu/, PRACE
  88. Productivity benefits of Intel Ct Technology (2010) Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/productivity-benefits-of-intel-ct-technology/, Intel corporation
  89. RapidMind (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/RapidMind, Wikipedia
  90. Sagar RS, Labarta J, van der Steen A, Christadler I, Huber H (2010) PRACE preparatory phase project, Deliverable 8.3.2, final technical report and architecture proposal. http://www.prace-project.eu/documents/public-deliverables/d8-3-2-extended.pdf
  91. Shalf J, Donofrio D, Oliker L, Wehner M (2006) Green flash: application driven system design for power efficient HPC. Paper presented at the Salishan conference on high-speed computingGoogle Scholar
  92. Shimpi AL (2010) New westmere details emerge: power efficiency and 4/6 core plans. AnandTech, Inc. Accessed 2 May 2011, from http://www.anandtech.com/show/2930, AnandTech, Inc
  93. Silicon Graphics (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Silicon_Graphics, Wikipedia
  94. Single Chip 4-Bit P-Channel Microprocessor (1987) Intel corporationGoogle Scholar
  95. Sophisticated Library for Vector Parallelism (2011) Intel array building blocks: a flexible parallel programming model for multicore and many-core architectures. Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intel-array-building-blocks/, Intel corporation
  96. Team TsG (2005) The mother of All CPU charts 2005/2006. Bestofmedia network. Accessed 2 May 2011, from http://www.tomshardware.com/reviews/mother-cpu-charts-2005,1175.html, Bestofmedia network
  97. Tesla C1060 Computing Processor Board Specification (2010) (BD-04111-001-v06). http://www.nvidia.com/docs/IO/43395/BD-04111-001v-06.pdf
  98. Tesla C2050/C2070 GPU Computing Processor (2010) NVIDIA CorporationGoogle Scholar
  99. The Green500 (2010) Green 500: ranking the worlds most energy-efficient supercomputers. Accessed 2 May 2011, from www.green500.org, The Green500
  100. Thelen E (2003) The connection machine -1-2-5. Ed-Thelen.org. Accessed 2 May 2011, from http://ed-thelen.org/comp-hist/vs-cm-1-2-5.html, Ed-Thelen.org
  101. Thelen E (2005) ILLIAC IV. Ed-Thelen.org. Accessed 2 May 2011, from http://ed-thelen.org/comp-hist/vs-illiac-iv.html, Ed-Thelen.org
  102. Thornton JE (1963) Considerations in computer design–leading up to the control data 6600. http://www.bitsavers.org/pdf/cdc/cyber/cyber_70/thornton_6600_paper.pdf
  103. Thornton JE (1970) The design of a computer: the control data 6600. Scott, Foresman and Company, GlenviewGoogle Scholar
  104. Top 500 (2011) Top500.org. Accessed 2 May 2011, from http://www.top500.org/, Top500.org
  105. Valich T (2010) nVidia GF100 architecture: alea iacta est. Accessed from http://www.brightsideofnews.com/print/2010/1/18/nvidia-gf100-architecture-alea-iacta-est.aspx
  106. Writing Applications for the GPU Using the RapidMind™ Development Platform (2006) p 7. Accessed from http://www.cs.ucla.edu/palsberg/course/cs239/papers/rapidmind.pdf

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Lennart Johnsson
    • 2
    • 1
  1. 1.Department of Computer Science University of HoustonHoustonUSA
  2. 2.School of Computer Science and CommunicationsKTHStockholmSweden

Personalised recommendations