Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies

Johnsson, Lennart

doi:10.1007/978-3-642-16405-7_3

Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies

Lennart Johnsson^8,7

Chapter
First Online: 01 January 2013

2846 Accesses
1 Citations

Part of the book series: Lecture Notes in Earth System Sciences ((LNESS))

Abstract

During the last few years the convergence in architecture for High-Performance Computing systems that took place for over a decade has been replaced by a divergence. The divergence is driven by the quest for performance, cost-performance and in the last few years also energy consumption that during the life-time of a system have come to exceed the HPC system cost in many cases. Mass market, specialized processors, such as the Cell Broadband Engine (CBE) and Graphics Processors, have received particular attention, the latter especially after hardware support for double-precision floating-point arithmetic was introduced about three years ago. The recent support of Error Correcting Code (ECC) for memory and significantly enhanced performance for double-precision arithmetic in the current generation of Graphic Processing Units (GPUs) have further solidified the interest in GPUs for HPC. In order to assess the issues involved in potentially deploying clusters with nodes consisting of commodity microprocessors with some type of specialized processor for enhanced performance or enhanced energy efficiency or both for science and engineering workloads, PRACE, the Partnership for Advanced Computing in Europe, undertook a study that included three types of accelerators, the CBE, GPUs and ClearSpeed, and tools for their programming. The study focused on assessing performance, efficiency, power efficiency for double-precision arithmetic and programmer productivity. Four kernels, matrix multiplication, sparse matrix-vector multiplication, FFT, random number generation were used for the assessment together with High-Performance Linpack (HPL) and a few application codes. We report here on the results from the kernels and HPL for GPU and ClearSpeed accelerated systems. The GPU performed surprisingly significantly better than the CPU on the sparse matrix-vector multiplication on which the ClearSpeed performed surprisingly poorly. For matrix-multiplication, HPL and FFT the ClearSpeed accelerator was by far the most energy efficient device.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Ali A, Johnsson L, Mirkovic D (2007) Empirical auto-tuning code generator for FFT and trigonometric transforms. Paper presented at the 5th workshop on optimizations for DSP and embedded systems. International symposium on code generation and optimization, San Jose
Google Scholar
AMD™ Processor Pricing (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/pricing/Pages/server-opteron.aspx, Advanced Micro Devices, Inc
Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Yelick KA (2006) The landscape of parallel computing research: a view from Berkeley. (UCB/EECS-2006-183). http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
ATI Radeo HD 5870 Graphics (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5870/Pages/ati-radeon-hd-5870-overview.aspx#2, Advanced Micro Devices, Inc
Belady CL (2007) In the data center, power and cooling costs more than the IT equipment it supports. Electronics cooling
Google Scholar
Bell BS (2009) RV870 architecture, FS Media, Inc. Accessed 2 May 2011, from http://www.firingsquad.com/hardware/ati_radeon_hd_5870_performance_preview/page3.asp, FS Media, Inc
CAPS (2011) CAPS enterprise. Accessed 2 May 2011, from http://www.caps-entreprise.com/index.php, CAPS enterprise
Cell (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Cell&oldid=426379510, Wikipedia
Cell Project at IBM Research (2011) IBM. Accessed 2 May 2011, from http://www.research.ibm.com/cell/, IBM
Chen T, Raghavan R, Dale J, Iwata E (2005) Cell broadband engine architecture and its first implementation. Accessed from https://www.ibm.com/developerworks/power/library/pa-cellperf/
Christadler I, Weinberg V (2010) RapidMind: portability across architectures and its limitations. Paper presented at the facing the multi-core challenge (conference proceedings), Heidelberg
Google Scholar
Clark J (1980) A VLSI geometry processor for graphics. Comput Mag 13(7):59–68
Google Scholar
Clark J (1982) The geometry engine: a VLSI geometry systems for graphics. Comput Graph 16(3):127–133
Article Google Scholar
ClearSpeed (2011) ClearSpeed Technology. Accessed 2 May 2011, from http://www.clearspeed.com/, ClearSpeed Technology
Colella P (2004) Defining software requirements for scientific computing
Google Scholar
Comparison of AMD Graphics Processing Units (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Comparison_of_AMD_graphics_processing_units&oldid=427053994, Wikipedia
Comparison of Nvidia Graphics Processing Units (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units, Wikipedia
Connection Machine (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Connection_Machine, Wikipedia
Cray-1 Computer System (1976) Cray Research, Inc, Minnesota
Google Scholar
CSX700 Datasheet (2011) (06-PD-1425 Rev 1E). http://www.clearspeed.com/products/documents/CSX700_Datasheet_Rev1E.pdf
CSX700 Processor (2011) http://www.clearspeed.com/products/csx700.php
CUDA (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Special:Cite&page=CUDA&id=427059959, Wikipedia
CUDA Case Studies (2009) http://www.lunarc.lu.se/Documents/nvidia-workshop/files/presentation/50_Case_Studies.pdf
CXSL User Guide (2010) (06-RM-1305), p 54. http://support.clearspeed.com/resources/documentation/CSXL_User_Guide_3.1_Rev1.C.pdf
Dolbeau R, Bihan S, Bodin F (2007) HMPP: a hybrid multi-core parallel programming environment. Paper presented at the proceedings of the workshop on general purpose processing on graphics processing units (GPGPU 2007), Boston. http://www.caps-entreprise.com/upload/ckfinder/userfiles/files/caps-hmpp-gpgpu-Boston-Workshop-Oct-2007.pdf
Dongarra J, Graybill R, Harrod W, Lucas R, Lusk E, Luszczek P, Tikir M (2008) DARPA’s HPCS program: history, models, tools, languages. Adv Comput 72:1–100
Article Google Scholar
Erbacci G, Cavazzoni C, Spiga F, Christadler I (2009) Report on petascale software libraries and programming models. Deliverable 6.6(RI-211528), 163. http://www.prace-project.eu/documents/public-deliverables-1/public-deliverables/d6-6.pdf
ESC Corporation (ed) LDS-1/PDP-10 display system. Evans and Sutherland Computer Corporation, Salt Lake City
Google Scholar
EuroBen Benchmark (2011) EuroBen. Accessed 2 May 2011, from http://www.euroben.nl/index.php, EuroBen
Evans and Sutherland (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Evans_%26_Sutherland, Wikipedia
Evans and Sutherland (2011b) Evans and Sutherland. Accessed 2 May 2011, from http://www.es.com/, Evans and Sutherland
Feldman M (2009) Benchmark challenge: Nehalem versus Istanbul, HPC wire. HCP wire. Accessed from http://www.hpcwire.com/hpcwire/2009-06-18/benchmark_challenge_nehalem_versus_istanbul.html
Flynn’s Taxonomy (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Flynn's_taxonomy, Wikipedia
GeForce 256 (2011) NVIDIA corporation. Accessed 2 May 2011, from http://www.nvidia.com/page/geforce256.html, NVIDIA corporation
Gelas JD (2008) Linpack: Intel’s Nehalem versus AMD Shanghai. Anandtech. Accessed from http://www.anandtech.com/show/3470
Gelas JD (2010) AMD’s 12-core “Magny-Cours” Opteron 6174 versus Intel’s 6-core Xeon Anandtech. Accessed 2 May 2011, from http://www.anandtech.com/show/2978, Anandtech
Ghuloum A, Sprangle E, Fang J, Wu G, Zhou Z (2007a) Ct: a flexible parallel programming model for tera-scale architectures. http://software.intel.com/file/25739
Ghuloum A, Smith T, Wu G, Zhou X, Fang J, Guo P, So B, Rajagopalan M, Chen Y, Chen B (2007b) Future-proof data parallel algorithms and software on Intel® multi-core architecture. Intel Technol J 11(4):333–348
Google Scholar
Goodyear MPP (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Goodyear_MPP, Wikipedia
GPU Shipments Report by Jon Peddie Research (2011) Jon Peddie Research. Accessed 2 May 2011, from http://jonpeddie.com/publications/market_watch/, Jon Peddie Research
Graphics Processing Unit (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Graphics_processing_unit&oldid=427152592, Wikipedia
Grochowski E, Annavaram M (2006) Energy per instruction trends in Intel® microprocessors
Google Scholar
Hills WD (1989) The connection machine. MIT Press, Cambridge
Google Scholar
HMPP Open Standard (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=HMPP_Open_Standard&oldid=415481893, Wikipedia
Homberg W (2009) Network specification and software data structures for the eQPACE architecture Jülich supercomputing center (JSC). Accessed 2 May 2011, from http://www2.fz-juelich.de/jsc/juice/eQPACE_Meeting/, Jülich supercomputing center (JSC)
HP Challenge Benchmark Record (2011) University of Tennessee. Accessed 2 May 2011, from http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=403, University of Tennessee
HPC Challenge Benchmark Record (2011) University of Tennessee. Accessed 2 May 2011, from http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=434, University of Tennessee
Hybrid Multi-Core Parallel Programming Workbench (2011) CAPS enterprise. Accessed 2 May 2011, from http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36, CAPS enterprise
IA-32 (Intel Architecture 32-bit) (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/IA-32, Wikipedia
ILLIAC IV (1972) Corporation system characteristics and programming manual. Burroughs corporation
Google Scholar
ILLIAC IV (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/ILLIAC_IV, Wikipedia
ILLIAC IV (2011b) Burroughs corporation. Accessed 2 May 2011, from http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf
Intel 4004 (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Intel_4004, Wikipedia
Intel 4004 (2011b) A big deal then, a big deal now. Intel corporation. Accessed 2 May 2011, from http://www.intel.com/about/companyinfo/museum/exhibits/4004/facts.htm, Intel corporation
Intel 56XX (2011) Series products (formerly Westemere-\({\rm EP}\_\)). Intel corporation. Accessed 2 May 2011, from http://ark.intel.com/ProductCollection.aspx?codeName=33174, Intel corporation
Intel Hyper-Threading Technology (Intel HT Technology) (2011) Intel Corporation. Accessed 2 May 2011, from http://www.intel.com/technology/platform-technology/hyper-threading/index.htm, Intel corporation
Intel Math Kernel Library (2011) Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intel-mkl/, Intel corporation
Intel Processor (2011) Clock speed (MHz). Accessed 2 May 2011, from http://smoothspan.files.wordpress.com/2007/09/clockspeeds.jpg
Intel Xeon Processor E5540 (2011) Intel corporation. Accessed 2 May 2011, from http://ark.intel.com/Product.aspx?id=37104&processor=E5540&spec-codes=SLBF6, Intel corporation
Intel(R) Array Building Blocks for Linux OS, User’s Guide (2011) (324171-006US), p 74. http://software.intel.com/sites/products/documentation/arbb/arbb_userguide_linux.pdf
Intel(R) Array Building Blocks Virtual Machine, Specification (2011) (324820-002US), p 118. http://software.intel.com/sites/products/documentation/arbb/vm/arbb_vm.pdf
Intel’s Ct Technology Code Samples (2010) Intel. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intels-ct-technology-code-samples/, Intel
Introducing Intel many Integrated Core Architecture (2011) Intel corporation. Accessed 2 May 2011, from http://www.intel.com/technology/architecture-silicon/mic/index.htm, Intel corporation
Introduction to Parallel GPU Computing (2010) Center for scalable application development software
Google Scholar
Johnsson L (2011) Overview of data centers energy efficiency evolution. In: Ranka S, Ahmad I (eds) Handbook of green computing. CRC Press, New York
Google Scholar
Kanellos M (2001) Intel’s accidental revolution. CNET news. Accessed from CNET News website
Google Scholar
Kennedy K, Koelbel C, Schreiber R (2004) Defining and measuring the productivity of programming languages. Int J High Perform Comput Appl 18(4):441–448
Article Google Scholar
Kozin IN (2008) Evaluation of ClearSpeed accelerators for HPC, p 15. http://www.cse.scitech.ac.uk/disco/publications/Clearspeed.pdf
Linpack, ClearSpeed (2010) CleerSpeed technology limited. Accessed 2 May 2011, from http://www.clearspeed.com/applications/highperformancecomputing/hpclinpack.php, CleerSpeed technology limited
Matsuoka S, Dongarra J TESLA GPU computint. http://www.microway.com/pdfs/TeslaC2050-Fermi-Performance.pdf
McCalpin JD (2011) STREAM: sustainable memory bandwidth in high-performance computers, University of Virginia. Accessed 2 May 2011, from http://www.cs.virginia.edu/stream/, University of Virginia
McCool MD (2007) RapidMind multi-core development platform. CASCON Cell Workshop
Google Scholar
McCool MD (2008) Developing for GPUs, cell, and multi-core CPUs using a unified programming model. Linux J
Google Scholar
Memory Bandwidth (STREAM)—Two-Socket Servers (including AMD™ 6100 Series Processors) (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/server/benchmarks/Pages/memory-bandwidth-stream-two-socket-servers.aspx, Advanced Micro Devices, Inc
Mirkovic D, Mahasoom R, Johnsson L (2000) An adaptive software library for fast fourier transforms. Paper presented at the 2000 international conference on supercomputing, Santa Fe
Google Scholar
Moore GE (1965) Craming more components onto integrated circuits. Electronics 38(8):114–117
Google Scholar
Non-Uniform Memory Access (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access, Wikipedia
NVIDIA Corporation (2011) What is CUDA? Accessed 2 May 2011, from http://www.nvidia.com/object/what_is_cuda_new.html, NVIDIA corporation
OpenCL (2010) Specification Version: 1(1), p 379. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
OpenCL (2011) The open standard for parallel programming of heterogeneous systems. Khronos Group. Accessed 2 May 2011, from http://www.khronos.org/opencl/, Khronos Group
Pentium 4 (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Pentium_4, Wikipedia
Petitet A, Whaley RC, Dongarra J, Cleary A (2008) HPL–a portable implementation of the high-performance Linpack benchmark for distributed-memory computers, University of Tennessee Computer Science Department. Accessed 2 May 2011, from http://www.netlib.org/benchmark/hpl/, University of Tennessee Computer Science Department
Petrov V, Fedorov G (2010) MKL FFT performance—comparison of local and distributed-memory implementations. Intel software network. Retrieved from http://software.intel.com/en-us/articles/mkl-fft-performance-using-local-and-distributed-implementation/
Pettey C (2011) Gartner says worldwide PC shipments in fourth quarter of 2010 grew 3.1 percent; year-end shipments increased 13.8 percent. Accessed from http://www.gartner.com/it/page.jsp?id=1519417, Gartner, Inc
Pettey C, Stevens H (2011) Gartner says 2010 worldwide server market returned to growth with shipments up 17 percent and revenue 13 percent. Gartner, Inc. Accessed 2 May 2011, from http://www.gartner.com/it/page.jsp?id=1561014, Gartner, Inc
PGI Accelerator Programming Model for Fortran and C (2010) p 36. http://www.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.3.pdf
Phillips E, Fatica M (2010) CUDA accelerated Linpack on clusters, E. Phillips. http://www.nvidia.com/content/GTC-2010/pdfs/2057_GTC2010.pdf
Pollack F (1999) New microarchitecture challenges in the coming generations of CMOS process technologies. Paper presented at the proceedings of the 32nd annual IEEE/ACM international symposium on microarchitecture, Haifa
Google Scholar
Portland Group Inc (2011) Accelerated compilers. STMicroelectronics. Accessed 2 May 2011, from http://www.pgroup.com/resources/accel.htm, STMicroelectronics
PRACE (2009) Preparatory phase project, Deliverable 8.3.1, technical component assessment and development, report
Google Scholar
PRACE (2011) PRACE. Accessed 2 May 2011, from http://www.prace-ri.eu/, PRACE
Productivity benefits of Intel Ct Technology (2010) Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/productivity-benefits-of-intel-ct-technology/, Intel corporation
RapidMind (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/RapidMind, Wikipedia
Sagar RS, Labarta J, van der Steen A, Christadler I, Huber H (2010) PRACE preparatory phase project, Deliverable 8.3.2, final technical report and architecture proposal. http://www.prace-project.eu/documents/public-deliverables/d8-3-2-extended.pdf
Shalf J, Donofrio D, Oliker L, Wehner M (2006) Green flash: application driven system design for power efficient HPC. Paper presented at the Salishan conference on high-speed computing
Google Scholar
Shimpi AL (2010) New westmere details emerge: power efficiency and 4/6 core plans. AnandTech, Inc. Accessed 2 May 2011, from http://www.anandtech.com/show/2930, AnandTech, Inc
Silicon Graphics (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Silicon_Graphics, Wikipedia
Simpson AD, Bull M, Hill J (2008) http://www.prace-project.eu/documents/Identification_and_Categorisatio_of_Applications_and_Initial_Benchmark_Suite_final.pdf
Single Chip 4-Bit P-Channel Microprocessor (1987) Intel corporation
Google Scholar
Sophisticated Library for Vector Parallelism (2011) Intel array building blocks: a flexible parallel programming model for multicore and many-core architectures. Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intel-array-building-blocks/, Intel corporation
Team TsG (2005) The mother of All CPU charts 2005/2006. Bestofmedia network. Accessed 2 May 2011, from http://www.tomshardware.com/reviews/mother-cpu-charts-2005,1175.html, Bestofmedia network
Tesla C1060 Computing Processor Board Specification (2010) (BD-04111-001-v06). http://www.nvidia.com/docs/IO/43395/BD-04111-001v-06.pdf
Tesla C2050/C2070 GPU Computing Processor (2010) NVIDIA Corporation
Google Scholar
The Green500 (2010) Green 500: ranking the worlds most energy-efficient supercomputers. Accessed 2 May 2011, from www.green500.org, The Green500
Thelen E (2003) The connection machine -1-2-5. Ed-Thelen.org. Accessed 2 May 2011, from http://ed-thelen.org/comp-hist/vs-cm-1-2-5.html, Ed-Thelen.org
Thelen E (2005) ILLIAC IV. Ed-Thelen.org. Accessed 2 May 2011, from http://ed-thelen.org/comp-hist/vs-illiac-iv.html, Ed-Thelen.org
Thornton JE (1963) Considerations in computer design–leading up to the control data 6600. http://www.bitsavers.org/pdf/cdc/cyber/cyber_70/thornton_6600_paper.pdf
Thornton JE (1970) The design of a computer: the control data 6600. Scott, Foresman and Company, Glenview
Google Scholar
Top 500 (2011) Top500.org. Accessed 2 May 2011, from http://www.top500.org/, Top500.org
Valich T (2010) nVidia GF100 architecture: alea iacta est. Accessed from http://www.brightsideofnews.com/print/2010/1/18/nvidia-gf100-architecture-alea-iacta-est.aspx
Writing Applications for the GPU Using the RapidMind™ Development Platform (2006) p 7. Accessed from http://www.cs.ucla.edu/palsberg/course/cs239/papers/rapidmind.pdf

Download references

Acknowledgments

The results reported here are due to efforts by many members of the PRACE Preparatory Phase Work Package 8 and documented in a deliverable to the European Commission under grant agreement RI-211528 within the EU Commission’s infrastructure initiative INFRA-2007-2.2.2.1. Support for this effort has also been received from SNIC, the Swedish National Infrastructure for Computing a meta-center for HPC under the Swedish Research Council which is gratefully acknowledged.

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, Houston, TX, USA
Lennart Johnsson
School of Computer Science and Communications, KTH, Stockholm, Sweden
Lennart Johnsson

Authors

Lennart Johnsson
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Minnesota, Dep. of Earth Sciences and Minnesota, Supercomputing Institute, Pillsbury Hall 23, Minneapolis, 55455, Minnesota, USA
David A. Yuen
Network Information Center, Comuter Center and Computer, Zhong Guan Cun 4, Beijing, 100190, China, People's Republic
Long Wang
Supercomputing Center, Zhong Guan Cun 4, Beijing, 100190, China, People's Republic
Xuebin Chi
, Computer Science, University of Houston, Calhoun Street 4800, Houston, 77204, Texas, USA
Lennart Johnsson
Inst. Process Engineering (IPE), Chinese Academy of Sciences, Zhongguancun North Second Street 1, Beijing, 100190, China, People's Republic
Wei Ge
, Laboratory of Computational Geodynamics,, Chinese Academy of Sciences, Yu Quan Lu 19a, Beijing, 100049, China, People's Republic
Yaolin Shi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Johnsson, L. (2013). Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies. In: Yuen, D., Wang, L., Chi, X., Johnsson, L., Ge, W., Shi, Y. (eds) GPU Solutions to Multi-scale Problems in Science and Engineering. Lecture Notes in Earth System Sciences. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16405-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-16405-7_3
Published: 09 January 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16404-0
Online ISBN: 978-3-642-16405-7
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)

Publish with us

Policies and ethics