Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies
- Lennart Johnsson
- … show all 1 hide
Purchase on Springer.com
$29.95 / €24.95 / £19.95 *
* Final gross prices may vary according to local VAT.
Abstract
During the last few years the convergence in architecture for High-Performance Computing systems that took place for over a decade has been replaced by a divergence. The divergence is driven by the quest for performance, cost-performance and in the last few years also energy consumption that during the life-time of a system have come to exceed the HPC system cost in many cases. Mass market, specialized processors, such as the Cell Broadband Engine (CBE) and Graphics Processors, have received particular attention, the latter especially after hardware support for double-precision floating-point arithmetic was introduced about three years ago. The recent support of Error Correcting Code (ECC) for memory and significantly enhanced performance for double-precision arithmetic in the current generation of Graphic Processing Units (GPUs) have further solidified the interest in GPUs for HPC. In order to assess the issues involved in potentially deploying clusters with nodes consisting of commodity microprocessors with some type of specialized processor for enhanced performance or enhanced energy efficiency or both for science and engineering workloads, PRACE, the Partnership for Advanced Computing in Europe, undertook a study that included three types of accelerators, the CBE, GPUs and ClearSpeed, and tools for their programming. The study focused on assessing performance, efficiency, power efficiency for double-precision arithmetic and programmer productivity. Four kernels, matrix multiplication, sparse matrix-vector multiplication, FFT, random number generation were used for the assessment together with High-Performance Linpack (HPL) and a few application codes. We report here on the results from the kernels and HPL for GPU and ClearSpeed accelerated systems. The GPU performed surprisingly significantly better than the CPU on the sparse matrix-vector multiplication on which the ClearSpeed performed surprisingly poorly. For matrix-multiplication, HPL and FFT the ClearSpeed accelerator was by far the most energy efficient device.
Look
Inside
Within this Chapter
- Introduction
- Highlights of a PRACE Study of Accelerated IA-32 Servers
- Programming Tools Assessment
- Conclusions
- References
- References
Other actions
Related Content
Supplementary Material (0)
References (111)
- Ali A, Johnsson L, Mirkovic D (2007) Empirical auto-tuning code generator for FFT and trigonometric transforms. Paper presented at the 5th workshop on optimizations for DSP and embedded systems. International symposium on code generation and optimization, San Jose
- AMD™ Processor Pricing (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/pricing/Pages/server-opteron.aspx, Advanced Micro Devices, Inc
- Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Yelick KA (2006) The landscape of parallel computing research: a view from Berkeley. (UCB/EECS-2006-183). http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pdf
- ATI Radeo HD 5870 Graphics (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/desktop/graphics/ati-radeon-hd-5000/hd-5870/Pages/ati-radeon-hd-5870-overview.aspx#2, Advanced Micro Devices, Inc
- Belady CL (2007) In the data center, power and cooling costs more than the IT equipment it supports. Electronics cooling
- Bell BS (2009) RV870 architecture, FS Media, Inc. Accessed 2 May 2011, from http://www.firingsquad.com/hardware/ati_radeon_hd_5870_performance_preview/page3.asp, FS Media, Inc
- CAPS (2011) CAPS enterprise. Accessed 2 May 2011, from http://www.caps-entreprise.com/index.php, CAPS enterprise
- Cell (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Cell&oldid=426379510, Wikipedia
- Cell Project at IBM Research (2011) IBM. Accessed 2 May 2011, from http://www.research.ibm.com/cell/, IBM
- Chen T, Raghavan R, Dale J, Iwata E (2005) Cell broadband engine architecture and its first implementation. Accessed from https://www.ibm.com/developerworks/power/library/pa-cellperf/
- Christadler I, Weinberg V (2010) RapidMind: portability across architectures and its limitations. Paper presented at the facing the multi-core challenge (conference proceedings), Heidelberg
- Clark J (1980) A VLSI geometry processor for graphics. Comput Mag 13(7):59–68
- Clark J (1982) The geometry engine: a VLSI geometry systems for graphics. Comput Graph 16(3):127–133 CrossRef
- ClearSpeed (2011) ClearSpeed Technology. Accessed 2 May 2011, from http://www.clearspeed.com/, ClearSpeed Technology
- Colella P (2004) Defining software requirements for scientific computing
- Comparison of AMD Graphics Processing Units (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Comparison_of_AMD_graphics_processing_units&oldid=427053994, Wikipedia
- Comparison of Nvidia Graphics Processing Units (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units, Wikipedia
- Connection Machine (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Connection_Machine, Wikipedia
- Cray-1 Computer System (1976) Cray Research, Inc, Minnesota
- CSX700 Datasheet (2011) (06-PD-1425 Rev 1E). http://www.clearspeed.com/products/documents/CSX700_Datasheet_Rev1E.pdf
- CSX700 Processor (2011) http://www.clearspeed.com/products/csx700.php
- CUDA (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Special:Cite&page=CUDA&id=427059959, Wikipedia
- CUDA Case Studies (2009) http://www.lunarc.lu.se/Documents/nvidia-workshop/files/presentation/50_Case_Studies.pdf
- CXSL User Guide (2010) (06-RM-1305), p 54. http://support.clearspeed.com/resources/documentation/CSXL_User_Guide_3.1_Rev1.C.pdf
- Dolbeau R, Bihan S, Bodin F (2007) HMPP: a hybrid multi-core parallel programming environment. Paper presented at the proceedings of the workshop on general purpose processing on graphics processing units (GPGPU 2007), Boston. http://www.caps-entreprise.com/upload/ckfinder/userfiles/files/caps-hmpp-gpgpu-Boston-Workshop-Oct-2007.pdf
- Dongarra J, Graybill R, Harrod W, Lucas R, Lusk E, Luszczek P, Tikir M (2008) DARPA’s HPCS program: history, models, tools, languages. Adv Comput 72:1–100 CrossRef
- Erbacci G, Cavazzoni C, Spiga F, Christadler I (2009) Report on petascale software libraries and programming models. Deliverable 6.6(RI-211528), 163. http://www.prace-project.eu/documents/public-deliverables-1/public-deliverables/d6-6.pdf
- ESC Corporation (ed) LDS-1/PDP-10 display system. Evans and Sutherland Computer Corporation, Salt Lake City
- EuroBen Benchmark (2011) EuroBen. Accessed 2 May 2011, from http://www.euroben.nl/index.php, EuroBen
- Evans and Sutherland (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Evans_%26_Sutherland, Wikipedia
- Evans and Sutherland (2011b) Evans and Sutherland. Accessed 2 May 2011, from http://www.es.com/, Evans and Sutherland
- Feldman M (2009) Benchmark challenge: Nehalem versus Istanbul, HPC wire. HCP wire. Accessed from http://www.hpcwire.com/hpcwire/2009-06-18/benchmark_challenge_nehalem_versus_istanbul.html
- Flynn’s Taxonomy (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Flynn's_taxonomy, Wikipedia
- GeForce 256 (2011) NVIDIA corporation. Accessed 2 May 2011, from http://www.nvidia.com/page/geforce256.html, NVIDIA corporation
- Gelas JD (2008) Linpack: Intel’s Nehalem versus AMD Shanghai. Anandtech. Accessed from http://www.anandtech.com/show/3470
- Gelas JD (2010) AMD’s 12-core “Magny-Cours” Opteron 6174 versus Intel’s 6-core Xeon Anandtech. Accessed 2 May 2011, from http://www.anandtech.com/show/2978, Anandtech
- Ghuloum A, Sprangle E, Fang J, Wu G, Zhou Z (2007a) Ct: a flexible parallel programming model for tera-scale architectures. http://software.intel.com/file/25739
- Ghuloum A, Smith T, Wu G, Zhou X, Fang J, Guo P, So B, Rajagopalan M, Chen Y, Chen B (2007b) Future-proof data parallel algorithms and software on Intel® multi-core architecture. Intel Technol J 11(4):333–348
- Goodyear MPP (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Goodyear_MPP, Wikipedia
- GPU Shipments Report by Jon Peddie Research (2011) Jon Peddie Research. Accessed 2 May 2011, from http://jonpeddie.com/publications/market_watch/, Jon Peddie Research
- Graphics Processing Unit (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=Graphics_processing_unit&oldid=427152592, Wikipedia
- Grochowski E, Annavaram M (2006) Energy per instruction trends in Intel® microprocessors
- Hills WD (1989) The connection machine. MIT Press, Cambridge
- HMPP Open Standard (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/w/index.php?title=HMPP_Open_Standard&oldid=415481893, Wikipedia
- Homberg W (2009) Network specification and software data structures for the eQPACE architecture Jülich supercomputing center (JSC). Accessed 2 May 2011, from http://www2.fz-juelich.de/jsc/juice/eQPACE_Meeting/, Jülich supercomputing center (JSC)
- HP Challenge Benchmark Record (2011) University of Tennessee. Accessed 2 May 2011, from http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=403, University of Tennessee
- HPC Challenge Benchmark Record (2011) University of Tennessee. Accessed 2 May 2011, from http://icl.cs.utk.edu/hpcc/hpcc_record.cgi?id=434, University of Tennessee
- Hybrid Multi-Core Parallel Programming Workbench (2011) CAPS enterprise. Accessed 2 May 2011, from http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36, CAPS enterprise
- IA-32 (Intel Architecture 32-bit) (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/IA-32, Wikipedia
- ILLIAC IV (1972) Corporation system characteristics and programming manual. Burroughs corporation
- ILLIAC IV (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/ILLIAC_IV, Wikipedia
- ILLIAC IV (2011b) Burroughs corporation. Accessed 2 May 2011, from http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf
- Intel 4004 (2011a) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Intel_4004, Wikipedia
- Intel 4004 (2011b) A big deal then, a big deal now. Intel corporation. Accessed 2 May 2011, from http://www.intel.com/about/companyinfo/museum/exhibits/4004/facts.htm, Intel corporation
- Intel 56XX (2011) Series products (formerly Westemere- ${\rm EP}\_$ EP _ ). Intel corporation. Accessed 2 May 2011, from http://ark.intel.com/ProductCollection.aspx?codeName=33174, Intel corporation
- Intel Hyper-Threading Technology (Intel HT Technology) (2011) Intel Corporation. Accessed 2 May 2011, from http://www.intel.com/technology/platform-technology/hyper-threading/index.htm, Intel corporation
- Intel Math Kernel Library (2011) Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intel-mkl/, Intel corporation
- Intel Processor (2011) Clock speed (MHz). Accessed 2 May 2011, from http://smoothspan.files.wordpress.com/2007/09/clockspeeds.jpg
- Intel Xeon Processor E5540 (2011) Intel corporation. Accessed 2 May 2011, from http://ark.intel.com/Product.aspx?id=37104&processor=E5540&spec-codes=SLBF6, Intel corporation
- Intel(R) Array Building Blocks for Linux OS, User’s Guide (2011) (324171-006US), p 74. http://software.intel.com/sites/products/documentation/arbb/arbb_userguide_linux.pdf
- Intel(R) Array Building Blocks Virtual Machine, Specification (2011) (324820-002US), p 118. http://software.intel.com/sites/products/documentation/arbb/vm/arbb_vm.pdf
- Intel’s Ct Technology Code Samples (2010) Intel. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intels-ct-technology-code-samples/, Intel
- Introducing Intel many Integrated Core Architecture (2011) Intel corporation. Accessed 2 May 2011, from http://www.intel.com/technology/architecture-silicon/mic/index.htm, Intel corporation
- Introduction to Parallel GPU Computing (2010) Center for scalable application development software
- Johnsson L (2011) Overview of data centers energy efficiency evolution. In: Ranka S, Ahmad I (eds) Handbook of green computing. CRC Press, New York
- Kanellos M (2001) Intel’s accidental revolution. CNET news. Accessed from CNET News website
- Kennedy K, Koelbel C, Schreiber R (2004) Defining and measuring the productivity of programming languages. Int J High Perform Comput Appl 18(4):441–448 CrossRef
- Kozin IN (2008) Evaluation of ClearSpeed accelerators for HPC, p 15. http://www.cse.scitech.ac.uk/disco/publications/Clearspeed.pdf
- Linpack, ClearSpeed (2010) CleerSpeed technology limited. Accessed 2 May 2011, from http://www.clearspeed.com/applications/highperformancecomputing/hpclinpack.php, CleerSpeed technology limited
- Matsuoka S, Dongarra J TESLA GPU computint. http://www.microway.com/pdfs/TeslaC2050-Fermi-Performance.pdf
- McCalpin JD (2011) STREAM: sustainable memory bandwidth in high-performance computers, University of Virginia. Accessed 2 May 2011, from http://www.cs.virginia.edu/stream/, University of Virginia
- McCool MD (2007) RapidMind multi-core development platform. CASCON Cell Workshop
- McCool MD (2008) Developing for GPUs, cell, and multi-core CPUs using a unified programming model. Linux J
- Memory Bandwidth (STREAM)—Two-Socket Servers (including AMD™ 6100 Series Processors) (2011) Advanced Micro Devices, Inc. Accessed 2 May 2011, from http://www.amd.com/us/products/server/benchmarks/Pages/memory-bandwidth-stream-two-socket-servers.aspx, Advanced Micro Devices, Inc
- Mirkovic D, Mahasoom R, Johnsson L (2000) An adaptive software library for fast fourier transforms. Paper presented at the 2000 international conference on supercomputing, Santa Fe
- Moore GE (1965) Craming more components onto integrated circuits. Electronics 38(8):114–117
- Non-Uniform Memory Access (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access, Wikipedia
- NVIDIA Corporation (2011) What is CUDA? Accessed 2 May 2011, from http://www.nvidia.com/object/what_is_cuda_new.html, NVIDIA corporation
- OpenCL (2010) Specification Version: 1(1), p 379. http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
- OpenCL (2011) The open standard for parallel programming of heterogeneous systems. Khronos Group. Accessed 2 May 2011, from http://www.khronos.org/opencl/, Khronos Group
- Pentium 4 (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Pentium_4, Wikipedia
- Petitet A, Whaley RC, Dongarra J, Cleary A (2008) HPL–a portable implementation of the high-performance Linpack benchmark for distributed-memory computers, University of Tennessee Computer Science Department. Accessed 2 May 2011, from http://www.netlib.org/benchmark/hpl/, University of Tennessee Computer Science Department
- Petrov V, Fedorov G (2010) MKL FFT performance—comparison of local and distributed-memory implementations. Intel software network. Retrieved from http://software.intel.com/en-us/articles/mkl-fft-performance-using-local-and-distributed-implementation/
- Pettey C (2011) Gartner says worldwide PC shipments in fourth quarter of 2010 grew 3.1 percent; year-end shipments increased 13.8 percent. Accessed from http://www.gartner.com/it/page.jsp?id=1519417, Gartner, Inc
- Pettey C, Stevens H (2011) Gartner says 2010 worldwide server market returned to growth with shipments up 17 percent and revenue 13 percent. Gartner, Inc. Accessed 2 May 2011, from http://www.gartner.com/it/page.jsp?id=1561014, Gartner, Inc
- PGI Accelerator Programming Model for Fortran and C (2010) p 36. http://www.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.3.pdf
- Phillips E, Fatica M (2010) CUDA accelerated Linpack on clusters, E. Phillips. http://www.nvidia.com/content/GTC-2010/pdfs/2057_GTC2010.pdf
- Pollack F (1999) New microarchitecture challenges in the coming generations of CMOS process technologies. Paper presented at the proceedings of the 32nd annual IEEE/ACM international symposium on microarchitecture, Haifa
- Portland Group Inc (2011) Accelerated compilers. STMicroelectronics. Accessed 2 May 2011, from http://www.pgroup.com/resources/accel.htm, STMicroelectronics
- PRACE (2009) Preparatory phase project, Deliverable 8.3.1, technical component assessment and development, report
- PRACE (2011) PRACE. Accessed 2 May 2011, from http://www.prace-ri.eu/, PRACE
- Productivity benefits of Intel Ct Technology (2010) Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/productivity-benefits-of-intel-ct-technology/, Intel corporation
- RapidMind (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/RapidMind, Wikipedia
- Sagar RS, Labarta J, van der Steen A, Christadler I, Huber H (2010) PRACE preparatory phase project, Deliverable 8.3.2, final technical report and architecture proposal. http://www.prace-project.eu/documents/public-deliverables/d8-3-2-extended.pdf
- Shalf J, Donofrio D, Oliker L, Wehner M (2006) Green flash: application driven system design for power efficient HPC. Paper presented at the Salishan conference on high-speed computing
- Shimpi AL (2010) New westmere details emerge: power efficiency and 4/6 core plans. AnandTech, Inc. Accessed 2 May 2011, from http://www.anandtech.com/show/2930, AnandTech, Inc
- Silicon Graphics (2011) Wikipedia. Accessed 2 May 2011, from http://en.wikipedia.org/wiki/Silicon_Graphics, Wikipedia
- Simpson AD, Bull M, Hill J (2008) http://www.prace-project.eu/documents/Identification_and_Categorisatio_of_Applications_and_Initial_Benchmark_Suite_final.pdf
- Single Chip 4-Bit P-Channel Microprocessor (1987) Intel corporation
- Sophisticated Library for Vector Parallelism (2011) Intel array building blocks: a flexible parallel programming model for multicore and many-core architectures. Intel corporation. Accessed 2 May 2011, from http://software.intel.com/en-us/articles/intel-array-building-blocks/, Intel corporation
- Team TsG (2005) The mother of All CPU charts 2005/2006. Bestofmedia network. Accessed 2 May 2011, from http://www.tomshardware.com/reviews/mother-cpu-charts-2005,1175.html, Bestofmedia network
- Tesla C1060 Computing Processor Board Specification (2010) (BD-04111-001-v06). http://www.nvidia.com/docs/IO/43395/BD-04111-001v-06.pdf
- Tesla C2050/C2070 GPU Computing Processor (2010) NVIDIA Corporation
- The Green500 (2010) Green 500: ranking the worlds most energy-efficient supercomputers. Accessed 2 May 2011, from www.green500.org, The Green500
- Thelen E (2003) The connection machine -1-2-5. Ed-Thelen.org. Accessed 2 May 2011, from http://ed-thelen.org/comp-hist/vs-cm-1-2-5.html, Ed-Thelen.org
- Thelen E (2005) ILLIAC IV. Ed-Thelen.org. Accessed 2 May 2011, from http://ed-thelen.org/comp-hist/vs-illiac-iv.html, Ed-Thelen.org
- Thornton JE (1963) Considerations in computer design–leading up to the control data 6600. http://www.bitsavers.org/pdf/cdc/cyber/cyber_70/thornton_6600_paper.pdf
- Thornton JE (1970) The design of a computer: the control data 6600. Scott, Foresman and Company, Glenview
- Top 500 (2011) Top500.org. Accessed 2 May 2011, from http://www.top500.org/, Top500.org
- Valich T (2010) nVidia GF100 architecture: alea iacta est. Accessed from http://www.brightsideofnews.com/print/2010/1/18/nvidia-gf100-architecture-alea-iacta-est.aspx
- Writing Applications for the GPU Using the RapidMind™ Development Platform (2006) p 7. Accessed from http://www.cs.ucla.edu/palsberg/course/cs239/papers/rapidmind.pdf
About this Chapter
- Title
- Efficiency, Energy Efficiency and Programming of Accelerated HPC Servers: Highlights of PRACE Studies
- Book Title
- GPU Solutions to Multi-scale Problems in Science and Engineering
- Pages
- pp 33-78
- Copyright
- 2013
- DOI
- 10.1007/978-3-642-16405-7_3
- Print ISBN
- 978-3-642-16404-0
- Online ISBN
- 978-3-642-16405-7
- Series Title
- Lecture Notes in Earth System Sciences
- Series ISSN
- 2193-8571
- Publisher
- Springer Berlin Heidelberg
- Copyright Holder
- Springer-Verlag Berlin Heidelberg
- Additional Links
- Topics
- Industry Sectors
- eBook Packages
- Editors
-
-
David A. Yuen
(ID1)
-
Long Wang
(ID2)
-
Xuebin Chi
(ID3)
-
Lennart Johnsson
(ID4)
-
Wei Ge
(ID5)
-
Yaolin Shi
(ID6)
-
David A. Yuen
- Editor Affiliations
-
- ID1. University of Minnesota, Dep. of Earth Sciences and Minnesota, Supercomputing Institute
- ID2. Network Information Center, Comuter Center and Computer
- ID3. Supercomputing Center
- ID4. , Computer Science, University of Houston
- ID5. Inst. Process Engineering (IPE), Chinese Academy of Sciences
- ID6. , Laboratory of Computational Geodynamics,, Chinese Academy of Sciences
- Authors
-
- Lennart Johnsson (1) (2)
- Author Affiliations
-
- 1. Department of Computer Science, University of Houston, Houston, TX, USA
- 2. School of Computer Science and Communications, KTH, Stockholm, Sweden
Continue reading...
To view the rest of this content please follow the download PDF link above.