Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Wang, Feng; Yang, Can-Qun; Du, Yun-Fei; Chen, Juan; Yi, Hui-Zhan; Xu, Wei-Xia

doi:10.1007/s11390-011-0184-1

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Published: 23 September 2011

Volume 26, pages 854–865, (2011)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Feng Wang¹,
Can-Qun Yang¹,
Yun-Fei Du¹,
Juan Chen¹,
Hui-Zhan Yi¹ &
…
Wei-Xia Xu¹

309 Accesses
28 Citations
Explore all metrics

Abstract

In this paper we present the programming of the Linpack benchmark on TianHe-1 system, the first petascale supercomputer system of China, and the largest GPU-accelerated heterogeneous system ever attempted before. A hybrid programming model consisting of MPI, OpenMP and streaming computing is described to explore the task parallel, thread parallel and data parallel of the Linpack. We explain how we optimized the load distribution across the CPUs and GPUs using the two-level adaptive method and describe the implementation in details. To overcome the low-bandwidth between the CPU and GPU communication, we present a software pipelining technique to hide the communication overhead. Combined with other traditional optimizations, the Linpack we developed achieved 196:7 GFLOPS on a single compute element of TianHe-1. This result is 70:1% of the peak compute capability, 3:3 times faster than the result by using the vendor's library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0:563 PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list in November, 2009.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evolving the HPL benchmark towards multi-GPGPU clusters

Article 26 October 2022

NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs

Article 27 December 2022

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

References

Dongarra J J, van de Geijn R A, Walker D W. Scalability issues affecting the design of a dense linear algebra library. J. Parallel Distrib. Comput., 1994, 22(3): 523–537.
Article Google Scholar
http://www.top500.org, Nov. 10, 2010.
Villarreal J, Najjar W. Compiled hardware acceleration of molecular dynamics code. In Proc. International Conference on Field Programmable Logic and Applications (FPL 2008), Heidelberg, Germany, Sept. 8–10, 2008, pp.667-670.
NVIDIA. Fermi compute architecture whitepaper, 2009.
AMD. AMD stream computing user guide v 1.4.0, Feb. 2009.
NVIDIA. CUDA programming guide, June 2007.
Munshi A. Opencl parallel computing on the GPU and CPU. In Proc. ACM SIGGRAPH 2008, Los Angeles, USA, Aug. 11–15, 2008.
Falcao G, Yamagiwa S, Silva V, Sousa L. Parallel LDPC decoding on GPUs using a stream-based computing approach. Journal of Computer Science and Technology, 2009, 24(5): 913–924.
Article Google Scholar
Roberts E, Stone J E, Sepulveda L, Mei W, Hwu W, Luthey-Schulten Z. Long time-scale simulations of in vivo diffusion using GPU hardware. In Proc. the 2009 IEEE International Symposium on Parallel&Distributed Processing (IPDPS 2009), Rome, Italy, May 23–29, 2009, pp.1-8.
Meng J, Skadron K. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In Proc. the 23 rd International Conference on Supercomputing (ICS 2009), Yorktown Heights, USA, Jun. 8–12, 2009, pp.256-265.
Di P, Wan Q, Zhang X, Wu H, Xue J. Toward harnessing DOACROSS parallelism for multi-GPGPUs. In Proc. the 39th International Conference on Parallel Processing, San Diego, USA, Sept. 13–16, 2010, pp.40-50.
Fan Z, Qiu F, Kaufman A, Yoakum-Stover S. GPU cluster for high performance computing. In Proc. the 2004 ACM/IEEE Conference on Supercomputing (SC 2004), Pittsburgh, USA, Nov. 6–12, 2004, p.47.
Sun J C, Yuan G X, Zhang L B, Zhang Y Q. 2009 China top100 list of high performance computer. http://124.16.137.70/2009-China-HPC-TOP100-20091101-eng.htm, Nov. 2009.
Petitet A, Whaley R C, Dongarra J J, Cleary A. HPL — A portable implementation of the high-performance linpack benchmark for distributed memory computers. http://www.netlib.org/benchmark/hpl/, 2006.
Luk C K, Hong S, Kim H. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (Micro-42), New York, USA, Dec. 12–16, 2009, pp.45-55.
Dongarra J J, Luszczek P, Petitet A. The linpack benchmark: Past, present and future. Concurrency and Computation: Practice and Experience, 2003, 15(9): 803–820.
Article Google Scholar
Dongarra J J, Du Croz J, Hammarling S, Duff I S. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 1990, 16(1): 1–17.
Article MATH Google Scholar
Kistler M, Gunnels J, Brokenshire D, Benton B. Petascale computing with accelerators. In Proc. the 14th ACM SIG-PLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2009), Raleigh, USA, Feb. 14–18, 2009, pp.241-250.
Baliga H, Cooray N, Gamsaragan E, Smith P, Yoon K, Abel J, Valles A. Original 45 nm Intels Core2 processor performance. Intel Technology Journal, 2008, 11: 157–168.
Google Scholar
AMD. AMD core math library for graphic processors release notes for version 1.0, 2009.
Agarwal R, Balle S M, Gustavson F G, Joshi M, Palkar P. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development, 1995, 39(5): 575–582.
Article Google Scholar
Ryoo S, Rodrigues C I, Baghsorkhi S S, Stone S S, Kirk D B, Hwu W M W. Optimization principles and application per- formance evaluation of a multithreaded GPU using CUDA. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2008), Salt Lake City, Feb. 20–23, 2008, pp.73-82.
Quintana-Ortí G, Igual F D, Quintana-Ortí E S, van de Geijn R A. Solving dense linear systems on platforms with multiple hardware accelerators. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2009), Raleigh, USA, Feb. 14–18, 2009, pp.121-130.
Linderman M D, Collins J D, Wang H, Meng T H. Merge: A programming model for heterogeneous multi-core systems. SIGOPS Oper. Syst. Rev., 2008, 42(2): 287–296.
Article Google Scholar
Fatica M. Accelerating linpack with CUDA on heterogenous clusters. In Proc. 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2), Washington DC, USA, 2009, pp.46-51.
Johns C R, Brokenshire D A. Introduction to the cell broadband engine architecture. IBM J. Res. Dev., 2007, 51(5): 503–519.
Article Google Scholar
ATI Radeon rv770. http://en.wikipedia.org/wiki/Radeon_R700 .
Hamano T, Endo T, Matsuoka S. Power-aware dynamic task scheduling for heterogeneous accelerated clusters. In Proc. Int. Parallel and Distributed Processing Symposium, Rome, Italy, May 23–29, 2009, pp.1-8.
Clearspeed Technology Inc. http://www.clearspeed.com/.
NVIDIA. http://www.nvidia.com/object/product_tesla_s1070_us.html, Nov. 10, 2010.
Endo T, Matsuoka S. Massive supercomputing coping with heterogeneity of modern accelerators. In Proc. the 2008 IEEE International Symposium on Parallel&Distributed Processing (IPDPS 2008), Miami, USA, Apr. 14–18, 2008, pp.1-10.

Download references

Author information

Authors and Affiliations

School of Computer Science, National University of Defense Technology, Changsha, 410073, China
Feng Wang (Member, CCF, ACM), Can-Qun Yang, Yun-Fei Du, Juan Chen, Hui-Zhan Yi & Wei-Xia Xu

Authors

Feng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Can-Qun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yun-Fei Du
View author publications
You can also search for this author in PubMed Google Scholar
Juan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hui-Zhan Yi
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Xia Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Wang.

Additional information

Supported by the National High Technology Research and Development 863 Program of China under Grant No. 2009AA01A128, the Major Science and Technology Project of China under Grant No. 2009ZX01036-001-003-001, the National Natural Science Foundation of China under Grant Nos. 61003087, 60903044, 60903059, 60970033, and 60673150.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 80.1 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, F., Yang, CQ., Du, YF. et al. Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer. J. Comput. Sci. Technol. 26, 854–865 (2011). https://doi.org/10.1007/s11390-011-0184-1

Download citation

Received: 24 November 2010
Revised: 15 June 2011
Published: 23 September 2011
Issue Date: September 2011
DOI: https://doi.org/10.1007/s11390-011-0184-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Abstract

Access this article

Similar content being viewed by others

Evolving the HPL benchmark towards multi-GPGPU clusters

NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic Supplementary Material

(PDF 80.1 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer

Abstract

Access this article

Similar content being viewed by others

Evolving the HPL benchmark towards multi-GPGPU clusters

NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs

GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic Supplementary Material

(PDF 80.1 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation