A CUDA Implementation of the High Performance Conjugate Gradient Benchmark

Phillips, Everett; Fatica, Massimiliano

doi:10.1007/978-3-319-17248-4_4

Everett Phillips¹⁶ &
Massimiliano Fatica¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8966))

Included in the following conference series:

International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems

1310 Accesses
8 Citations

Abstract

The High Performance Conjugate Gradient (HPCG) benchmark has been recently proposed as a complement to the High Performance Linpack (HPL) benchmark currently used to rank supercomputers in the Top500 list. This new benchmark solves a large sparse linear system using a multigrid preconditioned conjugate gradient (PCG) algorithm. The PCG algorithm contains the computational and communication patterns prevalent in the numerical solution of partial differential equations and is designed to better represent modern application workloads which rely more heavily on memory system and network performance than HPL. GPU accelerated supercomputers have proved to be very effective, especially with regard to power efficiency, for accelerating compute intensive applications like HPL. This paper will present the details of a CUDA implementation of HPCG, and the results obtained at full scale on the largest GPU supercomputers available: the Cray XK7 at ORNL and the Cray XC30 at CSCS. The results indicate that GPU accelerated supercomputers are also very effective for this type of workload.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Characterizing the efficiency of multicore and manycore processors for the solution of sparse linear systems

Article 15 September 2015

Considerations on the Implementation and Use of Anderson Acceleration on Distributed Memory and GPU-based Parallel Computers

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

Article 01 October 2016

References

Dongarra, J., Heroux, M.A.: Toward a New Metric for Ranking High Performance Computing Systems. Sandia report SAND2013-4744 (2013)
Google Scholar
Dongarra, J., Luszczek, P.: Introduction to the HPC challenge benchmark Suite, ICL Technical report, ICL-UT-05-01, (Also appears as CS Dept. Tech report UT-CS-05-544) (2005)
Google Scholar
Heroux, M.A., Dongarra, J., Luszczek, P: HPCG Technical specification, Sandia report SAND2013-8752 (2013)
Google Scholar
Graph 500. http://www.graph500.org
Green 500. http://www.green500.org
CUDA Toolkit. http://developer.nvidia.com/cuda-toolkit
CUDA Fortran. http://www.pgroup.com/resources/cudafortran.htm
CUBLAS Library. http://docs.nvidia.com/cuda/cublas
CUSPARSE Library. http://docs.nvidia.com/cuda/cusparse
THRUST Library. http://docs.nvidia.com/cuda/thrust
http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-generate-custom-application-profile-timelines-nvtx/
Barrett, R.F., Heroux, M.A., Lin, P.T., Vaughan, C.T., Williams, A.B.: Poster: mini-applications: vehicles for co-design. In: Proceedings of the 2011 Companion on High Performance Computing Networking, Storage and Analysis Companion (SC 2011 Companion), pp. 1–2. ACM, New York (2011)
Google Scholar
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. John Hopkins University Press, USA (1996)
MATH Google Scholar
Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial. SIAM, USA (2000)
Book MATH Google Scholar
Green 500: Energy efficient HPC System workloads power measurement methodology (2013)
Google Scholar
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, December 1995
Google Scholar
Phillips, E.H., Fatica, M.: Implementing the Himeno benchmark with CUDA on GPU clusters. In: IEEE International Symposium on Parallel & Distributed Processing IPDPS, pp. 1–10 (2010)
Google Scholar
Park, J., Smelyanskiy, M.: Optimizing Gauss-Seidel Smoother in HPCG. In: ASCR HPCG Workshop, Bethesda MD, 25 March 2014
Google Scholar
Luby, M.: A simple parallel algorithm for the maximal independent set problem. SIAM J. Comput. 15(4), 1036–1053 (1986)
Article MATH MathSciNet Google Scholar
Jones, M.T., Plassmann, P.E.: A parallel graph coloring heuristic. SIAM J. Sci. Comput. 14, 654–669 (1992)
Article MathSciNet Google Scholar
Cohen, J., Castonguay, P.: Efficient graph matching and coloring on the GPU. In: GPU Technology Conference, San Jose CA, 14–17 May 2012. http://ondemand.gputechconf.com/gtc/2012/presentations/S0332-Efficient-Graph-Matching-and-Coloring-on-GPUs.pdf

Download references

Acknowledgments

This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. We wish to thank Buddy Bland, Jack Wells and Don Maxwell of Oak Ridge National Laboratory for their support. This work was also supported by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID g33. We also want to acknowledged the support from Gilles Fourestey and Thomas Schulthess at CSCS. We wish to thank Lung Scheng Chien and Jonathan Cohen at NVIDIA for relevant discussions.

Author information

Authors and Affiliations

NVIDIA Corporation, Santa Clara, CA, 95050, USA
Everett Phillips & Massimiliano Fatica

Authors

Everett Phillips
View author publications
You can also search for this author in PubMed Google Scholar
Massimiliano Fatica
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Everett Phillips .

Editor information

Editors and Affiliations

University of Warwick, Coventry, United Kingdom
Stephen A. Jarvis
University of Warwick, Coventry, United Kingdom
Steven A. Wright
Sandia National Laboratories CSRI, Albuquerque, New Mexico, USA
Simon D. Hammond

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Phillips, E., Fatica, M. (2015). A CUDA Implementation of the High Performance Conjugate Gradient Benchmark. In: Jarvis, S., Wright, S., Hammond, S. (eds) High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation. PMBS 2014. Lecture Notes in Computer Science(), vol 8966. Springer, Cham. https://doi.org/10.1007/978-3-319-17248-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-17248-4_4
Published: 18 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17247-7
Online ISBN: 978-3-319-17248-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A CUDA Implementation of the High Performance Conjugate Gradient Benchmark

Abstract

Access this chapter

Similar content being viewed by others

Characterizing the efficiency of multicore and manycore processors for the solution of sparse linear systems

Considerations on the Implementation and Use of Anderson Acceleration on Distributed Memory and GPU-based Parallel Computers

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A CUDA Implementation of the High Performance Conjugate Gradient Benchmark

Abstract

Access this chapter

Similar content being viewed by others

Characterizing the efficiency of multicore and manycore processors for the solution of sparse linear systems

Considerations on the Implementation and Use of Anderson Acceleration on Distributed Memory and GPU-based Parallel Computers

GHOST: Building Blocks for High Performance Sparse Linear Algebra on Heterogeneous Systems

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation