Performance and Portability of a Linear Solver Across Emerging Architectures

Walden, Aaron C.; Zubair, Mohammad; Nielsen, Eric J.

doi:10.1007/978-3-030-74224-9_4

Aaron C. Walden¹²,
Mohammad Zubair¹³ &
Eric J. Nielsen¹²

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12655))

Included in the following conference series:

International Workshop on Accelerator Programming Using Directives

348 Accesses

Abstract

A linear solver algorithm used by a large-scale unstructured-grid computational fluid dynamics application is examined for a broad range of familiar and emerging architectures. Efficient implementation of a linear solver is challenging on recent CPUs offering vector architectures. Vector loads and stores are essential to effectively utilize available memory bandwidth on CPUs, and maintaining performance across different CPUs can be difficult in the face of varying vector lengths offered by each. A similar challenge occurs on GPU architectures, where it is essential to have coalesced memory accesses to utilize memory bandwidth effectively. In this work, we demonstrate that restructuring a computation, and possibly data layout, with regard to architecture is essential to achieve optimal performance by establishing a performance benchmark for each target architecture in a low level language such as vector intrinsics or CUDA. In doing so, we demonstrate how a linear solver kernel can be mapped to Intel^® Xeon™ and Xeon Phi™, Marvell^® ThunderX2^®, NEC^® SX-Aurora™ TSUBASA Vector Engine, and NVIDIA^® and AMD^® GPUs. We further demonstrate that the required code restructuring can be achieved in higher level programming environments such as OpenACC, OCCA, and Intel^® OneAPI™/SYCL, and that each generally results in optimal performance on the target architecture. Relative performance metrics for all implementations are shown, and subjective ratings for ease of implementation and optimization are suggested.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

OpenACC. https://www.openacc.org. Accessed 24 Aug 2020
OpenMP. https://www.openmp.org. Accessed 24 Aug 2020
The MPI Forum Website. http://www.mpi-forum.org. Accessed 24 Aug 2020
AMD Incorporated: AMD Radeon Instinct MI50 Accelerator. https://www.amd.com/en/products/professional-graphics/instinct-mi50. Accessed 24 Aug 2020
AMD Incorporated: HIP Porting Guide. https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-porting-guide.html. Accessed 24 Aug 2020
AMD Incorporated: HIP Programming Guide. https://rocm-documentation.readthedocs.io/en/latest/Programming_Guides/HIP-GUIDE.html. Accessed 24 Aug 2020
Biedron, R., et al.: FUN3D Manual 13.6. NASA/TM-2019-220416 (2019)
Google Scholar
Codeplay: Codeplay Contribution to DPC++ Brings SYCL Support for NVIDIA GPUs. https://www.codeplay.com/portal/news/2020/02/03/codeplay-contribution-to-dpcpp-brings-sycl-support-for-nvidia-gpus.html. Accessed 24 Aug 2020
Intel Corporation: Intel oneAPI DPC++ Compiler (Beta). https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-compiler.html. Accessed 24 Aug 2020
Intel Corporation: Intrinsics Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Accessed 24 Aug 2020
Khronos Group: OpenCL. https://www.khronos.org/opencl/. Accessed 24 Aug 2020
Khronos Group: SYCL. https://www.khronos.org/sycl/. Accessed 24 Aug 2020
Kincaid, D.R., Oppe, T.C., Young, D.M.: ITPACKV 2D User’s Guide, May 1989
Google Scholar
Korzun, A., et al.: Effects of Spatial Resolution on Retropropulsion Aerodynamics in an Atmospheric Environment. AIAA SciTech Forum (2020)
Google Scholar
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5), C401–C423 (2014). https://doi.org/10.1137/130930352
Article MathSciNet MATH Google Scholar
Laflin, K.R., et al.: Data summary from second AIAA computational fluid dynamics drag prediction workshop. J. Aircraft 42(5), 1165–1178 (2005)
Article Google Scholar
Medina, D.S., St-Cyr, A., Warburton, T.: OCCA: A Unified Approach to Multi-Threading Languages. arXiv preprint arXiv:1403.0968 (2014)
NEC Corporation: SX-Aurora TSUBASA Fortran Compiler User’s Guide. https://www.hpc.nec/documents/sdk/pdfs/g2af02e-FortranUsersGuide-018.pdf. Accessed 24 Aug 2020
NEC Corporation: SX-Aurora TSUBASA VEOS NUMA Mode Guide for Partitioning Mode. https://www.hpc.nec/documents/guide/pdfs/VEOS_NUMA_Mode4PartitioningMode_E.pdf. Accessed 24 Aug 2020
Nielsen, E.J., Diskin, B.: High-performance aerodynamic computations for aerospace applications. Parallel Comput. 64, 20–32 (2017)
Article MathSciNet Google Scholar
NVIDIA Corporation: cuBLAS. https://developer.nvidia.com/cublas. Accessed 24 Aug 2020
NVIDIA Corporation: CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz4Hicq83a9. Accessed 24 Aug 2020
NVIDIA Corporation: cuSPARSE. https://developer.nvidia.com/cusparse. Accessed 24 Aug 2020
Oak Ridge National Laboratory: Exascale System Expected to be World’s Most Powerful Computer for Science and Innovation. https://www.olcf.ornl.gov/2019/05/07/no-scaling-back-doe-cray-amd-to-bring-exascale-to-ornl/. Accessed 24 Aug 2020
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2003)
Google Scholar
ANANDTECH: Assessing Cavium’s ThunderX2: The Arm Server Dream Realized At Last (2018). https://www.anandtech.com/show/12694/assessing-cavium-thunderx2-arm-server-reality
Walden, A., Nielsen, E., Diskin, B., Zubair, M.: A mixed precision multicolor point-implicit solver for unstructured grids on GPUs. In: Proceedings of the Ninth Workshop on Irregular Applications: Architectures and Algorithms, IA3 2019, Los Alamitos, CA, USA, pp. 23–30. IEEE Press (2019)
Google Scholar
Zubair, M., Nielsen, E., Luitjens, J., Hammond, D.: An optimized multicolor point-implicit solver for unstructured grid applications on graphics processing units. In: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms, IA3 2016, Piscataway, NJ, USA, pp. 18–25. IEEE Press (2016)
Google Scholar

Download references

Acknowledgments

The authors would like to express their appreciation to the following people for many helpful conversations pertaining to the current work: Justin Luitjens (NVIDIA Corporation), Erich Focht and Rudolf Fischer (NEC Corporation), John Linford (Arm Limited), Tim Warburton (Department of Mathematics, Virginia Tech); Noel Chalmers (AMD Incorporated), Sameer Shende (Department of Computer and Information Science, University of Oregon), and Jeff Hammond, Varsha Madananth, and Kevin O’Leary (Intel Corporation). The authors also wish to thank the High Performance Computing Incubator at the NASA Langley Research Center and the NASA Headquarters Office of Chief Engineer Research and Analysis program for providing support for this work. The support of Dr. Mujeeb Malik, Technical Lead for the Revolutionary Computational Aerosciences subproject within the NASA Aeronautics Research Mission Directorate Transformational Tools and Technologies Project, is also acknowledged.

Author information

Authors and Affiliations

NASA Langley Research Center, Hampton, VA, USA
Aaron C. Walden & Eric J. Nielsen
Old Dominion University, Norfolk, VA, USA
Mohammad Zubair

Authors

Aaron C. Walden
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Zubair
View author publications
You can also search for this author in PubMed Google Scholar
Eric J. Nielsen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aaron C. Walden .

Editor information

Editors and Affiliations

Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Sridutt Bhalachandra
RWTH Aachen University, Aachen, Germany
Sandra Wienke
University of Delaware, Newark, DE, USA
Sunita Chandrasekaran
Helmholtz-Zentrum Dresden-Rossendorf, Dresden, Germany
Guido Juckeland

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Walden, A.C., Zubair, M., Nielsen, E.J. (2021). Performance and Portability of a Linear Solver Across Emerging Architectures. In: Bhalachandra, S., Wienke, S., Chandrasekaran, S., Juckeland, G. (eds) Accelerator Programming Using Directives. WACCPD 2020. Lecture Notes in Computer Science(), vol 12655. Springer, Cham. https://doi.org/10.1007/978-3-030-74224-9_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-74224-9_4
Published: 17 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74223-2
Online ISBN: 978-3-030-74224-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics