Abstract
A linear solver algorithm used by a large-scale unstructured-grid computational fluid dynamics application is examined for a broad range of familiar and emerging architectures. Efficient implementation of a linear solver is challenging on recent CPUs offering vector architectures. Vector loads and stores are essential to effectively utilize available memory bandwidth on CPUs, and maintaining performance across different CPUs can be difficult in the face of varying vector lengths offered by each. A similar challenge occurs on GPU architectures, where it is essential to have coalesced memory accesses to utilize memory bandwidth effectively. In this work, we demonstrate that restructuring a computation, and possibly data layout, with regard to architecture is essential to achieve optimal performance by establishing a performance benchmark for each target architecture in a low level language such as vector intrinsics or CUDA. In doing so, we demonstrate how a linear solver kernel can be mapped to Intel® Xeon™ and Xeon Phi™, Marvell® ThunderX2®, NEC® SX-Aurora™ TSUBASA Vector Engine, and NVIDIA® and AMD® GPUs. We further demonstrate that the required code restructuring can be achieved in higher level programming environments such as OpenACC, OCCA, and Intel® OneAPI™/SYCL, and that each generally results in optimal performance on the target architecture. Relative performance metrics for all implementations are shown, and subjective ratings for ease of implementation and optimization are suggested.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
OpenACC. https://www.openacc.org. Accessed 24 Aug 2020
OpenMP. https://www.openmp.org. Accessed 24 Aug 2020
The MPI Forum Website. http://www.mpi-forum.org. Accessed 24 Aug 2020
AMD Incorporated: AMD Radeon Instinct MI50 Accelerator. https://www.amd.com/en/products/professional-graphics/instinct-mi50. Accessed 24 Aug 2020
AMD Incorporated: HIP Porting Guide. https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-porting-guide.html. Accessed 24 Aug 2020
AMD Incorporated: HIP Programming Guide. https://rocm-documentation.readthedocs.io/en/latest/Programming_Guides/HIP-GUIDE.html. Accessed 24 Aug 2020
Biedron, R., et al.: FUN3D Manual 13.6. NASA/TM-2019-220416 (2019)
Codeplay: Codeplay Contribution to DPC++ Brings SYCL Support for NVIDIA GPUs. https://www.codeplay.com/portal/news/2020/02/03/codeplay-contribution-to-dpcpp-brings-sycl-support-for-nvidia-gpus.html. Accessed 24 Aug 2020
Intel Corporation: Intel oneAPI DPC++ Compiler (Beta). https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-compiler.html. Accessed 24 Aug 2020
Intel Corporation: Intrinsics Guide. https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Accessed 24 Aug 2020
Khronos Group: OpenCL. https://www.khronos.org/opencl/. Accessed 24 Aug 2020
Khronos Group: SYCL. https://www.khronos.org/sycl/. Accessed 24 Aug 2020
Kincaid, D.R., Oppe, T.C., Young, D.M.: ITPACKV 2D User’s Guide, May 1989
Korzun, A., et al.: Effects of Spatial Resolution on Retropropulsion Aerodynamics in an Atmospheric Environment. AIAA SciTech Forum (2020)
Kreutzer, M., Hager, G., Wellein, G., Fehske, H., Bishop, A.R.: A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5), C401–C423 (2014). https://doi.org/10.1137/130930352
Laflin, K.R., et al.: Data summary from second AIAA computational fluid dynamics drag prediction workshop. J. Aircraft 42(5), 1165–1178 (2005)
Medina, D.S., St-Cyr, A., Warburton, T.: OCCA: A Unified Approach to Multi-Threading Languages. arXiv preprint arXiv:1403.0968 (2014)
NEC Corporation: SX-Aurora TSUBASA Fortran Compiler User’s Guide. https://www.hpc.nec/documents/sdk/pdfs/g2af02e-FortranUsersGuide-018.pdf. Accessed 24 Aug 2020
NEC Corporation: SX-Aurora TSUBASA VEOS NUMA Mode Guide for Partitioning Mode. https://www.hpc.nec/documents/guide/pdfs/VEOS_NUMA_Mode4PartitioningMode_E.pdf. Accessed 24 Aug 2020
Nielsen, E.J., Diskin, B.: High-performance aerodynamic computations for aerospace applications. Parallel Comput. 64, 20–32 (2017)
NVIDIA Corporation: cuBLAS. https://developer.nvidia.com/cublas. Accessed 24 Aug 2020
NVIDIA Corporation: CUDA C Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#axzz4Hicq83a9. Accessed 24 Aug 2020
NVIDIA Corporation: cuSPARSE. https://developer.nvidia.com/cusparse. Accessed 24 Aug 2020
Oak Ridge National Laboratory: Exascale System Expected to be World’s Most Powerful Computer for Science and Innovation. https://www.olcf.ornl.gov/2019/05/07/no-scaling-back-doe-cray-amd-to-bring-exascale-to-ornl/. Accessed 24 Aug 2020
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2003)
ANANDTECH: Assessing Cavium’s ThunderX2: The Arm Server Dream Realized At Last (2018). https://www.anandtech.com/show/12694/assessing-cavium-thunderx2-arm-server-reality
Walden, A., Nielsen, E., Diskin, B., Zubair, M.: A mixed precision multicolor point-implicit solver for unstructured grids on GPUs. In: Proceedings of the Ninth Workshop on Irregular Applications: Architectures and Algorithms, IA3 2019, Los Alamitos, CA, USA, pp. 23–30. IEEE Press (2019)
Zubair, M., Nielsen, E., Luitjens, J., Hammond, D.: An optimized multicolor point-implicit solver for unstructured grid applications on graphics processing units. In: Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms, IA3 2016, Piscataway, NJ, USA, pp. 18–25. IEEE Press (2016)
Acknowledgments
The authors would like to express their appreciation to the following people for many helpful conversations pertaining to the current work: Justin Luitjens (NVIDIA Corporation), Erich Focht and Rudolf Fischer (NEC Corporation), John Linford (Arm Limited), Tim Warburton (Department of Mathematics, Virginia Tech); Noel Chalmers (AMD Incorporated), Sameer Shende (Department of Computer and Information Science, University of Oregon), and Jeff Hammond, Varsha Madananth, and Kevin O’Leary (Intel Corporation). The authors also wish to thank the High Performance Computing Incubator at the NASA Langley Research Center and the NASA Headquarters Office of Chief Engineer Research and Analysis program for providing support for this work. The support of Dr. Mujeeb Malik, Technical Lead for the Revolutionary Computational Aerosciences subproject within the NASA Aeronautics Research Mission Directorate Transformational Tools and Technologies Project, is also acknowledged.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Walden, A.C., Zubair, M., Nielsen, E.J. (2021). Performance and Portability of a Linear Solver Across Emerging Architectures. In: Bhalachandra, S., Wienke, S., Chandrasekaran, S., Juckeland, G. (eds) Accelerator Programming Using Directives. WACCPD 2020. Lecture Notes in Computer Science(), vol 12655. Springer, Cham. https://doi.org/10.1007/978-3-030-74224-9_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-74224-9_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74223-2
Online ISBN: 978-3-030-74224-9
eBook Packages: Computer ScienceComputer Science (R0)