Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations

Li, Dali; Xu, Chuanfu; Cheng, Bin; Xiong, Min; Gao, Xiang; Deng, Xiaogang

doi:10.1007/s11227-016-1943-0

Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations

Published: 16 December 2016

Volume 73, pages 2506–2524, (2017)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Dali Li¹,
Chuanfu Xu^1,2,
Bin Cheng¹,
Min Xiong¹,
Xiang Gao¹ &
…
Xiaogang Deng³

520 Accesses
8 Citations
Explore all metrics

Abstract

As a typical Gauss–Seidel method, the inherent strong data dependency of lower-upper symmetric Gauss–Seidel (LU-SGS) poses tough challenges for shared-memory parallelization. On early multi-core processors, the pipelined parallel LU-SGS approach achieves promising scalability. However, on emerging many-core processors such as Xeon Phi, experience from our in-house high-order CFD program show that the parallel efficiency drops dramatically to less than 25%. In this paper, we model and analyze the performance of the pipelined parallel LU-SGS algorithm, present a two-level pipeline (TL-Pipeline) approach using nested OpenMP to further exploit fine-grained parallelisms and mitigate the parallel performance bottlenecks. Our TL-Pipeline approach achieves 20% performance gains for a regular problem \((256\times 256\times 256)\) on Xeon Phi. We also discuss some practical problems including domain decomposition and algorithm parameters tuning for realistic CFD simulations. Generally, our work is applicable to the shared-memory parallelization of all Gauss–Seidel like methods with intrinsic strong data dependency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Achieving high performance and portable parallel GMRES algorithm for compressible flow simulations on unstructured grids

Article 09 June 2023

OpenMP Parallelization Strategies for a Discontinuous Galerkin Solver

Article 30 July 2018

Parallelization and Optimization of Large-Scale CFD Simulations on Sunway TaihuLight System

References

Aftosmis M, Berger M, Biswas R, Djomehri MJ, Hood R, Jin H, Kiris C (2006) A detailed performance characterization of columbia using aeronautics benchmarks and applications. In: Proc. 44th AIAA Aerospace Sciences Meeting & Exhibit
Biswas R, Djomehri MJ, Hood R, Jin H, Kiris C, Saini S (2005) An application-based performance characterization of the columbia supercluster. In: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, p 26. IEEE Computer Society
Che Y, Cheng X, Xu C, Zhu X, Wang Z (2015) Performance engineering of a supersonic combustion simulator on heterogeneous platforms. In: Proceedings of 27th International Conference on Parallel Computational Fluid Dynamics
Chen R, Wang Z (2000) Fast, block lower-upper symmetric gauss-seidel scheme for arbitrary grids. AIAA j 38(12):2238–2245
Article Google Scholar
Deng X, Mao M (1997) Weighted compact high-order nonlinear schemes for the euler equations. AIAA paper, pp 97–1941
Deng X, Mao M, Jiang Y, Liu H (2011) New high-order hybrid cell-edge and cell-node weighted compact nonlinear schemes. AIAA Pap 3857:2011
Google Scholar
Deng X, Zhang H (2000) Developing high-order weighted compact nonlinear schemes. J Comput Phys 165(1):22–44
Article MathSciNet MATH Google Scholar
Djomehri MJ, Jin HH, Biegel B (2002) Hybrid mpi+ openmp programming of an overset cfd solver and performance investigations. Tech. rep., NASA Ames Research Center, NAS Technical Report, NAS-02-002
Economon TD, Palacios F, Alonso JJ, Bansal G, Mudigere D, Deshpande A, Heinecke A, Smelyanskiy M (2015) Towards high-performance optimizations of the unstructured open-source su2 suite. AIAA SciTech AIAA Pap 1949:2015
Google Scholar
Fang J (2014) Towards a Systematic Exploration of the Optimization Space for Many-Core Processors. Delft University of Technology, Delft
Google Scholar
Fang J, Sips H, Zhang L, Xu C, Che Y, Varbanescu AL (2014) Test-driving intel xeon phi. In: Proceedings of the 5th ACM/SPEC international conference on Performance engineering. ACM, pp 137–148
Gang W, Jiang Y, Zhengyin Y (2012) An improved lu-sgs implicit scheme for high reynolds number flow computations on hybrid unstructured mesh. Chin J Aeronaut 25(1):33–41
Article Google Scholar
Li D, Xu C, Wang Y, Song Z, Xiong M, Gao X, Deng X (2015) Parallelizing and optimizing large-scale 3d multi-phase flow simulations on the tianhe-2 supercomputer. Practice and Experience, Concurrency and Computation
Li R, Wang X, Zhao W (2008) A multigrid block lu-sgs algorithm for euler equations on unstructured grids. Numer Math Theory Methods Appl 1:92–112
MathSciNet MATH Google Scholar
Liu W, Zhang L, Zhong Y, Wang Y, Che Y, Xu C, Cheng X (2015) Cfd high-order accurate scheme jacobian-free newton krylov method. Comput Fluids 110:43–47
Article MathSciNet Google Scholar
Luo H, Sharov D, Baum JD, Löhner R (2003) Parallel unstructured grid gmres+ lu-sgs method for turbulent flows. AIAA Pap 273:2003
Google Scholar
Otero E, Eliasson P (2011) Convergence acceleration of the cfd code edge by lu-sgs. In: 3rd CEAS European Air & Space Conference. CEAS/AIDAA, pp 606–611
Parsani M, Van den Abeele K, Lacor C (2007) Implicit lu-sgs time integration algorithm for high-order spectral volume method with p-multigrid strategy. In: West-East High-Speed Flow Field Conference, Moscow, Russia
Sharov D, Luo H, Baum JD, Löhner R (2000) Implementation of unstructured grid gmres+ lu-sgs method on shared-memory, cache-based parallel computers. AIAA Pap 927:2000
Google Scholar
Sun Y, Wang Z, Liu Y (2009) Efficient implicit non-linear lu-sgs approach for compressible flow computation using high-order spectral difference method. commun. Comput Phys 5(2–4):760–778
MathSciNet Google Scholar
Wang YX, Zhang LL, Che YG, Xu CF, Liu W, Cheng XH (2015) Efficient parallel computing and performance tuning for multi-block structured grid cfd applications on tianhe supercomputer. Tien Tzu Hsueh Pao/acta Electronica Sinica 43(1):36–44
Google Scholar
Xu C, Deng X, Zhang L, Fang J, Wang G, Jiang Y, Cao W, Che Y, Wang Y, Wang Z et al (2014) Collaborating cpu and gpu for large-scale high-order cfd simulations with complex grids on the tianhe-1a supercomputer. J Comput Phys 278:275–297
Article MATH Google Scholar
Yamamoto S, Sasao Y, Sato S, Sano K (2007) Parallel-implicit computation of three-dimensional multistage stator-rotor cascade flows with condensation. In: Proc. 18th AIAA Computational Fluid Dynamics Conference, AIAA Paper, vol 4460, p 2007
Yoon S, Jameson A (1988) Lower-upper symmetric-gauss-seidel method for the euler and navier-stokes equations. AIAA J 26(9):1025–1026
Article Google Scholar
Yoon S, Jost G, Chang S (2005) Parallelization of gauss-seidel relaxation for real gas flow. Tech. rep., NAS Technical Report, NAS-05-011
Zhang L, Wang Z (2004) A block lu-sgs implicit dual time-stepping algorithm for hybrid dynamic meshes. Comput Fluids 33(7):891–916
Article MATH Google Scholar

Download references

Acknowledgements

This paper was supported by the Basic Research Program of National University of Defense Technology under Grant No. ZDYYJCYJ20140101, the Open Research Program of China State Key Laboratory of Aerodynamics under Grant No. SKLA20160104, the Defense Industrial Technology Development Program under Grant No. C1520110002, and the National Science Foundation of China under Grant Nos. 11502296 and 61561146395.

Author information

Authors and Affiliations

College of Computer Science, National University of Defense Technology, Changsha, 410073, People’s Republic of China
Dali Li, Chuanfu Xu, Bin Cheng, Min Xiong & Xiang Gao
State Key Laboratory of Aerodynamics, P.O. Box 211, Mianyang, 621000, People’s Republic of China
Chuanfu Xu
National University of Defense Technology, Changsha, 410073, People’s Republic of China
Xiaogang Deng

Authors

Dali Li
View author publications
You can also search for this author in PubMed Google Scholar
Chuanfu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Min Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Gao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaogang Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuanfu Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, D., Xu, C., Cheng, B. et al. Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations. J Supercomput 73, 2506–2524 (2017). https://doi.org/10.1007/s11227-016-1943-0

Download citation

Published: 16 December 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s11227-016-1943-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations

Abstract

Access this article

Similar content being viewed by others

Achieving high performance and portable parallel GMRES algorithm for compressible flow simulations on unstructured grids

OpenMP Parallelization Strategies for a Discontinuous Galerkin Solver

Parallelization and Optimization of Large-Scale CFD Simulations on Sunway TaihuLight System

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations

Abstract

Access this article

Similar content being viewed by others

Achieving high performance and portable parallel GMRES algorithm for compressible flow simulations on unstructured grids

OpenMP Parallelization Strategies for a Discontinuous Galerkin Solver

Parallelization and Optimization of Large-Scale CFD Simulations on Sunway TaihuLight System

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation