Skip to main content
Log in

Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

As a typical Gauss–Seidel method, the inherent strong data dependency of lower-upper symmetric Gauss–Seidel (LU-SGS) poses tough challenges for shared-memory parallelization. On early multi-core processors, the pipelined parallel LU-SGS approach achieves promising scalability. However, on emerging many-core processors such as Xeon Phi, experience from our in-house high-order CFD program show that the parallel efficiency drops dramatically to less than 25%. In this paper, we model and analyze the performance of the pipelined parallel LU-SGS algorithm, present a two-level pipeline (TL-Pipeline) approach using nested OpenMP to further exploit fine-grained parallelisms and mitigate the parallel performance bottlenecks. Our TL-Pipeline approach achieves 20% performance gains for a regular problem \((256\times 256\times 256)\) on Xeon Phi. We also discuss some practical problems including domain decomposition and algorithm parameters tuning for realistic CFD simulations. Generally, our work is applicable to the shared-memory parallelization of all Gauss–Seidel like methods with intrinsic strong data dependency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Aftosmis M, Berger M, Biswas R, Djomehri MJ, Hood R, Jin H, Kiris C (2006) A detailed performance characterization of columbia using aeronautics benchmarks and applications. In: Proc. 44th AIAA Aerospace Sciences Meeting & Exhibit

  2. Biswas R, Djomehri MJ, Hood R, Jin H, Kiris C, Saini S (2005) An application-based performance characterization of the columbia supercluster. In: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, p 26. IEEE Computer Society

  3. Che Y, Cheng X, Xu C, Zhu X, Wang Z (2015) Performance engineering of a supersonic combustion simulator on heterogeneous platforms. In: Proceedings of 27th International Conference on Parallel Computational Fluid Dynamics

  4. Chen R, Wang Z (2000) Fast, block lower-upper symmetric gauss-seidel scheme for arbitrary grids. AIAA j 38(12):2238–2245

    Article  Google Scholar 

  5. Deng X, Mao M (1997) Weighted compact high-order nonlinear schemes for the euler equations. AIAA paper, pp 97–1941

  6. Deng X, Mao M, Jiang Y, Liu H (2011) New high-order hybrid cell-edge and cell-node weighted compact nonlinear schemes. AIAA Pap 3857:2011

    Google Scholar 

  7. Deng X, Zhang H (2000) Developing high-order weighted compact nonlinear schemes. J Comput Phys 165(1):22–44

    Article  MathSciNet  MATH  Google Scholar 

  8. Djomehri MJ, Jin HH, Biegel B (2002) Hybrid mpi+ openmp programming of an overset cfd solver and performance investigations. Tech. rep., NASA Ames Research Center, NAS Technical Report, NAS-02-002

  9. Economon TD, Palacios F, Alonso JJ, Bansal G, Mudigere D, Deshpande A, Heinecke A, Smelyanskiy M (2015) Towards high-performance optimizations of the unstructured open-source su2 suite. AIAA SciTech AIAA Pap 1949:2015

    Google Scholar 

  10. Fang J (2014) Towards a Systematic Exploration of the Optimization Space for Many-Core Processors. Delft University of Technology, Delft

    Google Scholar 

  11. Fang J, Sips H, Zhang L, Xu C, Che Y, Varbanescu AL (2014) Test-driving intel xeon phi. In: Proceedings of the 5th ACM/SPEC international conference on Performance engineering. ACM, pp 137–148

  12. Gang W, Jiang Y, Zhengyin Y (2012) An improved lu-sgs implicit scheme for high reynolds number flow computations on hybrid unstructured mesh. Chin J Aeronaut 25(1):33–41

    Article  Google Scholar 

  13. Li D, Xu C, Wang Y, Song Z, Xiong M, Gao X, Deng X (2015) Parallelizing and optimizing large-scale 3d multi-phase flow simulations on the tianhe-2 supercomputer. Practice and Experience, Concurrency and Computation

  14. Li R, Wang X, Zhao W (2008) A multigrid block lu-sgs algorithm for euler equations on unstructured grids. Numer Math Theory Methods Appl 1:92–112

    MathSciNet  MATH  Google Scholar 

  15. Liu W, Zhang L, Zhong Y, Wang Y, Che Y, Xu C, Cheng X (2015) Cfd high-order accurate scheme jacobian-free newton krylov method. Comput Fluids 110:43–47

    Article  MathSciNet  Google Scholar 

  16. Luo H, Sharov D, Baum JD, Löhner R (2003) Parallel unstructured grid gmres+ lu-sgs method for turbulent flows. AIAA Pap 273:2003

    Google Scholar 

  17. Otero E, Eliasson P (2011) Convergence acceleration of the cfd code edge by lu-sgs. In: 3rd CEAS European Air & Space Conference. CEAS/AIDAA, pp 606–611

  18. Parsani M, Van den Abeele K, Lacor C (2007) Implicit lu-sgs time integration algorithm for high-order spectral volume method with p-multigrid strategy. In: West-East High-Speed Flow Field Conference, Moscow, Russia

  19. Sharov D, Luo H, Baum JD, Löhner R (2000) Implementation of unstructured grid gmres+ lu-sgs method on shared-memory, cache-based parallel computers. AIAA Pap 927:2000

    Google Scholar 

  20. Sun Y, Wang Z, Liu Y (2009) Efficient implicit non-linear lu-sgs approach for compressible flow computation using high-order spectral difference method. commun. Comput Phys 5(2–4):760–778

    MathSciNet  Google Scholar 

  21. Wang YX, Zhang LL, Che YG, Xu CF, Liu W, Cheng XH (2015) Efficient parallel computing and performance tuning for multi-block structured grid cfd applications on tianhe supercomputer. Tien Tzu Hsueh Pao/acta Electronica Sinica 43(1):36–44

    Google Scholar 

  22. Xu C, Deng X, Zhang L, Fang J, Wang G, Jiang Y, Cao W, Che Y, Wang Y, Wang Z et al (2014) Collaborating cpu and gpu for large-scale high-order cfd simulations with complex grids on the tianhe-1a supercomputer. J Comput Phys 278:275–297

    Article  MATH  Google Scholar 

  23. Yamamoto S, Sasao Y, Sato S, Sano K (2007) Parallel-implicit computation of three-dimensional multistage stator-rotor cascade flows with condensation. In: Proc. 18th AIAA Computational Fluid Dynamics Conference, AIAA Paper, vol 4460, p 2007

  24. Yoon S, Jameson A (1988) Lower-upper symmetric-gauss-seidel method for the euler and navier-stokes equations. AIAA J 26(9):1025–1026

    Article  Google Scholar 

  25. Yoon S, Jost G, Chang S (2005) Parallelization of gauss-seidel relaxation for real gas flow. Tech. rep., NAS Technical Report, NAS-05-011

  26. Zhang L, Wang Z (2004) A block lu-sgs implicit dual time-stepping algorithm for hybrid dynamic meshes. Comput Fluids 33(7):891–916

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This paper was supported by the Basic Research Program of National University of Defense Technology under Grant No. ZDYYJCYJ20140101, the Open Research Program of China State Key Laboratory of Aerodynamics under Grant No. SKLA20160104, the Defense Industrial Technology Development Program under Grant No. C1520110002, and the National Science Foundation of China under Grant Nos. 11502296 and 61561146395.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuanfu Xu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, D., Xu, C., Cheng, B. et al. Performance modeling and optimization of parallel LU-SGS on many-core processors for 3D high-order CFD simulations. J Supercomput 73, 2506–2524 (2017). https://doi.org/10.1007/s11227-016-1943-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-016-1943-0

Keywords

Navigation