Performance and Scalability Improvements for Discontinuous Galerkin Solutions to Conservation Laws on Unstructured Grids
This paper presents a computational framework developed to improve both the serial and parallel performance of two dimensional, unstructured, discontinuous Galerkin (DG) solutions to hyperbolic conservation laws. The coding techniques employed factor in advancements trending in HPC technologies. They are designed to maximize loop vectorization, efficiently utilize cache, facilitate straightforward shared memory parallelization, reduce message passing volume, and increase the overlap between computation and communication. With today’s CPU technology and HPC networks rapidly evolving, it is important to quantitatively assess and compare these methodologies with standard paradigms in order to maximize current computational resources. In our benchmark studies, we specifically investigate the shallow water equations to show that the refactored algorithm implementation is able to provide a significant performance increase over the conventional elemental DG code structure in terms of both CPU time and parallel scalability. Our results show that the serial optimizations result in a 28–38 % performance increase. For parallel computations our improvements give rise to a 1.5–2.0 speedup factor for local problem sizes between 10 and 2000 elements per core, regardless of the overall problem size. The computational benchmarks were performed on the Lonestar and Stampede supercomputers at the Texas Advanced Computing Center.
KeywordsParallel computing Conservation laws Shallow water equations Discontinuous Galerkin Finite element method
This work was supported by the National Science Foundation Grants DMS-1228212, ACI-1339738, and ACI-1339801. J.J. Westerink was also partly supported by the Henry J. Massman and the Joseph and Nona Ahearn endowments at the University of Notre Dame. The authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. Also we would like to thank TACC research scientist John McCalpin for assisting with the hardware counter results. URL: http://www.tacc.utexas.edu. The benchmark studies were performed using the XSEDE Allocation TG-DMS080016N.
- 1.Baggag, A., Atkins, H., Keyes, D.: Parallel implementation of the discontinuous Galerkin method. In: Parallel Computational Fluid Dynamics, pp. 115–122 (1999)Google Scholar
- 8.Gwennap, L.: Sandy bridge spans generations. Microprocess. Rep. 9(27), 10-01 (2010)Google Scholar
- 12.Intel Corporation: Intel 64 and IA-32 Architectures Software Developer’s Manual, vol. 1 (2015). http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-1-manual.pdf
- 16.Koornwinder, T.: Two-variable analogues of the classical orthogonal polynomials. In: Theory and Applications of Special Functions, pp. 435–495 (1975)Google Scholar
- 25.Sutter, H.: The free lunch is over: a fundamental turn toward concurrency in software. Dr. Dobbs J. 30(3), 202–210 (2005)Google Scholar