Fast Matrix-Free Discontinuous Galerkin Kernels on Modern Computer Architectures
This study compares the performance of high-order discontinuous Galerkin finite elements on modern hardware. The main computational kernel is the matrix-free evaluation of differential operators by sum factorization, exemplified on the symmetric interior penalty discretization of the Laplacian as a metric for a complex application code in fluid dynamics. State-of-the-art implementations of these kernels stress both arithmetics and memory transfer. The implementations of SIMD vectorization and shared-memory parallelization are detailed. Computational results are presented for dual-socket Intel Haswell CPUs at 28 cores, a 64-core Intel Knights Landing, and a 16-core IBM Power8 processor. Up to polynomial degree six, Knights Landing is approximately twice as fast as Haswell. Power8 performs similarly to Haswell, trading a higher frequency for narrower SIMD units. The performance comparison shows that simple ways to express parallelism through for loops perform better on medium and high core counts than a more elaborate task-based parallelization with dynamic scheduling according to dependency graphs, despite less memory transfer in the latter algorithm.
KeywordsDiscontinuous Galerkin Quadrature Point Spectral Element Method Memory Transfer Pressure Poisson Equation
The authors acknowledge the support given by the Bayerische Kompetenznetzwerk für Technisch-Wissenschaftliches Hoch- und Höchstleistungsrechnen (KONWIHR) in the framework of the project High performance finite difference stencils for modern parallel processors. This work was supported by the German Research Foundation (DFG) under the project High-order discontinuous Galerkin for the exa-scale (ExaDG) within the priority program Software for Exascale Computing (SPPEXA). The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time on the GCS Supercomputer SuperMUC at Leibniz Supercomputing Centre (LRZ, www.lrz.de) through project id pr83te.
The authors acknowledge collaboration with Benjamin Krank, Niklas Fehn, and Matthias Brehm.
- 4.Intel Corporation: Intel VTune Amplifier XE 2017. https://software.intel.com/en-us/intel-vtune-amplifier-xe
- 5.Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming, Knights Landing Edition. Morgan Kaufmann, Cambridge (2016)Google Scholar
- 9.Kormann, K., Kronbichler, M.: Parallel finite element operator application: graph partitioning and coloring. In: Proceedings of the 7th IEEE International Conference on eScience, pp. 332–339 (2011). doi: 10.1109/eScience.2011.53
- 10.Krank, B., Fehn, N., Wall, W.A., Kronbichler, M.: A high-order semi-explicit discontinuous Galerkin solver for 3D incompressible flow with application to DNS and LES of turbulent channel flow. arXiv preprint arXiv:1607.01323 (2016)
- 12.Kronbichler, M., Wall, W.A.: A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers. arXiv preprint arXiv:1611.03029 (2016)
- 13.Reinders, J.: Intel Threading Building Blocks. O’Reilly, Sebastopol (2007)Google Scholar