Abstract
This study compares the performance of high-order discontinuous Galerkin finite elements on modern hardware. The main computational kernel is the matrix-free evaluation of differential operators by sum factorization, exemplified on the symmetric interior penalty discretization of the Laplacian as a metric for a complex application code in fluid dynamics. State-of-the-art implementations of these kernels stress both arithmetics and memory transfer. The implementations of SIMD vectorization and shared-memory parallelization are detailed. Computational results are presented for dual-socket Intel Haswell CPUs at 28 cores, a 64-core Intel Knights Landing, and a 16-core IBM Power8 processor. Up to polynomial degree six, Knights Landing is approximately twice as fast as Haswell. Power8 performs similarly to Haswell, trading a higher frequency for narrower SIMD units. The performance comparison shows that simple ways to express parallelism through for loops perform better on medium and high core counts than a more elaborate task-based parallelization with dynamic scheduling according to dependency graphs, despite less memory transfer in the latter algorithm.
Similar content being viewed by others
Notes
- 1.
- 2.
https://github.com/RRZE-HPC/likwid, retrieved on September 18, 2016.
- 3.
As a complement to the numbers given by likwid that count FMAs as one FLOP, we recorded FMAs and additions and multiplication separately with the Intel software development emulator.
References
Bangerth, W., Davydov, D., Heister, T., Heltai, L., Kanschat, G., Kronbichler, M., Maier, M., Turcksin, B., Wells, D.: The deal.II library, version 8.4. J. Numer. Math. 24(3), 135–141 (2016). doi:10.1515/jnma-2016-1045. www.dealii.org
Hesthaven, J.S., Warburton, T.: Nodal Discontinuous Galerkin Methods: Algorithms, Analysis, and Applications. Texts in Applied Mathematics, vol. 54. Springer, New York (2008). doi:10.1007/978-0-387-72067-8
Hindenlang, F., Gassner, G., Altmann, C., Beck, A., Staudenmaier, M., Munz, C.D.: Explicit discontinuous Galerkin methods for unsteady problems. Comput. Fluids 61, 86–93 (2012). doi:10.1016/j.compfluid.2012.03.006
Intel Corporation: Intel VTune Amplifier XE 2017. https://software.intel.com/en-us/intel-vtune-amplifier-xe
Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming, Knights Landing Edition. Morgan Kaufmann, Cambridge (2016)
Karniadakis, G.E., Sherwin, S.J.: Spectral/hp Element Methods for Computational Fluid Dynamics, 2nd edn. Oxford University Press, Oxford (2005). doi:10.1093/acprof:oso/9780198528692.001.0001
Karniadakis, G.E., Israeli, M., Orszag, S.A.: High-order splitting methods for the incompressible Navier-Stokes equations. J. Comput. Phys. 97(2), 414–443 (1991). doi:10.1016/0021-9991(91)90007-8
Kopriva, D.: Implementing Spectral Methods for Partial Differential Equations. Springer, Dordrecht (2009). doi:10.1007/978-90-481-2261-5
Kormann, K., Kronbichler, M.: Parallel finite element operator application: graph partitioning and coloring. In: Proceedings of the 7th IEEE International Conference on eScience, pp. 332–339 (2011). doi:10.1109/eScience.2011.53
Krank, B., Fehn, N., Wall, W.A., Kronbichler, M.: A high-order semi-explicit discontinuous Galerkin solver for 3D incompressible flow with application to DNS and LES of turbulent channel flow. arXiv preprint arXiv:1607.01323 (2016)
Kronbichler, M., Kormann, K.: A generic interface for parallel cell-based finite element operator application. Comput. Fluids 63, 135–147 (2012). doi:10.1016/j.compfluid.2012.04.012
Kronbichler, M., Wall, W.A.: A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers. arXiv preprint arXiv:1611.03029 (2016)
Reinders, J.: Intel Threading Building Blocks. O’Reilly, Sebastopol (2007)
Acknowledgements
The authors acknowledge the support given by the Bayerische Kompetenznetzwerk für Technisch-Wissenschaftliches Hoch- und Höchstleistungsrechnen (KONWIHR) in the framework of the project High performance finite difference stencils for modern parallel processors. This work was supported by the German Research Foundation (DFG) under the project High-order discontinuous Galerkin for the exa-scale (ExaDG) within the priority program Software for Exascale Computing (SPPEXA). The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time on the GCS Supercomputer SuperMUC at Leibniz Supercomputing Centre (LRZ, www.lrz.de) through project id pr83te.
The authors acknowledge collaboration with Benjamin Krank, Niklas Fehn, and Matthias Brehm.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kronbichler, M., Kormann, K., Pasichnyk, I., Allalen, M. (2017). Fast Matrix-Free Discontinuous Galerkin Kernels on Modern Computer Architectures. In: Kunkel, J.M., Yokota, R., Balaji, P., Keyes, D. (eds) High Performance Computing. ISC High Performance 2017. Lecture Notes in Computer Science(), vol 10266. Springer, Cham. https://doi.org/10.1007/978-3-319-58667-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-58667-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58666-3
Online ISBN: 978-3-319-58667-0
eBook Packages: Computer ScienceComputer Science (R0)