Fast Matrix-Free Discontinuous Galerkin Kernels on Modern Computer Architectures

  • Martin KronbichlerEmail author
  • Katharina Kormann
  • Igor Pasichnyk
  • Momme Allalen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10266)


This study compares the performance of high-order discontinuous Galerkin finite elements on modern hardware. The main computational kernel is the matrix-free evaluation of differential operators by sum factorization, exemplified on the symmetric interior penalty discretization of the Laplacian as a metric for a complex application code in fluid dynamics. State-of-the-art implementations of these kernels stress both arithmetics and memory transfer. The implementations of SIMD vectorization and shared-memory parallelization are detailed. Computational results are presented for dual-socket Intel Haswell CPUs at 28 cores, a 64-core Intel Knights Landing, and a 16-core IBM Power8 processor. Up to polynomial degree six, Knights Landing is approximately twice as fast as Haswell. Power8 performs similarly to Haswell, trading a higher frequency for narrower SIMD units. The performance comparison shows that simple ways to express parallelism through for loops perform better on medium and high core counts than a more elaborate task-based parallelization with dynamic scheduling according to dependency graphs, despite less memory transfer in the latter algorithm.


Discontinuous Galerkin Quadrature Point Spectral Element Method Memory Transfer Pressure Poisson Equation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The authors acknowledge the support given by the Bayerische Kompetenznetzwerk für Technisch-Wissenschaftliches Hoch- und Höchstleistungsrechnen (KONWIHR) in the framework of the project High performance finite difference stencils for modern parallel processors. This work was supported by the German Research Foundation (DFG) under the project High-order discontinuous Galerkin for the exa-scale (ExaDG) within the priority program Software for Exascale Computing (SPPEXA). The authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. ( for funding this project by providing computing time on the GCS Supercomputer SuperMUC at Leibniz Supercomputing Centre (LRZ, through project id pr83te.

The authors acknowledge collaboration with Benjamin Krank, Niklas Fehn, and Matthias Brehm.


  1. 1.
    Bangerth, W., Davydov, D., Heister, T., Heltai, L., Kanschat, G., Kronbichler, M., Maier, M., Turcksin, B., Wells, D.: The deal.II library, version 8.4. J. Numer. Math. 24(3), 135–141 (2016). doi: 10.1515/jnma-2016-1045. MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Hesthaven, J.S., Warburton, T.: Nodal Discontinuous Galerkin Methods: Algorithms, Analysis, and Applications. Texts in Applied Mathematics, vol. 54. Springer, New York (2008). doi: 10.1007/978-0-387-72067-8 zbMATHGoogle Scholar
  3. 3.
    Hindenlang, F., Gassner, G., Altmann, C., Beck, A., Staudenmaier, M., Munz, C.D.: Explicit discontinuous Galerkin methods for unsteady problems. Comput. Fluids 61, 86–93 (2012). doi: 10.1016/j.compfluid.2012.03.006 MathSciNetCrossRefGoogle Scholar
  4. 4.
    Intel Corporation: Intel VTune Amplifier XE 2017.
  5. 5.
    Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High Performance Programming, Knights Landing Edition. Morgan Kaufmann, Cambridge (2016)Google Scholar
  6. 6.
    Karniadakis, G.E., Sherwin, S.J.: Spectral/hp Element Methods for Computational Fluid Dynamics, 2nd edn. Oxford University Press, Oxford (2005). doi: 10.1093/acprof:oso/9780198528692.001.0001 CrossRefzbMATHGoogle Scholar
  7. 7.
    Karniadakis, G.E., Israeli, M., Orszag, S.A.: High-order splitting methods for the incompressible Navier-Stokes equations. J. Comput. Phys. 97(2), 414–443 (1991). doi: 10.1016/0021-9991(91)90007-8 MathSciNetCrossRefzbMATHGoogle Scholar
  8. 8.
    Kopriva, D.: Implementing Spectral Methods for Partial Differential Equations. Springer, Dordrecht (2009). doi: 10.1007/978-90-481-2261-5 CrossRefzbMATHGoogle Scholar
  9. 9.
    Kormann, K., Kronbichler, M.: Parallel finite element operator application: graph partitioning and coloring. In: Proceedings of the 7th IEEE International Conference on eScience, pp. 332–339 (2011). doi: 10.1109/eScience.2011.53
  10. 10.
    Krank, B., Fehn, N., Wall, W.A., Kronbichler, M.: A high-order semi-explicit discontinuous Galerkin solver for 3D incompressible flow with application to DNS and LES of turbulent channel flow. arXiv preprint arXiv:1607.01323 (2016)
  11. 11.
    Kronbichler, M., Kormann, K.: A generic interface for parallel cell-based finite element operator application. Comput. Fluids 63, 135–147 (2012). doi: 10.1016/j.compfluid.2012.04.012 MathSciNetCrossRefGoogle Scholar
  12. 12.
    Kronbichler, M., Wall, W.A.: A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers. arXiv preprint arXiv:1611.03029 (2016)
  13. 13.
    Reinders, J.: Intel Threading Building Blocks. O’Reilly, Sebastopol (2007)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Institute for Computational MechanicsTechnical University of MunichGarchingGermany
  2. 2.Max–Planck–Institute for Plasma PhysicsGarchingGermany
  3. 3.Zentrum MathematikTechnical University of MunichGarchingGermany
  4. 4.IBM DeutschlandGarchingGermany
  5. 5.Leibniz-Rechenzentrum der Bayerischen Akademie der WissenschaftenGarchingGermany

Personalised recommendations