High-Order Discontinuous Galerkin Methods by GPU Metaprogramming

  • Andreas Klöckner
  • Timothy Warburton
  • Jan S. Hesthaven
Part of the Lecture Notes in Earth System Sciences book series (LNESS)


Discontinuous Galerkin (DG) methods for the numerical solution of partial differential equations have enjoyed considerable success because they are both flexible and robust: They allow arbitrary unstructured geometries and easy control of accuracy without compromising simulation stability. In a recent publication, we have shown that DG methods also adapt readily to execution on modern, massively parallel graphics processors (GPUs). A number of qualities of the method contribute to this suitability, reaching from locality of reference, through regularity of access patterns, to high arithmetic intensity. In this article, we illuminate a few of the more practical aspects of bringing DG onto a GPU, including the use of a Python-based metaprogramming infrastructure that was created specifically to support DG, but has found many uses across all disciplines of computational science.


Discontinuous Galerkin Global Memory Discontinuous Galerkin Method Memory Bandwidth Spectral Element Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



AK’s research was partially funded by AFOSR under contract number FA9550-07-1-0422, through the AFOSR/NSSEFF Program Award FA9550-10-1-0180 and also under contract DEFG0288ER25053 by the Department of Energy. TW acknowledges the support of AFOSR under grant number FA9550-05-1-0473 and of the National Science Foundation under grant number DMS 0810187. JSH was partially supported by AFOSR, NSF, and DOE. The opinions expressed are the views of the authors. They do not necessarily reflect the official position of the funding agencies.


  1. Barth T, Knight T (2005) A streaming language implementation of the discontinuous Galerkin method. Technical report 20050184165. NASA Ames Research CenterGoogle Scholar
  2. Bilmes J, Asanovic K, Chin C, Demmel J (1997) Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: Proceedings of the 11th international conference on supercomputing. ACM, New York, pp 340–347Google Scholar
  3. Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: stream computing on graphics hardware. In: International conference on computer graphics and interactive techniques. ACM, New York, pp 777–786Google Scholar
  4. Burstedde C, Ghattas O, Gurnis M, Isaac T, Stadler G, Warburton T, Wilcox L (2010) Extreme-scale amr. In: International conference for high performance computing, networking, storage and analysis (SC), pp 1–12, Nov 2010. doi: 10.1109/SC.2010.25
  5. Cockburn B, Hou S, Shu C-W (1990) The runge-kutta local projection discontinuous galerkin finite element method for conservation laws IV: the multidimensional case. Math Comput 54(190):545–581. doi: 10.2307/2008501 Google Scholar
  6. Dally WJ, Hanrahan P, Erez M, Knight TJ, Labonté F, Ahn JH, Jayasena N, Kapasi UJ, Das A, Gummaraju J (2003) Merrimac: supercomputing with streams. In: Proceedings of the ACM/IEEE SC2003 conference (SC’03), vol 1Google Scholar
  7. Filipovič J, Fousek J (2010) Medium-grained functions mapping using modern GPUs. In: Proceedings of the symposium on application accelerators in high performance computing (SAAHPC’11), Knoxville, TNGoogle Scholar
  8. Frigo M, Johnson SG (2005) The design and implementation of FFTW3. Proc IEEE 93(2):216–231. doi: 10.1109/JPROC.2004.840301. Special issue on “Program Generation, Optimization, and Platform Adaptation”Google Scholar
  9. Göddeke D, Strzodka R, Turek S (2005) Accelerating double precision FEM simulations with GPUs. In: Proceedings of ASIMGoogle Scholar
  10. Hesthaven JS, Warburton T (2007) Nodal discontinuous galerkin methods: algorithms, analysis, and applications. 1st edn, Springer. ISBN 0387720650Google Scholar
  11. Klöckner A, Pinto N, Lee Y, Catanzaro B, Ivanov P, Fasih A (2012) PyCUDA and PyOpenCL: a scripting-based approach to GPU run-time code generation. Parallel Comput 38(3):157–174. doi:10.1016/j.parco.2011.09.001Google Scholar
  12. Klöckner A, Warburton T, Bridge J, Hesthaven J (2009) Nodal discontinuous galerkin methods on graphics processors. J Comp Phys 228:7863–7882. doi: 10.1016/j.jcp2009.06.041 Google Scholar
  13. Klöckner A, Warburton T, Hesthaven J (2011a) Solving wave equations on unstructured geometries. In: Hwu W-m (ed) GPU computing gems, Jade Edn. Morgan Kaufmann Publishers, WalthamGoogle Scholar
  14. Klöckner A, Warburton T, Hesthaven JS (2011b) Viscous shock capturing in a time-explicit discontinuous galerkin method. Math Model Nat Phenom 6:57–83. doi: 10.1051/mmnp/20116303 Google Scholar
  15. Krakiwsky S, Turner L, Okoniewski M (2004) Acceleration of finite-difference time-domain (FDTD) using graphics processor units (GPU). In: IEEE MTT-S international microwave symposium digest, vol 2, pp 1033–1036, ISBN 0149-645X. doi: 10.1109/MWSYM.2004.1339160
  16. Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. In: IEEE/ACM international symposium on code generation and optimization, 0:75. doi: 10.1109/CGO.2004.1281665
  17. Lesaint P, Raviart P (1974) On a finite element method for solving the neutron transport equation. Mathematical aspects of finite elements in partial, differential equations. Academic Press, New York, pp 89–123Google Scholar
  18. Li W, Wei X, Kaufman A (2003) Implementing lattice boltzmann computation on graphics hardware. Vis Comput 19:444–456Google Scholar
  19. Lindholm E, Nickolls J, Oberman S, Montrym J (2008) Nvidia tesla: a unified graphics and computing architecture. IEEE Micro 28:39–55. doi: 10.1109/MM.2008.31 Google Scholar
  20. Mohammadian AH, Shankar V, Hall WF (1991) Computation of electromagnetic scattering and radiation using a time-domain finite-volume discretization procedure. Comput Phys Commun 68(1–3):175–196. doi: 10.1016/0010-4655(91)90199-U
  21. Mueller C, Martin B, Lumsdaine A (2007) CorePy: high-productivity Cell/BE programming. In: Procceedings of the first STI/Georgia tech workshop on software and applications for the Cell/BE processor, GeorgiaGoogle Scholar
  22. Nvidia corporation (2009) NVIDIA CUDA 2.2 compute unified device architecture programming guide. Nvidia corporation, Santa Clara, USA, April 2009Google Scholar
  23. Oliphant T (2006) Guide to NumPy. Trelgol Publishing, Spanish ForkGoogle Scholar
  24. Reed WH, Hill TR (1973) Triangular mesh methods for the neutron transport equation. Technical report, Los Alamos Scientific Laboratory, Los AlamosGoogle Scholar
  25. van Rossum G et al (1994) The python programming language. http://python.org
  26. Warburton T (2006) An explicit construction of interpolation nodes on the simplex. J Eng Math 56:247–262. doi: 10.1007/s10665-006-9086-6 Google Scholar
  27. Warburton T (2010) A low storage curvilinear discontinuous galerkin time-domain method for electromagnetics. In: IEEE international symposium on electromagnetic theory (EMTS) (URSI 2010), pp 996–999Google Scholar
  28. Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimizations of software and the ATLAS project. Par Comp 27:3–35. doi: 10.1016/S0167-8191(00)00087-9 Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Andreas Klöckner
    • 1
  • Timothy Warburton
    • 2
  • Jan S. Hesthaven
    • 3
  1. 1.Courant Institute of Mathematical SciencesNew York UniversityNew YorkUSA
  2. 2.Department of Computational and Applied MathematicsRice UniversityHoustonUSA
  3. 3.Division of Applied MathematicsBrown UniversityProvidenceUSA

Personalised recommendations