Journal of Scientific Computing

, Volume 60, Issue 2, pp 457–482 | Cite as

Exploiting Batch Processing on Streaming Architectures to Solve 2D Elliptic Finite Element Problems: A Hybridized Discontinuous Galerkin (HDG) Case Study

  • James King
  • Sergey Yakovlev
  • Zhisong Fu
  • Robert M. Kirby
  • Spencer J. Sherwin


Numerical methods for elliptic partial differential equations (PDEs) within both continuous and hybridized discontinuous Galerkin (HDG) frameworks share the same general structure: local (elemental) matrix generation followed by a global linear system assembly and solve. The lack of inter-element communication and easily parallelizable nature of the local matrix generation stage coupled with the parallelization techniques developed for the linear system solvers make a numerical scheme for elliptic PDEs a good candidate for implementation on streaming architectures such as modern graphical processing units (GPUs). We propose an algorithmic pipeline for mapping an elliptic finite element method to the GPU and perform a case study for a particular method within the HDG framework. This study provides comparison between CPU and GPU implementations of the method as well as highlights certain performance-crucial implementation details. The choice of the HDG method for the case study was dictated by the computationally-heavy local matrix generation stage as well as the reduced trace-based communication pattern, which together make the method amenable to the fine-grained parallelism of GPUs. We demonstrate that the HDG method is well-suited for GPU implementation, obtaining total speedups on the order of 30–35 times over a serial CPU implementation for moderately sized problems.


High-order finite elements Spectral/\(hp\) elements Discontinuous Galerkin method Hybridization Streaming processors Graphical processing units (GPUs) 



We would like to thank Professor B. Cockburn (U. Minnesota) for the helpful discussions on this topic. This work was supposed by the Department of Energy (DOE NETL DE-EE0004449) and under NSF OCI-1148291.


  1. 1.
    Buck, I.: GPU computing: programming a massively parallel processor. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO ’07, p. 17. IEEE Computer Society, Washington, DC, USA (2007)Google Scholar
  2. 2.
    Bell, N., Yu, Y., Mucha, P.J.: Particle-based simulation of granular materials. In: Proceedings of the 2005 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’05, pp. 77–86. ACM, New York, NY, USA (2005)Google Scholar
  3. 3.
    Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Phillips, J.C.: GPU computing. Proceedings of the IEEE 96(5), 879–899 (2008)CrossRefGoogle Scholar
  4. 4.
    Hesthaven, J.S., Warburton, T.: Nodal Discontinuous Galerkin Methods: Algorithms, Analysis and Applications. Springer, New York (2008)CrossRefGoogle Scholar
  5. 5.
    Ali, A., Syed, K.S., Ishaq, M., Hassan, A., Luo, Hong.: A communication-efficient, distributed memory parallel code using discontinuous Galerkin method for compressible flows. In: Emerging Technologies (ICET), 2010 6th International Conference on, pp. 331–336, oct 2010Google Scholar
  6. 6.
    Eskilsson, C., El-Khamra, Y., Rideout, D., Allen, G., Jim Chen Q., Tyagi, M.: A parallel High-Order Discontinuous Galerkin Shallow Water Model. In: Proceedings of the 9th International Conference on Computational Science: Part I, ICCS ’09, pp. 63–72. Springer-Verlag, Berlin, Heidelberg (2009)Google Scholar
  7. 7.
    Goedel, N., Schomann, S., Warburton, T., Clemens, M.: GPU accelerated Adams-Bashforth multirate discontinuous Galerkin FEM simulation of high-frequency electromagnetic fields. IEEE Trans. Magn. 46(8), 2735–2738 (2010)CrossRefGoogle Scholar
  8. 8.
    Goedel, N., Warburton, T., Clemens, M.: GPU accelerated Discontinuous Galerkin FEM for electromagnetic radio frequency problems. In: Antennas and Propagation Society International Symposium, 2009. APSURSI ’09. IEEE, pp. 1–4, June 2009Google Scholar
  9. 9.
    Klöckner, A., Warburton, T., Hesthaven, J.S.: High-Order Discontinuous Galerkin Methods by GPU Metaprogramming. In: GPU Solutions to Multi-scale Problems in Science and Engineering, pp. 353–374. Springer (2013)Google Scholar
  10. 10.
    Cockburn, B., Karniadakis, G.E., Shu, C.-W. (eds.): The Development of Discontinuous Galerkin Methods. In: Discontinuous Galerkin Methods: Theory, Computation and Applications, pp. 135–146. Springer-Verlag, Berlin (2000)Google Scholar
  11. 11.
    Cockburn, B., Gopalakrishnan, J., Lazarov, R.: Unified hybridization of discontinuous Galerkin mixed and continuous Galerkin methods for second order elliptic problems. SIAM J. Numer. Anal. 47, 1319–1365 (2009)CrossRefzbMATHMathSciNetGoogle Scholar
  12. 12.
    Klöckner, A., Warburton, T., Bridge, J., Hesthaven, J.S.: Nodal discontinuous Galerkin methods on graphics processors. J. Comput. Phys 228, 7863–7882 (2009)CrossRefzbMATHMathSciNetGoogle Scholar
  13. 13.
    Lanteri, S., Perrussel, R.: An implicit hybridized discontinuous Galerkin method for time-domain Maxwell’s equations. Rapport de recherche RR-7578, INRIA, March (2011)Google Scholar
  14. 14.
    NVIDIA Corporation. CUDA Programming Guide 4.2, April 2012Google Scholar
  15. 15.
    AMD Corporation. AMD Accelerated Parallel Processing Math Libraries, Jan 2011Google Scholar
  16. 16.
    ATI. AMD Accelerated Parallel Processing OpenGL Programming Guide, Jan 2011Google Scholar
  17. 17.
    Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC ’08, pp. 31:1–31:11. IEEE Press, Piscataway, NJ, USA (2008)Google Scholar
  18. 18.
    Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., Tomov, S.: A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs. In: GPU Computing Gems, Jade Edition 2, 473–484 (2011)Google Scholar
  19. 19.
    Song, F., Tomov, S., Dongarra, J.: Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures. University of Tennessee, Computer Science Technical, Report UT-CS-11-668 (2011)Google Scholar
  20. 20.
    Karniadakis, G.E., Sherwin, S.J.: Spectral/HP Element Methods for CFD, 2nd edn. Oxford University Press, UK (2005)CrossRefGoogle Scholar
  21. 21.
    Sherwin, S.J., Karniadakis, G.E.: A triangular spectral element method. Applications to the incompressible Navier–Stokes equations. Comput. Methods Appl. Mech. Eng. 123, 189–229 (1995)CrossRefzbMATHMathSciNetGoogle Scholar
  22. 22.
    Cockburn, B., Dong, B., Guzmán, J.: A superconvergent LDG-Hybridizable Galerkin method for second-order elliptic problems. Math. Comput. 77(264), 1887–1916 (2007)CrossRefGoogle Scholar
  23. 23.
    Cockburn, B., Gopalakrishnan, J., Sayas, F.-J.: A projection-based error analysis of HDG methods. Math. Comput. 79, 1351–1367 (2010)CrossRefzbMATHMathSciNetGoogle Scholar
  24. 24.
    Cockburn, B., Guzmán, J., Wang, H.: Superconvergent discontinuous Galerkin methods for second-order elliptic problems. Math. Comput. 78, 1–24 (2009)CrossRefzbMATHGoogle Scholar
  25. 25.
    Arnold, D.N., Brezzi, F., Cockburn, B., Marini, D.: Unified analysis of discontinuous Galerkin methods for elliptic problems. SIAM J. Numer. Anal. 39, 1749–1779 (2002)CrossRefzbMATHMathSciNetGoogle Scholar
  26. 26.
    Kirby, Robert M., Sherwin, Spencer J., Cockburn, Bernardo: To CG or to HDG: a comparative study. J. Sci. Comput. 51(1), 183–212 (Apr 2012)Google Scholar
  27. 27.
    Dubiner, M.: Spectral methods on triangles and other domains. J. Sci. Comput. 6, 345–390 (1991)CrossRefzbMATHMathSciNetGoogle Scholar
  28. 28.
    Bell, N., Garland, M.: Efficient Sparse Matrix-Vector Multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation, Dec 2008Google Scholar
  29. 29.
    Vos P.E.J.: From h to p efficiently : optimising the implementation of spectral / hp element methods. PhD thesis, University of London, 2011Google Scholar
  30. 30.
    Göddeke, Dominik, Strzodka, Robert, Mohd-Yusof, Jamaludin, McCormick, Patrick S., Wobker, Hilmar, Becker, Christian, Turek, Stefan: Using GPUs to improve multigrid solver performance on a cluster. Int. J. Comput. Sci. Eng. 4(1), 36–55 (2008)Google Scholar
  31. 31.
    Göddeke, Dominik, Wobker, Hilmar, Strzodka, Robert, Mohd-Yusof, Jamaludin, McCormick, Patrick S., Turek, Stefan: Co-processor acceleration of an unmodified parallel solid mechanics code with FEASTGPU. Int. J. Comput. Sci. Eng. 4(4), 254–269 (2009)CrossRefGoogle Scholar
  32. 32.
    Kirby, R.M., Sherwin, S.J.: Nektar++ finite element library.
  33. 33.
    Bell, N., Garland, M.: Cusp: Generic Parallel Algorithms for Sparse Matrix and Graph Computations, 2012. Version 0.3.0Google Scholar
  34. 34.
    Hoberock, J., Bell, N.: Thrust: A Parallel Template Library, 2010. Version 1.7.0Google Scholar
  35. 35.
    Ha, L.K., King, J., Fu, Z., Kirby, R.M.: A High-Performance Multi-Element Processing Framework on GPUs. SCI Technical Report UUSCI-2013-005, SCI Institute, University of Utah (2013)Google Scholar
  36. 36.
    Roca, X., Nguyen N.C., Peraire, J.: GPU-accelerated sparse matrix-vector product for a hybridizable discontinuous Galerkin method. Aerospace Sciences Meetings. American Institute of Aeronautics and Astronautics, Jan 2011. doi: 10.2514/6.2011-687

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • James King
    • 1
  • Sergey Yakovlev
    • 2
  • Zhisong Fu
    • 1
  • Robert M. Kirby
    • 1
  • Spencer J. Sherwin
    • 3
  1. 1.School of Computing and Scientific Computing and Imaging (SCI) InstituteUniversity of UtahSalt Lake CityUSA
  2. 2.Scientific Computing and Imaging (SCI) InstituteUniversity of UtahSalt Lake CityUSA
  3. 3.Department of AeronauticsImperial College LondonLondonUK

Personalised recommendations