Skip to main content
Log in

Finite Element Algorithms and Data Structures on Graphical Processing Units

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

The finite element method (FEM) is one of the most commonly used techniques for the solution of partial differential equations on unstructured meshes. This paper discusses both the assembly and the solution phases of the FEM with special attention to the balance of computation and data movement. We present a GPU assembly algorithm that scales to arbitrary degree polynomials used as basis functions, at the expense of redundant computations. We show how the storage of the stiffness matrix affects the performance of both the assembly and the solution. We investigate two approaches: global assembly into the CSR and ELLPACK matrix formats and matrix-free algorithms, and show the trade-off between the amount of indexing data and stiffness data. We discuss the performance of different approaches in light of the implicit caches on Fermi GPUs and show a speedup over a two-socket 12-core CPU of up to 10 times in the assembly and up to 6 times in the solution phase. We present our sparse matrix-vector multiplication algorithms that are part of a conjugate gradient iteration and show that a matrix-free approach may be up to two times faster than global assembly approaches and up to 4 times faster than NVIDIA’s cuSPARSE library, depending on the preconditioner used.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28

Similar content being viewed by others

References

  1. Alefeld, G.: On the convergence of the symmetric sor method for matrices with red-black ordering. Numerische Mathematik 39(1), 113–117 (1982). doi:10.1007/BF01399315

    Article  MATH  MathSciNet  Google Scholar 

  2. Axelsson, O.: Iterative Solution Methods. Cambridge University Press, Cambridge (1996)

    MATH  Google Scholar 

  3. Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation (2008)

  4. Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. ACM Transactions on Graphics 22, 917–924 (2003)

    Article  Google Scholar 

  5. Cantwell, C., Sherwin, S., Kirby, R., Kelly, P.: From h to p efficiently: Strategy selection for operator evaluation on hexahedral and tetrahedral elements. Computers & Fluids 43(1), 23–28 (2011). doi:10.1016/j.compfluid.2010.08.012. http://www.sciencedirect.com/science/article/pii/S00457930100

  6. Cecka, C., Lew, A.J., Darve, E.: Assembly of finite element methods on graphics processors. International Journal for Numerical Methods in Engineering 85(5), 640–669 (2011). doi:10.1002/nme.2989

    Article  MATH  Google Scholar 

  7. Christen, M., Schenk, O., Messmer, P., Neufeld, E., Burkhart, H.: Accelerating stencil-based computations by increased temporal locality on modern multi- and many-core architectures. In: Proceedings of the First International Workshop on New Frontiers in High-performance and Hardware-aware, Computing (HipHaC’08), pp. 47–54 (2008).

  8. Dally, B.: Power, programmability, and granularity: The challenges of exascale computing. In: Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16–20 May, p. 878 (2011).

  9. Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC ’08, pp. 4:1–4:12. IEEE Press, Piscataway, NJ, USA (2008).

  10. Fidkowski, K.J., Oliver, T.A., Lu, J., Darmofal, D.L.: p-multigrid solution of high-order discontinuous galerkin discretizations of the compressible navier-stokes equations. J. Comput. Phys. 207(1), 92–113 (2005). doi:10.1016/j.jcp.2005.01.005

    Article  MATH  Google Scholar 

  11. Filipovic, J., Peterlik, I., Fousek, J.: GPU acceleration of equations assembly in finite elements method preliminary results. Symposium on Application Accelerators in HPC, SAAHPC (2009)

  12. Flaig, C., Arbenz, P.: A scalable memory efficient multigrid solver for micro-finite element analyses based on CT images. Parallel Computing 37(12), 846–854 (2011). doi:10.1016/j.parco.2011.08.001. http://www.sciencedirect.com/science/article/pii/S01678191110

  13. Göddeke, D., Strzodka, R., Turek, S.: Accelerating double precision FEM simulations with GPUs. In: Hülsemann, F., Kowarschik, M., Rüde, U. (eds.) 18th Symposium Simulationstechnique (ASIM’05), pp. 139–144. Simulation , Frontiers in (2005)

  14. Hwu, WmW: GPU Computing Gems Emerald Edition, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco,CA, USA (2011)

    Google Scholar 

  15. Johnson, C.: Numerical Solution of Partial Differential Equations by the Finite Element Method. Cambridge University Press, Cambridge (1987)

    MATH  Google Scholar 

  16. Komatitsch, D., Göddeke, D., Erlebacher, G., Michéa, D.: Modeling the propagation of elastic waves using spectral elements on a cluster of 192 GPUs. Computer Science Research and Development 25(1–2), 75–82 (2010). doi:10.1007/s00450-010-0109-1

    Article  Google Scholar 

  17. Komatitsch, D., Micha, D., Erlebacher, G.: Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA. Journal of Parallel and Distributed Computing 69(5), 451–460 (2009). doi:10.1016/j.jpdc.2009.01.006. http://www.sciencedirect.com/science/article/pii/S07437315090

    Google Scholar 

  18. Markall, G.R., Ham, D.A., Kelly, P.H.: Towards generating optimised finite element solvers for GPUs from high-level specifications. Procedia Computer Science 1(1), 1815–1823 (2010). doi:10.1016/j.procs.2010.04.203. http://www.sciencedirect.com/science/article/pii/S18770509100

  19. NVIDIA: cuSPARSE library, last accessed Dec 20th (2012). http://developer.nvidia.com/cuSPARSE

  20. NVIDIA: NVIDIA CUDA C Best Practices Guide, last accessed Aug 20th (2012). http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf

  21. NVIDIA: NVIDIA Tesla C2070 techinical specifications, last accessed Aug 20th (2012). http://www.nvidia.com/docs/IO/43395/NV_DS_Tesla_C2050_C2070_jul10_lor

  22. NVIDIA: CUBLAS library, last accessed Sept 12th (2013). http://developer.nvidia.com/cublas

  23. Plaszewski, P., Maciol, P., Banas, K.: Finite element numerical integration on GPUs. In: Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I, PPAM’09, pp. 411–420. Springer, Berlin, Heidelberg (2010). http://dl.acm.org/citation.cfm?id=1882792.1882842

  24. Poole, E.L., Ortega, J.M.: Multicolor ICCG Methods for Vector Computers. SIAM Journal on Numerical Analysis 24(6), 1394–1418 (1987)

    Article  MATH  MathSciNet  Google Scholar 

  25. Reguly, I., Giles, M.: Efficient sparse matrix-vector multiplication on cache-based GPUs. In: Innovative Parallel Computing (InPar), 2012. IEEE (2012). 2012, doi:10.1109/InPar.6339602.

  26. Spencer, B.: A general auto-tuning framework for software performance optimisation (2011). Third Year Project Report, University of Oxford.

  27. Vázquez, F., Fernández, J., Garzón, E.: Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach. Parallel Computing (2011). doi:10.1016/j.parco.2011.08.003. http://www.sciencedirect.com/science/article/pii/S01678191110

Download references

Acknowledgments

This research was supported in part by the UK Engineering and Physical Sciences Research Council through project EP/J010553/1 on “Algorithms and Software for Emerging Architectures”, and in part by the EU LLP/Erasmus program 10/2010-2011/Erasmus-SMP. The authors would like to acknowledge the help and support of Csaba Józsa, András Oláh, Barna Garay and Tamás Roska at PPKE Hungary.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to I. Z. Reguly.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Reguly, I.Z., Giles, M.B. Finite Element Algorithms and Data Structures on Graphical Processing Units. Int J Parallel Prog 43, 203–239 (2015). https://doi.org/10.1007/s10766-013-0301-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0301-6

Keywords

Navigation