Abstract
The finite element method (FEM) is one of the most commonly used techniques for the solution of partial differential equations on unstructured meshes. This paper discusses both the assembly and the solution phases of the FEM with special attention to the balance of computation and data movement. We present a GPU assembly algorithm that scales to arbitrary degree polynomials used as basis functions, at the expense of redundant computations. We show how the storage of the stiffness matrix affects the performance of both the assembly and the solution. We investigate two approaches: global assembly into the CSR and ELLPACK matrix formats and matrix-free algorithms, and show the trade-off between the amount of indexing data and stiffness data. We discuss the performance of different approaches in light of the implicit caches on Fermi GPUs and show a speedup over a two-socket 12-core CPU of up to 10 times in the assembly and up to 6 times in the solution phase. We present our sparse matrix-vector multiplication algorithms that are part of a conjugate gradient iteration and show that a matrix-free approach may be up to two times faster than global assembly approaches and up to 4 times faster than NVIDIA’s cuSPARSE library, depending on the preconditioner used.
Similar content being viewed by others
References
Alefeld, G.: On the convergence of the symmetric sor method for matrices with red-black ordering. Numerische Mathematik 39(1), 113–117 (1982). doi:10.1007/BF01399315
Axelsson, O.: Iterative Solution Methods. Cambridge University Press, Cambridge (1996)
Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004, NVIDIA Corporation (2008)
Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. ACM Transactions on Graphics 22, 917–924 (2003)
Cantwell, C., Sherwin, S., Kirby, R., Kelly, P.: From h to p efficiently: Strategy selection for operator evaluation on hexahedral and tetrahedral elements. Computers & Fluids 43(1), 23–28 (2011). doi:10.1016/j.compfluid.2010.08.012. http://www.sciencedirect.com/science/article/pii/S00457930100
Cecka, C., Lew, A.J., Darve, E.: Assembly of finite element methods on graphics processors. International Journal for Numerical Methods in Engineering 85(5), 640–669 (2011). doi:10.1002/nme.2989
Christen, M., Schenk, O., Messmer, P., Neufeld, E., Burkhart, H.: Accelerating stencil-based computations by increased temporal locality on modern multi- and many-core architectures. In: Proceedings of the First International Workshop on New Frontiers in High-performance and Hardware-aware, Computing (HipHaC’08), pp. 47–54 (2008).
Dally, B.: Power, programmability, and granularity: The challenges of exascale computing. In: Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16–20 May, p. 878 (2011).
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC ’08, pp. 4:1–4:12. IEEE Press, Piscataway, NJ, USA (2008).
Fidkowski, K.J., Oliver, T.A., Lu, J., Darmofal, D.L.: p-multigrid solution of high-order discontinuous galerkin discretizations of the compressible navier-stokes equations. J. Comput. Phys. 207(1), 92–113 (2005). doi:10.1016/j.jcp.2005.01.005
Filipovic, J., Peterlik, I., Fousek, J.: GPU acceleration of equations assembly in finite elements method preliminary results. Symposium on Application Accelerators in HPC, SAAHPC (2009)
Flaig, C., Arbenz, P.: A scalable memory efficient multigrid solver for micro-finite element analyses based on CT images. Parallel Computing 37(12), 846–854 (2011). doi:10.1016/j.parco.2011.08.001. http://www.sciencedirect.com/science/article/pii/S01678191110
Göddeke, D., Strzodka, R., Turek, S.: Accelerating double precision FEM simulations with GPUs. In: Hülsemann, F., Kowarschik, M., Rüde, U. (eds.) 18th Symposium Simulationstechnique (ASIM’05), pp. 139–144. Simulation , Frontiers in (2005)
Hwu, WmW: GPU Computing Gems Emerald Edition, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco,CA, USA (2011)
Johnson, C.: Numerical Solution of Partial Differential Equations by the Finite Element Method. Cambridge University Press, Cambridge (1987)
Komatitsch, D., Göddeke, D., Erlebacher, G., Michéa, D.: Modeling the propagation of elastic waves using spectral elements on a cluster of 192 GPUs. Computer Science Research and Development 25(1–2), 75–82 (2010). doi:10.1007/s00450-010-0109-1
Komatitsch, D., Micha, D., Erlebacher, G.: Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA. Journal of Parallel and Distributed Computing 69(5), 451–460 (2009). doi:10.1016/j.jpdc.2009.01.006. http://www.sciencedirect.com/science/article/pii/S07437315090
Markall, G.R., Ham, D.A., Kelly, P.H.: Towards generating optimised finite element solvers for GPUs from high-level specifications. Procedia Computer Science 1(1), 1815–1823 (2010). doi:10.1016/j.procs.2010.04.203. http://www.sciencedirect.com/science/article/pii/S18770509100
NVIDIA: cuSPARSE library, last accessed Dec 20th (2012). http://developer.nvidia.com/cuSPARSE
NVIDIA: NVIDIA CUDA C Best Practices Guide, last accessed Aug 20th (2012). http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf
NVIDIA: NVIDIA Tesla C2070 techinical specifications, last accessed Aug 20th (2012). http://www.nvidia.com/docs/IO/43395/NV_DS_Tesla_C2050_C2070_jul10_lor
NVIDIA: CUBLAS library, last accessed Sept 12th (2013). http://developer.nvidia.com/cublas
Plaszewski, P., Maciol, P., Banas, K.: Finite element numerical integration on GPUs. In: Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I, PPAM’09, pp. 411–420. Springer, Berlin, Heidelberg (2010). http://dl.acm.org/citation.cfm?id=1882792.1882842
Poole, E.L., Ortega, J.M.: Multicolor ICCG Methods for Vector Computers. SIAM Journal on Numerical Analysis 24(6), 1394–1418 (1987)
Reguly, I., Giles, M.: Efficient sparse matrix-vector multiplication on cache-based GPUs. In: Innovative Parallel Computing (InPar), 2012. IEEE (2012). 2012, doi:10.1109/InPar.6339602.
Spencer, B.: A general auto-tuning framework for software performance optimisation (2011). Third Year Project Report, University of Oxford.
Vázquez, F., Fernández, J., Garzón, E.: Automatic tuning of the sparse matrix vector product on GPUs based on the ELLR-T approach. Parallel Computing (2011). doi:10.1016/j.parco.2011.08.003. http://www.sciencedirect.com/science/article/pii/S01678191110
Acknowledgments
This research was supported in part by the UK Engineering and Physical Sciences Research Council through project EP/J010553/1 on “Algorithms and Software for Emerging Architectures”, and in part by the EU LLP/Erasmus program 10/2010-2011/Erasmus-SMP. The authors would like to acknowledge the help and support of Csaba Józsa, András Oláh, Barna Garay and Tamás Roska at PPKE Hungary.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Reguly, I.Z., Giles, M.B. Finite Element Algorithms and Data Structures on Graphical Processing Units. Int J Parallel Prog 43, 203–239 (2015). https://doi.org/10.1007/s10766-013-0301-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-013-0301-6