Abstract
Parallelization of the finite-element method (FEM) has been contemplated by the scientific and high-performance computing community for over a decade. Most of the computations in the FEM are related to linear algebra that includes matrix and vector computations. These operations have the single-instruction multiple-data (SIMD) computation pattern, which is beneficial for shared-memory parallel architectures. General-purpose graphics processing units (GPGPUs) have been effectively utilized for the parallelization of FEM computations ever since 2007. The solver step of the FEM is often carried out using conjugate gradient (CG)-type iterative methods because of their larger convergence rates and greater opportunities for parallelization. Although the SIMD computation patterns in the FEM are intrinsic for GPU computing, there are some pitfalls, such as the underutilization of threads, uncoalesced memory access, lower arithmetic intensity, limited faster memories on GPUs and synchronizations. Nevertheless, FEM applications have been successfully deployed on GPUs over the last 10 years to achieve a significant performance improvement. This paper presents a comprehensive review of the parallel optimization strategies applied in each step of the FEM. The pitfalls and trade-offs linked to each step in the FEM are also discussed in this paper. Furthermore, some extraordinary methods that exploit the tremendous amount of computing power of a GPU are also discussed. The proposed review is not limited to a single field of engineering. Rather, it is applicable to all fields of engineering and science in which FEM-based simulations are necessary.
Similar content being viewed by others
References
Zienkiewicz O C, Taylor R L and Nithiarasu P 2000 The finite element method: solid mechanics, vol. 2. Oxford: Butterworth-heinemann
Singh I V, Mishra B K, Brahmankar M, Bhasin V, Sharma K and Khan I A 2014 Numerical simulations of 3-d cracks using coupled EFGM and FEM. Int. J. Comput. Methods Eng. Sci. Mech. 15(3): 227–231
Jin J M 2015 The finite element method in electromagnetics, 3rd ed. New York: John Wiley & Sons
Moratal D 2012 Finite element analysis-from biomedical applications to industrial development. London: InTech
Argyris J 1954 and 1955 Energy theorems and structural analysis. Aircraft Engineering re-printed 1990 London: Butterworth’s Scientific Publications
Clough W R 1960 The finite element method in plane stress analysis. In: Proceedings of the 2nd Conference on Electronic Computation, A.S.C.E. Structural Division, Pittsburgh, Pennsylvania
Banaś K, Płaszewski P and Macoił P 2014 Numerical integration on GPUs for higher order finite elements. Comput. Math. Appl. 67(6): 1319–1344
Komatitsch D, Michéa D and Erlebacher G 2009 Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA. J. Parallel Distrib. Comput. 69(5): 451–460
Dongarra J Survey of sparse matrix storage formats. www.netlib.org/utk/papers/templates/node90.html (visited 10th May 2017)
Bell N and Garland M 2008 Efficient sparse matrix–vector multiplication on CUDA. Nvidia Technical Report NVR-2008-004, Nvidia Corporation
Barrett R, Berry M, Chan T F, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C and Van der Vorst H 1994 Templates for the solution of linear systems: building blocks for iterative methods, 2nd ed. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania
Carey G F and Jiang B 1986 Element-by-element linear and nonlinear solution schemes. Commun. Appl. Numer. Methods 2(2): 145–153
Carey G F, Barragy E, Mclay R and Sharma M 1988 Element-by-element vector and parallel computations. Commun. Appl. Numer. Methods 4(3): 299–307
Nickolls J and Kirk D 2009 Graphics and computing GPUs. In: Patterson D A and Hennessy J L Computer organization and design, 4th ed. Appendix A: 1–77
NVIDIA CUDA 2007 Compute unified device architecture programming guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (visited 23rd September 2017)
Owens J D, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn A E and Purcell T J 2007 A survey of general-purpose computation on graphics hardware.Comput. Graph. Forum 26(1): 80–113
Liu Y, Jiao S, Wu W and De S 2008 GPU accelerated fast FEM deformation simulation. In: Proceedings of the Asia Pacific Conference on Circuits and Systems, APCCAS 2008, IEEE Macao, pp. 606–609
Kákay A, Westphal E and Hertel R 2010 Speedup of FEM micromagnetic simulations with graphical processing units.IEEE Trans. Magn. 46(6): 2303–2306
Brodtkorb A R, Hagen T R and Sætra M L 2013 Graphics processing unit (GPU) programming strategies and trends in GPU computing. J. Parallel Distrib. Comput. 73(1): 4–13
Hoole S R H, Karthik V U, Sivasuthan S, Rahunanthan A, Tyagarajan R S and Jayakumar P 2015 Finite elements, design optimization, and nondestructive evaluation: a review in magnetics, and future directions in GPU-based, element-by-element coupled optimization and NDE. Int. J. Appl. Electromagn. Mech. 47(3): 607–627
Sanders J and Kandrot E 2010 CUDA by example: an introduction to general-purpose GPU programming. Massachusetts: Addison-Wesley Professional
Ho-Le K 1988 Finite element mesh generation methods: a review and classification.Comput. Aided Des. 20(1): 27–38
Sivasuthan S, Karthik V U, Jayakumar P, Thyagarajan R S, Udpa L and Hoole S R H 2015 A script-based, parameterized finite element mesh for design and NDE on a GPU. IETE Tech. Rev. 32(2): 94–103
Reddy J N 1993 An introduction to the finite element method, 2nd ed. New York: McGraw-Hill, vol. 2, no. 2.2
Garcia-Ruiz M J and Steven G P 1999 Fixed grid finite elements in elasticity problems. Eng. Comput. 16(2): 145–164
Krużel F and Banaś K 2013 Vectorized OpenCL implementation of numerical integration for higher order finite elements. Comput. Math. Appl. 66(10): 2030–2044
Solin P, Segeth K and Dolezel I 2003 Higher-order finite element methods. Boca Raton: Chapman & Hall, CRC Press
Macioł P, Płaszewski P and Banaś K 2010 3D finite element numerical integration on GPUs. Procedia Comput. Sci. 1(1): 1093–1100
Filipovič J, Peterlík I and Fousek J 2009 GPU acceleration of equations assembly in finite elements method—preliminary results. In: Proceedings of the Symposium on Application Accelerators in HPC (SAAHPC)
Dziekonski A, Sypek P, Lamecki A and Mrozowski M 2012 Accuracy, memory, and speed strategies in GPU-based finite-element matrix-generation. IEEE Antennas Wirel. Propag. Lett. 11: 1346–1349
Dziekonski A, Sypek P, Lamecki A and Mrozowski M 2013 Generation of large finite element matrices on multiple graphics processors. Int. J. Numer. Methods Eng. 94(2): 204–220
Nvidia Corporation 2008 Cublas library. Version 2.0, NVIDIA, Santa Clara, California
Dziekonski A, Sypek P, Lamecki A and Mrozowski M 2012 Finite element matrix generation on a GPU. Prog. Electromagn. Res. 128: 249–265
Munshi A, Gaster B R, Mattson T G, Fung J and Ginsburg D 2011 OpenCL programming guide. London: Pearson Education
Banaś K, Krużel F and Bielański J 2016 Finite element numerical integration for first order approximations on multi-and many-core architectures. Comput. Methods Appl. Mech. Eng. 305: 827–848
Woźniak M 2015 Fast GPU integration algorithm for isogeometric finite element method solvers using task dependency graphs. J. Comput. Sci. 11: 145–152
Mamza J, Makyla P, Dziekonski A, Lamecki A and Mrozowski M 2012 Multi-core and multiprocessor implementation of numerical integration in Finite Element Method. In: Proceedings of the 19th International Conference on Microwaves, Radar & Wireless Communications, IEEE, Warsaw, vol. 2, pp. 457–461
Knepley M G and Terrel A R 2013 Finite element integration on GPUs. ACM Trans. Math. Softw. (TOMS) 39(2): 10:1–13
Cecka C, Lew A and Darve E 2010 Introduction to assembly of finite element methods on graphics processors. IOP Conf. Ser. Mater. Sci. Eng. 10(1): 012009
Iwashita T and Shimasaki M 2002 Algebraic multicolor ordering for parallelized ICCG solver in finite-element analyses. IEEE Trans. Magn. 38(2): 429–432
Iwashita T and Shimasaki M 2003 Algebraic block red–black ordering method for parallelized ICCG solver with fast convergence and low communication costs. IEEE Trans. Magn. 39(3): 1713–1716
Fu Z, Lewis T J, Kirby R M and Whitaker R T 2014 Architecting the finite element method pipeline for the GPU. J. Comput. Appl. Math. 257: 195–211
Cecka C, Lew A J and Darve E 2011 Assembly of finite element methods on graphics processors. Int. J. Numer. Methods Eng. 85(5): 640–669
Markall G R, Ham D A and Kelly Paul H J 2010 Towards generating optimized finite element solvers for GPUs from high-level specifications. Procedia Comput. Sci. 1(1): 1815–1823
Markall G R, Slemmer A, Ham D A, Kelly P H J, Cantwell C D and Sherwin S J 2013 Finite element assembly strategies on multicore and manycore architectures. Int. J. Numer. Methods Fluids 71(1): 80–97
Sanfui S and Sharma D 2017 A two-kernel based strategy for performing assembly in FEA on the graphic processing unit. In: Proceedings of the IEEE International Conference on Advances in Mechanical, Industrial, Automation and Management Systems (AMIAMS), pp. 1–9
Cecka C, Lew A and Darve E 2011 Application of assembly of finite element methods on graphics processors for real-time elastodynamics. In: GPU computing gems, Jade ed. Massachusetts: Morgan Kaufmann, chapter 16, pp. 187–205
Meng H T, Nie B L, Wong S, Macon C and Jin J M 2014 GPU accelerated finite-element computation for electromagnetic analysis. IEEE Antennas Propag. Mag. 56(2): 39–62
Reguly I Z and Giles M B 2015 Finite element algorithms and data structures on graphical processing units. Int. J. Parallel Program. 43(2): 203–239
Dziekonski A, Sypek P, Lamecki A and Mrozowski M 2014 GPU-accelerated finite-element matrix generation for lossless, lossy, and tensor media. IEEE Antennas Propag. Mag. 56(5): 186–197
Dziekonski A, Sypek P, Lamecki A and Mrozowski M 2017 Communication and load balancing optimization for finite element electromagnetic simulations using multi-GPU workstation. IEEE Trans. Microw. Theory Tech. 65(8): 2661–2671
Logg A, Mardal M A and Wells G N 2012 Automated solution of differential equations by the finite element method: the FEniCS book, vol. 84. New York–Heidelberg–Dordrecht–London: Springer
Dupont T, Hoffman J, Jansson J, Johnson C, Kirby Robert C, Knepley M, Larson M , Logg A and Scott R 2003 The fenics project. Tech. Rep. 200321, Chalmers Finite Element Center Preprint Series
Luporini F, Varbanescu A L, Rathgeber F, Bercea G T, Ramanujam J, Ham D A and Kelly P H J 2014 COFFEE: an optimizing compiler for finite element local assembly. arXiv preprint arXiv:1407.0904
Shewchuk J R 1994 An introduction to the conjugate gradient method without the agonizing pain. Technical Report CMU-CS-94-125, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
Itu L M, Suciu C, Moldoveanu F and Postelnicu A 2011 Comparison of single and double floating point precision performance for Tesla architecture GPUs. Bull. Transilv. Univ. Brasov Ser. I Eng. Sci. 4(53): 131–138
Göddeke D, Strzodka R and Turek R 2007 Performance and accuracy of hardware-oriented native-, emulated-and mixed-precision solvers in FEM simulations. Int. J. Parallel Emerg. Distrib. Syst. 22(4): 221–256
Baboulin M, Buttari A, Dongarra J, Kurzak J, Langou J, Langou Julien, Luszczek P and Tomov S 2009 Accelerating scientific computations with mixed precision algorithms. Comput. Phys. Commun. 180(12): 2526–2533
Buttari A, Dongarra J, Kurzak J, Langou Julie, Langou Julien, Luszczek P and Tomov S 2006 Exploiting mixed precision floating point hardware in scientific computations. In: Proceedings of the High Performance Computing Workshop, pp. 19–36
Göddeke D, Strzodka R and Turek S 2005 Accelerating double precision FEM simulations with GPUs. In: Proceedings of ASIM 18th Symposium on Simulation Technique
Cosgrove J D F, Díaz J C and Griewank A 1992 Approximate inverse preconditionings for sparse linear systems. Int. J. Comput. Math. 44(1–4): 91–110
Li R and Saad Y 2013 GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 63(2): 443–466
Naumov M, Chien L S, Vandermersch P and Kapasi U 2010 Cusparse library. Presented at: GPU Technology Conference San Jose
Wang E, Zhang Q, Shen B, Zhang G, Lu X, Wu Q and Wang Y 2014 Intel math kernel library. In: High-performance computing on the Intel®Xeon Phi \(^{TM}.\) Springer International Publishing, pp. 167–188
Naumov M 2011 Incomplete-LU and Cholesky preconditioned iterative methods using CUSPARSE and CUBLAS. Nvidia Technical Report and White Paper
Fialko S Y and Zeglen F 2016 Preconditioned conjugate gradient method for solution of large finite element problems on CPU and GPU. J. Telecommun. Inf. Technol. nr 2: 26–33
Gao J, Liang R and Wang J 2014 Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU. J. Parallel Distrib. Comput. 74(2): 2088–2098
Benzi M, Meyer C D and Tůma M 1996 A sparse approximate inverse preconditioner for the conjugate gradient method. SIAM J. Sci. Comput. 17(5): 1135–1149
Grote M J and Huckle T 1997 Parallel preconditioning with sparse approximate inverses. SIAM J. Sci. Comput. 18(3): 838–853
Ament M, Knittel G, Weiskopf D and Straßer W 2010 A parallel preconditioned conjugate gradient solver for the Poisson problem on a multi-GPU platform. In: Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-based Processing, IEEE, pp. 583–592
Helfenstein R and Koko J 2012 Parallel preconditioned conjugate gradient algorithm on GPU. J. Comput. Appl. Math. 236(15): 3584–3590
Gravvanis G A 2002 Explicit approximate inverse preconditioning techniques. Arch. Comput. Methods Eng. 9(4): 371–402
Gravvanis G A, Filelis-Papadopoulos C K and Giannoutakis K M 2012 Solving finite difference linear systems on GPUs: CUDA based parallel explicit preconditioned biconjugate conjugate gradient type methods. J. Supercomput. 61(3): 590–604
Cuthill E and McKee J 1972 Several strategies for reducing the bandwidth of matrices. In: Rose D J and Willoughby R A (Eds.) Sparse matrices and their applications. New York: Springer, pp. 157–166
Fujiwara K, Nakata T and Fusayasu H 1993 Acceleration of convergence characteristic of the ICCG method. IEEE Trans. Magn. 29(2): 1958–1961
Camargos A F P De, Silva V C, Guichon J M and Munier G 2014 Efficient parallel preconditioned conjugate gradient solver on GPU for FE modeling of electromagnetic fields in highly dissipative media. IEEE Trans. Magn. 50(2): 569–572
Bernaschi M, Bisson M, Fantozzi C and Janna C 2016 A factored sparse approximate inverse preconditioned conjugate gradient solver on graphics processing units. SIAM J. Sci. Comput. 38(1): C53–C72
Bell N and Garland M 2017 https://code.google.com/archive/p/cusp-library/downloads (visited 23rd June)
Monakov A and Avetisyan A 2009 Implementing blocked sparse matrix–vector multiplication on NVIDIA GPUs. Embedded computer systems: architectures, modeling, and simulation, pp. 289–297
Choi J W, Singh A and Vuduc R W 2010 Model-driven autotuning of sparse matrix–vector multiply on GPUs. ACM Sigplan Not. 45(5): 115–126
Vázquez F, Fernández J J and Garzón E M 2011 A new approach for sparse matrix vector product on NVIDIA GPUs. Concurr. Comput. Pract. Exp. 23(8): 815–826
Pichel J C, Rivera F F, Fernández M and Rodríguez A 2012 Optimization of sparse matrix–vector multiplication using reordering techniques on GPUs. Microprocess. Microsyst. 36(2): 65–77
Dang H V and Schmidt B 2012 The sliced COO format for sparse matrix–vector multiplication on CUDA-enabled GPUs. Procedia Comput. Sci. 9: 57–66
Dang H V and Schmidt B 2013 CUDA-enabled sparse matrix–vector multiplication on GPUs using atomic operations. Parallel Comput. 39(11): 737–750
Monakov A, Lokhmotov A and Avetisyan A 2010 Automatically tuning sparse matrix–vector multiplication for GPU architectures. HiPEAC Proceedings, Lecture Notes in Computer Science 5952, pp. 111–125
Kreutzer M, Hager G, Wellein G, Fehske H and Bishop A R 2014 A unified sparse matrix data format for efficient general sparse matrix–vector multiplication on modern processors with wide SIMD units. SIAM J. Sci. Comput. 36(5): C401–C423
Anzt H, Tomov S and Dongarra J Implementing a sparse matrix vector product for the SELL-C/SELL-C- \(\sigma \) formats on NVIDIA GPUs. University of Tennessee, Tech. Rep., UT-EECS-14-727
Filippone S, Cardellini V, Barbieri D and Fanfarillo A 2017 Sparse matrix–vector multiplication on GPGPUs. ACM Trans. Math. Softw. (TOMS) 43(4): 30
Gao J, Wang Y and Wang J 2017 A novel multigraphics processing unit parallel optimization framework for the sparse matrixvector multiplication. Concurr. Comput. Pract. Exp. 29(5): e3936
Gao J, Zhou Y, He G and Xia Y 2017 A multi-GPU parallel optimization model for the preconditioned conjugate gradient algorithm. Parallel Comput. 63: 1–16
Flegar G and Quintana-Ortí E S Balanced CSR sparse matrix–vector product on graphics processors. In: Proceedings of the European Conference on Parallel Processing. Cham: Springer, pp. 697–709
Merrill D and Garland M 2016 Merge-based parallel sparse matrix–vector multiplication. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, UT, Salt Lake City, pp. 678–689
Yang W, Li K and Li K 2017 A hybrid computing method of SpMV on CPUGPU heterogeneous computing systems. J. Parallel Distrib. Comput. 104: 49–60
Lin S and Xie Z 2017 A Jacobi PCG solver for sparse linear systems on multi-GPU cluster. J. Supercomput. 73(1): 433–454
Cevahir A, Nukada A and Matsuoka S 2009 Fast conjugate gradients with multiple GPUs. In: Proceedings of the International Conference on Computational Science, LNCS 5544. Berlin–Heidelberg: Springer, pp. 893–903
Martínez-Frutos J, Martínez-Castejón P J and Herrero-Pérez D 2015 Fine-grained GPU implementation of assembly-free iterative solver for finite element problems. Comput. Struct. 157: 9–18
Kiss I, Gyimothy S and Badics Z 2012 Parallel realization of the element-by-element FEM technique by CUDA. IEEE Trans. Magn. 48(2): 507–510
Fernández D M, Dehnavi M M, Gross W J and Giannacopoulos D 2012 Alternate parallel processing approach for FEM. IIEEE Trans. Magn. 48(2): 399–402
Hughes T J R, Levit I and Winget J 1983 An element-by-element solution algorithm for problems of structural and solid mechanics. Comput. Methods Appl. Mech. Eng. 36(2): 241–254
Yan X, Han X, Wu D, Xie D, Bai B and Ren Z 2017 Research on preconditioned conjugate gradient method based on EBE-FEM and the application in electromagnetic field analysis. IEEE Trans. Magn. 53(6): 1–4
Akbariyeh A, Dennis B H, Wang B P and Lawrence K L 2015 Comparison of GPU-based parallel assembly and assembly-free sparse matrix vector multiplication for finite element analysis of three-dimensional structures. In: Proceedings of the Fifteenth International Conference on Civil, Structural and Environmental Engineering Computing, Civil-Comp Press, Stirlingshire, Scotland
Martínez-Frutos J and Herrero-Pérez D 2015 Efficient matrix-free GPU implementation of fixed grid finite element analysis. Finite Elem. Anal. Des. 104: 61–71
Bendsøe M P and Sigmund O 2004 Topology optimization theory, methods, and applications. Berlin–Heidelberg: Springer
Martínez-Frutos J, Martínez-Castejón P J and Herrero-Pérez D 2017 Efficient topology optimization using GPU computing with multilevel granularity. Adv. Eng. Softw. 106: 47–62
Martínez-Frutos J and Herrero-Pérez D 2017 GPU acceleration for evolutionary topology optimization of continuum structures using isosurfaces. Comput. Struct. 182: 119–136
Ram L and Sharma D 2017 Evolutionary and GPU computing for topology optimization of structures. Swarm Evol. Comput. 35: 1–13
Martínez-Frutos J and Herrero-Pérez D 2016 Large-scale robust topology optimization using multi-GPU systems. Comput. Methods Appl. Mech. Eng. 311: 393–414
Baca V, Horak Z, Mikulenka P and Dzupa V 2008 Comparison of an inhomogeneous orthotropic and isotropic material models used for FE analyses. Med. Eng. Phys. 30(7): 924–930
Cai Y, Li G and Wang H 2013 A parallel node-based solution scheme for implicit finite element method using GPU. Procedia Eng. 61: 318–324
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pikle, N.K., Sathe, S.R. & Vyavhare, A.Y. GPGPU-based parallel computing applied in the FEM using the conjugate gradient algorithm: a review. Sādhanā 43, 111 (2018). https://doi.org/10.1007/s12046-018-0892-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12046-018-0892-0