pp 1–25 | Cite as

High performance iterative elemental product strategy in assembly-free FEM on GPU with improved occupancy

  • Nileshchandra K. Pikle
  • Shailesh R. Sathe
  • Arvind Y. Vyavahare


The Matrix-vector products (MvPs) are computed either at element level or Degree-of-freedom (DoF) level in assembly-free Finite Element Method. The MvPs are mapped on GPU at element level or DoF level on per thread basis. Both strategies exploit the computing power of the GPU with cogent improvement in performance. However, these strategies suffer from poor global memory load/store efficiency. This paper proposes an efficient implementation of DoF based MvPs strategy using faster on-chip shared memory to store elemental matrices on GPU. Since the GPU has smaller shared memory size, MvPs are carried out iteratively in chunks to alleviate the poor occupancy issue. Performance of the iterative method is improved by two factors, first by coalesced access to global memory and second by improving the occupancy. Numerical experiments have shown that proposed iterative method outperforms the DoF based strategy approximately by factor 3.


Graphics Processing Unit (GPU) Finite Element Method (FEM) Preconditioned Conjugate Gradient (PCG) Compute Unified Device Architecture (CUDA) 

Mathematics Subject Classification

Finite Element Methods 74S05 


  1. 1.
    Theis TN, Wong HSP (2017) The end of Moore’s law: a new beginning for information technology. Comput Sci Eng 19(2):41–50CrossRefGoogle Scholar
  2. 2.
    Nickolls J, Kirk D (2009) Graphics and computing GPUs Computer Organization and Design (DA Patterson and JL Hennessy). The Hardware/Software Interface Edition 4, San Francisco. CA, Morgan Kaufmann, Appendix A, pp A1–A77Google Scholar
  3. 3.
    Comas O, Taylor Z A, Allard J, Ourselin S, Cotin S and Passenger J (2008) Efficient nonlinear FEM for soft tissue modelling and its GPU implementation within the open source framework SOFA. In: International symposium on biomedical simulation. Springer, Berlin, pp 28–39Google Scholar
  4. 4.
    Georgescu S, Chow P, Okuda H (2013) GPU acceleration for FEM-based structural analysis. Arch Comput Methods Eng 20(2):111–121MathSciNetCrossRefzbMATHGoogle Scholar
  5. 5.
    Bathe KJ (2008) Finite element method. Wiley Online Library, New YorkCrossRefGoogle Scholar
  6. 6.
    Barrett R, Berry M, Chan TF, Demmel J, Donato J, Dongarra J, Eijkhout V, Pozo R, Romine C, Van der Vorst H (1994) Templates for the solution of linear systems: building blocks for iterative methods, 2nd edn. Society for Industrial and Applied Mathematics, PhiladelphiaCrossRefzbMATHGoogle Scholar
  7. 7.
    Shewchuk JR (1994) An introduction to the conjugate gradient method without the agonizing pain. Technical Report CMU-CS-94-125. School of Computer Science, Carnegie Mellon University, Pittsburgh, PennsylvaniaGoogle Scholar
  8. 8.
    Macioł P, Płaszewski P, Banaś K (2010) 3D finite element numerical integration on GPUs. Procedia Comput Sci 1(1):1093–1100CrossRefzbMATHGoogle Scholar
  9. 9.
    Płaszewski P, Banaś K, MaciołP (2010) Higher order FEM numerical integration on GPUs with OpenCL. In: Proceedings of the 2010 international multiconference on computer science and information technology (IMCSIT), IEEE, pp 337–342Google Scholar
  10. 10.
    Banaś K, Płaszewski P, Macoił P (2014) Numerical integration on GPUs for higher order finite elements. Comput Math Appl 67(6):1319–1344MathSciNetCrossRefzbMATHGoogle Scholar
  11. 11.
    Komatitsch D, Michéa D, Erlebacher G (2009) Porting a high-order Fnite element earthquake modeling application to nvidia graphics cards using cuda. J Parallel Distrib Comput 69(5):451–460CrossRefGoogle Scholar
  12. 12.
    Dziekonski A, Sypek P, Lamecki A, Mrozowski M (2012) Finite element matrix generation on a GPU. Prog Electromagn Res 128:249–265CrossRefzbMATHGoogle Scholar
  13. 13.
    Woźniak M (2015) Fast GPU integration algorithm for isogeometric finite element method solvers using task dependency graphs. J Comput Sci 11:145–52MathSciNetCrossRefGoogle Scholar
  14. 14.
    Dziekonski A, Sypek P, Lamecki A, Mrozowski M (2013) Generation of large finite element matrices on multiple graphics processors. Int J Numer Methods Eng 94(2):204–20MathSciNetCrossRefzbMATHGoogle Scholar
  15. 15.
    Cecka C, Lew AJ, Darve E (2011) Assembly of finite element methods on graphics processors. Int J Numer Methods Eng 85(5):640–69CrossRefzbMATHGoogle Scholar
  16. 16.
    Markall GR, Slemmer A, Ham DA, Kelly PH, Cantwell CD, Sherwin SJ (2013) Finite element assembly strategies on multi core and many core architectures. Int J Numer Methods Fluids 71(1):80–97MathSciNetCrossRefGoogle Scholar
  17. 17.
    Ament M, Knittel G, Weiskopf D, Strasser W (2010) A parallel preconditioned conjugate gradient solver for the poisson problem on a multi-gpu platform. In: 2010 18th Euromicro international conference parallel, distributed and network-based processing (PDP), pp 583–592Google Scholar
  18. 18.
    Helfenstein R, Koko J (2012) Parallel preconditioned conjugate gradient algorithm on GPU. J Comput Appl Math 236(15):3584–90MathSciNetCrossRefzbMATHGoogle Scholar
  19. 19.
    Cevahir A, Nukada A, Matsuoka S (2009) Fast conjugate gradients with multiple GPUs. In: International conference on computational science. Springer, Berlin, pp 893–903Google Scholar
  20. 20.
    Fialko SY, Zeglen F (2016) Preconditioned conjugate gradient method for solution of large finite element problems on CPU and GPU. J Telecommun Inf Technol nr–2:26–33Google Scholar
  21. 21.
    Nvidia CUDA (2014) Cusparse library. NVIDIA Corporation, Santa ClaraGoogle Scholar
  22. 22.
    Bell N, Garland M (2008) Efficient sparse matrix-vector multiplication on CUDA. Nvidia Technical Report NVR-2008-004, Nvidia CorporationGoogle Scholar
  23. 23.
    Vázquez F, Fernández JJ, Garzón EM (2011) A new approach for sparse matrix vector product on NVIDIA GPUs. Concurr Comput Pract Exp 23(8):815–26CrossRefGoogle Scholar
  24. 24.
    Dehnavi MM, Fernández DM, Giannacopoulos D (2010) Finite-element sparse matrix vector multiplication on graphic processing units. IEEE Trans Magn 46(8):2982–5CrossRefGoogle Scholar
  25. 25.
    He G, Gao J (2016) A novel CSR-based sparse matrix-vector multiplication on GPUs. Math Probl Eng 2016:1–12Google Scholar
  26. 26.
    Nvidia CUDA Cublas library (2008) NVIDIA Corporation, Santa Clara. California 15:27Google Scholar
  27. 27.
    Bell N, Hoberock J (2011) Thrust: a productivity-oriented library for CUDA. GPU Comput GEMS Jade Ed 2:359–71Google Scholar
  28. 28.
    Kiss I, Gyimothy S, Badics Z, Pavo J (2012) Parallel realization of the element-by-element FEM technique by CUDA. IEEE Trans Magn 48(2):507–10CrossRefGoogle Scholar
  29. 29.
    Fernández DM, Dehnavi MM, Gross WJ, Giannacopoulos D (2012) Alternate parallel processing approach for FEM. IEEE Trans Magn 48(2):399–402CrossRefGoogle Scholar
  30. 30.
    Martínez-Frutos J, Martínez-Castejón PJ, Herrero-Pérez D (2015) Fine-grained GPU implementation of assembly-free iterative solver for finite element problems. Comput Struct 157:9–18CrossRefGoogle Scholar
  31. 31.
    Garcia-Ruiz MJ, Steven GP (1999) Fixed grid finite elements in elasticity problems. Eng Comput 16(2):145–164CrossRefzbMATHGoogle Scholar
  32. 32.
    Martínez-Frutos J, Martínez-Castejón PJ, Herrero-Pérez D (2017) Efficient topology optimization using GPU computing with multilevel granularity. Adv Eng Softw 106:47–62CrossRefGoogle Scholar
  33. 33.
    Dick C, Georgii J, Westermann R (2011) A real-time multigrid finite hexahedra method for elasticity simulation using CUDA. Simul Modell Pract Theory 19(2):801–16CrossRefGoogle Scholar
  34. 34.
    Martínez-Frutos J, Herrero-Pérez D (2015) Efficient matrix-free GPU implementation of fixed grid finite element analysis. Finite Elem Anal Des 104:61–71CrossRefGoogle Scholar
  35. 35.
    Helfenstein R, Koko J (2012) Parallel preconditioned conjugate gradient algorithm on GPU. J Comput Appl Math 236(15):3584–3590MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Volkov V (2010) Better performance at lower occupancy. In: Proceedings of the GPU technology conference, GTC 2010 10:16Google Scholar

Copyright information

© Springer-Verlag GmbH Austria, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringVisvesvaraya National Institute of TechnologyNagpurIndia
  2. 2.Department of Applied MechanicsVisvesvaraya National Institute of TechnologyNagpurIndia

Personalised recommendations