The Journal of Supercomputing

, Volume 72, Issue 11, pp 4160–4180 | Cite as

Nekbone performance on GPUs with OpenACC and CUDA Fortran implementations

  • Jing Gong
  • Stefano Markidis
  • Erwin Laure
  • Matthew Otten
  • Paul Fischer
  • Misun Min


We present a hybrid GPU implementation and performance analysis of Nekbone, which represents one of the core kernels of the incompressible Navier–Stokes solver Nek5000. The implementation is based on OpenACC and CUDA Fortran for local parallelization of the compute-intensive matrix–matrix multiplication part, which significantly minimizes the modification of the existing CPU code while extending the simulation capability of the code to GPU architectures. Our discussion includes the GPU results of OpenACC interoperating with CUDA Fortran and the gather–scatter operations with GPUDirect communication. We demonstrate performance of up to 552 Tflops on 16, 384 GPUs of the OLCF Cray XK7 Titan.


Nekbone/Nek5000 OpenACC CUDA Fortran GPUDirect Gather–scatter communication Spectral element discretization 



This material is based upon work supported by the US Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under Contract DE-AC02-06CH11357, and partially supported by the Swedish e-Science Research Centre (SeRC). This research used resources of the Oak Ridge Leadership Computing Facility at Oak Ridge National Laboratory, which is supported by the Office of Science of the US Department of Energy under Contract No. DE-AC05-00OR22725. The research also used computing resources of the French Alternative Energies and Atomic Energy Commission (CEA) in France via the Partnership for Advanced Computing in Europe (PRACE).


  1. 1.
    Otten M, Gong J, Mametjanov A, Vose A, Levesque J, Fischer P, Min M (2015) An MPI/OpenACC implementation of a high order electromagnetics solver with GPUDirect communication. In: Int J High Perform Comput Appl (accepted) Google Scholar
  2. 2.
    Jespersen DC (2010) Acceleration of a CFD code with a GPU. Sci Program 18(3–4):193–201Google Scholar
  3. 3.
    Hoshino T, Maruyama N, Matsuoka S, Takaki R (2013) CUDA vs OpenACC: performance case studies with kernel benchmarks and a memory-bound CFD application. In: The proceeding of 13th IEEE/ACM international symposium on cluster, cloud, and grid computing, Delft, The NetherlandsGoogle Scholar
  4. 4.
    Kraus J, Schlottke M, Adinetz A, Pleiter D (2014) Accelerating a C++ CFD code with OpenACC. In: The proceedings of the first workshop on accelerator programming using directives SC14, LA, USA, pp 47–54Google Scholar
  5. 5.
    Xia Y, Luo H, Luo L, Edwards J, Lou J (2015) OpenACC acceleration of an unstructured CFD solver based on a reconstructed discontinuous Galerkin method for compressible flows. Int J Numer Meth Fluids 78(3):123–139MathSciNetCrossRefGoogle Scholar
  6. 6.
    Niemeyer K, Sung C (2014) Recent progress and challenges in exploiting graphics processors in computational fluid dynamics. J Supercomput 67(2):528–564CrossRefGoogle Scholar
  7. 7.
    Fischer P, Lottes JW, Kerkemeier SG Nek5000 web page.
  8. 8.
    Fischer P, Lottes JW (2004) Hybrid Schwarz-multigrid methods for the spectral element method: extensions to Navier–Stokes. In: Kornhuber R, Hoppe R, Périaux J, Pironneau O, Widlund O, Xu J (eds) Domain decomposition methods in science and engineering series. Springer, BerlinGoogle Scholar
  9. 9.
    Lottes JW, Fischer P (2005) Hybrid multigrid/Schwarz algorithms for the spectral element method. J Sci Comput 24:45–78MathSciNetCrossRefzbMATHGoogle Scholar
  10. 10.
    Fischer P, Lottes J, Pointer WD, Siegel A (2008) Petascale algorithms for reactor hydrodynamics. J Phys Conf Ser 125:012076CrossRefGoogle Scholar
  11. 11.
    Tufo HM, Fishcer P (2001) Fast parallel direct solvers for coarse-grid problems. J Parall Distrib Comput 61:151–177CrossRefzbMATHGoogle Scholar
  12. 12.
    Deville M, Fischer P, Mund E (2002) High-order methods for incompressible fluid flow. Cambridge University Press, CambridgeCrossRefzbMATHGoogle Scholar
  13. 13.
    Markidis S, Gong J, Schliephake M, Laure E, Hart A, Henty D, Heisey K, Fischer P (2015) OpenACC acceleration of the Nek5000 spectral element code. Int J High Perform Comput Appl 29:311–319CrossRefGoogle Scholar
  14. 14.
    Gong J, Markidis S, Schliephake M, Laure E, Henningson D, Schlatter P, Peplinski A, Hart A, Doleschal J, Henty D, Fischer P (2015) Nek5000 with OpenACC. In: Markidis S, Laure E (eds) Solving Software Challenges for Exascale, the International Conference on Exascale Applications and Software, EASC 2014 Stockholm, Sweden, April 20–23, 2014. Springer, Berlin, LNCS8759Google Scholar

Copyright information

© Springer Science+Business Media New York (outside the USA) 2016

Authors and Affiliations

  • Jing Gong
    • 1
  • Stefano Markidis
    • 1
  • Erwin Laure
    • 1
  • Matthew Otten
    • 2
  • Paul Fischer
    • 3
    • 4
  • Misun Min
    • 4
  1. 1.PDC, KTHStockholmSweden
  2. 2.Cornell UniversityIthacaUSA
  3. 3.University of Illinois Urbana-ChampaignChampaignUSA
  4. 4.Argonne National LaboratoryLemontUSA

Personalised recommendations