The Journal of Supercomputing

, Volume 59, Issue 2, pp 693–719 | Cite as

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms



Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially.


CFD CUDA Graphics processing unit (GPU) Incompressible flow Navier–Stokes equations Pthreads 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alonso P, Cortina R, Martinez-Zaldivar F, Ranilla J (2009) Neville elimination on multi- and many-core systems: OpenMP, MPI and CUDA. J Supercomput. doi: 10.1007/s11227-009-0360-z Google Scholar
  2. 2.
    Anderson J, Lorenz C, Travesset A (2008) General purpose molecular dynamics simulations fully implemented on graphics processing units. J Comput Phys 227(10):5342–5359 CrossRefMATHGoogle Scholar
  3. 3.
    Bailey D, Barszcz E, Barton J, Browning D, Carter R, Dagum L, Fatoohi R, Frederickson P, Lasinski T, Schreiber R, Simon H, Venkatakrishnan V, Weeratunga S (1991) The NAS parallel benchmarks. Int J Supercomput Appl High Perform Comput 5(3):63–73 CrossRefGoogle Scholar
  4. 4.
    Barrachina S, Castillo M, Igual F, Mayo R, Quintana-Ortı E (2008) Solving dense linear systems on graphics processors. Technical Report ICC 02-02-2008, Universidad Jaume I, Depto de Ingenieria y Ciencia de Computadores Google Scholar
  5. 5.
    Bleiweiss A (2008) GPU accelerated pathfinding. In: Proceedings of the 23rd ACM siggraph/Eurographics symposium on graphics hardware. Eurographics Association, Aire-la-Ville, pp 65–74 Google Scholar
  6. 6.
    Boltz J, Farmer I, Grinspun E, Schroder P (2003) Sparse matrix solvers on the GPU: Conjugate gradients and multigrid. ACM Trans Graph 22(3):917–924 CrossRefGoogle Scholar
  7. 7.
    Brandvik T, Pullan G (2008) Acceleration of a 3D Euler solver using commodity graphics hardware. In: 46th AIAA aerospace sciences meeting and exhibit Google Scholar
  8. 8.
    Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P (2004) Brook for GPUs: Stream computing on graphics hardware. ACM Trans Graph 23(3):777–786 CrossRefGoogle Scholar
  9. 9.
    Castillo M, Chan E, Igual F, Mayo R, Quintana-Ortı E, Quintana-Ortı G, van de Geijn R, Van Zee F (2008) Making programming synonymous with programming for linear algebra libraries. FLAME Working Note 31:08–20 Google Scholar
  10. 10.
    Chandra R, Dagum L, Kohr D, Maydan D, McDonald J, Menon R (2001) Parallel programming in OpenMP. Morgan Kaufmann, San Mateo Google Scholar
  11. 11.
    Chorin A (1968) Numerical solution of Navier–Stokes equations. Math Comput 22(104):745–762 CrossRefMATHMathSciNetGoogle Scholar
  12. 12.
    Cohen JM, Molemaker J (2009) A fast double precision CFD code using CUDA. In: Parallel computational fluid dynamics Google Scholar
  13. 13.
    Elsen E, LeGresley P, Darve E (2008) Large calculation of the flow over a hypersonic vehicle using a GPU. J Comput Phys 227:10,148–10,161 CrossRefGoogle Scholar
  14. 14.
    Ferziger J, Perić M (2002) Computational methods for fluid dynamics. Springer, New York CrossRefMATHGoogle Scholar
  15. 15.
    GCC (2007) GNU compiler collection, Ver. 4.1.2.
  16. 16.
    Ghia U, Ghia K, Shin C (1982) High-RE solutions for incompressible-flow using the Navier–Stokes equations and a multigrid method. J Comput Phys 48(3):387–411 CrossRefMATHGoogle Scholar
  17. 17.
    Goodnight N, Woolley C, Lewin G, Luebke D, Humphreys G (2003) A multigrid solver for boundary value problems using programmable graphics hardware. In: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on graphics hardware. Eurographics Association, Aire-la-Ville, pp 102–111 Google Scholar
  18. 18.
    Hennessy J, Patterson D, Goldberg D, Asanovic K (2003) Computer architecture: A quantitative approach. Morgan Kaufmann, San Mateo Google Scholar
  19. 19.
    Houston M (2008) Stream computing. In: International conference on computer graphics and interactive techniques, ACM SIGGRAPH 2008 classes. ACM Press/Addison-Wesley Publishing Co, New York, p 15 Google Scholar
  20. 20.
    Liu W, Schmidt B, Voss G, Muller-Wittig W (2007) Molecular dynamics simulations on commodity GPUs with CUDA. Lecture notes in computer science, vol 4873, p 185 Google Scholar
  21. 21.
    Michalakes J, Vachharajani M (2008) GPU acceleration of numerical weather prediction In: IEEE international symposium on parallel and distributed processing, IPDPS 2008, pp 1–7 Google Scholar
  22. 22.
    Micikevicius P (2009) 3D finite difference computation on GPUs using CUDA. In: GPGPU-2: Proceedings of 2nd workshop on general purpose processing on graphics processing units. ACM, New York, pp 79–84. doi: 10.1145/1513895.1513905 CrossRefGoogle Scholar
  23. 23.
    Molemaker J, Cohen J, Patel S, Noh J (2008) Low viscosity flow simulations for animation. In: Eurographics/ACM SIGGRAPH symposium on computer animation Google Scholar
  24. 24.
    MPI Forum (2009) MPI: A message passing interface standard version 2.2.
  25. 25.
    NVIDIA (2007) CUDA programming tools.
  26. 26.
    NVIDIA (2008) CUDA compute unified device architecture programming guide, version 2.0.
  27. 27.
    NVIDIA (2009) CUDA zone, the resource for CUDA developers.
  28. 28.
    Owens J, Luebke D, Govindaraju N, Harris M, Krueger J, Lefohn A, Purcell T (2007) A survey of general-purpose computation on graphics hardware. Comput Graph Forum 26(1):80–113 CrossRefGoogle Scholar
  29. 29.
    Owens J, Houston M, Luebke D, Green S, Stone J, Phillips J (2008) GPU computing. Proc IEEE 96(5):879–899 CrossRefGoogle Scholar
  30. 30.
    Phillips EH, Zhang Y, Davis RL, Owens JD (2009) Rapid aerodynamic performance prediction on a cluster of graphics processing units. In: 47th AIAA aerospace sciences meeting, AIAA 2009-565 Google Scholar
  31. 31.
    Ryoo S, Rodrigues C, Baghsorkhi S, Stone S, Kirk D, Wen-mei W (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the 13th ACM SIGPLAN symposium on principles and practice of parallel programming. ACM, New York, pp 73–82 CrossRefGoogle Scholar
  32. 32.
    Sanjurjo J, Amor M, Boo M, Doallo RJC (2009) Optimizing Monte Carlo radiosity on graphics hardware. J Supercomput. doi: 10.1007/s11227-009-0353-y Google Scholar
  33. 33.
    Schatz M, Trapnell C, Delcher A, Varshney A (2007) High-throughput sequence alignment using graphics processing units. BMC Bioinf 8:474 CrossRefGoogle Scholar
  34. 34.
    Thibault J (2009) Implementation of a Cartesian grid incompressible Navier–Stokes solver on multi-GPU desktop platforms using CUDA. Master’s thesis, Boise State University Google Scholar
  35. 35.
    Tölke J, Krafczyk M (2008) TeraFLOP computing on a desktop PC with GPUs for 3D CFD. Int J Comput Fluid Dyn 22(7):443–456 CrossRefMATHGoogle Scholar
  36. 36.
    Ufimtsev I, Martínez T (2008) Quantum chemistry on graphical processing units. 1. Strategies for two-electron integral evaluation. J Chem Theory Comput 4(2):222–231 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Department of Computer ScienceBoise State UniversityBoiseUSA
  2. 2.Department of Mechanical and Biomedical EngineeringBoise State UniversityBoiseUSA

Personalised recommendations