Overcoming GPU Memory Capacity Limitations in Hybrid MPI Implementations of CFD

  • Jake ChoiEmail author
  • Yoonhee Kim
  • Heon-young Yeom
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11874)


In this paper, we describe a hybrid MPI implementation of a discontinuous Galerkin scheme in Computational Fluid Dynamics which can utilize all the available processing units (CPU cores or GPU devices) on each computational node. We describe the optimization techniques used in our GPU implementation making it up to 74.88x faster than the single core CPU implementation in our machine environment. We also perform experiments on work partitioning between heterogeneous devices to measure the ideal load balance achieving the optimal performance in a single node consisting of heterogeneous processing units. The key problem is that CFD workloads need to allocate large amounts of both host and GPU device memory in order to compute accurate results. There exists an economic burden, not to mention additional communication overheads of simply scaling out by adding more nodes with high-end scientific GPU devices. In a micro-management perspective, workload size in each single node is also limited by its attached GPU memory capacity. To overcome this, we use ZFP, a floating-point compression algorithm to save at least 25% of data usage in our workloads, with less performance degradation than using NVIDIA UM.


CFD MPI CUDA GPU Compression Memory 


  1. 1.
    Lai, J., Li, H., Tian, Z.: CPU/GPU heterogeneous parallel CFD solver and optimizations. In: Proceedings of the 2018 International Conference on Service Robotics Technologies (ICSRT ’18), pp. 88–92. ACM, New York (2018).
  2. 2.
    Lindstrom, P.: Fixed-rate compressed floating-point arrays. IEEE Trans. Vis. Comput. Graph. 20(12), 2674–2683 (2014). Scholar
  3. 3.
    Lindstrom, P.: Error distributions of lossy floating-point compressors. Joint Stat. Meet. 2017, 2574–2589 (2017) Google Scholar
  4. 4.
    Deutsch, P.: GZIP file format specification version 4.3. RFC, vol. 1952, pp. 1–12 (1996).
  5. 5.
    Bzip2 (2018).
  6. 6.
    Lindstrom, P., Isenburg, M.: Fast and efficient compression of floating-point data. IEEE Trans. Vis. Comput. Graph. 12(5), 1245–1250 (2006)CrossRefGoogle Scholar
  7. 7.
    Tao, D., Di, S., Liang, X., Chen, Z., Cappello, F.: Optimizing lossy compression rate-distortion from automatic online selection between SZ and ZFP (2019). Scholar
  8. 8.
    Niksiar, P., Ashrafizadeh, A., Shams, M., Madani, A.H.: Implementation of a GPU-based CFD Code. In: 2014 International Conference on Computational Science and Computational Intelligence, Las Vegas, NV, pp. 84–89 (2014).
  9. 9.
    Mintu, S.A., Molyneux, D.: Application of GPGPU to accelerate CFD simulation. In: ASME International Conference on Offshore Mechanics and Arctic Engineering, vol. 2: CFD and FSI ():V002T08A001.
  10. 10.
    NVIDIA Corp: Profiler user’s guide (2017). An optional note
  11. 11.
    Griebel, M., Zaspel, P.: Comput. Sci. Res. Dev. 25, 65 (2010). Scholar
  12. 12.
    Xu, H., et al.: Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer. J. Comput. Phys. 278(C), 275–297 (2013). Scholar
  13. 13. PassMark Software - Video Card (GPU) Benchmark Charts (2019). Accessed 24 May 2019
  14. 14. PassMark Software - CPU Benchmark Charts (2019). Accessed 24 May 2019
  15. 15. Intel product specifications (2019). Accessed 24 May 2019
  16. 16.
    Wang, Y., Malkawi, A., Yi, Y.K.: Implementing CFD (computational fluid dynamics) in OpenCL for building simulation (2011)Google Scholar
  17. 17.
    Gorobets, A., Soukov, S., Bogdanov, P.: Multilevel parallelization for simulating compressible turbulent flows on most kinds of hybrid supercomputers. Comput. Fluids 173 (2018). Scholar
  18. 18.
    Oyarzun, G., Borrell, R., Gorobets, A., Mantovani, F., Oliva, A.: Efficient CFD code implementation for the ARM-based mont-blanc architecture. Future Gener. Comput. Syst. 79 (2017). Scholar
  19. 19.
    Wang, Y.X., Zhang, L.L., Liu, W., Cheng, X.H., Zhuang, Y., Chronopoulos, A.: Performance optimizations for scalable CFD applications on hybrid CPU+MIC heterogeneous computing system with millions of cores. Comput. Fluids (2018). Scholar
  20. 20.
    Che, Y., Zhang, L., Xu, C., Wang, Y., Liu, W., Wang, Z.: Optimization of a parallel CFD code and its performance evaluation on Tianhe-1A. Comput. Inf. 33, 1377–1399 (2014)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Cockburn, B., Shu, C.W.: The Runge-Kutta discontinuous Galerkin method for conservation laws V. J. Comput. Phys. 141, 199–224 (1998)MathSciNetCrossRefGoogle Scholar
  22. 22.
    You, H., Kim, C.: High-order multi-dimensional limiting strategy with subcell resolution I. Two-Dimension. Mixed Meshes, J. Comput. Phys. 375, 1005–1032 (2018)zbMATHGoogle Scholar
  23. 23.
    Bassi, F., Crivellini, A., Rebay, S., Savini, M.: Discontinuous Galerkin solution of the Reynolds-averaged Navier-Stokes and k-\(\omega \) turbulence model equations. Comput. Fluids 34, 507–540 (2005)CrossRefGoogle Scholar
  24. 24.
    Cohen, J., Molemaker, M.J.: A fast double precision CFD code using CUDA. Parallel Computational Fluid Dynamics: Recent Advances and Future Directions (2009)Google Scholar
  25. 25.
    Li, W., Jin, G., Cui, X., See, S.: An evaluation of unified memory technology on NVIDIA GPUs. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Shenzhen, pp. 1092-1098 (2015).
  26. 26.
    Harris, M., Harris, M., Harris, M., Sakharnykh, N., Harris, M.: Unified memory for CUDA beginners—NVIDIA developer blog. NVIDIA Developer Blog (2019). Accessed 17 May 2019
  27. 27.
    Harris, M., Perelygin, K., Luitjens, J., Karras, T., Karras, T., Karras, T.: Cooperative groups: flexible CUDA thread programming—NVIDIA developer blog. NVIDIA Developer Blog (2019). Accessed 22 May 2019
  28. 28.
    Oteski, L., Colin de Verdiere, G., Contassot-Vivier, S., Vialle, S., Ryan, J.: Towards a unified CPU-GPU code hybridization: a GPU based optimization strategy efficient on other modern architectures (2018)Google Scholar
  29. 29.
    Karypis, G., Kumar, V.: Parallel multilevel k-way partitioning scheme for irregular graphs. In: Proceedings of the 1996 ACM/IEEE Conference on Supercomputing (CDROM), Ser. Supercomputing ’96. IEEE Computer Society, Washington, DC, USA (1996).
  30. 30.
    NVIDIA: NVIDIA CUBLAS Library (2019).
  31. 31.
    Larsen, M.: mclarsen/cuZFP. GitHub (2019). Accessed 22 May 2019

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer ScienceSeoul National UniversitySeoulSouth Korea
  2. 2.Department of Computer ScienceSookmyung Women’s UniversitySeoulSouth Korea

Personalised recommendations