Abstract
We present an implementation of the Lattice Boltzmann Method (LBM) with Locally Recursive non-Locally Asynchronous (LRnLA) algorithms on GPU and CPU. The algorithm is based on the recursive subdivision of the domain of the dD1T space-time simulation and loosens the memory-bound limit for numerical schemes with local dependencies. We show that LRnLA algorithm allows to overcome the main memory bandwidth limitations in both CPU and GPU implementations. For CPU, we find the data layout that provides alignment for the full use of AVX2/AVX512 vectorization. For GPU, we devise a procedure for pairwise CUDA-block synchronization applied to the implementation of the LRnLA algorithm, which previously worked only on CPU. The performance on GPU is higher, as it is usual in modern implementations. However, the performance gap in our implementation is smaller, thanks to a more efficient CPU version. Through a detailed comparison, we show possible future applications for both the CPU and the GPU implementations of the lattice Boltzmann method in the complex setting.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Computational resources of Keldysh Institute of Applied Mathematics RAS. www.kiam.ru
Bailey, P., Myre, J., Walsh, S.D., Lilja, D.J., Saar, M.O.: Accelerating lattice boltzmann fluid flow simulations using graphics processors. In: International Conference on Parallel Processing, ICPP 2009, pp. 550–557. IEEE (2009). https://doi.org/10.1109/ICPP.2009.38
Geier, M., Schönherr, M.: Esoteric twist: an efficient in-place streaming algorithmus for the lattice boltzmann method on massively parallel hardware. Computation 5(2), 19 (2017). https://doi.org/10.3390/computation5020019
Levchenko, V., Perepelkina, A., Zakirov, A.: Diamondtorre algorithm for high-performance wave modeling. Computation 4(3), 29 (2016). https://doi.org/10.3390/computation4030029
Levchenko, V.D., Perepelkina, A.Y.: Locally recursive non-locally asynchronous algorithms for stencil computation. Lobachevskii J. Math. 39(4), 552–561 (2018). https://doi.org/10.1134/S1995080218040108
Mattila, K., Hyväluoma, J., Rossi, T., Aspnäs, M., Westerholm, J.: An efficient swap algorithm for the lattice boltzmann method. Comput. Phys. Commun. 176(3), 200–210 (2007). https://doi.org/10.1016/j.cpc.2006.09.005
Neumann, P., Bungartz, H.J., Mehl, M., Neckel, T., Weinzierl, T.: A coupled approach for fluid dynamic problems using the PDE framework peano. Commun. Comput. Phys. 12(1), 65–84 (2012). https://doi.org/10.4208/cicp.210910.200611a
Nguyen, A., Satish, N., Chhugani, J., Kim, C., Dubey, P.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–13. IEEE (2010). https://doi.org/10.1109/SC.2010.2
Perepelkina, A., Levchenko, V.: LRnLA algorithm ConeFold with non-local vectorization for LBM implementation. Commun. Comput. Inf. Sci. 965, 101–113 (2019). https://doi.org/10.1007/978-3-030-05807-4_9
Riesinger, C., Bakhtiari, A., Schreiber, M., Neumann, P., Bungartz, H.J.: A holistic scalable implementation approach of the lattice Boltzmann method for CPU/GPU heterogeneous clusters. Computation 5(4), 48 (2017). https://doi.org/10.3390/computation5040048
Robertsén, F., Westerholm, J., Mattila, K.: Designing a graphics processing unit accelerated petaflop capable lattice boltzmann solver: read aligned data layouts and asynchronous communication. Int. J. High Perform. Comput. Appl. 31(3), 246–255 (2017). https://doi.org/10.1177/1094342016658109
Succi, S.: The Lattice Boltzmann Equation: for Fluid Dynamics and Beyond. Oxford University Press, Oxford (2001)
Tomczak, T., Szafran, R.G.: A new GPU implementation for lattice-Boltzmann simulations on sparse geometries. Comput. Phys. Commun. 235, 258–278 (2019)
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785
Zakirov, A., Levchenko, V., Perepelkina, A., Zempo, Y.: High performance FDTD algorithm for GPGPU supercomputers. J. Phys: Conf. Ser. 759, 012100 (2016). https://doi.org/10.1088/1742-6596/759/1/012100. IOP Publishing
Acknowledgments
The work was supported by the Russian Science Foundation (grant No. 18-71-10004).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Levchenko, V., Zakirov, A., Perepelkina, A. (2019). LRnLA Lattice Boltzmann Method: A Performance Comparison of Implementations on GPU and CPU. In: Sokolinsky, L., Zymbler, M. (eds) Parallel Computational Technologies. PCT 2019. Communications in Computer and Information Science, vol 1063. Springer, Cham. https://doi.org/10.1007/978-3-030-28163-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-28163-2_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-28162-5
Online ISBN: 978-3-030-28163-2
eBook Packages: Computer ScienceComputer Science (R0)