# Acceleration of Wind Simulation Using Locally Mesh-Refined Lattice Boltzmann Method on GPU-Rich Supercomputers

## Abstract

A real-time simulation of the environmental dynamics of radioactive substances is very important from the viewpoint of nuclear security. Since airflows in large cities are turbulent with Reynolds numbers of several million, large-scale CFD simulations are needed. We developed a CFD code based on the adaptive mesh-refined Lattice Boltzmann Method (AMR-LBM). AMR method arranges fine grids in a necessary region, so that we can realize a high-resolution analysis including a global simulation area. The code is developed on the GPU-rich supercomputer TSUBAME3.0 at the Tokyo Tech, and the GPU kernel functions are tuned to achieve high performance on the Pascal GPU architecture. The code is validated against a wind tunnel experiment which was released from the National Institute of Advanced Industrial Science and Technology in Japan Thanks to the AMR method, the total number of grid points is reduced to less than 10% compared to the fine uniform grid system. The performances of weak scaling from 1 nodes to 36 nodes are examined. The GPUs (NVIDIA TESLA P100) achieved more than 10 times higher node performance than that of CPUs (Broadwell).

## Keywords

High performance computing GPU Lattice Boltzmann Method Adaptive mesh refinement Real-time wind simulation## 1 Introduction

A real-time simulation of the environmental dynamics of radioactive substances is very important from the viewpoint of nuclear security. In particular, high resolution analysis is required for resident areas or urban cities, where the concentration of buildings makes the air flow turbulent. In order to understand the details of the air flow there, it is necessary to carry out large-scale Computational Fluid Dynamics (CFD) simulations. Since air flows behave as almost incompressible fluids, CFD simulations based on an incompressible Navier-Stokes equation are widely developed. The LOcal-scale High-resolution atmospheric DIspersion Model using Large-Eddy Simulation (LOHDIM-LES [1]) has been developed in Japan atomic energy agency (JAEA). The LOHDIM-LES can solve turbulent wind simulation with Reynolds numbers of several million. However, an incompressible formulation sets the speed of sound to infinity, and thus, the pressure Poisson equation has to be solved iteratively with sparse matrix solvers. In such large-scale problems, it is rather difficult for sparse matrix solvers to converge efficiently because the problem becomes ill-conditioned with increasing the problem size and the overhead of node-to-node inter-communication increases with the number of nodes.

The Lattice Boltzmann Method (LBM) [2, 3, 4, 5] is a class of CFD method that solves the discrete-velocity Boltzmann equation. Since the LBM is based on a weak compressible formulation, the time integration is explicit and we do not need to solve the pressure Poisson equation. This makes the LBM scalable, and thus, suitable for large-scale computation. As an example, researches performing large-scale calculation using the LBM were nominated for the Gordon Bell prize in SC10 [6] and SC15 [7]. However, it is difficult to calculate multi-scale analysis with a uniform grid from the viewpoint of computational resources and calculation time. In this work, we address this issue based on two approaches, one is the development of an adaptive mesh refinement (AMR) method for the LBM, and the other is optimization of the AMR-LBM on the latest Pascal GPU architecture.

The AMR method was proposed to overcome this kind of problem [8, 9]. Since the AMR method arranges fine grids only in a necessary region, we can realize a high-resolution multi-scale analysis covering global simulation areas. AMR algorithms for the LBM have been proposed, and they have achieved successful results [10, 11].

Recently, GPU based simulations have been emerging as an effective technique to accelerate many important classes of scientific applications including CFD applications [12, 13, 14]. Studies on LBM have also been reported on implementation of GPU [15, 16]. Since there are not many examples of AMR-based applications on the latest GPU architectures, there is a room for research and development of such advanced applications. In this work, we implement an AMR-based LBM code to solve multi-scale air flows. The code is developed on the GPU-rich supercomputer TSUBAME3.0 at the Tokyo Institute of Technology, and the GPU kernel functions are tuned to realize a real-time simulation of the environmental dynamics of radioactive substances.

This paper reports implementation strategies of the AMR-LBM on the latest Pascal GPU architectures and its performance results. The code is written in CUDA 8.0 and CUDA-aware MPI. The Host/Device memory is managed by using Unified memory, and the GPU/CPU buffers are directly passed to a MPI function. We demonstrate the performance of both CPU and GPU on the TSUBAME3.0. A single GPU process (a single NVIDIA TESLA P100 processor) achieves 383.3 mega-lattice update per second (MLUPS) when leaf size equals to \( 4^{3} \) in single precision. The performance is about 16 times higher than that of a single CPU process (two Broadwell-EP processors, 14 × 2 cores, 2.4 GHz). Regarding the weak scalability results, the AMR-LBM code achieves 22535 MLUPS using 36 GPU nodes, which is 85% efficiency compared with the performance on a single GPU node.

## 2 Lattice Boltzmann Method

Here, \( \Delta t \) is the time interval, *c*_{ i } is the lattice vectors of pseudo particles, and Ω_{ i } is the collision operator.

It is important to choose a proper lattice velocity (vector) model by taking account of the tradeoff between efficiency and accuracy. Since their low computational cost and high efficiency, the D3Q15 and D3Q19 models are popular. Recently, it was pointed out that these velocity models do not have enough accuracy at high Reynolds number with complex geometries [17]. On the other hand, the D3Q27 model is suitable model for a weakly compressible flow at high Reynolds number.

Here, *c* is sound speed, and is normalized as *c* = 1. Each velocity refers the predetermined upwind quantity. Since memory accesses are simple and continuous, the streaming process is suitable for high performance computing.

### 2.1 Single Relaxation Time Model

In this wind simulation, since the Mach number is less than 0.3, the flow can be regarded as incompressible. The equilibrium distribution function \( f_{i}^{eq} \) of incompressible model is given as

Here, ρ is the density and \( \vec{u} \) is the macroscopic velocity vector. The collision operator is equivalent to the viscous term in the Navier-Stokes equation. The corresponding weighting factors of the D3Q27 model are given by

Since the SRT model is unstable at high Reynolds number, a Large-Eddy Simulation (LES) model has to be used to solve the LBM equation. The dynamic Smagorinsky model [19, 20] is often used, but it requires an averaging process over a wide area to determine the model constant. This is a huge overhead for large-scale computations, and it will negate the simplicity of the SRT model.

### 2.2 Cumulant Relaxation Time Model

The cumulant relaxation time model [21, 22] is a promising approach to solve the above problems. This model realizes turbulent simulation without LES model, and we can determine the equilibrium distribution function locally. Unlike the SRT model, the collisional process is not determined in the momentum space. We redefine physical quantities in the following. We take the two-sided Laplace transform of distribution function as

Here, \( {\vec{\varXi }} \) is the velocity frequency variable. \( \vec{\xi } = \left( {\xi , \upsilon , \zeta } \right) \) are the microscopic velocities. The coefficients of the series as countable cumulants \( c_{\alpha \beta \gamma } \) are written as

*α*,

*β*, and

*γ*are indices of the cumulant. All decay processes are computed by

The asterisk ∗ is the post collision cumulant, and \( \omega_{\alpha \beta \gamma } \) is the relaxation frequency.

The velocities *u*, *v*, and *w* are the components of macroscopic velocity vector \( \vec{u} \), and *θ* is a parameter. Cumulants are calculated by using local quantities as discretized velocity function *f*_{ i } and macroscopic velocities \( \vec{u} \). Since this model is a computationally intensive algorithm with local memory access, it should be well suited to achieve high efficiency for GPU computing.

### 2.3 Boundary Treatment

The LBM is suitable for modeling boundary conditions with complex shapes. The bounce-back (BB) scheme and the interpolated bounce-back (IBB) scheme make it easy to implement the no-slip velocity condition. Immersed boundary methods (IBM) [23, 24] are also able to handle complex boundary conditions by adding external forces in the LBM.

Here \( \overrightarrow {{u_{b} }} \) is a velocity vector of the boundary. Since each velocity function refers the predetermined neighbor upwind and downwind quantities, it is more suitable for high performance computing than the IBM [23, 24].

## 3 Adaptive Mesh Refinement (AMR) Method

### 3.1 Block-Structured AMR Method

Since a lot of buildings and complex structures make the air flow turbulent in large urban areas, it is necessary to carry out multi-scale CFD simulations. However, it is difficult to perform such a multi-scale analysis with uniform grids from the viewpoint of computational resources and calculation time. The AMR method [8, 27] is a grid generation method, which can arrange high-resolution grids only in a necessary region. In the AMR methods based on a forest-of-octrees approach [16, 28], one domain named a leaf is subdivided into four leaves in two dimensions (quadtree) and eight leaves in three dimensions (octree). Since the leaf is recursively subdivided into half, it is easy to implement the algorithm for parallel computing, and the same number of leaves are assigned to each process.

^{3}grid points and these memory accesses are continuous, it is suitable for GPU computation. Figure 3(a) shows a schematic figure of computational leaves at the interface of leaves at different levels, where each level needs the halo region across the interface. In such halo leaves, data is constructed from data on another level. Figure 3(b) shows an example of the leaf arrangement in 2D case, where the calculation region at each level is surrounded by the halo region, which is constructed from the data on leaves at the next level. Therefore, only one level difference is allowed at the interface of leaves at different levels.

The AMR method is applied to resolve the boundary layer near the buildings. The octree is initialized at the beginning of the simulation and does not dynamically change the mesh during the time step.

### 3.2 LBM with AMR

The LBM is a dimensionless method in time and space. It is necessary to arrange these parameters according to the resolution of AMR grids [5]. The kinematic viscosity, defined in the LBM, depends on the time step size with

To keep a constant viscosity on coarse and fine grids, the relaxation time τ satisfies the following expression

Here the super- and sub-scripts *c* and *f* denote the value of the coarse and fine grids, respectively. The coefficient *m* is the refinement factor. The time step is also redefined for each resolution as \( \Delta t_{f} = \left( {\Delta t_{c} } \right)/m \). To take account of the continuity of hydrodynamic variables and their derivatives on the interface between two resolutions, the distribution functions satisfy the following equations

The refinement factor \( m \) is set to 2 for stability and simplicity reasons.

## 4 Implementation and Optimization

### 4.1 CPU and GPU Implementation

In this section, we describe implementation of wind simulation code. The code is written in CUDA 8.0. We adopted the Array of Structures (AoS) memory layout to optimize multi-threaded performance. Each array is allocated by using the CUDA runtime API “cudaMallocManaged” which defines CPU and GPU memory space in the same address space. The CUDA system software automatically migrates data between CPU and GPU, so that it keeps the portability.

The code is parallelized by the MPI library. OpenMPI 2.1.1 is CUDA-aware MPI that enables to send and receive CUDA device memory directly. OpenMPI 2.1.1 also supports Unified Memory, and the GPU/CPU buffers can be directly passed to a MPI function. MPI communications are executed in each leaf unit, and the leaf unit is transferred by one-sided communication of “MPI_Put” function implemented by MPI-2.

### 4.2 Optimization for GPU Computation

In our GPU implementation, the streaming and collision processes are fused to reduce global memory accesses. In order to achieve high performance, it is also necessary to use thousands of cores in GPUs. The upper limit of the number of threads is limited by the usage of registers per streaming multiprocessor (SM), and it is determined at compile time. For example, according to the GP100 Pa whitepaper of NVIDIA [32], the Pascal GP100 provides 65536 32-bit registers on each SM. If one thread requires 128 registers, only 512 threads are executed on SM simultaneously. On the other hand, if one thread requires 32 registers, 2048 threads are executed and that is the upper limit of the Pascal GP100. Since the D3Q27 model and its cumulant collision operator need a lot of register memories on GPUs, the number of threads executed is limited by the lack of registers.

As described above, the function without boundary conditions (Func1) can reduce the number of registers compared to the original function (Func2). By executing two functions asynchronously, it is possible to use more threads than the original calculation. Details of computational performance are discussed in Sect. 6.1 below.

## 5 Numerical Verification and Validation

### 5.1 Lid-Driven Cavity Flow

Discretization parameters for 2D lid-driven cavity flow.

AMR lv. | Δleaf | Δx | # of leaves | # of grid points |
---|---|---|---|---|

0 | L/8 | L/64 | 36 = 6 | 2304 |

1 | L/16 | L/128 | \( 52 = 14^{2} - 12^{2} \) | 3328 |

2 | L/32 | L/256 | \( 240 = 32^{2} - 28^{2} \) | 15360 |

Total | – | – | 328 | 20992 |

### 5.2 Wind Tunnel Test

*z*

_{ s }= 0.5 m and wind velocity coefficient is

*u*

_{ s }= 2.14

*m*/

*s*. The Reynolds number, which is evaluated from the inlet velocity and physical properties of the air, is about 14000 at the top of the cube (

*z*= 0.1

*m*).

Discretization parameters for wind tunnel test.

AMR lv. | \( \Delta {\text{x}}\left( {H = 0.1m} \right) \) | Domain size \( \left( {X_{min,max} / Y_{min,max} / Z_{min,max} } \right) \) | # of leaves | # of grid points \( \left( { \times 10^{6} } \right) \) |
---|---|---|---|---|

0 | \( {H \mathord{\left/ {\vphantom {H 4}} \right. \kern-0pt} 4} \) | −1.5, 1.5/−0.5, 0.5/−0.2, 0.75 | 24048 | 12.31 |

1 | \( {H \mathord{\left/ {\vphantom {H 8}} \right. \kern-0pt} 8} \) | −4.0, 4.0/−1.0, 1.0/−0.2, 1.5 | 25800 | 13.21 |

2 | \( {H \mathord{\left/ {\vphantom {H {16}}} \right. \kern-0pt} {16}} \) | −19.2, 19.2/−1.2, 1.2/−0.2, 2.2 | 24000 | 12.29 |

Total | – | – | 73848 | 37.81 |

We compute a simulation with three refinement levels. Fine-resolution leaves are located near the cube, and middle-resolution leaves are surrounding the fine-resolution leaves, and coarse-resolution leaves are used in the outer region. The total number of grid points is 3.78 × 10^{7}, which corresponds to 4.2% compared to the finest uniform grids in the whole domain.

## 6 Performance on the TSUBAME 3 Supercomputer

TSUBAME 3.0 specification of a node.

Architecture | Bandwidth/node (GB/s) | |
---|---|---|

CPU | Intel Xeon E5-2680 V4 (14 cores) \( \times \,2 \) | 153.6 (76.8 \( \times \,2 \)) |

GPU | NVIDIA TESLA P100 (16 GB, SXM2) \( \times \,4 \) | 2928 (732 \( \times \,4 \)) |

Network | Intel Omni-Path HFI 100 Gbps \( \times \,4 \) | 50 (12.5 \( \times \,4 \)) |

Memory | DDR4-2400 DIMM 256 GB | – |

PCI Express | PCI Express Gen3 \( \times \,16 \) | – |

### 6.1 Performance on a Single Process

We show the performance results of the application on a single process by comparing three versions as follows. A CPU version is the original code parallelized by using OpenMP library, and executed on a single node (two CPU sockets). A GPU version is written in CUDA, and executed on a single GPU. An Optimal GPU version is optimized by using a boundary separate technique described Sect. 4.2 above. CPU and GPU codes are compiled with the NVIDIA CUDA Compiler 8.0.61 (-O3 -use_fast_math -restrict -Xcompiler fopenmp –gpu-architecture = sm_60 -std = C++ 11). As for OpenMP parallelization, we use 28 threads on two Intel Xeon E5-2680 V4 Processor, while for GPU computation, the number of threads is set to \( min(N_{Leaf} , 256) \).

Performance on a single process in a single node of TSUBAME 3.0.

Nleaf | # of leaves in each level (Lv. = 0/1/2) | CPU (2 sockets) MLUPS | GPU MLUPS | Optimal GPU MLUPS |
---|---|---|---|---|

4 | 19008 /73728 /294912 | 23.3 | 231.6 | 383.5 |

8 | 2448 /9216 /36864 | 17.4 | 237.4 | 369.7 |

16 | 324 /1152 /4608 | 18.0 | 229.0 | 342.7 |

32 | 45 /144 /576 | 13.2 | 184.4 | 243.5 |

The performances of the Optimal GPU version are about 1.5 times higher than those of the GPU version under the conditions of \( N_{Leaf} = (4^{3} , 8^{3} ,16^{3} ) \). Since the benchmark is executed including the whole AMR leaves, the boundary separate technique works well under the condition with a small leaf size.

### 6.2 Performance on Multiple Processes in a Single Node

We show the performance results of the application on multiple processes in a single node. A communication cost of GPU based applications becomes a large overhead compared with that of CPU based ones. Table 3 shows that the memory bandwidth of GPUs is 19 times higher than that of CPUs in a single node. In other words, an impact of the communications cost on GPUs are 19 times larger than that on CPUs.

Performance of GPU computation in a single node.

Nleaf | # of leaves in each process (Lv. = 0/1/2) | MLUPS (4 GPUs) | MPI cost % |
---|---|---|---|

4 | 19008/73728/294912 | 261.0 | 88.2 |

8 | 2448/9216/36864 | 729.5 | 65.4 |

16 | 324/1152/4608 | 840.6 | 48.8 |

(Note: OpenMPI 2.1.2 supports GPUDirect RDMA, which enables a direct P2P (Peer-to-Peer) data transfer between GPUs. However, we do not succeed in MPI communications using the GPUDirect RDMA in TSUBAME 3.0.)

### 6.3 Performance on Multiple Nodes

We show the performance results of the application in multiple nodes. The leaf size is set to 8^{3} considering the performance and applicability to real problems. The number of leaves in a node is the same as that in Sect. 6.2 above.

In the weak scaling tests, the parallel efficiencies from 1 node to 36 nodes of CPUs and GPUs are 98% and 85%, respectively. Although CPUs show better scalability, the performance on a single GPU node (733MLUPS) is comparable to that on 36 CPU nodes (767MLUPS).

### 6.4 Estimation of Performance in Wind Simulation

Our final goal is to develop a real-time simulation of the environmental dynamics of radioactive substances. We estimate the minimum mesh resolution \( \Delta x_{real\,time} \), at which a wind simulation can be executed in real time. The mesh resolution can be easily estimated from the Courant–Friedrichs–Lewy (CFL) condition as

Here \( U_{target} \) is a wind velocity, and \( CFL_{target} \) is the CFL number at \( U_{target} \), and \( \Delta t_{cal} \) is the elapse time per step.

We estimate the mesh resolution under the condition of \( (U_{target} , CFL_{target} ) = (5.0 \,{\text{m}}/{\text{s}}, 0.2) \). The computational condition is based on a single GPU node case in the previous Subsect. 6.3. The fine leaves are placed near the ground surface, and the resolution changes in the height direction. The leaves are arranged with \( 24 \times 24 \times 17 \) at Lv. 0, \( 48 \times 48 \times 16 \) at Lv. 1, and \( 96 \times 96 \times 16 \) at Lv. 2. The computational performance is achieved 733MLUPS using a single GPU node. The minimum mesh resolution becomes \( \Delta x_{real time} = m \) that corresponds to the whole computation domain size of \( \left( {L_{x} , L_{y} , L_{z} } \right) = \left( {2.8\,{\text{km}}, 2.8\,{\text{km}}, 3.3\,{\text{km}}} \right) \). The above estimation shows that a detailed real-time wind simulation is realized by GPU computing.

## 7 Summary and Conclusions

This paper presented the GPU implementation of air flow simulations on the environmental dynamics of radioactive substances. We have successfully implemented the AMR-based LBM with a state-of-the-art cumulant collision operator. Our code is written in CUDA 8.0, and executed both on CPUs and GPUs by using the CUDA runtime API “cudaMallocManaged”. Since the LBM kernel needs a lot of register memories on GPUs, the number of threads executed is limited by the lack of registers. We propose the effective optimization to create a kernel function for each conditional branch. This technique can reduce the number of registers compared to the original function, and the single GPU performance is accelerated by ~1.5 times. The performance of a single GPU process (NVIDIA TESLA P100) achieved 383.3 mega-lattice update per second (MLUPS) with the leaf size of 4^{3} in single precision. The performance is about 16 times higher than that of a single CPU process (two Broadwell-EP 14 cores 2.4 GHz).

We have also discussed the weak scalability results. Regarding the weak scalability results, 36 GPU nodes achieved 22535 MLUPS with the parallel efficiency of 85% compared with a single GPU node. The present scaling studies revealed a severe performance bottleneck due to MPI communication, which will be addressed via GPUDirect RDMA or NVLink in the future work.

Finally, we estimate the minimum mesh resolution \( \Delta x_{real time} \) at which air flow simulations can be executed in real time. The above estimation shows that a detailed real-time wind simulation is realized by GPU computing. We conclude that the present scheme is one of efficient approaches to realize a real-time simulation of the environmental dynamics of radioactive substances.

## Notes

### Acknowledgements

This research was supported in part by the Japan Society for the Promotion of Science (KAKENHI), a Grant-in-Aid for Scientific Research (C) 17K06570 and a Grant-in-Aid for Scientific Research (B) 17H03493 from the Ministry of Education, and “Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures” in Japan (Project ID: jh170031-NAH). Computations were performed on the TSUBAME 3.0 at the Tokyo Institute of Technology, and the ICEX at the Japan Atomic Energy Agency.

## References

- 1.Nakayama, H., Takemi, T., Nagai, H.: Adv. Sci. Res.
**12**, 127–133Google Scholar - 2.Rothman, D.H., Zaleski, S.: J. Fluid Mech.
**382**(01), 374–378 (1997)Google Scholar - 3.Inamuro, T.: Fluid Dyn. Res.
**44**, 024001 (2012). 21 pp.MathSciNetCrossRefGoogle Scholar - 4.Inagaki, A., Kanda, M., et al.: Boundary-Layer Meteorology, pp. 1–21 (2017)Google Scholar
- 5.Kuwata, Y., Suga, K.: J. Comp. Phys.
**311**(2016)Google Scholar - 6.Rahimian, A., Lashuk, I., et al.: In: Proceedings of the 2010 ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE Computer Society (2010)Google Scholar
- 7.Rossinelli, D., Tang, Y.H., et al.: In: Proceedings of the 2015 ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, vol. 2. IEEE Computer Society (2015)Google Scholar
- 8.Berger, M.J., Oliger, J.: J. Comp. Phys.
**53**(3), 484–512 (1984)CrossRefGoogle Scholar - 9.Zhao, Y., Liang-Shih, F.: J. Comp. Phys.
**228**(17), 6456–6478 (2009)CrossRefGoogle Scholar - 10.Zhao, Y., Qiu, F., et al.: Proceedings of 2007 Symposium on Interactive 3D Graphics, pp. 181–188 (2007)Google Scholar
- 11.Yu, Z., Fan, L.S.: J. Comput. Phys.
**228**(17), 6456–6478 (2009)Google Scholar - 12.Wang, X., Aoki, T.: Parallel Comput.
**37**(9), 521–535 (2011)MathSciNetGoogle Scholar - 13.Shimokawabe, T., Aoki, T., et al.: In: Proceedings of the 2010 ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 1–11. IEEE Computer Society (2010)Google Scholar
- 14.Shimokawabe, T., Aoki, T., et al.: In: Proceedings of the 2011 ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis, vol. 3. IEEE Computer Society (2011)Google Scholar
- 15.Feichtinger, C., Habich, J., et al.: Parallel Computing
**37**(9), 536–549 (2011)MathSciNetCrossRefGoogle Scholar - 16.Zabelock, S., et al.: J. Comput. Phy.
**303**(15), 455–469 (2015)CrossRefGoogle Scholar - 17.Kang, S.K., Hassan, Y.A.: J. Comput. Phys.
**232**(1), 100–117 (2013)MathSciNetCrossRefGoogle Scholar - 18.Zou, Q., He, X., et al.: Phys. Fluid
**9**(6), 1591–1598 (1996)CrossRefGoogle Scholar - 19.Germano, M., Piomelli, U., Moin, P., Cabot, W.H.: Physics of Fluids A: Fluid Dynamics 3(7), pp.1760–1765 (1991)CrossRefGoogle Scholar
- 20.Lilly, D.K.: Phys. Fluids A
**4**(3), 633–635 (1992)MathSciNetCrossRefGoogle Scholar - 21.Geier, M., Schonherr, M., et al.: Comput. Math. Appl.
**70**(4), 507–547 (2015)MathSciNetCrossRefGoogle Scholar - 22.Geier, M., Psquali, A., et al.: J. Comput. Phys.
**348**, 889–898 (2017)MathSciNetCrossRefGoogle Scholar - 23.Kim, J., Kim, D., Choi, H.: J. Comput. Phys.
**171**(20), 132–150 (2001)MathSciNetCrossRefGoogle Scholar - 24.Peng, Y., Shu, C., et al.: J. Comput. Phys.
**218**(2), 460–478 (2006)MathSciNetCrossRefGoogle Scholar - 25.Chun, B., Ladd, A.J.C.: Phys. Rev. E
**75**(6), 066705 (2007)MathSciNetCrossRefGoogle Scholar - 26.Yin, X., Zhang, J.: J. Comput. Phys.
**231**(11), 4296–4303 (2012)CrossRefGoogle Scholar - 27.Guzik, S.M., Weisgraber, T.H., et al.: J. Comput. Phys.
**259**(15), 461–487 (2014)MathSciNetCrossRefGoogle Scholar - 28.Laurmaa, V., Picasso, M., Steiner, G.: Comput. Fluids
**131**(5), 190–204 (2016)MathSciNetCrossRefGoogle Scholar - 29.Zuzio, D., Estivalezes, J.L.: Comput. Fluids
**44**(1), 339–357 (2011)MathSciNetCrossRefGoogle Scholar - 30.Usui, H., Nagara, A., et al.: Proc. Comput. Sci.
**29**, 2351–2359 (2014)CrossRefGoogle Scholar - 31.Open MPI: Running CUDA-aware Open MPI. https://www.open-mpi.org/faq/?category=runcuda
- 32.NVIDIA: Whitepaper, NVIDIA Tesla P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
- 33.Ghia, U., Ghia, K.N., Shin, C.T.: J. Comput. Phys.
**48**, 387–411 (1982)CrossRefGoogle Scholar - 34.National Institute of Advanced Industrial Science and Technology, Database, (in Japanese). https://unit.aist.go.jp/emri/ja/results/db/01/db_01.html

## Copyright information

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.