The simulation of flows necessitates to follow a particular workflow path. First, the computational mesh corresponding to a flow problem and an associated geometry is generated. The mesh and a parallelized geometry are input to the multi-physics framework ZFS, which uses its solvers to compute an approximate solution of the Navier–Stokes equations. The following Sect. 2.1 describes the method to generate the mesh and a parallelized geometry. This is followed by a description of the LB solver in Sect. 2.2, which is used for the applications presented in Sect. 3. Some performance aspects of the simulation code are presented in Sect. 2.3. A workflow chart for the LB method is also shown in Fig. 1.
Mesh generation
The mesh generator [53] is part of the framework ZFS and generates hierarchical Cartesian meshes. It is independent of the method employed for the simulation. It fully works in parallel using the Message Passing Interface (MPI) and OpenMP, and is able to construct meshes in a short amount of time on a large number of processes.
Parallel meshing begins with an I/O of the geometry, which is usually given in Standard Tessellation Language (STL). Each process places an initial cube around this geometry and starts to continuously subdivide the cube into smaller cubes. This constitutes an octree hierarchy with parent–child and neighborhood relations, and cells living on different levels l in the tree. The subdivision is performed until a level \(l_\alpha\) is reached. In each iteration, cells that are outside the geometry are removed by a mixture of intersection tests and flooding algorithms. At \(l_\alpha\), all levels \(l<l_\alpha\) are removed and the remaining cells are decomposed by a Hilbert space-filling curve [61]. Each MPI rank keeps only those cells which it is responsible for and creates rank-neighborhood information based on the local domain boundary cells, the so-called window cells. Continuous subdivision is then performed on the remainder of the cells up to a level \(l_\beta >l_\alpha\). At this stage, the mesh is uniformly refined. The subsequent step creates locally refined meshes. The algorithm can refine regions, where higher resolutions are required based on user-defined geometrical objects or the distance of the boundary. For the former method, cells inside the defined geometrical shapes are refined, while for the latter method, the distance from the wall is measured and used as in indicator for refinement. Global cell neighborhood is restored using the window and the corresponding halo cells, which are created as copy of each window cell on a neighboring MPI rank. From the decomposition on \(l_\alpha\), a list of cells is defined, which is used in a preprocessing step to the solver for the weighted decomposition of the mesh. This list is also utilized for the parallelization of the geometry, i.e., for each cell in this list, the number of triangles and the corresponding triangle information is gathered and written to disk in parallel using either parallel NetCDF [47] or HDF5 [18]. These parallel libraries are in a final step also used to write the mesh to disk. For more details on the parallel mesh generator, the interested reader is referred to [49, 53].
Lattice–Boltzmann solver
ZFS implements different LB methods, i.e., the standard SRT, MRT, RLB, and CLB models. For spatial discretization, all these models employ the parallel octree mesh placed into memory via parallel I/O. The space is furthermore discretized using the DxQy schemata from [59]. Figure 2 shows the various discretization schemata implemented in ZFS. The inner box of Fig. 1 shows the basic algorithm of the LB methods, i.e., a collision step, which represents, in a statistical sense, the change of PPDFs due to cell-local collisions, is usually followed by a propagation step, which pushes the new PPDFs to the neighboring cells. Since the code is executed in parallel, window cells need to copy their information to the neighboring MPI ranks via inter-process communication (IPC) to their corresponding halo cells such that a valid propagation can be executed. Depending on the kind of boundary condition (BC), the BC is either executed right after collision and before IPC, or after propagation. After that, the iteration step t is advanced, and if \(t=t^*\), a special event is executed. Such events can be checkpointing actions, solution I/O, in-situ analysis, or simply finishing the simulation at a target time step \(t^*=t_{end}\).
The most frequently used LB models are the SRT with the D3Q27 scheme. The MRT, RLB, and CLB models come, however, with advanced features with respect to the range of applications and stability. The following describes these methods independent of the DxQy schemata. Furthermore, a method for large-eddy simulations (LES), the employed mesh refinement technique, and boundary conditions are described. Note that all methods have previously been validated in [15, 19, 50]. That is, in [15], the SRT method with mesh refinement is validated by simulating the flow past a circular cylinder at Reynolds numbers \(Re=40\) and \(Re=100\), and past a sphere at \(100\le Re\le 300\) and at \(3700\le Re\le 10,000\). For the latter case, the Reynolds number is on the same order as that of the landing gear configuration presented in Sect. 3.3. A detailed analysis of the averaged drag coefficient \({\bar{C}}_d\), the mean base pressure coefficient \({\bar{C}}_{pb}\), the mean non-dimensional length of the recirculation region \({\bar{L}}_r/D\), the mean separation angle \({\bar{\phi }}_s\), the Strouhal number St for the large-scale vortex shedding, the mean wall-pressure coefficient \({\bar{C}}_p\), and the mean skin-friction \({\bar{\tau }}\) is given. The values and distributions are in good agreement with the other results from the literature. In [19], the MRT and the CLB methods are validated for the D3Q19 and D3Q27 models by simulations of plane Poiseuille flow at \(Re = 200\), flow in a three-dimensional lid-driven cavity at yaw at \(Re = 700\), and of a turbulent channel flow at \(Re_\tau =200\). Finally, in [50], the thermal MDF LB approach is validated by analyzing a boundary-layer flow over a heated flat plate at \(Re=10,000\) and a Prandtl number of \(Pr=1.0\).
SRT method
The SRT method [59] equates the PPDFs \(f_i(r+\delta r, t+\delta t), i\in \{1.\ldots ,y\}\) with a single relaxation parameter \(\omega _F\) by:
$$\begin{aligned} f_i(r+\delta r, t+\delta t) = f_i(r,t) +\omega _F\left( f_i^{eq}(r,t)- f_i(r,t)\right) . \end{aligned}$$
(1)
The parameter \(\omega _F\) is a function of the inverse of the viscosity \(\nu\), that is:
$$\begin{aligned} \omega _F = \frac{c_s^2}{\nu +\delta t c_s^2/2}. \end{aligned}$$
(2)
In Eq. (1), r represents the spatial location, \(\delta r\) is the grid distance, t is the time, \(\delta t\) is the time increment, and \(f_i^{eq}\) is the discretized Maxwellian equilibrium distribution function given by:
$$\begin{aligned} f_i^{eq}(r,t)=\rho t_p \underbrace{\bigg [1+\frac{v_a\xi _{i,a}}{c_s^2}+\frac{v_a v_b}{2c_s^2}\cdot \left( \frac{\xi _{i,a}\xi _{i,b}}{c_s^2} -\delta _{ab}\right) \bigg ]}_{\chi }, \end{aligned}$$
(3)
where \(\rho\) is the density, \(t_p\) is a discretization scheme-dependent factor, see Appendix A, \(c_s\) is the speed of sound, \(v_{a,b}\) and \(\xi _{i,a,b}\) are fluid velocity and molecular velocity components, and \(\delta _{ab}\) is the Kronecker delta with indices \(a,b\in \{1,\ldots ,x\}\). The algorithm usually performs the collision and propagation steps in separate steps, i.e., an explicit scheme, alternating between collision and propagation operations, is used:
$$\begin{aligned}&{\hat{f}}_i(r,t) = f_i(r,t) +\omega _F\left( f_i^{eq}(r,t)- f_i(r,t)\right) \end{aligned}$$
(4)
$$\begin{aligned}&f_i(r+\delta r, t+\delta t) = {\hat{f}}_i(r,t), \end{aligned}$$
(5)
where the PPDFs with a \(<\hat{}>\) represent the post-collision PPDFs. The macroscopic variables can be obtained from the moments of the PPDFs, see, e.g., Hänel [27]:
$$\begin{aligned} \rho=\, & {} \sum _i f_i(r,t) \end{aligned}$$
(6)
$$\begin{aligned} \rho v_a=\, & {} \sum _i \xi _{i,a} f_i(r,t) \end{aligned}$$
(7)
$$\begin{aligned} \rho (e+v_a^2)=\, & {} \frac{1}{2}\sum _i \xi _{i,a}^2 f_i(r,t) \end{aligned}$$
(8)
$$\begin{aligned} \underbrace{\rho v_av_b+p\delta _{ab}-\sigma _{ab}}_{\varPi _{ab}}=\, & {} \sum _i\xi _{i,a} \xi _{i,b}f_i(r,t). \end{aligned}$$
(9)
The temperature distribution can be simulated by an MDF approach [26], i.e., by additionally solving:
$$\begin{aligned} g_i(r+\delta r, t+\delta t) = g_i(r,t) +\omega _T\left( g_i^{eq}(r,t)- g_i(r,t)\right) . \end{aligned}$$
(10)
The parameter \(\omega _T\) depends on the heat conduction coefficient \(\kappa\) of the fluid:
$$\begin{aligned} \omega _T = \frac{c_s^2}{\kappa +\delta t c_s^2/2}, \end{aligned}$$
(11)
where \(\kappa\) is given by the Prandtl number \(Pr=\nu /\kappa\) and \(g_i^{eq}(r,t)\) equates to:
$$\begin{aligned} g_i^{eq}(r,t)=T t_p\chi , \end{aligned}$$
(12)
with T representing the temperature. The macroscopic temperature variable is given by:
$$\begin{aligned} T = \sum _i g_i. \end{aligned}$$
(13)
The advantages of the SRT method are that the implementation is straightforward for all DxQy models and that the collision kernel is well suited for HPC simulations. The Reynolds number range and its quasi-incompressibility ansatz in the derivation of the equilibrium distribution function (see [27]) are, however, due to stability limitations and fixed collision frequencies \(\omega _F\) and \(\omega _T\) restricted to rather small Mach and Reynolds numbers.
MRT method
Unlike the SRT method, the MRT method [12] relaxes in momentum space and introduces individual collision frequencies for the various moments resulting in higher numerical stabilities [42]. Various physical processes in fluids such as viscous transport can be approximately described by mode coupling. The modes are directly related to the moments of the PPDFs. Since the collision frequencies of the moments are directly related to various transport coefficients, each mode can be controlled independently. This overcomes the fixed Prandtl number issue that the SRT models suffer from. The MRT equation can be written in vector notation and is given by:
$$\begin{aligned}&{\mathbf {f}}(r+\delta r,t+\delta t)\nonumber \\&\quad ={\mathbf {f}}(r,t) - \underline{{\mathbf {M}}}^{-1}\underline{{\mathbf {K}}}_{MRT} \cdot \left[ {\mathbf {m}}\left( r,t\right) -{\mathbf {m}}^{eq}\left( r,t\right) \right] , \end{aligned}$$
(14)
with \({\mathbf {f}}\) being the vector of the PPDFs and moment and relaxation matrices \(\underline{{\mathbf {M}}}\) and \(\underline{{\mathbf {K}}}_{MRT}\), and vectors \({\mathbf {m}}^{eq}\) and \({\mathbf {m}}\). The relaxation matrix \(\underline{{\mathbf {K}}}_{MRT}\) is a diagonal matrix holding the various collision frequencies. Example vectors and matrices for the D2Q9 MRT model are given in Appendix B.
Note that setting the diagonal elements of \(\underline{{\mathbf {K}}}_{MRT}\) to \(\omega _F\) yields the SRT method. The calculations of the macroscopic variables and the propagation are performed analogue to the SRT method, see Eqs. (6)–(8). That is, the MRT method only differs from the SRT method by the collision step. A stability analysis that considers the eigenvalues of the spectral matrix of the Fourier transform of Eqs. (1) and (14) underlines the stability advantages of the MRT method over the SRT method [42]. That is, the MRT method is advantageous for higher Reynolds number flows. Computationally, it is due to the required matrix operations more expensive than the SRT method. On modern supercomputers with decent vector units, the performance difference can, however, be considered negligible if the compiler is able to efficiently vectorize the code.
CLB method
Similarly to the MRT method, the CLB method [22] also relaxes in moment space, which reconstructs, however, the high-order non-hydrodynamic moments of the discrete velocity set up to the fourth order in a cascade from lower order moments. Short wave-length oscillations at higher Reynolds numbers are captured by these higher order moments yielding a stable numerical scheme. The advancement of the state vector of the PPDFs \({\mathbf {f}}\) is given by [22]:
$$\begin{aligned} {\mathbf {f}}(r+\delta r, t+\delta t) = {\mathbf {f}}(r,t)+\underline{{\mathbf {K}}}_{CLB}\cdot {\mathbf {k}} \end{aligned}$$
(15)
where \(\underline{{\mathbf {K}}}_{CLB}=({\mathbf {K}}_1,\ldots ,{\mathbf {K}}_{y})\) is a orthogonal transformation matrix that transforms the configuration space to the moment space to obtain Galilean invariance, e.g., the first hydrodynamic macroscopic quantities can be obtained by:
$$\begin{aligned} (\rho ,\rho v_1,\ldots ,\rho v_x)^T={\mathbf {f}}\cdot ({\mathbf {K}}_1,\ldots ,{\mathbf {K}}_{x+1}), \end{aligned}$$
(16)
and \({\mathbf {k}}\) is the moment vector. The matrix \(\underline{{\mathbf {K}}}_{CLB}\) and the vector \({\mathbf {k}}\) are exemplarily given in Appendix C for the D2Q9 model.
Concerning the limitations of the SRT method for \(\nu \rightarrow 0\), or in other words \(Re\rightarrow \infty\), the CLB method allows for higher numerical stability by enabling a decrease of the viscosity by many orders of magnitude as compared to the original model [22].
RLB method
Using the SRT method, the symmetry properties of the PPDFs are not necessarily fulfilled to reach the hydrodynamic limit. Therefore, Latt and Chopard [43] introduced the RLB method, which introduces a pre-collision regularization step to restore this symmetry. This regularization leads to enhanced stability and accuracy in the hydrodynamic regime of the employed scheme. The Chapman–Enskog expansion expands the PPDFs in powers of the Knudsen number \(\epsilon =l_f/L\), where \(l_f\) is the mean free path and L a characteristic length. This leads to:
$$\begin{aligned} f_i(r,t)=f_i^{(0)}(r,t)+\epsilon f_i^{(1)}(r,t)+\epsilon ^2f_i^{(2)}(r,t)+\ldots , \end{aligned}$$
(17)
and \(f^{eq}(r,t)=f_i^{(0)}(r,t)\) at zeroth order. The non-equilibrium parts are then given by:
$$\begin{aligned} f_i^{neq}(r,t)=f_i(r,t) - f_i^{eq}(r,t). \end{aligned}$$
(18)
The non-equilibrium part of the PPDFs can furthermore be expressed as [11]:
$$\begin{aligned} f_i^{neq}(r,t)\approx f_i^{(1)}(r,t)=-\frac{\delta t}{\omega _Fc_s^2}\left( \xi _{i,a}\xi _{i,b} -c_s^2\delta _{ab} \right) \cdot \frac{\partial \rho v_b}{\partial x_a}. \end{aligned}$$
(19)
Inserting this into the non-equilibrium momentum flux tensor [cf. Eq. (9)] leads to:
$$\begin{aligned} \varPi _{ab}^{neq}=\, & {} \varPi _{ab}-\sum _i\xi _{i,a}\xi _{i,b}f_i^{eq}(r,t) \end{aligned}$$
(20)
$$\begin{aligned}\approx & {} \sum _i\xi _{i,a}\xi _{i,b}f_i^{(1)}(r,t) \end{aligned}$$
(21)
$$\begin{aligned}=\, & {} -\frac{c_s^2}{\omega _F}\left( \nabla \cdot \rho {\mathbf {v}}\right) . \end{aligned}$$
(22)
Combining Eqs. (19) and (22) yields the regularization term:
$$\begin{aligned} f_i^{(1)}(r,t)=\frac{\delta t}{2c_s^4}\left( \xi _{i,a}\xi _{i,b} -c_s^2\delta _{ab} \right) \cdot \varPi _{ab}, \end{aligned}$$
(23)
which can be used in \(f_i^{reg}(r,t)=f_i^{eq}(r,t)+f_i^{(1)}(r,t)\) to set \({\hat{f}}_i(r,t)=f_i^{reg}(r,t)\) prior to the next collision.
Large-eddy simulations
LES computations employ the Smagorinsky sub-grid scale (SGS) model for LB methods [33, 68]. Spatial filtering is performed using coarsened meshes. To account for the filtered scales, the turbulent viscosity \(\nu _t\) is added to the collision frequency (see Eq. (2)):
$$\begin{aligned} \omega _F^{SGS} = \frac{c_s^2}{(\nu +\nu _t)+\delta t c_s^2/2} \end{aligned}$$
(24)
with Smagorinsky’s constant \(C_s\). The turbulent viscosity is given by:
$$\begin{aligned} \nu _t = (C_s\delta r)^2\sqrt{2{\bar{S}}^2_{ab}}, \end{aligned}$$
(25)
where \({\bar{S}}_{ab}\) is the filtered strain tensor, which can be obtained by:
$$\begin{aligned} {\bar{S}}_{ab}=\frac{\omega _F}{2\rho c_s^2}\sum _i \left[ f_i(r,t)-f_i^{eq}(r,t)\right] \xi _{i,a}\xi _{i,b}. \end{aligned}$$
(26)
Grid refinement
Simulations on multiple octree hierarchy levels are performed using the method of Dupuis and Chopard [14], i.e., by executing restriction and prolongation operations on succeeding levels \(l_\kappa\) and \(l_{\kappa +1}\) in the corresponding overlapping regions. The method splits missing interpolated incoming PPDFs \({\tilde{f}}_i(r,t)\) from other levels into equilibrium and non-equilibrium parts, see Eqs. (18) and (19). The relation of the non-equilibrium PPDFs on the fine and the coarse level can then be expressed as:
$$\begin{aligned} \frac{f_{i,\kappa +1}(r,t)^{neq}}{f_{i,\kappa }(r,t)^{neq}}=\, & {} \frac{\delta t_{\kappa +1}}{\delta t_{\kappa }}\cdot \frac{\omega _{F,\kappa }}{\omega _{F,\kappa +1}} \frac{\left( \xi _{i,a}\xi _{i,b} -c_s^2\delta _{ab} \right) }{\left( \xi _{i,a}\xi _{i,b} -c_s^2\delta _{ab} \right) }\nonumber \\=\, & {} \frac{\delta r_{\kappa +1}}{\delta r_\kappa }\cdot \frac{\omega _{F,\kappa }}{\omega _{F,\kappa +1}} \end{aligned}$$
(27)
with \(\delta t_{\kappa +1}/\delta t_{\kappa } = \delta r_{\kappa +1}/\delta r_{\kappa }\). For the transformations from fine to coarse and coarse to fine, this yields:
$$\begin{aligned} f_{i,\kappa +1}(r,t)=\, & {} {\tilde{f}}_i^{eq}(r,t)+({\tilde{f}}_{i,\kappa }(r,t) -{\tilde{f}}_i^{eq}(r,t))\nonumber \\&\cdot \underbrace{\frac{\delta r_{\kappa +1}}{\delta r_\kappa }\cdot \frac{\omega _{F,\kappa }}{\omega _{F,\kappa +1}}}_{\varTheta _{\kappa +1}}, \end{aligned}$$
(28)
$$\begin{aligned} f_{i,\kappa }(r,t)=\, & {} f_i^{eq}(r,t)+\left( f_{i,\kappa +1}(r,t)-f_i^{eq}(r,t)\right) \nonumber \\&\cdot \underbrace{\frac{\delta r_{\kappa }}{\delta r_{\kappa +1}}\cdot \frac{\omega _{F,\kappa +1}}{\omega _{F,\kappa }}}_{\varTheta _{\kappa }}. \end{aligned}$$
(29)
Keeping the viscosity constant across levels leads to the transformation factors \(\varTheta _{\kappa ,\kappa +1}\), which depend on the ratios of the grid distances and the relaxation times on \(l_\kappa\) and \(l_{\kappa +1}\). The same scheme can be applied to the MDF approach, however, using the same heat conduction coefficient on both levels and \(\omega _{T,\kappa }\) and \(\omega _{T,\kappa +1}\).
Boundary conditions
Various boundary conditions can be applied to simulate a large spectrum of numerical setups. In more detail, multiple von Neumann and Dirichlet conditions for in- and outflows, and periodic boundary conditions are implemented in ZFS. This also includes sponge layered, Saint-Venant/Wantzel, and pressure boundary conditions [51]. That is, for Dirichlet conditions, e.g., for the prescription of a velocity profile at an inlet or a pressure at the outlet, Eq. (3) is equated for a given velocity or density. An adaptive pressure-based outlet condition adapts according to the local Reynolds number. Such a boundary condition is frequently used at an outlet in combination with a Saint-Venant/Wantzel at an inlet in respiratory flow simulations to imitate the pressure drop caused by the expansion of the diaphragm; see Sect. 3.1. A von Neumann condition is fulfilled by extrapolating from interpolated values, e.g., velocity or density, of the inlet-/outlet-nearest neighbor cells. The Saint-Venant/Wantzel condition extrapolates the momentum. No-slip wall boundary conditions are at least of second-order accuracy. Among the most frequently used wall boundary conditions is the interpolated bounce-back method from Bouzidi et al. [6], which measures for cells intersecting the geometry the distance q from the cell center to the wall and uses this information to interpolate the bounced-back PPDF. The interpolation scheme is chosen based on \(q<0.5\cdot \delta r\) and \(0.5\cdot \delta r\le q\le \delta r\). Furthermore, schemes from Ginzburg and D’Humières [24] and Yu et al. [76] are implemented. In the former, a multi-reflection boundary condition, which employs PPDFs at three nodes to find the unknown PPDFs at the boundary condition after a bounce-back, is proposed. In the latter, a double interpolated bounce-back using only a single equation for all values of q is applied. This boundary condition finds a PPDF at a location between the two or three wall-nearest nodes, which during the bounce-back will exactly stream to the wall. Subsequently, it performs the bounce-back and then interpolates the missing PPDF based on the PPDF at the wall.
Performance analysis
The scalability of the code is tested on two HPC systems. The CRAY HAZEL HEN system is located at the High-Performance Computing Center (HLRS) Stuttgart, Germany. The system consists of 7712 dual socket nodes containing each 2 Intel Haswell E5-2680v3 CPUs, each with 12 cores clocked at 2.5 GHz. The system has a peak performance of 7.4PFlops for 185, 088 cores. The nodes contain 128 GB of RAM. Parallel I/O is implemented via a Lustre File System (LFS), see [77]. The IBM BlueGene/Q system JUQUEEN [69] is located at the Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich, Germany, and consists of 28,672 nodes containing IBM PowerPC A2 CPUs at 1.6 GHz, 16 cores, and 16 GB of RAM per node. The overall peak performance is 5.9PFlops. Due to its four-way SMT hardware threaded floating point units, it is capable of running a maximum number of 4 OpenMP threads per core. The JUQUEEN system uses the IBM LoadLeveler as job scheduler and has a 5D Torus network with a 40 GB/s bandwidth.
Figure 3a shows the performance of the mesh generation on the JUQUEEN system for two cases with \(\approx 9.82\cdot 10^9\) and \(\approx 78.54\cdot 10^9\) cells. The domains are cubic and periodic in all Cartesian directions. Both meshes have a base level of \(l_\alpha =7\). The smaller mesh is refined up to \(l_\beta =11\) and the finer mesh up to \(l_\beta =12\). From Fig. 3a, it can be seen that the mesh generation on for the smaller case scales well up to 2048 nodes. Defining the ratio of the expected run time under continuous bisection of the first run time on the lowest number of nodes and the actual measured time for the individual node counts as computational efficiency \(\epsilon\) in percent, a value of \(\epsilon =92.06\%\) is achieved on 2048 nodes. The performance increase drops to \(\epsilon =46.01\%\) on 8192 nodes. A slightly superlinear scaling behavior up to 16,384 nodes is visible for the finer case, where \(\epsilon =106.0\%\) is reached. It should be noted that despite the scalability drops for the smaller case for higher node counts, the meshing process only requires 27.63s and 44.94s for the small and fine cases on 8192 and 16,384 nodes.
Figure 3b shows the results of a strong scaling experiment on both systems HAZEL HEN and JUQUEEN using the SRT method on a cubic domain with periodic boundary conditions in all directions. The computational mesh is uniformly refined, has levels from \(l_\alpha =8\) to \(l_\beta =10\), and consists of \(\approx 1.225\cdot 10^9\) cells. The analyses are performed using \(n_H=16,\ldots ,512\) compute nodes on HAZEL HEN, each with 24 MPI ranks per node, and \(n_J=64,\ldots ,16,384\) on JUQUEEN using 16 MPI ranks per node. The simulations are advanced for \(t=100\) iterations. Obviously, the computation shows a very good scalability up to 512 nodes on HAZEL HEN with efficiencies of \(\epsilon =\{84.15\%,77.1\%,69.46\%\}\) on 128, 256, and 512 nodes. On JUQUEEN, the code scales up to 16,384 nodes and has efficiencies of \(\epsilon =\{81.06\%, 70.4\%, 41.26\%\}\) on 4096, 8192, and 16,384 nodes.
To reduce the memory footprint of simulations and to accelerate the preprocessing performance of simulations, parallelized geometries are employed. Figure 4 shows the memory saving factors and preprocessing accelerator factors employing parallelized geometries in contrast to using serial geometries. The results are obtained for the simulation of the flow in a respiratory tract using a mesh consisting of \(266.5\cdot 10^6\) cells with levels \(l_\alpha =9\), \(l_\beta =10\), and \(l_\gamma =12\). The corresponding geometry consists of \(7\cdot 10^6\) triangles and consumes 1239MB in serial. The memory graph is based on the estimate that a geometry allocates twice as much space under a doubling of the number of nodes. The acceleration of the preprocessing bases on a comparison to a test run using a serial geometry on 8192 nodes. From Fig. 4, it is obvious that using a parallelized geometry massively reduces allocated space, i.e., from 512 up to 8192 nodes saving factors of 1204 and 13,306 are obtained. The preprocessing, which is for the serial geometry dominated by I/O, can be decreased by factors of 50 and 27.5 for 512 and 8192 nodes. That is, on one hand, parallelized geometries allow to have more memory available for the computation and, on the other hand, reduce the effort in preparing a simulation.
For further ideas to accelerate LB codes, e.g., for exascale computing or to port to GPUs, the reader is referred to [4, 70, 73].