Lattice–Boltzmann simulations for complex geometries on high-performance computers

Complex geometries pose multiple challenges to the field of computational fluid dynamics. Grid generation for intricate objects is often difficult and requires accurate and scalable geometrical methods to generate meshes for large-scale computations. Such simulations, furthermore, presume optimized scalability on high-performance computers to solve high-dimensional physical problems in an adequate time. Accurate boundary treatment for complex shapes is another issue and influences parallel load-balance. In addition, large serial geometries prevent efficient computations due to their increased memory footprint, which leads to reduced memory availability for computations. In this paper, a framework is presented that is able to address the aforementioned problems. Hierarchical Cartesian boundary-refined meshes for complex geometries are obtained by a massively parallel grid generator. In this process, the geometry is parallelized for efficient computation. Simulations on large-scale meshes are performed by a high-scaling lattice–Boltzmann method using the second-order accurate interpolated bounce-back boundary conditions for no-slip walls. The method employs Hilbert decompositioning for parallel distribution and is hybrid MPI/OpenMP parallelized. The parallel geometry allows to speed up the pre-processing of the solver and massively reduces the local memory footprint. The efficiency of the computational framework, the application of which to, e.g., subsonic aerodynamic problems is straightforward, is shown by simulating clearly different flow problems such as the flow in the human airways, in gas diffusion layers of fuel cells, and around an airplane landing gear configuration.


Introduction and historical review
As computing power continuously increases, more and more complex engineering problems are investigated by means of computational fluid dynamics (CFD) methods. In research, high-performance computing (HPC) systems are commonly employed to solve large-scale multi-physics problems in fundamental and applied engineering. HPC architectures come in various flavors. The trend in HPC is towards heterogeneous architectures, which poses new challenges to simulation software. The TOP500 list of supercomputers 1 reflects this trend, e.g., the leading machine SUMMIT (status of 2020) of the Oak Ridge National Laboratory, U.S., is equipped with a massive amount of GPU accelerators. The adaption of software to run efficiently on such systems necessitates porting and tuning activities as well as multi-level parallelizations and optimized domain decompositioning methods.
A promising numerical method for the efficient simulation of flows are lattice-Boltzmann (LB) methods [3,27,31]. Their parallelization and their compute kernel implementation is simple, and intricate geometries can efficiently be treated. They are derived from lattice-gas cellular automata (LGCA). First theoretical work in [75] led to an application in fluid mechanics, i.e., to the first numerical solution of the Navier-Stokes equations via LGCA [20].
LGCA, however, suffer from their non-Galilean invariance at high Reynolds numbers Re leading to statistical noise [59,74]. McNamara and Zanetti [57] solved this problem by replacing the individual particle consideration by particle probability distribution functions (PPDFs). In [59], the collision matrix [31] is replaced by a single relaxation scheme. Higuera et al. [31] were the first to show results with the LB method for 2D flows in a channel containing a periodic array of obstacles. This is in line with the physical work of Bhatnagar-Gross-Krook and led to the numerical BGK model. Qian et al. [59] also introduce the DxQy discretization model with x space dimensions and y discrete directions. In [42,56,74], the Boltzmann equation is derived and by a Chapman-Enskog development the Navier-Stokes equations. Based on this model, a lot of novel variants of the LB method for different applications were deduced.
The stability of the standard BGK method with a single relaxation time (SRT) is non-linear dependent on the flow solution and the collision frequency. The viscosity is limited by a stability threshold, i.e., for Re → ∞ the viscosity cannot become infinitely small. Moment-based methods such as the multiple relaxation time (MRT) [12] and the cascaded LB (CLB) [22] methods are in this respect advantageous. In the MRT model, the relaxation is performed in moment space, which decouples the conservative and non-conservative variables. This leads to an increase in accuracy and stability. The CLB method also relaxes in moment space. Moments of the fourth order of the velocity components are, however, reconstructed from lower order moments in a cascade to capture oscillations in high-frequency space. Yet, this requires to use at least a D3Q27 discretization model. The entropic LB (ELB) method [36] increases the stability by an adaption of the relaxation time in each LB iteration. For the correction of instabilities, an entropy equation such as the Boltzmann entropy function or the Tsallis entropy function [5] is employed. The ELB model is stable even at high Reynolds numbers and more efficient than the MRT model [1]. The collision operator in the regularized LB (RLB) method [43] is separated into an equilibrium and non-equilibrium part. Under application of a term for the non-equilibrium stresses, the collision term is transformed into a solvable form. The RLB method is not as stable as the MRT, CLB, or ELB methods, and the computational costs are, however, similar to the standard BGK operator. Recently, the cumulant LB method [23] has been developed, which shows a smaller analytical and numerical error than the MRT method. The collision operator in this model is based on the so-called cumulants. The method is especially suited for high Reynolds number flows.
For the derivation of the equilibrium distribution function, a small Mach number Ma approximation is used [27], which limits the aforementioned methods to weakly compressible flow. This leads to a decoupling of the energy equation from the Navier Stokes equations. To satisfy the first law of thermodynamics or, in other words, to determine the temperature distribution, a multidistribution function (MDF) approach [26] is commonly used. Natural convection, i.e., a change of the density based on the temperature, can be realized by incorporating the Boussinesq approximation into the Navier-Stokes equations. The MDF method neglects, however, added work via compression and viscous heating at moderate Mach numbers. For Ma > 0.3 , either more energy states are considered [71] or finite-difference approaches [37] are employed. The full Navier-Stokes equations can be derived from the discretized equations using discretization models with y = 120 or y = 39 [66]. Chen et al. [9] present a method based on this approach, which is capable of simulating flows at Ma = 2.0.
LB methods are, furthermore, suited for the simulation of multi-phase flows. Depending on which thermal equation of state (EOS) and maximum density ratio are considered, or how fast and accurate a solution is requested, different models have been developed. Color-gradient methods [25] distinguish between two differently colored fluids, each with an own set of PPDFs [60]. Although this method is quite accurate, high-density gradients may lead to numerical instabilities. Shan-Chen models [65] generate a phase separation by integrating attracting and repelling forces in the vicinity of phase interfaces. This, however, leads to spurious currents. In free-energy models [72], the issue of the non-monotonic thermal EOS is addressed. The EOS is incorporated into the pressure tensor of the Navier-Stokes equations. Like the color-gradient method, the free-energy methods suffer from limiting density ratios. The interface tracking methods [30] solve for two sets of PPDFs. Macroscopically, the Cahn-Hilliard and the Navier-Stokes equations are considered. Recent advances using interface tracking methods [45] allow high-density gradients on the order of O(10 3 ) and an efficient and accurate solution scheme.
In general, all aforementioned LB schemes operate on Cartesian meshes, which may lead to an excessive number of cells due to the unique grid spacing, e.g., in boundary-layer flows. The application of boundary-refined Cartesian meshes delivers in such cases only a slight reduction of the computational effort. Furthermore, the Courant-Friedrichs-Lewy (CFL) number is bound to unity. It is hence worth noting that there exist several works on non-uniform meshes [13,39,58]. In [58], a finite volume (FV) formulation of the LB equation is derived, extending the applicability of LB methods to irregular meshes. Krämer et al. [39] employ a semi-Lagrangian propagation step, where the streamed PPDF is reconstructed by a finite-element approach. This allows for non-unity CFL numbers, higher order spatial discretizations, and higher time steps. Di Ilio et al. [13] combine the standard Cartesian approach with an FV formulation and name it the hybrid LB method (HLBM). In this method, two meshes, a Cartesian mesh in the far field of a solid object and an unstructured body-fitted FV mesh in the vicinity of the object, are combined. In the overlapping region, information is exchanged via interpolation. Note that despite the advantages of these irregular mesh methods, they complicate automatic meshing and dynamic solution-adaptive refinement, which is why the approach in this paper employs hierarchical Cartesian meshes. For a good introduction and further details on LB simulations, the reader is referred to [3,27].
The Institute of Aerodynamics and Chair of Fluid Mechanics, RWTH Aachen University, develops the framework Zonal Flow Solver (ZFS) [52], which is supported by the Jülich Aachen Research Alliance Center for Simulation and Data Science (JARA-CSD). This framework features a massively parallel Cartesian grid generator, parallelized geometries, and, amongst others, different LB methods to solve technical and biomedical engineering problems. The code is written in C++. In this contribution, the numerical methods implemented in ZFS are described and some application examples of large-scale simulations from different fields are presented as proof of concept. Note that despite the theory of the computational kernels is derived from available publications, the modularity, meshing, and parallelization approaches, also for the geometry, are unique to ZFS. That is, the paper aims at delivering details on a production code, which is under continuous development and frequently used. For a review of other LB simulation codes in the context of large-scale computations on HPC systems, the reader is referred to [70].
The paper is structured as follows. Section 2 discusses the numerical methods and corresponding performance aspects. Subsequently, Sect. 3 presents the application examples, i.e., flow computations in the human respiratory tract, in gas diffusion layers (GDLs) of fuel cells, and around a airplane landing gear configuration. In Sect. 4 a summary and a conclusion are given. Finally, in Sect. 5, other fields of application and an outlook are presented.

Numerical methods
The simulation of flows necessitates to follow a particular workflow path. First, the computational mesh corresponding to a flow problem and an associated geometry is generated. The mesh and a parallelized geometry are input to the multiphysics framework ZFS, which uses its solvers to compute an approximate solution of the Navier-Stokes equations. The following Sect. 2.1 describes the method to generate the mesh and a parallelized geometry. This is followed by a description of the LB solver in Sect. 2.2, which is used for the applications presented in Sect. 3. Some performance aspects of the simulation code are presented in Sect. 2.3. A workflow chart for the LB method is also shown in Fig. 1.

Mesh generation
The mesh generator [53] is part of the framework ZFS and generates hierarchical Cartesian meshes. It is independent of the method employed for the simulation. It fully works in parallel using the Message Passing Interface (MPI) and OpenMP, and is able to construct meshes in a short amount of time on a large number of processes.
Parallel meshing begins with an I/O of the geometry, which is usually given in Standard Tessellation Language (STL). Each process places an initial cube around this geometry and starts to continuously subdivide the cube into smaller cubes. This constitutes an octree hierarchy with parent-child and neighborhood relations, and cells living on different levels l in the tree. The subdivision is performed until a level l is reached. In each iteration, cells that are outside the geometry are removed by a mixture of intersection tests and flooding algorithms. At l , all levels l < l are removed and the remaining cells are decomposed by a Hilbert spacefilling curve [61]. Each MPI rank keeps only those cells which it is responsible for and creates rank-neighborhood information based on the local domain boundary cells, the Workflow overview of ZFS using the LB solver. At time step t = t * , some event is triggered, i.e., checkpointing, solution output, or in-situ analysis is performed so-called window cells. Continuous subdivision is then performed on the remainder of the cells up to a level l > l . At this stage, the mesh is uniformly refined. The subsequent step creates locally refined meshes. The algorithm can refine regions, where higher resolutions are required based on userdefined geometrical objects or the distance of the boundary. For the former method, cells inside the defined geometrical shapes are refined, while for the latter method, the distance from the wall is measured and used as in indicator for refinement. Global cell neighborhood is restored using the window and the corresponding halo cells, which are created as copy of each window cell on a neighboring MPI rank. From the decomposition on l , a list of cells is defined, which is used in a preprocessing step to the solver for the weighted decomposition of the mesh. This list is also utilized for the parallelization of the geometry, i.e., for each cell in this list, the number of triangles and the corresponding triangle information is gathered and written to disk in parallel using either parallel NetCDF [47] or HDF5 [18]. These parallel libraries are in a final step also used to write the mesh to disk. For more details on the parallel mesh generator, the interested reader is referred to [49,53].

Lattice-Boltzmann solver
ZFS implements different LB methods, i.e., the standard SRT, MRT, RLB, and CLB models. For spatial discretization, all these models employ the parallel octree mesh placed into memory via parallel I/O. The space is furthermore discretized using the DxQy schemata from [59]. Figure 2 shows the various discretization schemata implemented in ZFS. The inner box of Fig. 1 shows the basic algorithm of the LB methods, i.e., a collision step, which represents, in a statistical sense, the change of PPDFs due to cell-local collisions, is usually followed by a propagation step, which pushes the new PPDFs to the neighboring cells. Since the code is executed in parallel, window cells need to copy their information to the neighboring MPI ranks via inter-process communication (IPC) to their corresponding halo cells such that a valid propagation can be executed. Depending on the kind of boundary condition (BC), the BC is either executed right after collision and before IPC, or after propagation. After that, the iteration step t is advanced, and if t = t * , a special event is executed. Such events can be checkpointing actions, solution I/O, in-situ analysis, or simply finishing the simulation at a target time step t * = t end .
The most frequently used LB models are the SRT with the D3Q27 scheme. The MRT, RLB, and CLB models come, however, with advanced features with respect to the range of applications and stability. The following describes these methods independent of the DxQy schemata. Furthermore, a method for large-eddy simulations (LES), the employed mesh refinement technique, and boundary conditions are described. Note that all methods have previously been validated in [15,19,50]. That is, in [15], the SRT method with mesh refinement is validated by simulating the flow past a circular cylinder at Reynolds numbers Re = 40 and Re = 100 , and past a sphere at 100 ≤ Re ≤ 300 and at 3700 ≤ Re ≤ 10, 000 . For the latter case, the Reynolds number is on the same order as that of the landing gear configuration presented in Sect. 3.3. A detailed analysis of the averaged drag coefficient C d , the mean base pressure coefficient C pb , the mean non-dimensional length of the recirculation region L r ∕D , the mean separation angle ̄s , the stRouhal number St for the large-scale vortex shedding, the mean wall-pressure coefficient C p , and the mean skin-friction ̄ is given. The values and distributions are in good agreement with the other results from the literature. In [19], the MRT and the CLB methods are validated for the D3Q19 and D3Q27 models by simulations of plane Poiseuille flow at Re = 200 , flow in a threedimensional lid-driven cavity at yaw at Re = 700 , and of a turbulent channel flow at Re = 200 . Finally, in [50], the thermal MDF LB approach is validated by analyzing a boundary-layer flow over a heated flat plate at Re = 10, 000 and a PRandtl number of Pr = 1.0.

SRT method
T h e S RT m et h o d [ 5 9 ] e qu a t e s t h e P P D F s The parameter F is a function of the inverse of the viscosity , that is: In Eq. (1), r represents the spatial location, r is the grid distance, t is the time, t is the time increment, and f eq i is the discretized Maxwellian equilibrium distribution function given by: where is the density, t p is a discretization scheme-dependent factor, see Appendix A, c s is the speed of sound, v a,b and i,a,b are fluid velocity and molecular velocity components, and ab is the Kronecker delta with indices a, b ∈ {1, … , x} . The algorithm usually performs the collision and propagation steps in separate steps, i.e., an explicit scheme, alternating between collision and propagation operations, is used: where the PPDFs with a <̂> represent the post-collision PPDFs. The macroscopic variables can be obtained from the moments of the PPDFs, see, e.g., Hänel [27]: . (3) The temperature distribution can be simulated by an MDF approach [26], i.e., by additionally solving: The parameter T depends on the heat conduction coefficient of the fluid: where is given by the PRandtl number Pr = ∕ and g eq i (r, t) equates to: with T representing the temperature. The macroscopic temperature variable is given by: The advantages of the SRT method are that the implementation is straightforward for all DxQy models and that the collision kernel is well suited for HPC simulations. The Reynolds number range and its quasi-incompressibility ansatz in the derivation of the equilibrium distribution function (see [27]) are, however, due to stability limitations and fixed collision frequencies F and T restricted to rather small Mach and Reynolds numbers.

MRT method
Unlike the SRT method, the MRT method [12] relaxes in momentum space and introduces individual collision frequencies for the various moments resulting in higher numerical stabilities [42]. Various physical processes in fluids such as viscous transport can be approximately described by mode coupling. The modes are directly related to the moments of the PPDFs. Since the collision frequencies of the moments are directly related to various transport coefficients, each mode can be controlled independently. This overcomes the fixed PRandtl number issue that the SRT models suffer from. The MRT equation can be written in vector notation and is given by:

3
with being the vector of the PPDFs and moment and relaxation matrices and MRT , and vectors eq and . The relaxation matrix MRT is a diagonal matrix holding the various collision frequencies. Example vectors and matrices for the D2Q9 MRT model are given in Appendix B.
Note that setting the diagonal elements of MRT to F yields the SRT method. The calculations of the macroscopic variables and the propagation are performed analogue to the SRT method, see Eqs. (6)- (8). That is, the MRT method only differs from the SRT method by the collision step. A stability analysis that considers the eigenvalues of the spectral matrix of the Fourier transform of Eqs. (1) and (14) underlines the stability advantages of the MRT method over the SRT method [42]. That is, the MRT method is advantageous for higher Reynolds number flows. Computationally, it is due to the required matrix operations more expensive than the SRT method. On modern supercomputers with decent vector units, the performance difference can, however, be considered negligible if the compiler is able to efficiently vectorize the code.

CLB method
Similarly to the MRT method, the CLB method [22] also relaxes in moment space, which reconstructs, however, the high-order non-hydrodynamic moments of the discrete velocity set up to the fourth order in a cascade from lower order moments. Short wave-length oscillations at higher Reynolds numbers are captured by these higher order moments yielding a stable numerical scheme. The advancement of the state vector of the PPDFs is given by [22]: where CLB = ( 1 , … , y ) is a orthogonal transformation matrix that transforms the configuration space to the moment space to obtain Galilean invariance, e.g., the first hydrodynamic macroscopic quantities can be obtained by: and is the moment vector. The matrix CLB and the vector are exemplarily given in Appendix C for the D2Q9 model. Concerning the limitations of the SRT method for → 0 , or in other words Re → ∞ , the CLB method allows for higher numerical stability by enabling a decrease of the viscosity by many orders of magnitude as compared to the original model [22].

RLB method
Using the SRT method, the symmetry properties of the PPDFs are not necessarily fulfilled to reach the hydrodynamic limit. Therefore, Latt and Chopard [43] introduced the RLB method, which introduces a pre-collision regularization step to restore this symmetry. This regularization leads to enhanced stability and accuracy in the hydrodynamic regime of the employed scheme. The Chapman-Enskog expansion expands the PPDFs in powers of the Knudsen number = l f ∕L , where l f is the mean free path and L a characteristic length. This leads to: i (r, t) at zeroth order. The non-equilibrium parts are then given by: The non-equilibrium part of the PPDFs can furthermore be expressed as [11]: Inserting this into the non-equilibrium momentum flux tensor [cf. Eq. (9)] leads to: Combining Eqs. (19) and (22) yields the regularization term: prior to the next collision.

Large-eddy simulations
LES computations employ the Smagorinsky sub-grid scale (SGS) model for LB methods [33,68]. Spatial filtering is performed using coarsened meshes. To account for the filtered scales, the turbulent viscosity t is added to the collision frequency (see Eq. (2)): .
with Smagorinsky's constant C s . The turbulent viscosity is given by: where S ab is the filtered strain tensor, which can be obtained by:

Grid refinement
Simulations on multiple octree hierarchy levels are performed using the method of Dupuis and Chopard [14], i.e., by executing restriction and prolongation operations on succeeding levels l and l +1 in the corresponding overlapping regions. The method splits missing interpolated incoming PPDFs f i (r, t) from other levels into equilibrium and nonequilibrium parts, see Eqs. (18) and (19). The relation of the non-equilibrium PPDFs on the fine and the coarse level can then be expressed as: with t +1 ∕ t = r +1 ∕ r . For the transformations from fine to coarse and coarse to fine, this yields: Keeping the viscosity constant across levels leads to the transformation factors , +1 , which depend on the ratios of the grid distances and the relaxation times on l and l +1 . The same scheme can be applied to the MDF approach, however, using the same heat conduction coefficient on both levels and T, and T, +1 .

Boundary conditions
Various boundary conditions can be applied to simulate a large spectrum of numerical setups. In more detail, multiple von Neumann and Dirichlet conditions for in-and outflows, and periodic boundary conditions are implemented in ZFS. This also includes sponge layered, Saint-Venant/Wantzel, and pressure boundary conditions [51]. That is, for Dirichlet conditions, e.g., for the prescription of a velocity profile at an inlet or a pressure at the outlet, Eq. (3) is equated for a given velocity or density. An adaptive pressure-based outlet condition adapts according to the local Reynolds number. Such a boundary condition is frequently used at an outlet in combination with a Saint-Venant/Wantzel at an inlet in respiratory flow simulations to imitate the pressure drop caused by the expansion of the diaphragm; see Sect. 3.1. A von Neumann condition is fulfilled by extrapolating from interpolated values, e.g., velocity or density, of the inlet-/outletnearest neighbor cells. The Saint-Venant/Wantzel condition extrapolates the momentum. No-slip wall boundary conditions are at least of second-order accuracy. Among the most frequently used wall boundary conditions is the interpolated bounce-back method from Bouzidi et al. [6], which measures for cells intersecting the geometry the distance q from the cell center to the wall and uses this information to interpolate the bounced-back PPDF. The interpolation scheme is chosen based on q < 0.5 ⋅ r and 0.5 ⋅ r ≤ q ≤ r . Furthermore, schemes from Ginzburg and D'Humières [24] and Yu et al. [76] are implemented. In the former, a multi-reflection boundary condition, which employs PPDFs at three nodes to find the unknown PPDFs at the boundary condition after a bounce-back, is proposed. In the latter, a double interpolated bounce-back using only a single equation for all values of q is applied. This boundary condition finds a PPDF at a location between the two or three wall-nearest nodes, which during the bounce-back will exactly stream to the wall. Subsequently, it performs the bounce-back and then interpolates the missing PPDF based on the PPDF at the wall.

Performance analysis
The scalability of the code is tested on two HPC systems. The CRAY HAZEL HEN system is located at the High-Performance Computing Center (HLRS) Stuttgart, Germany.
The system consists of 7712 dual socket nodes containing each 2 Intel Haswell E5-2680v3 CPUs, each with 12 cores clocked at 2.5 GHz. The system has a peak performance of 7.4PFlops for 185, 088 cores. The nodes contain 128 GB of RAM. Parallel I/O is implemented via a Lustre File System (LFS), see [77]. The IBM BlueGene/Q system JUQUEEN [69] is located at the Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich, Germany, and consists of 28,672 nodes containing IBM PowerPC A2 CPUs at 1.6 GHz, 16 cores, and 16 GB of RAM per node. The overall peak performance is 5.9PFlops. Due to its four-way SMT hardware threaded floating point units, it is capable of running a maximum number of 4 OpenMP threads per core. The JUQUEEN system uses the IBM LoadLeveler as job scheduler and has a 5D Torus network with a 40 GB/s bandwidth. Figure 3a shows the performance of the mesh generation on the JUQUEEN system for two cases with ≈ 9.82 ⋅ 10 9 and ≈ 78.54 ⋅ 10 9 cells. The domains are cubic and periodic in all Cartesian directions. Both meshes have a base level of l = 7 . The smaller mesh is refined up to l = 11 and the finer mesh up to l = 12 . From Fig. 3a, it can be seen that the mesh generation on for the smaller case scales well up to 2048 nodes. Defining the ratio of the expected run time under continuous bisection of the first run time on the lowest number of nodes and the actual measured time for the individual node counts as computational efficiency in percent, a value of = 92.06% is achieved on 2048 nodes. The performance increase drops to = 46.01% on 8192 nodes. A slightly superlinear scaling behavior up to 16,384 nodes is visible for the finer case, where = 106.0% is reached. It should be noted that despite the scalability drops for the smaller case for higher node counts, the meshing process only requires 27.63s and 44.94s for the small and fine cases on 8192 and 16,384 nodes. Figure 3b shows the results of a strong scaling experiment on both systems HAZEL HEN and JUQUEEN using the SRT method on a cubic domain with periodic boundary conditions in all directions. The computational mesh is uniformly refined, has levels from l = 8 to l = 10 , and consists of ≈ 1.225 ⋅ 10 9 cells. The analyses are performed using n H = 16, … , 512 compute nodes on HAZEL HEN, each with 24 MPI ranks per node, and n J = 64, … , 16, 384 on JUQUEEN using 16  To reduce the memory footprint of simulations and to accelerate the preprocessing performance of simulations, parallelized geometries are employed. Figure 4 shows the memory saving factors and preprocessing accelerator factors employing parallelized geometries in contrast to using serial geometries. The results are obtained for the simulation  Memory saving factor (left ordinate) and preprocessing acceleration factor (right ordinate) for computations on increasing node counts using parallelized geometries related to resources allocated on 512 JUQUEEN nodes using a serial geometry of the flow in a respiratory tract using a mesh consisting of 266.5 ⋅ 10 6 cells with levels l = 9 , l = 10 , and l = 12 . The corresponding geometry consists of 7 ⋅ 10 6 triangles and consumes 1239MB in serial. The memory graph is based on the estimate that a geometry allocates twice as much space under a doubling of the number of nodes. The acceleration of the preprocessing bases on a comparison to a test run using a serial geometry on 8192 nodes. From Fig. 4, it is obvious that using a parallelized geometry massively reduces allocated space, i.e., from 512 up to 8192 nodes saving factors of 1204 and 13,306 are obtained. The preprocessing, which is for the serial geometry dominated by I/O, can be decreased by factors of 50 and 27.5 for 512 and 8192 nodes. That is, on one hand, parallelized geometries allow to have more memory available for the computation and, on the other hand, reduce the effort in preparing a simulation.
For further ideas to accelerate LB codes, e.g., for exascale computing or to port to GPUs, the reader is referred to [4,70,73].

Applications
To show the applicability of the methods presented in Sect. 2 for complex flows, three simulation examples are presented in the following. Section 3.1 discusses some results for the flow in a complex geometry such as the human respiratory system. Subsequently, the applicability for technical applications, i.e., for the simulation of the flow in porous media such as in GDLs of fuel cells and around an airplane landing gear configuration, are shown in Sects. 3.2 and 3.3.

Flow in the human respiratory tract
The flow in the human respiratory tract is quasi-incompressible and is mainly in the low Reynolds number and low Mach number regime, i.e., the D3Q27 SRT LB method is well suited for the simulation of such flows. Considering respiration, it is important, from a fluid mechanics point of view, to understand the functionalities of the human nasal cavity and what pathologies, or in fluid mechanics context, what flow phenomena might cause their reduction [51]. That is, the pressure loss along the airway can be used as an indication for strenuous respiration. A local consideration might even lead to the identification of anatomical structures causing diminished respiration. The distribution of the wall-shear stress can be used to find potential locations of inflammations, while the temperature distribution and the heating capability classify a nasal cavity to be protective for the lungs. The epiglottis and the larynx might lead to jets at inspiration [49]. Furthermore, particle deposition behavior in the nose as well as in the lung is of high interest, i.e., where and why, e.g., fine dust particles or Diesel aerosols deposit in the airway [55]. The human organism is a highly complex structure and simulations in the human respiratory tract are quite a challenge for today's simulation algorithms. ZFS, however, is capable of simulating such flows with LB methods on octree-based meshes. Especially large geometries that may be the output of algorithms extracting surfaces from computer tomography data [54] might become large in size and necessitate a geometrical parallelization approach as presented in Sects. 2.1 and 2.3. Figure 5 shows the results of a simulation embedded in the original computer tomography data set that was computed on the JUQUEEN system. The computational mesh consists of 266.5 ⋅ 10 6 cells and has levels l = 9 , l = 10 , and l = 12 , cf. Sect. 2.3. The Reynolds number, which is based on the hydraulic diameter of the trachea D T , the kinematic viscosity of air, and the velocity corresponding to a volume flux of 360ml/s is Re = 1, 515 . The flow in the nasal cavity is primarily in the laminar regime. Depending on the local Reynolds number, i.e., the local shape of the anatomy and the respiration velocity, the flow can, however, become transitional [51]. Further downstream, the epiglottis and the larynx are responsible for the formation of a jet. The magnification in Fig. 5 shows that the flow becomes transitional in this region before it turns laminar again in the trachea. For visualizing unsteady vortical structures, the -criterion [10] defined by has been evaluated in the magnification. It is based on the Q-criterion [34]: where are the strain and the vorticity tensor, = (v 1 , v 2 , v 3 ) T is the velocity vector, and |⋅| denotes the Frobenius norm.
Quantitative results are depicted in Fig. 6. As it is in many cases the nasal cavity which determines the respiratory capability, the total pressure loss from the nostrils to the pharynx, i.e., is juxtaposed to results in [51] in Fig. 6a. That is, the pressure loss of the nasal cavities N with indices 1, … , 3 , is compared to the current case with index 4 for the left l and right r nasal cavity. In Eq. (34), the quantities x 1 ∈ {l, r} and x 2 represent the locations of the left and right nostrils and the pharynx location (see Fig. 5). The static pressure p s and the dynamic pressure p d are averaged over their corresponding cross-sectional areas. Note that to allow for a comparison, the total pressure has been normalized according to [51], that is: Case N 1 is considered a healthy cavity, N 2 a pathological case with swollen turbinates and a bend septum, and N 3 is a case which underwent a septoplastic surgery. Obviously, the respiratory capability of case N 4 is below that of the healthy and even the pathological case. It can be seen that strenuous respiration is present in both the left and right nasal cavity. It should, however, be mentioned that for N 1,…,3 a volume flux of 250 ml/s was assumed, which explains the comparative increased pressure loss for case N 4 .
The laminarization process along the airway is analyzed by considering the turbulent kinetic energy 3 ) with temporally averaged nondimensional fluctuating squared quantities v ′2 a . It is sampled along the geometrical centerline of the anatomy from the pharynx region to the end of the trachea. The results along the arc length of the curve are shown in Fig. 6b. It can be seen that the flow is already quite energetic when reaching  5. The corresponding restriction helps laminarizing the flow before the turbulent kinetic energy increases in the epiglottis jet region. Subsequently, the larynx constriction supports again laminarization and the larynx jet again increases the energy. In the trachea, the energy level stays at a low and almost constant level before the flow finally enters the lung.

Flow in gas diffusion layers of fuel cells
In fuel cells, it is important to have a uniform supply of oxygen and hydrogen in the reactive layer to reliably produce electricity. Therefore, GDLs [21,41] are placed between the gas supply channels and the reactive layer. Such GDLs consist of porous material composed of dense arrays of carbon fibers, sometimes treated with hydrophobic substances to allow for enhanced water droplet transport across the layers. The optimal structure to homogenize the flow is, however, difficult to find, which is why numerical simulations are employed to understand the fundamental aspects of gas transport across the layers. The following simulations make use of the LB method with the SRT D3Q27 model using parallelized geometries. Figure 7 shows the simulation setup, which covers 27 GDLs embedded in a 1mm 3 cube. For the inlet, a uniform velocity profile is prescribed via a Dirichlet BC and the density is extrapolated using a von Neumann BC. In contrast, the outlet extrapolates the velocity and the density is prescribed. At the outer wall, a slip condition is used and the no-slip wall BC for at the fiber surfaces is realized via the interpolated bounce-back approach from [6]. The Reynolds number, which is based on the fiber diameter of D = 7.2 m , the kinematic viscosity of hydrogen, and a velocity v G = 1.8 ⋅ 10 −3 m∕s , is Re G = 6.47 ⋅ 10 −5 . The computational mesh consists of hierarchy levels l = 7 , l = 8 , and l = 11 and of 2.032 ⋅ 10 9 cells. Figure 8 shows some results of the simulation, i.e., the velocity magnitude of the gas flow across the GDL is depicted via distributions in cross sections in the center of the GDL (horizontal and vertical) and right downstream of the GDL. The results match the small Reynolds number, i.e., the flow is dominated by diffusive effects. This is also corroborated by the smeared velocity distribution in the cross section right downstream of the GDL.
The simulations were performed on the JURECA system [40] at JSC Jülich. The JURECA supercomputer consists of

Flow around an airplane landing gear configuration
To show the applicability of the LB method to subsonic aerodynamic problems, simulations are performed on the JURECA system at JSC for the flow around a front airplane landing gear configuration 2 as shown in Fig. 9a. The landing gear can be considered typical for vertical take-off landing (VTOL) jet airplanes with double front gear wheels. The MRT method described in Sect. 2.2.2 with the D3Q27 discretization scheme is used. The Reynolds number of this problem is Re = v ∞ ⋅ H∕ = 0.5 ⋅ 10 6 . It is based on the freestream velocity v ∞ = 22.725m∕s , the landing gear height H = 1.20m (see label in Fig. 9a), and the viscosity of air at a temperature of T = 267.039K of = 15.15 ⋅ 10 −6 m 2 ∕s . The Reynolds number based on the diameter of the main hydraulic damper d = 0.062m is Re d = 25, 800 . Although the velocity and, hence, the Reynolds number are quite low compared to real inflight conditions, it represents a characteristic Reynolds number in the transition phase from vertical ascent to accelerated horizontal flight. Figure 9b shows Fig. 8 Simulation results of the GDL simulation. The different cross-sections are colored by the velocity magnitude. Cross-section locations from left to right: vertical in the center of the GDL; horizontal in the center of the GDL; directly downstream of the GDL the computational mesh of the simulation which is uniformly refined up to a base level l = l = 7 . The mesh is furthermore boundary-refined in the vicinity of the gear and patchrefined in the wake of the gear up to level l = 11 . The inset of Fig. 9b shows a magnification of the computational mesh in the vicinity of the top airplane mount joint. The mesh has a total number of cells of 191.5 ⋅ 10 6 . On level l , the resolution is at r = 1.479mm . To show that this resolution is sufficient, the non-dimensional wall distance y + is calculated on a temporally averaged simulation data set. The mean velocity components v a and the mean density ̄ are obtained by averaging the solution after s = 300 ⋅ 10 3 LB iterations for another ̄= 200 ⋅ 10 3 LB iterations. The non-dimensional wall distance is given by: where v * = √ w ∕ is the friction velocity with wall-shear stress w = ( v ∥ ∕ r w ) , and r w , , and v ∥ are the coordinate normal to the wall, the dynamic viscosity, and the walltangential velocity component. The minimal distance (i) between field data points i and the landing gear geometry employs the signed distance calculation method by Baerentzen and Aanaes [2]. The wall-shear stress is linearly approximated by w (i) ≈ [v ∥ ∕ (i)] . Together with Eq. (36), this leads to: Figure 11a shows contours of (i) colored by y + (i) . The maximum y + that is found in the wall-nearest cells, i.e., at a maximum distance of r∕2 on level l is max{y + (i)} = 4.086 . Although max{y + (i)} > 1 , the majority of the cells has a value of y + (i) ≤ 1 . Furthermore, since the flow is mainly shear-driven, the resolution of the computational mesh can be considered sufficient. The corresponding contour at = r∕2 , colored by y + , is shown in Fig. 11b. Obviously, the maximum y + is found on the left and right of the stagnation lines on the main and minor dampers, the hydraulic dampers, the wheels, and on top and on the bottom of the wheels. Figure 10a shows the flow in a cross section parallel to the main flow direction. The cross section is colored by the velocity magnitude. Obviously, the complex geometry in combination with the Reynolds number leads to a transition of undisturbed flow upstream of the landing gear to highly vortical structures in the wake region. This is also visible in Fig. 10b, which shows contours of the -criterion colored by the velocity magnitude. The two insets in Fig. 10b show the magnifications of these structures in the top region of the geometry and in the vicinity of the small hydraulic hoses, which resemble hair-pin vortex-like shapes that detach in the near-wall region.
To quantitatively analyze the flow, the Reynolds stresses are computed over the LB iterations ̄ . The turbulent kinetic energy k and the mean velocities v a along a line L * that runs from the inlet to the outlet and crosses the center of the axle of the wheel construction are plotted over the normalized length of the line. The position of the line is shown in Fig. 10a. Figure 12 shows the strong increase of the turbulent kinetic energy in the impingement region of the wheel axle. The center of the wheels, i.e., the rotational axis, is located at p * w = 0.3445 (see Fig. 10a) and the wheel range is p * w = [0.2974, 0.3919] . A slight decrease of k is visible in the wake region, before it normalizes towards an equilibrium further downstream. While the mean axial flow component v 1 strongly decreases towards the object, an increase of v 2 and v 3 due to the geometry-induced side-way displacement of the flow in this regions is visible. Downstream of the gear, v 3 experiences a stronger decrease than v 2 , which is due to the left and right mounted wheels forcing the flow to move inwards. The turbulent flow is, furthermore, analyzed by means of the power spectral density PSD calculated for the velocity components v a and the density at probe locations p * 1,2 ≈ 0.4 (see Fig. 10a). Probe p * 1 is located in the center of the geometry, while p * 2 is located right behind the left wheel. Comparing the PSDs in Fig. 13a and Fig. 13b shows that the flow at p * 2 is slightly more energetic in the higher wave number range and experiences a stronger drop towards non-resolved frequencies. Especially in the lower frequency range, the energy content at p * 1 for component v 3 is higher than for p * 2 , which is probably due to the same effect as for the turbulent kinetic energy k at this location. Three wave numbers k w (p * 1 ) = {0.176 ⋅ 10 −3 , 0.251 ⋅ 10 −3 , 0.325 ⋅ 10 −3 } , as highlighted by the three vertical lines in Fig. 13a, seem to dominate the low wave number range for v 3 . At p * 2 , a slight bump is visible for velocity component v 1 at k w (p * 2 ) = 0.113 ⋅ 10 −3 (indicated by the vertical line in Fig. 13b). A similar behavior, although with a slightly higher wave number, is present for p * 1 .
The results show the MRT method together with mesh refinement method and the Smagorinsky SGS model for LES computations to be a suitable combination to compute such an intricate flow over a highly complicated geometry.

Summary and conclusion
Over the last three decades, the LB method has grown mature and is nowadays used in a variety of physics and engineering applications. Methods for simple flow problems, flow in highly complex geometries, multi-phase, low-and high Mach, and Reynolds number flows have been developed. Furthermore, the LB algorithms are easy to parallelize and show good scaling behavior.
The simulation framework ZFS is developed by the Institute of Aerodynamics and Chair of Fluid Mechanics, RWTH Aachen University, and is supported by JARA-CSD. It features a massively parallel grid generator, which enables to construct hierarchical octree meshes on hundreds of thousands of compute cores in a small amount of time for the simulation of large-scale problems. It shows good scalability behavior on present HPC systems. Furthermore, parallelized geometries can easily be generated, which enable to massively reduce the memory footprint and the preprocessing times of simulations. For the simulation, different methods have been implemented with LB methods primarily being used for the simulation in complex geometries. The standard SRT, MRT, CLB, and RLB models as well as the thermal MDF approach, the refinement strategy, and the Smagorinsky LES approach for LB methods have been presented. All methods are hybrid MPI/OpenMP parallelized, use parallel I/O for output and feature internal data usage for in situ analyses of the results. The scalability of the SRT method has been shown up to 16,384 nodes on the PowerPC-based JUQUEEN system at JSC and up to 512 nodes on the Intel-based HAZEL HEN machine at HLRS. To show the applicability of these methods, three simulation examples have been presented.
The first example considered the flow in the human respiratory tract. Highly resolved simulations were carried out and flow regime transitions such as laminar to transitional flow regions could be identified. The second example showed the applicability to porous media flows, i.e., the flow in GDLs of fuel cells has been investigated. The method showed to be capable of dealing with highly dense fiber structures. The flow in GDLs is dominated by low Reynolds numbers and diffusive effects. The third example is the flow around a front landing gear configuration of a VTOL jet airplane. The resolution of the simulation has been discussed by an analysis of the non-dimensional wall distance. The flow has been analyzed qualitatively by contours of the -criterion and in-plane velocities as well as quantitatively by considering the turbulent kinetic energy derived from the Reynolds stress tensor and power spectral density analyses.
To conclude, with ZFS, an efficient multi-tool framework to implement various LB methods has been established. It allows to simulate laminar to transitional flow in and around complex geometries at low to medium as well  as higher Reynolds numbers leading to turbulent flow and at low Mach numbers. The modularity of the framework, however, allows for an easy integration of methods such as the Cumulant LB or the interface tracking method for high Reynolds number and compressible, and multi-phase flows. That is, an extension towards aeronautical engineering applications with realistic inflight conditions is straightforward and is line with the research and development strategy of the developers. Note that ZFS is already capable of simulating such problems by means of an FV method solving the compressible Navier-Stokes equations, see, e.g., [64]. Interfaces of moving boundaries can be tracked by a level-set approach, cf. [29]. Particles can be traced with a Lagrangian approach [55]. Aeroacoustics are simulated with a coupled FV discontinuous Galerkin direct-hybrid approach [63] solving the flow problem and the acoustic perturbation equations [16]. That is, ZFS is a tool to simulate a variety of multi-physics problems in engineering and biomedicine. The modularity of the framework allows to couple any of the described methods and for easy model and method extension.

Further fields of application and outlook
The LB method is extremely appealing for a variety of fields of application. The cumulant LB method has, e.g., been used to simulate turbulent flows over a resolved urban canopy [46]. König and Fares [38] use the LB method to simulate transonic flows around the NASA Common Research model. The method is based on the trans-and supersonic method presented in [17,67], i.e., it uses the D3Q39 model with a Hermite expansion to arrive at a fourth-order polynomial for the equilibrium PPDFs. Latt et al. [44] also use the D3Q39 model to  Fig. 10a. All quantities are derived from the nondimensional LB velocities simulate supersonic flows in a sod shock tube and around a NACA0012 profile. The LB method is also used to solve computational aeroacoustic (CAA) problems, e.g., to study the community noise of urban air transportation vehicles [8] in combination with a Ffowcs-Williams and Hawkings acoustics model [7]. Another approach is to run a direct acoustic simulation, e.g., using a potential energy double-distribution-function (DDF) LB method [48]. Furthermore, the LB method can be used to simulate microfluidic behavior, where especially multi-phase flows, cf. Sect. 1, play an important role. For example, Harting et al. [28] simulate fluid flows in hydrophobic and rough microchannels as well as over surfaces covered by nano-or microscale gas bubbles. In [32], simulations of droplets manipulation generated in lab-on-chip (LOC) microfluidic T-junction are performed. For a good review on LBmicrofluid simulations, the reader is also referred to [79]. LB methods are also well suited to simulate rarefied gas flows, where for Knudsen numbers Kn > 0.1 continuous approaches such as the FV method fail [27]. For example, rarefied gas flows in microchannels are simulated in [78] using the MRT method with a second-order slip condition. A Bosanquet-type effective viscosity is used to correlate the effective relaxation time with the local Knudsen number to account for varying rarefaction effects. In contrast, a Knudsen-number-relaxation-time relation using the viscosity-based mean free path and mean thermal speed is applied in [35] to simulate 2D and 3D microchannel flows.