1 Introduction

When solving the compressible two-phase equations, the gas, as a continuum, is best represented by a set of partial differential equations (the Navier–Stokes equations) that are numerically solved on a mesh. Thus, the gas characteristics are calculated at the mesh points within the flowfield. However, as the particles (or fragments) may be relatively sparse in the flowfield, they can be modeled by either:

  1. (a)

    A continuum description, i.e. in the same manner as the fluid flow, or

  2. (b)

    A particle (or Lagrangian) description, where individual particles (or groups of particles) are monitored and tracked in the flow.

Although the continuum (so-called multi-fluid) method has proven relatively successful for compressible two-phase flows, the inherent assumptions of the continuum approach lead to several disadvantages which may be countered with a Lagrangian treatment for dilute flows [15, 16, 23, 30, 70]. The continuum assumption cannot robustly account for local differences in particle characteristics, particularly if the particles are polydispersed. In addition, the only boundary conditions that can be considered in a straightforward manner are slipping and sticking, whereas reflection boundary conditions, such as specular and diffuse reflection, may be additionally considered with a Lagrangian approach. Turbulent dispersion can also be treated on a more fundamental basis. Finally, numerical diffusion of the particle density can be eliminated by employing Lagrangian particles due to their pointwise spatial accuracy.

While a Lagrangian approach offers many potential advantages, this method also creates problems that should be addressed by the model. For instance, large numbers of particles may cause a Lagrangian analysis to be memory intensive. This problem is circumvented by treating parcels of particles, i.e. doing the detailed analysis for one particle and then applying the effect of many. In addition, continuous mapping and remapping of particles to their respective elements may increase computational requirements, particularly for unstructured grids.

The present paper summarizes the procedures used, as well as some of the difficulties encountered when implementing a particle description of diluted phases in a flow solver based on unstructured grids.

2 Equations describing the motion of particles

In order to describe the interaction of particles with the flow, the mass, forces and energy/ work exchanged between the flowfield and the particles must be defined. Before going on, we need to define the physical parameters involved. For the fluid, we denote by \(\rho , p, e, T, k, v_i, \mu , \gamma \) and \(c_v\) the density, pressure, specific total energy, temperature, conductivity, velocity in direction \(x_i\), viscosity, ratio of specific heats and the specific heat at constant volume. For the particles, we denote by \(\rho _p, T_p, v_{pi}, d, c_p\) and \(Q\) the density, temperature, velocity in direction \(x_i\), equivalent diameter, and heat transferred per unit volume. In what follows, we will refer to particles, fragments, or chunks collectively as particles.

Making the classic assumptions that the particles may be represented by an equivalent sphere of diameter \(d\), the drag forces \(\mathbf{D}\) acting on the particles will be due to the difference of fluid and particle velocity:

$$\begin{aligned} \mathbf{D}= {{\pi d^2} \over 4} \cdot c_D \cdot { 1 \over 2} \rho | \mathbf{v}- \mathbf{v}_p | ( \mathbf{v}- \mathbf{v}_p ). \end{aligned}$$
(1)

The drag coefficient \(c_D\) is obtained empirically from the Reynolds-number \(Re\):

$$\begin{aligned} Re = {{\rho | \mathbf{v}- \mathbf{v}_p | d } \over { \mu }} \end{aligned}$$
(2)

as:

$$\begin{aligned} c_D = max\left( 0.1 , {24 \over Re} \left( 1 + 0.15 Re^{0.687} \right) \right) \end{aligned}$$

The lower bound of \(c_D=0.1\) is required to obtain the proper limit for the Euler equations, where \(Re \rightarrow \infty \).

The heat transferred between the particles and the fluid is given by

$$\begin{aligned} Q = {{\pi d^2} \over 4} \cdot \left[ h \cdot ( T - T_p ) + \sigma ^* \cdot ( T^4 - T_p^4 ) \right] , \end{aligned}$$
(3)

where \(h\) is the film coefficient and \(\sigma ^*\) the radiation coefficient. For the class of problems considered here, the particle temperature and kinetic energy are such that the radiation coefficient \(\sigma ^*\) may be ignored. The film coefficient \(h\) is obtained from the Nusselt-Number \(Nu\):

$$\begin{aligned} Nu = 2 + 0.459 Pr^{0.333} Re^{0.55}, \end{aligned}$$
(4)

where \(Pr\) is the Prandtl-number of the gas

$$\begin{aligned} Pr = {k \over \mu }, \end{aligned}$$
(5)

as

$$\begin{aligned} h = {{ Nu \cdot k }\over d}. \end{aligned}$$
(6)

Having established the forces and heat flux, the particle motion and temperature are obtained from Newton’s law and the first law of thermodynamics. For the particle velocities, we have:

$$\begin{aligned} \rho _p {{\pi d^3} \over 6 } \cdot {{ d\mathbf{v}_p} \over {dt}} = \mathbf{D}~~. \end{aligned}$$
(7)

This implies that:

$$\begin{aligned} {{ d\mathbf{v}_p} \over {dt}} \!=\! {{3 \rho } \over {4 \rho _p d}} \cdot c_d | \mathbf{v}\!-\! \mathbf{v}_p | ( \mathbf{v}\!-\! \mathbf{v}_p ) \!=\! \alpha _v | \mathbf{v}\! -\! \mathbf{v}_p | ( \mathbf{v}\! -\! \mathbf{v}_p ).\quad \end{aligned}$$
(8)

The particle positions are obtained from:

$$\begin{aligned} {{ d\mathbf{x}_p} \over {dt}} = \mathbf{v}_p. \end{aligned}$$
(9)

The temperature change in a particle is given by:

$$\begin{aligned} \rho _p c_p {{\pi d^3} \over 6 } \cdot {{ dT_p} \over {dt}} = Q, \end{aligned}$$
(10)

which may be expressed as:

$$\begin{aligned} {{ dT_p} \over {dt}} = {{3 k}\over {4 c_p \rho _p d^2}} \cdot Nu \cdot ( T - T_p ) = \alpha _T ( T - T_p ). \end{aligned}$$
(11)

Equations (8, 9, 11) may be formulated as a system of Ordinary Differential Equations (ODEs) of the form:

$$\begin{aligned} {{d\mathbf{u}_p} \over {dt}} = \mathbf{r}(\mathbf{u}_p, \mathbf{x}, \mathbf{u}_f), \end{aligned}$$
(12)

where \(\mathbf{u}_p, \mathbf{x}, \mathbf{u}_f\) denote the particle unknowns, the position of the particle and the fluid unknowns at the position of the particle.

3 Numerical integration

As seen above, the equations describing the position, velocity and temperature of a particle or fragment may be formulated as a system of nonlinear Ordinary Differential Equations (see above) of the form:

$$\begin{aligned} {{d\mathbf{u}_p} \over {dt}} = \mathbf{r}(\mathbf{u}_p, \mathbf{x}, \mathbf{u}_f) ~~. \end{aligned}$$
(13)

They can be integrated numerically in a variety of ways. Due to its speed, low memory requirements and simplicity, we have chosen the following k-step low-storage Runge-Kutta procedure to integrate them:

$$\begin{aligned} \mathbf{u}^{n+i}_p&= \mathbf{u}^n_p + \alpha _i \Delta t \cdot \mathbf{r}(\mathbf{u}^{n+i-1}_p, \mathbf{x}^{n+i-1}, \mathbf{u}^{n+i-1}_f), \nonumber \\ i&= 1,k, \quad \Delta \mathbf{u}^0 = 0. \end{aligned}$$
(14)

For linear ODEs the choice

$$\begin{aligned} \alpha _i= {1 \over {k+1-i}} ~~,\quad ~~ i=1,k \end{aligned}$$
(15)

leads to a scheme that is \(k\)-th order accurate in time. Note that in each step the location of the particle with respect to the fluid mesh needs to be updated in order to obtain the proper values for the fluid unknowns. The default number of stages used is \(k=4\). This would seem unnecessarily high, given that the flow solver is of second-order accuracy, and that the particles are integrated separately from the flow solver before the next (flow) timestep, i.e. in a staggered manner. However, it was found that the 4-stage particle integration preserves very well the motion in vortical structures and leads to less ‘wall sliding’ close to the boundaries of the domain. The stability/ accuracy of the particle integrator should not be a problem as the particle motion will always be slower than the maximum wave speed of the fluid (fluid velocity + speed of sound).

The transfer of forces and heat flux between the fluid and the particles must be accomplished in a conservative way, i.e. whatever is added to the fluid must be subtracted from the particles and vice-versa. The Finite Element Discretization of the the fluid equations will lead to a system of ODE’s of the form:

$$\begin{aligned} \mathbf{M}\Delta \mathbf{u}= \mathbf{r}, \end{aligned}$$
(16)

where \(\mathbf{M}, \Delta \mathbf{u}\) and \(\mathbf{r}\) denote, respectively, the consistent mass matrix, increment of the unknowns vector and right-hand side vector. Given the ‘host element’ of each particle, i.e. the fluid mesh element that contains the particle, we add the forces and heat transferred to \(\mathbf{r}\) as follows:

$$\begin{aligned} \mathbf{r}^i_D = \sum _{el surr i} N^i(\mathbf{x}_p) \mathbf{D}_p . \end{aligned}$$
(17)

Here \(N^i(\mathbf{x}_p)\) denotes the shape-function values of the host element for the point coordinates \(\mathbf{x}_p\). As the sum of all shape-function values is unity at every point:

$$\begin{aligned} \sum N^i(\mathbf{x}) = 1 \quad \forall \mathbf{x}, \end{aligned}$$
(18)

this procedure is strictly conservative.

The change in momentum and energy for one particle is given by:

$$\begin{aligned} \mathbf{f}_p&= \rho _p {{\pi d^3}\over 6} {{\left( \mathbf{v}^{n+1}_p - \mathbf{v}^n_p \right) } \over {\Delta t}},\end{aligned}$$
(19)
$$\begin{aligned} q_p&= \rho _p c_{pp} {{\pi d^3}\over 6} {{\left( T^{n+1}_p - T^n_p \right) } \over {\Delta t}}. \end{aligned}$$
(20)

These quantities are multiplied by the number of particles in a packet in order to obtain the final values transmitted to the fluid. Before going on, we summarize the basic steps required in order to update the particles one timestep:

  • Initialize Fluid Source-Terms: \(\mathbf{r}=0\)

  • DO: For Each Particle:

    • DO: For Each Runge-Kutta Stage:

      1. Find Host Element of Particle: IELEM, \(N^i(\mathbf{x})\)

      2. Obtain Fluid Variables Required

      3. Update Particle: Velocities, Position, Temperature, ...

  • ENDDO

    – Transfer Loads to Element Nodes

  • ENDDO

4 Particle parcels

For a large number of very small particles, it becomes impossible to carry every individual particle in a simulation. The solution to this dilemma is to:

  1. (a)

    Agglomerate the particles into so-called packets of \(N_p\) particles;

  2. (b)

    Integrate the governing equations for one individual particle; and

  3. (c)

    Transfer back to the fluid \(N_p\) times the effect of one particle.

Beyond a reasonable number of particles per element (typically \(>8\)), this procedure produces accurate results without any deterioration in physical fidelity.

4.1 Agglomeration/subdivision of particle parcels

As the fluid mesh may be adaptively refined and coarsened in time, or the particle traverses elements of different sizes, it may be important to adapt the parcel concentrations as well. This is necessary to ensure that there is sufficient parcel representation in each element and yet, that there are not too many parcels as to constitute an inefficient use of CPU and memory. For example, as an element with parcels is refined by one level (the maximum is typically four or five levels of refinement) to yield eight new elements, the number of parcels per new element will be significantly reduced if no parcel adaption is employed. This can lead to a reduction in local spatial accuracy, especially if no parcels are left in one or more of the new elements.

In order to locally determine if a refinement or a coarsening of parcels is to be performed, the number of parcels in each element is checked and modified either after a set number of timesteps or after each mesh adaptation/ change.

5 Limiting during particle updates

As the particles are integrated independently from the flow solver, it is not difficult to envision situations where for the extreme cases of very light or very heavy particles physically meaningless or unstable results may be obtained.

5.1 Small/light particles

In order to see the difficulties that can occur with very small and/or light particles, consider an impulsive start from rest. This situation can happen when a shock enters a dusty zone. The friction forces are proportional to the difference of fluid and particle velocities to the 2nd power, and to the diameter of the particle to the 2nd power. The mass of the particle, however, is proportional to the diameter of the particle to the 3rd power. If the timestep is large and the particle very light, after a timestep (or Runge-Kutta substep) the velocity of the particle may exceed the velocity of the fluid. This is clearly impossible and is only due to the discretization error of the numerical integration in time (i.e. the timestep is too large). The same can happen to the temperature (and diameter, in the case of burning particles) of the particle.

It would be impractical (and unnecessary) to reduce the timestep so as to achieve high temporal accuracy throughout the calculation. After all, for the case of a shock entering a quiescent dusty zone the timestep would have to be reduced until the shock has traversed the complete region. In order to prevent this, the changes in particle velocities and temperatures are limited in order not to exceed the differences in velocities and temperature between the particles and the fluid. Assume (in 1D) a difference of velocities at time \(t=t^n\):

$$\begin{aligned} \Delta v^n = v^n - v^n_p. \end{aligned}$$
(21)

Furthermore, assume that the particles are updated before the flow. The particle velocity is then limited as follows:

  • If: \(v^n_p < v^n \) \(\Rightarrow \) \(v^{n+1}_p \le v^n\)

  • If: \(v^n_p > v^n \) \(\Rightarrow \) \(v^{n+1}_p \ge v^n\)

This limiting procedure is applied to each of the Runge–Kutta stages.

5.2 Large/heavy/many particles

Consider now the opposite case as before. Assume that the particles are started impulsively from rest (e.g. by a shock entering a quiescent dusty region), but that there are many or these and/or they are large or heavy. In this case, when the drag force is added back to the fluid, if the timestep is too large a flow reversal could occur (if the particles are accelerated the flow is decelerated). To prevent this unphysical (and unstable) phenomenon to happen, the source-terms are limited. This is done by comparing the resulting source-terms for the momentum and energy equations of the fluid with the fluid velocities and temperature. Assuming we know the source-terms for the particles \(\mathbf{s}_v, s_T\) and the current timestep \(\Delta t\), the procedure is as follows:

  • Obtain the average particle velocities and temperatures at the points of the flow mesh \(\mathbf{v}_p, T_p\).

  • Obtain the change in flow velocities and temperatures if only the source-terms from the particles are added, e.g. for the velocities:

    $$\begin{aligned} \mathbf{M}\quad \rho \quad \Delta \mathbf{v}= \mathbf{s}_v ; \end{aligned}$$
    (22)
  • Obtain the allowed increase/decrease factors from:

    $$\begin{aligned} \alpha _v = {{| (\mathbf{v}_p - \mathbf{v}) \cdot \Delta \mathbf{v}|} \over {\Delta \mathbf{v}\cdot \Delta \mathbf{v}}};\quad \alpha _T = {{| (T_p - T ) } \over { \Delta T} }; \end{aligned}$$
    (23)
  • Limit the allowed increase/decrease factor:

    $$\begin{aligned} \alpha _v\!=\!max(0,min(1,\alpha _v)), \quad \alpha _T\!=\!max(0,min(1,\alpha _T)).\nonumber \\ \end{aligned}$$
    (24)

This assures that the source-terms added to the momentum and energy equation remain bounded. While this procedure works very well, avoiding instabilities, it is non-conservative.

6 Particle contact

In some situations, the density of the particles increases to a point that they basically occupy all the volume available. Although such high density situations are outside the scope of the underlying theory, production runs require techniques that can cope with them. What happens physically is that at some point particles contact with one another, thereby limiting the achievable density and volume-fill ratio of particles.

6.1 Particle forces due to contact

In order to approximate the forces exerted by the contact, the first measure that has to be obtained is the equivalent radius. After all, we are computing packets of particles. Some of these packets represent hundreds or thousands of actual particles. Given \(n_p\) particles of diameter \(d_p\), the volume occupied by them is given by:

$$\begin{aligned} V = { n_p \over {\alpha _K}} {\pi \over 6} d_p^3, \end{aligned}$$
(25)

where \(\alpha _K\) is the maximum filling factor (whose theoretical limit for spheres is the Kepler limit of \(\alpha _K \approx 0.74\)). The equivalent radius is therefore given by:

$$\begin{aligned} r^a = \left[ {3 \over {4 \pi }} V \right] ^{1/3} = \left[ n_p \over {8 \alpha _K} \right] ^{1/3} d_p. \end{aligned}$$
(26)

Given two particle packets with positions \(\mathbf{x}_i, \mathbf{x}_j\), the overlap distance is given by:

$$\begin{aligned} do_{ij} = r^a_i + r^a_j - d_{ij}, \quad d_{ij} = |\mathbf{x}_i - \mathbf{x}_j|. \end{aligned}$$
(27)

The average overlap betwen particles is then:

$$\begin{aligned} do^s_{ij} = do_{ij} {1 \over 2} \left( { 1 \over {n_p}_i } + { 1 \over {n_p}_j } \right) . \end{aligned}$$
(28)

Defining a unit normal \(\mathbf{n}\) in the direction \(i,j\) as:

$$\begin{aligned} \mathbf{n}_{ij} = { {\mathbf{x}_i - \mathbf{x}_j} \over {|\mathbf{x}_i - \mathbf{x}_j|} }, \end{aligned}$$
(29)

the relative velocity of the particles defines a tangential direction \(\mathbf{t}\):

$$\begin{aligned} \mathbf{v}_{ij} = \mathbf{v}_{j} - \mathbf{v}_{i},\quad v^n_{ij} = \mathbf{v}_{ij} \cdot \mathbf{n}_{ij},\quad \mathbf{v}^t_{ij}&= \mathbf{v}_{ij} - v^n_{ij} \cdot \mathbf{n}_{ij},\nonumber \\ \mathbf{t}_{ij}&= {{\mathbf{v}^t_{ij}} \over {|\mathbf{v}^t_{ij}|}}. \end{aligned}$$
(30)

The normal and tangential forces are:

$$\begin{aligned} f^n_{ij}&= {1 \over 2} \left( k_i + k_j \right) do^s_{ij},\end{aligned}$$
(31)
$$\begin{aligned} f^t_{ij}&= {1 \over 2} \left( h_i + h_j \right) f^n_{ij}, \end{aligned}$$
(32)

where \(k,h\) refer to the stiffness and damping specified. The tangential force is limited so as to avoid a reversal in relative tangential velocities:

$$\begin{aligned} f^t_{ij} = min\left( f^t_{ij} , {{|\mathbf{v}^t_{ij}|} \over {{\Delta t}\cdot max(m_i,m_j)}} \right) , \end{aligned}$$
(33)

A damping force is added in the normal direction in order to avoid ‘ringing’. This force is given by:

$$\begin{aligned} f^{nd}_{ij} = - v^n_{ij} {{(h_i+h_j) \cdot (k_i+k_j)} \over {(m_i+m_j)}}, \end{aligned}$$
(34)

and is limited to the lowest possible value of damping in order to avoid revertion of contact force due to velocity damping:

$$\begin{aligned} f^{nd}_{ij} = max\left( - f^n_{ij} , f^{nd}_{ij} \right) . \end{aligned}$$
(35)

The complete force is then given by:

$$\begin{aligned} \mathbf{f}_{ij} = ( f^n_{ij} + f^{nd}_{ij} )\cdot \mathbf{n}+ f^t_{ij} \cdot \mathbf{t}. \end{aligned}$$
(36)

The particles are stored in a bin in order to quickly find the particles in the vicinity of any given particle.

6.2 Estimating contact stiffness and damping parameters

The estimation of the required particle contact stiffness and damping parameters presents an interesting challenge. The measured values for contact stiffness may be very high, forcing a reduction of the allowable timestep. Therefore, an attempt was made to obtain values that would avoid penetration, yet allow the usual CFL-based flowfield timesteps to be kept. Let us consider a stationary particle in a flow with density \(\rho _f\) and velocity \(v_f\). In this case the force exerted by the fluid is given by:

$$\begin{aligned}&D = {{\pi d^2} \over 4} \cdot c_D \cdot { 1 \over 2} \rho _f v_f^2\end{aligned}$$
(37)
$$\begin{aligned}&c_D = max\left( 0.1 , {24 \over Re} \left( 1 + 0.15 Re^{0.687} \right) \right) , Re = {{\rho _f v_f d } \over {\mu }}.\nonumber \\ \end{aligned}$$
(38)

The stiffness required in order to avoid a penetration distance of \(\xi d\) is then:

$$\begin{aligned} k \xi d = D, \end{aligned}$$
(39)

or

$$\begin{aligned} k = {{D} \over {\xi d}}. \end{aligned}$$
(40)

For the low and high Reynolds-number regimes, we obtain:

$$\begin{aligned} k_{Re<1} \approx {{ 3 \pi \mu v} \over \xi }, \quad k_{Re>>1} \approx {{ 0.1 \pi d \rho v^2} \over { 8 \xi }}. \end{aligned}$$
(41)

7 Accounting for void fractions

The amount the fluid can occupy in any given volume is reduced by the presence of particles. In the original derivation of the theory the assumption of a very dilute solid phases was made. This implied that all volume (or void fraction) effects could be neglected. As users keep pushing up the void fraction, these assumptions are no longer valid and the effect of the void fraction has to be accounted for in the flow solver. Given a volume \(V\), occupied to the extent \(V_f\) by a fluid and \(V_p\) by particles, the void fraction \(\epsilon \) is defined as:

$$\begin{aligned} \epsilon = {V_f \over V} = {{ V - V_p } \over {V}}. \end{aligned}$$
(42)

The Navier–Stokes equations for the case of noticeable void fractions are given by [28, 69]:

$$\begin{aligned} \mathbf{u}_{,t} + \nabla \cdot ( \mathbf{F}^a - \mathbf{F}^v ) = \mathbf{S}, \end{aligned}$$
(43)

where

$$\begin{aligned}&\mathbf{u}= \epsilon \left\{ \rho ~,~ \rho v_i ~,~ \rho e \right\} ,\nonumber \\&\mathbf{F}^a_j= \epsilon \left\{ \rho v_j ~,~ \rho v_i v_j + p \delta _{ij} ~,~ v_j (\rho e +p) \right\} ,\nonumber \\&\mathbf{F}^v_j= \epsilon \left\{ 0 ~,~ \sigma _{ij} ~,~ v_l \sigma _{lj} + k T_{,j}\right\} .\end{aligned}$$
(44)
$$\begin{aligned}&\mathbf{S}= \left\{ 0 , p \epsilon _{,x} \,+\, \epsilon \rho g_x ~,~ p \epsilon _{,y}\right. \nonumber \\&\left. \quad +\, \epsilon \rho g_y,~ p \epsilon _{,z} \,+\, \epsilon \rho g_z,~ (p \epsilon _{,j} + \epsilon \rho g_j) v_j\right\} . \end{aligned}$$
(45)

Here \(\rho , p, e, T, k, v_i, g_i\) denote the density, pressure, specific total energy, temperature, conductivity, fluid velocity and gravity in direction \(x_i\) respectively. This set of equations is closed in the usual way by providing an equation of state for the pressure, relating the stress tensor \(\sigma _{ij}\) to the deformation rate tensor.

8 Particle tracking

A common feature of all particle-grid applications is that the particles do not move far between timesteps. This makes physical sense: if a particle jumped ten gridpoints during one timestep, it would have no chance to exchange information with the points along the way, leading to serious errors. Therefore, the assumption that the new host elements of the particles are in the vicinity of the current ones is a valid one. For this reason, the most efficient way to search for the new host elements is via the vectorized neighbour-to-neighbour algorithm described in [33, 45] (see Fig. 1). The idea is to search for the host elements of as many particles as possible. The obstacle to this approach is that not every particle will find its host element in the same number of attempts or passes over the particles. The solution is to reorder the particles to be interpolated after each pass so that all particles that have not yet found their host element are at the top of the list.

Fig. 1
figure 1

Particle tracing on unstructured grid

9 Particles and shared memory parallel machines

For shared memory parallel machines, the ‘find host element’ technique described above can be used directly. One only has to make sure that sufficiently long vectors are obtained so that even on tens of cores the procedure is efficient. For the ‘particle loads to fluid nodes’ assembly, the particles loads are first accumulated according to elements. Thereafter, these element loads are added to the fluid nodes using the standard mesh coloring techniques [45].

10 Particles and GPUs

Almost all physics subroutines employed by the authors in the FEFLO code have been ported to GPUs using the semi-automatic tool F2CUDA [49, 51, 58, 59]. Particles on an unstructured mesh represent two ‘irregular’ data structures. Therefore, it is not surprising that porting the particle modules to GPUs required some effort. In order to pre-sort these particles heavy use was made of the CUDA Thrust library, in particular the thrust::copy_if option. A further algorithmic difficulty was encountered during the step that adds the source-terms (forces, source-terms) from particles to points. After all, many particles could reside in an element, adding repeatedly to points. Such memory contention inhibits straightforward parallelization directly over the particles, necessitating a more advanced algorithm.

One solution is to use the colouring/grouping of elements (used to avoid memory conflicts during scatter-add operations) and the host element information of each particle to presort the order in which the particle source-terms are added. The problem with this approach is that the number of particles in each element is not the same. Therefore, besides being difficult to vectorize, large load imbalances may occur.

An alternative approach is to apply data-parallel algorithms provided by the Thrust library [29]. In particular, prior to scattering particle-point contributions, they are first written to a temporary array along with the point index to which the contributions should be scattered. Next, the contributions array is sorted by the point indexes using thrust::sort_by_key, which is based on the Merrill-radix-sort algorithm [57]. With particle-point contributions now arranged consecutively in memory, the next step is to sum the contributions to each point using the thrust::reduce_by_key algorithm. Given that not necessarily all points will receive a contribution from particles, it is necessary to then perform a thrust::scatter to map the computed particle contributions to the appropriate points. While most of FEFLO is automatically ported to the GPU via F2CUDA [49, 51], this is one instance where this is not the case. The employed translator allows for incorporating manual overrides of the original Fortran code. This is done here by expressing the algorithms in terms of standardized data-parallel primitives. These primitives are highly non-trivial to implement efficiently, but are easily accessible via the Thrust library. An advantage of this approach is that future performance improvements to the Thrust library will be immediately reflected in the GPU version of FEFLO.

We remark here, as we have done on several occasions before, that favourable GPU timings require that all operations be performed on the GPU.

11 Particles and distributed memory parallel machines

Porting particle tracing and particle/fluid interaction options to a distributed memory environment within a domain decomposition framework/ approach requires the proper transfer of particles from one domain to the other, with the associated extra coding. FEFLO uses overlap of domains that is one layer thick. While this has many advantages, in the case of particles one faces the problem that a particle could be in several domains. In order to arrive at a fast particle transmission algorithm, the following procedure was implemented:

  • Obtain all the elements that border/overlap other domains;

  • Order these border elements according to the communication passes that exchange mesh information with neighbours;

  • Obtain all the particles in each element;

  • In each exchange pass:

    • From the list of border elements and the list of particles in each element: assemble all particles that need to be sent to the neighbouring domain;

    • Exchange the information of how many particles will be sent and received in this pass;

    • Send/receive the particles from the neighbouring domains;

  • Remove the duplicate particles residing in the domain.

A number of problems had to be overcome before this procedure would work reliably and without excessive CPU requirements:

  • Duplicate iarticles after exchanging the particles, the same particle may appear repeatedly in the same processor. For example, the particle may be moving along the border of two domains, or may move slowly (which implies it was also sent from the neighbouring domain in the previous timestep). The best solution for this dilemma is to assign to each particle a so-called ‘unique universal number’ (UUN). The particles are then traversed, and those whose UUN is already in the list are discarded.

  • Particles with same location due to flow physics and/or geometric singularities distinct particles may end up in the same location. This is avoided by traversing all elements, finding the particles in each, and then separating in each element those that are too close together. This last step is done by assigning a small random change to the shape-functions of the particles at the current location.

It was observed that the parallel particle update modules required a considerable amount of CPU resources. For cases where the particle count per processor was in the millions (and this happens frequently), an update could take 2–3 min (!). After extensive recoding and optimization for both MPI and OMP, the same update was reduced to 2–3 s per update.

12 Examples

The techniques described above were implemented in FEFLO, a general-purpose CFD code based on the following general principles:

  • Use of unstructured grids (automatic grid generation and mesh refinement);

  • Finite element discretization of space;

  • Separate flow modules for compressible and incompressible flows;

  • Edge-based data structures for speed;

  • Optimal data structures for different architectures;

  • Bottom-up coding from the subroutine level to assure an open-ended, expandable architecture.

The code has had a long history of relevant applications involving compressible flow simulations in the areas of transonic flow [37, 5256], store separation [4, 7, 9, 11, 12], blast–structure interaction [3, 5, 6, 8, 10, 13, 14, 40, 46, 63, 65, 67], incompressible flows [2, 39, 42, 50, 60, 62, 66], free-surface hydrodynamics [36, 43, 44], dispersion [1720, 41], patient-based haemodynamics [1, 21, 22, 37, 47] and aeroacoustics [31]. The code has been ported to vector [38], shared memory [35, 64, 68], distributed memory [34, 48, 60, 61] and GPU-based [2427, 49] machines.

12.1 Shock into dust

This case considers a 60-cm-square rigid shaft, with the axis running in the \(x\)-direction. The shaft is considered as semi-infinite starting at \(x=-250~cm\). The gas is air, and treated as a perfect gas obeying \(p=\rho ~R~T\) with \(R=2.869\cdot 10^6~dynes-cm/g-K\) and \(\gamma =1.4\). The air for \(x>0\) is initially at \(p=1.01\cdot 10^6~dynes/cm^2\) and \(T=15.15^oC=288.3^oK\). The air for \(x<0\) is initially at \(p=4000~psi= 2.7579\cdot 10^8~dynes/cm^2\) and \(T=1430.6^oK\). Initially the region \(x>250.0~cm\) is filled with a mixture of air and uniformly distributed dust. The dust particles have a density of \(\rho _p=2.3~g/cm^3\) and \(D=100~nm\). The average mass loading of dust inside the disk is \(0.1~g/cm^3\). The dust particles therefore occupy a volume fraction of \(0.1/2.3=0.0435\) within the dusty region, sufficiently low for a dilute species assumption. At time \(t=0\) the gases are allowed to begin interacting as in a Riemann problem, launching a shock wave propagating in the \(+x\) direction. Although the problem is one-dimensional, it was run using the 3-D code. The results obtained have been summarized in Fig. 2a–c, which show the variables along the centerline of the tube for different times. The emergence of the classical Riemann problem is visible, followed by slowdown and partial reflection due to the presence of the particles. This is a high-loading case (the density of the particles are 100x the ambient density of air), and can lead to instabilities. In order to trigger these, we ran on a cartesian mesh split into tetrahedra, with 2 \(\times \) 2 \(\times \) 2 particles per cartesian cell. Note the emergence of oscillations in the velocities as the shock enters the region of quiescent particles. Without the limiters described above, this type of run would fail.

Fig. 2
figure 2

a Tube with particles: fluid density. b Tube with particles: fluid velocity. c Tube with particles: fluid pressure

This case was repeated with 4x4x4 per cartesian cell. The results are shown in Fig. 3. Note that the oscillations have largely disappeared.

Fig. 3
figure 3

a Tube with particles: fluid density. b Tube with particles: fluid velocity. c Tube with particles: fluid pressure. d Tube with particles: dust velocity. e Tube with particles: dust density

12.2 Blast in room with dilute material

This example considers the flow and particle transport resulting from a blast in a room where dilute material has been deposited. The powder-like material is modeled via particles. The geometry, together with the solution, can be discerned from Fig. 4.

Fig. 4
figure 4

Blast in room: pressures and particle velocities

The compressible Euler equations are solved using an edge-based FEM-FCT technique [32, 45, 52]. The initialization was performed by interpolating the results of a very detailed 1D (spherically symmetric) run. The particles are transported using a 4th order Runge-Kutta technique. The timing studies (summarized in Table 1) were carried out with the following set of parameters:

  • Compressible Euler

  • Ideal Gas EOS

  • Explicit FEM–FCT

  • Initialization from a 1D file

  • 4.0 Million elements, 93,552 particles

  • Run for 60 steps

Table 1 Blast in room with dilute material

13 Conclusions and outlook

The treatment of dilute solid (or liquid) phases via Lagrangian particles within mesh-based gas-dynamics (or hydrodynamic) codes is common in computational fluid dynamics. While these techniques work very well for a large spectrum of physical parameters, in some cases, notably for very light or very heavy particles, numerical instabilities appear. The present paper has examined ways of mitigating these instabilities. Furthermore, important implementational issues were summarized.

Current efforts are directed at porting all compressible flow modules to account for volume blockage effects, and the link to chemical reactions with burning particles.