After having discussed the discretization in time and space (i.e. mass, for Lagrangian schemes) of the evolution equations, we now turn to the problem of computing the gravitational interactions of the simulated mass distribution. This step is usually the most time-consuming aspect of a modern N-body simulation and thus also where most numerical approximations are made and where various parallelization strategies have the largest impact. Depending on the problem at hand, the targeted numerical accuracy, and the computer architecture employed, several different methods exists that are in different senses ’optimal’. Modern state-of-the-art codes typically exploit all these existing techniques. Optimal algorithmic complexity, \(\mathcal {O}(N)\) in the number N of particles, is achieved e.g. by the Fast Multipole Method (FMM), which is very promising for very large particle count simulations and is used e.g., in the PKDGRAV3 or Gadget-4 codes, and the geometric multigrid method, used e.g., in the RAMSES code. The newest codes also readily utilise thousands of GPUs to generate simulated Universes for the upcoming generations of cosmological observations.
In the following, we provide a brief overview of the main methods and the main ideas behind them. Regarding the general topic of gravity calculations in N-body, we also refer to other reviews for further details on these methods (Dehnen and Read 2011).
Mesh-based methods
A robust and fast method to solve for the gravitational interactions of a periodic system is provided by the particle-mesh (PM) method (Doroshkevich et al. 1980; Hockney and Eastwood 1981). Derived from the particle-in-cell (PIC) technique developed in plasma physics, they are among the oldest numerical methods employed to study cosmological structure formation. The techniques described here can be employed not only for the N-body discretisations, but are readily applicable also e.g. for full phase space or integer lattice methods (cf. Sects. 3.2 and 3.3), see also Miller and Prendergast (1968), and even in the case of Schrödinger-Poisson systems (Woo and Chiueh 2009).
Force and potential determination—spectral calculation
Considering a periodic domain of side length L, we want to solve the cosmological Poisson equation (Eq. 32b). Assume that both density \(\rho \) and potential \(\phi \) are periodic in \([-L/2,L/2)\) and can be expanded in a Fourier series, i.e.
$$\begin{aligned} \rho (\varvec{x})=\sum _{\varvec{n}\in \mathbb {Z}^3} \tilde{\rho }_{\varvec{n}}\exp \left( \mathrm{i} k_0\, \varvec{x}\cdot \varvec{n}\right) ,\quad \text {with}\quad k_0:=\frac{2\pi }{L} \end{aligned}$$
(56)
and identically for \(\phi (\varvec{x})\) with coefficients \(\tilde{\phi }_{\varvec{n}}\). It then follows from Poisson’s equation (Eq. 17) that their Fourier coefficients obey the algebraic relation
$$\begin{aligned} -k_0^2\left\| \varvec{n}\right\| ^2\,\tilde{\phi }_{\varvec{n}} = 4\pi G a^{-1} \left( \tilde{\rho }_{\varvec{n}} - \overline{\rho }\,\delta _D(\varvec{n}) \right) \quad \text {for all}\quad \varvec{n}\in \mathbb {Z}^3. \end{aligned}$$
(57)
This equation imposes the consistency condition \(\tilde{\rho }_{\varvec{n}=\varvec{0}}=\overline{\rho }\), i.e. the mean Poisson source must vanish. In practice, this is achieved in PM codes by explicitly setting to zero the \(\varvec{n}=0\) mode (a.k.a. the “DC mode”, in analogy to AC/DC electric currents). For the acceleration field \(\varvec{g} = -\nabla \phi \), one finds \(\tilde{\varvec{g}}_{\varvec{n}} = -\mathrm{i}k_0 \varvec{n} \tilde{\phi }_{\varvec{n}}\). The solution for potential and acceleration can thus be conveniently computed using the Discrete Fourier transform (DFT) as
$$\begin{aligned} \tilde{\phi }_{\varvec{n}} = \left\{ \begin{array}{cl} -\frac{4\pi G }{a k_0^2} \frac{\tilde{\rho }_{\varvec{n}}}{\Vert \varvec{n}\Vert ^2}&{} \quad \text {if}\quad \varvec{n}\ne \varvec{0} \\ 0 &{} \quad \text {otherwise } \end{array} \quad , \right. \qquad \tilde{\varvec{g}}_{\varvec{n}} = \left\{ \begin{array}{cl} \frac{4\pi G}{a k_0} \frac{\mathrm{i}\,\varvec{n}\tilde{\rho }_{\varvec{n}}}{\Vert \varvec{n}\Vert ^2}&{} \quad \text {if}\quad \varvec{n}\ne \varvec{0} \\ 0 &{} \quad \text {otherwise } \end{array} \right. . \end{aligned}$$
(58)
If one considers a uniform spatial discretisation of both potential \(\phi _{\varvec{m}}:=\phi _{i,j,k} := \phi (\varvec{m} h)\) and density \(\rho _{\varvec{m}}\), with \(i,j,k\in [0\dots N_g-1]\), mesh index \(\varvec{m}:=(i,j,k)^T\), and grid spacing \(h:=L/N_g\), then the solution can be directly computed using the Fast-Fourier-Transform (FFT) algorithm at \(\mathcal {O}(M\log M)\) for \(M=N_g^3\) grid points. Many implementations exist, the FFTW libraryFootnote 11 (Frigo and Johnson 2005) is one of the most commonly used with support for multi-threading and MPI. In the case of the DFT, the Fourier sum is truncated at the Nyquist wave number, so that \(\varvec{n} \in (-N_g/2,N_g/2]^3\).
Note that instead of the exact Fourier-space Laplacian, \(-k_0^2 \Vert \varvec{n} \Vert ^2\), which is implicitly truncated at the Nyquist wave numbers, sometimes a finite difference version is used in PM codes such as Fast-PM (Feng et al. 2016) (cf. 4.4). Inverting the second order accurate finite difference Laplacian in Fourier space yieldsFootnote 12
$$\begin{aligned} \tilde{\phi }_{\varvec{n}}^{\mathrm{FD2}} = \left\{ \begin{array}{cl} -\frac{\pi G \varDelta x^2 }{a} \;\tilde{\rho }_{\varvec{n}}\;\left( \sin ^2\left[ \frac{\pi n_x}{N_g} \right] + \sin ^2\left[ \frac{\pi n_y}{N_g} \right] + \sin ^2\left[ \frac{\pi n_z}{N_g} \right] \right) ^{-1}&{} \quad \text {if}\quad \varvec{n}\ne \varvec{0} \\ 0 &{} \quad \text {otherwise. } \end{array} \right. \end{aligned}$$
(59)
This kernel has substantially suppressed power on small scales compared to the Fourier space Laplacian, which reduces aliasing (see the discussion in the next section). It also reduces the effect of anisotropies due to the mesh on grid scales.
Solving Poisson’s equation in Fourier space with FFTs becomes less efficient if boundary conditions are not periodic, or if spatial adaptivity is necessary. For isolated boundary conditions, the domain has to be zero padded to twice its size per linear dimension, which is an increase in memory by a factor of eight in three dimensions. This is a problem on modern architectures since memory is expensive and slow, while floating-point operations per second (FLOP) are much cheaper to have in comparison. A further problem of FFT methods is their parallelization: a multidimensional FFT requires a global transpose of the array. This leads to a very non-local communication pattern and the need to transfer all of the data multiple times between computer nodes per force calculation.
Additionally, if high resolution is required, as is often the case in cosmology due to the nature of gravity as an attractive force, the size of the grid can quickly become the computational bottleneck. One possibility is to introduce additional higher resolution meshes (Jessop et al. 1994; Suisalu and Saar 1995; Pearce and Couchman 1997; Kravtsov et al. 1997; Teyssier 2002), deposit particles onto them and then solve using an adaptive “relaxation method” such as the adaptive multigrid method (see below), or by employing the periodic FFT solution as a boundary condition. Adaptive algorithms are typically more complex‘ due to the more complicated data structures involved.
It is also possible to employ another (or many more) Fourier mesh extended over a particular region of interest in a so-called “zoom simulation”, cf. Sect. 6.3.4, if higher force resolution is required in a few isolated subregions of the simulation volume. A problem related of this method is that, for a finite grid resolution, Fourier modes shorter than the Nyquist frequency will be incorrectly aliased to those supported by the Fourier grid (Hockney and Eastwood 1981), which causes a biased solution to the Poisson equation. The magnitude of this aliasing effect depends on the mass assignment scheme and can be reduced when PM codes are complemented with other force calculation methods, as discussed below in Sect. 5.3, since then the PM force is usually UV truncated.
Instead of adding a fine mesh on a single region of interest, it is possible to add it everywhere in space. This approach is known as two-level PM or PMPM, and has been used for carrying out Cosmo-\(\pi \), the largest N-body simulation to date (cf. Sect. 10). This approach has the advantage that, for a cubical domain decomposition, all the operations related to the fine grid can be performed locally, i.e. without communication among nodes in distributed-memory systems, which might result in significant advantages specially when employing hundreds of thousands of computer nodes.
For full phase-space techniques, the PM approach also is preferable if a regular mesh already exists in configuration space onto which the mass distribution can then be easily projected. The Fourier space spectral solution of the Poisson equation can also be readily employed in the case of Schrödinger–Poisson discretisations on a regular grid. In this case, the Poisson source is computed from the wave function which is known on the grid, so that \(\rho _{\varvec{m}} = \psi _{\varvec{m}} \psi _{\varvec{m}}^*\).
Mass assignment schemes
Grid-based methods always rely on a charge assignment scheme (Hockney and Eastwood 1981) that deposits the mass \(m_i\) associated with a particle i at location \(\varvec{X}_i\) by interpolating the particle masses in a conservative way to grid point locations \(\varvec{x}_{\varvec{n}}\) (where \(\varvec{n}\in \mathbb {N}^3\) is a discrete index, such that e.g. \(\varvec{x}_{\varvec{n}} = \mathbf {n}\,\varDelta x\) in the simplest case of a regular (cubic) grid of spacing \(\varDelta x\)). This gives a charge assignment of the form
$$\begin{aligned} \rho _{\varvec{n}} = \int _{\mathbb {R}^3} \mathrm{d}^3x^{\prime}\,\hat{\rho }(\varvec{x}^{\prime}) \,W_{3D}(\varvec{n}\,\varDelta x-\varvec{x}^{\prime})\quad \text {with}\quad \hat{\rho }(\varvec{x}):=\sum _{i=1}^N M_i \delta _D(\varvec{x}-\varvec{X}_i), \end{aligned}$$
(60)
where the periodic copies in the density were dropped since periodic boundary conditions are assumed in the Poisson solver. Charge assignment to a regular mesh is equivalent to a single convolution if \(M_i=M\) is identical for all particles. The most common particle-grid interpolation functions (cf. Hockney and Eastwood 1981) of increasing order are given for each spatial dimension byFootnote 13
$$\begin{aligned} W_{\mathrm{NGP}}(x)= & {} \frac{1}{h}\left\{ \begin{array}{ll} 1 &{} \quad {\text {for}}\,\left| x \right| \le \frac{\varDelta x}{2}\\ 0 &{} \quad {\text {otherwise}} \end{array}\right. \end{aligned}$$
(61a)
$$\begin{aligned} W_{\mathrm{CIC}}(x)= & {} \frac{1}{h}\left\{ \begin{array}{ll} 1-\frac{\left| x\right| }{\varDelta x} &{}\quad {\text {for}}\,\left| x\right| < \varDelta x \\ 0 &{}\quad \text {otherwise} \end{array}\right. \end{aligned}$$
(61b)
$$\begin{aligned} W_{\mathrm{TSC}}(x)= & {} \frac{1}{\varDelta x}\left\{ \begin{array}{ll} \frac{3}{4} - \left( \frac{x}{\varDelta x}\right) ^2 &{} \quad {\text {for}}\,\left| x\right| \le \frac{\varDelta x}{2}\\ \frac{1}{2}\left( \frac{3}{2} - \frac{\left| x\right| }{\varDelta x}\right) ^2 &{} \quad \text {for }\frac{\varDelta x}{2}\le \left| x \right| < \frac{3\varDelta x}{2}\\ 0 &{} \quad {\text {otherwise}} \end{array}\right. \end{aligned}$$
(61c)
$$\begin{aligned} W_{\mathrm{PCS}}(x)= & {} \frac{1}{\varDelta x}\left\{ \begin{array}{ll} \frac{1}{6} \left[ 4 - 6\left( \frac{x}{\varDelta x}\right) ^2 + 3 \left( \frac{|x|}{\varDelta x}\right) ^3 \right] &{} \quad {\text {for}}\,\left| x\right| \le \varDelta x\\ \frac{1}{6}\left( 2 - \frac{\left| x\right| }{\varDelta x}\right) ^3 &{} \quad {\text {for}}\, \varDelta x \le |x| < 2 \varDelta x\\ 0 &{} \quad {\text {otherwise}} \end{array}\right. \end{aligned}$$
(61d)
The three-dimensional assignment function is then just the product \(W_{3D}(\varvec{x})=W(x)\,W(y)\,W(z)\), where \(\varvec{x}=(x,y,z)^T\). It can be easily shown that interpolating using the inverse of these operators from a field increases the regularity of the particle density field, and thus also has a smoothing effect on the resulting effective gravitational force. This can also be seen directly from the Fourier transform of the assignment functions which have the form (per dimension)
$$\begin{aligned} \tilde{W}_{n}(k) = \left[ \mathrm{sinc }\frac{\pi }{2}\frac{k}{k_\mathrm{Ny}}\right] ^n\quad \text {with}\quad \mathrm{sinc}\,x = \frac{\sin x}{x}. \end{aligned}$$
(62)
where \(n=1\) for NGP, \(n=2\) for CIC, \(n=3\) for TSC, \(n=4\) for PCS interpolation, and \(k_{\mathrm{Ny}}:=\pi /\varDelta x\) is the Nyquist wave number. NGP leads to a piecewise constant, CIC to a piece-wise linear, TSC to a piecewise quadratically (i.e., continuous value and first derivative), and PCS to piecewise cubically changing acceleration as a particle moves between grid points. The real space and Fourier space shape of the kernels is shown in Fig. 6. Note that the support is always \(n \varDelta x\), i.e. n cells, per dimension and thus increases with the order, and by the central limit theorem \(\tilde{W}_n\) converges to a normal distribution as \(n\rightarrow \infty \). Hence, going to higher order can impact memory locality and communication ghost zones negatively. Since an a priori unknown number of particles might deposit to the same grid cell, special care needs to be taken to make the particle projection thread safe in shared-memory parallelism (Ferrell and Bertschinger 1994).
Alternatively to these mass assignment kernels for particles, it is possible to project phase-space tessellated particle distributions (cf. Sect. 3.2) exactly onto the force grid (Powell and Abel 2015; Sousbie and Colombi 2016). In practice, when using such sheet tessellation methods, for a given set of flow tracers, the phase-space interpolation can be constructed and sampled with M “mass carrying” particles which can then be deposited into the grid. Since the creation of mass carriers is a local operation, M can be arbitrarily large and thus the noise associated to N-body discreteness can be reduced systematically. This approach has been adopted by Hahn et al. (2013) and Angulo et al. (2013b) to simulate warm dark matter while suppressing the formation of artificial fragmentation, as we will discuss in greater detail in Sect. 7.3
The same mass assignment schemes can be used to reversely interpolate values of a discrete field back to the particle positions \(\left\{ \varvec{X}_i\right\} \) by inverting the projection kernel. It has to be ensured that the same order is used for both mass deposit and interpolation of the force to the particle positions, i.e., that deposit and interpolation are mutually inverse. This is an important consistency since, otherwise, (1) exact momentum conservation is not guaranteed, and (2) self-forces can occur allowing particles to accelerate themselves (cf. Hockney and Eastwood 1981). It is important to note that due to the grid discretisation, particle separations that are unresolved by the discrete grid are aliased to the wrong wave numbers, which e.g. can cause certain Fourier modes to grow at the wrong rate. Aliasing can be ameliorated by filtering out scales close to the Nyquist frequency, or by using interlacing techniques where by combination of multiple shifted deposits individual aliasing contributions can be cancelled at leading order (Chen et al. 1974; Hockney and Eastwood 1981). Such techniques are important also when estimating Fourier-space statistics (i.e., poly-spectra) from density fields obtained using above deposit techniques (see Sect. 9 for a discussion).
Relaxation methods and multi-scale
In order to overcome the limitations of Fourier space solvers (in particular, the large cost of the global transpose on all data necessary along with the lack of spatial adaptivity), a range of other methods have been developed. The requirement is that the Poisson source is known on a grid, which can also be an adaptively refined ‘AMR’ grid structure. On the grid, a finite difference version of the Poisson equation is then solved, e.g., for a second-order approximation in three dimension the solution is given by the finite difference equation:
$$\begin{aligned} \phi _{i-1,j,k}+\phi _{i+1,j,k}+\phi _{i,j-1,k}+\phi _{i,j+1,k}+\phi _{i,j,k-1}+\phi _{i,j,k+1}-6\phi _{i,j,k} = \varDelta x^2\,f_{i,j,k} \,, \end{aligned}$$
(63)
where indices refer to grid point locations as above, \(\varDelta x\) is the grid spacing, and \(f_{i,j,k} := 4\pi G (\rho _{i,j,k}-\overline{\rho })/a\) is the Poisson source. This can effectively be written as a matrix inversion problem \(\mathsf{\varvec {A}} \phi = f\) where the finite difference stencil gives rise to a sparse matrix \(\mathsf{\varvec {A}}\) and the solution sought is \(\phi =\mathsf{\varvec {A}}^{-1}f\). Efficient methods exist to solve such equations. A particularly powerful one, that can directly operate even on an AMR structure, is the adaptive multigrid method (Brandt 1977; Trottenberg et al. 2001), which is used e.g., by the RAMSES code (Teyssier 2002). It combines simple point relaxation (e.g., Jacobi or Gauss–Seidel iterations) with a hierarchical coarsening procedure which spreads the residual correction exponentially fast across the domain. Some additional care is required at the boundaries of adaptively refined regions. Here the resolution of the mesh changes, typically by a linear factor of two, and interpolation from the coarser grid to the ghost zones of the fine grid is required. In the one-way interface type of solvers, the coarse solution is obtained independently of the finer grid, and then interpolated to the finer grid ghost zones to serve as the boundary condition for the fine solution (Guillet and Teyssier 2011), but no update of the coarse solution is made based on the fine solution. This approach is particularly convenient for block-stepping schemes (cf. Sect. 4.2.2) where each level of the grid hierarchy has its own time step by solving e.g. twice on the fine level while solving only once on the coarse. A limitation of AMR grids is however that the force resolution can only change discontinuously by the refinement factor, both in time—if one wants to achieve a resolution that is constant in physical coordinates—and in space—as a particle moves across coarse-fine boundaries. On the other hand, AMR grids contain self-consistently an adaptive force softening (see Sect. 8.2), if the refinement strategy is tied to the local density or other estimators (Hobbs et al. 2016).
Depending on the fragmentation of the finer levels due to the dynamic adaptivity, other solvers can be more efficient than multigrid, such as direct relaxation solvers (Kravtsov et al. 1997) or conjugate gradient methods. However, it is in principle more accurate to account for the two-way interface and allow for a correction of the coarse potential from the fine grid as well, as discussed e.g. by Johansen and Colella (1998), Miniati and Colella (2007). Note that, once a deep grid hierarchy has developed, global Poisson solves in each fine time step are usually prohibitive for numerical algorithms. For this reason optimizations are often employed to solve for the gravitational acceleration of only a subset of particles in multi-stepping schemes. In the case of AMR, some care is necessary to interpolate boundary conditions also in time to avoid possible spurious self-interactions of particles.
Direct P2P summation
As discussed above, mesh-based methods bring along an additional discretisation of space. This can be avoided by computing interactions directly at the particle level from Eqs. (32b–33). In this case, the gravitational potential at particle i’s location, \(\varvec{X_i}\), is given by the sum over the contribution of all the other particles in the system along with all periodic replicas of the finite box, i.e.
$$\begin{aligned} \phi (\varvec{x}_i) = - a^{-1} \sum _{\varvec{n}\in \mathbb {Z}^3} \left[ \sum _{{\mathop{j\!=\!1}\limits_ {{i\!\ne\! j}}}}^N\frac{G M_j}{\Vert \varvec{X}_i-\varvec{X}_j-\varvec{n}L \Vert } + \varphi _{\mathrm{box},L}(\varvec{X}_i-\varvec{n}L)\right] . \end{aligned}$$
(64)
Note that we neglected force softening for the moment, i.e. we set \(W(\varvec{x})=\delta _D(\varvec{x})\). Here \(\varphi _{\mathrm{box},L}\) is the potential due to a box \([0,L)^3\) of uniform background density \(\overline{\rho }=\varOmega _m\rho _c\) that guarantees that the density \(\rho -\overline{\rho }\) sourcing \(\phi \) vanishes when integrated over the box.
This double sum is slowly convergent with respect to \(\varvec{n}\), and in general there can be spurious forces arising from a finite truncation [but note that the sum is unconditionally convergent if the box has no dipole, e.g., Ballenegger (2014)]. A fast and exact way to compute this expression is provided by means of an Ewald summation (Ewald 1921), in which the sum is replaced by two independent sums, one in Fourier space for the periodic long-range contribution, and one in real space, for the non-periodic local contribution, which both converge rapidly. It is then possible to rewrite Eq. (64) employing the position of the nearest replica, which results into pairwise interactions with a modified gravitational potential. This potential needs to be computed numerically, thus, in GADGET3, for instance, it is tabulated and then interpolated at runtime, whereas in GADGET4, the code relies on a look-up table of a Taylor expansion with analytic derivatives of the Ewald potential. We summarise in more detail how this is achieved in Sect. 5.3, where we discuss in particular how the FFT can be efficiently used to execute the Fourier summation.
This direct summation of individual particle-particle forces is \(\mathcal {O}(N^2)\), that is, quadratic in the number of particles and thus becomes quickly computationally prohibitive. In addition, since it is a highly non-local operation, it would require a considerable amount of inter-process communication. In practice, this method is sometimes used to compute short-range interactions, where the operation is local and can exploit the large computational power provided by GPUs. This is, for instance, the approach followed by the HACC code (Habib et al. 2016), when running one of the largest simulations to date with 3.6 trillion particles; and also by the ABACUS code (Garrison et al. 2018). Direct summation enabled by GPUs has also been adopted by Rácz et al. (2019) for compactified simulations, where there is an additional advantage that only a small subset of the volume has to be followed down to \(z=0\) (cf. Sect. 6.3.5).
Particle mesh Ewald summation, force splitting and the P\(^3\)M method
Beyond the poor \(\mathcal {O}(N^2)\) scaling of the direct P2P summation (for which we discuss the solutions below), another important limitation of the naïve infinite direct summation is the infinite periodic contribution in Eq. (64). At the root of the solution is the Ewald summation (Ewald 1921), first used for cosmological simulations by Bouchet and Hernquist (1988), in which the total potential or acceleration is split into a short and a long range contribution, and where the short range contribution is summed in real space, while the long range contribution is summed in Fourier space where it converges due to its periodic character much faster. One thus introduces a ‘splitting kernel’ S so that
$$\begin{aligned} \phi (\varvec{x}) = \phi _{\mathrm{lr}}(\varvec{x})+ \phi _{\mathrm{sr}}(\varvec{x}) := S*\phi + (1-S)*\phi . \end{aligned}$$
(65)
The long-range contribution \(\phi _{\mathrm{lr}}\) can be computed using the PM method on a relatively coarse mesh. On the other hand, the short-range contribution \(\phi _{\mathrm{sr}}\), can be computed from the direct force between particles only in their immediate vicinity—since the particles further away contribute through the PM part. Instead of the direct force, which gives then rise to the P\(^3\)M method, modern codes often use a tree-method (see next section) for the short range force [this is e.g., what is implemented in the GADGET2 code by Springel (2005), see also Wang (2021)].
The splitting kernel effectively spreads the mass over a finite scale \(r_s\) for the long range interaction, and corrects for the residual with the short range interaction on scales \(\lesssim r_s\). Many choices are a priori possible, Hockney and Eastwood (1981), e.g., propose a sphere of uniformly decreasing density, or a Gaussian cloud. The latter is, e.g., used in the GADGET codes.
In terms of the Green’s function of the Laplacian \(G(\varvec{r}) = -1/(4\pi \Vert \varvec{r}\Vert )\), the formal solution for the cosmological Poisson equation reads \(\phi = \frac{4\pi G}{a} \left( \rho -\overline{\rho }\right) *G\). For a Gaussian cloud of scale \(r_s\), one has in real and Fourier space
$$\begin{aligned} S(r; r_s) = (2\pi r_s^2)^{-3/2} \exp \left( -\frac{r^2}{2r_s^2} \right) ,\quad \tilde{S}(k; r_s) = \exp \left[ -\frac{1}{2}k^2 r_s^2\right] . \end{aligned}$$
(66)
The ‘dressed’ Green’s functions \(G_{\mathrm{lr}} = G*S\) and \(G_\mathrm{sr} = G*(1-S)\) then become explicitly in real and Fourier space
$$\begin{aligned} G_{\mathrm{lr}}(r; r_s)&= - \frac{1 }{4\pi \,r} \,\mathrm{erf\left[ \frac{r}{\sqrt{2}r_s} \right] }, \quad&\tilde{G}_{\mathrm{lr}}(k; r_s)&= -\frac{1}{k^2}\exp \left[ -\frac{1}{2}k^2r_s^2 \right] , \end{aligned}$$
(67a)
$$\begin{aligned} G_{\mathrm{sr}}(r; r_s)&= - \frac{1 }{4\pi \,r} \,\mathrm{erfc\left[ \frac{r}{\sqrt{2}r_s} \right] },&\tilde{G}_{\mathrm{sr}}(k; r_s)&= -\frac{1}{k^2}\left( 1-\exp \left[ -\frac{1}{2}k^2r_s^2 \right] \right) . \end{aligned}$$
(67b)
Instead of the normal Green’s functions, one thus simply uses these truncated functions and obtains a hybrid solver. In order to use this approach, one chooses a transition scale of order the grid scale, \(r_s\sim \varDelta x\), and then replaces the PM Green’s function with \(G_{\mathrm{lr}}\). Instead of the particle-particle interaction in the direct summation or tree force (see below), one uses \(G_\mathrm{sr}\) for the potential, and \(\varvec{\nabla } G_{\mathrm{sr}}\) for the force.
While the long range interaction already includes the periodic Ewald summation component if solved with Fourier space methods, when evaluating the periodic replica summation for the short-range interaction, the evaluation can be restricted to the nearest replica in practice due to the rapid convergence with the regulated interaction. In addition, since PM forces are exponentially suppressed on scales comparable to \(r_s\) which is chosen to be close to the grid spacing \(\varDelta x\), aliasing of Fourier modes is suppressed.
Note that another more aggressive near-far field combination is adopted by the ABACUS code. In this approach, the computational domain is first split into a uniform grid with \(K^3\) cells. Interactions of particles separated by less than approximately 2L/K are computed using direct summation (neglecting Ewald corrections); otherwise are computed using a high-order multipole (\(p=8\)) representation of the force field in the \(K-\)grid. Since two particles only interact via either the near- or far-field forces, and the tree structure is fixed to the K-grid, this allows for several optimizations and out-of-core computations. The price is discontinuous force errors with a non-trivial spatial dependence, as well as reduced accuracy due to the lack of Ewald corrections. This, however, might be acceptable for some applications and, as we will see in Sect. 8.5, ABACUS performs well when compared to other state-of-the-art codes.
Hierarchical tree methods
Assuming that it is acceptable to compute gravitational forces with a given specified accuracy, there are ways to circumvent the \(\mathcal {O}(N^2)\) and non-locality problem of direct summation. A common approach is to employ a hierarchical tree structure to partition the mass distribution in space and compute the gravitational potential jointly exerted by groups of particles, whose potential is expanded to a given multipole order (Barnes and Hut 1986). Thus, instead of particle-particle interactions, particle-node interactions are evaluated. Since the depth of such a tree is typically \(\mathcal {O}(\log N)\), the complexity of the evaluation of all interactions can be reduced to \(\mathcal {O}(N\log N)\). This can be further reduced to an ideal \(\mathcal {O}(N)\) complexity with the fast multipole method (FMM, see below).
There are several alternatives for constructing tree structures. The most common choice is a regular octree in which each tree level is subdivided into 8 sub-cells of equal volume, this is for instance used by GADGET. Another alternative, used for instance in old versions of PKDGRAV are binary trees in which a node is split into only two daughter cells. This in principle has the advantage to adapt more easily to anisotropic domains, and a smoother transition among levels, at the expense of a higher cost in walking the tree or the need to go to higher order multipole expansions at fixed force error. The tree subdivision continues until a maximum number M of particles per node is reached (\(M=1\) in GADGET2-3 but higher in GADGET4 and PKDGRAV).
The main advantage brought by tree methods is that the pairwise interaction can be expanded perturbatively and grouped among particles at similar locations, thus reducing dramatically the number of calculations that needs to be carried out. The key philosophical difference with respect to direct summation is that one seeks to obtain the result at a desired accuracy, rather than the exact result to machine precision. This difference allows a dramatic improvement in algorithmic complexity. Another key aspect is that hierarchical trees are well suited for hierarchical (adaptive) timesteps.
Tree methods have for a long time been extraordinarily popular for evaluating the short range interactions also in hybrid tree-PM methods, as pioneered by Bagla (2002); Bagla and Ray (2003), or more recent FMM-PM (Gnedin 2019; Wang 2021; Springel et al. 2021) approaches, thus supplementing an efficient method for periodic long-range interactions with an efficient method which is not limited to the uniform coarse resolution of FFT-based approaches (or also discrete jumps in resolution of AMR approaches). We discuss some technical aspects of these methods next.
Hierarchical multipole expansion
In the ‘Barnes & Hut tree’ algorithm (Appel 1985; Barnes and Hut 1986), particle-node interactions are evaluated instead of particle-particle interactions. Let us consider a hierarchical octree decomposition of the simulation box volume \(\mathcal {V}:=[0,L_{\mathrm{box}}]^3\) at level \(\ell \) into cubical subvolumes, dubbed ‘nodes’, \(\mathcal {S}^\ell _{i=1\dots N_\ell }\) of side length \(L_{\mathrm{box}}/2^\ell \), where \(N_\ell =2^{3\ell }\), so that \(\bigcup _i \mathcal {S}^\ell _i = \mathcal {V}\) and \(\mathcal {S}^\ell _i\cap \mathcal {S}^\ell _{j\ne i} = \emptyset \) on each level gives a space partitioning. Let us consider the gravitational potential due to all particles contained in a node \(\varvec{X}_j\in S^\ell _i\). The partitioning is halted when only one (but typically a few) particle is left in a node. We shall assume isolated boundary conditions for clarity, i.e. we neglect the periodic sum in Eq. (64). Thanks to the partitioning, the gravitational interaction can be effectively localised with respect to the ‘tree node’ pivot at location \(\varvec{\lambda }\in \mathcal {S}^\ell _i\), so that the distance \(\Vert \varvec{X}_j - \varvec{\lambda } \Vert \le \sqrt{3} L_{\mathrm{box}}/2^\ell =: r_\ell \) is by definition bounded by the ‘node size‘ \(r_\ell \) and can serve as an expansion parameter. To this end, one re-writes the potential due to the particles in the node subvolume \(\mathcal {S}^\ell _i\)
$$\begin{aligned} \phi ^\ell _i(\varvec{x}) \propto \sum _{\varvec{X}_j\in \mathcal {S}_i^\ell } \frac{M_j}{\Vert \varvec{x}-\varvec{X}_j\Vert } = \sum _{\varvec{X}_j\in \mathcal {S}_i^\ell } \frac{M_j}{\Vert (\varvec{x}-\varvec{\lambda })-(\varvec{X}_j-\varvec{\lambda })\Vert } = \sum _{\varvec{X}_j\in \mathcal {S}_i^\ell } \frac{M_j}{\Vert \varvec{d}+\varvec{\lambda }-\varvec{X}_j\Vert } \end{aligned}$$
(68)
where \(\varvec{d}:=\varvec{x}-\varvec{\lambda }\). This can be Taylor expanded to yield the ‘P2M’ (particle-to-multipole) kernels
$$\begin{aligned} \begin{aligned} \frac{1}{\Vert \varvec{d}+\varvec{\lambda }-\varvec{X}_j\Vert } =&\underbrace{\frac{1}{\Vert \varvec{d}\Vert }}_{\text {monopole}} + \underbrace{\frac{d_k}{\Vert \varvec{d}\Vert ^3} \left( X_{j,k}-\lambda _k\right) }_{\text {dipole}\; \mathcal {O}(r_\ell /d^2)} + \\&\quad + \underbrace{\frac{1}{2}\frac{d_kd_l}{\Vert \varvec{d} \Vert ^5} \left( 3(X_{j,k}-\lambda _k)(X_{j,l}-\lambda _l) -\delta _{kl} \Vert \varvec{X}_j-\varvec{\lambda } \Vert ^2 \right) }_{\text {quadrupole}\;\mathcal {O}(r_\ell ^2/d^3)} +\dots , \end{aligned} \end{aligned}$$
(69)
which converges quickly if \(\Vert \varvec{d}\Vert \gg r_\ell \). The multipole moments depend only on the vectors \((\varvec{X}_j-\varvec{\lambda })\) and can be pre-computed up to a desired maximum order p during the tree construction and stored with each node. In doing this, one can exploit that multipole moments are best constructed bottom-up, as they can be translated in an upward-sweep to the parent pivot and then co-added—this yields an ‘upwards M2M’ (multipole-to-multipole) sweep. Note that if one sets \(\varvec{\lambda }\) to be the centre of mass of each tree node, then the dipole moment vanishes. The complexity of such a tree construction is \(\mathcal {O}(N\log N)\) for N particles.
When evaluating the potential \(\phi (\varvec{x})\) one now proceeds top-down from the root node at \(\ell =0\) in a ‘tree walk’ and evaluates M2P (multipole-to-particle) interactions between the given particle and the node. Since one knows that the error in \(\phi ^\ell _i(\varvec{x})\) is \(\mathcal {O}\left( (r_\ell /d)^p \right) \), one defines a maximum ‘opening angle’ \(\theta _{\mathrm{c}}\) and requires in order for the multipole expansion \(\phi ^\ell _i(\varvec{x})\) to be an acceptable approximation for the potential due to the mass distribution in \(\mathcal {S}^\ell _i\) that the respective opening angle obeys
$$\begin{aligned} \frac{r_\ell }{\Vert \varvec{d}\Vert } <\theta _{\mathrm{c}}. \end{aligned}$$
(70)
Otherwise the procedure is recursively repeated with each of the eight child nodes. Since the depth of a (balanced) octree built from a distribution of N particles is typically \(\mathcal {O}(\log N)\), a full potential or force calculation has an algorithmic complexity of \(\mathcal {O}(N\log N)\) instead of the \(\mathcal {O}(N^2)\) of the direct summation. The resulting relative error in a node-particle interaction is (Dehnen 2002)
$$\begin{aligned} \delta \phi \le \frac{\theta _c^{p+1}}{1-\theta _c} \frac{M_\mathrm{node}}{\Vert \mathbf {d}\Vert }, \end{aligned}$$
(71)
where \(M_{\mathrm{node}}\) is the node mass (i.e. the sum of the masses of all particles in \(\mathcal {S}^\ell _i\)), and p is the order of the multipole expansion. Eq. 70 error estimate is a purely geometric criterion, independent of the magnitude of \(M_\mathrm{node}\) and the multipole moments, as well as the actual value of the gravitational acceleration. It is also independent of the magnitude of the interaction, i.e. neglecting that far nodes contribute more than nearby ones to the total interaction.
An alternative method, proposed by Springel et al. (2001b), is to use a dynamical criterion by comparing the expected acceleration with the force error induced by a given node interaction. Specifically, when evaluating the particle-node interactions for particle j one sets
$$\begin{aligned} \theta _{\mathrm{c},j} = \left( \alpha \Vert \varvec{A}_j\Vert \frac{\Vert \varvec{d}\Vert ^2}{G M_{\mathrm{node}}}\right) ^{1/p}, \end{aligned}$$
(72)
where \(\Vert \varvec{A}_j\Vert \) is the modulus of the gravitational acceleration (which could be estimated from the force calculation performed in a previous step), and \(\alpha \) is a dimensionless parameter that controls the desired accuracy. Note, however, that for relatively isotropic mass distributions, the uncertainty of a given interaction might not be representative of the uncertainty in the total acceleration.
We highlight that the expressions (68)–(69) are valid for the non-periodic particle-node interactions, but for periodic boundary conditions additional terms arise owing to the modified Green’s function as seen in Eq. (64). The Green’s function is also modified in the case when tree interactions are combined with other methods such as PM in a tree-PM method (see Sect. 5.3). This implies in principle also modified error criteria (or opening angles), however, this is often neglected.
So far, performing the multipole expansion only to monopole order (with nodes centered at the center of mass) has been a popular choice for N-body codes. The reason behind this is that a second-order accurate expression is obtained with very low memory requirements (one needs to simply store the centre of mass of tree nodes instead of the geometric centre, which is enough when moderate accuracy is sought. However, in search of higher accuracy, a larger number of codes have started to also consider quadrupole and octupole terms, which requires more memory and computation but allows a less aggressive opening criteria. This has been advocated as the optimal combination that provides the most accurate estimate at a fixed computational cost (Dehnen and Read 2011; Potter and Stadel 2016), although the precise optimal order depends on the required accuracy (Springel et al. 2021). In the future, further gains from higher order terms can be obtained as computer architectures evolve towards higher FLOP/byte ratios.
A problem for tree codes used for cosmological simulations is that on large scales and/or at high redshift the mass distribution is very homogeneous. This is a problem since the net acceleration of a particle is then the sum of many terms of similar magnitude but opposite sign that mostly cancel. Thus, obtaining accurate forces requires a low tolerance error which increases the computational cost of a simulation. For instance, the Euclid Flagship simulation (Potter and Stadel 2016), which employed a pure Tree algorithm (cf. Sect. 10), spends a considerable amount of time on the gravitational evolution at high redshift. Naturally, this problem is exacerbated the larger the simulation and the higher the starting redshift.
A method to address this problem, that was proposed by Warren (2013) and implemented in the 2HOT code, is known as “background subtraction”. The main idea is to add the multipole expansion of a local uniform negative density to each interaction, which can be computed analytically for each cubic cell in a particle-node interaction. Although this adds computational cost to the force calculation, it results in an important overall reduction of the cost of a simulation since many more interactions can be represented by multipole approximations at high redshift. As far as we know, this has not been widely adopted by other codes.
A further optimization that is usually worth carrying out on modern architectures is to prevent tree refinement down to single particles (for which anyway all multipoles beyond the monopole vanish). Since the most local interactions end up being effectively direct summation anyway, one can get rid of the tree overhead and retain a ‘bucket’ of \(10^{2-3}\) particles in each leaf node rather than a single individual particle. All interactions within the node, as well as those which would open child nodes are carried out in direct summation. While algorithmically more complex, such a direct summation is memory-local and can be highly optimized and e.g. offloaded to GPUs, providing significant speed-up over the tree.
Fast-multipole method
Despite the huge advantage with respect to direct summation, a single interaction of a particle with the tree is still computationally expensive as it has a \(\mathcal {O}(\log N)\) complexity for a well-balanced tree. Furthermore, trees as described above have other disadvantages, for instance, gravitational interactions are not strictly symmetric. This leads to a violation of momentum conservation. A solution to these limitations is provided by fast multipole methods (FMM), originally proposed by Greengard and Rokhlin (1987) and extended to Cartesian coordinates by Dehnen (2000, 2002). These algorithms take the idea of hierarchical expansions one step further by realising that significant parts of the particle-node interactions are redundantly executed for particles that are within the same node. In order to achieve a \(\mathcal {O}(1)\) complexity per particle, the node-node interaction should be known and translated to the particle location. This is precisely what FMM achieves by symmetrising the interaction to node-node interactions between well-separated nodes, which are separately Taylor expanded inside of the two nodes. Up to recently, FMM methods have not been widespread in cosmology, presumably due to a combination of higher algorithmic and parallelization complexity. The advantages of FMM are becoming evident in modern N-body codes, which simulate extremely large numbers of particles and seek high accuracy, and thus FMM has been adopted in PKDGRAV, GADGET-4, and SWIFT. We only briefly summarize the main steps of the FMM algorithm here, and refer the reader to the reviews by, e.g., Kurzak and Pettitt (2006), Dehnen and Read (2011) for details on the method.
The FMM method builds on the same hierarchical space decomposition as the Barnes&Hut tree above and shares some operators. For the FMM algorithm, three steps are missing from the tree algorithm outlined in the previous section: a ‘downward M2L’ (multipole-to-local) sweep, which propagates the interactions back down the tree after the upward M2M sweep, thereby computing a local field expansion in the node. This expansion is then shifted in ‘downward L2L’ (local-to-local) steps to the centers of the child nodes, and to the particles in a final ‘L2P’ (local-to-particle) translation. As one has to rely on the quality of the local expansion in each node, FMM requires significantly higher order multipole expansions compared to standard Barnes&Hut trees to achieve low errors. Note that for a Cartesian expansion in monomials \(x^ly^mz^n\) at a fixed order \(p=l+m+n\), one has \((p+1)(p+2)/2\) multipole moments, i.e. \((p+1)(p+2)(p+3)/6\) for all orders up to incl. p, i.e. memory needed for each node scales as \(\mathcal {O}(p^3)\), and a standard implementation evaluating multipole pair interactions scales as \(\mathcal {O}(p^6)\). For expansions in spherical harmonics, one can achieve \(\mathcal {O}(p^3)\) scaling (Dehnen 2014). Note that for higher order expansions one can rely on known recursion relations to obtain the kernel coefficients (Visscher and Apalkov 2010) allowing arbitrary order implementations. Recently, it was demonstrated that a trace-free reformulation of the Cartesian expansion has a slimmer memory footprint (Coles and Bieri 2020) (better than 50% for \(p\ge 8\)). The same authors provide convenient Python scripts to auto-generate code for optimized expressions of the FMM operators symbolically. It is important to note that the higher algorithmic complexity lends itself well to recent architectures which favour high FLOP-to-byte ratio algorithms (Yokota and Barba 2012).
While the FMM force is symmetric, Springel et al. (2021) report however that force errors can be much less uniform in FMM than in a standard tree approach, so that it might be required to randomise the relative position of the expansion tree w.r.t. the particles between time steps in order to suppress the effect of correlated force errors on sensitive statistics for cosmology. In principle also isotropy could be further improved with random rotations. Note that errors might have a different spatial structure with different expansion bases however.
The FMM method indeed has constant time force evaluation complexity for each N-body particle. This assumes that the tree has already been built, or that building the tree does not have \(\mathcal {O}(N\log N)\) complexity (which is only true if it is not fully refined but truncated at a fixed scale). Note however that for FMM solvers, it is preferable to limit the tree depth to a minimum node size or at least use a larger number of particles in a leaf cell for which local interactions are computed by direct ‘P2P’ (particle-to-particle) interactions. Also, tree construction has typically a much lower pre-factor than the ‘tree walk’. Note further that many codes use some degree of ‘tree updating’ in order to avoid rebuilding the tree in every timestep.
In order to avoid explicit Ewald summation, some recent methods employ hybrid FFT-FMM methods, where essentially a PM method is used to evaluate the periodic long range interactions as in tree-PM and the FMM method is used to increase the resolution beyond the PM mesh for short-range interactions (Gnedin 2019; Springel et al. 2021).