1 Introduction

The recent advances in the field of deep learning have initiated a new era for generative models. Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) have become a very popular approach by demonstrating their ability to learn complicated representations to produce high-resolution images (Karras et al. 2018). In the field of cosmology, high-resolution simulations of matter distribution are becoming increasingly important for deepening our understanding of the evolution of the structures in the universe (Springel et al. 2005; Potter et al. 2017; Kuhlen et al. 2012). These simulations are made using the N-body technique, which represents the distribution of matter in 3D space by trillions of particles. They are very slow to run and computationally expensive, as they evolve the positions of particles over cosmic time in small time intervals. Generative models have been proposed to emulate this type of data, dramatically accelerating the process of obtaining new simulations, after the training is finished (Rodríguez et al. 2018; Mustafa et al. 2019).

N-body simulations represent the matter in a cosmological volume, typically between 0.1–10 Gpc, as a set of particles, typically between 1003 to 20003. The initial 3D positions of the particles are typically drawn from a Gaussian random field with a specific power spectrum. Then, the particles are displaced over time according to the laws of gravity, properties of dark energy, and other physical effects included in the simulations. During this evolution, the field is becoming increasingly non-Gaussian, and displays characteristic features, such as halos, filaments, sheets, and voids (Bond et al. 1996; Dietrich et al. 2012).

N-body simulations that consist only of dark matter effectively solve the Poisson’s equation numerically. This process is computationally expensive, as the forces must be recalculated in short time intervals to retain the precision of the approximation. This leads to the need for frequent updates of the particle positions. The speed of these simulations is a large computational bottleneck for cosmological experiments, such as the Dark Energy Survey,Footnote 1 Euclid,Footnote 2 or LSST.Footnote 3

Recently, GANs have been proposed for emulating the matter distributions in two dimensions (Rodríguez et al. 2018; Mustafa et al. 2019). These approaches have been successful in generating data of high visual quality, and almost indistinguishable from the real simulations to experts. Moreover, several summary statistics often used in cosmology, such as power spectra and density histograms, also revealed good levels of performance. Some challenges still remain when comparing sets of generated samples. In both works, the properties of sets of generated images did not match exactly; the covariance matrix of power spectra of the generated maps differed by order of 10% with the real maps.

While these results are encouraging, a significant difficulty remains in scaling these models to generate three-dimensional data, which include several orders of magnitude more pixels for a single data instance. We address this problem in this work. We present a publicly available dataset of N-body cubes, consisting of 30 independent instances. Due to the fact that the dark matter distribution is homogeneous and isotropic, and that the simulations are made using periodic boundary condition, the data can be easily augmented through shifts, rotations, and flips. The data is in the form of a list of particles with spatial positions x, y, z. It can be pixelised into 3D histogram cubes, where the matter distribution is represented in density voxels. Each voxel contains the count of particles falling into it. If the resolution of the voxel cube is high enough, the particle- and voxel-based representations should be able to be used interchangeably for most of the applications. Approaches to generate the matter distribution in the particle-based representation could also be designed; in this work, however, we focus on the voxel-based representation. By publishing the N-body data and the accompanying codes we aim to encourage the development of large scale generative models capable of handling such data volumes.

We present a benchmark GAN system to generate 3D N-body voxel cubes. Our design of the novel GAN architecture scales to volumes of 2563 voxels. Our proposed solution relies on two key building blocks. First, we split the generation of the high-dimensional data into smaller patches. Instead of assuming that the distribution of each patch is independent of the surrounding context, we model it as a function of the neighboring patches. Although splitting the generation process into patches provides a scalable solution to generate images of arbitrary size, it also limits the field of view of the generator, reducing its ability to learn global image features. The second core idea of our method addresses this problem by relying on a multi-scale approach that efficiently captures global dependencies that might otherwise be lost in the splitting process.

Our results constitute a baseline solution to the challenge. While the obtained statistical accuracy is currently insufficient for a real cosmological use case, we achieve two goals: (i) we demonstrate that the project is tractable by GAN architectures, and (ii) we provide a framework for evaluating the performance of new algorithms in the future.

1.1 Related work

Generative models that produce novel representative samples from high-dimensional data distributions are increasingly becoming popular in various fields such as image-to-image translation (Zhu et al. 2017), or image in-painting (Iizuka, Simo-Serra and Ishikawa 2017) to name a few. There are many different deep learning approaches to generative models. The most popular ones are Variational Auto-Encoders (VAE) (Kingma and Welling 2014), Autoregressive models such as PixelCNN (van den Oord et al. 2016), and Generative Adversarial Networks (GAN) (Goodfellow et al. 2014). Regarding prior work for generating 3D images or volumes, two main types of architectures – in particular GANs – have been proposed. The first type (Achlioptas et al. 2018; Fan et al. 2017) generates 3D point clouds with a 1D convolutional architecture by producing a list of 3D point positions. This type of models does not scale to cases where billion of points are present in a simulation, posing an important concern given the size of current and future N-body simulations. The second type of approaches, including Wu et al. (2016), Mosser et al. (2017), directly uses 3D convolutions to produce a volume. Although the computation and memory cost is independent of the number of particles, it scales with the number of voxels of the desired volume, which grows cubically with the resolution. While recursive models such as PixelCNN (van den Oord et al. 2016) can scale to some extent, they are slow to generate samples, as they build the output image pixel-by-pixel in a sequential manner. We take inspiration from PixelCNN to design a patch-by-patch approach, rather than a pixel-by-pixel approach, which significantly speeds up the generation of new samples.

As mentioned above, splitting the generation process into patches reduces the ability of the generator to learn global image features. Some partial solutions to this problem can already be found in the literature, such as the Laplacian pyramid GAN (Denton et al. 2015) that provides a mechanism to learn at different scales for high quality sample generation, but this approach is not scalable as the sample image size is still limited. Similar techniques are used in the problem of super-resolution (Ledig et al. 2017; Lai et al. 2017; Wang et al. 2018). Recently, progressive growing of GANs (Karras et al. 2018) has been proposed to improve the quality of the generated samples and stabilize the training of GANs. The size of the samples produced by the generator is progressively increased by adding layers at the end of the generator and at the beginning of the discriminator. In the same direction, Brock et al. (2019), Lučić et al. (2019) achieved impressive quality in the generation of large images by leveraging better optimization. Problematically, the limitations of the hardware on which the model is trained occur after a certain increase in size and all of these approaches will eventually fail to offer the scalability we are after.

GANs were proposed for generating matter distributions in 2D. A generative model for the projected matter distribution, also called a mass map, was introduced by Mustafa et al. (2019). Mass maps are cosmological observables, as they are reconstructed by techniques such as, for example, gravitational lensing (Chang et al. 2018). Mass maps arise through integration of the matter density over the radial dimension with a specific, distance-dependent kernel. The generative model presented in Mustafa et al. (2019) achieved very good agreement with the real data several important non-Gaussian summary statistics: power spectra, density histograms, and Minkowski functionals (Schmalzing et al. 1996). The distributions of these summaries between sets of generated and real data also agreed well. However, the covariance matrix of power spectra within the generated and real sets did not match perfectly, differing by the order of 10%.

A generative model working on 2D slices from N-body simulations was developed by Rodríguez et al. (2018). N-body slices have much more complex features, such as filaments and sheets, as they are not averaged out in projection. Moreover, the dynamic range of pixel values spans several orders of magnitude. GANs presented by Rodríguez et al. (2018) also achieved good performance, but only for larger cosmological volumes of 500 Mpc. Some mismatch in the power spectrum covariance was also observed.

Alternative approaches to emulating cosmological matter distributions using deep learning have been recently been proposed. Deep Displacement Model (He et al. 2018) uses a U-shaped neural network that learns how to modify the positions of the particles from initial conditions to a given time in the history of the universe.

Generative models have also been proposed for solving other problems in cosmology, such as generation of galaxies (Regier et al. 2015), adding baryonic effects to the dark matter distribution (Tröster et al. 2019), recovery of certain features from noisy astrophysical images (Schawinski et al. 2017), deblending galaxy superpositions (Reiman and Göhre 2019), improving resolution of matter distributions (Kodi Ramanah et al. 2019).

2 The N-body data

2.1 Cosmological N-body simulations

The distribution of matter, dark matter and other particles in the universe at large scale, under the influence of gravity, forms a convoluted network-like structure called the cosmic web (Bond et al. 1996; Coles and Chiang 2000; Forero-Romero et al. 2009; Dietrich et al. 2012). This distribution contains information vital to the study of dark matter, dark energy, and the very laws of gravity (Abbott et al. 2018; Hildebrandt et al. 2017; Joudaki et al. 2017). Simulations of these various computational cosmological models (Springel 2005; Potter et al. 2017) lead to understanding of the fundamentals of cosmological measurements (Fosalba et al. 2015; Busha et al. 2013), and other properties of the universe (Springel et al. 2005). These simulations are done using N-body techniques. N-body techniques simulate the cosmic web using a set of particles in three dimensional space, and evolve their positions with time. This evolution is governed by the underlying cosmological model and the laws of gravity. The end result of an N-body simulation is the position of billions of particles in space, as depicted in Fig. 1. Unfortunately, N-body simulations are extremely resource intensive, as they require days, or even weeks of computation to produce them (Teyssier et al. 2009; Boylan-Kolchin et al. 2009). Moreover, a large number of these N-body simulations is needed to obtain good statistical accuracies, which further increases the computational requirements.

Figure 1
figure 1

An example N-body simulation at current cosmological time (redshift \(z=0\)). The density is represented by particle positions in 3D. In this work, the generative models use the representation of this distributon that is based on 3D voxel histogram of the particle positions

This computational bottleneck opens up a leeway for deep learning and generative models to offer an alternative solution to the problem. Generative models have the potential to be able to learn the underlying data distribution of the N-body simulations using a seed set of N-body samples to train on.

There are multiple techniques for running N-body simulations, which agree well for large scales, but start to diverge for small scales, around wavenumber \(k=1~\mbox{Mpc}^{-1}\) (Schneider et al. 2016). Moreover, baryonic feedback can also affect the small scale matter distribution (Mead et al. 2015; Huang et al. 2019; Barreira et al. 2019), and large uncertainty remains for these scales.

2.2 Data preprocessing

We produce samples of the cosmic web using standard N-body simulation techniques. We used L-PICOLA Howlett et al. (2015) to create 30 independent simulations. The cosmological model used was ΛCDM with Hubble constant \(\mathrm{H}_{0} = 500h = 350~\mbox{km/s/Mpc}\),Footnote 4 dark energy density \(\varOmega _{\varLambda }= 0.724\) and matter density \(\varOmega _{m} = 0.276\). We used the particle distribution at redshift \(z = 0\). The output of the simulator is a list of particles 10243 3D positions. To obtain the matter distribution, we first convert it to a standard 2563 3D cube using histogram binning. We consider these cubes as the raw data for our challenge and can be downloaded at https://zenodo.org/record/1464832. The goal is to build a generative model able to produce new 3D histograms. While 30 samples might seem as a low number of samples to train a deep neural network, each sample contains a large number of voxels. One can also expand the training data by relying on data augmentation, using various rotations and circular translations as described in Appendix 2. Problematically, the dynamic range of this raw data spans several orders of magnitude and the distribution is skewed towards smaller values, with a very elongated tail towards the larger values. Empirically, we find that this very skewed distribution makes learning a generative model difficult. Therefore, we first transform the data using a logarithm-based function, as described in Appendix 1. Note that this transform needs to be inverted before the evaluation procedure.

2.3 Evaluation procedure

The evaluation of a GAN is not a simple task Borji (2019), Grnarova et al. (2019). Fortunately, following Rodríguez et al. (2018), we can evaluate the quality of the generated samples with three summary statistics commonly used in the field of cosmology.

  1. 1.

    Mass histogram is the average (normalized) histogram of the pixel value in the image. Note that the pixel value is proportional to the matter density.

  2. 2.

    Power spectrum density (or PSD) is the amplitude of the Fourier modes as a function of their frequency (the phase is ignored). Practically, a set of bins is defined and the modes of similar frequency are averaged together.

  3. 3.

    Peak histogram is the distribution of maxima in the density distribution, often called “peak statistics”, or “mass function”. Peaks capture the non-Gaussian features present in the cosmic web. This statistic is commonly used on weak lensing data (Kacprzak et al. 2016; Martinet et al. 2017). A peak is defined as a pixel greater than every pixel in its 2-pixels neighbourhood (24 pixels for 2D and 124 for 3D).

Other statistics such as Minkowski functionals, three point correlation functions, or phase distributions, could be considered. Nevertheless, we find that the three aforementioned statistics are currently sufficient to compare different generative models.

Distance between statistics

We define a score that reflects the agreement of the 3 aforementioned cosmological measures. Problematically, the scalars forming the 3 vectors representing them have very different scales and their metrics should represent the relative error instead of the absolute one. For this reason, we first compute the logarithm (in base 10) of the computed statistics s. As all statistics are positive but not strictly positive, we add a small value ϵ before computing the logarithm, i.e., \(s_{\log }= \log _{10}(s+\epsilon )\). ϵ is set to the maximum value of the statistic averaged over all real samples divided by 105.

At this point, the relative difference connects to the difference of the real and fake \(s_{\log }\), i.e.: \({s_{\log }^{r}-s_{\log }^{f}\approx \log _{10}\frac{s^{r}}{s^{f}}}\). One could quantify the error using a norm, i.e.: \(\|\mathbb {E}{s_{\log }^{r}}-\mathbb {E}{s_{\log }^{f}}\|\). However, such a distance does not take into account second-order moments. Rather, we take inspiration from the Fréchet Inception Distance (Heusel et al. 2017). We start by modeling the real and fake log statistics \(s_{\log }^{r}\), \(s_{\log }^{f}\) as two multivariate Gaussian distributions. This allows us to compute the Fréchet Distance (FD) between the two distributions (Fréchet 1957), which is also the Wasserstein-2. The FD between two Gaussian distribution with mean and covariance \((m^{r},C ^{r})\) and \((m^{f}, C^{f})\) is given by Dowson and Landau (1982):

$$ \begin{aligned}[b] &d^{2} \bigl(\bigl(m^{r},C^{r} \bigr),\bigl(m^{f}, C^{f}\bigr) \bigr) \\ &\quad = \bigl\Vert m^{r} - m^{f} \bigr\Vert _{2}^{2} + \operatorname{Tr} \bigl( C^{r} + C^{f} - 2 C^{r} C^{f} \bigr). \end{aligned} $$
(1)

Note that Dowson and Landau (1982) also proves that \(\operatorname{Tr} ( C^{r} + C^{f} - 2 C^{r} C^{f} )\) is a metric for the space of covariance matrices. We choose the FD over the Kullback–Leibler (KL) divergence for two reasons: (a) the Wasserstein distance is still an appropriate distance when distributions have non-overlapping support and (b) the KL is computationally more unstable since the covariance matrices need to be inverted.

2.4 N-Body mass map generation challenge

Using the FD, we define a score for each statistic as

$$ S^{*} = \frac{1}{d^{2}((m^{r},C^{r}),(m^{f}, C^{f})}, $$
(2)

where m, C are computed on the log statistics, i.e: \(m^{r}=\mathbb {E}{s _{\log }^{r}}\), \(m^{f}=\mathbb {E}{s_{\log }^{f}}\). Along with this manuscript, we release our dataset and our evaluation procedure in the hope that further contributions will improve our solution. All information can be found at https://github.com/nperraud/3DcosmoGAN. We hope that this dataset and evaluation procedure can be a tool for the evaluation of GANs in general. Practically, these cosmological statistics are very sensitive. We observed two important properties of this problem. First, a small variation in the generated images still has an impact on the statistics. The statistics can be highly affected by high density regions of the N-body data, and these regions are also the most rare in the training set. Second, while mode collapse may not directly affect the mean of the statistics, it can affect their second order moment significantly. We observed that obtaining a good statistical agreement (and hence a high score), is much more difficult than obtaining generated images that are indistinguishable for the human eye, especially for the 2-dimensional case. We found that the problems of high data volume, large dynamic range of the data, and strict requirement on good agreement in statistical properties of real and generated samples, pose significant challenge for generative models.

Interpretation of the \(S^{*}\) scores

We report the \(S^{*}\) scores for the 3D case in Sect. 4 and the web page hosting the source code also includes baseline scores for the 2D case. We would like to emphasize that these scores are mostly suitable to compare two generative models applied to the same distribution. It is a priori unclear whether these scores provide a meaningful comparison when considering two generative models trained on different distributions. We therefore refrain from relying on such scores to compare the generative models trained on 2D and 3D data since the latter gives access to correlations along the third dimension that are absent from the 2D data. Finally, while in theory, the \(S^{*}\) score is unbounded and can be arbitrarily large as \(d^{2}\rightarrow 0\), its empirical evaluation is, in practice, limited by the estimation error of \(d^{2}\) that depends on the number of samples used to estimate the mean vectors and the covariance matrices \((m^{r},C^{r})\) as well as the moments of the estimated statistics. In the case where the training set contains a large number of samples, one would expect the training score to be indicative of the score of the true distribution. One can see that the estimation error of the first term of (1) depends mostly on the variance of the statistic, while for the second term, it depends on the moments of order 3 and 4. Hence, the estimated score will reflect the variance of the estimated statistics given the number of samples used. A high score therefore means less variance within the corresponding statistic.

3 Sequential generative approach

We propose a novel approach to efficiently learn a Generative Adversarial Network model (see Sect. 3.1) for 2D and 3D images of arbitrary size. Our method relies on two building blocks: (1) a multi-scale model that improves the quality of the generated samples, both visually and quantitatively, by learning unique features at different scales (see Sect. 3.2), and (2) a training strategy that enables learning images of arbitrary size, that we call “conditioning on neighborhood” (see Sect. 3.3).

3.1 Generative adversarial networks (GANs)

Generative Adversarial Networks (GAN) rely on two competing neural networks that are trained simultaneously: the generator G, which produces new samples, and the discriminator D, which attempts to distinguish them from the real ones. During training, it is the generator’s objective to fool the discriminator, while the discriminator resists by learning to accurately discriminate real and fake data. Eventually, if the optimization process is carried out successfully, the generator should improve to the point that its generated samples become indistinguishable from the real one. In practice, this optimization process is challenging and numerous variants of the original GAN approach have been proposed, many of them aiming to improve stability including e.g. Roth et al. (2017), Gulrajani et al. (2017), Arjovsky et al. (2017). In our work, we rely on the improved Wasserstein GAN (WGAN) approach introduced in Gulrajani et al. (2017). The latter optimizes the Wasserstein distance instead of the Jensen-Shannon divergence and penalizes the norm of gradient of the critic instead of using a hard clipping as in the original WGAN (Arjovsky et al. 2017). The resulting objective function is

where \(\mathbb{P}_{r}\) is the data distribution and \(\mathbb{P}_{g}\) is the generator distribution implicitly defined by , \(\mathrm{z} \sim p(\mathrm{z})\). The latent variable z is sampled from a prior distribution p, typically a uniform or a Gaussian distribution. Eventually, is defined implicitly by sampling uniformly along straight lines between pair of points sampled from the true data distribution \(\mathbb{P}_{r}\) and the generator distribution \(\mathbb{P}_{g}\). The weight λ is the penalty coefficient.

3.2 Multi-scale model

Our multi-scale approach is inspired by the Laplacian pyramid GAN (Denton et al. 2015). We refer to three image types of different sizes, namely \(I_{3} = 32 \times 32 \times 32\), \(I_{2} = 64 \times 64 \times 64\), \(I_{1} = 256 \times 256 \times 256\) Footnote 5 pixels, where \(I_{2}\) is a down-scaled version of \(I_{1}\) and \(I_{3}\) is a down-scaled version of \(I_{2}\). The multi-scale approach is shown in Fig. 2 and uses three different GANs, \(M_{1}\), \(M_{2}\) and \(M_{3}\), all trained independently from each other, and can therefore be trained in parallel. We train GAN \(M_{3}\) to learn the data distribution of images \(I_{3}\), while the GANs \(M_{2}\) and \(M_{1}\) are conditioned on the images produced by \(M_{3}\) and \(M_{2}\), respectively. In our implementation, we take \(M_{3}\) to be a normal Deep Convolution GAN (DCGAN) that learns to produce down-scaled samples of size \(I_{3}\). We design GANs \(M_{2}\) and \(M_{1}\) to have the following properties: (1) they produce outputs using a sequential patch-by-patch approach and (2) the output of each patch is conditioned on the neighboring patches. This procedure allows handling of high data volume, while preserving the long-range features in the data. Moreover, different GANs learn salient features at different scales, which contribute to an overall improved quality of the samples produced the final GAN \(M_{1}\). Further details regarding the implementation details are provided in Appendix 3.

Figure 2
figure 2

Multiscale approach using multiple intermediary GANs, each learning features at different scales and all trained independently from each other. The same approach can in principle be extended to produce higher-resolution images, potentially adding more intermediary GANs if necessary

3.3 Conditioning on neighborhoods

The limited amount of memory available to train a GAN generator makes it impractical to directly produce large image samples. Using current modern GPUs with 16 GB of RAM and a state-of-the-art network architecture, the maximum sample size we were allowed to use was 323, which is far from our target taken to be 2563. In order to circumvent this limitation, we propose a new approach that produces the full image (of size 2563) patch-by-patch, each patch being of smaller size (323 in our case). This approach is reminiscent of the Pixel-CNN (van den Oord et al. 2016), where 2D images are generated pixel-by-pixel, rather than the entire picture being generated at once. Instead of assuming that the distribution of each patch is independent of the surrounding context, we model it as a function of the neighboring patches. The generation process is done using a raster-scan order, which implies that a patch depends on the neighboring patches produced before the current patch. The process illustrated in Fig. 3 is for the 2D case with three neighboring patches; the generalization to three dimensions is straightforward as it simply requires seven neighboring 3D patches.

Figure 3
figure 3

Left: Sequential Generation. The image is generated patch-by-patch. Right: Upscaling GAN conditioned on neighborhood information. Given border patches B1, B2 and B3, and the down-scaled version of the 4th patch, the generator generates the 4th fake patch. The discriminator also receives the border patches, as well as either an original or fake patch from the generator. The principle is similar in 3D, with the difference that 7 border cubes are used instead of 3 border patches

In the generator, the neighborhood information, i.e. the borders (3 patches for 2D, 7 cubes for 3D), is first encoded using several convolutional layers. Then it is concatenated with the latent variable, is inputed to a fully connected layer before being reshaped into a 3D image. At this point, the down-sampled version of the image is concatenated. After a few extra convolutional layers, the generator produces the lowest rightmost patch with the aim of making it indistinguishable to the patch from the real data. The generator architecture is detailed in Fig. 4. In the case where there is no neighborhood information available, such as at the boundary of a cube, we pad with zeros. The discriminator is given images containing four patches where the lower right patch is either the real data or the fake data generated by the generator. The generator only produces patches of size 323, irrespective of the size of the original image. This way this method can scale to any image size, which is a great advantage. The discriminator only has access to a limited part of the image and ignores the long-range dependencies. This issue, however, is handled by the multi-scale approach described in the previous section. We actually tried a model only conditioned on the neighborhoods as detailed in Appendix 4.3. It ended up performing significantly worse than using the multi-scale approach.

Figure 4
figure 4

Details of the upsampling generator. The borders/cubes are encoded using several convolutional layers

4 Experimental results

Our approach relies on a recursive approach that progressively builds a 3D cube, starting from a low-resolution and upsampling it at every step. We detail and analyze each step separately in order to understand the impact of each of these steps on the final performance of the full model. We also compare the results of our multi-scale approach to a simpler uni-scale model. Additional details regarding the network architectures and various implementation choices are available in Appendix 3.

4.1 Scale by scale analysis of the pipeline

In the following, we describe our model that relies on three different GANs, namely \(M_{1}\), \(M_{2}\) and \(M_{3}\), to generate samples at distinct resolutions. We detail each step of the generation process below.

Step 1: low-scale generation (latent code to \(I_{3}\))

The low scale generation of a sample of size 323 is performed by the WGAN \(M_{3}\). The architecture of both the generator and the discriminator is composed a 5 3D convolutional layers with kernels of size \({4 \times 4 \times 4}\). We use leaky ReLu non-linearity. Further details can be found in Table 2 of Appendix 3.

Figure 5 shows the middle slice from 16 3D samples \(I_{3}\) drawn from the generator of \(M_{3}\) compared to real samples. In Fig. 6, we additionally plot our evaluation statistics for the 30 samples corresponding to the total number of N-body simulations used to build the dataset. The generated samples drawn from \(M_{3}\) are generally similar to the true data, both from a visual and from a statistical point of view, although one can observe slight disagreements at higher frequencies. Note that the low number of samples and the size of each of them does not allow us to compute very accurate statistics, hence limiting our evaluation.

Figure 5
figure 5

8-down-scaled sample generation (323 cubes). Middle slice from 16 real and 16 fake WGAN \(M_{3}\) samples. Video: https://youtu.be/uLwrF73wX2w or Additional file 1

Figure 6
figure 6

Statistics of the 8-downscaled 323 cubes. The fake samples are generated from WGAN \(M_{3}\). The power spectrum density is shown in units of h Mpc−1, where \(\mbox{h} = \mbox{H}_{0}/100\) corresponds to the Hubble parameter

Step 2: up-scale (\(I_{3}\) to \(I_{2}\))

Upscaling the sample size from 323 to 643 is performed by the WGAN \(M_{2}\). The architecture is similar to \(M_{3}\) (see Table 3 in Appendix 3), except that the border patches are first encoded using three 3D convolutional layers and then concatenated with the latent variables before being fed to the generator.

In order to visualize the quality of up-scaling achieved by this first up-sampling step independently from the rest of the pipeline, we first down-scale the real 2563 samples to 323, and then provide them as input \(I_{3}\) to the WGAN \(M_{2}\). We then observe the result of the up-scaling to 643. Figure 7 shows slices from some generated \(I_{2}\) samples, as well as the real down-scaled \(I_{3}\) image and the real \(I_{2}\) image. We observe a clear resemblance between the up-scaled fake samples and the real samples. The statistics for this step of the generation process are shown in Fig. 8. We observe more significant discrepancies for features that rarely occur in the training set such as for large peaks. This is however not surprising as learning from few examples is in general a difficult problem.

Figure 7
figure 7

Up-scaling a 323 cube to 643. Left and right: middle slices from 4 real and fake 643 samples. The fake is generated by conditioning the WGAN \(M_{2}\) on the real down-scaled 323 cube (center). Video: https://youtu.be/IIPoK8slShU or Additional file 2

Figure 8
figure 8

Statistics of the samples produced by \(M_{2}\). The fake samples are generated by conditioning WGAN \(M_{2}\) on the real down-scaled 323 samples. The power spectrum density is shown in units of h Mpc−1, where \(\mbox{h} = \mbox{H}_{0}/100\) corresponds to the Hubble parameter

Step 3: up-scale (\(I_{2}\) to \(I_{1}\))

The final upscaling from 643 to 2563 is performed by the WGAN \(M_{1}\). The architecture of both the generator and the discriminator of \(M_{1}\) is composed of eight 3D convolutional layers with inception modules Szegedy et al. (2014). The inception modules have filters of three different sizes: \(1 \times 1 \times 1\), \(2 \times 2 \times 2\) and \(4 \times 4 \times 4\). The input tensor is convolved with all three types of filters, using padding to keep the output shape the same as the input shape. Eventually, the three outputs are summed to recover the desired number of output channels.

To visualize the quality of up-scaling achieved by the final up sampling step, we down-scale the real 2563 samples to 643, and then provide them as inputs \(I_{2}\) to the WGAN \(M_{1}\). Figure 9 shows the middle slices of two real and fake samples \(I_{1}\). Although, the up-sampling factor is 4, the WGAN \(M_{1}\) is able to produce convincing samples even in terms of high frequency components.

Figure 9
figure 9

Upsampling a 643 cube to 2563. Left and right: middle slices from 2 real and fake 2563 samples. The WGAN \(M_{1}\) that generates the fake sample is conditioned on the real image down-scaled to 643 (center). Video: https://youtu.be/guUYP8ZOoVU or Additional file 3

4.2 Full multi-scale pipeline

Sample generation

The generation process used to produce new samples proceeds as follows. First, a latent variable is randomly drawn from the prior distribution and is fed to \(M_{3}\) which produces a 323 low-resolution cube. The latter is then upsampled by \(M_{2}\). At first, all border patches shown in Fig. 3 are set to zero. Then the 643 cube is built recursively (in \(2^{3}=8\) steps) where at each step, the previously generated patches are re-fed as borders into the generator. The generation of the full 2563 cube is done in a similar fashion by \(M_{1}\). Note that this last step requires the generation of \(8^{3}=512\) patches/smaller cubes. An illustration, adapted to the 2D case for simplicity, is shown in Fig. 10. The full generation process takes around 7 seconds to produce one sample of size 2563 using a single GPU node with 24 cores compared to approximately 6-8 hours for a fast and approximate L-Picola (Howlett et al. 2015) simulator running on two identical nodes.

Figure 10
figure 10

Up-scaling patches in 3-steps, using 3 different WGANs

Quantitative results

Figure 11 shows a few slices from a 3D fake \(I_{1}\) image generated using the full pipeline, alongside a random real \(I_{1}\) image. Figure 12 shows the summary statistics of 500 GAN generated samples, compared to that of real samples. The visual agreement between the real and generated samples is good, although a trained human eye can still distinguish between real and fake samples. In particular, a careful visualization reveals that the transitions between the different patches are still imperfect and that the generated samples have less long range filaments than the true samples. The summary statistics agree very well for the middle range of mass density, with slight disagreements at the extremes. The shape of the power spectrum density (PSD) matches well, but the overall amplitude is too high for most of the k range. Naturally, we would expect the error of the multi-scale pipeline to be the result of the accumulation of errors from the three upsampling steps. In practice, we observe a lot of similarities between the statistics shown in Fig. 13 (from the last upscaling step) and in Fig. 12 (from the full pipeline). Finally, Table 1 presents the scores obtained for the 30 cubes of size 2563 using the full multi-scale pipeline as generated by the full GAN sequence. As explained in Sect. 2.4, the reference score gives an indication of the variability within the training set, i.e, how similar are two independent sample sets from the training set. The reference mass histogram score is much higher than the PSD and the Peak histogram due to the fact that this statistic has in general much less variance. For that reason, it is probably easier to estimate, as indicated by the score of our pipeline. Cosmological analyses typically require the precision of less than few percent on these summary statistics, which is achieved by the GAN method only for specific scales, peak and density values.

Figure 11
figure 11

Middle slice from real and generated 2563 samples. The GAN-generated samples are produced using the full multi-scale pipeline. Videos: 32-scale: https://youtu.be/uLwrF73wX2w; 64-scale: https://youtu.be/xI2cUuk3DRc; 256-scale: https://youtu.be/nWXP6DVEalA or Additional file 4

Figure 12
figure 12

Summary statistics of real and GAN-generated 2563 images using the full multi-scale pipeline. The power spectrum density is shown in units of h Mpc−1, where \(\mbox{h} = \mbox{H}_{0}/100\) corresponds to the Hubble parameter

Figure 13
figure 13

Statistics for the WGAN \(M_{1}\) producing fake 2563 cubes. The fake samples are generated by conditioning WGAN \(M_{1}\) on the real cube down-scaled to 643. The power spectrum density is shown in units of h Mpc−1, where \(\mbox{h} = \mbox{H}_{0}/100\) corresponds to the Hubble parameter

Table 1 Scores computed for the 30 GAN-generated 2563 images using the full multi-scale pipeline. We refer to (2) and (1) for details. The reference is computed by comparing two sets of 15 real samples. It represent the best score that can be achieved

The scores of multiscale approach are much higher than the ones of the simpler single-scale approach described in the following section.

4.3 Comparison to the single scale model

In order to evaluate the effectiveness of the multi-scale approach, we compare our model to a uni-scale variant that is not conditioned on the down-sampled image. Here each patch is generated using only its neighboring 7 border patches and a latent variable that is drawn from the prior distribution. More simply, it is a direct implementation of the principle displayed in Fig. 3 left in 3D, where the discriminator is not conditioned on the down-sampled image. An important issue with the uni-scale model is that the discriminator never s more than 643 pixels at once. Hence, it is likely to fail to capture long-range correlations. Practically, we observed that the training process is unstable and subject to mode collapse. At generation time. the recursive structure of the model often lead to repeating pattern. The resulting models were of bad quality as shown in Fig. 14 and 15. This translates to the score presented in Table 1. This experiment demonstrates that conditioning on data at lower scales does play an important role in generating samples of good quality.

Figure 14
figure 14

Middle slice from real and generated 2563 samples. The GAN-generated samples are produced using the full uni-scale pipeline. Video: https://youtu.be/fxZEQHEGunA or Additional file 5

Figure 15
figure 15

Summary statistics of real and GAN-generated 2563 images using the full uni-scale pipeline. The power spectrum density is shown in units of h Mpc−1, where \(\mbox{h} = \mbox{H}_{0}/100\) corresponds to the Hubble parameter

5 Conclusion

In this work we introduced a new benchmark for the generation of 3D N-body simulations using deep generative models. The dataset is made publicly available and contains matter density distributions represented as cubes of \(256\times 256\times 256\) voxels. While the performance of the generative model can be measured by visual inspection of the generated samples, as commonly done on datasets of natural images, we also offer a more principled alternative based on a number of summary statistics that are commonly used in cosmology. Our investigation into this problem has revealed that several factors make this task challenging, including: (i) the sheer volume of each data sample, which is not straightforwardly tractable using conventional GAN architectures, (ii) the large dynamic range of the data that spans several orders of magnitude; which requires a custom-designed transformation of the voxel values, and (iii) the need for high accuracy required for the model to be practically usable for a real cosmological study. Adding to the difficulties of (i) and (ii), this also requires accurately capturing features that are rare in the training set.

As a first baseline result for the newly introduced benchmark, we proposed a new method to train a deep generative model on 3D images. We split the generation process into the generation of smaller patches as well as condition on neighboring patches. We also apply a multi-scale approach to learn multiple WGANs at different image resolutions, each capturing salient features at different scales. This approach is inspired by Laplacian pyramid GAN (Denton et al. 2015) and by PixelCNN (van den Oord et al. 2016), which have both been developed for 2D data.

We find that the proposed baseline produces N-body cubes with good visual quality compared to the training data, but significant differences can still be perceived. Overall, the summary statistics between real and generated data match well, but notable differences are present for high voxel values in the mass and peak histograms. The power spectrum has the expected shape, with amplitude that is too high for most of the k values. The overall level of agreement is promising, but can not yet be considered as sufficient for practical applications in cosmology. Further development will be needed to achieve this goal; in order to encourage it, we have made the dataset and the code publicly available.

In our current model, the discriminator only has access to a partial view of the final image. The dependencies at small scale that may exist between distant patches are therefore not captured by the discriminator. Extending this model to allow the discriminator to have a more global view would be the next logical extension of this work. We have also observed empirically that the extreme right tail of the histogram is often not fully captured by the generator. Designing architectures that would help the generative model to handle large dynamic range in the data could further improve performance. One could also get further inspiration from the literature on generative models for video data, such as Vondrick et al. (2016), Xiong et al. (2018), Saito et al. (2017). Given the observations made in our experiments, one might for instance expect that the two stage approach suggested in Saito et al. (2017) could address some of the problem seen with the right tail of the distribution.

Another interesting research direction would be to condition the generation process on cosmic time or cosmological parameters. One could for instance rely on a conditional GAN model such as Mirza and Osindero (2014).