The Theoretical Background
Cosmologists have been interested in the large-scale distribution of matter in the universe since the early twentieth century.Footnote 5 Hubble’s observations in the 1930s indicated that, at sufficiently large scales, galaxies are distributed homogeneously in the universe. On smaller scales, however, Hubble showed that planets, stars, galaxies, and even groups of galaxies exhibit clustering, forming so-called “structures.” Attention was next turned to understanding Hubble’s observations: Could such observations be predicted directly from (gravitational) theory or could they guide future theoretical research?
Given the enormous length and time scales involved, the basic theory behind large-scale structure formation is remarkably simple.Footnote 6 One models the (dark) matter in the universe as a perfect and homogeneous fluid and gravity with a Newtonian gravitational potential. The creation of structure requires deviations from homogeneity, so one introduces small density perturbations and studies their evolution. The method outlined so far is applicable as long as the density fluctuations remain small. As the system is evolved forward and the perturbations grow to be larger (on the order of the background fluid density or greater), however, this simple linear theory is no longer quantitatively useful. To be able to compare theoretical predictions with observations, cosmologists need to study these larger perturbations and the structures that they seed. Thus, they turn to another method: simulations.
Investigating the Large-Scale Structure of the Universe with Simulations
The simulations used to investigate the large-scale structure of the universe begin much like the theory described above. One first assumes that the (dark) matter in the universe can be modeled as a perfect fluid. Because these are computer simulations, they represent this perfect fluid as a discrete set of particles (N-bodies) interacting via Newtonian gravitational forces. For this reason, such simulations are called “N-body simulations.” Importantly, as Coles and Lucchin note, though they employ discrete particles, “these techniques are not intended to represent the motion of a discrete set of particles. The particle configuration is itself an approximation to a fluid” (2002, 305). Indeed, the particles themselves are not meant to be representations of real, physical particles but rather each particle “represent[s] a phase space patch covering a vast number of individual [dark matter] particles” (Tulin and Yu 2018, 26).
In this shift to simulations, cosmologists are no longer asking if gravitational theory can predict structure formation—this question has already been answered affirmatively. Instead, they are using simulations to investigate the statistical distributions of matter that different combinations of cosmological parameters give rise to.Footnote 7 The values of these cosmological parameters are not typically constrained directly by observations. Instead, observations of the statistical distributions of matter (i.e., the matter power spectrum) provide constraints to many of these parameters at once. Thus, cosmologists require large data sets from simulations that vary the parameters of interest to compare to observations and infer the values of the cosmological parameters instantiated in the universe.
The challenge of N-body simulations comes not from the theory underlying the simulations, but from the execution of the simulations. The first difficulty is that each simulation employs millions of particles (e.g., Heitmann et al. 2010 employ over 16 million particles), so calculating the pairwise gravitational forces between particles, summing the total forces, and evolving the entire system forward is computationally expensive. The second difficulty associated with N-body simulations is that using these simulations to infer cosmological parameters requires having comprehensive coverage of the parameter space of cosmological parameters. This is because the standard method for inferring cosmological parameters from such simulations is to use Markov Chain Monte Carlo (MCMC) analysis. MCMC analysis is a method of Bayesian inference that allows researchers to sample from a probability distribution. In this case, MCMC analysis is used to sample from a probability distribution over cosmological parameters and determine what parameter values are instantiated in observations. When this requirement for comprehensive coverage of the parameter space is coupled with the expense of running each simulation, the impracticality of the task becomes obvious.
What do Cosmologists Learn from N-Body Simulations?
Researchers conducting N-body simulations acknowledge the limits of their gravity-only simulations. Heitmann and her collaborators note, for example, that at sufficiently small length scales, additional physics beyond mere gravitational interactions will be needed for accurate calculations (e.g. gas dynamics and feedback effects; 2010, 105–107). Nonetheless, N-body simulations are a good approximation and useful for studying the effects of changing the cosmological parameters on large-scale structure formation.
Given that N-body simulations leave out what is known to be relevant physics on small scales, one may wonder what cosmologists are trying to learn with such simulations. In this context, cosmologists are clearly not asking the question “When all our best models of the relevant physics have been included, do we get a universe like ours?” Their investigations cannot be aimed at this question as their simulations leave out large domains of relevant physics. The framework of minimalist idealization (or minimal conditions modelling) can help clarify the situation. Weisberg describes minimalist idealization as “the practice of constructing and studying theoretical models that include only the core causal factors which give rise to a phenomenon” (Weisberg 2007, 642). O’Connor writes similarly that minimal conditions modeling identifies reasonable, minimal conditions for a phenomenon to arise (2017, 7). Perhaps, then, N-body simulations should be understood as a minimalist idealization—as modeling the minimal conditions for large-scale structure formation. N-body simulations do show that gravitational force is a minimal causal variable in producing the large-scale distribution of matter, but it also seems clear that this is not all cosmologists are learning from such simulations.
I suggested above that such simulations are designed to answer questions such as: “What would the statistical distribution of matter in the universe be if these were the true values of the cosmological parameters, assuming some particular cosmological model?” To appreciate the importance of this question, consider the role of such simulations in cosmologists’ larger research programs. The results of N-body simulations are often compared to cosmologists’ ever-improving observations of the statistical distribution of matter in the universe.The interplay between observations and theory/simulations points us towards the role of such simulations: considered together, observations and simulations serve as tests of different instantiations of cosmological parameters.
We can also ask whether and how such simulations are explanatory. To address this question, consider a distinction Batterman draws between what he calls type (i) and type (ii) why questions. Type (i) why questions ask why a phenomenon occurred in some particular circumstance while type (ii) why questions ask why phenomena of this general type occur across a variety of circumstances (1992, 332). In a later paper, Batterman and Rice (2014) argue that the explanations provided by minimal models fit within this second why question and are distinct from various others kinds of explanations (e.g., causal, mechanical, etc.) discussed in the philosophy of science literature. They claim that minimal models are explanatory insofar as they provide a story about why a class of systems all exhibit some large-scale behavior.
I argue that the simulations discussed above are actually answering both types of questions. The type (i) why question is “Why does our universe have the particular statistical distribution of matter that it does?” The answer these N-body simulations give would include the values of the cosmological parameters in the cosmological model being tested. The type (ii) why question is “Why does the universe exhibit structures across a variety of cosmological parameters?” This question (which is closer to one minimal conditions modeling is meant to answer) could then be answered by both linear theory and N-body simulations. They would both point to gravitational forces acting on small perturbations to bring about clustering behavior.
Ultimately, these N-body simulations do address the minimal conditions needed for structure formation. More importantly, however, coupled with observations, they serve as tests of instantiations of various cosmological parameters. For such simulations to fulfill this role requires that they be at least as precise as the observations they are being compared to. Considering the huge computational expense involved in running these simulations, it is unsurprising that cosmologists are looking for new methods to employ in these contexts.
Investigating the Large-Scale Structure of the Universe with Machine Learning
ML has a long history of use in astronomy and cosmology. Some of the first uses included scheduling observation time and performing adaptive optics for array telescopes (see Serra-Ricart et al. 1994, for a review of uses in the early 1990s). Contemporary uses of ML range from identifying structure in simulations to interpreting observations of the cosmic microwave background.Footnote 8 The role of ML in the next decade of cosmology was the topic of a recent white paper submitted as part of the Astro2020 Decadel Survey on Astronomy and Astrophysics organized by the National Academy of Sciences. There, Ntampaka and collaborators argued that the upcoming “era of unprecedented data volumes” (2019, 3) in cosmology provides rich opportunities to employ ML techniques. They further argue that cosmology is uniquely positioned not only to benefit from advances in ML, but to itself provide “opportunities for breakthroughs in the fundamental understanding of ML” (Ntampaka et al. 2019, 5). They consider the corresponding “temptation to choose expediency over understanding” (2019, 3) but outline some methods for improving the interpretability of ML. It is with these same worries and goals that I have chosen to focus on a cosmological case study in this paper.
The case study presented here uses ML to address the second of the two sources of computational expense in the context of N-body simulations. Recall, from Sect. 3.2, that the first source is the number of particles needed for any individual N-body simulation while the second is the number of simulations needed for MCMC analysis. Cosmologists have begun using ML methods like Gaussian processes and artificial neural networks to quickly fill in the relevant parameter space using a limited number of simulations, thus addressing the second source of computational expense. One of the first groups to employ ML to study large-scale structure formation was Katrin Heitmann’s research team. They call their methodology “emulation,” describing it as a “generalized interpolation method capable of yielding very accurate predictions” (Heitmann et al. 2009, 2). But what is an emulator, how is it different from a simulation, and how does it use machine learning to reduce computational expense?
Developing an emulator requires: (i) building a training set (often just a collection of simulation results), (ii) regressing for analytic functions that mimic the training set data, and (iii) evaluating those functions at the desired interpolation points while accounting for interpolation error Schneider et al. (2011); Kern et al. (2017).Footnote 9 As Kern et al. note, “The emulator itself is then just the collection of these functions, which describe the overall behavior of our simulation” (2017, 2–3). Emulators do not include physical laws or principles. Rather, they statistically characterize the space of simulation results and allow for sophisticated interpolation.
Below, I present the methodology used to construct two emulators, the Cosmic Emulator and Pkann. I have chosen these two emulators for a variety of reasons. First, because of the relative simplicity of the goal of the two emulators—to fill in the parameter space needed for MCMC analysis. This simple goal makes them a valuable case study to investigate the role of emulators in broader research contexts and to provide a proof of concept that ML can deliver scientific understanding. Second, because both research groups express skepticism about the ability of their emulators to deliver scientific understanding (as discussed in Sect. 1). I will argue that when understood in the larger research context, these emulators can overcome the worries expressed by their developers and provide explanations.Footnote 10
The Cosmic Emulator
The construction of Heitmann’s emulator, the so-called Cosmic Emulator, proceeds according to the three steps outlined above. Heitmann et al. begin with a five-dimensional parameter space, with each dimension corresponding to each of the five cosmological parameters they are investigating. They then decide on a methodology to sample the parameter space and run the appropriate simulations. They employ Symmetric optimal Latin Hypercube (SLH) sampling, a sampling method that imposes good filling and sampling of the parameter space and is thought to be most appropriate when one is ignorant of functional variation across the parameter space (Habib et al. 2007, 5). Using SLH sampling, Heitmann et al. find that only 37 cosmological simulations are necessary to train their emulator. In other words, they only need 37 points in the five-dimensional parameter space (Heitmann et al. 2009, 163).
Having built their training set, Heitmann and her collaborators decide to use Gaussian Process (GP) modeling to interpolate amongst the simulations runs. GP modeling works by finding the function that best characterizes the data through Bayesian inference. As noted by Mohammadi et al., GPs have several advantages. They can be used to fit any smooth, continuous function and they are considered “non-parametric,” meaning “no strong assumptions about the form of the underlying function are required” (Mohammadi et al. 2018, 2). This makes them especially compatible with the sampling methodology employed for the Cosmic Emulator.
Once the emulator is trained, the final step in the process is to test the emulator. Heitmann et al. consider a mock data set of 10 test cosmological models and find that emulation reproduces the nonlinear matter power spectrum to within 1% accuracy (2009, 167). Ultimately, the fully trained emulator allows an investigator to use MCMC analysis to infer the values for various cosmological parameters that would have given rise to a particular observation.
PkANN
Artificial neural networks (ANNs) are another method of machine learning that has been employed by cosmologists. In a series of two papers, Agarwal et al. (2012, 2014) present PkANN, an ANN-based emulator.Footnote 11 The main advantage ANNs have over GPs in this context is their ability to cover a broader parameter space, but the drawback is that ANNs require a much larger training set of simulations.
Agarwal’s methodology, like Heitmann’s, begins with LH sampling. Then, instead of a GP, they train an ANN on this simulation set and evaluate the ANN’s accuracy. Fundamentally, an ANN consists of interconnected nodes which can be thought of as artificial neurons with activation functions. These activation functions map the input the node receives to its output. These nodes are then arranged in “layers” and allowed to communicate, to transport their output, to nodes in the next layer. The connections linking nodes do not merely transmit the information; they also multiply the output of the previous node by a “weight” as it travels along the connection to the next node. Adjusting these weights to get better results constitutes the “training” of an ANN. Though one can in principle perform this training by hand (adjusting the weights to match the network’s output to the desired output), the sheer number of weights in an ANN often prevents one from being able to do so meaningfully. Instead, the ANN typically adjusts the weights itself, a process referred to as “learning.” Training an ANN in this way requires having a labeled training set. The ANN then compares its own output to a known answer from the training set and quantifies the difference with a “cost function.” The ANN then shifts the weights in whatever direction is required to get a better evaluation from the cost function. In sum, whereas Gaussian processes use Bayesian inference to find the function that best characterizes the data, ANNs are essentially trained to solve a calculus problem: to minimize their cost function by determining the necessary weight parameters for their model.Footnote 12
Once they have a trained ANN, Agarwal et al. use it to fill in the required parameter space. They then use MCMC analysis to infer the values of various cosmological parameters. Their fully trained emulator outperforms the Cosmic Emulator but requires an order of magnitude more simulation runs for its training set.