A comparison of optimisation algorithms for high-dimensional particle and astrophysics applications

Optimisation problems are ubiquitous in particle and astrophysics, and involve locating the optimum of a complicated function of many parameters that may be computationally expensive to evaluate. We describe a number of global optimisation algorithms that are not yet widely used in particle astrophysics, benchmark them against random sampling and existing techniques, and perform a detailed comparison of their performance on a range of test functions. These include four analytic test functions of varying dimensionality, and a realistic example derived from a recent global fit of weak-scale supersymmetry. Although the best algorithm to use depends on the function being investigated, we are able to present general conclusions about the relative merits of random sampling, Differential Evolution, Particle Swarm Optimisation, the Covariance Matrix Adaptation Evolution Strategy, Bayesian Optimisation, Grey Wolf Optimisation, and the PyGMO Artificial Bee Colony, Gaussian Particle Filter and Adaptive Memory Programming for Global Optimisation algorithms.


Introduction
Typical theories in various branches of science, such as particle physics, particle astrophysics and astrophysics, are formulated as parametric models. To make predictions in these models, one needs to specify the values of a set of free numerical parameters. Comparison with experiment, in turn, can constrain the values of these parameters and single out a theory that matches observation the best. Such parameter estimation and model selection are fundamental components of the scientific method in the physical sciences that we hope will lead us to the best description of nature.
In the modern era, the exploding size of experimental data sets, coupled with complicated theories, has introduced a substantial computational complexity in parameter extraction and model selection. It is frequently necessary to perform computationally expensive simulations of experiments, and further problems result from the moderately large dimensionality of typical beyond-Standard Model physics models, whose parameter multiplicity often ranges up to O (100). In parameter spaces of this size, sampling randomly or on a uniform grid is both inefficient and unlikely to lead to robust statistical conclusions [1].
We have entered a period in fundamental physics where one experiment is unlikely to unambiguously determine the next theory of particle physics or cosmology. Consequently, we must combine clues from many different branches of observation. Fortunately, at the same time, exponentially increasing computing power has allowed scientists to become more ambitious both in the scope of observation and theoretical calculation. Thus, parameter extraction and model comparison can be performed at an extraordinary scale to find the theory that best describes the physical world, if one employs efficient and ingenious sampling algorithms.
The likelihood function is a key quantity in statistical inference (see e.g., Ref. [2] for a pedagogical introduction), as it tells us the probability of the observed experimental data for a particular set of model parameters. If we are considering data from several experiments, the likelihood function may often be written as a product of likelihoods, one for each of the individual experiments, if the individual experiments are independent. In frequentist statistics, one can obtain consistent estimators for the values of the parameters of the model by finding the set of parameters that maximises the likelihood, and use the maximum likelihood itself to construct a test-statistic to perform statistical tests (see e.g., Ref. [3] for an introduction to likelihood-based tests in particle physics). The difficulty is that the likelihood function is rarely known as a simple function of the original parameters and so we cannot find the maximum analytically. In fact, although in our setting we assume a tractable likelihood, evaluating the likelihood may still involve non-differentiable forward simulations of experiments (see e.g., Ref. [4]) and so even derivatives are unavailable. We are thus forced to use derivative-free numerical optimisation algorithms (see Ref. [5] for a review) to explore the likelihood function.
Furthermore, likelihood functions of interest often contain multiple modes, i.e., several distinct local maxima. In this setting, exploring the likelihood function and locating the global maximum may be extremely challenging, as we risk getting stuck in a local maximum. For this reason, we focus on stochastic algorithms that, e.g., step out of a local maximum with a particular probability, and neglect local optimizers commonly used in physics [6]. The simplest such approach is repeated random sampling from the entire parameter space (followed by picking either the highest or lowest value found.). This, however, is known to be deficient for two reasons. First, as the dimension of the parameter space increases, the number of samples that need to be drawn increases exponentially if the same point density is to be maintained. In practice this would thus come with an exponentially higher demand for computational power, which is worsened by the fact that the function evaluations are typically already costly themselves. Second, it is highly inefficient in most physical examples, as the high likelihood regions of the parameter space usually occupy a very small region of the total multidimensional volume. The past decades have thus seen the development of a series of novel sampling and optimisation procedures, particularly metaheuristic ones, many of which have been utilised in particle astrophysics applications .
The purpose of this paper is to survey a wide range of optimisation techniques that, to the best of our knowledge, have not received mainstream use in particle astrophysics applications. The different techniques are explored by different authors of this paper in the form of the following challenge: use your optimisation technique of choice to find the optima of a set of reference functions, including both analytic examples and a 12-dimensional parameter space representing a supersymmetric model called the phenomenological MSSM7 [106] (a popular theory of beyond-Standard Model Particle Physics). Apart from the common set of reference functions and the use of a common test framework, there was no common tuning of the free parameters of the techniques. Although this introduces a human factor in the experiments, we believe this is representative of a real-life application of any one of the explored methods. The work was completed within the DarkMachines community 1 , which aims to develop new approaches for dark matter research thorough closer collaboration with machine learning and data science experts. It is worth noting that the algorithms we compare of course have many uses beyond maximising likelihood functions, such as minimisation of fine-tuning or the optimisation of the hyperparameters of an algorithm. Our ultimate aim is to provide a self-contained overview of optimisation methods that can be used as a reference by researchers working in the physical sciences. For the purposes of this publication, we developed a testing framework in Python with interfaces to codes representing each of the optimisation techniques we use below.
This paper is structured as follows. In Section 2, we define our optimisation problem in more detail, and provide a description of each of the techniques used. We describe the test functions that we use for our comparative studies in Section 3, and detail the results in Section 4. Finally, we present conclusions in Section 5, and describe our Python framework for implementing the scanning techniques in Appendix A. The best found solutions for each investigated function and algorithm, together with the corresponding algorithm's hyperparameters, can be found in Appendix B.

Optimisation
The problem that we address in this review is the following. Given a deterministic function f : R n → R, defined over a domain of interest given by lower and upper bounds on the function parameters, what is the optimum of the function? In particle astrophysics applications, the function can represent the likelihood of observed data given a particular physical theory, and the optimum then gives the maximum likelihood estimate of the parameters, which is an important quantity in frequentist inference. It is in fact more usual to minimise the negative log-likelihood function, rather than maximise the likelihood.
Optimisation techniques can be divided into categories, based on whether they require a knowledge of the derivatives of the function or not (either analytical or numerical). Since derivative information is not always available in the physical sciences, we focus on techniques that only require evaluations of the likelihood function itself. Arguably the most challenging optimisation problems in particle astrophysics applications arise in global fits of beyond-Standard Model physics models. Popular techniques for performing frequentist inference on dark matter models have included Markov Chain Monte Carlo techniques (see e.g., Ref. [107]) and nested sampling [108] which, although designed for Bayesian computation, can be repurposed for frequentist studies [7-22, 24, 26-28, 30-98]. In recent years, genetic algorithms and, to an even greater extent, Differential Evolution, have proven capable of adequately exploring very complex likelihood functions in multiple theories of dark matter [23,106,[109][110][111][112][113]. In selecting techniques for this study, we have focused on global optimisers that are expected to provide comparable performance to Differential Evolution. We include the Differential Evolution implementation previously studied in [114] in order to benchmark the performance of our newly-explored techniques. The full list of algorithms that we explore is as follows.

Differential Evolution
Differential Evolution [DE; [115][116][117][118] is a population-based heuristic optimisation strategy belonging to the class of evolutionary algorithms. DE does not rely on derivatives of the function being optimised, and is often the algorithm of choice for highly multimodal or otherwise poorly-behaved objective functions.
DE consists of evolving a population of N P individuals or 'target vectors' {X g i }, of specific points in the parameter space, for a number of generations. Here i refers to the ith individual, and g corresponds to the generation of the population. The initial generation is generally selected randomly within the parameter intervals to be sampled.
One generation is evolved to the next via three main steps: mutation, crossover and selection. The simplest variant of the algorithm is known as rand/1/bin; the first two parts of the name refer to the mutation strategy (random population member, single difference vector), and the third to the crossover strategy (binomial).
Mutation proceeds by identifying an individual X i to be evolved, and constructing one or more donor vectors V i with which the individual will later be crossed over. In the rand/1 mutation step, three unique random members of the current generation X r1 , X r2 and X r3 are chosen (with none equal to the target vector), and a single donor vector V i is constructed as with the scale factor F being a parameter of the algorithm. A more general mutation strategy known as rand-to-best/1 also allows some admixture of the current best-fit individual X best in a single donor vector, according to the value of another free parameter of the algorithm, λ. Crossover then proceeds by constructing a trial vector U i , by selecting each component (parameter value) from either the target vector, or from one of the donor vectors. In simple binomial crossover (the bin of rand/1/bin), this is controlled by an additional algorithm parameter Cr. For each component of the trial vector U i , a random number is chosen uniformly between 0 and 1; if the number is greater than Cr, the corresponding component of the trial vector is taken from the target vector; otherwise, it is taken from the donor vector. At the end of this process, a single component of U i is chosen at random, and replaced by the corresponding component of V i (to make sure that U i = X i ).
Selection simply faces the target vector X i off against the trial vector U i , with the vector returning the best value of the objective function retained for the next generation. In this way, each member of a generation is pitted against exactly one trial vector in each generation step.
A widely-used variant of simple rand/1/bin Differential Evolution is so-called jDE [119], where the parameters F and Cr are optimised on-the-fly by the Differential Evolution algorithm itself, as if they were regular parameters of the objective function. An even more aggressive variant known as λjDE [114] is the self-adaptive equivalent of rand-to-best/1/bin, where F, Cr and λ are all dynamically optimised.
In this paper, we run two different software implementations of Differential Evolution. The first is the open-source implementation of the λjDE algorithm contained in the Diver package 2 . We use this via the pyScannerBit interface to the ScannerBit package of the GAMBIT code for beyond-Standard Model global statistical fits [109,114,120]. We also run the jDE [121] and iDE [122] algorithms implemented in the PyGMO package [123]. In doing so, we have varied the number of generations and the parameter adaptation scheme which is used to optimize the weight coefficient and the crossover probability.

Particle Swarm Optimisation
Particle Swarm Optimisation [PSO; 124,125] is another population-based evolutionary algorithm that does not make use of derivatives. Here, each member of the population of parameter samples ('the swarm') is also given a velocity. In each generation step, the positions of other particles in the swarm are used to update each particle's velocity. The position of each particle is updated by allowing it to move along its velocity vector for a fixed amount of time.
The standard velocity update for particle i in generation g is v g+1 such that x g+1 Here r 1 and r 2 are uniform random numbers between 0 and 1, x i,pb is the ith particle's personal best-fit position so far (i.e. in any generation), x gb is the global best-fit position so far (i.e. by any particle, in any generation), and ω, φ 1 and φ 2 are free parameters of the algorithm.
In this paper, we will make use of a self-adaptive variant inspired by jDE [121] that we will refer to as j-Swarm, where ω and/or φ 1 and φ 2 can be dynamically optimised in the course of a run by treating them as parameters of the objective. The implementation that we use is bundled in ScannerBit, and will be released within GAMBIT and described in a forthcoming GAMBIT publication.

CMA-ES
The Covariance Matrix Adaptation (CMA) Evolution Strategy (ES) is another evolutionary optimisation algorithm based on the idea of natural selection [126]. From an initial point in the n-dimensional parameter space, x (0) , a set of λ new points (called a population) are sampled from a multivariate normal distribution with covariance matrix σ (g) 2 C (g) , where k = 1, . . . , λ and g counts the number of generations. The optimisation function is then evaluated at all x The obtained values are used to sort the points and the best µ points (the parents) used to calculate where w j > 0 are weights with µ j=1 w j = 1 and j is the sorted index running from best to worst point.
For optimum performance of the algorithm, the step size σ (g) and the matrix C (g) should be updated to maximize the probability that the new generation is closer to the minimum of the objective function. The optimal update has been found to be given with c is a cumulative path, storing information about the direction taken during previous steps, c c < 1 is the learning rate for the cumulative path, c cov < 1 is the learning rate for the covariance matrix and µ cov ≥ 1 controls the ratio between cumulation and rank-µ updates.
The parameter µ −1 eff = µ j=1 w 2 j represents the effective selection mass. The cumulation update adapts the matrix to the large scale gradient of the optimisation function, while the rank-µ update adapts to the local gradient.
The optimal update for the step size is based on the absolute length of the cumulative path. If consequent steps are taken in the same direction, the length is expected to be large, while for steps in random direction, the length is shorter. In the former case, fewer and longer steps can be taken, while in the latter the step size should be decreased. The update is therefore The adaptation speed is controlled by the learning rate c σ and the damping parameter d σ .
The CMA-ES is an invariant and stationary optimisation algorithm, meaning that its tuning parameters are insensitive to the objective function and depend almost exclusively on the dimensionality of the parameter space. Only a brief overview of the algorithm has been given here, for details and explanations see Ref. [126]. The implementation used in this work is from the pycma package. 3

Bayesian Optimisation
Bayesian Optimisation [BO; [127][128][129] is a set of techniques that attempts to find the optimum x * of an objective function f (x) with the minimum number of function evaluations, which is particularly useful when the function is computationally expensive to evaluate. It works by explicitly approximating the objective function f (x) with a probabilistic regression model, called a surrogate model, that can predict the outcome of yet unseen samples to make a more informed decision of which samples to evaluate next. The initial surrogate model is trained on a set of random samples of the objective function, or a set of samples selected by any other sampling technique. The surrogate model needs to be probabilistic, and popular choices are Gaussian processes or probabilistic ensembles. Every further sample of the objective function f (x) counts as a training point x, continuously updating the surrogate model to a new posterior distribution that, after a given number of samples D{(x i , y i )}, gives our best belief of what the objective function f (x) looks like. It can also provide some uncertainty at each point in the parameter space by using Gaussian likelihoods. An acquisition function α(x) is used to choose where to sample next, taking the latest posterior of the surrogate model as an input. The acquisition function α(x) is easy to evaluate and can be sampled with techniques such as Thompson sampling. In this way, cheap samples of the surrogate model are used to guide the sampling, rather than expensive samples of the objective function f (x) itself. To avoid getting stuck in local minima, the acquisition function trades off exploration and exploitation, exploring regions of high expected outcome and those where the model shows high uncertainty, and hence requires more training data than is currently available. The result of this procedure is that fewer samples are needed to find the optimum x * of the objective function f (x), but more computation is required for predicting each next sample to try. More formally, the method consists of the following steps.
Step 1: Define a surrogate model Assuming that we use a Gaussian process (GP), we also need to choose a prior distribution and a covariance function, or kernel, that defines the shape of the regression curves. The surrogate model can be written as f (x) ∼ GP (µ(x), k(x, x )). We can assume a normal prior with µ(x) = 0 without loss of generality. A popular choice for the kernel k is the radial basis function, also known as the square exponential kernel, in which the length scale λ and signal variance σ 2 control how flexible or flat the surrogate model can be: Step 2: Choose an acquisition function A number of acquisition functions are commonly used, each defining a specific way to trade off exploration and exploitation. Given a surrogate model trained on n samples that can return, for every input x, the predicted mean µ(x) and standard deviation σ(x), as well as the current best value f * and a tuneable parameter ψ which balances exploration and exploitation, we can describe the following popular choices: 1. Maximum probability of improvement (MPI): 2. Expected improvement (EI): 3. Upper confidence bound (UCB): In these equations Φ and φ are the CDF and PDF of the standard normal distribution, respectively. The acquisition function α(x) is sampled using, for instance, Thompson sampling, and the sample with the highest acquisition score will be evaluated next on the actual optimisation function. After evaluation, this sample becomes a new training point, the surrogate model is updated, and the next sample is selected. This process is iterated until an acceptable result is reached, a certain budget is exhausted, or when all acquisition scores fall below a predefined threshold. The final recommendation of the optimisation process is the best observed result x * or the optimisation of the mean of the updated posterior distribution on all observations D{(x i , y i )}.

Trust Region Bayesian Optimisation
Standard Bayesian Optimisation suffers from scalability issues in high-dimensional problems. This is mainly due to the implicit homogeneity assumption of most surrogate models, and an overemphasis on exploration in the used acquisition functions. The acquisition function α(x), also becomes difficult to optimize in high-dimensional problems as it has the same number of dimensions as the number of dimensions of the input space.
For instance, the commonly used Gaussian process surrogates in Bayesian Optimisation assume a constant length scale λ and signal variance σ in the search space. This is often not the reality of high dimensional functions, which tend to be flat in most of the space between local or global optima. Moreover, in high-dimensional spaces, samples are few and far between, meaning that the surrogate will exhibit high uncertainty and cause the acquisition function α(x) to focus predominantly on exploration instead of exploitation. This will harm the performance of Bayesian Optimisation.
To overcome those issues, the Trust Region Bayesian Optimisation (TuRBO) algorithm [130] fits a set of local models and determines how to allocate samples from those such that the global optimum x * is found most efficiently. It performs a collection of simultaneous local optimisation executions using independent Gaussian processes.
Through this procedure, each Gaussian process enjoys the typical benefits of Bayesian modeling and these local surrogates allow for heterogeneous modeling of the objective function f (x) without suffering from over-exploration. In order to optimize all surrogates, TuRBO leverages an implicit multi-armed bandit strategy at each iteration to allocate samples between these local areas and thus decide which local optimisation runs to continue.
Each Gaussian process resides in a Trust Region, a hyperrectangle centered at the best solution found so far, x * , under a surrogate model with a base side length L that is a function of the length scales λ of the Gaussian process: The local optimisation runs use a batch acquisition function that is restricted to select points X that lie within the Trust Region hyperrectangle.
The base side length L evolves during the optimisation process. If L contains all the input space X the algorithm becomes standard Bayesian Optimisation. The trade-off between exploration, large L, and exploitation of good solutions, small L, becomes critical. A popular criterion is to double L in size after τ results better than x * , and halve the size in the other case.
TuRBO maintains m trust regions simultaneously, selecting max_eval candidates drawn from the union of all trust regions and updating all local optimisation problems for which candidates were drawn. TuRBO gets the i th candidate from all the trust regions by drawing a sample of the posterior mean function from all the Gaussian Processes per trust region and getting the minimum from them all, which is a greedy Thompson Sampling approach: where f l is a sample from the GP corresponding to trust region l.

Grey Wolf Optimisation
The Grey Wolf Optimisation algorithm [131] falls under the category of swarm intelligence algorithms, and takes inspiration from the hunting mechanism and leadership hierarchies in packs of grey wolves. Like other swarm intelligence algorithms there are two main phases: exploration and exploitation. The algorithm takes inspiration from the behaviour of grey wolves here through their tracking and attacking of prey, in which the social hierarchy of the pack of grey wolves also plays an important role. In the grey wolf analogy the wolves are associated with the candidate solutions in the swarm and the prey is associated with the best solution to the optimisation problem. The search agents in the algorithm are assigned to one of four categories, α, β, δ, or ω, which are meant to imitate the social hierarchy of the grey wolves. The fittest solution is assigned to α, the second fittest to β, the third fittest to δ, and the rest to ω. During the optimisation the search for the optimal solution is usually led by the α solution, however the β and δ also have influence. The algorithm is initiated by assigning the search agents to random solutions in the search space, and these candidate solutions are assigned to the categories based on their fitness. The positions of each search agent are then updated, with the updates containing stochastic components as well as influence from the positions of the α, β, and δ search agents. In order to model the search strategy of the search agents the following definitions are required: (2.13) These vectors have a length equal to the dimension of the search space. The vectors A and C will be used to update the positions of the search agents, where r 1,2 and vectors of random numbers in [0, 1], and a has components which decrease linearly from 2 to 0 over the course of the iterations. Vectors capturing the distance between each agent and the three fittest agents are defined as: 14) The positions of each agent are then updated according to: where t labels the current iteration, and: The form of these updates are such that the α, β, and δ agents estimate the position of the optima by encircling it, and the ω agents search randomly around this position. The sequence of updates is continued until some condition on the fitness of the optimal solution is met, or for a fixed number of iterations. The only hyper-parameters in this algorithm are the number of agents to use, which should be at least 4. It is possible to introduce other hyper-parameters by varying the possible range of the numbers in the vector a also, but this is not part of the minimal set-up described in [131].

PyGMO Artificial Bee Colony
The Artificial Bee Colony algorithm is inspired by the intelligent behaviour of honey bees in their search for the right food sources [132,133]. The algorithm has an automated mechanism to balance exploration and exploitation. In the experiments performed for this paper we use the implementation of the Artificial Bee Colony in PyGMO [123], for which the pseudocode can be found in Listing 2 of Ref. [134]. Here we restrict ourselves to a more conceptual explanation of the algorithm. The Artificial Bee Colony algorithm keeps track of SN active data points x s , where s runs from 1 to SN . The locations of these data points are uniformly initialised and the function value f s for each of these data points is evaluated. In the bee-analogy these data points can be seen as the locations of food sources and the function value can be interpreted as being related to the food gain from each of these sources. The goal of the algorithm is then to find the food source (i.e. data point) with the highest gain (i.e. best function value).
Finding this data point is done iteratively, where each iteration consists of three phases. In the first phase each active data point x i is used as reference to explore a new location v i . The proposal location v i is calculated by moving, for each dimension j, in the direction of another, randomly selected active data point where ϕ ij is a uniform random number between 0 and 1. For the proposal v i the functional value f s is calculated and if this value is better than f s , x i and f s will be replaced by v i and f s respectively. The old location and function value are kept if this test fails.
For each data point the number of failed update attempts is kept track of. This information can be used to give up on locations x i that fail to be updated for many subsequent iterations. This is done in the second phase, where these dead data points (as determined with respect to some user-configured threshold) are reinitialised uniformly in the parameter space.
These first two steps can be seen as a form of exploration: the parameter space is explored by sampling new data points and evaluating their function values. The third -and last -step of each iteration is a form of exploitation, in which Equation 2.17 is used to perform updates on the active data points. Which active data points are updated is determined by assigning to each data point x i a fitness-score 4 . This score is dependent on the functional value f i of each data point: This score is then used to calculate an update probability for each data point The update attempts are distributed over the active data points using Bernoulli experiments with respect to these probabilities. Just like in the first phase data points are only updated if the proposal point v i improves on the function value of the original location x i . Also in this phase the number of update attempts is kept track of.
The automated balancing of exploration on the one hand and exploitation on the other makes the Artificial Bee Colony algorithm a potentially strong optimisation algorithm. Moreover, it has the nice property that as the number of iterations increases, data points will tend to move towards minima, because updates are only performed if they improve the function value for the data point under consideration. This causes the updates of Equation 2.17 to provide increasingly more resolution on these minima.

Gaussian Particle Filter
The Gaussian Particle Filter was first explored in Ref. [135]. The scanning algorithm starts off by collecting an initial seed of randomly generated points. The number of points that are generated is a user definable quantity, as are the range and the sampling prior (uniform or logarithmic). By then using these randomly generated points as seeds, a multi-dimensional Gaussian distribution is used to draw new points around the location of those seed data points. The number of data points sampled from each Gaussian is proportional to the relative function value of the target function at the seed. The width of these Gaussians varies over the run time of the algorithm, starting at (O(1 − 2)) at the start of the run, in order to cover a large part of the parameter space. In higher iterations, this width is shrunk by multiplying by a factor < 1. The speed at which it reduces is a hyperparameter of the algorithm and will henceforth be called the width decay.
The implementation of the algorithm that we employ in this paper 5 allows for the use of a uniform or logarithmic exploration, indicating if the standard deviations of the multi-dimensional Gaussian distributions should be multiplied by the coordinate x. This is configured through the boolean parameter logarithmic.
In each iteration N sample data points are sampled from N Gaussians Gaussians and the function values for these samples are calculated. These samples are combined with the best fraction of the (already evaluated) samples that formed the seed for the Gaussians. This fraction is named the survival rate. From the resulting dataset, N Gaussians data points (i.e. those with the lowest function value) are selected as the seed for the Gaussians of the next iteration.
The algorithm is highly flexible, and is aided by adding user knowledge about how the function behaves. This can help to determine an appropriate value to choose for the width of the Gaussians. The stopping criterion of the implementation that we employed was based entirely on the width of the Gaussians, which we controlled using a width scheduling scheme. We did not use any information on the function values found, allowing the algorithm to potentially continue to sample after already finding the global minimum.

AMPGO
Adaptive Memory Programming for Global Optimisation (AMPGO) is a solver that uses multiple steps to solve a global optimisation problem [136]. First, a set of initial points in parameter space is chosen randomly. Next, a local solver is used to find the local optima of those points. When these are found, a method called Tabu Tunneling [137] is used to find a different local optimum in the parameter space from which the local optimizer is started again. Using this iterative approach, the global optimum can be found. In Figure 1 a visualisation of this approach is shown. The goal is to find the global optimum w and the initial random point is s 0 . As a first step, using the local optimizer, the point s is found. Then, using the Tabu Tunneling method, one tries to find a different point in parameter space with a value as good as s. In this example, that point is t. From t the local optimisation algorithm is invoked again to find the point w.
Any method can be used for the local solver; we employed L-BFGS-B [138]. A full mathematical description of the algorithm can be found in Ref. [136]. The AMPGO website [139] contains benchmarks of 184 multidimensional test functions against multiple different algorithms. A Python implementation of AMPGO is available on GitHub 6 .
The algorithm relies heavily on tunneling towards a new region in parameter space with the same or lower function value. However, as the number of possible tunnelling directions from any point in parameter space is infinite, there are significant hurdles to overcome when trying to find a region with a lower or equal function value: • When the global optimum is very narrow, the probability of tunnelling into the basin of attraction of this minimum can be very low (e.g. Analytic Function 1 discussed in Section 3.1); Tunneling Figure 1: Visualisation of the tunneling approach for minimisation as used in AMPGO [136]. From the initial sample s 0 the local optimum at s is found. After it is found, Tabu Tunneling is used to find location t, from which the local gradient-based approach can be reused again to find the global minimum at w.
• When the dimensionality increases, the volume to tunnel through in order to find a new minimum grows exponentially; • When there are many local minima of similar depth, the number of tunneling steps needed can be very high. Each time a new local minimum is found that is deeper than the last, the probability of finding a new minimum with an even lower function value goes down (see e.g. Analytic Function 2).
Given these difficulties, this algorithm can struggle to improve on its local solution in high-dimensional problems.

Characterisation of algorithms
It is not immediately obvious how to compare optimisation algorithms that depend on wildly different strategies, with each having a number of different hyperparameters. The best choice of the hyperparameters in each case will depend on the function being explored, and finding those best values is itself an optimisation challenge of reasonable complexity.
In this paper we attempt to provide the average particle astrophysicist with some knowledge of which algorithm might perform well for a particular type of function, without requiring them to engage in overly-aggressive hyperparameter optimisation. We therefore group the parameters of each of our techniques into four categories: • Convergence parameters: These are parameters that affect the stopping point of an algorithm, or the point at which convergence is presumed to have occurred. A more stringent tolerance condition should have the effect of improving the best fit found by the algorithm, but will require a greater number of likelihood evaluations to reach that point.
• Resolution parameters: These are parameters that affect the resolution with which the target function is explored. A higher resolution will increase the detail with which the likelihood function is mapped around the best-fit point, leading to a better mapping of the 1σ, 2σ and 3σ confidence regions, at the cost of a greater number of likelihood evaluations in total. Higher resolution can also improve the final quality of the best-fit point, independently of convergence parameters.
• Hint parameters: These are parameters that give the algorithm a clue as to how to obtain the best solution. For example, algorithms that are required to start at a certain point would strongly benefit from starting near the global minimum of the function. A wise choice of such a parameter (if this is possible) would reduce the number of likelihood evaluations required to give a good fit.
• Reliability parameters: These are parameters whose general effect is to improve the robustness of a technique.
In Table 1, we provide a grouping of the main parameters of each of our optimisation techniques, where it can be seen that not all techniques have a parameter in each group. Nevertheless, most of the techniques have a resolution and a convergence parameter, and the typical use-case for a particle astrophysicist would be to set these parameters to provide the best possible sampling within the available CPU budget, whilst using out-of-the-box choices for the hint and reliability parameters. Our results in Section 4 will therefore be presented for different choices of the resolution and convergence parameters for each algorithm (where these are available).

Analytic test functions
In this paper two sets of tests are performed. In the first, we compare the performance of each of the above algorithms on a series of analytic test functions, each of which embodies a different pathology that is commonly asociated with difficult physics examples. In the second, a real physics example is used, in order to determine whether the insights gained on analytic test functions generalise to a more realistic setting.
We use four reference functions 7 to evaluate and compare the performance of every optimisation algorithm. The analytic forms of these functions were unknown to the operators of the optimisation algorithms until the results were generated, in order not to introduce biases towards particular global minima.

Analytic Function 1
The equation for Analytic Function 1 reads as follows: 7 Our choice of these functions is greatly influenced by [139].    f

Parameter
where n is the number of dimensions. This function is visualised in Figure 2a for 2-dimensional input. The global minimum is at 2, where a function value of −1 is reached. This function is expected to be difficult to optimise numerically, as this minimum is surrounded by a region with relatively high function values. The global minimum is put at 2 rather than 0 to discourage algorithms that take zero as a starting point. The minimum of this function is searched for in the domain [−30.0, 30.0] n .

Analytic Function 2
The equation for Analytic Function 2 reads as follows: where n is the number of dimensions. This equation is visualised in Figure 2b

Analytic Function 3
The equation for Analytic Function 3 reads as follows: where n is the number of dimensions. This equation is visualised in Figure 2c for 2-dimensional input. This test function has many global minima and the optimisers should be able to find the global minimum, which is at −1. The minimum of this function is searched for in the domain [0.0, 1.0] n .

Analytic Function 4
The equation for Analytic Function 4 reads as follows: where n is the number of dimensions. This equation is visualised in Figure 2d for 2dimensional input. The difficulty in this analytic function is that it has an irregular shape and the parameter range is quite large. The global optimum is at about 420.968746 with f (420.968746) 0. The minimum of this function is searched for in the domain [−500.0, 500.0] n .

Particle astrophysics test problem
In addition to the test functions described above, it is interesting to test our various optimisation algorithms on a realistic particle astrophysics problem. A leading use of optimisation techniques in particle astrophysics is in global fits of models beyond the Standard Model of particle physics. These add a number of additional parameters to the Standard Model, and one must find the regions of the extended parameter space that are most compatible with current experimental data. In frequentist statistics, this is typically performed by maximising a likelihood function L, which is equivalent to minimising − log L. This problem therefore resembles the problem of minimising the analytic functions.
For our example, we take a recent global fit of a supersymmetric theory performed by the GAMBIT collaboration in Ref. [140], and obtain a fast interpolation of the likelihood function that was originally computationally expensive to obtain. The original fit explored a 7-parameter phenomenological version of the Minimal Supersymmetric Standard Model (the so-called "MSSM7"), which is described by the soft masses M 2 , m 2 f , m 2 Hu , m 2 H d , the trilinear couplings for the third generation of quarks A u 3 , A d 3 and tan β (plus the input scale Q = 1 TeV and the sign of µ, which was chosen to be positive). The mass parameters above are all defined at the Q common scale whereas tan β is defined at m Z . In addition to these supersymmetric model parameters, the original fit added a variety of nuisance parameters, comprising the strong coupling constant, the top quark mass, the local dark matter density, and the nuclear matrix elements for the strange, up and down quarks. Therefore, the global fit was performed on a 12-dimensional parameter space.
To make it possible to compare the performance of the different algorithms considered in this work we have approximated the joint likelihood 8 using a deep neural network as proposed e.g. in [141,142]. The total number of samples collected from the global fit to train the network was about 2.3 × 10 7 . The network consists out of four hidden layers of 20 fullyconnected neurons each, activated through the SELU function [143]. We have normalized the input and output data to a normal Gaussian distribution and split the data into 90% for training and 10% for testing. We have set the batch size for training to 1024 and optimized the performance by halving the learning rate of the Adam optimizer [144] when the loss function, in this case the mean absolute error (MAE), stopped improving. Early stopping was applied to stop training after a couple of these iterations. Figure 3 shows the validation plot of the trained network, where it can be seen that the neural net prediction of the likelihood is well-correlated with the true likelihood. It is important to note that perfect performance of the fast likelihood interpolation is not required, as for our purposes it is sufficient that it provides a suitable proxy for a difficult likelihood function that would typically be encountered in a particle astrophysics application.

Results
In this section, we compare the results of running the optimisation methods described in Section 2 on the functions described in Section 3. To make sure the algorithm was the only difference between all these optimisation experiments (apart from the algorithms' hyperparameters), we performed all experiments using the same Python framework. A description of this framework can be found in Appendix A. A comprehensive overview of the best found result for each of the algorithms on each optimised function (for all explored dimensionalities) can be found in Appendix B.

Analytic Functions
We first present results for the analytic test functions described in Section 3.1. These four functions were explored as 2-dimensional, 3-dimensional, 5-dimensional and 7-dimensional functions. It is expected that the ability of optimisation algorithms to find the true minimum decreases as the dimensionality increases. Each algorithm was run for several sets of its resolution and convergence hyperparameters, which are summarised in Table 1. Figures 4 to 7 show the accuracy with which each algorithm recovered the minimum of each function for each of the dimensionalities, with each circle representing a particular run of each algorithm with a specific choice of hyperparameters. The size of the circles is proportional to log 10 of the total number of likelihood evaluations in that run, whilst the different lines for each algorithm correspond to different dimensionalities (going from 2D for the line on the bottom, to 7D for the top-most line). To illustrate the accuracy, we show the difference between the known global minimum and the best value found for each run.
The results for Analytic Function 1 show that some algorithms never get anywhere close to the global minimum, regardless of the dimensionality of the problem. In ≥ 3 dimensions, all algorithms fail to find the global minimum. Inspecting the function in 2D (see Figure 2a) reveals why -there is a very spiked local minimum that is hard to locate, a problem that will get even harder as the dimensionality of the function increases. The best-performing algorithm, such as it is, is the PyGMO Artificial Bee Colony, which finds the correct minimum in 2D and 3D with a relatively small number of likelihood evaluations. The worst performing algorithms are PyGMO Grey Wolf Optimisation, Gaussian Particle Filter, AMPGO, GPyOpt and the PyGMO implementation of Differential Evolution. In the latter case this may simply be due to the low number of total likelihood evaluations, suggesting that a more stringent set of hyperparameters might yield better performance. This is confirmed by the fact that the Diver performance is apparently better, giving the correct global minimum in 2D with more likelihood evaluations and better, though not adequate, performance in 3D. The Gaussian Particle Filter algorithm works in 2D, but this is almost certainly due to the fact that it has performed a large number of likelihood evaluations in a low-dimensional space. It is outperformed by random sampling in 3D, as are all algorithms except PyGMO Artificial Bee Colony and TuRBO. Finally, it is worth comparing the performance of the two Bayesian Optimisation algorithms. The failure of GPyOpt to find the global minimum in any dimensionality is not unexpected, as a sharply-spiked local minimum is exactly the case that is expected to be missed by Bayesian Optimisation due to its relatively low number of samples of the objective function, and concentration of those samples in areas where the algorithm thinks it has found interesting points. TuRBO, meanwhile, is able to find the global minimum correctly in 2D (and reasonably well in 3D) because it breaks up the space into separate regions, and runs an independent Bayesian Optimisation within each of them. One of these regions is small enough to contain the global minimum as the obvious minimum, rather than the plateau of false minima at the edge of the function range.
For the second analytic function, it is interesting to see that most algorithms are able to find the global minimum. Moreover, the fact that most of these algorithms are able to systematically outperform random sampling indicates that these algorithms are in fact effective methods for optimisation problems that resemble Analytic Function 2, with a large number of local minima but a single clear global minimum. AMPGO is the only algorithm that fails in all numbers of dimensions, being outperformed by random sampling, followed by GPyOpt, which only properly succeeds in 2D. TuRBO performs better, giving adequate performance in 2D and 3D, but at the cost of a large number of total likelihood evaluations. The best algorithm is the PyGMO implementation of Differential Evolution, which gets the correct answer in all dimensionalities with a low number of likelihood evaluations. Diver also performs well, with a higher number of evaluations, but the PyGMO results suggest that fewer Diver evaluations would still give good performance. It is interesting to note that the PyGMO Artificial Bee Colony still performs well, getting the correct answer in all dimensionalities with a relatively low number of evaluations. A final interesting feature of the results in Figure 5 is that -as expected -the performance of each algorithm deteriorates with increasing dimensionality, as can be seen from the increased spread of results for higher dimensionalities.
The results for Analytic Function 3 in Figure 6 show that AMPGO retains its status as the worst algorithm, once again being outperformed by random sampling. However, in 2D the global minimum is almost found, with a fairly modest number of likelihood evaluations. In general, all of the algorithms now show very good performance, which is a reflection of the fact that there are now many global minima in the function, and it is easy to find at least one of them. This is further reflected in the fact that the precise configuration of each algorithm now becomes less important, with much less variation in the results of runs with different hyperparameter settings. Indeed, this is the only example among the analytic functions where a local optimisation approach would work. GPyOpt now shines, and finds the global minimum correctly in all dimensionalities with the smallest number of total likelihood evaluations. PyGMO Artificial Bee Colony still does a good job, with a relatively small number of likelihood evaluations. Comparing the Differential Evolution implementations we see that Diver does not quite get the global minimum in 7D, whilst the PyGMO Differential Evolution implementation finds it in all cases, with fewer likelihood evaluations. TuRBO is able to match the performance of GPyOpt, but with more likelihood evaluations due to the requirement of running many separate optimisations (these extra optimisations are redundant in this case).
Analytic Function 4 has various local minima, but only one global minimum. In Figure 7, we see that random sampling outperforms AMPGO again. PyGMO Artificial Bee Colony gets the answer right in all dimensionalities, as do the two Differential Evolution algorithms, CMA-ES, and Particle Swarm Optimisation. The best performing algorithm is now the PyGMO implementation of Differential Evolution, although again we should caution that Diver may give similar performance for a suitable choice of its hyperparameters. Again we see Bayesian Optimisation fail as the dimensionality increases, although TuRBO is better than GPyOpt in >5 dimensions. Of particular note is the fact that the Grey Wolf Optimisation now does not perform well at all, and seems only to have worked for Analytic Function 2 and Analytic Function 3.
In summarising the performance of the 11 different algorithms on the four different analytic functions, we can ask whether any algorithm emerges with acceptable performance on all of them. A summary of their performance is given in Table 2. PyGMO Artificial Bee Colony emerges as perhaps the best candidate, since it performs well for all functions except Analytic Function 1, and it gives the best performance in that case. AMPGO is poor and consistently worse than random sampling. The success or failure of Bayesian Optimisation (GPyOpt and TuRBO) is interesting to investigate across the analytic functions. It generally fails when there are hidden minima, but the situation can be improved by adding latin hypercube sampling such as that found in the TuRBO algorithm. Differential Evolution is consistently strong in both of the PyGMO and Diver implementations, whilst CMA-ES also shows consistent perfor-mance (all except for Analytic Function 1). Particle Swarm Optimisation is not as consistent across the analytic functions, and where it succeeds it requires a large number of evaluations. Finally, the Gaussian Particle Filter algorithm struggles in higher dimensionalities, and is highly sensitive to details of its configuration.
In Appendix B, Table 3 to Table 6, we show the best minimum found by each algorithm for each analytic function and dimensionality. The hyperparameter settings shown in each case are those that led to the best minimum and, where multiple different runs obtained the same result, the settings are those that needed the smallest number of total likelihood evaluations to obtain that result.  The four horizontal lines for each algorithm belong to the four explored dimensionalities, from top to bottom 7-dimensional (pink), 5-dimensional (purple), 3-dimensional (orange) and 2-dimensional (green). The horizontal axis shows the difference between the (known) log-likelihood at the global minimum and that at the found minimum.

Particle astrophysics test problem
In Figure 8, we show the results for each algorithm for the MSSM7 test example described in Section 3.2. The immediate thing to note is that Diver emerges as the best algorithm, finding the best fit of all algorithms. It comfortably outperforms the PyGMO implementations of Differential Evolution, albeit with a higher number of likelihood evaluations (suggesting, once more, that the PyGMO code may give better results for different choices of the hyperparameters). The cause for this exceptionally strong performance of Diver might be the fact that the training data for the neural network was itself sampled by Diver. Although the training data was created independently of any of the optimisation experiments presented here, a neural network trained on that data might still encode the patterns typically explored by Diver, while not encoding the patterns used by the other algorithms equally well. However, the neural network still provides an example of a physically-motivated function and it is true to say that the non-Diver algorithms were not able to find the minimum of this function. Apart from Diver, many of the algorithms give similar performance, comparing favourably to random sampling. Although random sampling is, for this physics test problem, able to come as close to the same solution as many other algorithms, it needs significantly more function evaluations to achieve this performance. The PyGMO Artificial Bee Colony algorithm is, in a surprising turn of events, not notably better than most of the other algorithms, even though it performed consistently well on the analytic functions.
The two Bayesian Optimisation methods, GPyOpt and TuRBO, are amongst the algorithms that perform better than, or comparable to, random sampling, but they underperform relative to other algorithms in this group. This is likely caused by the dimensionality of the problem (12D); we already saw in the results of the analytic functions that an increase in dimensionality dragged the performance of these algorithms down strongly.
The only exception to the general trend of "at least similar performance to random sampling" is AMPGO, which remains consistently poor. The results are shown as semi-opaque circles, of which the area increases logarithmically with the number of function evaluations needed to obtain that specific result. The horizontal axis shows the difference between the log-likelihood at the found minimum and the deepest minimum found by any algorithm for any settings. To emphasise the possible bias in the test function towards Diver, the results for that algorithm are coloured differently.

Conclusions
We have performed a detailed comparison of a variety of optimisation algorithms in an attempt to find new algorithms for particle astrophysics problems. Many of these algorithms have not been used in a particle astrophysics context before, and we have examined their ability to find the correct global minimum of a range of test functions. They were also tested in their ability to correctly maximise a likelihood in a realistic particle astrophysics example based on a recent global fit of a phenomenological supersymmetry model. The algorithms we investigated were Differential Evolution (using two different software implementations), Particle Swarm Optimisation, the Covariance Matrix Adaptation Evolution Strategy, Bayesian Optimisation (in two different forms), Grey Wolf Optimisation, PyGMO Artificial Bee Colony, Gaussian Particle Filter and AMPGO. All of the algorithms used in our comparison have publicly-available software implementations.
For each algorithm, we characterised the hyperparameters as affecting the convergence or resolution of the optimisation, or as providing hints or improving the reliability. We then ran the different algorithms with different resolution and convergence hyperparameter settings, and compared the performance on test functions of different dimensionality, and our custom implementation of the MSSM7 likelihood function. Understandably, our main conclusion is the almost facile observation that the "best" algorithm depends strongly on the type of function that one wishes to optimise. However, it is possible to add some further interesting conclusions: • Algorithms that emerge as the most consistent performers when evaluated on analytic functions do not necessarily give the best performance on a realistic particle astrophysics example. This is evidenced by the fact that the PyGMO Artificial Bee Colony algorithm arguably emerged as the most consistent performer on our analytic functions. It struggled to match the performance of the Differential Evolution implementation Diver on the MSSM7 likelihood function, but this might be due to a bias towards Diver in this physics inspired test case.
• Differential evolution (in various implementations) performed consistently well across the full barrage of tests.
• AMPGO performed consistently poorly on all test examples, being outperformed by random sampling in most cases.
• Bayesian Optimisation (in two variants, standard Gaussian Process-based Bayesian Optimisation, implemented in GPyOpt, and the Trust Region Bayesian Optimisation algorithm, implemented in the TuRBO package) performs well for functions with many global minima, but struggles in cases with very sharply-peaked minima, or multiple global minima. Performance can be enhanced by performing separate optimisations in different latin hypercubes, at the cost of increasing the total number of likelihood evaluations. Even then, the performance degraded significantly for the analytic functions once the dimensionality increased.
Finally, many of the algorithms used here show promising performance both on the analytic functions and the particle astrophysics example. This certainly motivates their use in real-world particle astrophysics applications, and we look forward to reading future examples of their application.

A Description of DarkMachines sampling framework
All experiments run for this paper were performed in an open-source Python package written specifically for this research: the High-Dimensional Sampling framework, henceforth abbreviated to HDS. This appendix outlines the general workings of, and the design ideas behind this package. It is however by no means a manual. A more complete and more technical introduction to the package can be found on the wiki at the project GitHub page: https://github.com/DarkMachines/high-dimensional-sampling/. The full code is published under the MIT license.

Design considerations
As many different algorithms needed to be tested for this paper, a consistent test framework was needed. Ideally this meant that this framework would be invariant under changes of the optimisation algorithm used for investigation: it should just use the supplied optimisation method and test its performance just like it would test the performance of any other optimisation method when supplied. As the intention was to use a wide range of optimisation methods -of which a significant number were already implemented in packages like ScannerBit [114] -no single existing package was found that perfectly matched our requirements.
In designing and writing the custom package the following guidelines were therefore followed: 1. The package should make experiments reproducible; 2. The evaluation of the performance of an optimisation algorithm should not depend on the exact algorithm under investigation; 3. The package should automate as much of the experiments as possible, with a minimum loss of configurability; 4. The package should be easy to use and install, as experiments will be performed by many people on many different machines; 5. The output of the package should make it possible to easily compare the performance of different algorithms.

Core components
Following guideline (2) the package does indeed not depend on any optimisation algorithm: experiments can run with any implemented algorithm and these can be freely interchanged. However, to achieve this independence a common interface is needed. In the HDS framework this interface is implemented as the Procedure class. This class implements a small selection of functions that give other parts of the framework the possibility to query the procedure for new samples or for its status (e.g., whether or not it has finished sampling and whether or not it can optimise a specific target function). As the interface is very minimal (there are only 5 functions that need to be implemented), implementing new optimisation methods is relatively easy. Implemented Procedures can be found in the optimisation submodule of the package. As some of these require the installation of extra third-party packages, it is recommended to read the documentation on the GitHub wiki for more information.
Optimisation procedures are tested on test functions. To open up the possibility of having third-party test functions (i.e., functions defined by an external likelihood evaluation procedure), these functions also have an interface class: TestFunction. The majority of the TestFunctions have their analytic form programmed directly in the python code. They include common optimisation targets, like the Himmelblau function, the Rastrigin function and the ThreeHumpCamel function (see the GitHub wiki for a complete list of implemented functions and references to their analytic forms).
Although useful as tests of optimisation procedures, the fact that the user has access to the analytic form of these functions (or can just google for the coordinate and functional value of the optimum), they are not true blind tests of optimisation procedures. Because of this, four additional functions are implemented for which the pre-compiled binaries are included in the package. The analytic forms of these functions are not included in the package. These so-called HiddenFunctions are the functions referred to in Section 3.1.
Experiments are run using instances of the Experiment class. It is this class that guarantees the first and third design guidelines. Providing the configured optimisation Procedure and defining on which TestFunctions the procedure should be tested, the Experiment class runs the experiment and outputs all necessary information to interpret the procedure's performance. This information includes: • Benchmarks of the speed of the machine on which the computer is run; • For each used TestFunction: -Meta data about the Procedure and the TestFunction (e.g., values of the configurable parameters); -The number of times the function was called (and if applicable the number of times the function was queried for its derivative); -The coordinates of the taken samples; -The found optimum and the function value at that coordinate; Tables 3 to 7 show for each explored optimisation algorithm the configuration and result of the run which came closest to the global minimum (for the analytic functions) or the overall best found minimum (for the MSSM7 function). If multiple runs resulted in the same function value, the result with the least number of function evaluations is shown.

C Acknowledgements
MvB acknowledges support from the Science and Technology Facilities Council (grant number ST/T000864/1). R. RdA acknowledges partial funding/support from the Elusives ITN (Marie Skłodowska-Curie grant agreement No 674896), the "SOM Sabor y origen de la Materia" (FPA 2017-85985-P). BS, LH and SC acknowledges the support by the Netherlands eScience Center