1 Introduction

Models of physics beyond the Standard Model often feature many new parameters that are unknown a priori and may only be determined by experiment. However, experimental constraints are not trivial to apply, as they often are expressed in terms of weak scale observables rather than the theory’s fundamental parameters. While it is often straightforward (if computationally expensive) to calculate the weak scale observables from the parameters, the inverse problem is typically intractable. That is, weak scale constraints do not allow for a trivial reduction of the dimensionality of the theory space.

The standard approach is to numerically scan over the theoretical parameters and reject those that are not consistent with experimental data. However, the number of samples required for a brute-force search of the parameter space increases exponentially with its dimension. Thus, particle physicists studying models of new physics are often faced with a computationally intractable task. One may pragmatically restrict to a more tractable subset of parameters based on theoretical prejudice. The danger of this approach is that one may miss viable parameters that are both consistent with experimental observations and generate novel phenomenology.

The Minimal Supersymmetric Standard Model (MSSM) is a well-known example of a new physics model with a large number of free parameters. Most of these parameters are the masses and couplings of the supersymmetric partners of Standard Model particles [1]. This overwhelming dimensionality prohibits a fully general survey of the parameter space. Studies of the MSSM typically restrict to theoretically motivated subspaces [2,3,4,5,6,7,8,9,10,11,12,13]. These include the 4+1 dimensional constrained MSSM (cMSSM) as well as the 19 dimensional phenomenological MSSM (pMSSM) [14, 15]. However, even these reduced spaces are difficult to scan using a brute-force search.

High dimensionality is not the only challenge when scanning the parameters of the MSSM. The fundamental parameters of the theory are defined at some high energy scale and must be evolved to the energy scale of the experiment. This evolution requires one to solve the coupled (RGEs) for the high-scale parameters over many orders of magnitude to the weak scale. The computational cost of RGE running and calculating experimental observables for a single set of parameters is expensive.

Many recent scans have incorporated machine learning in some capacity to decrease the computational burden of brute-force searching these spaces [7, 12, 13]. These use various machine learning models to learn the forward problem of determining weak scale properties given high-scale parameters. This bypasses the need to perform RGE running and weak scale computations, however one is still faced with the challenge of doing a brute-force search over a high-dimensional parameter space. Machine learning models for the forward problem are thus only a constant improvement in computational time compared to the exponential dependence on the dimension of the space.

In this work, we introduce two methods to efficiently sample high-dimensional parameter spaces subject to constraints at the weak scale. We test these frameworks by sampling regions of the cMSSM and pMSSM parameter spaces that admit a Higgs mass consistent with its experimental value [16, 17]. The first uses a deep neural network to machine-learn the likelihood of an event satisfying this constraint and then samples this likelihood using Hamiltonian Monte Carlo (HMC). The second trains a generative model known as a normalizing flow. We analyze the performance of these frameworks by determining the fraction of generated samples that survive the chosen constraint and compare to the performance of random sampling.

These methods allow us to directly and quickly generate points in the parameter space that admit a consistent Higgs mass. By solving the inverse problem of sampling high-scale parameters given weak scale properties, we aim to minimize inefficiencies that arise in a brute-force search.

Our presentation is a proof of concept for these generative models and is encouraging for practical applications. For example, the ability to efficiently scan the MSSM parameter space makes it much easier to determine the high-scale parameters that are consistent with a new particle’s mass and width if a sparticle is discovered. Alternatively, a trained generative model may permit scans over parameters that are consistent with experimental observations to search for specific theoretical features that one may wish to study, for example: gauge coupling unification, a particular type of dark matter particle, or low fine-tuning measures.

As a demonstration of the efficiency of the generative models, we scan the cMSSM and pMSSM parameter spaces for points that produce the Higgs mass and that saturate the observed dark matter relic density, requiring [18, 19]

$$\begin{aligned}&122~\text {GeV}<{} m_h {}< 128~\text {GeV},\\&\quad 0.08< {}\varOmega _{\text {DM}}h^2 {}< 0.14. \end{aligned}$$

In this study, the generative models have been trained for consistency with the Higgs mass, not the relic density. We compare a brute-force scan using random sampling to a generative model that has been trained to sample points that admit a consistent Higgs mass. We show that the generative models dramatically increase the sampling efficiency of this scan.

2 Methods

2.1 Data generation

The cMSSM contains 4 continuous parameters defined at the Grand Unified Theory (GUT) scale and 1 discrete sign parameter. These are the universal scalar mass \(m_0\), the universal gaugino mass \(M_{1/2}\), universal trilinear coupling \(A_0\), the ratio of Higgs vacuum expectation values \(\tan \beta \), and the sign of \(\mu \). The pMSSM is the most general subspace of the MSSM that admits first and second generation universality, no new sources of CP violation, and no flavor changing neutral currents [15]. Parameters of the pMSSM are defined at the electroweak (EW) scale. The full list parameters of the pMSSM are listed as part of Table 2.

Our datasets are formed by uniform random sampling within bounded regions of the parameter space: cMSSM parameters are sampled at the GUT scale and pMSSM parameters are sampled at the EW scale. Bounds are listed for the cMSSM and the pMSSM in Tables 1 and 2, respectively [2, 9], and are chosen to cover large volumes of the parameter space that are sensitive to modern collider experiments. For the cMSSM, we fix \({\text {sign}}(\mu )=1\). We sample approximately \(1.5\times 10^6\) datapoints in the cMSSM and approximately \(1.95 \times 10^7\) datapoints in the pMSSM. Once sampled, we calculate Higgs masses and relic densities with micrOMEGAs, which internally uses the spectrum generator SoftSUSYv4.1.0 [20, 21].

Table 1 Parameter bounds in the cMSSM scan, following Ref. [2]. A uniform prior is used for all parameters except \(A_0\), where we uniformly sample \(A_0 / m_0\)
Table 2 Parameter bounds in the pMSSM scan, following Ref. [9]. A uniform prior is used for all parameters. “Left-handed” and “right-handed” are abbreviated by l.h. and r.h., respectively

We apply two theoretical constraints: (i) consistent electroweak symmetry breaking and (ii) the positivity of all squared masses. In addition to these, we also require that SoftSUSY converges. We do not require that the lightest supersymmetric particle is neutral.

The theoretical uncertainty in the Higgs mass is significantly larger than its experimental uncertainty [22]. We take the uncertainty in the Higgs mass calculations to be \(\sigma _{m_h} = 3~\)GeV for all points in the data set [2, 9].

2.2 Neural network

We train the neural network by assigning all points in the dataset a likelihood

$$\begin{aligned} L(\theta ) = {\left\{ \begin{array}{ll} 1 &{}\quad |m_h(\theta ) - m_{h,\mathrm {exp}}| < \sigma _{m_h}, \\ 0 &{}\quad \text {otherwise}, \end{array}\right. } \end{aligned}$$
(1)

where we ignore a normalization constant. All data points that fail the theoretical constraints are assigned a likelihood of zero.

We use a deep neural network to learn the function \(L(\theta )\) [23]. This has two benefits. First, it greatly reduces the time required to evaluate the likelihood of a point. Second, it provides a differentiable interpolation of \(L(\theta )\). In the next section we show that HMC requires many evaluations of the likelihood and its gradients. It thus utilizes the full potential of these benefits.

We train a deep neural network \(\hat{L}(\theta )\) to minimize the usual L2 loss function

$$\begin{aligned} \mathcal {L} = |\hat{L}(\theta ) - L(\theta )|^2. \end{aligned}$$
(2)

We use a training, validation, and testing split of 0.7, 0.15, and 0.15 respectively for both datasets. Batch norm and dropout layers are used in between each hidden layer of the neural network. Backpropogation is performed using the ADAM optimizer [24].

Some of the pMSSM parameters in Table 2 span a disconnected range of positive and negative values, for example \(M_1\), \(M_2\) and \(\mu \). We preprocess these parameters by shifting negative values to create a single continuous domain; for example, for \(\mu \) we shift the negative values by 200 GeV. This has no physical significance and simply prepares the data for input into the neural network. We then standardize each feature. For the cMSSM dataset, we use the feature \(A_0 / m_0\) in place of \(A_0\), as this feature is uniformly distributed.

2.3 Hamiltonian Monte Carlo

The Hamiltonian Monte Carlo method is a Markov chain Monte Carlo technique that uses an analog of energy conservation to effectively sample the target distribution [25, 26]. To use the method, we first define an auxiliary momentum variable p, where each component is initially drawn from a normal distribution. Next, we define a potential energy function given by

$$\begin{aligned} V(\theta ) = -\log (\hat{L}(\theta )). \end{aligned}$$
(3)

The kinetic energy function takes the familiar form \(T=p^2/2\) where we set the mass to unity, \(m=1\). We then evolve the system from time \(t=0\) to \(t=\tau \) according to the Hamiltonian equations of motion

$$\begin{aligned} \frac{\mathop {}\!\mathrm {d}\theta _i}{\mathop {}\!\mathrm {d}t}&= p_i,&\frac{\mathop {}\!\mathrm {d}p_i}{\mathop {}\!\mathrm {d}t}&= \frac{\nabla \hat{L}(\theta )}{\hat{L}(\theta )}. \end{aligned}$$
(4)

We solve these equations using the leap-frog algorithm so that energy is approximately conserved. We take \(\theta (\tau )\) as a proposal to add to the Markov chain. The proposal is accepted with probability

$$\begin{aligned} P = \min \left( 1, \frac{e^{-H(\theta (\tau ), p(\tau ))}}{e^{-H(\theta (0), p(0))}}\right) . \end{aligned}$$
(5)

Energy conservation implies that a solution to the the equations of motion should always yield probability 1. However, a rejection step is necessary because we solve these equations numerically. If \(\theta (\tau )\) is rejected, then \(\theta (0)\) is added to the Markov chain instead. In the limit of an infinite number of samples, the Markov chain converges to a sample of the distribution \(\hat{L}(\theta )\). We seed the Markov chain with a random positive sample from the dataset used to train the neural network. We bound the parameter space with hard walls of infinite potential energy.

2.4 Normalizing flows

It is difficult to draw samples from a complicated distribution in a high-dimensional parameter space. On the other hand, it is easy to draw samples from an equally high-dimensional Gaussian distribution. Normalizing flows is a technique that learns an invertible map f from a simple distribution \(p_Z\) to a challenging distribution \(p_Y\). One then creates a set of samples from the challenging distribution by mapping easy-to-generate samples:

$$\begin{aligned} p_Y(y) = p_Z(f^{-1}(y))\left| \det \left( \frac{\partial f}{\partial y}\right) \right| ^{-1}. \end{aligned}$$
(6)

The function f depends on a set of parameters \(\varTheta \) which are learned by maximizing the log likelihood of a training set, \(\mathcal {X}\). The loss function for this training is thus

$$\begin{aligned} \mathcal {L}(\mathcal {X})&= -\sum _{y \in \mathcal {X}} \left( \log \left( p_Z(f^{-1}(y))\right) - \log \left| \det \left( \frac{\partial f}{\partial y}\right) \right| \right) . \end{aligned}$$

It is helpful to construct f to be the composition of n successive maps, \(f=f_n\circ \cdots \circ f_1\) [23]. Defining \(z_{i+1} = f_i(z_i)\) and identifying \(y = z_{n+1}\) yields the loss function

$$\begin{aligned} \mathcal {L}(\mathcal {X})&= -\sum _{y \in \mathcal {X}} \left( \log \left( p_Z(z_1)\right) - \sum _{i=1}^n \log \left| \det \left( \frac{\partial z_{i+1}}{\partial z_{i}}\right) \right| \right) . \end{aligned}$$

We choose the \(f_i\) to be autoregressive transformations. This means that the parameters \(\varTheta ^k_i\) that define the function \(f_i\) acting on the \(k\text {th}\) feature \(z_i^k\) depends only on the first \((k-1)\) features \(z_i^1, \ldots , z_i^{k-1}\):

$$\begin{aligned} z_{i+1}^k = f_i\bigl (z_i^k \,;\; \varTheta _i^k(z_{i}^{1:k-1})\bigr ). \end{aligned}$$

This structure ensures that the Jacobian matrix \(\partial z_{i+1}/\partial z_{i}\) is lower triangular so that the determinant is simply the product of diagonal elements and may be computed in linear time.

The function \(\varTheta _i^k\left( z_{i}^{1:k-1}\right) \) can be represented efficiently with a Masked Autoencoder for Distribution Estimation (MADE) [27]. MADE networks turn off specific internal weights of the neural network so that the autoregressive property is enforced, allowing one neural network to output all model parameters rather than performing a sequential loop over features.

For our application, we choose \(f_i\) to be rational-quadratic neural spline flows with autoregressive layers [28]. These are piece-wise monotonic functions defined as the ratio of two quadratic functions on the interval \([-B, B]\), with \(K+1\) knots determining the boundaries between bins. Outside of this interval, the transformation is defined to be the identity. These transformations are parameterized by \(3K-1\) parameters for each feature, which are K bin heights, K bin widths, and \(K-1\) positive derivative values at the knots, as the derivatives are set to 1 at \(-B\) and B to ensure a continuous derivative over the domain. Permutation layers are included between rational-quadratic transformation layers. We implement the normalizing flow using the Python package nflows [28].

3 Results

We analyze the performance of these generative frameworks on the cMSSM and pMSSM datasets described above. The cMSSM is low dimensional and can be scanned relatively well with brute-force search. Thus, we view the cMSSM as a test for the generation methods and the pMSSM as a more practical application. We present the results for the neural network with HMC as well as the normalizing flow side by side. For each method, we generate a dataset of \(4\times 10^5\) datapoints.

We present histograms of generated variables to confirm that the distribution of theory parameters is not biased by our generative framework. We also present histograms of \(m_h\) to ensure that our generative models sample within the band of permitted Higgs masses and \(\varOmega _{\text {DM}}h^2\) to provide evidence that the distribution of weak scale quantities match, as these are sensitive to higher order correlations in high energy scale parameters. Finally, we report sampling efficiencies, which are defined as the fraction of the dataset that satisfy a constraint. The hyperparameters used for the supervised neural network, Hamiltonian Monte Carlo, and normalizing flow are given in the Appendix for both datasets.

3.1 cMSSM

Fig. 1
figure 1

Histograms of cMSSM parameters that yield the experimental Higgs mass. We observe good agreement between the random sampling, HMC, and the flow model. Black: Data obtained through random sampling with a uniform prior and rejecting points that do not have a consistent Higgs mass. Magenta: data sampled with HMC. Blue: data sampled from the flow model. No rejection step is applied to generated samples

In Fig. 1, we compare histograms of the cMSSM parameters at the GUT scale. For both generative models, we see very good agreement between the distribution of generated samples and the distribution of randomly sampled points after the Higgs mass constraint is applied. Next, we run the parameters to the weak scale in order to perform the combined search for \(\varOmega _{\text {DM}}h^2\) and \(m_h\). In Fig. 2, we show the distribution of Higgs masses for generated points and randomly sampled points with a rejection step applied. We see that the generative models typically sample within the band of permitted Higgs masses.

Fig. 2
figure 2

Histogram of Higgs masses in the cMSSM for different sampling methods. The generative models are seen to mostly sample points consistent with the Higgs mass constraint. Gray: data obtained through random sampling with a uniform prior. Black: the same randomly sampled data, but points that do not have a consistent Higgs mass are rejected. Magenta: data sampled with HMC. Blue: data sampled with the normalizing flow

Fig. 3
figure 3

Histogram of dark matter thermal relic densities in the cMSSM for different sampling methods. We observe that the distributions of the generative models match the distribution of random sampling, providing evidence that the generative models are able to match higher order correlations in GUT scale parameters. Gray: data obtained through random sampling with a uniform prior. Black: the same randomly sampled data, but points that do not have a consistent Higgs mass are rejected. Magenta: data sampled with HMC. Blue: data sampled with the normalizing flow. Generative models have been trained to satisfy the Higgs mass constraint

As an example application, we show histograms of the dark matter relic density for these datasets in Fig. 3. We see that the distribution over dark matter relic densities from the generative models appear to accurately reflect the same distribution in the dataset after the Higgs mass constraint is applied. We emphasize that because the RGEs are coupled, weak-scale quantities are generally sensitive to higher-order correlations of the GUT scale parameters, and so matching weak-scale distributions is evidence of matching higher order correlations in the GUT scale parameters. This indicates that the \(m_h\)-constrained subspace has been accurately sampled, allowing for an exploration of additional constraints, such as relic density.

In Table 3, we compare various statistical properties of random sampling to those of our generative frameworks trained to satisfy the Higgs mass constraint. The first row shows the sampling efficiency with respect to the theoretical constraints mentioned in Sect. 2.1. We see that samples from the generative models are more likely to pass these constraints, as points with a consistent Higgs mass necessarily satisfy the theoretical constraints. The second row shows the sampling efficiency with respect to the Higgs mass constraint. Predictably, the generative models have significantly higher sampling efficiencies than random sampling. We also see that the flow model slightly outperforms the HMC sampling method.

Table 3 Comparison of sampling efficiency in the cMSSM for several methods and several levels of constraints. We compare a brute force random scan (random), Hamiltonian MC of a neural network trained to learn the \(m_h\) constraint (HMC\(_{m_h}\)), and normalizing flows that incorporate the \(m_h\) constraint (NF\(_{m_h}\)). The constraints applied are theoretical consistency checks (see text), consistency with the experimental Higgs mass and consistency with the Higgs mass and the dark matter relic density \((\varOmega _{\text {DM}}h^2)\)
Fig. 4
figure 4

Histograms of pMSSM parameters that yield the experimental Higgs mass. We observe good agreement between random sampling, HMC, and the flow model. Black: Data obtained through random sampling with a uniform prior and rejecting points that do not have a consistent Higgs mass. Magenta: data sampled with HMC. Blue: data sampled from the flow model. No rejection step is applied to generated samples

The third row shows the sampling efficiencies with respect to the combined Higgs mass and relic density constraint, where the generative models are still trained to only satisfy the Higgs mass constraint. This simulates a scenario where one would like to study the effect of imposing a new constraint in addition to the constraints that are explicitly trained on. Once again, we see that the generative models have much higher sampling efficiencies, resulting from the high probability that the samples pass the Higgs mass constraint. We see an increase in sampling efficiency of approximately an order of magnitude for both generative frameworks.

3.2 pMSSM

Differences between the generative models appear in the higher-dimensional pMSSM. In Fig. 4, we compare histograms of parameters sampled using brute-force search, HMC and the normalizing flow model. Despite the increased dimensionality, we find very good agreement in the distributions of all parameters.

Figures 5 and 6 present histograms of \(m_h\) and \(\varOmega _{\text {DM}}h^2\) for the pMSSM. The generative models tend to sample in the band of allowed Higgs masses, with the normalizing flow model matching the brute-force scan well. We see general agreement with the true distribution of dark matter abundances for both generative frameworks, though the HMC samples do not match the brute-force distributions as well as those from the flow model.

Table 4 summarizes the performance of our sampling methods in the pMSSM. See Sect. 3.1 for a detailed description of the quantities presented in the table. We find that generative models greatly increase the sampling efficiency relative to a brute-force search. In fact, the improvement in sampling efficiency is much greater than that seen in the cMSSM. This is largely due to the poorer performance of a brute-force search in the higher-dimensional pMSSM.

Fig. 5
figure 5

Histogram of Higgs masses in the pMSSM. The generative models are seen to mostly sample points consistent with the Higgs mass constraint. Gray: data obtained through random sampling with a uniform prior. Black: the same randomly sampled data, but points that do not have a consistent Higgs mass are rejected. Magenta: data sampled with HMC. Blue: data sampled with the normalizing flow

Fig. 6
figure 6

Histogram of dark matter thermal relic densities in the pMSSM. We observe that the distributions of the generative models match the distribution of random sampling, providing evidence that the generative models are able to match higher order correlations in EW scale parameters. Gray: data obtained through random sampling with a uniform prior. Black: the same randomly sampled data, but points that do not have a consistent Higgs mass are rejected. Magenta: data sampled with HMC. Blue: data sampled with the normalizing flow. Generative models have been trained to satisfy the Higgs mass constraint

Table 4 Comparison of sampling efficiency in the pMSSM for several methods and several levels of constraints. Methods compared are brute force random scan, Hamiltonian MC of a neural network trained to learn the \(m_h\) constraint (HMC\(_{m_h}\)), and normalizing flows that incorporate the \(m_h\) constraint (NF\(_{m_h}\)). Constraints applied are theoretical consistency checks (see text), consistency with the experimental Higgs mass and consistency with the Higgs mass and the dark matter relic density \((\varOmega _{\text {DM}}h^2)\)

4 Conclusion

We implement two generative frameworks that utilize machine learning in order to increase the sampling efficiency of searches in supersymmetric parameter spaces. These sampling methods offer a more efficient way to search the high-dimensional parameter spaces in models of new particle physics. We compare these generative frameworks to the currently used method of a brute-force search, and have seen orders of magnitude of improvement in the sampling efficiency for both parameter spaces considered here. We show that our generative frameworks are able to sample the underlying data distribution without any evidence of bias or mode collapse.

In the cMSSM, both methods significantly outperformed random sampling, with the flow model slightly outperforming HMC. In the pMSSM the flow model significantly outperforms HMC. This is likely due to the larger dimensionality of the pMSSM. In addition to performance benefits, the flow model is also quicker to train and sample, making it clearly favorable to HMC. However, the HMC framework is more complementary to previous works, as it learns the forward problem of determining likelihoods and uses tested Monte Carlo algorithms to sample this likelihood.

Possibilities for future work include incorporating additional constraints into the generative model. In theory, there is no limit to the number of constraints that can be included into either generative model. However, forming an initial dataset for learning may be difficult when the constraints are very strict. A possible remedy is to train generative models with less restrictive constraints which are then used to produce sizable datasets of points that already satisfy many constraints. This new dataset could then be searched to form a training set for a generative model with increasingly restrictive constraints.

Given the ability of the generative machine learning models to efficiently explore high-dimensional parameter spaces, it will be interesting to apply the techniques described here to other problems. For instance, one may identify relations that explain why there is a ‘little hierarchy’ between the electroweak scale and the scale of soft parameters, which go beyond the focus point scenario [29]. In general, one may be able to identify manifolds of viable points in high-dimensional parameter sets, and explore their geometry.

We have shown promising results in subspaces of the MSSM parameter space. These results apply generally to any high-dimensional parameter space with constraints that are computationally expensive to verify. Another direction for future study may be applications to the parameter spaces of even higher-dimensional models of new physics. This includes potentially relaxing constraints built into the pMSSM parameter space, but could also include applications to non-supersymmetry (SUSY) theories. Finally, one could attempt to further tune the neural network structure and hyperparameters in order to achieve higher sample efficiency than was achieved in this work.