Visualization and efficient generation of constrained high-dimensional theoretical parameter spaces

We describe a set of novel methods for efficiently sampling high-dimensional parameter spaces of physical theories defined at high energies, but constrained by experimental measurements made at lower energies. Often, theoretical models such as supersymmetry are defined by many parameters, O\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \mathcal{O} $$\end{document}(10 − 100), expressed at high energies, while relevant experimental constraints are often defined at much lower energies, preventing them from directly ruling out portions of the space. Instead, the low-energy constraints define a complex, potentially non-contiguous subspace of the theory parameters. Naive scanning of the theory space for points which satisfy the low-energy constraints is hopelessly inefficient due to the high dimensionality, and the inverse problem is considered intractable. As a result, many theoretical spaces remain under-explored. We introduce a class of modified generative autoencoders, which attack this problem by mapping the high-dimensional parameter space to a structured low-dimensional latent space, allowing for easy visualization and efficient generation of theory points which satisfy experimental constraints. An extension without dimensional compression, which focuses on limiting potential information loss, is also introduced.

1 Introduction Decades of searches for extensions to the Standard Model such as supersymmetry (SUSY) have come up empty, and future higher-energy colliders may be required in order to discover new particles and interactions.But given the vast expense and wait time incurred by such projects, it is worth exploring diligently whether the parameter spaces of these theories have been thoroughly exhausted in current datasets.Could there still be novel unexplored islands-unexcluded regions of parameter space with signatures accessible to the LHC and astroparticle experiments?
The full parameter space of the minimal super-symmetric model (MSSM) has more than 100 parameters [1], making a complete exploration daunting.Out of desperate pragmatism, the search space is often restricted to a more tractable subset of parameters based on theoretical inclinations.Common examples include the (4 + 1)-dimensional cMSSM as well as the 19-dimensional pMSSM.These high-dimensional spaces guide the LHC program [2][3][4][5][6][7][8][9][10], but direct experimental searches are typically performed in only one or two dimensions, with all other parameters fixed to theoretically preferred values.The resulting surfaces of LHC exclusion occupy a vanishingly small fraction of the volume of the larger theoretical space, most of which is essentially unexamined.Each point in such a space defines a spectrum of particles which determine well-known quantities such as the Higgs boson mass and the dark matter relic density.But what portions of the full SUSY space are consistent with these weak-scale constraints and LHC exclusion results?Are there unexplored islands where promising models abound?Nobody knows.
A critical obstacle is that weak-scale constraints such as the Higgs mass or the dark matter relic density are not expressed directly in terms of the fundamental parameters of the theory, such as SUSY particle masses, instead requiring intensive calculations.This prevents us from using experimental constraints to immediately rule out large portions of the model's parameter space or reduce its dimensionality.Instead, the weak-scale constraints define a complex, potentially non-contiguous subspace of the theory parameters.And the inverse problem, determining a map from weak-scale observables to fundamental parameters, is intractable.Instead, we are forced to scan the high-dimensional parameter space, evaluating each point for weak-scale consistency.Efforts to speed up the calculation of the weak-scale quantities with machine learning [11][12][13] do not solve the core problem, which is dominated by the dependence of the scanning cost on the dimensionality rather than the per-sample computation time.
Even the reduced spaces such as the pMSSM are difficult to scan using a brute-force search, so much so that the discovery of new islands of experimental consistency in this subspace relies largely on the inspiration of particle theorists intuiting their existence.Each island (e.g."well-tempered SUSY" [14], "focus point SUSY" [15], "Natural SUSY" [16], "Gauge Mediated SUSY Breaking" [17], "General Gauge Mediation" [18]) is then carefully studied [19][20][21][22][23][24][25][26][27][28], and novel signatures can motivate new experimental analyses.But the danger of this exploration by intuition is that one may miss viable regions.Currently, many particle physics theories include vast, mostly-unexplored parameter spaces, which may contain undiscovered islands consistent with experimental constraints.Some of these islands may reveal new unanticipated signatures accessible at the LHC.
Fortunately, an arsenal of techniques in Artificial Intelligence (AI), and in particular statistical techniques such as Machine Learning (ML), have shown great promise at exploring and summarizing information from high-dimensional data spaces.Recent approaches for exploring high dimensional HEP spaces fall largely into two broad categories.
A class of generative method such as GANs [29], HMC [30][31][32], Normalizing Flows [33], and Genetic Algorithms [34], take advantage of the ability of ML to perform highdimensional interpolation, making notable improvements in the efficiency of sampling highdimensional subspaces consistent with experimental constraints [35][36][37][38][39][40].But these methods often act as black boxes, making interpretability difficult, and ameliorate but do not fundamentally solve the underlying difficulty of searching high-dimensional spaces.The increasingly high-dimensionality of SUSY-related theoretical spaces may instead require methods which map the problem to a lower-dimensional space where extensive sampling is more feasible.
The other broad category of approaches, observational dimensional reduction methods (e.g.UMAP [41], TSNE [42], non-generative auto-encoders [43], and self-organizing maps [44,45]), analyze high dimensional datasets by mapping them to a lower-dimensional latent space [46,47], allowing for visualization of any resulting structure, such as fertile islands of desirable points.For example, in the context of SUSY parameter space searches, these techniques allow for visual examination of the latent space in the hopes of finding clusters of points consistent with experimental constraints.While the goal of collapsing the high-dimensional structure into a lower dimensional space has solid mathematical underpinnings [48], actually creating useful structure is challenging, as many of the methods are unsupervised and thus cannot explicitly encourage such structure in the latent space.Often, the resulting latent space is too poorly structured for efficient sampling [49].In other cases, the lack of a return mapping from the latent space prevents sampling altogether.However, when used in the context of generative methods, reducing the dimensional complexity of the sampling space has the potential to directly remove the primary difficulty posed by searching high-dimensional spaces.
Additionally, Manifold Learning methods [50][51][52][53] spanning both of these categories have sought to learn and sample from low-dimensional manifolds on which interesting points sit in high dimensional spaces, although they can be non-trivial and computationally expensive to train [50,51].
In this paper, we build a new class of generative methods which map the problem to a lower-dimensional space, but specifically encourage useful structure by treating the problem as a supervised learning task.We use a modified Sliced Wasserstein Autoencoder (SWAE) [49] structure to create bidirectional maps between the GUT-scale parameter space and a structured low-dimensional latent space.The structure of the latent space is specialized to force valid points to cluster near the latent origin, allowing for efficient generation of new points which can be mapped back to the GUT-scale parameter space.Because the latent space is low-dimensional, it is inherently easier to search.Additionally, the added benefit of visualization due to dimensional reduction makes our maps easy to understand and interact with, and in some cases allows the addition of secondary constraints in media res with no additional training.Because our method reduces the complexity of the sampling space, it may be scalable in the future to tackle larger, less-constrained subsections of the MSSM.The structure of our approach shown in Fig. 1.To explore whether encouraging structure in the latent space is sufficient, without insisting on dimensional reduction, we also introduce a complementary formulation of our method which operates without any dimensional reduction.
We call our primary method the SWAE Algorithm For Experimentally Sound Point-Production And Mapping (SAFESPAM), and two variants without dimensional reduction we refer to as Constraint-driven High-dimensional UNcompressed (Categorical) Clustering, or CHUNC and CHUNC21 [54].
The paper is outlined as follows: In Section II we detail the task and nature of our experimental constraints.Section III describes the generation and parameters of our dataset.In Section IV, we discuss our approach for SAFESPAM, CHUNC and CHUNC2, as well as our sampling procedures.Section V presents the results of our trials for various sets of experimental constraints, and Section VI contains further discussion and conclusions.
Figure 1: A simplified diagram of SAFESPAM for the cMSSM.GUT-scale inputs are mapped to an abstract 2D latent space, where they are shaped to a target distribution via loss function.Weak scale values that accompany each theory point are used only to classify points as valid (orange) or invalid (blue), and do not enter the network as inputs.Valid points shown in orange cluster towards the center of the latent space.The mapping f from input theory space to latent 2D space is referred to as the encoder and the return mapping g as the decoder.SAFESPAM tries to reconstruct the GUT scale parameters as accurately as possible by minimizing the L2 term in the loss function, ensuring that the outputs from the decoder match the corresponding inputs to the encoder.Sampling new points in the valid region of the latent space allows for efficient generation of new valid points in GUT space.

Task
To efficiently generate new points in high-dimensional theoretical parameter spaces which satisfy the experimental constraints placed on weak-scale observables, we seek maps between the theory space and a structured latent space which are: • generative: capable of generating new points with high efficiency compared with naive brute force sampling • contrastive: with structured latent spaces where valid and invalid points are separated into distinct regions • low-dimensional: for ease of sampling and visualization of the latent space In this paper, we study the performance of the SAFESPAM method in two constrained SUSY spaces, the cMSSM in 4+1 dimensions and pMSSM in 19 dimensions [55][56][57].We impose two experimental constraints, the Higgs boson mass and the dark matter relic density: • Higgs Boson Mass: 122.09GeV ≤ m h ≤ 128.09GeV [58,59] • Dark Matter Relic Density: 0.08 ≤ Ω DM h 2 ≤ 0.14 [60,61] Points are denoted as 'valid' whose experimental values fall within these constraints.While in principle our methods could also use other experimental variables as constraints, prior work has often focused on these [35,36,38].With CHUNC and CHUNC2, we introduce a third constraint that the Lightest Supersymmetric Particle (LSP) be a neutralino.All weak scale constraints are calculated using softsusy 4.1.10and micromegas 5.2.6 [62] [63].

Dataset
We utilize initial brute-force scans, O(10-20M) points, of the parameters of pMSSM and cMSSM at the GUT scale to create our datasets [7,55].Below we denote the parameter ranges for the cMSSM (Table 1) and pMSSM (Table 2).All parameters are sampled uniformly over the specified range, with the exception of A 0 , where A 0 m 0 is sampled uniformly [35].
GUT-scale points are considered to be valid if they are consistent with experimental constraints, calculated via softsusy 4.1.10and micromegas 5.2.6 [62] [63].

Approach
Unlike other dimensional reduction methods, auto-encoders include a mapping back to the original space by design.We utilize a modified Sliced Wasserstein Autoencoder (SWAE) [49] which constrains the shape and consistency of the latent space to be suitable for efficient sampling, unlike standard non-generative auto-encoders.The latent space created is abstract in the sense that there is no direct physical interpretation of its coordinates.But it is also structured such that points are encouraged towards a target distribution using a Sliced Wasserstein loss term.The network finds the most accurate way to encode this structure into reduced dimensionality by minimizing the reconstruction loss.New points that are generated in the low-dimensional latent space are then mapped back up to the original space, allowing for efficient sampling of the valid substructures in high dimensions.
All feature columns are normalized to values between 0 and 1 before training.

Why not a VAE?
The Variational Auto-Encoder (VAE) [64] is a well-known generative framework using dimensional reduction, but complications in its structure make it non-ideal for our needs.
Notably, the standard analytical form of the VAE's Kullback-Leibler (KL) loss assumes a Gaussian target distribution, which is achieved when the term is minimized at µ = 0, σ = 1.This pushes all points in the latent space to within 1 unit of the origin, which when combined with the variational nature of the VAE has been shown to cause information collapse during training: loss values can diverge or become degenerate because KL loss is not a true distance metric [65][66][67].Additionally, because the KL loss term is defined only in terms of µ and σ, it is difficult to fit points to more complicated distributions such as those that are concentric or hollow.SWAEs come from a family of autoencoders specifically developed to avoid these issues [49,66], by replacing the KL loss term with the Sliced Wasserstein distance from optimal transport theory.The Sliced Wasserstein loss term fits latent points to individual points from the target distribution deterministically instead of variationally, mapping directly to latent coordinates instead of µ and σ.Thus SWAEs avoid the VAE's information collapse problem and can utilize complex target distribution shapes, making them a superior choice for our purposes.Wasserstein-style autoencoders have been shown in the past to produce higher quality samples in image generation tasks, compared to VAEs [49].

SAFESPAM
Using pytorch [68] and an ADAM optimizer [69] 2 , we train a modified SWAE to learn the mappings to and from the latent space, which we denote as f and g respectively .The mapping {f : T → L} from the theory space T to the latent space L is learned by the encoder, and the return mapping {g : L → T } is learned by the decoder.The full mapping of the SWAE can then be written: In SAFESPAM, the primary modification is the addition of a clustering loss term which encourages valid points to cluster together.The loss function we use thus has three terms: the standard L2 reconstruction loss used by autoencoders to ensure accurate mapping, the Sliced Wasserstein loss term which fits latent points to a target distribution, and the clustering loss term, which pushes valid points toward the origin and invalid points away.Because the clustering loss term is aware of our experimental constraints, SAFESPAM is a supervised deep learning method, although weak-scale values do not enter the model as inputs.Our loss equation takes the following form: where α, β, γ are hyperparameters.Below we explain each term in more depth.L2 loss, or MSE Mean loss, is calculated between the training data and the output of the SWAE, ensuring that the decoder accurately reconstructs the original inputs when mapping.It looks as follows: where N is the number of points per batch and t the input theory point vector.
The Sliced Wasserstein loss is a formulation of the Wasserstein/earth-mover distance calculated along projected one-dimensional slices in latent space.It quantifies how the target distribution and latent distribution differ from one another.[49,66,71] Where θ i represents the randomly sampled one-dimensional slices along which the marginal distributions are defined, sk and f (t ℓ ) are random samples from the target distribution and the latent input data respectively, and tc(•, •) represents the transport cost between the two distributions.
Our exploratory experiments indicated that for SAFESPAM, strong performance can often be achieved training only on valid points.In these instances, latent valid cores often take a convenient compact shape, useful for sampling and further structural exploration.Heuristically we find that utilizing concentric valid and invalid target distributions outperforms a bimodal normal target distribution where invalid points cluster around a point far from the latent origin.
Our clustering loss term takes inspiration from the Triplet Loss originally used in the fields of contrastive and metric learning [77,78], with the goal of embedding a hierarchy of distances in the latent space, such that the distance between a desirable point and anchor point is minimized, and conversely the distance between an undesirable point and anchor point is maximized.For ease of sampling, we choose the latent origin as our anchor point.The mathematical expression per batch for our clustering is expressed as: where a, b are hyperparameters, N is number of points per batch, and t the input theory point vector.δ valid is 1 for valid points and 0 for invalid points.This term encourages valid points to clusters near the latent origin and pushes invalids outward.When training with only valid points, this term serves as an extra assurance that the valid portion of the latent space is sufficiently compact.When γ = 0, a SAFESPAM network reverts to an unmodified SWAE.
Together, the Sliced Wasserstein and Clustering loss terms allow us to force favorable structure onto the latent space, controlling the distribution of valid and invalid points.Unlike standard non-generative auto-encoders, where poorly-defined latent spaces make sampling unfeasible, SAFESPAM enables efficient sampling by forcing desirable points to cluster together in one area; see Fig. 2 for a cartoon schematic.

Sampling the latent space
To sample, we generate points in the latent space and map them back to the original space using the network's decoder.Our method supports a variety of sampling options, including simple approaches such as uniformly sampling in the valid region.We have found that Kernel Density Estimation (KDE) sampling is best at automatically capturing the shape of the latent valid core.KDE evaluates a kernel function over a dataset to estimate and sample from its probability density function.We utilized a top-hat kernel, with bandwidth h such that K(x; h) ∝ 1 if x < h, to estimate the density of our latent valid cores from known valid points, and sample the latent space accordingly.Using this kernel function, the density at a point in space X in relation to a group of data points x j can be estimated as where N is the number of points being used by the density estimation module.Essentially, this method approximates the density at each location by summing the number of data points that fall within the bandwidth distance of that point in space.KDE sampling works best in low-dimensional applications, making it well-suited to the latent space of our dimensional-reduction methods.
The valid latent points can be used to build a density estimate, via SciKit Learn's Density Estimation module [79], from which new points can be sampled.Because our mapping has already done the difficult task of clustering these points together in a reduceddimensional representation, KDE sampling in our latent space is more efficient than in the original space.
In some cases, the latent valid core takes an irregular shape that is more difficult to sample.To remedy this, we augment our basic KDE sampling approach iteratively.Valid points found in our first round of KDE sampling are added to the pool of points used to fit the next round of KDE sampling; see Fig. 3.As this iteration progresses, the bandwidth of the kernel is successively tightened, to sample finer structures more accurately; see Fig. 4. Iterative KDE serves as a means of transitioning from initial exploration of the latent space into more intensive sampling of the valid regions already found [38].This method can be repeated until a satisfactory efficiency is achieved.
While more advanced sampling options exist, such as training a normalizing flow to sample the potentially irregularly-shaped latent valid cores, we find KDE-style sampling to be much less labor-intensive and more robust at handling cases where there are disparities between the location of valid training data and valid sampled data.See, for example, the structures shown in Fig. 10.
While SAFESPAM does not guarantee unbiased sampling of the full valid subspace, it does offer significantly higher yields compared with naive sampling.In practice, this often means sampling with very high efficiency over an incomplete portion of the valid range.All SAFESPAM trials shown here used a sampling size of 10k points.

CHUNC
Any function which reduces the dimensionality of its input has the potential to destroy information about the high-dimensional structures it seeks to describe.We anticipate such difficulties in high-dimensional spaces like the CMSSM and PMSSM where collapse to a two-dimensional latent space could cause significant portions of the valid sub-space to be overlooked with SAFESPAM-style methods.Thus, in an attempt to avoid this information loss, we introduce a similar methodology that operates without dimensional compression by imposing that the latent space is of the same dimension as the input space and that none of the layers in the network have a dimension smaller than the input.
The motivation for this approach comes from the Data Processing Inequality (DPI) [80] which states that any function f : T → L acting on a space T can only ever destroy or preserve information.This information is represented by the joint probability distribution p(T, θ), where T is the theory space discussed in Section III and Section IV.2, and θ are the experimental constraints.The SWAE networks constructed here attempt to transform the unknown distribution p(T, θ) to one whose structure is known p(L, θ) while preserving the correlations between L and θ.The DPI states that the mutual information I[T ; θ] ≥ I[f (T ); θ] ≥ I[g(f (T )); θ] with equality when no information loss has occured.
To maximize I[L; θ] = I[f (T ); θ] then, we impose that the dimension of the space L be the same as the dimension of T .This comes at expense to the ease of sampling which is otherwise gained from utilizing a low dimensional latent space, and we also sacrifice the benefits of visualization.However our latent space still retains the desired quality that valid points cluster near the origin while invalid points are pushed outward.We demonstrate this method in cMSSM and pMSSM on a triple constraint: Higgs, DM, and Neutralino LSP.We train on an equal mixture of valid and invalid points, which we fit to separate nonoverlapping target distributions.Our Sliced Wasserstein loss term then becomes: where c, d are hyperparameters.To reiterate, the role of the latent space is not to generate a low dimensional representation, but rather solely to transform the unknown distribution of the input space p(T, θ) to one p(L, θ) which is intentionally structured for sampling purposes, without any potential loss of information.We sample from this known distribution with KDE and map back to the original space, in the same manner as SAFESPAM.Because CHUNC retains usage of the Sliced Wasserstein loss term, the methods avoids the same VAE pitfalls as SAFESPAM.While the lack of dimensional reduction makes CHUNC superficially similar to normalizing flows, we believe our concentric contrastive target distributions are too complex to be utilized in a standard normalizing flow framework, which requires simple and well-understood base distributions [35].

CHUNC2
CHUNC2 is an alternative formulation of CHUNC which posits that further information loss might be avoided if the contrastive clustering is limited to a single additional categorical variable, θ, allowing the rest of the latent space more freedom to preserve reconstruction capabilities.Instead of pushing invalid points to infinity, the network learns to map to the joint space L × θ so that the latent space is constrained as a simple Gaussian p(L, θ) = p(θ)p(L|θ) = p(θ)p(µ L , σ L |θ) where the categorical variable θ interpolates between signal-like and background-like events.As such, CHUNC2 replaces the clustering loss term mentioned in section 4.2 with a loss term built around an added categorical variable tasked with codifying whether a point is valid or invalid.Thus our latent space is now N + 1 dimensional, where the new dimension serves as a predictor of a point's validity: • Binary L 2 Loss -The latent categorical variable z i , is constrained through an L 2 loss to match the category θ i = {s, b} of each event, In the following studies, the categorical variable is trained to map to 1.0 for valid points and 0.0 for invalid points.It is of course straightforward to generalize this approach to one which θ represents a multi-class problem, but we do not pursue this here since our task is a binary one (i.e. with our triple-constraint task we are not interested in generating points that only meet some but not all of the individual constraints).Sampling utilizes KDE on latent points which fall into the most valid categorical bin to generate new latent points in 6D or 20D respectively, which are mapped back to the original space.Since CHUNC2's latent space is of a higher dimensionality than the parameter space, normalizing flows would not be able to emulate this methodology.

SAFESPAM
We trained a series of Modified SWAEs on cMSSM and pMSSM data using Higgs and DM constraints separately.Unless otherwise noted, all SAFESPAM models were trained on valid-only data3 , and latent points were constrained to a truncated normal target distribution by minimizing the Sliced Wasserstein loss.The latent space was sampled with an initial uniform scan whose location was determined by the latent position of valid training data.Valid points from this scan were then used to define a probability density sampled with single-round KDE sampling, using a tophat kernel for increased precision.Iterative KDE was utilized to improve results in cases with the Dark Matter experimental constraint.Below, we present the results for Higgs mass constraints in the cMSSM and pMSSM cases, followed by dark matter relic density constraints in each space, and an analysis of the ability of our method to simultaneously satisfy both constraints.We present visuals of the sampled points in the latent space to confirm the presence of a central valid core.We also include histograms comparing our generated valid data to our valid training data to qualify to what extent our sampling has introduced unwanted bias or excluded viable regions of the space.
We then report sampling efficiencies, defined as the fraction of points from the sample which adhere to the desired constraint.Relevant hyperparameters are listed in Appendix A. All SAFESPAM model-training for this project took place on a 2017 macbook pro without use of GPU.We introduce two additional metrics to better quantify the successes and shortcomings of this method, described as follows: • Core-forming Metric -This metric measures how successfully the generated valid points match the core shape suggested by the position of latent valid training data.We calculate the Sliced Wasserstein distance between randomly selected points from the latent distribution of valid training points and the latent distribution of valid generated points.Latent points are normalized before calculation to account for differing target distribution sizes between trials.A low value suggests that the generated valid points are occurring where we expect them to be, whereas a higher value suggests potential disparities between the latent distributions of training valids and generated valids.Multiple calculations of this value are averaged out to account for the random nature of the slices.All SAFESPAM metrics were calculated by taking the average over 10 iterations, using 200 projections.It takes the form: Where θ i represents the randomly sampled one-dimensional slices along which the marginal distributions are defined, gl k are the normalized valid latent points generated by the model and f (tl ℓ ) are random normalized latent representations of samples from the valid training dataset, and tc(•, •) represents the transport cost between the two distributions.

• Incompleteness Metric
-This metric measures how successfully the generated valid points account for the full distribution of expected valid points seen in the training data.We calculate the Sliced Wasserstein distance between randomly selected points from the GUT distribution of valid training points and the GUT distribution of valid generated points (excluding valid generated points outside the original range the training data was generated with).All points are normalized before calculation to account for the different ranges used in data generation in the cMSSM and pMSSM.A low value indicates we are sampling without much bias, whereas a higher value suggests we are introducing more significant bias when sampling.Multiple calculations of this value are averaged out to account for the random nature of the slices.All SAFESPAM metrics were calculated by taking the average over 10 iterations, using 200 projections.It takes the form: Where θ i represents the randomly sampled one-dimensional slices along which the marginal distributions are defined, gg k are the normalized valid GUT points generated by the model and f (tg ℓ ) are random normalized samples from the valid training dataset, and tc(•, •) represents the transport cost between the two distributions.

Higgs cases
Our method had little trouble creating a valid core using Higgs mass as the experimental constraint in the cMSSM; see Fig. 6, which allowed for highly efficient sampling efficiency (0.957) compared to a naive scan (0.347).Small incursions of invalid points in the core are largely avoided by KDE sampling.The distribution of SAFESPAM-sampled points in the space of cMSSM parameters is shown in Fig. 7, where the newly generated points reasonably match the distribution in the original training set.In the case of the pMSSM, SAFESPAM was again able to structure the latent space for efficient sampling.We note that the specific form of the latent space demonstrates significant sensitivity to the hyperparameters such as the shape of the target distribution, length of training, and inclusion or exclusion of valid points, though without a significant change in efficiency.
As a demonstration, we present results for two separate networks trained for the pMSSM Higgs constraint, labeled Trials 1 and 2 respectively.Trial 1 was trained first on a combination of valid and invalid points fit to a concentric distribution (a normal distribution with an ablated annulus to create two distinct regions), before being trained further with only valid points fit to a truncated normal.Trial 2 was trained only on valid points fit to a truncated normal, but trained for four times as many epochs.Figure 8 shows the distribution of the valid points in the latent space for each trial, with clear variation in the structure of the latent space.KDE sampling of the core achieves an efficiency of 0.965 (0.961) for Trial 1 (Trial 2), compared to naive sampling efficiency of 0.171.Distribution of the generated points in the cMSSM parameter space for both trials is shown in Fig. 9.Although neither trial generated the full range of valid points seen in the training data, future methods might make use of ensemble sampling to remedy this.

Dark Matter Cases
Our dark matter trials had significantly more difficulty creating well-structured latent spaces.While the method was successful in clustering together valid training data, sampling in this vicinity resulted in many points which do not satisfy the constraints; see Fig. 10.However, a subset of points from the initial core allowed for efficient generation of new valid points, as discovered after iterating the KDE sampling procedure four times.This led to high efficiency of 0.972(0.979),compared to naive sampling efficiency of 0.05534(0.006506)in the cMSSM (pMSSM).See Figs 11 and 13.The iterative nature of our sampling method allowed us to probe these valid cores efficiently despite their irregular shape.As in the Higgs trials, this method had more success in preserving the valid range of points from the training data when used in the cMSSM (Fig. 12), and less so in the pMSSM (Fig. 14).We speculate that the difficulty here may result from the particularly complex and narrow structures in the cMSSM and pMSSM created by the extreme narrowness of the dark matter constraint as applied to these spaces: we believe our valid-only training dataset lacked the sufficient resolution required to encode their boundaries accurately in the latent space.

Adding secondary constraints without additional training
The structure of the SAFESPAM latent space allows us to apply additional constraints in media res when sampling.We identify latent points which meet these additional constraints via initial scan or previous sampling and use KDE sampling to draw new points from their vicinity in the latent space, without providing any additional information to the auto-encoder about which points satisfy these secondary constraints.We demonstrate this in the cMSSM (Fig. 15) and pMSSM (Fig. 16) by re-sampling our Higgs models for points that have both valid Higgs and DM values, i.e. a joint constraint.Although our method does not guarantee that such 'double-valid' points will exist in the latent space or form any neat self-contained substructure, we found our KDE sampling was sufficiently robust to sample the clusters of points that were present in the latent space, achieving a significantly increased efficiency compared to the naive scan, and doing so without requiring any additional training.This property gives our maps an added versatility that many nonmapping generative models and efficient search methods are unequipped for.For example, one can add additional experimental constraints or look for theory points from a specific dark matter annihilation class just by selectively sampling regions of the map where those characteristics are known to exist.Our method's dimensional reduction capabilities also give it a significant advantage in ease of sampling these added constraints compared to other mapping-based generative methods such as normalizing flows which could in theory attempt something similar.We see significant increases in efficiency with single-round KDE up to 0.3191 (0.5621) in the cMSSM (pMSSM) compared to a naive value of 0.004467 (0.01892).Additional iterations of KDE may improve these results further.
Table 3 summarizes the generation efficiency of experimentally valid points using SAFES-PAM and naive scanning.Figure 16: Distributions in the two-dimensional latent space for the pMSSM study, where points which satisfy (fail) the Higgs mass constraint are marked Valid (Invalid), and valid points which also satisfy the dark matter constraint are marked as Double-Valid.In the top left, the initial scan identifies a core of valid Higgs points.In the top right, the subset of those points which also satisfy the dark matter constraints are identified.In the bottom left are points generated using a KDE built from the double-valid points, and in the bottom right are the new points which satisfy both constraints.The cMSSM and pMSSM training datasets for CHUNC and CHUNC2 each consist of approximately 10K points, evenly split between invalid points and triple-valid points, which satisfy the Higgs mass, dark matter relic density and LSP constraints defined in section 2. The higher-dimensional latent space requires a slightly different KDE sampling method.Rather than performing a uniform scan to identify a potential core, we use the position of known latent valid points from the training set to perform the kernel density estimation.CHUNC attempts to cluster all invalid points on an N -dimensional spherical shell of radius 1 with of thickness of 0.6, and valid points to an N -dimensional spherical core of radius 0.3.In CHUNC2, all points are forced to a Gaussian distribution in the original dimensions, while the additional categorical quantity is constrained to a value of 1.0 (0.0) for valid (invalid) points.
Despite not using dimensional reduction, CHUNC and CHUNC2 both achieve more than two orders of magnitude improvements in generation efficiency over the naive scan, see Table 5.In the cMSSM, CHUNC2 has an efficiency of 0.32, outperforming CHUNC's efficiency of 0.21.However, in the higher dimensional pMSSM case CHUNC's efficiency of 0.24 exceeds CHUNC2's efficiency of 0.15.It is difficult to say why this is the case.Iterative KDE sampling may improve these results further.
While CHUNC and CHUNC2 appear evenly matched at preserving the proportionality of the valid subspace in the lower dimensional cMSSM trials (see Fig. 17), CHUNC2 does a visibly better job at capturing the proportionality of the valid subspace without introducing significant distortion, as seen most notably in the |M 2 | , |A t | , m L1 ,m L3 and m ũ3 distributions in Fig. 18.However, CHUNC outpeforms CHUNC2 for pMSSM efficiency.However our 'incompleteness' metric suggests these differences may be less prominent than they appear.Both CHUNC and CHUNC2 appear to introduce notably less bias than SAFESPAM when sampling from the pMSSM.6 Discussion and Conclusion

Strengths and Weaknesses
SAFESPAM is capable of significant improvements in efficiency over naive sampling, attaining greater than 95% efficiency in all single-constraint cases; see Table 3.However, the range of points sampled often missed valid regions from the training data, especially in the pMSSM.In many cases we also found valid points generated outside the boundaries specified in our initial data generation, in ranges previously unknown to the model, behavior not unexpected for such models.For example, Fig. 7 shows valid points with m 0 , m 1/2 > 10, 000 GeV.While the latent distributions produced by SAFESPAM often display a distinct boundary between valid and invalid points, this was not the case for the Dark Matter trials, perhaps due to the narrower constraint in weak-scale space.Differences in evaluation software, differing combinations of experimental constraints, variations on which parameters of the spaces are being explored and the range of values over which they are explored all make it a significant challenge to make direct comparisons to other state of the art methods.In most cases these differences constitute essentially different though related problems with potentially significant differences in level of diffi-culty.That said, if we ignore this we find that for the Higgs mass constraint, in both the cMSSM and pMSSM SAFESPAM achieves efficiencies which significantly outperforms prior work using HMC (0.723 cMSSM and 0.319 pMSSM), normalizing flows (0.796 cMSSM and 0.663 pMSSM), NSGA-II (0.715 cMSSM and 0.862 pMSSM)and TPE (0.668 cMSSM and 0.557 pMSSM) methods, and slightly outperforms CMA-ES (0.924 cMSSM and 0.899 pMSSM) [35,38].A significant caveat is that SAFESPAM often fails to capture portions of the valid subspace, especially in the PMSSM.Other methods do not seem to struggle with this to the same extent (although this is difficult to quantify since few papers provide a numerical metric which directly measures this sort of sampling bias in aggregate over all dimensions).
Across SAFESPAM, CHUNC, and CHUNC2 individually, we see that better (lower core-forming) values seem to generally correspond with better (lower incompleteness), but comparing across methods suggests that the removal of the constraints of dimensional reduction allows for greater leeway in this regard: In SAFESPAM core-formation seems to become more difficult as the number of dimensions increases, but for CHUNC and CHUNC2 the opposite holds true.Although CHUNC seems to struggle with core formation at times even more than SAFESPAM, this does not seem to effect its incompleteness nearly as much, perhaps because of the method's lack of dimensional reduction.Notably in the cMSSM CHUNC has a core-forming value several orders of magnitude worse than any other trial, but is still able to achieve better incompleteness than our pMSSM SAFESPAM dark matter trial.With respect to core-forming alone, CHUNC2 seems to perform around 2 orders of magnitude better than corresponding CHUNC trials in both the pMSSM and cMSSM respectively, perhaps because the contrastive task has been shifted onto the categorical variable.However it is interesting and worthwhile to note the lack of pronounced effect this core-forming issue has on incompleteness in CHUNC and CHUNC2, compared to SAFESPAM.It is also possible that the differences in incompleteness we observe are due to CHUNC and CHUNC2 being trained on a combination of valid and invalid points where SAFESPAM was trained only on valid data.Additionally we see that although CHUNC2 outperforms CHUNC in both the core-forming and incompleteness metrics in both spaces, this does not always lead to better efficiency.It is difficult to quantify definitively why this is the case.
Both SAFESPAM and CHUNC-style methods show notable improvement across several constraints compared to naive sampling in the cMSSM and pMSSM.The ability to apply additional constraints while sampling makes these maps versatile tools, and the iterative usage of KDE sampling guarantees that even difficult latent shapes can be sampled with high efficiency.Optimistically, one imagines a web of theorists training SWAE networks on specific sub-regions of interest in these parameter spaces and sharing their models with one another for easy data generation in those regions.Another possibility is to use these methods to explore lesser-known and under-explored subspaces within the full MSSM, such as those which might lead to new solutions to the little hierarchy problem [81,82].Our methodology could also be used with non-MSSM theoretical parameter spaces, or any highdimensional supervised learning problem with a clear labeled distinction between valid and invalid points.
Broadly, there is a clear trade-off between the efficiency benefits of dimensional reduction methods such as SAFESPAM and the potential sampling biases introduced by compressing these high-dimensional spaces.In contrast, CHUNC and CHUNC2-style networks avoid dimensional reduction in their latent spaces, and achieve significantly less bias in sampling the pMSSM despite imposing no constraints that select for this, allowing these methods to sample a broader range of valid points.However, with no dimensional reduction, these latent spaces are still vast and difficult to search exhaustively, even if well-organized, and KDE-style sampling can be less effective, resulting in lower yield.

Future Avenues
In terms of near term improvements of our methodology, sampling jointly from an ensemble of models with varying hyperparameters could help recover portions of these spaces lost due to dimensional reduction in any given individual model.Alternatively, one might introduce a secondary Sliced Wasserstein loss term, applied to fit randomly generated output data to the expected proportionality of the valid sub-space while training, in order to prevent biased sampling.
The cMSSM and pMSSM both adopt de facto theory biases in constraining the underlying MSSM space.Ideally, machine-learning-driven methods will someday be powerful enough to tackle larger unconstrained spaces with high accuracy and efficiency, usurping the need for such constraints.However neither the SAFESPAM or CHUNC-style method in their current iterations would perform well in a ∼100 parameter space such as the unconstrained MSSM.Reduction to two dimensions via a SAFESPAM-style methodology would cause significant information loss, and a CHUNC-style method with no dimensional reduction would be overwhelmed by the sheer vastness of the space, the primary reason the MSSM is difficult to explore in the first place.Standard KDE sampling is also known to degrade due to curse of dimensionality, though there are proposed methods to circumvent this [83][84][85].Ultimately, future work in this vein will need to thread the needle with dimensional reduction in order to simplify sampling and search methods without incurring significant information loss.One optimistically imagines a variable-dimension auto-encoder where the dimension of the latent space at any given point is itself a trainable parameter.Such a network could constrain different portions of a space into separate yet interconnected lower-dimensional representations, changing the dimension of the latent space as needed to maximize dimensional reduction benefits while avoiding information loss.The boundaries of the resulting structures in the latent space could be elucidated iteratively then those structures could be learned and sampled via manifold learning [50,51].
Another fundamental issue is that as dimensionality increases a larger initial scan is needed to represent the space without losing resolution.In the unconstrained MSSM any sort of uniform scan is currently intractable: a scan with only 2 values per dimension would require ∼ 2 100 points.A potential solution to this would be to focus on sequential or algorithmic search methods that can operate without the overhead of evaluating a large initial dataset (see [36], [40] for examples), but it is unclear how exactly one would meld these with a dimensional-reduction-based framework.

Conclusion
We trained a variety of modified Sliced Wasserstein autoencoders, both with and without dimensional reduction, to sample efficiently from experimentally valid regions of two subspaces of the MSSM, the cMSSM and pMSSM, under multiple experimental constraints.Utilizing these mappings alongside KDE sampling, we were able to demonstrate efficiencies significantly greater than naive values in all cases, and we suspect additional iterative usage of KDE sampling could improve these results even further.Our method also is capable of easily adding additional secondary constraints without any retraining, in part due to the benefits of dimensional reduction.
If subtle, unexplored regions remain in spaces like the cMSSM and pMSSM, their discovery hinges on sampling methods that can overcome the inherent difficulties of searching high-dimensional spaces.Methods like SAFESPAM, CHUNC, and CHUNC2 provide a novel way to drill deeper into such parameter spaces and ensure underexplored regions do not go overlooked.Further work with these methods to analyze such regions in the cMSSM and pMSSM remains, as do additional iterative strategies to apply additional successive experimental constraints.The discovery of notable regions in these spaces could easily inspire to new LHC analyses in search of supersymmetric signatures we would not know to look for otherwise.

Figure 2 :
Figure 2: A cartoon illustration of the desired outcome of training on the latent space.Minimizing the clustering loss term pushes valid points (shown in orange) towards the latent origin while minimizing the Sliced Wasserstein loss fits those latent valid points to a target distributions.Shown in blue, invalid points (if present in training data) are pushed outward by the clustering loss term.The resulting clustering of valid points in the latent space allows for straightforward, high-efficiency sampling.Minimizing the L2 loss (not pictured) allows the network to successfully map back and forth between this structured latent space and the constrained MSSM parameter space.

Figure 3 :
Figure 3: A cartoon illustration of the KDE sampling methodology.In CHUNC and CHUNC2, step 1 is omitted and known latent valids from the training data are used instead.

Figure 4 :
Figure4: A cartoon illustration of the bandwidth shrinking procedure in our iterative KDE sampling methodology.As the number of known valid points grows with each successive iteration, we shrink the bandwidth used to estimate the density at each point in the latent space, allowing for higher precision.

Figure 5 :
Figure 5: An example of KDE sampling in the pMSSM: An initial uniform scan of the latent space (left) demarcates the shape of the valid core, and the valid points found in that scan are utilized via kernel density estimation to sample the core efficiently (right).White spaces represent areas that are avoided by KDE to improve efficiency.

Figure 6 :
Figure 6: Distribution of sampled points in the two-dimensional latent space, for the case of learning to generate cMSSM points which satisfy the Higgs mass constraint.Points which satisfy (fail) the Higgs mass constraint are marked Valid (Invalid).Invalid regions in the center are partially avoided due to KDE sampling.

Figure 7 :
Figure 7: Distribution of SAFESPAM-generated points (red) in the cMSSM in each of the five cMSSM parameters, under the Higgs mass constraint, compared to the distribution of the training dataset (blue).

Figure 8 :
Figure 8: Distribution of sampled points in the two-dimensional latent space, for the case of learning to generate pMSSM points which satisfy the Higgs mass constraint.Points which satisfy (fail) the Higgs mass constraint are marked Valid (Invalid).Shown are points sampled via KDE, where invalid regions in the center are partially avoided due to KDE sampling.Trial 1 is shown on the left and Trial 2 on the right, showcasing the sensitivity of the latent space structure to the hyperparameters.

Figure 9 :
Figure 9: Distribution of SAFESPAM-generated points (red for Trial 1, orange for Trial 2) in the pMSSM in each of the nineteen pMSSM parameters, compared to the distribution of the training dataset (blue).While both trials achieve strong efficiency, they sample different portions of the parameter space.

Figure 10 :
Figure10: Distributions in the two-dimensional latent space for the cMSSM dark matter study, where points which satisfy (fail) the dark matter constraint are marked Valid (Invalid).Left shows the distribution of valid points in the training data, with a well-formed core.However, many new points sampled from this core do not satisfy the dark matter constraint (right), perhaps due to insufficient resolution of DM-valid features in the training data.Successive KDE iterations allowed for precise, efficient sampling in the remaining regions of high validity (see Figs11 and 13).

Figure 11 :
Figure 11: Distributions of sampled points in the two-dimensional latent space in four successive rounds of sampling for the cMSSM dark matter case.Points which satisfy (fail) the dark matter constraint are marked Valid (Invalid).In each round, newly discovered valid points are used to update the kernel density estimation of valid regions of the latent space, which is used to perform the next round of sampling.As a result, the 'crust' of invalid points shown in blue shrinks away with each successive iteration.The efficiencies for these 4 rounds are 0.724, 0.887, 0.945, and 0.972 respectively.

Figure 12 :
Figure 12: Distribution of SAFESPAM-generated points in the cMSSM (red) in each of the five cMSSM parameters under the dark matter constraint, compared to the distribution of the training dataset (blue).

Figure 13 :
Figure 13: Distributions in the two-dimensional latent space for the pMSSM dark matter study, where points which satisfy (fail) the dark matter constraint are marked Valid (Invalid).Shown are distributions of sampled points in successive rounds, where in each round valid points from previous rounds are used to perform a kernel density estimation of valid regions of the latent space for the next round of sampling.The initial scan in the top left frame samples via uniform distribution over a region determined by the position of latent valid training data.The efficiency for each round is 0.049, 0.432, 0.881, 0.950, 0.979, respectively.

Figure 14 :
Figure 14: Distribution of SAFESPAM-generated points in the pMSSM (red) in each of the nineteen pMSSM parameters under the dark matter constraint, compared to the distribution of the training dataset (blue).

Figure 15 :
Figure 15: Distributions in the two-dimensional latent space for the cMSSM study, where points which satisfy (fail) the Higgs mass constraint are marked Valid (Invalid), and valid points which also satisfy the dark matter constraint are marked as Double-Valid.In the left, the initial scan identifies a core of valid Higgs points.On the right are the newly generated points which satisfy both constraints.

Figure 17 :
Figure 17: Distribution of CHUNC (red) and CHUNC2 generated (orange) points in each of the five cMSSM parameters under the triple constraint, compared to the distribution of the training dataset (blue).

Figure 18 :
Figure 18: Distribution of CHUNC (red) and CHUNC2 generated (orange) points in each of the nineteen pMSSM parameters under the triple constraint, compared to the distribution of the training dataset (blue).

Table 1 :
Range for cMSSM parameters used in the initial brute-force generation of the training dataset.

Table 2 :
Range for pMSSM parameters used in the initial brute-force generation of the training dataset..An additional pre-processing step was taken to manually remove the gaps in |M 1 | < 0.05 TeV, |M 2 | < 0.1 TeV, and |µ| < 0.1 TeV before training.

Table 3 :
Comparison of the efficiency of generation of experimentally-valid points in naive scanning versus SAFESPAM, for several experimental constraints (Higgs mass m H , dark matter relic density Ω DM , or both) and two theoretical spaces (5-dim cMSSM, 19-dim pMSSM).

Table 4 :
Comparison of the Core-Forming and Incompleteness Metrics for various SAFES-PAM trials for several experimental constraints (Higgs mass m H , dark matter relic density Ω DM ) and two theoretical spaces (5-dim cMSSM, 19-dim pMSSM).

Table 5 :
Comparison of the efficiency of generation of experimentally-valid points in naive scanning versus CHUNC and CHUNC2 methods for three experimental constraints (Higgs mass m H , dark matter relic density Ω DM , and the lightest supersymmetric particle (LSP) being a neutralino) and two theoretical spaces (5-dim cMSSM, 19-dim pMSSM).The uncertainty across all trials is ∼ O(0.01), estimated by running 10 independent samplings for each case.

Table 6 :
Comparison of the Core-Forming and incompleteness Metrics for CHUNC and CHUNC2 for the 'triple' experimental constraint (Higgs mass m H ∩ dark matter relic density Ω DM cap Neutralino LSP,) in two theoretical spaces (5-dim cMSSM, 19-dim pMSSM).

Table 7 :
Table of Hyperparameters for SAFESPAM.Note for pMSSM m H trial 1 training was split between 20 epochs with a proportional mix of valid and invalid data and 20 epochs of only valid data.

Table 8 :
Table of Hyperparameters for CHUNC and CHUNC2.