Constraining the parameters of high-dimensional models with active learning

Constraining the parameters of physical models with >5-10\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$>5-10$$\end{document} parameters is a widespread problem in fields like particle physics and astronomy. The generation of data to explore this parameter space often requires large amounts of computational resources. The commonly used solution of reducing the number of relevant physical parameters hampers the generality of the results. In this paper we show that this problem can be alleviated by the use of active learning. We illustrate this with examples from high energy physics, a field where simulations are often expensive and parameter spaces are high-dimensional. We show that the active learning techniques query-by-committee and query-by-dropout-committee allow for the identification of model points in interesting regions of high-dimensional parameter spaces (e.g. around decision boundaries). This makes it possible to constrain model parameters more efficiently than is currently done with the most common sampling algorithms and to train better performing machine learning models on the same amount of data. Code implementing the experiments in this paper can be found on GitHub


Introduction
With the rise of computational power over the last decades, science has gained the power to evaluate predictions of new theories and models at unprecedented speeds. Determining the output or predictions of a model given a set of input parameters often boils down to running a program and waiting for it to finish. The same is however not true for the inverse problem: determining which (ranges of) input parameters a model can take to produce a certain output a e-mail: b.stienen@science.ru.nl (e.g., finding which input parameters of a universe simulation yield a universe that looks like ours) is still a challenging problem. In fields like high energy physics and astronomy, where models are often high-dimensional, determining which model parameter sets are still allowed given experimental data is a time-consuming process that is currently often approached by looking only at lower-dimensional simplified models. This not only requires large amounts of computational resources, in general it also reduces the range of possible physics the model is able to explain.
In this paper we approach this problem by exploring the use of active learning [9,24,25], an iterative method that applies machine learning to guide the sampling of new model points to specific regions of the parameter space. Active learning reduces the time needed to run expensive simulations by evaluating points that are expected to lie in regions of interest. As this is done iteratively, this method increases the resolution of the true boundary with each iteration. For classification problems this results in the sampling of points around -and thereby a better resolution on -decision boundaries. An example of this can be seen in Fig. 1. In this paper we investigate techniques called query-by-committee [25] and queryby-dropout-committee [11,21,28], which allow for usage of active learning in parameter spaces with a high dimensionality.
The paper is structured as follows: in Sect. 2 we explain how active learning works. In Sect. 3 we show applications of active learning to determine decision bounds of a model in the context of high energy physics, working in model spaces of a 19-dimensional supersymmetry (SUSY) model. 1 We con-

Active learning
Scientific simulations can be computationally expensive to run, making it expensive to explore the output space of these. Approximations of these simulations can however be constructed in the form of machine learning estimators, which are typically quick to evaluate. Active learning leverages this speed, exploiting the ability to quickly estimate how much information can be gained by querying a specific point to the true simulation (also known as the labelling procedure).
Active learning works as an iterative sampling technique. In this paper we specifically explore a technique called poolbased sampling [24], of which a diagrammatic representation can be found in Fig. 2. In this technique an initial data set is sampled from the parameter space and queried to the labelling procedure (also called the oracle). After retrieving the new labels one or more machine learning estimators are trained on the available labelled data. This estimator (or set of estimators) can then provide an approximation of the boundary of the region of interest. We gather a set of candidate (unlabelled) data points, which can for example be sampled randomly or be generated through some simulation, and provide these to the trained estimator. The output of the Fig. 2 Diagrammatic representation of active learning. Data is sampled and used to create a data set. This data is used to train an ML estimator (or a committee of estimators), which is used to get an approximation of the labelling on newly sampled data. From this new data the points with the highest uncertainty in their labelling are selected for sampling via the true sampling procedure and added to the data set. This process can be repeated until enough data is collected estimator can then be used to identify which points should be queried to the oracle. For a classification problem this might for example entail finding out which of the candidate points the estimator is most uncertain about. As only these points are queried to the oracle, it will not spend time on evaluating points which are not expected to yield significant information about our region of interest. The selected data points and their labels are then added to the total data set. This procedure -from creating an estimator to adding new to the data set -can be repeated to get an increasingly better estimation of the region of interest and be stopped when e.g. the collected data set reaches a certain size or when the performance increase between iterations becomes smaller than a predetermined size.
It should be noted that the active learning procedure as described above has hyperparameters: the size of the initial data set, the size of the pool of candidate data points and the number of candidate data points queried to the oracle in each iteration. The optimal configuration for these parameters is problem dependent and finding them requires a dedicated search. We performed a random search on the hyperparameters of the experiments in Sect. 3 and selected the best configuration for all experiments. For completeness a discussion on the hyperparameters can be found in Appendix A. We do want to note that in all the active learning configurations we experimented with, active learning always performed at least equally as good as random sampling.
Arguably the most important step in Fig. 2 is to select those points that ought to be queried to the labelling procedure from a large set of candidate data points. As the problems we look at here are classification problems, the closeness to the boundary can be estimated by the uncertainty of the trained estimator on the classification of the model point.
This uncertainty can for example be obtained from an algorithm like Gaussian Processes [22], which has already been successfully applied in high energy physics to aim sampling of new points around 2-dimensional exclusion boundaries [10]. Due to the computational complexity of this algorithm it is however limited to low-dimensional parameter spaces, as it scales at best with the number of data points squared [13]. Because of this, we investigate specifically query-by-committee and query-by-dropout-committee.

Query-by-committee (QBC)
By training multiple machine learning estimators on the same data set, one could use their disagreement on the prediction for a data point as a measure for uncertainty. Data points on which the estimators disagree most are expected to provide the highest information gain. This method is called query-bycommittee (QBC) [25]. To create and enhance the disagreement among the committee members in uncertain regions the training set can be changed for each estimator (e.g. via bagging [6]) or by varying the configuration of the estimator (e.g. when using a committee of Neural Networks, each of these could have a different architecture or different initial conditions), such that we get a reasonable amount of diversity in the ensemble.
The disagreement among the estimators can be quantified through metrics like the standard deviation. For binary classification problems it can even be done by taking the mean of the outputs of the set of N estimators. If the classes are encoded as 0 and 1, a mean output of 0.5 would mean maximal uncertainty, so that the uncertainty measure for N estimators could be The QBC approach is not bound to a specific estimator. If one were to use N estimators of which the training scales linearly with the number of data points K , the active learning procedure would have a computational complexity of O(N K ) for each iteration. This allows for the use of large amounts of data, as is needed in high-dimensional parameter space.

Query-by-Dropout-committee (QBDC)
Compared to Random Forests, Neural Networks are able to capture more complex data patterns. Using Neural Networks for active learning might therefore be beneficial. However, although query-by-committee can also be used to create a committee of Neural Networks, this is generally ill-advised due to the computational expensive more training of Neural Networks. As an alternative to this, one can also build a committee using Monte Carlo dropout [12]. This technique uses a Neural Network with dropout layers [26] as estimator. These dropout layers are conventionally used to prevent overtraining (i.e. increased performance on the training set at the cost of a reduction in performance on general data sets) by disabling a random selection of the neurons in the preceding layer of the network. This selection is changed at each evaluation, and therefore at each training step of the network. This makes it harder for the network to strengthen a specific node path through the network during training, making it more robust to changes in the input data. At test time the dropout layers are typically disabled and the weights of the layer following the dropout layer are rescaled with a factor of 1/(1 -dropout fraction), in order to make the network.
In Monte Carlo dropout, on the other hand, the dropout layers are left enabled at test time, making the output of the network vary at each evaluation. Making x predictions on a specific data point can therefore be interpreted as a set of predictions coming from a committee of x networks. The advantage here however is that only a single network has to be trained. As this method uses Monte Carlo dropout, this method is called Query-by-Dropout-Committee (QBDC) [11,21,28].
For classification problems one might wonder why not to simply use a softmax output of a normal network (i.e. one not using Monte Carlo dropout) and use this output as a measure for uncertainty. Softmax layers associate a probability for each output class, which could in principle be transformed in an uncertainty. A measure like this can be used the address the uncertainty inherent in the data, also known as the aleatoric uncertainty [14]. It captures uncertainty around decision boundaries for example, where based on the data the network is insecure on how to classify new data. The output of the softmax layer says however nothing about the uncertainty in the model itself, the epistemic uncertainty [14]. The epistemic uncertainty on the other hand quantifies the uncertainty related to the model configuration (i.e. that different models would give different predictions). A single network with a softmax activation on its output would not take epistemic uncertainty into account, as its output has no intrinsic mechanism to capture epistemic uncertainty. This is especially important in regions where training data is sparse. A single network may therefore still give the impression to be very certain in such regions. QBDC, like other ensemble methods, captures model uncertainty by producing more variation in regions where training data is sparse. For active learning, both aleatoric and epistemic uncertainty are important.

Applications in HEP
In this section active learning as a method is investigated using data sets from high energy physics. The experiments investigated here are all classification problems, as these have a clear region of interest: the decision boundary. It should be noted that the methods explored here also hold for regression problems with a region of interest (e.g. when searching for an optimum). Although active learning can also be used to improve the performance of a regression algorithm over the entire parameter space, whether or not this works is highly problem and algorithm dependent, as can for example be seen in Ref. [23].

Increase resolution of exclusion boundary
As there are no significant experimental signals found in "beyond the standard model" searches that indicate the presence of unknown physics, the obtained experimental data is used to find the region in the model parameter space that still allowed given the experimental data. Sampling the region around the boundary of this region in high-dimensional spaces is highly non-trivial with conventional methods due to the curse of dimensionality.
We test the application of active learning on a model of new physics with 19 free parameters (the 19-dimensional pMSSM [16]). This test is related to earlier work on the generalisation of high-dimensional results, which resulted in SUSY-AI [7]. In that work the exclusion information on ∼ 310,000 model points as determined by the ATLAS collaboration [27] was used; the same data is used in this study. We investigate three implementations of active learning: two Random Forest set ups, one with a finite and the other with an infinite pool, and a setup with a QBDC. The performance of each of these is compared to the performance of random sampling, in order to evaluate the added value of active learning. This comparison is quantified by using the following steps: 1. Call max_performance the maximum reached performance for random sampling; 2. Call N random the number of data points needed for random sampling to reach max_performance; 3. Call N active the number of data points needed for active learning to reach max_performance; 4. Calculate the performance gain through The configurations of the experiments were explicitly made identical and were not optimised on their own.

Random forest with a finite pool
Just as for SUSY-AI we trained a Random Forest classifier on the public ATLAS exclusion data set [27] (details on the configuration of this experiment can be found in Appendix B). This data set was split into three parts: an initial training set of 1000 model points, a test set of 100,000 model points and a pool of the remaining ∼ 200,000 model points. As the labeling of the points is 0 for excluded points and 1 for allowed points, after each training iteration the 1000 new points with their Random Forest prediction closest to 0.5 (following the QBC scheme outlined in Sect. 2.1) are selected from the pool and added to the training set. Using this now expanded dataset a new estimator is trained from scratch. The performance of this algorithm is determined using the test set. This experiment is also performed with all points selected from the pool at random, so that a comparison of the performance of active learning and random sampling becomes possible. The results of both experiments are shown in Fig. 3. The bands around the curves in this figure indicate the range in which the curves for 7 independent runs of the experiment lie. The figure shows that active learning outperforms random sampling initially, but after a while random sampling catches up in performance. The decrease in accuracy of the active learning method is caused by an overall lack of training data. After having selected approximately 70,000 points via active learning, new data points are selected further away from this boundary, causing a relative decrease of the weight of the points around the decision boundary, degrading the generalisation performance.
Based on Fig. 3 the performance gain of active learning over random sampling in the early stages of learning, up to a train size of 50,000, -as described by Eq. 2 -lies in the range 3.5 to 4.

Random forest with an infinite pool
A solution to the problem with the finite pool in Sect. 3.1.1 is to not give it a pool of data but to give it access to a method through which new data points can be generated. In this experiment we create such a method that generates data in the training volume of SUSY-AI with a uniform prior. Although in each iteration only a limited set of candidate points is considered, the fact that this set is sampled anew in each iteration guarantees that the decision boundary is never depleted of new candidate points. Because of this, the pool can be considered infinite. In contrast to the experiment in Sect. 3.1.1, where labeling (i.e., excluded or allowed) was readily available, determining true labeling on these newly sampled data points would be extremely costly. Because of this SUSY-AI [7] was used as a stand-in for this labeling pro- True labeling was provided by ATLAS [1]. Active learning quickly starts outperforming the random sampling. The decline in accuracy for active learning, starting from a training size of 60,000, is caused by the limited size of the pool and the fact that the region around the pool is depleted from data around the decision boundary. The bands around the curves show the range in which all curves of that colour lie when the experiment was repeated 7 times cess. 2 Since we are training a Random Forest estimator, we retrained SUSY-AI as a Neural Network, to make sure the newly trained Random Forest estimator would not be able to exactly match the SUSY-AI model, as this would compromise the possibility to generalise the result beyond this one. The accuracy of this Neural Network was comparable to the accuracy of the original SUSY-AI. Details on the technical implementation can be found in Appendix B.
The accuracy development as recorded in this experiment is shown in Fig. 4. The bands again correspond to the ranges of the accuracy as measured over 7 independent runs of the experiment. The gain of active learning with respect to random sampling (as described by Eq. 2) is 6 to 7. The overall reached accuracy is however lower than in Fig. 3, but note that this experiment stopped when a total of 100,000 points as sampled, compared to the 200,000 points in the previous experiment.

QBDC with an infinite pool
To test the performance of QBDC, the infinite pool experiment above was repeated with a QBDC setup. The technical 2 Since SUSY-AI has an accuracy of 93.2% on the decision boundary described by the ATLAS data [27], active learning will not find the decision boundary described by the true labeling in the ATLAS data. However, as the goal of this example is to show that it is possible to find a decision boundary in a high-dimensional parameter space in the first place, we consider this not to be a problem. True labeling was provided by a machine learning algorithm trained on model points and labels provided by ATLAS [1]. Here active learning is vastly superior over random sampling, yielding a gain in computational time of a factor of 5-6. The bands around the curves show the range in which all curves of that colour lie when the experiment was repeated 7 times details of the setup can be found in Appendix B. The accuracy development plot resulting from the experiment can be seen in Fig. 5. The bands around the lines representing the accuracies for active learning and random sampling indicate the minimum and maximum gained accuracy for the corresponding data after running the experiment 7 times. The performance gain (as defined in Eq. 2) for active learning in this experiment lies in the range 3-4.
QBDC sampling is approximately K times faster than ensemble sampling with K committee members for a fixed number of samples, as only one network has to be trained. However, as active learning outperforms random sampling by a factor of 3 to 4, it depends on how expensive training of the estimator is in comparison to how much computational time is gained.
Compared to Figs. 3 and 4 the accuracies obtained in Fig. 5 are significantly higher. This can be caused by the fact that the model trained to quantify the performance more strongly resembles the oracle (both of them are Neural Networks with a similar architecture), or that the Neural Network is inherently more capable of capturing the exclusion function.

QBC with infinite pool for smaller parameter spaces
To investigate the performance of the method on lower dimensional spaces, we perform the QBC experiment of Sect. 3.1.2 five times. Each of these iterations uses the 19dimensional parameter space from this original experiment, but in each of these we fix an increasingly larger subset of variables of the original 19. For those familiar with SUSY: True labeling was provided by a machine learning algorithm trained on model points and labels provided by ATLAS [1]. The gain of active learning with respect to random sampling (as described by Eq. 2) is 3-4. The bands show the range in which all curves of that colour lay when the experiment was repeated 7 times we consecutively fix the variables associated with the slepton sector, the electroweak (EW) sector, the higgs sector and the third generation. An overview of the variables and their fixed values can be found in Table 1. The specific values to which the variables were fixed were determined using SUSY-AI: we required that whatever value selected, there was a balance in the fraction of allowed and excluded points. To make results from these experiments comparable to the experiments in Sect This avoids, amongst other things, for example that the initial data set is too large for active learning to be most effective.
The results of the five reduced dimensionality experiments can be found in Fig. 6. We have also included the results of Sect. 3.1.2 for comparison. The gain over active learning with respect to random sampling (Eq. 2) differs slightly from plot to plot and can be found in the last column of Table 2. From this table it can be seen that active learning is of added value, even when applied to parameter spaces with a smaller dimensionality.
However, although the relative gain in performance is an informative measure, when deciding to apply active learning one also has to consider the additional computational overhead that the procedure adds to the sampling process. How large of an overhead one is willing to accept depends largely on the cost associated with querying the oracle. If a query to  To investigate this influence of oracle query times we kept track of the time taken by the sampling process and oracle query times for active learning in this and the previous section. Using this information we calculated the time needed for active learning to obtain the maximum performance by random sampling. Comparing this time to the time needed to randomly sample the necessary amount of points for this performance gives a time-based comparison between the two sampling methods. In Table 2 the following metric is used time metric = time taken by random sampling time taken by active learning .
The table shows the results for this metric for the experiments in this and the previous section, averaged out over the seven performed iterations. From these results it can be concluded that active learning can provide a benefit over random sampling even for low dimensional parameter spaces, as long as the oracle query time is long enough. An alternative interpretation could be that if one would be able to find an invertible transformation that projects the data to a lower dimensional The bands show the range in which all curves of that colour lay when the experiment was repeated 7 times. The subsplots show the development for different dimensionalities for the simplified models. The relative gain of active learning over random sampling is approximately equal for all six of these plots, showing that active learning has added benefit even for smaller parameter spaces space, active learning might not be necessary -as long as the oracle query time is short enough. An important note should be made about the exact oracle query times in Table 2 however. Although the ratios reported in the table are independent on the used hardware and the used active learning method, this is not the case of the dependence of these ratios on the quoted oracle query times. Where the time taken by random sampling is only dependent on this oracle query time and has negligible overhead, this is not the case for active learning. So if the oracle query time stays the same but the computer becomes for example twice as slow (or the machine learning method would be replaced with a method that is twice as expensive), the ratio would drop with a factor anywhere between 1 and 2, the exact value depending on the oracle query time and the time taken by the active learning overhead. The general trend that active learning can be beneficial for lower dimensional parameter spaces if the oracle time is just long enough -or alternatively: that active learning becomes more and more relevant as one considers higher and higher dimensional parameter spaces -is however unaffected by this hardware dependence.

Identifying uncertain regions and steering new searches
Instead of using active learning e.g. to iteratively increase the resolution on a decision boundary, the identification of uncertain regions of the parameter space on which active learning is built can also be used to identify regions of interest.
For example, in high energy physics one could attempt to identify model points around the exclusion boundary in a high-dimensional model. These model points could then be used as targets for new searches or even new experiments. This is an advantage over the conventional method of trying to optimise a 2-dimensional projection of the exclusion region, as this method works over the full dimensionality of the model, which thereby can respect a more detailed account of the underlying theory that is being tested for. One could even go a step further by reusing the same pool for these search-improvement studies, so that regions of parameter space that no search has been able to exclude can be identified. Analogous to this one could also apply this method to find targets for the design of a new experiment.
To test the application of this technique in the context of searches for new physics we trained a Neural Network on the publicly available ATLAS exclusion data on the pMSSM-19 [27], enhanced with the 13 TeV exclusion information as calculated by [4]. The technical setup is detailed in Appendix B. We sampled ∼ 87,000 model points in the SUSY-AI parameter space [7] using SOFTSUSY 4.1.0 [3] as spectrum generator and selected 1000 points with the highest uncertainty following the QBDC technique outlined in Sect. 3.1.3. Figure 7 shows the sampled model points in the gluino mass -LSP mass projection. As the LSP mass was not directly one of the input parameters, the fact that the selected points are nevertheless well-sampled in the region of the decision boundary, we conclude that the active learning algorithm did successfully find the decision boundary in the 19dimensional model.
We conclude this section by noting that in all the active learning experiments in this section new points were selected exclusively with active learning. In more realistic scenarios the user can of course use a combination of random sampling and active learning, in order not to miss any features in parameter space that were either unexpected or not sampled by the initial dataset.  . 7 The model points that were selected in a pool-based sampling projected on the gluino mass (mg) -LSP mass (mχ0 1 ) plane. The algorithm did not have direct access to the variables on the axes, but was nevertheless able to sample points in the region around the decision boundary, indicated by the solid black line in the figure. The dashed red lines indicate the boundary of the model, outside of which no data points will be sampled (caused by the fact that no supersymmetric particle can be lighter than the LSP)

Conclusion
In this paper we illustrated the possibility to improve the resolution of regions of interest in high-dimensional parameter spaces. We specifically investigated query-by-committee and query-by-dropout-committee as tools to constrain parameters and the possibility to improve the identification of uncertain regions in parameter space to steer the design of new searches. We find that all active learning strategies presented in this paper query the oracle more efficiently than random sampling, by up to a factor of 7. By reducing the dimensionality of the 19-dimensional parameter space used in the experiments, we have additionally shown that active learning can be beneficial even for parameter spaces with a lower dimensionality.
One of the limiting factors of the techniques as presented in this paper is the fact that a pool of candidate points needs to be sampled from the parameter space. If sampling candidate points randomly yields too few interesting points, generative models can be used to sample candidate points more specifically.
The code for all performed experiments is made public on GitHub . 3 Fig. 8 The dependence of the accuracy in the last iteration of the active learning procedure on the number of candidates in each iteration. The error bars indicate the range within which the accuracies over 7 runs lie. As described in the text, an optimum value can be observed, although it should be noted that this value also depends on the number of data points selected in each iteration -size_select: The number of data points to select from the pool of candidate data points and query to the oracle.
Which settings are optimal depends on the problem at hand, although some general statements can be made about the possible values for these hyperparameters. To illustrate this we performed a hyperparameter optimisation for the experiment in Sect. 3.1.2, although it should be noted that this optimisation was performed only for illustration purposes and was not used to configure the experiments in this paper.
The size_initial for example configures how well the first trained machine learning estimator approximates the oracle. If this approximation is bad, the first few sampling iterations will sample points in what will later turn out to be uninteresting regions. A higher value for size_initial would therefore be preferable over a smaller value, although this could diminish the initial motivation for active learning: avoiding having to run the oracle on points that are not interesting with respect to a specific goal.
The size_sample parameter however will have an optimum: if chosen too small the selected samples will be more spread out and possibly less interesting points will be queried to the oracle. If chosen too high on the other hand the data could be focused in a specific subset of the region of interest because the trained estimator happens to have a local minimum there. The existence of an optimal value for size_sample can be seen in Fig. 8.
It should be noted that the location of the optimum does not only depend on size_sample, but also on size_select. If one were to set size_select to 1, the size of the candidate pool is best as large as possible, Fig. 9 The dependence of the accuracy in the last iteration of the active learning procedure on the number of candidates and the number of points selected for querying to the oracle in each iteration. The last iteration was defined as the last iteration before 100,000 data points were selected, meaning that a setup with size_select equal to 500 had more iterations than a setup with a size_select of 7500 for example in order to be sure that the selected point is really the most informative one you can select. This would avoid the selection of clustered data points, but this comes at the cost of having to run the procedure for more iterations in order to get the same size for the final data set. This would however be very expensive if the cost for training the ML estimator(s) is very high. The dependence of the accuracy on these two variables is shown in Fig. 9, in which the accuracy gained in the last step of the active learning procedure is shown for different configurations of these two parameters. The script to generate this figure can be found on GitHub .

Appendix B: Experiment configuration
All networks were trained using Keras [8] with a Tensorflow [2] backend linked to CUDA [19]. scikit-learn [20] was used when implementations of Random Forest were used.

Increase resolution of exclusion boundary
The configuration of the active learning procedure can be found in Table 3. The experiments are denoted by the section in this paper in which they were covered. Random Forest with a finite pool The trained Random Forest classifier followed the defaults of scikit-learn [20]: it consisted out of 10 decision trees with gini impurity as splitting criterion. Random Forest with an infinite pool For active learning we trained a Random Forest [5] classifier that consisted out of 100 decision trees with gini impurity as splitting criterion. All other settings were left at their default values.