Efficiency Parameterization with Neural Networks

Multidimensional efficiency maps are commonly used in high energy physics experiments to mitigate the limitations in the generation of large samples of simulated events. Binned multidimensional efficiency maps are however strongly limited by statistics. We propose a neural network approach to learn ratios of local densities to estimate in an optimal fashion efficiencies as a function of a set of parameters. Graph neural network techniques are used to account for the high dimensional correlations between different physics objects in the event. We show in a specific toy model how this method is applicable to produce accurate multidimensional efficiency maps for heavy flavor tagging classifiers in HEP experiments, including for processes on which it was not trained.


Introduction
An overarching issue of Large Hadron Collider (LHC) experiments is the necessity of massive numbers of simulated collision events to estimate the rates of expected processes in very restricted regions of phase space. To mitigate this difficulty, a commonly used approach is the event weighting technique which replace selection cuts with event weights. These weights are typically defined from binned efficiency maps. The difficulty in these methods is the range of applicability of efficiency maps that are limited in the number of dimensions (typically two), and subsequently, fail to capture more subtle effects that appear in specific regions of phase space. To account for these dependencies, a multidimensional mapping is required. This implies large statistical fluctuations in the map itself that defies the original purpose of the method.
A common example of the usage of event weighting techniques is typically given by analyses relying on the identification of jets originating from b-quarks (b-tagging) [1][2][3]. Applying a weight corresponding to the expected identification efficiency of a jet, i.e. the probability of being identified as a b-jets, instead of a direct selection cut can provide large gains in statistics (especially in cases of percent level efficiencies to be applied on several jets in an event). However, obtaining universally applicable maps require to account for a large number of parameters. Some of which are typically not known.
The goal of the proposed method is to provide higher dimensional parametrizations of efficiencies that can capture non-trivial dependencies while making optimal use of the available statistics and therefore be applicable in any analysis context considered. When achieving this goal the parameterization will be referred to as universal. The proposed approach is based on Graph Neural Networks (GNN). The case study used is the b-tagging performance in the analysis of Higgs boson decays to b quarks (H → bb).
The strength of the proposed method relies on its ability to model high-dimensional correlations between jets. These jet-by-jet dependencies are not given explicitly as inputs variables to the neural network, but rather they are inferred from single-jet properties during the training of the network. In case multiple jets in the event are b-tagged, the jet-efficiencies provided by the NN can be combined to derive an unbiased estimator of the event tagging efficiency. A toy model is built to probe the capability of the ML approach to provide a robust parametrization of the b-tagging efficiency.
The paper is organized as follows. Section 2 introduces the event weighting technique and describes the main challenges and goals of the method. Section 3 describes the MC simulation technique used to generate the toy data-set. Section 4 describes a map-based technique that is commonly used to estimate the event weight based on a parameterization of the b-tagging classifier performance. Section 5 describes the GNN model, whose results are compared to the ones of the map-based technique in Section 6. In Section 7 some considerations about the usage of the proposed methodology in real experiments are presented. Conclusions are drawn in Section 8.

Event weighting technique
In high energy physics experiments (HEP), estimating a background rate or a signal efficiency from a selection cut, such as f (x) > T f , is most accurately achieved by a full simulation of the -2 -event. However, the precision of such an estimate can be heavily affected by the limitation in the number of events that can be simulated in a given region of phase space. If instead of selecting events based on a classification cut, a weight corresponding to the classifier efficiency is applied, significant improvements in sensitivity can be gained, as schematically shown in Figure 1. This procedure is also known as Tag-Rate-Function (TRF) method or Truth Tagging (TT) [4,5].
Selections can be interpreted as a classification depending on a vector of input variables x. The classifier can be represented by a function f (x) and the classification by a simple selection cut on the classifier above a given threshold T f . The classifier can represent simple cuts or a multivariate method. Typically the variables x depend on several underlying properties of the events, their simulation and their reconstruction which will lead to non-trivial dependencies of the distribution of x for different types of events or features of the events that are used in the analysis which will be denoted by θ.
In the case of Heavy Flavor tagging, x includes the reconstruction of secondary vertices and a combination of track impact parameter information estimated from the properties of a set of reconstructed charged particle tracks. This information is then combined to produce a multivariate jet-based classifier f (x).
A classifier efficiency can be written as: where T f is operating working point threshold of the classifier; the numerator, the selected number of jets of a given flavor at this working point; and the denominator represents the total number of jets of the same flavor.
To achieve a parametrization of the efficiency, applicable to a large number of analyses, a set of relevant variables θ must be defined such that the conditional probability of the classifier inputs, x, at a given value of θ, p(x|θ), will be identical between samples or different regions of phase space, as illustrated in Fig 2. This motivates the efficiency maps approach, where an attempt is made to parametrize jet binned in θ. Efficiency maps are a commonly used tool in real experiments. However, taking into account the full dependencies of the classifier efficiency is often impractical using efficiency maps. The reason being that a small enough set of variables that fully capture these dependencies might not be available.
In the case of b-tagging, θ is typically defined as the jet transverse momentum p T and pseudorapidity η [1]. It was found that while p T and η are indeed the most dominant variables in determining jet , there are other variables that affect the efficiency and could be considered had we known them, e.g. the angular separation and flavor of the adjacent jets [2,6].
We propose a different approach to estimate jet based on a neural network built using a GNN. The neural network takes as input a set of jet-variables Θ j e for each jet j in the event e. In the toy model used throughout this paper, the input variables are the jet-(p T , η, φ, flavor) and the neural network model infers, in addition to p T and η, the jet-by-jet, ∆R(i, j), angular dependencies of the b-tagging efficiency which reflects the environment of the tagged jet. ∆R(i, j) is the angular distance between the tagged jet, i, and the adjacent jets, j, defined in Section 3.  The joint distribution of (x, θ) is generally different between two samples. The top right plot shows the overall probability distribution of the input variables of the classifier, P(x), for two different samples. Different P(x) distributions lead to different overall efficiencies between the two samples. The bottom right plot shows the conditional probability distributions, P(x|θ), between the two samples. The set of relevant variables θ is defined to provide a P(x|θ) which is sample independent. Under this condition, the parametrized classifier efficiency (θ) is expected to be universal.

Simulated samples
The samples employed in this study consist of toy pp collision events with multiple jets generated with generic kinematic and flavor properties. We assume a cylindrical coordinate system where particle beams collide on the z axis, xy is denoted as the transverse plane, φ is the azimuthal angle, θ the polar angle, and pseudo-rapidity η is defined as η = − log tan(θ/2).
The generated events are sampled using an exponential function to fix the number of jets in the event and Gaussians or polynomial distributions to sample the jet kinematics variables and the angular distance between two jets ∆R(i, 2 . More details about the event generation can be found in Appendix A. Three separate samples of four-momenta representing b-, c-and light-jets are generated. The b-tagging efficiency is modeled using ad-hoc parameterizations using a multivariate Gaussian distribution depending on p T and η which is modified by a multiplicative correction factor depending on the angular distance ∆R(i, j) of other jets in the event as well as their flavor. This efficiency is chosen to mimic the b-tagging performance of ATLAS and CMS [1,7] and it is expressed as: where f i (p T , η) is the two-dimensional parameterisation of the efficiency to tag a jet of a given flavor f i , andˆ i j ∆R(i, j), f j is the one-dimensional correction factor which accounts for the effect of any close-by jet j of flavor f j in the event. The efficiencies f i (p T , η) and the correction factorŝ Figure 3. The true b-tagging efficiency of each individual jet in the event is computed using Eq. 3.1. This efficiency value jet i is used to emulate b-tagging by assigning a boolean value to each jet istag which is set to 1 based on a random score s i sampled from a uniform distribution. Namely, if s i < jet i the i-th jet in the event is considered to be b-tagged (istag=1). In many physics analyses, multiple jets in the event are required to pass b-tagging selections, hence the efficiencies of the single jet need to be combined to form a per-event efficiency. In this toy analysis the event selection is based on the two jets with highest p T in the event ("leading jets", labeled as 1 and 2), and it is defined depending on the number of tagged jets, n tag :

Efficiency Map techniques
The estimation of event in the case of b-tagging in real experiments is commonly based on the binned two-dimensional efficiency maps in the jet p T -η plane [4,5],˜ , derived from MC simulation separately for b-jets, c-jets and light-jets, which are used to approximate the per-jet b-tagging efficiency of Eq. 3.1 as: The choice of the variables used to parameterize˜ is motivated by the expected dependency of the b-tagging performance. For example, as the transverse momentum of a b-jet increases, the dilation of its lifetime in the laboratory frame results in secondary decay vertices which are reconstructed further from the interaction point of the primary collision. The reconstruction efficiency of secondary vertices is not constant as a function of their distance to the primary vertex and this affects the response of the b-tagging classifier. Similarly, the typical configuration of multi-purpose detectors produces a dependency of track reconstruction performance on detector geometry, which in turn propagates into a dependency of the b-tagging performance on η.
From the per-jet efficiency maps˜ the event weight event is computed factorizing the contribution from the various jets, similarly to what is shown in Eq. 3.2.
The main limitation of this map-based approach is the assumption that correlations between jets can be neglected and that the efficiency of b-tagging a single jet only depends on its p T and η. The dependency of efficiency on residual observables is marginalized out when deriving˜ from MC samples, introducing a bias that is particularly significant for final states with large jet multi--6 -plicities or events where close-by or overlapping jets are reconstructed from the decay of boosted resonances. A dedicated ∆R(i, j) reweighing was derived and used to correct for this effect in previous H → bb and H → cc analyses [2,6]. Given the uncertain nature of this correction and the limited statistics of the sample used to derive it, a large systematic uncertainty equal to half of the correction was assigned to the relevant MC templates [6]. The overall uncertainty related to the statistics of the MC templates constitutes a contribution up to around 20% to the total background uncertainty [3,8].
Additional limitations come from the binning of the two-dimensional maps. To reduce discontinuities, smoothing techniques need to be employed. However, these techniques often require a non-trivial interplay between the bin sizes and the parameters of the smoothing model resulting in unpractical compared to an unbinned neural network training.

Truth Tagging with Neural Networks
Taking into account the full dependency of the jet-tagging probability on all event observables would be unpractical with a map-based approach. ML techniques, on the other hand, provide the possibility to scale the problem to higher dimensionality and therefore to more challenging physics topologies.
In principle, a standard feedforward neural networks could be used to perform the task. However, these models are not able to cope with inputs of variable sizes and thus the overall correlations between jets in the event cannot be easily exploited during the training. The technique we propose uses a graph neural network to capture efficiently these correlations. A GNN also offers a more natural representation of the data by exploiting pair-wise relationships between the jets. In our toy experiment, each jet is represented by a set of variables corresponding to (p T , η, φ, flavor). The neural network takes as input these variables for each jet in the event e, Θ e = ((p T 1 , η 1 , φ 1 , flavor 1 ), ..., (p T n jets , η n jets , φ n jets , flavor n jets )) and learns to approximate the efficiency given in Eq. 3.1 for each of these jets. Note that the inputs to the neural network do not include ∆R between neighboring jets, which is the variable that determines the correction applied in Eq. 3.1 but rather this dependency is inferred directly during the training.
Model Architecture The model consists of two components: a graph neural network (GNN) [9] and a jet efficiency network. The flow of information between the different parts is illustrated in Figure 4.
The GNN component creates a hidden representation for each jet that is based on the information of the other jets in the event. The GNN takes as input the n jets × 4 matrix of jet features, and outputs n jets ×d hidden matrix of jet hidden representations 1 . The jet efficiency network then operates on each jet individually. It takes as an input the jet variables and the jet hidden representation and it returns as an output the jet for every jet. More details about the model architecture can be found in Appendix B.
Training Procedure The network is trained to predict the n jets × 1 vector of efficiencies. The loss function used for training is the weighted binary cross-entropy (BCE), which for a single event it can be written as: where the sum runs over the sets of jets in the event, e, which pass (istag=1) and do not pass (istag=0) b-tagging and NN (Θ e ) i is the i-th component of the output of the NN, a vector of variable size representing the predicted efficiency of tagging each jet in an event. The loss function being minimized is the sum of BCE e for all the events in in the training sample. The factor µ controls the weight of the non-tagged events and can be used to balance the number of tagged and non-tagged jets to facilitate the training. This approach could be useful for light-jets where the number of non-tagged jets is O(100) larger than the tagged ones. Even if this factor was found to be helpful in tests conducted with feedforward networks, for GNNs it was found to have a negligible impact on the final results. Therefore, µ=1 is assumed in the following discussions.
Using a well-known result, the neural network trained using BCE as loss function converges to the following ratio [10,11]: NN (Θ e ) i and (θ i ) are the predicted and true efficiency jet of the i-th jet in the event e, respectively. In the toy model employed for this study, (θ i ) represents the true single-jet efficiency, i , computed in Eq. 3.1. g i (Θ e ) is the function, infeered during the training, which approximate the relevant variables of the i-th jet θ i , g i (Θ e ) ≈ θ i . For example, for the i-th jet in the event: g i ((p T 1 , η 1 , φ 1 , flavor 1 ), ..., (p T n jets , η n jets , φ n jets , flavor n jets )) ≈ (p T i , η i , ∆R(i, j), flavor j ) where the index j runs over every jet in the event, excluding the i-th jet. Finally, p non-tag (g i (Θ e )) and p tag (g i (Θ e )) are the g i (Θ e ) distributions of the i-th jet to be non-tagged and tagged as a b-jet, respectively. It is worth noticing that the NN computes directly the efficiency NN (Θ e ) i without regressing p tag (g i (Θ e )) and p non-tag (g i (Θ e )) independently. In the map-based approach, on the other hand, the -8 -distribution of tagged jets, p tag (p T i , η i ), and the distribution of the total number of jets, which is the sum of tagged and non-tagged jets (p tag (p T i , η i ) + p non-tag (p T i , η i )), are computed independently in bins of p T and η. The efficiency is then estimated in a second step by taking the ratio bin-by-bin between these two distributions.
The training workflow of the proposed approach is illustrated in Figure 5. The training is done with stochastic gradient descent, with a batch size of 5,000 events. The batch size was chosen to be as large as possible given the memory constraints of the system used for training. The batch size is particularly important for this task as a significant amount of tagged and non-tagged jets needs to be present to reduce statistical fluctuations during training. To further reduce the effect of these fluctuations, 20 neural networks with different weights initialization and random batching during training were used. The efficiency for each jet is computed by taking the mean of these 20 different predictions.
-9 -In this section, the result of approximating event using the jet b-tagging efficiencies calculated from the NN are presented and compared to the results obtained with the map-based technique discussed in Sec. 4. Three main aspects are discussed: the modeling of single-jet distributions after jet weighting, the capability of the NN technique to provide an unbiased estimation of event , and the independence of the GNN performance on the choice of the sample used for training. Figure 6 shows the relative residuals ( true − predicted )/ true for each jets in the event and ∆R(i, j) distributions for the leading and subleading jet where the leading jet is classified as btagged. true is computed from Eq. 3.1. The results of direct tagging 2 are also shown together with the jets weighted with either the predicted per-jet efficiency from the map-based (Eq 4.1) or NN approaches (Eq 5.2). While, as expected, the map-based approach is unable to provide good modeling of the ∆R(i, j) distribution, the NN predictions are in good agreement with the distributions obtained with direct tagging and with true efficiency weights. These results give us confidence about the ability of the NN to approximate the set of relevant variables θ as well as their dependency on the jet (θ) of the different jets.
Results of the reweighing procedure are further studied when both the leading and sub-leading jets are classified as b-jets, and compared to those from direct tagging. In this case, the event weight is simply computed as the product of the efficiencies of b-tagging each of the two jets, event = 1 · 2 . It is therefore important to study the modeling of distributions that capture correlations among individual jet observables, once event weights are applied.
The invariant mass distribution computed from the leading and subleading jets in each event is shown in Figure 7. The figures are further sub-divided based on the true flavors of the two jets. Similarly to the single-jet case, the NN predictions show good agreement compared to the true efficiency while the map-based approach is unable to properly capture the effect of close-by jets on b-tagging. It can also be noted that the reweighing procedure based on NN predictions improves significantly the statistical uncertainly compared to the direct tagging.
Finally, the generality of the method is probed by using the same network to reweight events from a separate sample with different jet p T , η and ∆R(i, j) distributions compared to the training sample. For this purpose, events were simulated in which a boosted scalar particle decays in exactly two jets per event, where the p T of the decaying particle is generated from an exponentially decaying distribution, and its mass is generated from a Gaussian distribution peaked at 90 GeV. The boson decays with a rate of 33% to light-, c-or b-jets. Figure 8 shows the results for the angular separation between the two decay products as well as for the reconstructed invariant mass of the generated boson. An overall good agreement is found between the NN results and direct tagging, similarly to the previous cases. This gives confidence about the universality of the proposed approach: as long as the phase space is sampled adequately during training, the efficiency estimated using the neural network is expected to be independent on the chosen sample. 2 Only jets passing the b-tagging classification are included in distributions, without any additional weight. (b) Distribution of the jet ∆R(i, j) of the leading and subleading jet, obtained when the leading jet is classified as b-tagged (black), compared to the same distributions obtained when jets are weighted with their true efficiency (grey), using the efficiency˜ from the map-based approach (blue) or using the NN output (red). The lower pad shows the ratio between the two latter distributions and the one obtained with true weights.  Figure 7: Distribution of the invariant mass of the two leading jets, when the events are weighted by the product of true efficiencies, as calculated in Eq. 3.1 (grey). Also shown is the distribution for events where both jets are b-tagged (direct tagging), or when the events are weighted using the estimated efficiency˜ from the map-based approach (blue) or using the NN output (red). The lower pad shows the ratio between all distributions and the one obtained with true weights. Events are split into categories based on the true flavor of the two leading jets.

Ratio
-12 -  Figure 8: Distribution of the ∆R(i, j) (a) and invariant mass (b) of the leading-subleading jet system, obtained for events where these jets are classified as b-tagged (blue), compared to the same distributions obtained when these jets are instead weighted with their probability of passing btagging, calculated using the true weight from Eq. 3.1 (grey), using the efficiency˜ from the map-based approach (blue) or using the NN output (red). The lower pad shows the ratio between the two latter distributions and the one obtained with true weights.
-13 -In this section we summarize some of the main considerations aimed at generalizing the proposed approach for use cases beyond the toy model presented in this paper.
The size of θ We used a relatively small number of variables that control the efficiency and required the network to only infer the variable ∆R(i, j). In real-life applications, θ may include more variables and the related inference may be more complex in higher dimensions. To cope with this, the inputs variables Θ needs to be enlarged using additional variables. Neural networks are a particularly suitable tool to perform this task due to their flexibility to cope with higher dimensions. Any variables potentially correlated with the tagging decision could be used to ensure that all correlations are captured.
The functional form of (θ) We assumed a relatively simple efficiency in Eq. 3.1. In principle, the neural network can learn any function, no matter how complex the functional form is, as shown in Ref. [12]. This method in scenarios where the form of (θ) may present more complex dependencies between the efficiency and the relevant variables θ.
Systematic uncertainties In the applications of the simple efficiency maps, the insufficient capture of the existing underlying correlations requires the introduction of systematic uncertainty. This method is aimed at avoiding this systematic error, it will, however, require thorough checks to ensure that its estimates are accurate.
Generalization of the method In the proposed approach we have focused our studies to approximate efficiency, i.e. density ratios between two complementary classes. The method can also be generalized to approximate ratios between two separate classes 3 . A multidimensional ratio between two classes could be used in a variety of different applications, such as to derive multi-dimensional scale factors from data to correct the tagging efficiency in Monte Carlo simulation.

Conclusions
The parametrization of classifier efficiencies can play an important role to mitigate the limitations in the number of simulated events at LHC experiments. To be effective, parametrized classifier efficiencies need to be accurate in any context and therefore need to capture the dependencies on event properties that are used in analyses and which entail variations of efficiencies. A new technique that optimally exploits these dependencies is proposed. This technique is based on graph neural networks that provide an estimate of ratios between multidimensional local densities. We use the case of the identification of heavy-flavor jets as a topical example building a toy model based on ad-hoc parameterizations of the classifier efficiency inspired by the observed dependencies of b-tagging performance in the ATLAS and CMS experiments. A Graph Neural Network is used to exploit correlations between jets in the event to provide an unbiased parametrization of the efficiency.
A toy example is used to probe the performance of the method, which takes as an input the true flavors and momenta of reconstructed jets, and returns the b-tagging efficiency of each. These efficiencies are used to build the per-event weights in a sample of simulated events with multiple b-tagged jets. We use the estimated efficiency for the event reweighing technique which is used to reduce the statistical fluctuations of Monte Carlo samples after classification.
Results show good compatibility between per-jet and per-event kinematic distributions obtained with the proposed approach and the distributions expected from the direct application of b-tagging. We also show that the proposed technique can generalize to samples with input distributions differing significantly compared to the training sample.