Preserving physically important variables in optimal event selections: A case study in Higgs physics

Analyses of collider data, often assisted by modern Machine Learning methods, condense a number of observables into a few powerful discriminants for the separation of the targeted signal process from the contributing backgrounds. These discriminants are highly correlated with important physical observables; using them in the event selection thus leads to the distortion of physically relevant distributions. We present an alternative event selection strategy, based on adversarially trained classifiers, that exploits the discriminating power contained in many event variables, but preserves the distributions of selected observables. This method is implemented and evaluated for the case of a Standard Model Higgs boson decaying into a pair of bottom quarks. Compared to a cut-based approach, it leads to a significant improvement in analysis sensitivity and retains the shapes of the relevant distributions to a greater extent.


Introduction
The discovery of a Standard Model (SM)-like Higgs boson by the ATLAS [1] and CMS [2] collaborations gave rise to a rich experimental programme, seeking to measure its properties with the highest degree of precision. The Higgs boson has since been observed in its main production modes and a range of decay channels [3][4][5][6][7][8][9].
Many of these results rely on machine learning algorithms at various stages of the analysis pipeline, most prominently for the separation of signal and background processes. Often the entire analysis is designed around a powerful multivariate discriminant computed by a boosted decision tree (BDT) or artificial neural network. The Higgs boson signal strength is commonly extracted through a maximum-likelihood fit to the distribution of this discriminant [3,9].
As shown in e.g. [10], such a discriminant approximates a sufficient statistic of the data, and can therefore exploit the correlations between input variables in the most optimal way. While statistically very intuitive, it can however not be directly related to individual physical observables. More importantly, its sufficiency drains all other event variables of their discriminative power to separate signal and background: once the discriminant is used to select a sample of signal-enriched events, even the distributions of physically highly relevant observables (such as the invariant mass of a system of particles) get distorted and lose their conventional interpretations.
In the past, this has led to the development of dedicated "cross-check" analyses. Typically, these do not strive to maximise the expected signal sensitivity, but rather try to directly operate in terms of individual physical observables. For example, the observation of the Higgs boson decay to two bottom quarks, H → bb, by the ATLAS and CMS collaborations was complemented by such an alternative analysis [8,9]. Its event selection is purely based on simple cuts on a few physically relevant event variables. The Higgs boson signal strength is then extracted through a maximum-likelihood fit to the invariant mass of the two b-jets, m bb , arguably the most physically relevant variable in this setting and a quantity whose meaning is easily understood.
In this letter, we explore the possibility of combining these two approaches. We design an event classification procedure that allows for an optimal separation of signal and background events, while preserving the distributions of a few manually selected "flagship" variables. This allows to harvest the discriminative power inherent in a large collection of features that is otherwise hard to exploit, while keeping the physical interpretation of the chosen observables manifest.
To introduce this approach and demonstrate its feasibility, we evaluate it in the context of the process mentioned earlier, H → bb. In particular, we target the production of the Higgs boson in association with an electroweak vector boson (V = W , Z) and focus on the 0-lepton channel of its decay. The analysis of this process is complicated by the presence of a number of challenging backgrounds. Following our procedure, we partition the events into a number of signal regions, using m bb as our flagship variable. The expected discovery significance for a SM Higgs boson is evaluated through a maximum-likelihood fit to the resulting m bb distributions. We benchmark this method against the equivalent cut-based result for the same process presented by ATLAS [9].
The remainder of this letter is structured as follows. Section 2 describes the generation of the samples of simulated events on which our studies are based. Then, Section 3 introduces our analysis strategy and contrasts it to the cut-based approach. Section 4 summarises our findings and Section 5 provides an outlook to future work.

Simulation Techniques and Validation
Monte Carlo (MC) simulation is used to model the signal and background processes at a centre-of-mass energy of √ s = 13 TeV. The signal sample contains a Higgs boson H produced in association with a vector boson V and subsequently decaying to a pair of bottom quarks, V H → bb. The main background processes are W and Z production in association with b-jets (W +jets, Z+jets), top quark pair (tt) production, and diboson production.
For all samples, the hard scattering is generated using MadGraph5 2.6.5 [11]. All signal and background processes are simulated at leading order (LO). Their inclusive crosssections are normalised to the NNLO calculation where available and to NLO otherwise. The yields are further scaled to correspond to the full LHC Run II integrated luminosity of 140 fb −1 . Parton showering is implemented using Pythia 8.2 [12]. Contributions from pileup were explicitly switched off to reduce the required processing time. The detector response is simulated using the Delphes 3.4.1 [13] parametrisation of the ATLAS detector.
After performing the event selection outlined in Section 3.1, our setup is validated against the yields shown in Table 13 in Ref. [14]. Signal yields were found to agree within 5% while individual background yields were found to agree within about 40%, where the largest disagreement was observed for tt. The overall background normalisation agrees to within 20%.

Analysis Methodology
Section 3.1 describes the event selection used for all subsequent studies. The cut-based analysis employed by ATLAS is described in Section 3.2 as a baseline approach. Section 3.3 describes the proposed alternative event selection and classification scheme.

Event selection
The event selection closely follows that of Ref. [9]. The analysis considers only events containing exactly two b-jets, and one additional non-b-flavoured jet is permitted. To make efficient use of the generated MC samples, this b-tagging requirement is implemented at the level of the generated partons. Detector effects are then included by multiplying the event weight by the b-tagging efficiencies achieved by the individual b-jets, according to [15]. Only events without reconstructed leptons (electrons or muons) with a transverse momentum of p T > 7 GeV are used.
The leading b-jet is required to have a transverse momentum larger than 45 GeV and the missing transverse energy in the event (E miss T ) is required to be larger than 150 GeV. A requirement on the scalar sum of the jet p T , H T > 120 (150) GeV for events containing 2 (3) jets, is used to remove phase space where trigger efficiencies depend on the jet multiplicity. Cleaning cuts on the angular separation between the jets, and between the di-b-jet system and E miss T are used to remove events in which the E miss T results from mismeasured jets. A requirement based on ∆φ(E miss T , p miss T ), used in the ATLAS analysis for the same purpose, is dropped 1 .

Cut-based analysis
The cut-based analysis (CBA) [9] makes use of four signal regions. The events passing the above selection are first divided into two categories, containing events with two and three jets, respectively. These two categories are further split into two regions each, containing events with 150 < E miss Distributions of the di-b-jet invariant mass m bb in each of the CBA signal regions defined by these cuts are shown in Figure 1.
The cut values are re-optimised to maximise the sensitivity of the CBA on our MC samples, and to ensure a fair comparison with the proposed strategy. The optimised cuts are determined though Bayesian optimisation by demanding that the Asimov discovery significance (as defined in Section 4) of the Higgs boson signal be maximised. This procedure

Pivotal classifier analysis
Our proposed analysis strategy is based on the same underlying event selection. The event categories are now defined through cuts on a multivariate discriminantŷ computed by a neural network. It is trained to separate signal from background, i.e. solves a classification problem. To avoid any distortion of the m bb distributions upon the selection of signal-like events,ŷ must be statistically independent from m bb , i.e. the joint probability distribution p(m bb ,ŷ) must factorise, In the parlance of statistics, the classifier outputŷ is required to form a pivotal quantity with respect to m bb . We thus refer to our proposed strategy as pivotal classifier analysis (PCA). The requirement of statistical independence is incorporated through a modification of the loss function that is used to train the classifier neural network [16][17][18][19]. We add an adversarial loss term L adv to the usual classification loss L clf , suitably constructed to impose the independence constraint in Equation 3.1, Here, λ is a continuous (but fixed) parameter that compromises between the (generally conflicting) objectives specified by L adv and L clf . Both are explained in more detail below.
Classifier loss L clf : Let the classifier network be parametrised by θ c and its response to an event i with kinematics x i and invariant mass (m bb ) i be denoted byŷ i . The true identity of i, i.e. whether it originates from signal or background, is labelled by y i . Binary cross entropy between y i andŷ i is used to compute the classification loss on a batch of N b events, Adversary loss L adv : The mutual information between the classifier response and the invariant mass, MI(ŷ, m bb ), is defined as the Kullback-Leibler divergence [20] between the joint and the product of the marginal probability densities, MI(ŷ, m bb ) = D KL (p(ŷ, m bb )||p(ŷ) p(m bb )) = dm bb dŷ p(ŷ, m bb ) · log p(ŷ, m bb ) p(ŷ) · p(m bb ) .
(3.4) MI(ŷ, m bb ) is zero if and only ifŷ and m bb are independent random variables, and positive otherwise.
Mutual information has historically been difficult to compute [21] for general probability distributions. However, recent work on the dual representation of f -divergences [22,23] rephrased the estimation of mutual information from a set of samples as a functional optimisation problem [24]. To this end, we introduce an ancillary "adversary" neural network, parametrised by θ a . It takesŷ and m bb as inputs and so parametrises a smooth function T [ŷ, m bb ]. The loss function of the adversary network is then defined as 5) and the sums run over a batch of N b events where in the first sumŷ and m bb are sampled from their joint probability density while in the second they are sampled from the respective marginals.
Ref. [24] shows that the supremum of this functional represents a tight lower bound on the mutual information, for a sufficiently expressive family F of functions. Importantly, this lower bound on the mutual information is expressed as a manifestly differentiable function of the input. This allows it to be included into the loss function for the classifier and extremised through standard gradient-based methods, as explained below.
Classifier architecture: We can equivalently formulate the constraint in Equation 3.1 as p(ŷ|m bb ) = dx p(ŷ|x) p(x|m bb ) = p(ŷ). (3.7) The object p(ŷ|x i ) is the probability density of the classifier response generated for an event with kinematics x i . It is thus natural to think of the classifier as defining a probability distribution rather than implementing a deterministic function, i.e. as a randomised classifier [25]. We choose to parametrise p(ŷ|x i ) through a mixture of N Gaussian distributions, transformed to limit its support to the interval (0, 1). The 3N − 1 free parameters of this model (the means µ k , standard deviations σ k as well as the mixture fractions) are computed by the classifier network. The responseŷ i to an event x i is then defined as a random variate drawn from this distribution, i.e.ŷ i ∼ p(ŷ|x i ).
Training algorithm: The classifier network is trained adversarially [16,26], whereby the adversary adapts to compute the mutual information between the classifier output and the invariant mass as the classifier develops, while the classifier is incentivised to minimise this quantity. In this way, the classifier is effectively penalised for revealing information about the invariant mass and will remain statistically independent from m bb .
The adversarial training proceeds as follows. The adversary network is first pretrained for N p steps by updating θ a such as to maximise L adv , which provides an initial estimate of MI(ŷ, m bb ) by virtue of Equation 3.6. The training loop consists of a single update step of the classifier parameters, taking into account the adversary, followed by the update of the adversary according to Equation 3.8 in order to adapt to the changes in the classifier. To allow the adversary to continuously follow the classifier, a larger learning rate is chosen for the latter, η a > η c . The input variables used for the classifier are a subset of those used for the BDTs employed by the main analysis in [9], all known to be reliably modelled by MC. These include m bb , ∆R bb , ∆η bb , E miss T , the transverse momenta of the two b-jets, the scalar sum of all jet momenta as well as the azimuthal angle difference between the di-b-jet system and E miss T . It is worth noting that the classifier not only has access to variables highly correlated with m bb , such as ∆R bb , but also to m bb itself, and yet the adversarial training is able to reliably achieve independence from the invariant mass.
The classifier (adversary) is a dense neural network of 3 (5) layers of 30 nodes each, using ReLU activations [27]. The Adam [28] optimisation algorithm is used for all studies. A batch size of 1000 events was found to work well in our situation, while the learning rates were set to η a = 4 · 10 −4 and η c = 2 · 10 −4 . The adversary is pretrained for N p = 100 steps. The value of λ is varied between 0 and 1.4. Separate classifiers (with separate adversaries) are trained for events containing two and three jets. The training makes use of 1/3 of the available simulated events, with the remaining events held back for performance validation. Already a single Gaussian was found to be sufficient for the parametrisation of the classifier output distribution p(ŷ|x i ). Using a deterministic classifier instead was found to lead to significantly worsened training stability. In addition, using dropout regularisation on both the classifier and adversary networks helped to improve convergence. No systematic tuning of hyperparameters to improve the performance of the neural network was done. Our implementation of the algorithm can be accessed at [29]. It makes use of the TensorFlow and TensorFlow Probability libraries [30].
Event categorisation: Events are classified based onŷ. Tight and loose regions are defined, separately for events with two and three jets. Events are allocated to the tight or loose categories based on the ordering defined by the discriminantŷ, with the two tight categories containing the most signal-like events.
The PCA loose (tight) category is constructed to achieve the same signal efficiency as the CBA low (high)-E miss T category, separately for both jet multiplicities. This guarantees a fair benchmark of the performance achieved by either event classification. In addition, it suggests a natural correspondence between CBA and PCA signal regions, used for comparisons in the following. Figure 2 shows the m bb distributions of the signal and main backgrounds in the four PCA categories for a classifier trained with λ = 1.4.

Results
The effect of λ on the classifier response is illustrated in Figure 3. It shows the invariant mass distributions generated by the Z+jets and tt backgrounds, in different signal regions of the PCA. For comparison, the distributions achieved in the equivalent CBA signal regions are shown as well, as are the inclusive mass distributions, i.e. those for the corresponding jet multiplicity, but without any selection applied. Small values of λ (corresponding to a traditional classifier) result in events being selected preferentially around the Higgs boson mass peak, seriously deforming the m bb distribution. Classifiers trained with a larger value of λ select events more uniformly across the invariant mass range and hence preserve the original m bb shapes. We note the large distortion of the mass spectrum caused by the CBA when using the original cut values. While not explicitly required, the optimisation of these cuts led to a large reduction of the shaping in the low-E miss T categories. The CBA and PCA event classification schemes are compared in Figure 4, separately for the two-jet (top) and three-jet (bottom) categories. For each jet multiplicity, the achieved performance is shown individually for each signal region, as well as for their combination. The horizontal axis shows the estimate [31] of the binned discovery significance where M = 1 2 (P + Q). By virtue of Equation 3.1, this distortion measure directly indicates the extent to which m bb andŷ are statistically independent random variables. Larger values of 1/JSD imply a higher degree of independence, i.e. correspond to less distortion of the m bb spectrum in the signal regions.
With respect to these metrics, PCA clearly outperforms CBA and achieves higher significances, even in situations where the distortion of the mass distributions is below the levels achieved by CBA.
As a measure for the overall analysis performance, a combined Asimov fit to the m bb distributions in all four signal regions is performed. Each signal-and background component is scaled by separate, unconstrained normalisation factors that are determined by the fit. The signal discovery significance obtained in this way gives a combined performance measure, taking into account also the distortion of the mass distribution produced by the event classification. Effects coming from systematic uncertainties are not included in the fit. These are orthogonal to the present discussion and are expected to have similar impacts on both PCA and CBA. Figure 5 shows the Asimov discovery significance for the signal process V H → bb. It contrasts the results for the CBA with the significances obtained for the PCA as a function of λ. The corresponding uncertainty band indicates the spread due to the inherent stochasticity of the training procedure. A performance improvement of about 13% with respect to the CBA is observed. Evaluating the classifier with maximum discovery significance on a separate set of events results in a very similar improvement of 12.7%. It should be noted that both the binned and Asimov significance estimates were introduced solely for the purpose of comparing the two analysis methods. Because of the simplifications made in producing the MC dataset and the neglected systematic uncertainties, they are not intended as realistic absolute projections of the achievable sensitivity. MadGraph + Pythia8 s = 13 TeV, 140 fb 1 cut-based analysis (optimised) cut-based analysis pivotal classifier (a) Figure 5: Asimov discovery significance σ A for V H → bb in the 0-lepton channel as a function of λ. The significance expected for the CBA using the original (optimised) cut values is indicated by the pink (dashed) line. The solid blue line represents the mean significance achieved by the PCA for each λ, while the uncertainty band encompasses the results obtained from 10 independent training runs.

Conclusions
Separating contributions from signal and background processes is a ubiquitous problem in the analysis of collider data. In this work, we presented a selection procedure based on a discriminating variable computed by a neural network. This discriminant is designed to be statistically independent from a set of physically important variables, whose distributions are thus unaffected by the event selection.
We demonstrated the practical feasibility of this technique in the study of the 0-lepton channel of the V H → bb process. Constructed to preserve the spectrum of the invariant mass of the two b-jets, our discriminant achieves a gain in analysis sensitivity of about 13% compared to a conventional cut-based approach.
Based on the adversarial minimisation of mutual information, this method is expected to generalise seamlessly to situations where the preservation of multiple variables is desired. In addition, achieving independence from a set of continuous and discrete variables (e.g. discrete nuisance parameters) can also be considered, a case whose study we leave to future work.