Enhancing searches for resonances with machine learning and moment decomposition

A key challenge in searches for resonant new physics is that classifiers trained to enhance potential signals must not induce localized structures. Such structures could result in a false signal when the background is estimated from data using sideband methods. A variety of techniques have been developed to construct classifiers which are independent from the resonant feature (often a mass). Such strategies are sufficient to avoid localized structures, but are not necessary. We develop a new set of tools using a novel moment loss function (Moment Decomposition or MoDe) which relax the assumption of independence without creating structures in the background. By allowing classifiers to be more flexible, we enhance the sensitivity to new physics without compromising the fidelity of the background estimation.


Introduction
Searching for new phenomena associated with localized excesses in otherwise featureless spectra, often referred to as bump hunting, is one of the most widely used approaches in particle and nuclear physics, dating back at least to the discovery of the ρ meson [1], and used continuously since, including recently in the discovery of the Higgs boson [2,3]. In the present day, such searches reach the multi-TeV scale [4,5] and span high energy particle and nuclear physics experiments [6][7][8][9][10][11][12]. A key feature of these searches is that they are relatively model-agnostic since sidebands in data can be used to estimate the background under a potential localized excess. These sideband fits are possible because the background data can be well-approximated either with simple parametric functions or smooth non-parametric techniques such as Gaussian processes [13].
A key challenge with complex event selections like those involved in boosted W tagging is that they can invalidate the smoothness assumption of the background. In particular, if classifiers can infer the mass of the parent resonance, then selecting signal-like events will simply pick out background events with a reconstructed mass near the target resonance mass. Many techniques have been developed that modify or simultaneously optimize classifiers so that their responses are independent of a given resonance feature [40-43, 43-51, 51-57]. For machine learning classifiers, the proposed solutions include modifications to loss functions that implicitly or explicitly enforce independence. These methods have been successfully deployed in bump hunts; see, e.g., Refs. [23, 32-37, 39, 58-68]. A variety of similar proposals under the monikers of domain adaptation and fairness have been proposed in the machine learning literature (see e.g. Ref. [69,70] and Ref. [71,72]).
Ensuring that a classifier is independent from a given resonant feature is sufficient for mitigating sculpting, but it is not necessary. The original requirement is simply that a selection using the classifier does not introduce localized features in the background spectrum, which is a much looser requirement than enforcing independence. For example, if a classifier has a linear dependence on the resonant feature, then there would be a strong correlation. However, a threshold requirement on such a classifier would not sculpt any bumps in the background-only case. This example motivates a new class of techniques that allow classifiers to depend on the resonant feature in a controlled way. In the limit that constant dependence is required, then the classifier and the resonant feature will be independent. The advantage of relaxing the independence requirement is that the resulting classifiers can achieve superior performance because they are allowed to be more flexible.
In this article, we present a new set of tools that allow for controlled dependence on a resonant feature. This new approach is called Moment Decomposition (MoDe). Using MoDe, analysts can require independence, linear dependence, and quadratic dependence. In addition, analysts can place bounds on the slope of the linear dependence, and restrict quadratic dependence to be monotonic. Extending MoDe to allow for arbitrarily higherorder dependence is straightforward. This article is organized as follows. Section 2 briefly reviews existing decorrelation methods and then introduces MoDe. Numerical results using a simplified model and a physically motivated example are presented in Sec. 3. Finally, we present conclusions and outlook in Sec. 4.

Existing decorrelation methods
We will consider the binary classification setting in which examples are given by the triplet (X, Y, M ), where X ∈ X is a feature vector, Y ∈ Y := {0, 1} is the target label, and finally, M ∈ M is the resonant feature (or protected attribute) whose spectrum will be used in the bump hunting. Throughout this article, we take M to be mass, though it could be any feature. The feature vector X can either contain M directly as one of its elements or contain other features that are arbitrarily indicative of M . We are interested in finding a mapping f : X → S where s ∈ S are scores used to obtain predictionsŷ ∈ Y with the additional constraint that f be conditionally independent of (or uniform with) M in the sense that for one or more values y, although typically, Eq. (2.1) is required only for the background. Existing decorrelation methods used in particle physics that simultaneously train a classifier f (x) : R n → [0, 1] and decorrelate from a resonant feature m use the following loss function: where S = {i| y i = 1} and B = {i| y i = 0} denote signal and background, respectively, L class is the usual classification loss such as the binary cross entropy L BCE (f (x), y) = y log(f (x)) + (1 − y) log(1 − f (x)), w is a weighting function, λ is a hyperparameter, and L decor generically denotes some form of decorrelation loss. Standard classification corresponds to w(m) = 1 and λ = 0. Decorrelation methods include: • Planing [55,73]: λ = 0 and w(m i ) ≈ p S (m)/p B (m) so that the marginal distribution of m is non-discriminatory after the reweighting.
• Adversaries [40,44,49,56]: w(m) = 1, λ < 0, and L decor is the loss of a second neural network (adversary) that takes f (x) as input and tries to learn some properties of m or its probability density.
Decorrelation methods have proven to be useful additions to the toolkit of the bump hunter.

Moment decorrelation
First, we will derive a new decorrelation method based on moments. While this technique achieves state-of-the-art decorrelation performance, along with being robust, simple, and fast, its true value is that it is trivially extended to allow for controlled dependence beyond just decorrelation. We begin by noting that the uniformity constraint in Eq. (2.1) can be written in terms of the conditional cumulative distribution function (CDF) of scores at s, F (s|M, Y ), as This is the same observation that lies at the heart of the flatness loss defined in Ref. [51].
Here, we will consider the conditional CDFs in bins of mass and only on the background, which allows us to adopt the following more compact notation where now m is discrete and indexes the mass bins. We leave the exploration of similar unbinned approaches for future work. Furthemore, we assume that some transformation is performed on m such that M → [−1, 1]. This could be a simple linear transformation but does not have to be, discussion on this point is provided later. The uniformity constraint of Eq. (2.3) can be imposed on the learned function by defining the decorrelation loss using 2 Here, F 0 m is based on the 0 th Legendre 3 moment of F m (s) in m, c 0 , and polynomial, where ∆ m denotes the width of bin m, andm is its central mass value. Note that the loss in Eq. (2.5) is clearly minimized when which implies that Eq. (2. 3) holds and f (X) and M are indeed independent. Note that in the limit that all bins have equal width and occupancy, it is straightforward to show that the loss function in Eq. (2.5) is the same as the flatness loss of Ref. [51]; however, when the underlying background distribution is highly nonuniform, these loss functions are drastically different resulting in MoDe outperforming Ref. [51] in such cases.

Beyond decorrelation: Moment decomposition
We will now generalize moment decorrelation to allow for controllable mass dependence in the form of an th order polynomial, where is a hyperparameter chosen by the analyst. The generalized MoDe loss is given by where Here, F 0 m in Eq. (2.5) has been replaced by (2.10) and the Legendre moments are given by We note that setting = 0 reduces the generalized MoDe loss of Eq. (2.9) down to the moment decorrelation of Eq. (2.5).
The MoDe loss in Eq. (2.9) is optimal when F m (s) = F m (s) ∀ m, s, which clearly occurs when the mass dependence of the classifier is at most an th order polynomial. For example, taking = 0 drives the classifier to be independent of mass. More interestingly, choosing = 1 allows for a linear mass dependence, = 2 quadratic dependence, etc. Furthermore, making the replacement 4 in Eq. (2.10) places an upper limit c max 1 c 0 (s) > 0 on the magnitude of the linear slope (the first Legendre moment is the coefficient of them term), allowing the analyst to control this aspect of the mass dependence through a hyperparameter, c max 1 . In addition, for the case where = 2 is selected, it is straightforward to show that as long as 3|c 2 (s)| ≤ |c 1 (s)| the derivative of F m (s) is nonzero on (−1, 1). Therefore, making the replacement in Eq. (2.10) results in monotonic mass dependence. This option can be turned on or off in MoDe, and can be used in conjunction with c max 1 if desired. Finally, controlled higher-order mass dependence can be achieved by extending these ideas to larger values.

Computational details
Computing the MoDe loss and its gradient is straightforward using a few approximations. At the batch level, where n is the number of samples in the batch, n m is the number of samples in bin m, is the score of sample i, and Θ is the Heaviside function: Θ(x) = 1 if x > 0 and θ(x) = 0 otherwise. Minimizing the loss function requires calculating the functional derivative of L MoDe with respect to f . This requires specifying how the MoDe loss changes due to variations of the score of each sample in the batch: where from Eq. (2.14) In addition, using this result, along with Eqs. (2.10) and (2.11), we obtain . . .
where the sum over mass bins in Eq. (2.11) is no longer needed, since changes to the score for sample i only affect the CDF in bin m i . The factors of δ(s − s i ) eliminate the integral over s resulting in relatively simple gradient terms, e.g., for = 1 we obtain The fact that terms like Eq. (2.19) do not depend on how the integral over s is approximated yields high-precision gradients, which is a big advantage when performing gradient descent. The results in this subsection are easily generalized for weighted samples. The CDFs in Eq. (2.14) become where w i are the per-sample weights and is the sum of the weights in bin m. Equation (2.16) then becomes and updating the rest of the results follows accordingly.
Finally, we address the topic of scalability. While the optimization of the MoDe loss works well stochastically with few examples every step, its performance increases greatly with larger batch sizes. This is not surprising due to the global nature of the MoDe constraint. Fortunately, all of the calculations scale well with batch size (see Appendix A for time and memory performance as functions of the number of inputs). Most computational costs occur in the forward direction, where MoDe scales linearly with the number of inputs n (batch size) and the number of steps chosen for the integral in s, n s . In addition, dynamic binning sorts m i and reindexes s i are required, and so MoDe runs in O(n s × n + n log n) time. In the forward direction, we also compute and cache the residual F m (s i ) − F m (s i ) which is used in the backward pass. Since the CDFs are evaluated at every s i , this contributes an O(n × n m ) component. In theory, this could be improved, e.g., if the CDFs at s i were instead approximated using nearest neighbor interpolation. Finally, MoDe takes O(n × n m ) extra memory (beyond what is required to store the data) when calculating the gradients, since the CDF is computed for each input for each bin. It is worth noting that, at particularly small batch sizes, MoDe might be susceptible to slow convergence due to mini-batch statistics not accurately reflecting the full-batch statistics. That is why we recommend using MoDe with a sizeable fraction of the full sample.

Example Results
In the this section, we will demonstrate how MoDe performs on a simple model problem, and on the W -jet tagging problem used in the decorrelation studies of Ref. [46,47]. All of the numerical results reported in this section are obtained using the PyTorch framework [78].

Simple Model
We first consider a binary classification example composed of a signal and two types of background. Each sample X ∈ X has 2 features: where N denotes the normal distribution. The mass, m, is drawn from N (0.2, 0.1) and U(−1, 1) at equal rates when Y = 1. For the backgrounds, we sample a uniform random variable, U ∼ U(0, 1), then define m = 1 − 2 √ U and m = −1 + 2 √ U for background types 1 and 2, respectively. This simple-model data is shown in Fig. 1.
In this scenario, an unconstrained classifier with sufficient capacity will learn the underlying mass distribution (due to the explicit mass dependence of x 2 ) and use it to discriminate between signal and background. Figure 2 shows how such a classifier favors regions near m = 0.2, leading to extreme peak-sculpting in the background. It would be difficult to employ this classifier in a real-world analysis and obtain an unbiased signal estimator. Figure 2 also shows that MoDe[0] successfully decorrelates the classifier response from mass producing a viable classifier for such an analysis.
In this simple example, we can analytically find the optimal classifier-defined here as one that only uses information not explicitly indicative of mass-by removing x 2 from X . Figure 2 shows that the resulting mass agnostic classifier is linearly correlated with mass (we ensured this via our choice of x 1 ). However, a classifier that enforces decorrelation must accept backgrounds 1 and 2 at equal rates to keep p(s|m) flat, which as shown in Fig. 2 produces performance that is far from optimal. By relaxing the flatness constraint, MoDe [1] is able to reject background 2 at a higher rate, while producing the expected linear dependence on mass. This linear mass dependence, which will not sculpt out any fake peaks from the background, allows MoDe[1] to achieve better classification power than is possible using decorrelation. In this case, it is able to achieve the same performance as the optimal mass-agnostic classifier, since the optimal performance here is linear. Figure 2 shows that even though MoDe [2] is given the freedom to find quadratic mass dependence, it also produces the same optimal linear mass dependence in this case.
As discussed in Sec. 2, the MoDe package provides an even higher level of control over the response of a model, including allowing the analyst to define the maximum linear slope and to require that the quadratic dependence is monotonic. To demonstrate these features, we make the following minor change to the simple model: In this case, the optimal classifier is no longer linear. Figure 3 shows that here the additional freedom given to MoDe [2] does improve the classification performance. Figure 4 shows that the MoDe[2] solution does indeed have quadratic mass dependence. The MoDe [2] response is not monotonic by default, but we also show in Fig. 4 that we can apply such a constraint. As can be seen in Fig. 3, there is a small decrease in classification performance; whether this is acceptable is a problem-specific decision left to the analyst. Finally, Fig. 5 shows that the analyst can exert full control over the maximum linear slope. This could be desirable in cases where the signal mass is not known, and similar-but not necessarily equivalent-performance across the mass range is viewed as beneficial. MoDe [1] MoDe [2] m-agnostic Unconstrained

Boosted hadronic W tagging
As mentioned in Sec. 1, highly lorentz boosted, hadronically decaying W bosons commonly arise in extensions of the Standard Model. The boost causes the decay products of these bosons to be collimated in the lab frame and to be mostly captured by a single large-radius jet. Various features of the substructure of these jets can be used to distinguish the boosted bosons from generic quark and gluon jets. A bump hunt is performed either in the mass of the W candidate jet, m J , or the mass of one W candidate jet and another (possibly W candidate) jet, m JJ . The challenge with substructure classifiers is that they can introduce artificial bumps into the mass spectrum because substructure is correlated with the jet mass and the jet kinematic properties (which are related to m JJ ). For this reason, boosted W tagging has become a benchmark process for studying decorrelation methods at the LHC.
The simulated samples used in this section are the same as in Ref. [47] (intended to emulate the study in Ref. [46]) and are briefly summarized here. In particular, boosted W bosons (signal) and generic multijet (background) events are generated with Pythia 8.219 [79,80] and a detector simulation is provided by Delphes 3.4.1 [81][82][83]. Jets are clustered using the anti-k t algorithm [84] with R = 1.0 implemented in FastJet 3.0.1 [85,86]. The selected jets for this study have 300 GeV < p T < 400 GeV and 50 GeV< m J < 300 GeV. Ten representative jet substructure features are computed for each jet and used for classification. This list is the same as in Ref. [46] (based on Ref. [87]) and includes the energy correlation ratios C 2 and D 2 [88], the N -subjettiness ratio τ 21 [89], the Fox-Wolfram moment R FW 2 [90], planar flow P [91], the angularity a 3 [92], aplanarity A [93], the splitting scales Z cut [94] and √ d 12 [95], and the k t subjet opening angle KtDR [96]. Detailed explanations of these features can be found in the references.

Classifier Details
MoDe and DisCo: We use a simple 3-layer neural network with a similar architecture to that described in Ref. [47]. However, unlike Refs. [47] and [46], after each of the 3 fully connected 64-node layers, we use Swish activation [97] as it provides a notable performance increase. We also use a batch normalization layer after the first fully connected layer. The output layer has a single node with a sigmoid activation. Both MoDe and DisCo are trained with the ADAM optimizer [98] using a 1cycle learning rate policy [99] with a starting learning rate of 10 −3 and a maximum learning rate (lr) of 10 −2 , which is reached using a cosine annealing strategy [100] and decayed to 10 −5 during the last few iterations. Momentum is cycled in the inverse direction from 0.95 to a minimum of 0.85. These hyperparameters were selected through a learning rate range test. Training is done using large batches of 10-20% of the training data. Note that large batch sizes do not necessarily make training more difficult especially when coupled with the 1cycle learning policy; see Ref. [101].
Adversarial Decorrelation: The same classifier used for MoDe and DisCo is trained against a Gaussian Mixture Network (GMN) [102] that parametrizes a Gaussian mixture model with 20 components, i.e. its outputs are the means, variances, and mixing coefficients of 20 normal distributions. We follow a similar adversarial setup to that referenced in Refs. [46] and [47]. We use one hidden layer with 64 nodes with ReLu activation connected to 60 output nodes. These outputs are then used to model the posterior probability density function p θ adv (m|f (X) = s), where θ adv are the parameters of the GMN. The adversary is optimized by maximizing the likelihood of the data given by where θ class and θ adv are the parameters of the classifier and the adversary, respectively. To control the classification-decorrelation trade-off, we vary λ between 1 and 100. The nonconvex nature of the loss makes training considerably more difficult; the hyperparameters must be chosen carefully.

Decorrelation
First, we will show that MoDe[0] achieves state-of-the-art decorrelation performance. Following Ref. [47], we quantify the classification and decorrelation performance using the following metrics: R50, the background rejection power (inverse false positive rate) at 50% signal efficiency; and 1/JSD, where the Jensen-Shannon divergence (JSD) is a symmetrized version of the Kullback-Leibler divergence. Here, JSD is used to compare the mass distributions of backgrounds that pass and fail the classifier-based selections, with the relative entropy measured in bits. Figure 6 shows that, as expected, without imposing a strong constraint on mass decorrelation, the classifier learns to select samples near the W -boson mass, which sculpts a fake peak in the background. Figure 6 also shows that MoDe[0] successfully decorrelates its response from mass (if the decorrelation hyperparameter λ is chosen to be sufficiently large). Figure 7 shows that the existing state-of-the-art decorrelation methods discussed in Sec. 2 perform similarly to MoDe[0] on this W -tagging problem. More precisely, as observed in Ref. [47], the adversary method performs slightly better, but is considerably more difficult to train, since it requires carefully tuning two neural networks against each other. The optimal solution is a saddle point, where the classification and adversarial losses are minimized and maximized, respectively, which makes the training inherently unstable. Conversely, both DisCo and MoDe[0] minimize convex loss functions, making their training robust and stable, and both only introduce one additional hyperparameter in the loss function. The performances of MoDe[0] and DisCo are comparable in these metrics (Appendix A shows that MoDe[0] is less resource intensive), though decorrelation is not our primary objective.

Beyond Decorrelation
Moving beyond decorrelation the 1/JSD metric is no longer relevant. Figure 6 shows that neither MoDe [1] nor MoDe [2] sculpts a peaking structure in the background, but their 1/JSD values are small since neither seeks to decorrelate from the mass. Therefore, we replace the 1/JSD metric with the signal bias induced by the classifier selection, which is what actually matters when searching for resonant new physics. Specifically, we use the signal estimators obtained by fitting the selected background-only samples to a simple polynomial function as proxies for the signal biases. These are divided by their uncertainties such that values of roughly unity are consistent with no bias, while values significantly larger than unity indicate substantial bias that could result in false claims of observations. Figure 8 shows that the DisCo and MoDe[0] decorrelation methods provide unbiased signal estimators for R50 9, which from Fig. 7 corresponds to 1/JSD 1000. While achieving higher decorrelation values is possible, this does not provide any tangible gains in the bump-hunt analysis. Figure 8 also shows that the flexibility to go beyond decorrelation provided by MoDe [1] and MoDe [2] results in achieving unbiased signal estimators at larger background-rejection power. This would directly translate to improved sensitivity in a real-world analysis.

Conclusions and Outlook
In summary, a key challenge in searches for resonant new physics is that classifiers trained to enhance potential signals must not induce localized structures. In particular, if classifiers can infer the mass of the parent resonance, then selecting signal-like events will simply pick out background events with a reconstructed mass near the target resonance mass creating an artificial structure in the background. Such structures could result in a false signal when the background is estimated from data using sideband methods. A variety of techniques have been developed to construct classifiers which are independent from the resonant feature (often a mass). Such strategies are sufficient to avoid localized structures, but are not necessary.
In this article, we presented a new set of tools using a novel moment loss function (Moment Decomposition or MoDe) which relax the assumption of independence without creating structures in the background. Using MoDe, analysts can require independence, linear dependence, quadratic dependence, etc. In addition, analysts can place bounds on the slope of the linear dependence, and restrict higher-order dependence to be monotonic. By allowing classifiers to be more flexible, we enhance the sensitivity to new physics without compromising the fidelity of the background estimation.

Code and Data
An implementation for MoDe in PyTorch as well as Keras/Tensorflow is available at https://github.com/okitouni/MoDe, along with example code used to produce the results in this article. The simulated W and QCD samples are available from Zenodo at Ref. [103].