Resonant Anomaly Detection with Multiple Reference Datasets

An important class of techniques for resonant anomaly detection in high energy physics builds models that can distinguish between reference and target datasets, where only the latter has appreciable signal. Such techniques, including Classification Without Labels (CWoLa) and Simulation Assisted Likelihood-free Anomaly Detection (SALAD) rely on a single reference dataset. They cannot take advantage of commonly-available multiple datasets and thus cannot fully exploit available information. In this work, we propose generalizations of CWoLa and SALAD for settings where multiple reference datasets are available, building on weak supervision techniques. We demonstrate improved performance in a number of settings with realistic and synthetic data. As an added benefit, our generalizations enable us to provide finite-sample guarantees, improving on existing asymptotic analyses.


Introduction
Due to the vast parameter space of Standard Model extensions and to the lack of significant evidence for new particles or forces of nature, a new model-agnostic search paradigm has emerged.Many of these anomaly detection (AD) strategies are enabled by machine learning (see e.g.Ref. [1][2][3][4]) and the first results with collision data are now available [5,6].One way to characterize AD methods is based on their physics assumption of the new phenomena [2].Strategies that assume the new physics is "rare" [7] estimate (explicitly or implicitly) the data probability density and focus on events with low density.In contrast, techniques that assume the new physics will manifest as an overdensity in phase space use likelihood ratio methods to compare a reference dataset to a target dataset.The latter approach has been extensively studied in the context of resonant anomaly detection [8], where one resonant feature (usually a mass) is used to create a sideband region (reference dataset) nearly devoid of any anomalous events and a signal region (target dataset) that may contain anomalies.The reference dataset is used to estimate the presence of anomalies in the target dataset via interpolation.
Many existing approaches are defined using one reference dataset and one target dataset [9][10][11][12][13][14][15][16][17][18][18][19][20][21][22][23][24].However, in practice one can have access to or construct multiple references.First, there may exist multiple resonant features that can be used to construct sideband and signal regions.For instance, when a particle decays into two new particles, the decay products can be used to construct all three intermediate resonances, a setting present in the LHC Olympics Dataset [3].Second, there may also exist multiple independent Standard Model simulators available for producing a dataset (e.g.Pythia [25], Herwig [26], or Sherpa [27]).Using multiple reference datasets may improve performance, but it is not clear how to incorporate all of their information when using existing methods designed for a single set.
We explore two generalizations of resonant AD to multiple reference datasets.First, we consider Classification Without Labels (CWoLa) [9,10,28], in which the reference is simply the sideband region-a form of weak supervision where the noisy label of "signal" is assigned to events in the signal region and the noisy label of 'background' to events in the sideband region.We propose a new method, Multi-CWoLa, that builds multiple reference datasets by constructing signal and sideband regions along different resonant features.We consider a point's membership in each feature's signal region as a noisy vote for anomaly, learn weights on each vote, and aggregate them to produce a higherquality noisy label.We demonstrate Multi-CWoLa's performance on the LHC Olympics Dataset [3].
Next, we study Simulation Assisted Likelihood-free Anomaly Detection (SALAD) [14].In this method, a reweighting function between a reference simulation dataset and a target dataset is learned in the sideband conditioned on the resonant feature.The simulated events in the signal region are reweighted by interpolating this function and then are used to distinguish anomalies in the target dataset.We extend this to the case of multiple simulated datasets, each of which may make different approximation choices and thus provide complementary accuracy when using SALAD.We introduce Multi-SALAD, which combines the simulated datasets accordingly and then reweights, with the key finding that combining data helps when each simulator approximates different components of the background well.We demonstrate Multi-SALAD's performance on synthetic data.
Finally, we study the finite sample guarantees of our proposed methods.Many resonant AD methods have optimality guarantees in some asymptotic limit, but there is no firstprinciples understanding of the methods' performance with finite samples.In particular, approaches like the ones described above that use classifiers to distinguish a reference dataset from a target dataset approximate the data-to-background likelihood ratio.When the reference (physics) model is correct, this approach will converge to the optimal Neyman-Pearson likelihood ratio test in the limit of infinite statistics, complex enough classifier architecture, and flexible enough training procedure [15,29].However, a finite sample understanding of these approaches is lacking.We draw on results from statistical theory to begin a formal study of resonant AD methods with limited data.Our results lay a foundation for future investigations into the finite sample properties of AD and related methods.
This paper is organized as follows.Section 2 briefly set up the resonant AD setting and then Multi-CWoLa and Multi-SALAD are introduced in Secs. 3 and 4, respectively.The paper ends with conclusions and outlook in Sec. 5.

Problem Setup
We have an input space of discriminating features x ∈ X and k resonant features m = [m 1 , . . ., m k ] ∈ R k .Associated with a point (x, m) is an unknown label y ∈ Y for Y = {0, 1} (background vs. signal).Points (x, m, y) are drawn from a distribution P with density p(•).For a resonant feature m i ∈ R, an interval I m i ⊂ R is used to define a signal region SR i = {(x, m) : m i ∈ I m i } and a sideband region SB i = {(x, m) : m i / ∈ I m i } (when the resonant feature is obvious, the i is dropped and we use SR and SB).We assume that the sideband region contains little to no signal, i.e., p(y = 1|(x, m) ∈ SB) ≈ 0. Our goal is to construct a predictor f : X → Y to perform anomaly detection.

Multi-CWoLa: Learning from Multiple Resonant Features
We introduce Multi-CWoLa, an approach to anomaly detection that uses multiple reference datasets and is built using principles from the area of weak supervision [30,31].
i=1 with one resonant feature (k = 1) that we want to use to learn f .We use m to construct the signal and sideband regions, D SR , D SB ⊂ D where D SR = D ∩ SR and D SB = D ∩ SB, with distributions p SR and p SB respectively.With the intuition that there are more anomalies in the signal region, we express each distribution as a mixture of signal and background components with weight 0 ≤ η SR , η SB ≤ 1: Under this construction, the density ratio of the mixtures p SR (x) p SB (x) can be written in terms of the ratio of the signal and background components, r(x) = p(x|y=1) p(x|y=0) , as p SR (x) p SB (x) = η SR r(x)+1−η SR η SB r(x)+1−η SB .Assuming η SR > η SB (e.g. more signal in the signal region), the mixture ratio is monotonically increasing in r(x).Therefore, we train a classifier f to learn p SR (x) by distinguishing between D SR and D SB , and this f provides information about r(x) and can be used for anomaly detection.

Multi-CWoLa Method
Intuitively, CWoLa uses the resonant feature m as a noisy label that identifies the signal versus sideband region and then trains a classifier using these.This idea leads to a simple question-if more than one such feature is available (k > 1), how can the multiple noisy labels best be utilized?We tackle this question using principles from weak supervision [30][31][32][33].

Model
In our approach, we split D along each resonant feature m i to produce pairs of datasets We propose to directly aggregate these labels M(m) into an estimate of y, ŷ, and train a classifier on the aggregated ŷ along with the discriminative features x.Since each M i (m)'s "vote" can have different correlation with the true y, we aim to combine the votes in a weighted fashion.We cannot directly measure each membership label's accuracy since the true y is unknown, so we draw on methods from weak supervision.We model the distribution p(y, M(m)) as a probabilistic graphical model with the following parametrization: where θ = {θ y , θ i ∀i ∈ [k]} are the canonical parameters of the distribution, Z is for normalization, and y and M i (m) are y and M i (m) scaled from {0, 1} to {−1, 1}.Intuitively, θ i represents the (unobserved) strength of the correlation between M i (m) and y and thus captures a notion of M i 's accuracy.This model also implies, for simplicity, that M i (m) ⊥ M j (m)|y; that is, the resonant features are conditionally independent given y. 1ur goal is to estimate the parameters of the graphical model and use them to perform inference, producing aggregated weak labels ŷ from the distribution p(y = 1|M(m); θ) given a vector of noisy labels M(m).

Parameter Estimation
We first learn the parameters of p(y, M(m); θ) as defined in (3.3).Of key interest is the accuracy parameter α i = p(M i (m) = 1|y = 1) = p(M i (m) = 0|y = 0) of the ith resonant feature, which corresponds to the canonical parameter θ i (see [35] for more background on probabilistic graphical models).We estimate the accuracy parameters by adapting the triplet approach from [31].First, we draw triplets of resonant features a, b, c ∈ [k]. 2 If the distribution on y, M(m) follows the graphical model in (3.3), it holds that Writing one such equation for each pair in the triplet (a, b, c), we have that Solving this system, we obtain , and similarly for b and c.We assume that each signal region is positively correlated with the true signal, which allows for us to ignore the absolute value and uniquely recover properties of the graphical model in (3.3).Note that in practice, all of these quantities are empirical estimates, with terms such as

Inference and Training
After we learn the accuracy parameters, we use them to estimate p(y = 1|M(m)) for a given M(m).We use Bayes' rule and the conditional independence among M(m) to write . We assume that the class balance p(y = 1) is known; otherwise, it can be estimated via tensor decomposition [33].
and the denominator p(M(m)) can be either directly estimated since all quantities are observable or computed as m i=1 p(M i (m)|y = 1)p(y = 1) + m i=1 p(M i (m)|y = 0)p(y = 1) using the estimated accuracies and class balance.
Once p(y = 1|M(m)) is estimated for all M(m) ∈ {0, 1} k , the aggregated weak label ŷ is drawn from such distribution.With labels ŷ for each (x, m) ∈ D, we train a classifier f on the weakly labeled dataset {(x, ŷ)} n i=1 .This procedure is summarized in Algorithm 1.
; thresholds I m i that split D into signal and sideband regions, D SR i and D SB i respectively, for each m i ; class balance probability of anomaly p(y = 1) for each resonant feature m i . ) where Ê is an empirical estimate of the expectation over D, and M (m) indicates M (m) scaled to {−1, 1}.5: end for 6: Set accuracy parameter

Theoretical Results
Under (3.3), Multi-CWoLa offers finite-sample generalization guarantees.Suppose the downstream model f trained on ŷ belongs to class F. Define a loss function C : Y ×Y → R and let the expected loss of f be L C (f ) := E [ C (f (x), y)] on true labels.Then, the optimal classifier is f = argmin f ∈F L C (f ), which is achieved with unlimited labeled data.Let the empirical loss of f on ŷ be LC (f Then, the f we learn is constructed from f = argmin f ∈F LC (f ), which is learned on finite and noisily labeled data.Note that this construction is different from the standard empirical risk minimization (ERM) loss on labeled data, and thus LC (f ) does not asymptotically equal L C (f ).We aim to minimize the generalization error L We now present our result on an upper bound for L . Define e min as the minimum eigenvalue of the covariance matrix on [y, M 1 (m), . . ., M k (m)], and let a min be the minimum value of Theorem 1. Assume that p(y, M(m)) can be parametrized according to (3.3) and that is scaled to be bounded in [0, 1].Assume that the class balance p(y) is known (if not, there are ways to estimate it [33]), and that k ≥ 3.Then, with probability at least 1 − δ, the generalization error of Multi-CWoLa on D is at most where c 1 , c 2 are positive constants.
We observe that there are three quantities controlling the above bound: • The Rademacher complexity of F: this term describes the model's expressivity.Smaller Rademacher complexity means that the model is easier to learn and that our f will be closer to the best model in F. This quantity can be readily computed for a variety of function classes F, such as decision trees, linear models, and two-layer feedforward networks, which makes our bound in Theorem 1 tractable.See Appendix C.2 for exact values.
• Using n finite samples: as the amount of data increases, the error decreases in O(n −1/2 ).
• Using noisy labels ŷ instead of y: for our weak supervision algorithm and graphical model, using ŷ rather than y contributes an additional O(n −1/2 ) error.Asymptotically, our approach thus does no worse than training with labeled data.
By contrast, the standard CWoLa approach with k = 1 does not utilize any aggregation or weak supervision, which requires k ≥ 3.For standard CWoLa, the second term in the generalization error is irreducible due to the fact that using any single resonant feature in place of y is biased.On the other hand, Multi-CWoLa corrects for some of this bias; the second term asymptotically approaches 0 with more data.

Empirical Results
In Figure 1, we compare Multi-CWoLa with standard CWoLa as well as two other baselines.We use simulation data from the LHC Olympics Dataset [3]; in particular from Pythia 8 [25], where the signal is boson decay and the background is generic 2 → 2 parton scattering.This dataset contains 5 features; in the standard CWoLa setup, we use one thresholded resonant feature (k = 1) and use 4 discriminative features as x.For Multi-CWoLa, we have generated k = 3 mixtures by varying how the 3 resonant features (the jet masses in addition to the dijet mass) are thresholded and use 2 discriminative features as x.We have three other baselines that utilize 3 resonant features: • CWoLa + intersect defines the signal region as the intersection of the resonant features' signal regions, e.g.SR = SR 1 ∩SR 2 ∩SR 3 , but this can be overly conservative.
• CWoLa + average runs standard CWoLa three times, once per resonant feature and with the 2 discriminative features.The three model scores are averaged to produce the final output.We vary the number of samples available on a logarithmic scale from n = 59 to 6003 and plot the AUC averaged over 5 runs per sample size in 1.We find that Multi-CWoLa offers a higher AUC and lower variance, especially when there is limited data.We also plot the SI curves averaged over 5 runs for n = 59, 530, 6003 in 2.

Multi-SALAD: Learning from Multiple Simulations
We often have access to a(n approximate) simulation of the background process.We first provide an overview of SALAD, which reweighs samples from the simulation to better assist with classification on the real dataset.Then, we present Multi-SALAD, a variant of SALAD that uses multiple simulations.
Standard SALAD We have a background simulation dataset with y i = 0 for all i in addition to one true dataset D = {(x i , m i )} n i=1 .D sim is drawn from some distribution P sim with density p sim .While CWoLA learns the likelihood ratio between the signal and sideband regions of D alone, SALAD utilizes D sim as well.Note that if p sim is equal to p(•|y = 0), we could directly train a model to distinguish between D and D sim in the signal region to get a classifier that could detect anomalies.However, since D sim may not match the true background data, we instead first need to learn a reweighting function that captures the differences between D sim and D's background data, and then we train a model to distinguish between D and the reweighted D sim in the signal region.Formally, given fixed SR and SB for both datasets, the method can be broken into two steps: 2. Detection: Using a loss function L S with estimated ŵ(x, m) applied to D sim SR = D sim ∩ SR, a classifier ĥ is trained to distinguish between D SR and D sim SR .
If the estimate ŵ(x, m) is exactly equal to w(x, m) (e.g.ĝ is Bayes-optimal), then the second step will be equivalent in expectation to learning the ratio p(x) p(x|y=0) (see Lemma 2 in Appendix C.3), from which one can detect anomalies.

Multi-SALAD Method
Now, we have multiple simulation datasets D sim 1 , . . ., D sim k .One approach would be to maintain distinctions among simulations by reweighing each pair to learn k weight functions w i (x, m), and then using one overall loss function that weights points from each D sim SR,i with w i .However, it has been shown that importance reweighting, despite working in expectation, can be highly unstable and result in poor performance of tasks on the target data D [40].To understand why, Ref. [41] showed that the generalization error of an empirical loss function with importance weights w depends on the magnitude of w.Applied to our setting, it suggests that the more inaccurate the simulation is, the less the reweighted loss recovers the true p(x) p(x|y=0) , and the model may instead pick up on differences between D SR and the reweighted D sim SR that are noise rather than the anomaly.As a result, aggregating individual SALAD outputs can be equivalent to ensembling many poor classifiers.
Given these observations, Multi-SALAD uses multiple simulation datasets in a very simple yet theoretically principled way: control the magnitude of the overall w by combining all the D sim i to produce one large simulation dataset D sim whose distribution best approximates the true background p(x|y = 0), and then use standard SALAD with D sim and D. Note that this approach both improves sample complexity and can "suppress" a simulation that on its own has high w, while the approach of learning k weight functions would not offer such improvements.In Algorithm 2 and Appendix B.1, we write this procedure out where we simply concatenate all D sim i together.However, with domain knowledge on the strengths and weaknesses of each simulation across features, one could produce D sim by sampling accordingly from each.We leave this direction for future work.

Theoretical Results
We now present a finite sample generalization error bound on Multi-SALAD that also applies to SALAD.To measure the generalization error, recall w(x, m) = p(x,m|y=0) p sim (x,m|y=0) and let ŵ be the classifier g's estimate.We denote h as the reweighted classifier.Let h = argmin h∈H L S (h, w) and let ĥ = argmin h∈H LS (h, ŵ).We aim to bound L S ( ĥ, ŵ) − L S (h , w).
We first set up some definitions.Define n SR as the number of points from D and D sim belonging to the signal region, and n SB as the number of points belonging to the sideband.Let n SR sim be the number of points in D sim belonging to the signal region.Let ĝ(x) ∈ [ĝ min , ĝmax ] and g (x) ∈ [g min , g max ], where g is the optimal classifier.Let R n SR ( S • {H, G}) be the Rademacher complexity of the overall loss L S (h, w) across function classes h ∈ H, g ∈ G. Define W = max x,m w(x, m) as the maximum ratio between the simulation and true background.Let B 1 = max{− log h (x, m), − log(1 − h (x, m))} be based on the most extreme value of h (i.e.how far apart p and p(•|y = 0) can be).Let is the Rademacher complexity of the loss function class used for learning the reweighting, where is point-wise crossentropy.Finally, let B 2 = − log(min{ĝ min , g min }).
Theorem 2. With probability at least 1 − δ, there exists a constant c > 0 such that the generalization error of Multi-SALAD on D sim and D is at most We make several observations about this bound: • The bound scales in (n SB ) −1/2 and (n SR sim ) −1/2 , where the former comes from the initial reweighting step while the latter comes from the weighted classification step.
• The bound is also dependent on the Rademacher complexities of both classifiers g and h used.
• The bound depends on the difference between the simulation and data distributions through quantities W , B 1 , B 2 , η, ĝmax , g max .If the distributions have very different densities, these quantities will all be large, increasing the generalization error.
We comment how this bound is different when instantiated for SALAD versus Multi-SALAD.The following example shows how SALAD with one simulation can result in a large W (and other large constants), while Multi-SALAD with two simulations combined can reduce W in the bound.
Example 1.Let P 1 sim (x|y = 0) = N (µ, σ 2 ), P 2 sim (x|y = 0) = N (−µ, σ 2 ) be Gaussian distributions on x with µ, σ 2 ∈ R, and let the true background distribution P(•|y = 0) be a mixture of the Gaussians on x, P(x|y = 0) = 1 2 P 1 sim + 1 2 P 2 sim .Let P 1 sim , P 2 sim , and P have the same marginal distribution over m with x ⊥ ⊥ m|y.Then, if we only use one simulation Therefore, as x → −∞, W → ∞.However, if we define P sim as the distribution of the two simulation datasets concatenated, we have that p sim (x|y = 0) = p(x|y = 0), and as a result, W → 1, making the generalization error bound smaller.
From this example, we can see that significantly differing simulation and data distributions can result in large, unbounded weight ratios, which are correlated with poor performance. 4This concretely motivates our algorithmic objective to combine multiple simulation datasets as to closely approximate the true data.

Empirical Results
To demonstrate how Multi-SALAD can improve over using only one simulation and over using simulations separately, we consider a synthetic experiment with two simulation 2), and the anomaly is , and simulation 2 is P 2 sim = 1 2 N (−1, 0.2) + 1 2 N (0, 1).We generate 2000 points from the true background and 100 points that are anomalies to form D, and 2000 points each from P 1 sim and P 2 sim to form D sim 1 and D sim 2 .We construct signal and sideband regions from these by splitting datasets in half randomly, assuming they follow the same distribution over x (i.e., m is independent of x) except that there is no anomaly in the sideband regions.A visualization is shown in Figure 3.
Intuitively, the anomaly is only slightly different from the background data, which makes it important to learn a good reweighting function from the simulations.Because each simulation alone diverges greatly from the data for one mode, each individual reweighting may not approximate the true P(•|y = 1) well.On the other hand, if we combine both simulation datasets together, the aggregate distribution has smaller weights with lower variance, which can allow for more accurate reweighting.This is demonstrated in Figure 4, which depicts the reweighting in the sideband region.Figure 5 depicts the reweighting's interpolation into the signal region, where we introduce an additional baseline SALAD-Switch, which uses k separate weight functions w i (x, m) and switches among them in the reweighted loss function L S .In all but the bottom right subfigure in both figures, the reweighted simulation data poorly approximates the true background data.As a result, a classifier trained to distinguish between the high-variance reweighted simulation and the true background data plus some small anomaly will more likely learn the distinctions coming from poor approximation, rather than anomaly.In particular, note that SALAD-  Switch results in significant overweighting in Figure 5.
With these observations, we present the signal efficiency to rejection rate of each method in Figure 6, where we compare Multi-SALAD against SALAD using simulation 1 only, SALAD using simulation 2 only, and SALAD-Switch.Table 1 contains the accuracy and AUC scores for each method.Averaged over 10 random seeds, Multi-SALAD outperforms other methods.The signal efficiency to rejection rate for each of the 10 runs is available in Appendix E.  1. Accuracy and AUC scores (%) for Multi-SALAD on two simulation datasets.We compare to SALAD-Switch (different reweighting), as well as standard SALAD on individual simulations and no reweighting.Performance is averaged over 10 random runs with one standard deviation reported.

Conclusions and Outlook
We extend two resonant AD approaches to incorporate multiple reference datasets.For Multi-CWoLa, we draw from weak supervision models to handle multiple resonant features.For Multi-SALAD, we combine multiple simulation datasets to best approximate   the background process.Future work includes 1) exploring Multi-SALAD's applicability on real data and algorithms for sampling from simulation datasets 2) extending Multi-CWoLa to model more complex relationships among resonant features and 3) using such approaches together over multiple simulations and resonant features, effectively utilizing as much information as possible.1−ĝ(x,m) , where ĝ is a classifier that distinguishes data D SB from simulation D sim SB in the sideband region.5: Train a new classifier ĥ on the signal region to distinguish between points in D SR and points in D sim SR reweighted by ŵ, using the following loss: (B.2) 6: Output: Classifier output ĥ(x, m), which yields a score that is thresholded for anomaly detection.
In expectation with an optimal w, we can see that minimizing this loss is equivalent to minimizing the cross-entropy loss on a task that distinguishes between points drawn from p and points drawn from p(•|y = 0) in the signal region.Therefore, h can be used for anomaly detection.The procedure is summarized in Algorithm 2.

C.1 The Need for 3 Resonant Features
We show that to identify the model (3.3), we need at least k = 3 resonant features.Proof.The strategy we use to show that the model cannot be identified for k = 1 or k = 2 is to prove that the observable distributions P ( M 1 (m), . . ., M k (m)) are consistent with multiple values of θ.We do so by direct calculation.
First, consider the case of k = 1.Set θ y = 0 for simplicity.Then, the model is Then Z = 2 exp(θ) + 2 exp(−θ), and Thus, any θ value produces the same observable distribution, so that we cannot identify θ.
Next, we consider k = 2. Again, set θ y = 0.The model is now The observable distribution is now P ( M 1 (m), M 2 (m)).We have that and Note that we have As a result, we have the same distribution P ( M 1 (m), M 2 (m)) for the parameters θ 1 , θ 2 = a, b and for θ 1 , θ 2 = b, a, where a, b are some non-negative values.If a = b, we end up with at least two solutions that cannot be distinguished, completing the proof.

C.2 Rademacher Complexity Bounds
We present bounds on the Rademacher complexity R n (F) of various models F. For all of the F below, we obtain R n ( • F) by computing R n (F).These two Rademacher complexities are equal when we assume that is 1-Lipschitz and apply Talagrand's lemma.
• Linear models: • Two-layer feed-forward neural networks (MLPs): We define f θ (x) where θ = (U, w) are the parameters for the weights for the two layers of an MLP.Here U ∈ R m×d and w ∈ R m .Suppose ReLU is the activation function, • Kernels: Let k : X × X → R be a continuous symmetric function so that for x 1 , . . ., x n , the matrix given by K ij = k(x i , x j ) is positive semidefinite.The class of kernel estimators consists of functions f . For particular kernels it is easy to bound the term in the numerator above.For example, we consider the RBF kernel which has maximum one, yielding R n (F) ≤ 2B √ n .
Proof.Let n SR data be the number of points from D that belong to the signal region.Under our assumptions, the empirical loss function can be written as As n SR → ∞, the first term approaches − Pr For the second term, we can construct n SR,0 data , the amount of data where x is from p(•|y = 0), to be equal to n SR sim such that the expression asymptotically approaches − Pr(z = 0) • E x,m∼P sim p(x,m|y=0) Putting this together, we have that lim We can apply standard learning theory bounds on L C f ) − L C (f ).In particular, this quantity is equal to

D Proofs
where we have used the fact that LC ( f ) ≤ LC (f ).Then, using uniform convergence bounds, such as Theorem 3.3 of [44], we have This gives us our desired result.

E.2 Multi-SALAD Experiments
Setup We use MLPs from Keras [46], each with 3 hidden layers of dimension 32, ReLu activation, and trained with cross-entropy loss and the Adam optimizer.We train for 50 epochs, batch size 200, and default parameters otherwise.Finally, we evaluate our approach on a new test set containing 200000 background points and 200000 anomaly points.This test set is used to produce the signal efficiency to rejection rate.All experiments were run on a personal laptop.

Additional Results
In Figure 7, we show our results on individual runs.This is because computing the confidence intervals of these curves averaged across the 10 random runs is too noisy due to the magnitude of the reciprocal 1/FPR.

Figure 1 .
Figure 1.Comparison between CWoLa and Multi-CWoLa.Using multiple mixed samples helps performance across a range of dataset sizes.Access to multiple weak sources enables better AUC and lower variance compared to the single-feature version.

Figure 5 .
Figure 5. Top left: SALAD reweighting using simulation 1 on signal region.Top right: reweighting using simulation 2. Bottom left: using both simulation 1 and 2 weights separately.Bottom right: reweighting using simulation 1 and 2 combined.

Figure 6 .
Figure 6.Signal efficiency to rejection of Multi-SALAD versus other baselines (weighted and unweighted).

Algorithm 2 2 : 3 : 4 :
Multi-SALAD 1: Input: Simulation datasets D sim 1 , . . ., D sim k and real dataset D. Construct overall simulation dataset D sim = k i=1 D sim i .Split each dataset into signal region and sideband region using resonant feature m to get {D sim SR , D sim SB } and {D SR , D SB }.Learn weight ŵ(x, m) = ĝ(x,m)

Lemma 1 .
If k = 1 or k = 2 in model (3.3), the parameters θ 1 and θ 2 cannot be recovered from the observable quantities.

D. 1
Proof of Theorem 1 Proof.From Theorem 3 of [31], we have that L C ( f ) − L C (f ) is bounded by the traditional ERM generalization gap of L C ( f ) − L C (f ), where f = argmin f ∈F 1 n m i=1 (f (x i , m i ), y i ) is the classifier learned on labeled data, plus the term
SB i and D SR i for each i ∈ [k] based on membership in I m i .A straightforward way to use all datasets (D SB 1 , D SR 1 ), . . ., (D SB k , D SR k ) is to apply standard CWoLa k times by training k classifiers that we can then ensemble or average.Instead, in Multi-CWoLa, we construct a binary vector per x consisting of k noisy membership labels, M Table