Improving new physics searches with diffusion models for event observables and jet constituents

We introduce a new technique called Drapes to enhance the sensitivity in searches for new physics at the LHC. By training diffusion models on side-band data, we show how background templates for the signal region can be generated either directly from noise, or by partially applying the diffusion process to existing data. In the partial diffusion case, data can be drawn from side-band regions, with the inverse diffusion performed for new target conditional values, or from the signal region, preserving the distribution over the conditional property that defines the signal region. We apply this technique to the hunt for resonances using the LHCO di-jet dataset, and achieve state-of-the-art performance for background template generation using high level input features. We also show how Drapes can be applied to low level inputs with jet constituents, reducing the model dependence on the choice of input observables. Using jet constituents we can further improve sensitivity to the signal process, but observe a loss in performance where the signal significance before applying any selection is below 4$\sigma$.

1 Introduction The standard model of particle physics is one of the most successful scientific theories ever, accurately describing the interactions of elementary particles for three of the fundamental forces of nature in a quantum field theory.However, it is known to be incomplete as it cannot describe gravity, provide an explanation for dark matter, or address the (anti)matter imbalance in the universe, amongst other observed phenomena.At the energy and intensity frontier, experiments at the Large Hadron Collider [1] at CERN, such as the ATLAS and CMS experiments [2,3], hunt for new physics phenomena which could provide an insight into physics beyond the standard model (BSM).
Dedicated analyses optimised for a wide variety of BSM models have been performed, but as yet no significant evidence of new physics has been uncovered [4][5][6][7][8][9].Recently, focus has turned to searches with broader sensitivity in the context of anomaly detection.Instead of aiming for high sensitivity to a small subset of models, these approaches target sensitivity to a much wider spectrum of possible new physics scenarios, at the expense of sensitivity to any one process [10,11].
A cornerstone of these model agnostic searches is the bump hunt, which identifies localised excesses of data above a falling background, typically as a resonance in an invariant mass.One of the drawbacks to the standard bump hunt is that the sensitivity is limited to the invariant mass after applying a selection to define a signal enriched sample.To enhance the sensitivity, machine learning approaches which improve the sensitivity are studied, in particular, weakly supervised classifiers (CWoLa) trained to separate signal region data from a reference background sample [12,13].For this approach it is important that the reference background sample closely matches the true background data in the signal region, otherwise the classifier can learn to separate events based on the background mismodelling, rather than differences between signal and background data.Several machine learning methods have been developed for this utilising normalizing flows or multidimensional sample reweighting [14][15][16][17][18][19][20].These methods, however, only form a subset of the array of anomaly detection techniques being developed in high energy physics .
In this work we introduce a new method, Drapes (Denoising resonant anomalies by perturbing existing samples), which leverages state of the art diffusion models to generate reference background samples in a fully data driven approach for resonant anomaly searches.After training Drapes on side-band data, reference background samples can be generated in the signal region either by sampling from noise, or by partially applying the forward and reverse diffusion processes on existing data drawn from either the side-bands, the signal region, or another reference sample.We apply Drapes to the LHCO R&D dataset [120] and demonstrate state-of-the art performance in training CWoLa classifiers for a range of signal fractions.Due to the flexibility of diffusion models, we can apply Drapes to both high level features and low level jet constituents.
This work represents one of the first applications of diffusion models 1 for the task of sample generation in weakly supervised searches for new physics.Concurrent with the development of this work a similar approach was developed, applying diffusion models to di-jet generation for weakly supervised searches using low level jet constituents [114].However, in addition to low level generation, our work studies the application to high level variables as well as the first application of partial diffusion in high energy physics.Other previous work applying diffusion models for anomaly detection has seen a diffusion model trained as an unsupervised tagger for identifying anomalous jets [121].

Dataset
The LHCO R&D dataset [120] has become a standard candle of performance for resonant anomaly detection approaches.It comprises background data from di-jet events produced in QCD interactions, and signal data from the all-hadronic decay of a massive resonant BSM particle into two lighter BSM particles.The masses of the three BSM particles in the signal events are m W ′ = 3.5 TeV, m X = 500 GeV, and m Y = 100 GeV.
Signal and background events are generated with Pythia 8.219 [122] with detector simulation performed with Delphes 3.4.1 [123].Large radius jets, with a radius parameter R = 1.0, are reconstructed using the anti-k t clustering algorithm [124] using the FastJet [125] package.There are in total 1 million background and 100,000 signal events.
The LHCO Black Box 2 sample, which also comprises 1 million events, is used as an alternative background sample.The generation chain follows the same prescription, except Herwig++ [126] is used to generate the events, with a slightly modified detector parametrisation.
Events are required to have exactly two jets, where at least one jet has p J T > 1.2 TeV, and are ordered by decreasing mass.In order to compare Drapes to other methods, we use the same high level input features as in Refs.[17,19,20,127], namely Furthermore, with diffusion models it is possible to operate on input sets, such as the constituent particles within a jet, with permutation equivariance.As such, the model can be trained without enforcing any order on the constituents within the jet.Therefore, in addition to the default feature set, we also consider the leading 128 constituents in each jet.They are represented by their transverse momentum as a fraction of the jet transverse momentum and their coordinates in η-ϕ space relative to the central axis of the jet (p T /p J T , ∆η, ∆ϕ).The background data follow a falling spectrum in m JJ above 2.7 TeV, with signal events localised as a resonance in m JJ with a peak at 3.5 TeV (corresponding to m W ′ ), and mostly contained within a signal window m JJ ∈ [3300, 3700), which we define as our signal region (SR).To improve the statistical accuracy when evaluating model performance, additional background events falling within the SR are also available for the high level input features [128].

Method
In Drapes, we train diffusion models to learn a conditional generative model of the data from side-bands (SB) surrounding our signal region.The model is conditioned on m JJ and, under the assumption that the side-bands contain only background data, can be assumed to be a generative model for the background process across the full m JJ range, including the SR.The Drapes models are trained on either the standard high level variables or the jet constituents of the two jets in the event.
For a nominal choice of side-bands we consider the region m JJ ∈ [2800, 6000), excluding the signal region.Separate Drapes models are trained for a wide range of injected signal levels.

Diffusion model
We follow the 'EDM' noise scheduler and network pre-conditioning introduced in Ref. [98].Here, the probability ordinary differential equation defining the reverse diffusion process is given by dx t = −t∇ x log p(x; t) dt, where p(x; t) is the distribution over the data at time t, where t = σ ∈ [0, 80) (where σ is referred to as the noise strength).
For high level input variables we employ a simple ResNet [129] architecture comprising four hidden layers, each with 64 nodes, for the diffusion network.The noise strength σ is processed with a cosine embedding layer before being concatenated alongside m JJ as the conditional vector.
Skip and scaling connections, c in (σ), c out (σ) and c skip (σ) are used in conjunction with the network, resulting in an output of This results in a training objective of which is simply the distance between the network output and the true target, namely the original

Background sample generation
Considering a single Drapes model, there are several options which can be used to generate reference background samples by performing the reverse diffusion process.
It is possible to use Drapes as a purely conditional generative model, generating new data for target values of m JJ .To generate new samples noise is drawn from a multidimensional gaussian distribution, and the reverse diffusion process is performed starting at σ = 80.We refer to this method as Drapes ∅.This generation method can be considered a diffusion analogue to the Cathode approach [15], where samples are generated for target m JJ values with a normalizing flow starting from noise.
As with all diffusion models, the solver plays a key role in the fidelity and generation speed of samples.Here we find performance is saturated with 50 integration steps, with no discernable difference between the Heun and Euler ordinary differential equation solvers.Nonetheless, to ensure that the choice of solver plays a subleading role, we choose to use the Heun 2 nd solver with 200 integration steps at σ = 80 when generating all samples.However, in addition to generating from pure noise, it is also possible to generate a reference background template by taking existing samples and using them as the inputs for partial diffusion.
In partial diffusion, the forward diffusion process is performed by adding noise to the samples corresponding to a target noise strength σ ′ .Then, the reverse diffusion process is performed for a target value m JJ , starting at σ = σ ′ .In this case, fewer passes are required to recover the output samples, as the diffusion process is starting partway along the trajectory.The number of diffusion steps N and σ ′ are related3 by where N max = 200 and σ max = σ = 80 is the total number of reverse diffusion steps and corresponding noise strength during training.Here σ min = 10 −5 and ρ = 7 are hyperparameters of the diffusion model.
We consider three different approaches for partial diffusion, which differ based on the source of the initial data.
Drapes SR Here, partial diffusion is applied to the data in the signal region with a target noise strength σ ′ .They are subsequently denoised using the reverse diffusion process using the original m JJ values.
Drapes SB Here, data are drawn from the side-bands and partial diffusion is applied for a target m JJ value which lies in the SR, rather than their initial m JJ value.This case most closely resembles the Curtains method which uses normalizing flows [17,18] to transform sideband data to target values of m JJ within the signal region.
Drapes MC Here partial diffusion is applied to data in the signal region which are drawn from another reference sample, for example Monte Carlo simulation.Here we use the alternative background sample from the LHC Olympics.This case most closely resembles the Feta approach [19] which uses normalizing flows to transform signal region data from another sample to the signal region of the nominal sample.
For the case of Drapes ∅ and Drapes SB, we sample target values of m JJ from a four-parameter polynomial where z = m JJ / √ s.The parameters p i are extracted by performing a fit to the side-band data.
In all cases it is possible to oversample the reference background sample by generating more data than are present in the SR.For Drapes SR and Drapes MC this is done by sampling different noise in the forward diffusion process multiple times for the same events.For Drapes ∅ and Drapes SB, both the noise in the forward diffusion process and target m JJ values are sampled independently multiple times for each event.

Drapes for jet constituents
Due to the flexibility of diffusion models in comparison to normalizing flows, Drapes can easily be adapted to any type of input.To apply Drapes to low level jet constituents, we generate the constituents of each jet in isolation using the same diffusion model.The diffusion model builds upon the architecture and prescription used in Ref. [109] with the denoising architecture built from a series of transformer encoder (TE) blocks [130], without a further optimisation of the hyperparameters.
The jet four-momentum and number of jet constituents (p 4 = (p T , η, ϕ, m j , N const )) 4 are provided as conditional information.However, in order to obtain the conditional information p 4 for each jet, we require a second diffusion model conditional on m JJ .This network is identical to the Drapes architecture used for the high level features.The two networks comprising the low level Drapes model are shown in Fig. 2.
In background QCD events we do not expect significant correlations between the substructure of the two jets, which are produced as either two gluons, two quarks, or one of each, predominantly via t-channel. 5However, for any resonant signal process the two daughter particles will be related and their substructure will be strongly correlated.By applying Drapes to each jet in isolation, in the case where there are signal events in the sidebands the substructure of signal jets could be learned, but the correlations between the substructure of the two jets would not be.Alternatively, if Drapes were applied to the constituents in both jets simultaneously, the correlations between the substructure of signal jets would be learned.As a result, when generating reference templates for the signal region, considering each jet independently should have fewer signal like events, where both jets have signal like substructure, resulting in better sensitivity than when the two jets are not independent.
Even in the case where there are no signal events present in the sidebands, applying Drapes to each jet in isolation should still be beneficial for partial diffusion.After the addition of noise and applying independent denoising processes to each jet, the correlations between the substructure of two signal jets in an event would be decreased.The network is conditioned on m JJ and σ.To learn the jet constituents as a function of p 4 , the constituents of each jet are passed through the jet diffusion network H ψ after being perturbed with noise with strength σ.The network is conditioned on p 4 and σ.In both networks skip connections modify the input and output of the network, and partial diffusion can also be employed.

Weakly-supervised classifiers
To test the performance of the various approaches, we train CWoLa classifiers to separate the signal region data from the generated templates.The SR data can either be purely background events, or contain a controlled number of signal data, which can also contaminate the side-bands.
As the classifiers themselves are not the focus on this work, we use the same hyperparameters and architecture used in Refs.[17,18] without further optimisation for the high level features, despite promising performance improvements observed with decision trees [95,131] or for mitigating sculpting after applying cuts [20].This enables easier direct comparison with CurtainsF4F.A fivefold cross validation approach is employed to train all classifiers, including the fully supervised configuration.
Following the CWoLa classifiers for the high level features, we keep the architecture small also for the low level approach with Drapes applied to generate jet constituents.We use a classifier which is permutation invariant to both the jet constituents, but also the two jets.To achieve this we use transformer encoders operating on the jet constituents with self-attention [130] layers.
The output is then pooled with a learnable class token cross-attention [132] layer, weighting the contribution from each constituent; this ensures permutation invariance over the jet constituents.
The same transformer is applied to the constituents of both jets, with the outputs combined in a summation in order to ensuring permutation invariance between the two jets.The resulting vector is subsequently processed with a small multi-layer perceptron.summation in order to ensuring permutation invariance between the two jets.

Comparison with prior work
As previously discussed, various modes of sample generation with Drapes are analogous to existing approaches but using diffusion models rather than normalizing flows.However, during the development of this work Ref. [114] applied flow matching and diffusion models to the task of dijet generation in a signal region.This approach is equivalent to Drapes ∅ applied to the low level jet constituents, however, this work differentiates itself in several key regards with respect to Ref. [114].
With Drapes, we introduce and study the use of partial diffusion as a new means of generating samples, rather than generating only from noise as with Drapes ∅.Furthermore, we demonstrate that diffusion models can outperform normalizing flow based approaches on the same features, and that there are advantages beyond the more flexible choice of architecture that enables jet constituent generation.We compare the performance of Drapes applied to high and low level variables across a wide range of injected signal events, and make comparisons for the case of partial diffusion for both sets of inputs.
In order to measure how well Drapes generates reference background samples, we focus on a single signal region defined as m JJ ∈ [3300, 3700) with various levels of signal contamination.The main measure of performance of the subsequent classifier is the significance improvement, given by ϵ S / √ ϵ B , where ϵ S and ϵ B are the signal and background efficiencies respectively.Similar to ϵ B in a receiver operator characteristic (ROC) curve, we calculate the significance improvement across the full range of background rejection values ( 1 / ϵ B ).
We compare the performance of Drapes in several scenarios to a CurtainsF4F model trained using the full side-band width, which was shown in Ref. [18] to deliver state of the art performance.
For reference we also show the performance of a supervised classifier trained to separate the signal process from the QCD background on an independent set of data, and an idealised CWoLa classifier.In the idealised classifier, simulated QCD background events are used as the reference background sample, with oversampling equal to the amount used in Drapes and CurtainsF4F (Over-Idealised).

Simple diffusion
As it is the simplest approach, we first look at the performance of Drapes ∅.In Fig. 4  Similarly we see that Drapes ∅ reaches state-of-the-art performance where 3,000 signal events have been injected.However, we note here that both Drapes ∅ and CurtainsF4F reach saturation in significance improvement at high levels of background rejection, achieving the same level of performance of the supervised classifier.
In Fig. 5 we look at the performance of Drapes ∅ as the number of signal events present in the data changes.In comparison to CurtainsF4F we see that Drapes ∅ is able to remain more sensitive at lower S / B .However, Drapes ∅ drops off in performance below injected significances of 1σ at background rejections of 10 3 , where fewer than 350 signal events are present in the signal region.This is just before the Over-Idealised classifier indicates the limit of the CWoLa weakly supervised approach.The same trend is observed for a background rejection factor of 5 × 10 3 .

Partial diffusion
When using Drapes with performing partial diffusion, it is critical that the generated reference distributions still match the true background distribution.For example, when starting from side-  band data with a value of σ ′ too close to zero, the overall distributions will not change and a CWoLa classifier will identify the differences between the two distributions arising from the correlations to m JJ instead of differences between the signal present in one class, and the approximation of the background data in the other.Similarly, when starting from the signal region data themselves, if σ ′ is too small, the samples will be statistically identical and the classifier will not be able to separate them, leading to a classifier unable to separate signal events from the background.
We  The injected significance measured from the number of signal and background events in the signal region before applying a cut S √ B is also shown.The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.classifier would instead focus on this mismodelling of the background.
In addition, we compare the performance of Drapes ∅ with partial diffusion, starting with data drawn from noise but with fewer denoising steps and the same values of σ ′ .The AUC scores for all four Drapes methods are compared in Fig. 6.We can see that all three methods starting from reference data, Drapes SR, Drapes SB, and Drapes MC, result in reference samples much closer to the true background distribution than when starting from noise with Drapes ∅.This shows that there is a benefit for starting with some data rather than just pure noise, but only up until a point, as after some value of σ ′ the input data are almost indistinguishable from noise.By construction, Drapes SR is closest to the true background distribution as they are used as the starting samples.
For Drapes SR and Drapes SB σ ′ values above 0.62 result in reference templates with AUC values less than 0.55, which are generally close enough to result in reasonable CWoLa classifiers in the presence of signal.
In Fig. 7 the performance of the partial diffusion approaches are compared as we vary σ ′ in the case of 1,000 and 3,000 injected signal events.As expected, in the case of fewer signal events the overall performance is more sensitive to the modelling of the background reference template.For 1,000 signal events we see that both Drapes SR and Drapes SB maintain competitive performance for values of σ ′ down to 0.6, where the AUC scores are also below 0.55, however for lower values the performance degrades.Interestingly, Drapes MC has lower overall performance and requires more denoising steps to reach competitive performance.Drapes SR and Drapes SB have similar -12 - By construction, all methods converge to the same outputs as σ ′ increases, and at a value of σ ′ = 80 all methods are equivalent when using the same m JJ values.An AUC of 0.505 is observed at the convergence of the methods.The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty, which are negligible at this scale.
performance for 1,000 injected signal events, though Drapes SR can maintain performance with far fewer steps in the presence of more signal in the SR.
In order not to optimise for σ ′ for all amounts of signal and subsequent studies we choose a single value for σ ′ using the AUC in the no signal case shown in Fig. 6.Here we see that the agreement of the partial diffusion approaches with true background data saturates at around σ ′ = 2.24.This also corresponds to where the performance has saturated for all three methods at the case of 1,000 signal events in Fig. 7.It should be noted that that this may not be the most optimal choice for all methods at low signal levels.
In Fig. 8 the significance performance at fixed background rejection rates is compared to Drapes ∅ for a range of signal values.Here we see that Drapes SR comes closest to the overall performance of Drapes ∅, with all methods becoming more similar as the injected S / √ B increases, though Drapes ∅ performs best for all values.However, Drapes SR and Drapes SB remain competitive, with less than half the number of denoising steps as Drapes ∅.

Drapes with jet constituents
When using low level inputs (LLV) we focus on the two best drapes generation approaches, Drapes ∅ and Drapes SR.We compare the performance to Drapes trained on high level variables (HLV) and a supervised classifier trained on the same low level inputs.
-13 -  Fig. 9 shows the performance for the case of 3,000 injected signal events.We can see that Drapes trained on low level inputs substantially outperforms the high level features.Here Drapes ∅ again results in overall better performance compared to partial diffusion, with Drapes SR suffering a non-negligible performance reduction not observed for HLV.This is most notable in the ROC curve, where at high signal efficiencies Drapes SR LLV has significantly lower background rejection.
For Drapes SR a value of σ ′ = 8.9 is used for LLV with σ ′ = 0.62 used for HLV as before.
The motivation for this can be seen in Fig. 10, where Drapes SR has little to no sensitivity below σ ′ = 2.This suggests that the partial diffusion approaches have less direct benefit when applied to LLV in comparison to HLV.However, this could be strongly dependent on how Drapes SR is applied when using the jet constituents.In this work, the jet kinematics used to condition the Drapes model are unmodified from the signal region data, and partial diffusion is only applied to    -15 -  the jet constituents.Another option, which has not been studied in this work, would be to perform partial diffusion on both p 4 and the jet constituents.This could improve the sensitivity of the partial diffusion method, and would be necessary for using partial diffusion with Drapes SB.
The performance of Drapes ∅ (LLV) as the amount of signal present changes is shown in Fig. 11.Here we can see that although the performance achieved by Drapes ∅ LLV substantially outperforms HLV where more signal is present, the performance drops off below 2,000 injected signal events.Drapes ∅ trained on HLV outperforms the jet constituent approach in the region of interest where the injected significance is lower than 3.This observation is consistent with what is observed in Ref. [114].This drop in performance, however, can be expected and is reproducible in the idealised setting.CWoLa classifiers based on neural networks have been observed to lose sensitivity as the dimensionality of the input increases, especially with inputs which are not sensitive to signal processes [20,95,131].This effect is amplified as the number of signal events in the training data decreases.In moving from five high level variables to jet constituents with Drapes, the dimensionality of the input to the weakly supervised neural networks has increased by more than a factor of 150.

Conclusion
In this work we have introduced Drapes, a new approach for generating background reference samples for use in resonant anomaly searches, using diffusion models with deterministic samplers.This approach can be used on both high level features as well as low level jet constituents.
For high level features, state-of-the-art performance is achieved on the LHCO R&D dataset when using Drapes as a purely data-driven generative model.Even better performance could be achieved by combining the generated background templates with boosted decision trees, rather than neural networks, for the CWoLa classifier.

Furthermore, we demonstrate an advantage of the flexibility of diffusion models by applying
Drapes to the generation of low level jet constituents.This approach is more model agnostic than using high level variables, as it does not impose a choice of observables which may be sensitive to new physics processes.Substantial improvements are observed for many levels of signal events.However, for initial levels of signal significance below 4σ the performance degrades and is outperformed by the high level features.
In addition to pure template generation, we demonstrate the first application of partial diffusion in high energy physics.Partial diffusion is a promising approach to generate new samples from an initial reference dataset, however on this dataset it does not reach the same level of performance as Drapes ∅.This is likely due to the nature of the application in high energy physics.In other fields, anomalies can be characterised as single-sample out-of-distribution events.In this case, distance based metrics between the original sample and the partially diffused sample can be used as an anomaly score.However, in the case of resonant anomalies, we are searching for over-densities in feature space, localised in one observable (m JJ ).In this case, what is important is that we compare the differences in densities of our datasets.With partial diffusion, if the initial samples are all in distribution, by performing the forward and reverse diffusion process, the initial density of the initial samples will be preserved.This approach could still have merit when combined with another anomaly metric than the classifier based approaches studied here.For example exploiting the displacement of samples in the signal region with Drapes SB, or assigning labels during training to side-band and signal region data which can be reversed at inference.The impact of SB width on the performance of Drapes and CurtainsF4F are studied for two narrow side-bands in comparison to the nominal maximum width.Fig. 12 show the performance for the case where there are 3,000 signal events in the data.Here we see that all Drapes methods are more sensitive to the side-band width than CurtainsF4F, demonstrating similar behaviour to Cathode observed in Ref. [18].However, Drapes SR retains slightly more of its performance at high background rejection values.

B.1 Impact of oversampling background template
In Fig. 13 the impact of oversampling the background reference samples with Drapes ∅ i studied for 1,000 and 3,000 injected signal events.We observe that a saturation of performance is achieved at a factor of four for 3,000 signal events, however in the case of 1,000 signal events more training data are required to saturate the performance.Here saturation is achieved at a factor of 12.For reference, the over-idealised curve has an oversampling factor of five.

B.2 Oversampling signal region data
With Drapes we observe that oversampling the background template is able to bring substantial improvement to the overall performance of the CWoLa classifier.This is something which has also been observed and exploited in many other approaches [15,[17][18][19][20]127], and also brings performance -27 - All methods are trained on the sample with 1,000 injected signal events (left) and 3,000 injected signal events (right), and a signal region 3300 ≤ m JJ < 3700 GeV.The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.
gains in the idealised case.However, this oversampling is only performed on one class in the classifier, with the real signal region data untouched.
Another observation is that even in the idealised case that the performance degrades as the number of signal events present in the signal region decreases.By training a Drapes model on the data in the signal region, we can learn the conditional density for the combination of both signal and background events as a function of m JJ .Using this model, GenSR, we can sampling multiple events for each value of m JJ in the signal region and see is able to improve the sensitivity of the CWoLa classifiers in the case of few signal events.The CWoLa classifiers are subsequently trained to separate this oversampled signal region data from the oversampled Drapes background reference samples.Using generative models to inflate training statistics has been observed to improve the performance relative to limited available statistics [133,134].However, to our knowledge this is the first test of this observation in the context of weakly supervised learning.
To study this we compare the nominal Drapes ∅ against a setup in which a second Drapes ∅ model has been trained inclusive of the signal region, and is used to oversample the signal region data.The SB data is also included when training the new model, labelled Drapes GenSR, to minimise the chance of the CWoLa classifier only identifying differences in the background modelling arising from different training statistics.Unfortunately, in Fig. 14 we find that the classifiers trained using Drapes GenSR perform worse than when using the real signal region data.This is especially the case for low amounts of injected signal, and suggests the Drapes model is not able to sufficiently capture the distribution over signal events when the training dataset comprises mostly background data, and at higher amounts of signal only approaches the upper limit of performance -28 -already reached with Drapes ∅.

C.1 Partial diffusion
When using partial diffusion noise is added to initial events which are subseqeuntly denoised.The mean absolute shift between the input data and denoised output data is shown as a function of σ ′ in Fig. 15.In other domains, this displacement can be used as a single event classification score, with out of distribution often resulting in larger displacements.However, as can be seen for m j1 , the signal events have a smaller displacement for all values of σ ′ .
In Fig. 16 distributions for signal and background data after applying the forward and inverse diffusion processes in partial diffusion are compared to the true signal and background distributions.
In all cases, background events result in generated data following the same ground truth distribution.
However, in the case of signal events, as σ ′ increases, the resulting distributions become more background like, however even at σ ′ = 2.24 there remain differences between the generated data using signal as a starting point, and the background true distribution.
The significance improvement and receiver operator characteristic curves comparing Drapes ∅ to the partial diffusion methods are shown in Fig. 17.Here a value of σ ′ = 2.24 is used for all three partial diffusion methods.At 1,000 injected signal events, Drapes SB is able to reach the highest level of performance.However, for the same oversampling factor this results in more events in the background reference sample, as each side-band event is considered when generating the background template.When using the same number of raw training data, Drapes ∅ achieves even higher levels

C.2 Maximum significance improvement
In other studies, the maximum significance improvement is used as a measure of performance, despite its sensitivity to statistical fluctuations at very high background rejection.The maximum significance improvement as a function of the number of injected signal events for all Drapes methods is shown in Fig. 18.The maximum significance improvement as a function of σ ′ for Drapes with partial diffusion is shown in Fig. 19.

27 B. 1 Impact of oversampling background template 27 B. 2
Oversampling signal region data 27

noised input data, x 0 .Figure 1 :
Figure 1: Schematic overview of the Drapes architecture during training when using high level variables.Data are drawn from the side-bands of the m JJ distribution, and their features are passed through the diffusion network F θ after being perturbed with noise with strength σ.The network is conditioned on m JJ and σ.Skip connections modify the input and output of the network.

Figure 2 :
Figure 2: Schematic overview of the Drapes architecture during training when applied to jet constituents.The approach comprises two components, the di-jet kinematic model (left) and the per-jet model (right).Data are drawn from the side-bands of the m JJ distribution.To learn the di-jet kinematics as a function of m JJ , the four-momentum and number of jet constituents for each jet p 4 are passed through the kinematic diffusion network G ϕ after being perturbed with noise with strength σ.The network is conditioned on m JJ and σ.To learn the jet constituents as a function

Figure 3 :
Figure 3: Schematic overview of the weakly supervised classifier for the jet constituents.The same transformer network is applied to the constituents of both jets, with the outputs combined in a we look at the performance of the CWoLa classifiers for two levels of signal injection.Where 1,000 signal events are present in the data, we see that Drapes ∅ achieves state-of-the-art performance as a purely generative model.Here, Drapes ∅ has higher significance improvement values than the Over-Idealised classifier only due to the different profile of the ROC curve.The Over-Idealised classifier has greater separation power, with an AUC of 0.89 in comparison to 0.85 with Drapes ∅.

Figure 4 :
Figure 4: Background rejection as a function of signal efficiency (left) and significance improvement as a function of background rejection (right) for Drapes ∅ (purple), CurtainsF4F (orange), Supervised (black), and Over-Idealised (green).All methods are trained on the sample with 1,000 (top) and 3,000 (bottom) injected signal events, and a signal region 3300 ≤ m JJ < 3700 GeV.The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.

Figure 6 :
Figure 6: AUC values of classifiers trained to separate the Drapes reference samples from data in the case where no signal has been injected.The AUC is calculated as a function of the initial σ ′ used to generate samples using Drapes SR (blue), Drapes MC (magenta) and Drapes SB (red).

Figure 7 :
Figure 7: Significance improvement as a function of noise rate for Drapes SR (red), Drapes SB (blue), and Drapes MC (magenta).All generation methods use the same diffusion model trained on the sample with 1,000 (top) and 3,000 (bottom) injected signal events, and a signal region 3300 ≤ m JJ < 3700 GeV.Reference values for Drapes ∅ (purple), Supervised (black) and Over-Idealised (green) are shown in addition.The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.

Figure 8 :
Figure 8: Significance improvement at fixed background rejection rates of 10 3 (left) and 5 × 10 3 (right) as a function of the number of signal events in the signal region, 3300 ≤ m JJ < 3700 GeV, for CurtainsF4F (orange), Drapes ∅ (purple), Drapes SR (red), Drapes SB (blue), Drapes MC (magenta), Supervised (black), and Over-Idealised (green).The injected significance measured from the number of signal and background events in the signal region before applying a cut S √ B is also shown.The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.A dashed line is shown at 1 to indicate where an improvement over the initial significance is achieved.

Figure 9 :
Figure 9: Background rejection as a function of signal efficiency (left) and significance improvement as a function of background rejection (right) for Drapes ∅ (purple) and Drapes SR (red).Models trained on the low level jet constituents are shown in a solid line (LLV), with models using the high level variables (HLV) are shown in a dashed line.All methods are trained on the sample with 3,000 injected signal events, and a signal region 3300 ≤ m JJ < 3700 GeV.The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.A supervised classifier (black) trained on low level jet constituents is shown for reference.

Figure 10 :
Figure10: Impact on performance of varying the level of partial diffusion σ ′ in the case where there are no signal events (left) and 3,000 injected signal events (right).For the no signal case we show the AUC values of classifiers trained to separate the Drapes SR reference samples from true background data.For the case where signal events are present we show the significance improvement at a background rejection of 5,000.Models trained on the low level jet constituents are shown in a solid line (LLV), with models using the high level variables (HLV) are shown in a dashed line.All methods are trained on the sample with 3,000 injected signal events, and a signal region 3300 ≤ m JJ < 3700 GeV.The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.For the significance improvement, a supervised classifier (black) trained on low level jet constituents is shown for reference.

Figure 11 :
Figure 11: Significance improvement at fixed background rejection rates of 10 3 (left) and 5 × 10 3 (right) as a function of the number of signal events in the signal region, 3300 ≤ m JJ < 3700 GeV, for Drapes ∅ (purple) using low level jet constituents (LLV, solid line) or high level variables (HLV, dashed line).The injected significance measured from the number of signal and background events in the signal region before applying a cut ( S √ B ) is also shown.The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.A supervised classifier (black) trained on low level jet constituents is shown for reference.

Figure 12 :
Figure 12: Significance improvement as a function of background rejection for Drapes ∅ (purple), Drapes SR (blue), Drapes SB (red), and CurtainsF4F (orange), trained on side-band data for three different side-band widths.All methods are trained on the sample with 3,000 injected signal events, and a signal region 3300 ≤ m JJ < 3700 GeV.The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.

Figure 13 :
Figure 13: Significance improvement as a function of background rejection for Drapes ∅ for different levels of background oversampling, compared to Over-Idealised (green) and Supervised (black).

Figure 14 :
Figure14: Significance improvement as a function of background rejection for 1,000 (left) and 3,000 (right) injected signal events for the nominal Drapes ∅ approach (purple, solid) and when Drapes GenSR is used to oversample the signal region data (purple, dashed).The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.

Figure 15 :
Figure 15: Displacement of input jets in the signal region after adding then removing noise with various noise rates using Drapes SR.A noise rate of 0 represents the input data without any noise perturbation or diffusion steps, and a rate of 80 corresponds to the case where the inputs are completely noise.The same diffusion model is used which was trained on the sample with 3,000 injected signal events.

Figure 16 :Figure 17 :
Figure 16: Events generated with Drapes SR for increasing values of σ ′ for m 1 J , τ j1 21 and ∆R jj , shown separately for the case where initial events are from the signal (Gen Signal, brown) and background (Gen Background, blue) processes.The true signal (red) and background (green) distributions are shown for reference.

Figure 18 :
Figure 18: Maximum significance improvement as a function of the number of signal events in the signal region, 3300 ≤ m JJ < 3700 GeV, for CurtainsF4F (orange), Drapes ∅ (purple), Drapes SR (red), Drapes SB (blue), Drapes MC (magenta), Supervised (black), and Over-Idealised (green).The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.

Figure 19 :
Figure 19: Maximum significance improvement as a function of noise rate for Drapes SR (red), Drapes SB (blue), and Drapes MC (magenta).All generation methods use the same diffusion model trained on the sample with 1,000 (left) and 3,000 (right) injected signal events, and a signal region 3300 ≤ m JJ < 3700 GeV.CurtainsF4F (orange), Supervised (black) and Over-Idealised (green) are shown for reference.The lines show the mean value of fifty independent classifiers, with the shaded band representing a 68% uncertainty.
measure the performance of the partial diffusion generation as we change σ ′ by training Drapes on a sample with no signal present.By generating background reference templates for a wide range of σ ′ we can then test how close they match the true background by training a classifier to separate them.If the area under the ROC curve of the classifier is close to 0.5, then the reference samples are good approximations of the background.However, if the values are too large, a CWoLa