Cluster Scanning: a novel approach to resonance searches

: We propose a new model-independent method for new physics searches called Cluster Scanning. It uses the k-means algorithm to perform clustering in the space of low-level event or jet observables, and separates potentially anomalous clusters to construct a signal-enriched region. The invariant mass spectra in these two regions are then used to determine whether a resonant signal is present. A pseudo-analysis on the LHC Olympics dataset with a Z ′ resonance shows that Cluster Scanning outperforms the widely used 4-parameter functional background fitting procedures, reducing the number of signal events needed to reach a 3 σ significant access by a factor of 0.61. Emphasis is placed on the speed of the method, which allows the test statistic to be calibrated on synthetic data.

The Standard Model (SM) is the current apex of theoretical physics, describing the electromagnetic, weak and strong interactions with unparalleled precision.Unfortunately, it is still far from complete, as several phenomena remain unexplained.In order to create a "theory of everything", one would not only need to combine the SM with general relativity, but also provide an explanation for many other issues, including the existence of neutrino masses, the origin of the matter-antimatter asymmetry, and most importantly, the origin of dark matter.To solve these problems, researchers are collaborating to formalise new theories, design, build and carry out new experiments, as well as simulate and analyze research data.One of the most renowned experimental facilities, the Large Hadron Collider (LHC) was constructed with the purpose of testing the SM in the high energy regime.The last elementary particle predicted by the SM, the Higgs boson, was discovered in 2012 [1,2].Since then, LHC research has shifted towards precision measurements and searches for beyond the Standard Model (BSM) effects.
Many extensions of the SM imply the existence of as yet undiscovered massive particles, often associated with proposed new symmetry groups.If a new particle has a narrow decay width, the straightforward method is to search for a resonant peak in the spectrum of a mass-like observable, such as the invariant mass of a dijet event.However, such a bump hunt is not completely free of assumptions.Often complex analytical functions need to be chosen to model the background distribution, with the possibility to introduce spurious signals and varying sensitivity under the assumption of different functional forms.Furthermore, additional observables or fiducial cuts need to be chosen and optimised to enhance sensitivity in the case where potential signal yields are low, causing searches to become more model-specific.To broaden the range of signal models covered by searches one may employ the model-unspecific search steategies [3][4][5].
Over the past decade, machine learning-based algorithms have become increasingly popular for solving a multitude of problems.Deep learning, in particular, has gained popularity for various tasks, with large neural networks being utilised.For example, many methods were implemented to perform anomaly detection (AD) tasks in various industries.Some of these AD methods have been repurposed and extended to support BSM searches  (see Refs. [86][87][88][89] for a comparison of various ML assisted BSM methods and Refs.[90,91] for a comparison of weakly supervised and unsupervised approaches).The ATLAS collaboration produced the first experimental results for such searches applied to experimental data using weakly supervised methods [92] and unsupervised ML anomaly detection methods [80,93].However, these efforts have not observed any significant deviations from the SM expectation.
Many AD approaches rely on the assumption that any new signal would form a set of outliers.However, in a bump hunt the assumption is instead that any new signal would be localised in some feature space, in particular in an invariant mass spectrum.Weakly supervised approaches, on the other hand, aim to enhance the sensitivity by applying a cut on a classifier trained directly on the data.However, in both instances the same bump hunt restrictions apply with either functional forms or input observables impacting the sensitivity to a model.
In this work we introduce a new data-driven method, Cluster Scanning (CS), which builds on the foundations of the bump hunt but addresses several limitations.By leveraging more information from the event CS is able to enhance sensitivity to potential signals without enforcing any model specific assumptions, and can also provide a direct estimate of the background distribution.The proposed approach complements existing techniques and is designed to be computationally efficient.
The paper is structured as follows.In Section 2, we briefly describe the LHCO R&D dataset [94], commonly used to benchmark the performance of anomaly detection techniques, and introduce our data preprocessing steps.Section 3 touches on the general topic of bump-hunting strategies in the literature, introduces the novel CS method, and discusses similarities and differences between them.In Section 4 we provide the results of applying CS in an anomaly search.Finally, we draw conclusions in Section 5.

Dataset
The LHCO R&D dataset consists of one million background Standard Model dijet events (also subsequently referred to as QCD) and 100 000 signal BSM Z ′ → XY events, where massive particles with m X = 500 GeV and m Y = 100 GeV decay into quark-antiquark pairs.The resonance itself has a mass of m Z ′ = 3.5 GeV.This anomaly model is discussed in detail in Ref. [95].

Jet images
In addition to the di-jet invariant mass (m jj ) of the event, used in a bump hunt, we extract additional information from the image representations of the two jets.This allows for a more model agnostic approach than selecting specific jet substructure observables.The jet images are processed following a prescription similar to that used in Ref. [11,[103][104][105] from the η, ϕ and p T of the jet constituents.Individual jet images are centred, rotated, and flipped in order to provide a consistent input to a convolutional neural network, reducing the number of symmetries the ML method would need to learn.
The jet images are cropped to [−0.8, 0.8] × [−0.8, 0.8] in η − ϕ space relative to the jet centre, binned with a 40 × 40 pixel grid, and normalised such that the sum of all pixels is equal to one.Fig. 1 shows the average jet images for QCD background, and the separate averages of all lighter (mostly Y ) and heavier (mostly X) jets in each Z ′ event.
Despite being used in many applications, the jet image representation has two main drawbacks, namely the sparsity of non-zero pixels (see app.B) and the imbalance in the magnitudes of their intensities.This is particularly problematic for approaches that depend on the L 2 (Euclidean) distance.We address both of these problems with the solutions introduced in Refs.[63,106].
To take the soft constituents into account, which have intensities orders of magnitudes lower than hard constituents, we apply a non-linear scaling to all pixels of I ij → I γ ij .To address sparsity we convolve (smear) the whole image with a two-dimensional Gaussian kernel with an isotropic standard deviation σ k .We find that using a value of γ = 0.5 for the pixel scaling alongside σ k = 1 for the Gaussian kernel provides a adequate solution to both issues without excessive impact on the structure of the jets.3 Method

Bump hunt
The bump hunt approach is a standard method used to search for excesses over a nonresonant background in HEP (high-energy physics) data.This method usually follows four main steps that we briefly discuss below.Each of these steps is a complex topic in itself with several different approaches in the literature, thus for our study, we choose only simplified and basic approaches.

Signal enrichment
Signal enrichment, in general, refers to the selection of a subset of experimental data in such a manner that the fraction of signal events in it is increased compared to the initial sample.Most often, this is done by cutting out a region of the observable space where the signal is expected to be abundant compared to the rest of the space, typically using a theoretical model of the signal of interest.These approaches, despite being sensitive to specific signal processes, make the search less model-agnostic and are ill-suited for general anomaly detection searches.Alternatively, one can hope to define a signal-rich region of the experimental data using a plethora of unsupervised ML (machine learning) techniques, which are expected to provide enhanced sensitivity over a wider range of potential signal processes.
In our particular example of LHCO data, we choose to explore a wide, smoothly falling region of the spectrum of dijet events with invariant dijet mass m jj from 3000 GeV to 4600 GeV.We choose this lower bound to avoid the turn on curve of the mass distribution, resulting from the jet trigger, and the upper bound is selected to remain in a region with relatively high statistics, so that we work in the region where the fit functions from Subsection 3.1.2are applicable.This interval contains, in total, around 380,000 QCD events and nearly all Z ′ events.We divide this region into 16 non-intersecting bins with 100 GeV width each, as in Ref. [107,108].

Background estimation
To perform a hypothesis test, one must first postulate a null hypothesis, which in counting experiments takes form of the expected background coming from the Standard Model processes.Often the background prediction relies on a theoretical basis to calculate the cross sections of the hard process and a simulation to account for detector response and measurement uncertainties.Still there are a number of searches where theory and simulation cannot provide a reliable background estimate.In these cases the background has to be estimated from the data itself in an empiric manner, using some general assumptions.
In dijet-like searches a background is often estimated by fitting a function of the form to a smoothly falling part of the dijet mass distribution [109][110][111][112][113][114][115][116][117][118][119][120][121][122][123][124][125], where x = m jj / √ s.This function is referred to as the "n-parameter dijet fit function", where n is the number of nonzero free parameters p i used in the function.Despite being a good fit to the simulated data, this functional form is still an empirical assumption and thus is subjects to a systematic error.Furthermore, after applying some selection criteria on the events which could be correlated with m jj , this function may no longer well describe the resulting distribution.
More advanced methods of fitting, such as the Sliding Window Fit (SWIFT) [126] and the ABCD method used in [127,128] are other methods that reduce the assumption of a functional form but introduce their own assumptions instead.However, due to the simplicity and wide use of the n-parameter fit function, we choose to use global 3-parameter and 4-parameter function fits as the benchmark analysis strategy.Further details of the (pseudo-)analysis on the LHCO R&D data performed using these background estimates are given in Appendix A. To access the upper bound on the performance of all background estimation methods, we use the underlying background distribution as an idealised fit, i.e. a fit with no systematic error.The (pseudo-)analysis using this is also described in Appendix A.

Test Statistic definition and calibration
There are several ways to calculate a global test statistic for two spectra.In HEP one of the more popular tests in model agnostic searches, called BumpHunter [129], relies on the maximal local significance (MLS) as the test statistic, where it is computed using a range of different windows over the spectrum.One of the benefits of the MLS test statistic is its simplicity and that it is well suited for signals that give rise to narrow, localised resonances.Here the MLS is applied to the binned m jj distributions of the data.Given a set B = {b 1 , ..., b n bins } of non-intersecting bins with N sig+bkg,b events or jets from the signalrich (experimental) distribution and N bkg,b events or jets from the background estimation, the MLS can be written as where CDF is the cumulative density function of the respective distribution.In equation Although some test statistics, like χ 2 , have well-known distributions, other more unusual test statistics, like the BumpHunter test statistic, require calibration.This is commonly done by modelling its distribution using Monte Carlo simulation.
Moreover, as systematic uncertainties arise from the definition of a signal region selection and the background estimate, this calibration should be performed even in the case where the distribution is known a priori.The calibration for the BumpHunter test statistic is performed in Ref. [129] by running pseudo-experiments in which the counts in each bin are varied according to Poisson's law.This can be extended to higher dimensions by resampling the background events with bootstrapping.By calculating the test statistic for each of our bootstrapped background-only pseudo-experiments, we obtain the distribution of the test statistic in the background-only hypothesis.To ensure good modelling of the tail of the test statistic distribution, which corresponds to large significance values in the presence of signal, a large number of pseudo-experiments is required.

Significance evaluation
To obtain a calibrated p-value for a given value of the test statistic t, one counts the number of background only pseudo-experiments exceeding this value N >t and divides it by the total number of pseudo-experiments done, N tot .
The (one-sided) significance is computed using the inverse cumulative density function of the normal distribution Z = CDF −1 N (0,1) (1 − p-value).In the case of N >t = 0 arising from the limited number of pseudo-experiments, we instead set a lower bound: For every experiment with added signal events, we still bootstrap the background (for consistency) and combine it with a given number of signal events chosen at random from 100,000 signal events (around 5% of events fall outside of the evaluation region).Due to statistical fluctuations we also perform several pseudo-experiments in the signal enriched case in order to obtain a robust estimate of the significance for each level of signal doping.

Cluster scanning
In this section we present a novel approach called Cluster Scanning, which follows the same bump hunting scheme, but relies on a distinct set of assumptions than the commonly employed methods and thus has several favorable characteristics.Our approach can be divided into several key steps given below, with the hyperparameters chosen in order to search for narrow resonances in the m jj spectrum of the LHCO R&D data.The motivation for these hyperparameters in each step and the argumentation on how to choose them for a different application case is given in App. C.

Training region selection:
We select a narrow m jj window [3000, 3100] GeV for training of the k-means algorithm.This window contains 56,486 original background events.In this publication, we focus on relatively small signal injections that include only 5% or less of the total number of Z' signals available.Therefore the training region is expected to contain 89 signal events or less, which can be regarded as negligible (proof given in App.G).In App.G we show an improvement in performance in case the training region matches the resonant peak and thus has a larger portion of signals events involved in clustering.However, in an actual analysis the position of the peak will be unknown, thus we choose to discuss a more representative case given here, when the training region happens to be in the tail of the signal peak and thus has a negligible number of signal events.

K-means Clustering:
We apply a mini-batch k-means clustering algorithm with k = 50 implemented in the scikit-learn [130] library, with a batch size of 2048 on the set containing jet images of the leading two jets from each event in this m jj window.The minibatch implementation is chosen due to its computational speed.The seeding of the cluster centroids is performed using the k-means++ prescription described and motivated in Ref. [131].
Cluster Spectra: After performing the fit of k-centroids to the data in the training region, we fix the centroid positions and evaluate how many jet images from each of the 16 m jj bins of the evaluation region, defined in Subsection 3. Per bin standardisation: We note that in each bin the normalised cluster spectra follow an approximately normal distribution with several outliers from the anomalous clusters (see discussion in App.D).Therefore we standardise the normalised cluster spectra in each bin using outlier robust estimators (described in App.E) for mean and standard deviation with an outlier factor of 0.2.Here we make the assumption that the majority of the signal is located in a small number of clusters, and the rest of the clusters are signal depleted.Figure 3 shows the cluster spectra from Fig. 2 after normalising with the outlier robust estimator.
Selecting anomalous clusters: Utilising the assumption that the signal is localised in m jj , we select potentially signal-rich cluster spectra as those with a deviation of more than a threshold value of θ = 3 standard deviations from the robust mean in the positive direction as we are only interested in a resonance leading to excess of events.The rest Figure 2: The m jj distributions for the jets in each of the 50 clusters, each normalised to unity.Here, 5,000 signal events have been injected into the evaluation dataset, which corresponds to 5% of the total available signal events.Test statistic: As previously discussed, to test the significance of an observed excess we use the simple maximum local significance, as defined in Equation 3.2.It may occur that no cluster is selected as anomalous.In this case we assign a default value of 0, in order to show good agreement with the null-hypothesis expectation.This is similar in motivation to setting the value for an observed deficit in events to zero.Following the discussion in Subsection 3.1 for the calibration process, we construct 3,900 pseudo-experiments using bootstrap resampling on 1 million background events.The distribution of the test statistic is discussed in App.F.
Ensembling: Different initialisations lead to a broader distribution over the final test statistic obtained with cluster scanning.In order to obtain a final value for the test statistic, the cluster scanning method is performed 15 times with independent initialisations.The mean of the test statistic from all the runs forms the final ensembled test statistic.The distribution of this statistic is presented in app.F.

Discussion
As we can see, CS follows the general bump hunt strategy, but introduces novel approaches for the first two steps of this strategy.First of all, CS selects the most anomalous looking clusters to define the signal-enriched region, and constructs a background estimate from the rest of the clusters.Notably though, this selection is completely data-driven and does not target a specific family of signal models.However, CS relies on a set of assumptions that fundamentally differ from those commonly used in other anomaly detection approaches.
Search for overdensity instead of outliers: Most anomaly search methods like Autoencoders [106] and SVDDs [88] rely on outlier detection, namely, identifying the data instances that lie in a region of very low probability density or outside the support of the "normal" distribution.Notably, while all normal events share similar characteristics and exhibit easily recognisable trends, anomalous data, such as defects or fraud, can differ in numerous ways and are thus given a wide prior.Although model-agnostic searches should accommodate a wide range of possible anomaly models, it is usually assumed that a signal is produced by only one or a few unknown BSM process.Thus, all anomalous events have many features in common and exhibit some similarity to SM events, as any new particle must radiate and decay into SM particles to be detectable.
Therefore we use the localisation of anomalies in both low-level (e.g.jet images) and high-level variable (e.g.m jj ) space as the first main assumption of the CS method.Localisation of anomalies in low-level variable space means that only a few out of all clusters contain a fraction of anomalies much higher than the rest of the clusters.This way clustering plays a role of data-driven binning in low-level variable space.Localisation in m jj gives us a possibility to distinguish these anomaly rich clusters from the rest, namely, by searching for an overdensity in m jj in one cluster spectrum compared to all others.Thus, CS is able to select a signal-rich region of events by leveraging the assumption of signal being localised rather than consisting of outliers.
Although semi-supervised methods based on CWoLa (see Refs. [107,108,[132][133][134][135][136]) and density estimation methods are also sensitive to overdensities, they usually require construction of a background template, which until recent developments [135,136] was preferably constructed for a smooth distribution of low dimensionality, typically using a few high-level observables.In this publication, we show that CS is able to draw significant improvement from a high-dimensional distribution of low-level jet observables.In this way, it can be considered less signal-model dependent than the methods that rely on handcrafted high-level observables.
Assume cluster mass independence instead of smoothness: CS proposes a solution to the second step of the analysis, namely, it estimates the form of the background by combining the signal-depleted clusters.In this way, we do not rely on any assumptions on smoothness or on a particular functional form of the background-only spectrum in m jj , which are heavily relied upon by most other bump hunt methods, such as the global functional fit mentioned in Subsection 3.1, SWIFT [126] and even Gaussian processes [137].
Instead, the second main assumption of CS is that in the background-only case, the assigned cluster centroid is approximately independent from m jj .Ideally, we would want the distribution of background events over m jj in each cluster to be identical within statistical uncertainty, such that the probability of a jet belonging to a cluster and having a specific mass factorises, p(i, m jj ) = p(i)p(m jj ), or at least that the correlation is weak.This would minimise the rate of incorrectly identified signal-enriched clusters.In practice, although Fig. 2 shows that the distributions all follow a similar trend, there are still some systematic deviations.These are a result of the finite width in m jj of the training window and slight correlations between the distribution of the jet constituents and m jj arising from the transverse momenta of the two non-resonant jets depending on m jj .Therefore, for the selection of the clusters, we estimate the uncertainty separately for each experiment and for bin based on the sample of our k cluster spectra values.This uncertainty estimate includes both, statistical Poisson fluctuations and systematical uncertainty from mass dependence (see discussion in App.D).
Unlike in methods with sliding window approach [108,126], in CS the fit only needs to be performed once 1 .Moreover k-means clustering is a simple classical algorithm that typically requires less training than deep learning approaches, making CS a relatively fast analysis method.This is important in the context of the ensembling and calibration, which both require a large number of analysis iterations, and are thus a notable obstacle to incorporating deep learning in HEP analysis under the constraint in computing resources.Fast analysis is also advantageous for testing its efficiency for simulated BSM events in order to produce the exclusion limits (see the RECAST [138] framework).Moreover, CS avoids other disadvantages inherent to sliding window approaches, such as limited search range due to the definition of the sidebands and the need to optimise sideband and signal window widths.

Idealised CS
Despite choosing a narrow m jj window to reduce mass dependence systematics, the variables that we use for clustering are in general not independent of m jj .Thus, we observe the background-only spectra of some clusters do not just statistically fluctuate around the expected shape of the background, but exhibit some degree of smooth mass sculpting.This affects the performance of the method by introducing false positives at the cluster selection stage.We expect that this may be partially remedied by a more sophisticated method of selecting anomalous clusters or a better background estimate, both of which would rely on further assumptions.These studies are outside the scope of this publication.However, to give an upper bound on the performance one may achieve with such improvements we propose an idealised version of clusters scanning.
Idealised CS version requires us to modify the distribution of the jets between the clusters.First, we count the numbers of jets that fall into each cluster in the first m jj bin.If no mass dependence were present, the fractions of QCD jets in each cluster x QCD,i,b = N QCD,i,b / i N QCD,i,b should be independent of bin number b within statistical uncertainties.To simulate this case in all the consecutive bins except the first we distribute the QCD jets in these bins among clusters using a multinomial distribution with weights equal to the fractions obtained in the first bin x QCD,i,b = x QCD,i,1 , thus generating cluster spectra that follow the original background spectrum with statistical fluctuations, i.e. the case with no mass dependence.The signal jets are distributed as before according to which cluster they belong to, such that the fractions of Z ′ jets may differ between different bins.This is done because we assume that only the background is distributed roughly proportionally between clusters, which is equivalent to assumption 2, but not the signal.
This distribution of jets creates idealised cluster spectra for each clustering, and the rest of the algorithm remains unchanged.

Results
As a proof of concept we perform an analysis applying CS and global fit based bump-hunting with the above mentioned hyperparameters to the LHCO R&D dataset with different amounts of signal injection, given in figures either as an absolute number of injected events ϵ or as a signal to background ratio S/B of events in the considered [3000, 4600] Gev m jj region.For each pseudo-experiment with signal injection we calculate the significance Z as discussed in Subsection 3.1.4using the calibration test statistic distribution.For each signal injection level we run 100 pseudo-experiments with bootstrapped background data and randomly sampled signal events.As a reference for the significance and its statistical variation for each contamination level, we report the median significance of these pseudo-experiments and 0.25 and 0.75 quantiles.We define the ratio between the median significance provided by CS to the significance of a baseline method as the significance improvement (SI).We also quote the relation between the number of events needed to obtain a 3σ evidence in each analysis strategy.
Figure 5a shows how the global significance of CS and the parametric fit-based methods depends on the signal contamination in the pseudo-experiment.It characterises the performance of these realistic analysis strategies, which do not use any truth information for the evaluation of the test statistic, thus including all the systematical uncertainties coming from partially fulfilled assumptions needed for the respective method.
We observe that although 3-and 4-parameter fits give approximately the same results, CS outperforms them by a significant margin in the region from 1500 to 4000 signal events.Beyond 3000 signal events, the significance yield from CS is limited by the number of bootstrap pseudo-experiments in the calibration set, but the lower bound on its significance still remains substantially higher than the significance of the parametric fits.This is the most interesting region as there the transition between non-significant signal (below 1σ)  The dotted lines mark lower bounds, as there was not enough statistics to access higher significance levels.Bottom figure: Significance improvement of the CS method compared to the 4-parameter fit for the (a) and the significance improvement of idealised CS compared to idealised fit for the (b) as a function of the signal-to-background ratio S/B. and new physics evidence (above 3σ) takes place.Looking at the lower subplot in Fig. 5a we see that CS gives us an SI of 2 and higher on the majority of regions of interest.We can also see that CS produces a 3σ evidence for only 61% of the events needed to obtain this evidence with the parametric fit.This shows that although both suffer from fit and assumption induced systematic uncertainties, CS has a clear advantage over parametric fitting procedures.
Above we have also described the idealised version of the cluster scanning method and in Appendix A the analysis with an idealised background fit.Both methods rely on event labels to remove systematic uncertainties introduced by the limitations of the assumptions of our methods and to make the background estimate in both cases close to the true background, with only statistical fluctuation taken in consideration.This is done to separate the influence of additional information, namely the low-level observables used in the analysis, from the systematic uncertainties introduced by the fits, and to construct the upper bound on the performance of our methods.
Figure 5b shows how the global significance of both idealised methods depend on the signal contamination in the pseudo-experiment.It can be seen that one needs substantially less signal for it to be significant in the idealised methods compared to the realistic methods, as it removes false-positives induced by systematic uncertainties in the fits.Still, we see that the idealised CS outperforms the idealised functional fit on the majority of the interval between 1000 and 2200 signal events.Looking at the lower subplot in Fig. 5b we see that by using CS in the region of interest we gain a significance improvement factor of up to 1.5.We can also see that CS produces a 3σ evidence with only 69% of the events needed to gain this evidence with the idealised fit.This shows that in the case of negligible systematic uncertainties, CS gives an improvement over any smooth fit as, in addition to just using information from m jj , it also makes use of the low-level event information.From the difference between idealised and non-idealised CS we can see that there is some room for improvement of CS to reduce the false positive rate, and improve the analysis efficiency.

Conclusions and outlook
This paper is a first proof of concept for the cluster scanning anomaly search method, which is designed to search for resonant overdensities on the distribution of an observable using clustering techniques in auxiliary observables.
We found that it outperforms the widely used bump-hunting method, which relies on the functional background fits, in several metrics relevant to an analysis.In the transition region, where the benchmark algorithm achieves 1σ to 3σ significance, CS improves the result by a factor of 2 or more for the realistic case, or by a factor of 1.5 for the idealised comparison.This reduces the number of signal events required to produce a 3σ significance by a factor of 0.61 in the realistic case and by a factor of 0.69 in the idealised case.The former factor of improvement should be expected in a real application.We also discuss the comparison of cluster scanning with other anomaly detection algorithms in Subsection 3.3, outlining its advantages and limitations.
The CS method should not be seen as a direct competitor to background fitting methods, but rather as a complementary approach that relies on a different set of assumptions about the nature of the anomaly and the background distributions, which are not well known.
There remains a large unexplored field of potential extensions and improvements to this method or synergies with other methods.Straightforward follow-up studies can explore the use of clustering methods other than k-means.One can look for other ways of selecting the anomalous clusters, alternatives to the one proposed in Subsection 3.2, that would rely on different assumptions.For example, one can require that all anomalous clusters are neighbors in the space of clustered inputs.One can also unify the assumptions of CS and functional fits to produce separate background estimates for each of the clusters separately, greatly reducing the m jj dependent systematic uncertainties.
CS could benefit from using features developed by other algorithms that have already been optimised for other tasks, such as flavour tagging, or even using unsupervised learning for feature extraction.
Furthermore, since many other ML approaches to improve sensitivity in model-independent searches rely on a bump hunt for the final statistical analysis, CS could also be used to further enhance sensitivity.This could be of particular interest when the background distribution is no longer well described by simple, smoothly decreasing functional forms.thereafter estimate the significance.Depending on the number of doped signal events, the median and quartile region significance given by this test is provided in the main text in Fig. 5b.
Unfortunately the background model is usually unknown, so for each experimental sample the background should be estimated in some less precise way relying on weaker assumptions.
As a realistic benchmark to our method we explore how sensitive the analysis is using global n-parameter functional to the kind of signal presented in LHCO R&D dataset.We use the binning with 16 bins defined in subsection 3.1.2and count the number of background events in each bin to get an original background spectrum N bkg,orig,b .
For all the fits in this studies, we use Trust Region Reflective nonlinear least squares fitting algorithm implementation from Scipy python package [139].The chosen bins generally contain more than 5000 counts, so the Poisson distributions of these counts can be well approximated by a Gaussian distributions with the variances equal to the bin counts.Using variances to scale the summands in the least squares objective we make it equivalent to the maximum likelihood objective for this setup.
First, we fit our 3-and 4-parameter functions to the spectrum to see if the fit is valid.
Resulting fits with 13 and 12 degrees of freedom score Unlike the CS method that doesn't generally rely on the smoothness of the background, global n-parameter takes it as the main assumption, so as N bkg,orig,b already has some statistical fluctuations a distribution resampled from it will have even larger statistical fluctuations than the ones expected for Poisson distribution.To simulate the proper scale Poisson fluctuations in the chosen region for our pseudo-experiments we resample events not from N bkg,orig,b but from the best possible fit.This also negates the systematic error from null-hypothesis not corresponding to the empirical functional form, so these experiments can be viewed as semi-idealised.In a more realistic cases, the space of functions given by all possible parameter values, does not contain the true form of null-hypothesis distribution and can only yield an approximation of it with limited precision.It is usual for fit functions with a small number of parameters, but with increasing number of parameters the function fit problem becomes over-defined and the function can fit the signal bump as well.Experimentally we have observed only insignificant increase in performance when comparing sampling from N bkg,orig,b or from the best fit distributions.On top of the resampled background events we add a number of signal events from signal's original distribution when needed.
The initial parameters of the fit in each experiment are chosen to be equal to the optimal parameters of the initial fit discussed above, so that one gets an "idealised" background fit if no optimisation is done.However, because of the statistical fluctuations and/or added signal contamination, the maximisation of likelihood results in a different set of parameters for this functional form.This error of background mismodeling under its statistical fluctuations and addition of the signal is exactly the type of error we want to demonstrate with this pseudo-analysis.
The results of such analysis for different signal contamination is given in Fig. 5a.We can see that the 3-parameter fit provides a slightly better result than 4-parameter fit as the latter has more flexibility to overfit the signal and the statistical fluctuations.This is so because the samples are drawn from 3-and 4-parameter functions with fixed parameters themselves.If we were to sample from other distribution the error coming from mismatch in true end expected functional forms may switch this ordering but it will reduce both performances.Therefore, the curves shown in Fig. 5a are upper limits of these realistic n-parameter fit analyses achievable only when the true distribution is described by one of the functions in the chosen parameterised space.

B Sparsity of the jet images
Fig. 6 show that the jet images are very sparsely populated ususally having less tha 100 non-zero pixels per 1600 pixels total.

C Hyperparameter selection and motivation
In this appendix we give motivation for every not yet discussed choice of hyperparameter in our pseudo-analysis.All the hyperparameter suggestions are done in an unsupervised way coming from general assumptions about signal and background and are not optimised using the truth information from LHCO R&D data.As such the levels of significance improvements may be further increased by performing a dedicated parameter scan for a specific application, however, we recommend to follow the same reasoning when applying CS in other analyses.
Training region: Training on the full spectrum would likely result in each cluster corresponding to a specific mass region, thus the background spectrum for each cluster would not be close to the original mass spectrum.Therefore we perform clustering in a narrow mass window [3000, 3100] GeV.We choose this window as it lies in the studied region defined in subsection 3.1.1and has the largest statistic of all other 100 GeV windows.

Number of clusters:
The most important parameter we had to choose is the number of clusters k.Two factors play the key role in this choice.On one hand, the number of clusters has to be as large as possible to better narrow down the anomaly-rich region.On the other hand, for a given number of events in the evaluation region and the binning of this region one has to take the number of clusters sufficiently small so that the least populated clusters in the smallest m jj bin min i,m jj (N i (m jj )) still has enough statistics for a meaningful statistical analysis.We assume that min k,m jj 50) should be sufficient.Using a coarse search, we determine, that for our choice of binning and overall statistic at hand choosing k = 50 gives a good trade-off as it has a median of 55 events in smallest cluster-bin and it goes below 20 only 1 time in 1000 pseudo-experiment runs, as can be seen in Fig. 7.  Batch size: Scikit-learn [130] documentation states that the parallelisation is performed on all available N cores computing cores if the batch size is N cores • 256 or larger.We performed all computations with 8 core parallelisation, thus the natural choice of a batch size was 2048.It is also important to maintain the batch size much larger than the number of clusters to ensure faster convergence.
Outlier fraction: To quantify the performance of our method we introduce the signal fraction improvement score (SFI) that characterises a subset S of the events in evaluation set E by the relative increase in the signal to background ratio Our main assumption is that the signal is distributed in clusters unevenly and there only several clusters have a significantly large SFI.To put it in numbers, we assume that not more than 20% of clusters have SFI of 2 or more.Following this assumption we choose the outlier fraction of 0.2 for outlier robust estimators.This is an ad.hoc prior assumption about the data at hand, and it has to be made prior to analysis and has no way to be validated without knowing the truth lables.Still we can show that this assumption is satisfied in our case with a margin for the pseudo-experiment shown on all the figures of section 3. 5000 signal events were giving an overdensity on the original spectrum that was not identifiable as a deviation from smooth background by human eye (without knowing the background truth), but in Fig. 2 one can easily notice two spectra with a significant bump around 3.5 TeV that stand out of the crowd of other spectra.Unsurprisingly these two spectra have SFIs of 9.1 and 8.9.Three more clusters also have a visible overdensity at this position possessing SFIs of 6.3, 5.6, 4.4.In total, exactly 8 clusters have SF I > 2. Still as we will see later only 3 of these clusters have a signal significant enough to be selected as anomalous, showing that our assumption is quite conservative in its limit and either the threshold SF I can be increased or the percentage of clusters to path the threshold reduced for it to still remain a valid assumption.Runs of the analysis on other (pseudo-)experiments behave in the similar manner.
Cluster selection threshold θ: First of all, we use the threshold only for positive deviations as we only search for excesses of events.Apart from the signal-rich outlier clusters the threshold can be passed by signal poor clusters, but only with an expected false positive rate of 1 − (1 − p-value N (0,1) (θ)) n bins .Then for large enough thresholds the average number of false positives can be estimated as k•n bins •p-value N (0,1) .Higher thresholds result in lower false positive and lower true positive rates.To retain the sensitivity for statistically small signal we choose to use θ = 3 that will result in approximately 50 • 16 • 0.00135 = 1.08 signal poor cluster being assigned a false positive label on average.Fig. 3 shows 4 clusters being chosen using this threshold.Three of them have an overdensity at 3.5 TeV and one does not, implying that it is a likely false posive.
Ensemble size: We recommend to take the ensemble size as high as possible, for given computation resource constrains to reduce the width of the test statistic distribution (see appendix F).

D Distribution of cluster scanning bin entries
The assumption on the Gaussianity of cluster spectra in each bin can be shown to be valid by robustly standardizing the cluster counts in each bin (see App. E) and checking if these distributions match N (0, 1).The upper part of Fig. 8 visually demonstrates that the distribution of cluster spectra in each bin in Fig. 3 matches the Gaussian model, except in bins 3.4 TeV < m jj < 3.5 TeV and 3.5 TeV < m jj < 3.6 TeV, where many outlier clusters are found due to the presence of signal.Fifty samples are usually not sufficient to determine whether the distribution is Gaussian or not, but by marginalizing (summing up) over 16 bins, we obtain 800 samples in total.The lower plot in Fig. 8 shows the said distribution.
If we consider only signal-poor clusters, the distribution fits the N (0, 1) distribution well visually and according to a consensus of three normality tests: Shapiro-Wilk, Kolmogorov-Smirnov, and Jarque-Bera (note that the p-values for all the tests are high).Including signal-rich clusters adds outliers, which is reflected in lower p-values for the normality tests; however, apart from these, it still can be well approximated by a unit Gaussian.
The variance among clusters in each bin depends on both statistical fluctuations and mass-dependent systematic effects.In the case of infinite statistics, all cluster spectra would appear smooth but would vary from one another due to differences in mass sculpting.In the absence of mass dependence, the variation in each bin would be caused solely by Poisson fluctuations (as mentioned in subsection 3.4, dedicated to idealised CS).In a realistic scenario, these two factors cannot be separated but can be jointly estimated using a robust standard deviation for each bin (see App. E). Figure 8: a histograms of cluster spectra for each bin in Fig. 3, depending on the deviation from the robust mean measured in robust standard deviations.Bottom: sum (marginal) of all the histograms at the top compared to expected bin counts of a Gaussian with a sum of 736 count (all the signal poor counts).

E Outlier robust estimators
While searching for outliers, it is preferred to use outlier robust estimators for standard deviation (SD) and mean.We define them as follows: given a sample of observations S = {x 1 , x 2 , ... x n } we find a median med(S) (which is itself an outlier robust estimator) of this sample and take a subsample Sf that is constructed from S by discarding a fraction 0 < f < 1 of all samples that have largest absolute distance to this median.In this way we have discarded the outliers.After that we construct estimators μf = mean( Sf ) and σf = SD( Sf ) • g(f ).If S is a sample from N (µ, σ) it is obvious that with lim n→∞ μf = lim n→∞ mean(S) = µ.If one takes S from N (0, 1) and rescales x i → σx i , then both estimators transform as σf → σσ f and SD(S) → σSD(S) by definition, so both estimators σf and SD(S) are proportional to a true σ of the Gaussian distribution.
Both σf and SD(S) are independent of µ and there are no other parameters of the normal distribution for estimators to depend on, therefore for a family of Gaussian distribution estimators σf and SD(S) are proportional to each other by some constant factor g(f ) in the limit of infinite sample.In other words, adjusting numerically g(f ) = SD(N (0,1)) σf (N (0,1)) = 1 σf (N (0,1)) is sufficient to sattisfy lim n→∞ σf = lim n→∞ SD(S) = σ.So μf and σf are unbiased estimators of µ and σ of a normal distribution, although depending on f they are less efficient than usual non-robust mean and SD.
Fig. 9 shows us the cluster spectra from Fig. 2 with subtracted normalised original spectrum (which is only needed for better visualisation as this step has no effect on the standardisation).Fig. 9 also shows the conventional and the outlier robust estimations of mean and SD of the cluster spectra values in each bin.As expected for lower m jj the SD is higher as these deviations is partially caused by the Poisson fluctuations which are proportional to N i,b .We can also see the conventional estimators have a bump around 3.5 TeV that is induced by our outlier signal-rich clusters, while the robust estimators are unaffected by the outliers.

F Calibration distributions
The distribution of the test statistics given by CS without ensembling for all background only pseudo-experiments is shown in Fig. 10a as a histogram.We see that around 300 of those were assigned test statistic of 0 as they had no clusters selected as anomalous.Other cases where one or more anomalous clusters were selected form a smooth continuous distribution.
The median CS test statistic for 100 signal contaminated pseudo-experiments is represented in Fig. 10a by a vertical line, and the vertical band represent the region between the quartiles of such a test statistic sample.For each signal-doped pseudo-experiment we calculate significance as it is described in subsection 3.1.4.The median significance is quoted in the legend of the figure.
Fig. 10b shows the distribution that is analogous to the one in Fig. 10a, but with an ensemble of 15 runs of CS algorithm for each pseudo-experiment.We notice that the distribution in Fig. 10b is significantly narrower than in 10a which reduces the frequency of background only experiment having large test statistic, thus increasing the sensitivity to signal injection.An additional benefit is that the uncertainty region (between two quarterlies) for each signal doping have significantly decreased which is important for lower uncertainty in the analysis on experimental data on the excess significance or on the exclusion limits.
This motivates, that in general the ensemble size should be taken as large as reasonably possible.Our choice of ensemble size 15 together with the number of pseudo-experiments 3900 were dictated by the computing time and storage memory limits as the amount of full CS algorithm iterations is the product of those numbers (excluding the pseudo-experiments with signal injection) Finally, Fig. 10c shows that for idealised CS without systematics introduced by mass correlations the MLS between our signal spectrum and background estimate is lower.Moreover, as expected, it improved the sensitivity of the method to the signal.Obviously this technique cannot be utilised in an actual analysis as jet labels are needed to distribute signal and background jets in a different manner.

G Impact of signal in the training region
First of all, we retrain all the signal-contaminated experiments such that we do not include the O(100) or fewer signal events while performing k-means fit.We still include the correct number of signal events when evaluating.The performance of this version of CS  is demonstrated with the curve labeled as "ignore signal" in Fig. 11.It is evident that the two versions have a statistically insignificant difference.We conclude that indeed the original CS version well describes a general and realistic case of having negligible to no signal contamination in the training region.For the next experiment, we train clustering in the region with the highest signal event fraction, namely [3450,3550] GeV.We run background-only and signal-contaminated pseudo-experiments in this region with all other hyperparameters equal to the values used in main studies.In the case of a non-negligible signal fraction in the training region, the cluster centroids will be attracted to the regions of signal event concentration.We observe this effect, as the resulting signal clusters have a much higher signal fraction as the signal events "pull" the corresponding cluster centers closer, leading to a large increase in the performance of the CS method that is visible in Fig. 11.Although in the general case the position of the signal-rich region is unknown, these studies prove that Fig. 5a and Fig. 5b only show the lower bound for the discovery potential of CS.

14 A 15 B Sparsity of the jet images 17 C Hyperparameter selection and motivation 17 D
Idealised fit and n-parameter fit pseudo-analysis Distribution of cluster scanning bin entries20 1 Introduction

Figure 1 :
Figure 1: From left to right: average of all 2M available QCD jet images, average image of all 100K lighter jets in a Z ′ event and average image of all 100K heavier jets in a Z ′ event before smearing and pixel scaling.
1.1 (it includes training region as one of the bins), fall into each of the k clusters N i,b where i ∈ {1...k}, b ∈ {1...n bins }.Fig. 2 shows the resulting 50 normalised cluster spectra N i,b / b (N i,b ) for one pseudo-experiment with signal injection.

Figure 3 :N
Figure 3: Spectra in Fig. 2 standardised over clusters in each bin.Potentially signal-rich cluster spectra are shown in red.

Figure 4 :
Figure 4: Curves corresponding to the sum of signal-rich and signal-poor spectra in Fig. 3.The blue signal-poor curve is rescaled to have the same total jet number as the signal-rich curve.The coloured region around the blue curve is a σ bkg,b = N bkg,b Poisson deviation after recaling used to compute MLS.

Figure 5 :
Figure 5: Upper figure: Median and the quartile bounds of global significance of the signal contaminated pseudo-experiments as a function of the number of signal events ϵ injected shown for the (a) realistic and (b) idealised analysis methods.The dotted lines mark lower bounds, as there was not enough statistics to access higher significance levels.Bottom figure: Significance improvement of the CS method compared to the 4-parameter fit for the (a) and the significance improvement of idealised CS compared to idealised fit for the (b) as a function of the signal-to-background ratio S/B.
n dof ≈ 1.338 that correspond to p-values of 0.275 and 0.182 which signify validity of these fits.

Figure 6 :
Figure 6: Distribution of images in QCD and top datasets from top-tagging task vs the number of non-0 pixels in them.

Figure 7 :
Figure 7: Distribution of the number of jet image counts in the least populated bin of the least populated cluster in each of the 1000 random background only runs of the CS algorithm on backround only data.

Figure 9 :
Figure 9: Normalised spectra with subtracted normalised original m jj spectrum.Amount of signal is 5000.The selection of the anomalous clusters is taken from Fig. 4.

Figure 10 :
Figure 10: Histogram of the CS test statistic for pseudo-experiments with bootstrapped background only samples.Vertical lines and vertical bands show median and region between lower and higher quartiles of test statistics for pseudo-experiments with signal injection.Several signal injection levels are represented by different colours.Panel (a) shows a case with only 1 initialisation of clusters in CS per pseudo-experiment, panel (b) shows a case for ensembling 15 runs of CS with different intialisations per pseudo-experiment and panel (c) shows a case for ensembling 15 runs of idealised CS with different initialisations per pseudo-experiment.

Figure 11 :
Figure 11: Analogous to Fig. 5a, with two curves added for the experiments of training without signal and for training in the most signal-rich region of [3450, 3550] GeV.
3.2 only overdensities are taken into account, i.e.Z b > 0 only for N sig+bkg,b > N bkg,b as we are searching for a resonance.For bins with N sig+bkg/bkg,b ≫ 1 one can approximate the Poisson distribution with a normal distribution N (N sig+bkg/bkg,b , N sig+bkg/bkg,b ).

Table 1 :
Summary of the hyperparameters used in cluster scanning.