Detecting New Physics as Novelty -- Complementarity Matters

Novelty detection is a task of machine learning that aims at detecting novel events without a prior knowledge. In particular, its techniques can be applied to detect unexpected signals from new phenomena at colliders. In this paper, we develop an analysis scheme that exploits the complementarity, originally studied in Ref.~\cite{Hajer:2018kqm}, between isolation-based and clustering-based novelty evaluators. This approach can significantly improve the performance and overall applicability of novelty detection at colliders, which we demonstrate using a variety of two dimensional Gaussian samples mimicking collider events. As a further proof of principle, we subsequently apply this scheme to the detection of two significantly different signals at the LHC featuring a $t\bar{t}\gamma\gamma$ final state: $t\bar t h$, giving a narrow resonance in the diphoton mass spectrum, and gravity-mediated supersymmetry, which results in broad distributions at high transverse momentum. Compared to existing dedicated searches at the LHC, the sensitivities for both signals are found to be encouraging.


Introduction
Particle physics has a long history of applying Machine Learning (ML) techniques in data analyses [3], where the complexity of event topologies at colliders post great challenges to traditional wisdom. For example, the neural network was applied to the top quark search by the DØ Collaboration in 90s [4,5], far before the deep neural network (DNN) became popularized due to hardware development and big-data availability about fifteen years ago [6]. Another example is the Boosted Decision Tree (BDT) [7]. Although it was first introduced by the MiniBooNE Collaboration for the analysis of neutrino data in 2004 [8], BDTs are nowadays extensively used in the analysis of collider data. For both methods, the algorithms are first trained on labeled data, i.e. in a supervised way, and then applied to classify testing data into the categories defined during training. The usage of these ML techniques in the analysis of collider events often result in significant improvements, assisting, e.g., in the discovery of the Higgs boson at Large Hadron Collider (LHC) in 2012 [9,10], and many other new phenomena searches and Standard Model (SM) measurements.
The triumph of the Higgs-boson discovery propelled the ambition for new physics (NP) hunting at the energy frontier. Since then, many dedicated searches targeting well-motivated beyond-the-SM extensions such as supersymmetry (SUSY), Composite Higgs models, extra spatial dimensions, etc., are being pursued. Yet, no convincing signals of NP have been observed so far. This suggests that, if NP exists, it may present itself in a highly unexpected form. This strongly motivated the design of new analysis strategies that would allow NP to be detected in a more model-independent way and with a broader coverage in theory space, hence complementing the ongoing model-dependent search program at the LHC.
In the science of ML, this involves a well-known task -novelty (or anomaly) detection, i.e., detecting novel events without a prior knowledge. This implies that there is no data of the signal pattern available for model training. Yet, different from its usual applications such as the face-or fingerprint-recognition of a stranger, where individual novel event is expected to be detected, at colliders the novelty detection will be defined on a statistical basis. To address the task of novelty detection, a series of pioneering studies have been pursued in the last decades. Essentially, designing novelty evaluators formulated its mainline (for a review, see, e.g., Ref. [11]). Depending on the characters of the events employed to perform this task, these novelty evaluators/algorithms can be roughly classified into two types [1] 1 2 : • Isolation-based (O iso ). The novelty response for a given testing event is evaluated according to its isolation from the known-pattern data in the feature space. All of the other testing events are not directly involved in this evaluation.
• Clustering-based (O clu ). The novelty response for a given testing event is evaluated according to the clustering around this event on top of the known-pattern data in the feature space. The other testing events, especially those nearby in the feature space, are potentially relevant in this evaluation.
Here the distribution of the known-pattern data or background events in both cases can be figured out by taking either Monte-Carlo simulation (semi-supervised ML) or data extrapolation (fully unsupervised ML). A short summary of the evaluators/algorithms proposed in the literature so far for novelty detection at colliders is given in Table 1. In this landscape, the autoencoder (AE) was first introduced in Ref. [1], where two kNN-based measures, O iso and O clu , were proposed for novelty evaluation in the AE latent space. After that, a set of metrics based on an ordinary AE [14][15][16], a variational AE [17,18,21], an adversarial AE [14,19,20], and a graph AE [22] were suggested as novelty measures. They mainly include reconstruction error in Ref. [14][15][16][17][18][19][20]22], Kullback-Leibler divergence [39] in Ref. [18,21], Energy Mover's Distance [40] in Ref. [18], and Logits and feature scores in Ref. [19]. Furthermore, the authors in [23] studied the potential of novelty detection with variational-quantum-circuits-based quantum autoencoder. These proposals are clearly isolation-based, since the novelty response of each testing event is evaluated by them independently. In addition, a graph network was 1 In Ref. [1], we have demonstrated the isolation-based (Oiso) and clustering-based (O clu ) evaluators, together with the synergy-based evaluator (Osyn) introduced below, using the k-nearest-neighbors (kNN)based designs (O trad , Onew and O comb ). In this paper, based on their nature, we will rename these designs as kNN-based Oiso, O clu and Osyn, respectively. 2 The classification taken in this article is based on the event characters employed for evaluating the novelty of the collider events. The goal is to make use of the complementarity between these characters to further improve the quality of evaluation. Notably, multiple ways exist to classify the evaluators/algorithms of novelty detection. For example, one can also classify these evaluators/algorithms according to local binning versus global statistical test [12,13].
The local density of each event was estimated with kNN in Refs. [1,13], where the O clu evaluator in Ref. [1] mimics the structure of N −B √ B (where N and B denote the number of data and background events, respectively; see Sec. 2) while a test statistic TS(p test , p B ) in Ref. [13] was defined in the logarithmic form of the Neyman-Pearson lemma. The lemma was applied as a novelty score in Ref. [2,12,29] as well. In Ref. [2,12,26,27], the local density was parameterized straightforwardly with a supervised learning to distinguish data from the reference, with the loss function serving as a novelty score t. In Ref. [28], each event is evaluated based on the fraction of its kNN which are signal-like or have a pre-assigned "signal" label. ANODE [29] utilizes the Neyman-Pearson lemma directly, but the physical distribution was simplified into a prior distribution via a masked autoregressive flow [41]. The unsupervised tagging in [30] is essentially novelty detection with the score defined by the ratio of Poisson probabilities. CWoLa [32][33][34], TNT, SALAD and SULU are based on weakly supervised learning (for a review see Ref. [42]), where a DNN trained with a density-based algorithm was used as an evaluator. CwoLa was proposed for a resonance search, using the difference of signal fractions between the signal region and its sideband region. TNT and SALAD are the extensions, where TNT generalizes it to a broad-distributed signal search by assuming each event has at least two independent objects [35] and SALAD uses a reweighting method to eliminate the discrepancy between simulation and real data [36]. SULU shows a possibility to synthesize data and manages to provide unlabelled data points with soft labels [37]. UCluster [38] constructs a graph network to define an embedded space and cluster collision events with similar properties in it. In this process, the locations of the testing events and cluster centroids in the embedded space are updated by considering the contribution of all testing events.
Despite this great progress, challenges exist for these evaluators/algorithms that are closely related to their applicability:  Figure 1: Cartoon illustrating the response of different types of novelty evaluators in the case of a 2D Gaussian example, consisting of 10000 background events (red points) and 1000 signal events (blue points). In the case of O iso evaluators, the backgrounds are mainly from the ring band between two yellow circles in (a). In contrast, O clu evaluators are affected by the upward fluctuations in the local density of background events, denoted by the areas within the yellow-plus circles in (b) (downward fluctuations are denoted by the yellow-minus circles). A synergy-based evaluator, O syn , would be affected only by the upward fluctuations in local density that lay within the ring band, as illustrated in (c). The distribution of the O clu vs. O iso responses is shown in (d), where yellow points represent the background events within the yellow circle in (a), and the red points are the rest of background events. Exploiting the information encoded in both evaluators would clearly allow to improve the signal sensitivity by suppressing the backgrounds described above.
1. In terms of signal events, the isolation-based evaluators O iso score high for events that are far from the bulk of the backgrounds in the feature space, such as those sitting on the tail of the distribution, while the clustering-based evaluators O clu tend to be sensitive for events that have a big deviation from the background-only prediction in their local density, such as those arising from resonance. In summary, these two types of evaluators attempt to detect novelty along two separate but special directions, which to some extent limits their respective signal coverage.
have comparable distance to the bulk of the backgrounds, while the O clu evaluators may suffer from those events that have a big fluctuation in their local density. These features are illustrated in Fig. 1 with a cartoon of 2D Gaussian samples.
These differences between O iso and O clu are rooted in the characters of the collider events employed to define O iso and O clu , respectively. Given the lack of prior knowledge on the signal pattern and the relevant backgrounds, a synergistic treatment on the basis of these two types of evaluators is strongly required. This motivates the construction of a third type of novelty evaluator: • Synergy-based (O syn ). The novelty response for a given testing event is evaluated according to a combination of its novelty responses to the isolation-based and clusteringbased evaluators, i.e. by a function f (O iso , O clu ).
We expect that the synergy-based O syn evaluator can take advantage of the complementarity between the isolation-based and clustering-based evaluators and provide an option with much wider applicability. Actually, essentially motivated by such a consideration, an evaluator has been proposed in Ref. [1], namely This design is not optimized, however. Firstly, the advantages of O iso are not fully exploited. As we will see, the signal events from the same region in feature space tend to have similar O iso scores. This fact essentially is a reflection of the signal-event "clustering" in feature space. But, the information of signal-event clustering is not well-picked by O syn if the O iso score happens to be low. Secondly, even if O syn has a better performance than O iso and O clu , there is a room to further improve the separation between the signal and the background events. Motivated by these considerations, we design an analysis scheme for novelty detection, where the complementarity between O iso and O clu could be fully utilized, in this paper. We will demonstrate that indeed this design yields a broad coverage of signal patterns and excellent property for generalization. This paper is organized as follows. In Sec. 2, we will introduce the proposed analysis scheme for novelty detection and demonstrate its performance using the 2D Gaussian example, using the kNN designs for illustration. We will especially discuss in Subsec. 2.3 the potential generalization of this scheme from the kNN-based evaluators to other ones listed in Tab. 1. In Sec. 3, this scheme will be applied to a more realistic use case, i.e. the LHC detection of SM tth production and of direct stop-quark pair production (tt), both in the ttγγ channel. Finally, we will conclude in Sec. 4.

A Complementarity-Based Analysis Scheme for Novelty Detection
Below we will present an optimized analysis scheme for novelty evaluation at colliders and study its performance. Two classes of samples are relevant: the training (or reference) sample will (1) assist the evaluation of novelty response for a given event and (2) set up a reference on the novelty response of the known-pattern data, while the testing sample represents "real" data. To demonstrate the relevant points, we will use the kNN designs for the isolation-based and clustering-based evaluators proposed in Ref. [1], namely Here d train is the mean distance of a testing event to its k nearest neighbors in the training sample; d train is the reference mean distance defined by the training sample only; d test is the mean distance of the testing event to its k nearest neighbors in the testing dataset; m is the dimension of the latent space which we specify while performing a concrete analysis; and c is a scaling factor chosen as the root mean square of the ∆ iso (∆ clu ) scores for all testing data. For simplicity, the kNN distance metric is defined to be Euclidean. Please note: this never means that the scheme developed here exclusively relies on the kNN designs for O iso and O clu . It represents a broad class of such designs instead. We will discuss its generalization in detail in Sec. 2.3. Figure 2: Workflow for the proposed novelty-detection scheme at colliders.

Design
The proposed analysis scheme for novelty detection at colliders involves five steps, which are outlined in Fig. 2 and described below: • Step I: dimensionality reduction. Reduce the high-dimensional feature space to a lower-dimensional latent space using a tool like AE [43] or its variances. This step can suppress the impact of statistical fluctuations in the high-dimensional feature space on detection efficiency, and generally allows a stronger response of anomalous data to the novelty evaluator. • Step II: novelty evaluation. Evaluate the novelty response of data, using the isolationbased and the clustering (density)-based methods, and project them to the O iso − O clu plane for identifying the potential signal phase space. Note, with this step the data binned w.r.t. O clu may not respect the original statistics any more where the data are expected to be mutually independent. Although one could analyze potential signal events at this 2D plane directly (using, e.g., O syn [1]), it is possible to proceed with the following steps to attempt to build a more optimal discriminant. This is especially valuable for the cases where the backgrounds are rich and also widely distributed. • Step III: bin resorting of O iso . The goal is to move the signal events to the top-right corner of the O iso − O clu plane (recall, if the signal events are located in the bulk of the background distribution, they tend to be scored low by O iso ), such that they can be isolated with a cut based on the evaluator geometric mean (see Step IV). For this purpose, we define n bins based on the O iso scores and then calculate ξ i = Here N i and B ref,i are the event number of the i-th bin in the testing and training samples, respectively. Assuming that ξ i is maximal for bin i , we propose a moving strategy as follows: i > n/2. We will shift all bins to the right so that bin i (i + 1) becomes the most right bin if ξ i +1 < 1 (ξ i +1 ≥ 1).
i ≤ n/2. We will shift all bins to the left so that bin , and then assign a new score 1 − O iso to each event.
This strategy preserves high priority for the bin with a large ξ i value. For the convenience of discussions, below we define the new score after movement as O iso . We will use 10 uniform bins for the resorting to get O iso . • Step IV: signal-like region identification. We define the signal-like region to be Here r 0 is a threshold. It defines the boundary of the signal-like region in the O iso −O clu plane. The determination of r 0 is not unique. For example, we can choose it to be the one above which we have the largest with N and B ref being the numbers of the testing and simulated background events in this region, respectively. In this study we will use a threshold r 0 = 0.7 or 0.6, unless otherwise specified. Alternatively, one can identify the signal-like region by collecting the 2D bins with In the case, S tends to be contaminated by fluctuating non-signal patches given the finite size of the samples.
• Step V: novelty re-evaluation. Construct a DNN of weakly supervised learning, to distinguish the data falling in the S region and the simulated background training samples. The inputs can be the full set of kinematic features or only the ones defining the latent space. The novelty of all testing data then will be re-evaluated by a new evaluator, i.e., the output neuron (denoted as O syn ). In this paper, we build this network with a simple architecture (five layers with 2, 8, 6, 4, 1 neurons, respectively), with the input neurons being the latent-space dimensions (see more discussions below).
To ensure its robustness, we could train this model with multiple initial seeds and average all O syn scores for each data point. We will train twenty such classifiers in total to define O syn , namely the average of O syn , for all scenarios in this study. Different from the cases of O clu and O syn , the data binned by O syn respect the Gaussian/Poisson statistics. As to be seen, the application of O syn strengthens the effectiveness and efficiency of novelty evaluation, and finally optimizes the sensitivity of the detection. All neural networks used in this study will be trained with Keras [44].
After Steps I-V, we will be able to calculate the sensitivity reach in this analysis scheme. Explicitly, we will use the Poisson-statistics-based formula to calculate the significance of excluding the background-only hypothesis with O syn .

Performance
Below we will explore the sensitivity performance of the proposed working scheme for novelty detection, with the signals and background events in the latent space being mimicked by 2D Gaussian samples. 3 Concretely, the backgrounds are defined as N ((0, 0), I), i.e. a Gaussian sample centered at the origin with a standard deviation equal to unity, with the signals N ((µ, 0), σ 2 I) sitting on its top. In total, nine different signal benchmarks (denoted "benchmark points, or BPs) are considered, with µ = 0, 1, 2 and σ = 0.1, 0.3, 0.6, all of which are unknown to the detection. The distributions of these signals, together with that for the backgrounds, are shown in Fig. 3, where x and y represent two directions defining the latent space. As µ increases, the signal distribution moves along the x axis from the background center to its tail. As σ increases, the signal distribution evolves from a narrow peak to a broad distribution.
For this study we generate 4×10 5 (background) events for training, and 1×10 3 +1×10 5 (signal + background) events for testing. For the O iso evaluator, d train and d train are calculated with k = 4000, and for the O clu evaluator, d train and d test are calculated with k = 4000 and k = 1000, respectively. The corresponding scores can be found in App. A. Notably, despite the impact on O iso and O clu (and hence on O syn ) when varying the k value from N sig to ∼ N sig , we find that the O syn performance is robust against this variation. As an isolation-based evaluator, the performance of O iso is highly sensitive to the µ parameter. As shown in Fig. 4, the signal events tend to score lower as the µ value decreases. This is the result of the O iso definition, since for lower µ values signal events are located in the bulk of the background distribution and d train becomes numerically closer to d train . The performance of O iso for the different BPs is illustrated in Fig. 8, which summarizes the ROC curves and AUC values. As expected, the BPs in the left column have a smaller AUC value for O iso than their respective counterparts in the right column. Compared to Figure 3: 2D Gaussian samples used for testing. Each contains 100000 background events and 1000 signal events. For convenience, we label these benchmark points, as "BP 1", ... , "BP 9", from left to right and from top to bottom. variations in the µ parameter, the impact of the σ parameter on the O iso performance is relatively weak. It should be noted that the degraded novelty response in the small-µ BPs does not imply that the signal events are not well-picked by O iso . It just means that the score in such a case is not a proper measure of the novelty. Actually, from Fig. 4 one can see that the signal events are still clustered in some specific O iso bins, despite their low scores in the small-µ BPs. This reveals one simple but important fact for both isolation-based and clustering-based evaluators: signal events from the same region in feature space tend to be scored the same. In this analysis scheme, we take a bin-resorting for O iso (we redefine it as O iso for the convenience of presentation), and demonstrate the outcomes after bin resorting in Fig. 5 and the correspondent ROC curves and AUC values in Fig. 8. As expected, the signal events tend to be scored high by O iso , which eventually yields an AUC value ≥ 0.5.
As a clustering-based evaluator, O clu performs excellently in detecting a resonance-like structure. As shown in Fig. 4 and Fig. 5, most signal events are scored very high for the BPs with σ = 0.1. Yet, as expected, the performance of O clu degrades as σ increases, since for a broader signal distribution it becomes increasingly harder to detect event clustering. The same trend can be also seen from the ROC curves of O clu and their AUC values shown in Fig. 8. Indeed, the obtained AUC values are all close to one for the BPs with σ = 0.1, but they are reduced as σ becomes larger. Compared to the σ parameter, the impact of the µ parameter on the O clu performance is relatively weak. But still, as µ increases, the obtained AUC values gradually improve, due to the reduced overlap between the signal and background distributions. Following the previous discussions, the complementarity between the isolation-based and the clustering-based novelty evaluators is two-fold. First, O iso (and O iso ) is more sensitive to µ while O clu is more sensitive to σ. This makes the BPs with small or large µ and small σ relatively easy to probe, while leaves the BPs with intermediate µ and large

3). This is
Step IV in this analysis scheme.
The novelty responses of the 2D Gaussian samples to O syn , O syn and their ROC curves are demonstrated in Fig. 6, Fig. 7 and Fig. 8, respectively. The significance curves based on O syn are illustrated in Fig. 9. By comparing the AUC values of the O syn ROC curves with those of O iso and O clu , one can see that this intuitive design performs universally better than at least one of O iso and O clu . Indeed, to some extent the complementarity discussed

Generalization
As previously indicated, the proposed analysis scheme is general. One can pair any of the isolation-based and clustering-based evaluators listed in Table 1 and use them to replace the kNN-based O iso and O clu used in this study, with the expectation of similar outcomes. 5 Alternatively, one can develop a clustering-based "partner" evaluator for each using the supervised learning network. 5 The complementarity between the AE reconstruction error and CWoLa was recently shown in Ref. [45].
In Ref. [46] it was suggested to detect anomalies by combining isolation-based Deep SVDD and clusteringbased autoregressive flows. Besides, it was also pointed out in Refs. [47] and [48] that the AE reconstruction error detects novelty only in one direction, as discussed in this paper. In relation to that, latent space tagging Figure 8: ROC curves and their AUC values (given in brackets) for the set of novelty evaluators considered, applied to the 2D Gaussian testing samples corresponding to the same BPs as in Fig. 3. Here r 0 = 0.7 is taken to define the signal-like sample S for training O syn . As a reference, the performance from supervised learning (SL) has also been reported. The dashed and solid blue curves are fully overlapped in the panels of the right column.
isolation-based evaluator, as it occurs to the kNN-based designs, and then apply them to this scheme. Recall, the definition of both O iso and O clu relies on some distance measure "d" (see Eq. (2.2)). Here we only need to replace the kNN-based measure with the one defining the given isolation-based evaluator.
(a clustering-based method in our definition) was introduced to further improve the detection performance in Ref. [47]. Essentially, these suggested methods exploit the complementarity between the isolation-based and clustering-based evaluators/algorythms, as was originally done for the kNN-based Oiso and O clu in Ref. [1]. Here r 0 = 0.7 is taken to define the signal-like sample S for training O syn . As a reference, the SL performance has also been reported.
To demonstrate these points (especially the second one), let us consider an AE-based novelty evaluator as an example. In particular, the AE reconstruction error (R AE ) is essentially a distance measure in the feature space or an isolation-based evaluator. So we can define the R AE -based O iso and O clu by taking d = R 1/2 AE (the power of 1 2 arises from the fact that R AE is formally a sum of "length" square in the feature space). Here two AEs with the same architecture (denoted as AE-A and AE-B) need to be trained, using the training and testing samples respectively. Then d train and d train will be calculated by AE-A, while d test will be calculated by AE-B.
As a concrete example, we apply this variant of our designed scheme to the analysis of the 2D Gaussian samples corresponding to BP 5. Here AE-A and AE-B are constructed with 3 hidden layers, which contain 2, 1 and 2 neurons respectively. Tanh is selected as the activation function for all layers except the last one, where a linear function is applied to match the input and output ranges. Stochastic gradient descent is used as the optimizer, with its learning rate, decay rate and momentum being 0.3, 0.0001 and 0.99 respectively. The epoch number is fixed to 400. The AE-A is trained with a training sample of 400000 events, while AE-B is trained with a testing sample of 101000 events. To improve the analysis stability, we train both AE-A and AE-B with 20 random initial seeds and use the square root of the averaged reconstruction errors for each testing data, i.e. R AE 1/2 , to define the distance "d" and eventually the evaluator O syn . Here r 0 = 0.6 is taken to define the signal-like sample S for training O syn . As a reference, the SL performance has also been reported.
O clu is strong. As discussed above, this is essentially determined by the definition of O iso and O clu , rather than the actual distance measure used. Quantitatively, the kNN-based O iso tends to score these signal events lower, while the R AE -based one tends to score them higher. This yields a set of AUC values for the R AE -based O iso and O clu that are closer to each other, compared to the kNN-based case. Finally, the R AE -based design returns a comparable O syn AUC value (0.88 VS. 0.90) and significance (7.7σ VS. 8.5σ) to those shown in Fig. 8 and Fig. 9.
Here the reference sample contains 4 × 10 4 background events. This case is close to BP 6 studied above, but with slightly larger (effective) µ and σ values, and five times larger S/B. It was first applied in [1] to show the capability of O iso , O clu and O syn and then used in [2] to demonstrate the performance of the likelihood-based algorithm [12]. Here we present the ROC curves and their AUC values for the set of novelty evaluators, and the significance curve of O syn in Fig. 12. In the original paper [1], a maximal significance ∼ 17σ is obtained with O syn [1], while in Ref. [2] the significance quoted is around 20σ. Using the suggested scheme in this paper, however, we find that a maximal significance ∼ 24σ can be reached by O syn . It improves the maximal significances reported in Refs. [1,2] by ∼ 7σ and ∼ 4σ, respectively. Actually, the performance of O syn in this context is very close to Here r 0 = 0.7 is taken to define the signal-like sample S for training O syn . As a reference, the significance obtained in [2] is shown in the right panel while the SL performance is reported in both.
the maximal significance of ∼ 24.5σ obtained with SL.
The comparison can be extended to another specific case which is on the resonant bump hunting in a 3D feature space (m, x, y). This toy case was first considered by [32] and further analyzed in [2]. In this case, the signal and background events are uniformly distributed in a region defined by (|m| < 1, |x| < 0.1, |y| < 0.1) and (|m| < 2, |x| < 0.5, |y| < 0.5), respectively. The reference sample has 40000 background events and the testing sample consists of 10000 background events and 300 signal events. We present the ROC curves and their AUC values for the set of novelty evaluators, and the significance curve of O syn in Fig. 13. Based on O syn , we find a maximal significance ∼ 13σ which is not far from the SL reach. As a comparison, the t-score method gives a median global significance of 8.1σ [2] and the CWoLa hunting yields a local significance ∼ 10.8σ [32].
Finally, we would point out that these comparisons are based on some simple toy cases known to us. To have a full picture on the performance of these methods, one needs to go to the cases with more realistic and dedicated kinematics. We leave this to a future work. Instead, below we will apply the suggested scheme to the analysis of ttγγ events in the proton-proton collider.

A Collider Case Study: Novelty Detection in ttγγ Events
In this section, we will apply the proposed novelty detection scheme to a more realistic case: the analysis of ttγγ events produced in proton-proton collisions at √ s = 13 TeV,  [2,32]. Here r 0 = 0.8 is taken to define the signal-like sample S for training O syn . As a reference, the significances obtained in [2,32] are shown in the right panel while the SL performance is reported in both.
assuming an integrated luminosity of 3 ab −1 . The potential signal events to detect include: (1) SM tth production with h → γγ, and (2) direct stop-quark pair production (tt) in gravitymediated SUSY, with both stop quarks undergoing a chain decayt → tχ 0 1 (→ γG), resulting in large missing transverse momentum from the undetected gravitinos (G). They represent two typical signatures at colliders: a resonant peak and a broad shape. As will be shown, these two types of signal events have a strong response to O clu and O iso , respectively. But eventually both of them can be picked up by O syn with a higher efficiency. So this analysis provides a nice context to test the suggested analysis scheme. As for the backgrounds, they arise primarily from ttγγ, ttγ+jets (with one jet misidentified as a photon), tt+jets (with two jets misidentified as a photon), and continuum γγ+jets production.
The background samples are simulated at leading order, following Ref. [53]. We use the MLM algorithm to perform the matrix-element to parton-shower matching for the ttγ process up to one additional jet, and for the tt process up to two additional jets. Re-weighting, with the probability provided in Ref. [54], is applied to compensate for the low-generation efficiency caused by the tiny faking rate of photons by jets. The yields after preselection for ttγγ, ttγ and tt are eventually close to 78.8 : 18.6 : 2.6 [53]. The continumm γγ process contributes about half of the total backgrounds [52]. Here it is simulated in a five-flavour scheme, with MLM matching up to two additional jets. One of these jets is randomly assigned in each event to represent the jet misidentified as a lepton, as required by the preselection.
The tth signal sample is generated at NLO, forcing the h → γγ decay, and normalized to a cross section of 0.51 pb times the Higgs boson branching ratio to diphotons of 0.227% [53]. Thett sample is generated at leading order, with the stop quark (t) decaying into a top quark and a bino-like neutralino (χ 0 1 ) and this neutralino further decaying into a photon and a gravitino (G). The mass parameters for stop quark, neutralino and gravitino are set as mt = 1 TeV, mχ0 1 = 0.2 TeV and mG ∼ 0 GeV. Thett sample is normalized to a cross section of 6.83 fb [55], times an assumed branching fraction of 10% for decay into ttγγ + 2G.
The expected event yields after preselection are summarized in Table 2. In addition, we generate 1.3 × 10 5 background events satisfying the preselection criteria as the reference sample.

Novelty Analysis
Although even after preselection multiple kinematic features could be exploited to discriminate the signals from the background, in this analysis we will only consider the di-photon kinematics for simplicity. We sort the two photons by energy and take their four momentum {P T , η, φ, E} to define the eight-dimensional feature space. We then standarize each of these features as where µ x is the mean of the reference sample and σ x is its variance. To reduce the potential sparse errors caused by the low rate of rare events, we take a dimensionality reduction by encoding the data in the eight-dimenional feature space into the AE latent space [1]. The AE is built with eleven layers with 8, 12, 8, 8, 6, 2, 6, 8, 8 , 12, 8 neurons respectively. The resulting latent space is two-dimensional 6 . We choose Tanh to be the activation function except for the last layer, where a linear activation function is applied to ensure the matching of the output and input data ranges. The model is then trained with a batch size of 250, a learning rate of 0.94 and a decay rate of 0.88, with ADADELTA [56] being used as an optimizer.
The directions of the latent space rely on the definition of the loss function. As a physical requirement, we incorporate Lorentz invariants such as the single-photon and diphoton invariant masses into the vanilla loss function (i.e. L = |x − x | 2 ), in the framework of relational AE [57]. Then we have Here the input and output quantities are distinguished with "prime". The two photons are sorted by energy, with m γ i and m γγ being the single photon and di-photon invariant masses, and σ E i and σ mγγ being the single-photon energy and di-photon m γγ variances (in the background sample). To ensure each term in Eq. (3.2) contributes comparably to the total loss, the coefficient c is selected to be 10. With such a construction, this AE not only learns to reconstruct the low-level observables like single-object four-momentum, but also keeps the high-level features, such as m γγ , which maintain the correlation among the objects in each event. The 2D latent spaces that are obtained by training with two different loss functions, L and L , are displayed in Fig. 14. The broadly scattering of the points in the n 1 − n 2 planes indicates that the correlation between the two dimensions of the latent space is not strong. As can be appreciated, with the Lorentz invariants being incorporated in the L loss function, the AE tends to project the signal events with different topologies into different regions in the latent space. The tth signal events are highly clustered in the bulk of the background distribution while thett signal events are broadly distributed at its tail. The internal correlations among the particles in each event, secured by Lorentz invariance, thus provide a useful guide to the AE for separating the signals with different topologies 7 . In contrast, such a symmetry-based guide is missing with the regular loss function L, which yields a broad distribution of the tth signal events in the latent space (although thett signal events are still loosely clustered at the tail of the background distribution). Given the similarity (in terms of both broadness and overlapping with the backgrounds) of this distribution to that of the 2D Gaussian sample in BP8, we expect the detection sensitivity of tth to be low in this case. Below we will perform our analysis in the latent space defined by L . To evaluate the novelty response of the testing data, we take k = 1000 for O iso and k = 1000 (k = 19) for O clu . In the latter case, k = 1000 is applied to calculate d train while the rescaled k = 19 is applied to calculate d test . The distributions of signal and background events in the O iso −O clu plane are shown in Fig. 15. As expected, the tth andtt signal events tend to score high in O clu and O iso (or O iso ), respectively, while their novelty responses to O iso and O clu are relatively weak. In spite of this, the synergy-based strategy ensures many of these signal events to be classified into the new signal-like bin S . Indeed, in both cases the signal events obtain a relatively high O syn score, as shown in Fig. 16, and hence good discrimination against the backgrounds.  Here r 0 = 0.6 is taken to define the signal-like sample S for training O syn . As a reference, the SL performance has also been reported.  The ROC curves and their AUC values for the set of novelty evaluators considered are shown in Fig. 17. As previously discussed, O iso and O clu are sensitive to one of the two signal patterns (resonance or continuous spectrum) only, whereas O syn brings in a further improvement, performing more stably in both cases. Finally, we show the statistical significances for discovery of both types of signals in Fig. 18. The sensitivity from a supervised-learning classifier is also shown for each case. It represents the best performance which could be achieved. Encouragingly, the sensitivities of O syn are not far from the "ideal" ones. Besides, these sensitivities are compatible with the results of real-data analyses. The tth production with h → γγ was analyzed in Ref. [53] using 36.1 fb −1 of ATLAS data at √ s = 13 TeV. This analysis yielded a significance of 1.16σ in the ttH-lep region, with the events in the di-photon mass window containing 90% signals (see Table 27 in Ref. [53]). 8 As a comparison, Fig. 18(a) indicates an optimal significance 1.0σ with the same amount of integrated luminosity. Separately, the stop pair production with a full decay into ttγγ + 2G was analyzed in Ref. [58], using 35.9 fb −1 of CMS data at √ s = 13 TeV. According to Fig. 5 in Ref. [58], the benchmark scenario studied here (i.e. mt = 1 TeV and m χ = 0.2 TeV) has been excluded with a significance > 2.0σ, considering inclusive decays ofχ 0 1 with BR(χ 0 1 → γG) = BR(χ 0 1 → ZG) = 50%. Assuming the same luminosity and BR(ttγγ + 2G) = 25%, Fig. 18(b) implies an optimal significance (against the background-only hypothesis instead) of ∼ 3.1σ.

Conclusions
The null results of the broad program of searches at the LHC so far may imply that the NP signals at colliders could take forms highly unexpected. This strongly motivates to develop strategies that could allow the NP to be detected in a less model-dependent way and with a broader coverage in theory space, hence complementing the current model-dependent search programs at the LHC. The ML techniques of novelty detection can well serve this purpose, since they are essentially designed to detect novel events without a prior knowledge.
Following Ref. [1], where O syn (i.e. a novelty evaluator utilizing the complementarity between the k-NN based O iso and O clu evaluators) was proposed to improve the detection sensitivity, we have develop an analysis scheme to make use of this feature in a more systematic way. One improvement here is that we adopt an additional step to resort the bins of O iso (see Step III in Sec. 2.1), which defines O iso , according to the level of deviation of the testing sample from the reference sample in its each bin. As discussed, the signal events from the same region of the feature space tend to have close O iso scores. These events however may not be well-identified by O iso , if they are from the background bulk. This step to a large extent resolves this problem. Then we can define O syn based on the O iso and O clu scores. To interpret the confidence level of the data deviation based on O syn with Gaussian/Poisson statistics, we select a signal region and background region separated by a certain O syn threshold. Then we use the latent space as inputs for a supervised DNN to classify these two regions. The output score of this DNN defines our final novelty score O syn to quantify the level of deviation of the data from known patterns. Such a treatment improves the confidence level of O syn in a general context. If the data statistics is sufficiently high, one can even use the full final-state particle kinematics as input to separate the signal and non-signal regions. Then the effect of information loss caused by dimensionality reduction will be diminished. Here we would stress that this scheme is rather broad. It represents a class of designs, where the kNN based O iso and O clu can be replaced with another pair of isolation-and clustering-based evaluators (see Table 1), with the features of O syn and O syn being qualitatively unchanged.
We then conduct a comparative study on novelty evaluators, and demonstrate the generality and efficiency of O syn in a variety of NP scenarios which are mimicked with two dimensional Gaussian samples. The yielded signatures range from loose clustering in the center of the known-pattern distribution to compact isolation. We subsequently apply this study to the LHC detection of the SM tth production and the direct stop-quark pair production in gravity-mediated SUSY as novel events in the ttγγ channel. These two scenarios yield signal patterns with a sharp resonance and a broad distribution of m γγ , respectively. With O syn , we successfully identify both types of signal, reaching a discover/exclusion confidence level comparable to the dedicated supervised-learning searches. We would like to stress: although the sensitivity for these two physical cases might depend on the chosen latent-space dimensionality using the invariant-mass-preserving AE architecture, a systematic way to define the symmetry-preserving latent space for novelty evaluation is another broad aspect that deserves further exploration (see e.g. Ref. [59] in this direction).
Our study so far is semi-supervised in the sense that the training or reference samples are expected to be generated using the MC simulation tools. Such a method will unavoidably introduce some systematics including QCD uncertainties due to the inaccuracy of simulation. One way to reduce such systematics could be developing adversarially-trained autoencoders where the sensitivity of autoencoders to the simulation-caused bias is suppressed [20]. Alternatively, one can extend semi-supervised learning to fully unsupervised learning or a data-driven method. To achieve this, one needs to extrapolate the backgrounds in the signal region from the control regions. This strategy may reduce the systematics to be below 10% (for some example using the ABCD method, see, e.g., [60]). The background extrapolation could be further improved by using a generative adversarial network where a background sample will be generated to mimic the data. Notably, in these processes the novelty evaluation for collider events will be essentially unchanged. We leave these explorations to a future work.