Use of a Generalized Energy Mover's Distance in the Search for Rare Phenomena at Colliders

In this paper we expand on the previously proposed concept of Energy Mover's Distance. The resulting observables are shown to provide a way of identifying rare processes in proton-proton collider experiments. It is shown that different processes are grouped together differently, and that this can contribute to the improvement of experimental analyses. The $t\bar{t}Z$ production at the Large Hadron Collider is used as a benchmark to illustrate the applicability of the method. Furthermore, we study the use of these observables as new features which can be used in the training of Deep Neural Networks.

In this paper we expand on the previously proposed concept of Energy Mover's Distance. The resulting observables are shown to provide a way of identifying rare processes in proton-proton collider experiments. It is shown that different processes are grouped together differently, and that this can contribute to the improvement of experimental analyses. The ttZ production at the Large Hadron Collider is used as a benchmark to illustrate the applicability of the method. Furthermore, we study the use of these observables as new features which can be used in the training of Deep Neural Networks.

ENERGY MOVER'S DISTANCE AS A TOOL FOR MEASUREMENTS AT COLLIDERS
The concept of a metric for the space of collider events based on the Energy Mover's Distance (EMD) has been recently proposed in [1,2]. Here, we apply an EMD for full event properties to the classification of physical processes. This application can be particularly important for the measurement of rare Standard Model (SM) processes, which typically have cross-sections several orders of magnitude below the backgrounds affecting their measurement. In such cases, a good discrimination between signal and background is critical and thus any new variables contributing to a correct classification of events are of utmost importance, particularly so for precision measurements. Similarly, extracting information from the event kinematics to search for new physics phenomena is also a key aspect when analyzing data produced at colliders [3].
In order to expand the concept of EMD to the full reconstructed events, we propose the generalization of the EMD definition, introducing a new factor encoding information on the identity of the reconstructed physics objects present in each event, and the generalised distance d(I, J) between the events I and J becomes: where i and j are the final state objects of the events I and J, respectively. The five leading small-R jets and large-R jets, the two leading electrons and muons and the missing transverse energy (M ET ) are the final state objects considered. When an event has fewer objects, the non-existing ones enter in the algorithm as null fourvectors, providing a proxy to the object multiplicity in the event. Jets are built from calorimeter energy clusters grouped using the jet finder algorithm anti-k t [4] as implemented in the FastJet package [5], with radius parameter R=0.4 and R=1.0 for small-and large-R jets, respectively.
In Eq. (1), p T is the momentum of the final state objects in the transverse plane [6], and ∆R ij = ∆φ 2 ij + ∆η 2 ij is the radial distance between the object i in the event I and object j in the event J. φ is the azimuthal angle defined in the transverse plane and η is the pseudo-rapidity. Before computing d(I, J) from the simulated Monte Carlo samples, the events are first boosted to their centre-of-mass frame and then rotated to align the hardest object vertically in the (η, φ) plane. Since physical laws are Lorentz invariant, this procedure simply removes spurious information.
The first term in Eq. (1) defines an overall distance between events weighted by the p T difference of their objects. The factor ID(i, j) is introduced to encode information on the identity of the final state objects but implies that Eq. (1) can not be, in general, interpreted as a distance in the geometric term. For simplicity we still call it a distance throughout the paper. ID(i, j) consists of a variable scale factor that penalises the distance between two objects if they are of different type, where small-R jets, large-R jets, electrons, muons and M ET are considered different types of objects: (2) Computing the minimal distance implies minimizing the first term of the Eq. (1). We address this by using the Earth Mover's Distance algorithm implemented in the Python Optimal Transport ot library [7]. Conceptually, the algorithm computes the minimal cost to transform one event into another.
The second term of the equation takes into account the total energy E difference between the events I and J. We study four variations of the distance between events, resulting from the combination of two options: adding the energy term or not and employing or not the ID(i, j) scaler, i.e. setting ID scale = {1, 2}: Despite which of the aforementioned options is at hand, distances between events of similar topology or kinematics will be short while events yielding different final states will be more separate. This suggests that such an approach could help to differentiate between physical processes, providing an additional tool in tasks that demand high discrimination. Its impact could be specially relevant in studies of rare signals, often the case of searches for new physics, where the discriminative performance plays a crucial role. We highlight the adaptable nature of the constructed observables -distances can be defined regardless of the event topology, the data filters employed or channel to be analysed -and are therefore suitable for generic and model-independent searches for new physics and for anomaly detection.
The time performance of the workflow is key to establish its practicability in a real experimental environment where billions of events need to be processed. In order to extract discriminative information about the events in a sample with N events, we would, in principle, need to compute the distances between all the pairwise combinations of events in the sample, i.e. N !/(2(N − 2)!), which is not feasible even when resourcing to parallel computing and attaining an average processing time of around 1 ms/distance with the Python multiprocessing module.
To overcome this drawback we define event references per process sample, that can later be used as the sample representatives to assess how far/near a given event is from the represented process. For each sample we compute the distances between all its events and then use a clustering technique to capture the structures existent in the data such as different kinematic regimes. We employ the k-Medoids clustering algorithm with the pyclustering Python library [8] and identify the medoid of each cluster, i.e., the central event according to Eq. (1). The medoid approach was used in [1] to visualize subcategories of jets. Here we expand this idea and use the medoids as the event references per process.
The number of clusters per sample is optimized using the Silhouette technique implemented in pyclustering [8]. Two clusters were found to be optimal. Fig. 1 shows the distribution of the event distances for a sample of simulated ttZ events for all pairwise combination of events in the sample, for pairwise combination of events belonging to the cluster, and between the cluster events and its medoid. The distributions follow a Landau curve, typical of many observables in collider experiments. Events within the first cluster are closer to each other as indicates the lower average and standard deviation. The second cluster is composed of events more far apart than in the first cluster but less scattered with respect to the original distribution, as seen from the lower standard deviation. The distances between the events and the cluster's medoids are even shorter as expected from the k-Medoids clustering.

PHYSICS CASE AND DATA SIMULATION
We use simulated samples of proton-proton collision events generated with MADGRAPH5 MCATNLO 2.6.5 [9] at leading order with a centre-of-mass energy of 13 TeV. The parton showering and hadronisation was performed with Pythia 8.2 [10], using the CMS underlying event tune CUETP8M1 [11] and the NNPDF 2.3 [12] parton distribution functions. The detector simulation employs the Delphes 3 [13] multipurpose detector simulator with the default configuration, corresponding to the parameters of the CMS detector.
The ttZ process is used as benchmark, corresponding to a typical measurement of a rare process at the Large Hadron Collider (LHC). Both the ATLAS and CMS Collaborations have considered trilepton final states for the measurement of the ttZ cross-sections [14,15] and, therefore, we focus on such topologies. For this we select events with a final state composed of at least three leptons (i.e. electrons or muons) compatible with the Z → decay and a leptonic top decay. Our main source of background is composed of ttX (X = W, Z, H), tX (X = W Z, Zj) and dibosons (W Z and ZZ). In addition, fake leptons arising from the misidentification of jets makes tt+jets and Z+jets an additional non-negligible source of background.
In order to increase the efficiency of the tri-leptonic selection and obtain a good statistical representation of the different processes, the individual samples are generated with a dileptonic decay filter. Particle decays are implemented with MadSpin [16,17], a simulator of narrow resonances decay that preserves spin and correctly implements its angular correlation scheme in the decay products.
Around 22 M events were simulated in order to achieve a statistical uncertainty which would be adequate to the analysis of 150 fb −1 of data produced by the LHC: • 100 k for the ttZ, ttW and tX (X = W Z, Zj) processes; • 500 k for ttH and for each diboson (W Z and ZZ) sample; • 8 M for the tt+jets process; • 12 M for Z+jets events.
Each process was normalized to the expected yield for the considered benchmark luminosity of 150 fb −1 , assuming the SM cross-sections computed at leading order with MADGRAPH5.

EMD AS HIGH-LEVEL FEATURES
In order to study the use of EMD as high-level features, we compute the distances between the events of all gen-erated processes and the two medoids representing each process sample for each four distance options considered d(I, J), d(I, J) ID , d(I, J) ∆E Fig. 2 shows two example distributions of the event distances to a ttZ medoid and a W Z medoid. Both the average and the median distance to the ttZ and W Z medoids are lower for the ttZ and W Z samples, respectively, as expected. Moreover, ZZ and Z+jets events are in average close to the W Z medoid, and the ttX and tX processes exhibit a short distance from the ttZ medoid. This observation provides evidence that the constructed set of distance observables has the ability of discriminating between event topologies. This conclusion is valid across all distributions of the distance observables and even if definite conclusions would require detailed detector simulation used by the LHC Collaborations [18,19], the presented results look promising.
To further investigate the potential of the proposed generalization of EMD to distinguish physical processes, we determine distances of events with respect to each sample medoids and use it as a disciminant against the medoid event process. Corresponding Receiver Operating Characteristic (ROC) curves are shown in Fig. 3 for one example medoid per process. Distances computed with respect to the ttZ, ttX, tt+jets and tX medoids allow to discriminate the diboson and Z+jets processes. Conversely, distances to the diboson and Z+jets medoids are sensitive to processes containing top quarks. It is in-teresting to note that the constructed observable does not allow to distinguish Z+jets from diboson events. With the hardest jets originating from gluon splitting and the jet system recoiling against a dileptonic Z, the Z+jets events constitute indeed irreducible background against the diboson signals. In order to further explore how this technique can be used in the context of High Energy Physics measurements we selected a set of high-level reconstructed event variables, from which we will derive a baseline to access its discriminant power, as well as to assess how different distances impact the corresponding separation performance. Following a typical choice of information set used in dedicated analysis at the LHC, the selected reconstructed variables used as features are: • (p T , η, φ) of the two leptons with the highest p T ; • (p T , η, φ, m) of the two small-R jets with highest p T ; • (b 1 , b 2 ), being two binary variables indicating if the jets were tagged as originated by a b-quark; • (p T , η, φ, m, τ 1 , ..., τ 5 ) of the large-R jet with the highest p T ; • small-R jet, electron, muon and large-R multiplicities; • scalar sum of all the reconstructed objects p T , H T ; • missing transverse energy and corresponding φ.
With both the event distances and the selected highlevel features, we performed an exploratory analysis by embedding the events into a two dimensional space using UMAP [20], as implemented by [21]. The embeddings for the selected features, for all the event distances, and for the combination of all event distances with the selected features can be seen in Fig. 4. In this picture we notice how, in a completely unsupervised manner, the embedding of the events through the selected features seems to be able to isolate clusters of events from different samples. The fact that the diboson events appear to be quite separated from those with a t-quark suggests that these events are the easiest to classify against the other classes, followed by ttZ events, which occupy mostly a single cluster. We also notice that fakes seem to mostly spread throughout all the clusters, highlighting the difficulty of isolating them. In the middle figure we show the resulting embedding if we use all the event distances defined above. Here again, we confirm the conclusion drawn in the previous section: these distances convey a notion of continuity from diboson events to tX events. In the third figure we used all the event distances in addition to the selected features. In this case we notice that we can identify the same clusters as those appearing in the first picture, but that the event distances brought in the notion of continuity between events, continuously connecting some of the clusters.

DEEP LEARNING APPLICATION
Since the event distances, either alone or combined with other high-level features, present a good discriminating power between physics processes, we went a step forward and studied how such discrimination compares with the one obtained through advanced machine learning techniques, namely Dense Neural Network (DNN). For this, we implemented DNN discriminants to perform the multiclassification task across the different sample classes (diboson, fakes, tX, ttX and ttZ), corresponding to the physics process defined above.
We use TensorFlow 2.0 [22] through its internal Keras API and followed the same sequential general architecture: input layer with width matching the number of input features, and a Softmax layer with five units as the output layer. The hyperparameters were fixed using HyperBand [23] as implemented by Keras-Tuner [24] for each set of features. A 1:1:1 train-validation-test split was performed for the whole process and the final results presented here were derived from the test set.
In Fig. 5 we show the confusion matrices for the three combination of features of Fig. 4, for two operating points. The first operating point (up) is defined by only accepting predictions, where the most likely prediction is greater than 0.2. This excludes the cases where the DNN cannot differentiate between any class and predicts 0.2 for all five classes. The second operating point (down) is set to 0.9, which will only retain very confident predictions. In these confusion matrices we notice that for low operation points, the inclusion of event distances to the high-level features has little performance impact. For a high operating point, we see that the event distances seem to help retain a fair discriminative power of fakes. These operating points are only meant to illustrate the potential of the proposed method since for each specific analysis they would need to be optimized. A more realistic experimental analysis would also need to take into account the effect of systematic sources of uncertainty in such optimization.  Next, in Fig. 6, we present the values of the areas under the ROCs for the multiclass discrimination using the different feature combinations on top, and how these compare to the baseline of using the selected reconstructed variables when training a DNN below. We see that each distance has discriminative power which depends slightly on the class we are trying to isolate. For example, the task of identifying fakes seems to benefit from the inclusion of the distances that include ∆E contribution to the distance, and even more from taking all distances into account. Identifying the remainder of the classes seem to benefit little or not at all from the inclusion of different event distances as features.
Finally, in Fig. 7, we show how the ROC curves for the task of discriminating fakes from the remainder of the classes. In this figure we see how different event distances provide different discriminant power for this specific case. We also notice that the combination of all event distances without the selected features has better performance than each distance separately. Finally, we observe that the ROC curve for the combination of all event distances with the selected features is the outermost curve for the large portion of the operating points.

CONCLUSIONS
In this paper, the Energy Mover's Distance concept was used to create a new set of observables that could be used in the measurement of rare processes at protonproton colliders, using ttZ as a study case. We have shown that such new observables, which build on the previously proposed concept of EMD, perform well in the task of grouping together different processes based on their topologies, showing a fair discrimination power by themselves. Namely, it can be seen that the distances between ttZ and ttX are smaller than ZZ and W Z. This indicates that the EMD based observables can be useful in the analysis of collider data, providing an interesting way to explore data and classify it in generic classes, which can be matched with significant accuracy to dif-ferent physics processes.
Additionally, the use of these observables in the training of a DNN was tested. Even if the overall performance of the DNN is not, in general, significantly increased, such observables are interesting on themselves since they provide event level information which is beneficial for the classification of processes with fake leptons in some scenarios. Furthermore, such event-level observables might be affected differently by systematic uncertainties -a study beyond the scope of the current paper which deserves further investigation.