JEDI-net: a jet identification algorithm based on interaction networks

We investigate the performance of a jet identification algorithm based on interaction networks (JEDI-net) to identify all-hadronic decays of high-momentum heavy particles produced at the LHC and distinguish them from ordinary jets originating from the hadronization of quarks and gluons. The jet dynamics are described as a set of one-to-one interactions between the jet constituents. Based on a representation learned from these interactions, the jet is associated to one of the considered categories. Unlike other architectures, the JEDI-net models achieve their performance without special handling of the sparse input jet representation, extensive pre-processing, particle ordering, or specific assumptions regarding the underlying detector geometry. The presented models give better results with less model parameters, offering interesting prospects for LHC applications.


Introduction
Jets are collimated cascades of particles produced at particle accelerators.Quarks and gluons originating from hadron collisions, such as the proton-proton collisions at the CERN Large Hadron Collider (LHC), generate a cascade of other particles (mainly other quarks or gluons) that then arrange themselves into hadrons.The stable and unstable hadrons' decay products are observed by large particle detectors, reconstructed by algorithms that combine the information from different detector components, and then clustered into jets, using physics-motivated sequential recombination algorithms such as those described in Ref. [1][2][3].Jet identification, or tagging, algorithms are designed to identify the nature of the particle that initiated a given cascade, inferring it from the collective features of the particles generated in the cascade.
Traditionally, jet tagging was meant to distinguish three classes of jets: light flavor quarks q = u, d, s, c, gluons g, or bottom quarks (b).At the LHC, due to the large collision energy, new jet topologies emerge.When heavy particles, e.g.W, Z, or Higgs (H) bosons or the top quark, are produced with large momentum and decay to all-quark final states, the resulting jets are contained in a small solid angle.A single jet emerges from the overlap of two (for bosons) or three (for the top quark) jets, as illustrated in Fig. 1.These jets are characterized by a large invariant mass (computed from the sum of the four-momenta of their constituents) and they differ from ordinary quark and gluon jets, due to their peculiar momentum flow around the jet axis.
Several techniques have been proposed to identify these jets by using physics-motivated quantities, collectively referred to as "jet substructure" variables.A review of the different techniques can be found in Ref. [4].As discussed in the review, approaches based on deep learning (DL) have been extensively investigated (see also Sec. 2), processing sets of physics-motivated quantities with dense layers or raw data representations (e.g.jet images or particle feature lists) with more complex architectures (e.g.convolutional or recurrent networks).
Fig. 1 Pictorial representations of the different jet categories considered in this paper.Left: jets originating from quarks or gluons produce one cluster of particles, approximately cone-shaped, developing along the flight direction of the quark or gluon that started the cascade.Center: when produced with large momentum, a heavy boson decaying to quarks would result in a single jet, made of 2 particle clusters (usually referred to as prongs).Right: a high-momentum t → Wb → qq b decay chain results in a jet composed of three prongs.
In this work, we compare the typical performance of some of these approaches to what is achievable with a novel jet identification algorithm based on an interaction network (JEDI-net).Interaction networks [5] (INs) were designed to decompose complex systems into distinct objects and relations, and reason about their interactions and dynamics.One of the first uses of INs was to predict the evolution of physical systems under the influence of internal and external forces, for example, to emulate the effect of gravitational interactions in n-body systems.The n-body system is represented as a set of objects subject to one-on-one interactions.The n bodies are embedded in a graph and these one-onone interaction functions, expressed as trainable neural networks, are used to predict the post-interaction status of the n-body system.We study whether this type of network generalizes to a novel context in high energy physics.In particular, we represent a jet as a set of particles, each of which is represented by its momentum and embedded as a vertex in a fully-connected graph.We use neural networks to learn a representation of each one-on-one particle interaction1 in the jet, which we then use to define jet-related high-level features (HLFs).Based on these features, a classifier associates each jet to one of the five categories shown in Fig. 1.
For comparison, we consider other classifiers based on different architectures: a dense neural network (DNN) [6] receiving a set of jet-substructure quantities, a convolutional neural network (CNN) [7][8][9] receiving an image representation of the transverse momentum (p T ) flow in the jet 2 , and a recurrent neural network (RNN) with gated recurrent units [10] (GRUs), which process a list of particle features.These models can achieve state-of-the-art performance although they require additional ingredients: the DNN model requires processing the constituent particles to pre-compute HLFs, the GRU model assumes an ordering criterion for the input particle feature list, and the CNN model requires representing the jet as a rectangular, regular, pixelated image.Any of these aspects can be handled in a reasonable way (e.g. one can use a jet clustering metric to order the particles), sometimes sacrificing some detector performance (e.g., with coarser image pixels than realistic tracking angular resolution, in the case of many models based on CNN).It is then worth exploring alternative solutions that could reach state-of-the-art performance without making these assumptions.In particular, it is interesting to consider architectures that directly takes as input jet constituents and are invariant for their permutation.This motivated the study of jet taggers based on recursive [11], graph networks [12,13], and energy flow networks [14].In this context, we aim to investigate the potential of INs.
This paper is structured as follows: we provide a list of related works in Sec. 2. In Sec. 3, we describe the utilized data set.The structure of the JEDI-net model is discussed in Sec. 4 together with the alternative architectures considered for comparison.Results are shown in Sec. 5.Sections 6 and 7 discuss what the JEDI-net learns when processing the graph and quantify the amount of resources needed by the tagger, respectively.We conclude with a discussion and outlook for this work in Sec. 8. Appendix A describes the design and optimization of the alternative models.

Related work
Jet tagging is one of the most popular LHC-related tasks to which DL solutions have been applied.Several classification algorithms have been studied in the context of jet tagging at the LHC [15][16][17][18][19][20][21][22] using DNNs, CNNs, or physics-inspired architectures.Recurrent and recursive layers have been used to construct jet classifiers starting from a list of reconstructed particle momenta [11][12][13].Recently, these different approaches, applied to the specific case of top quark jet identification, have been compared in Ref. [23].While many of these studies focus on data analysis, work is underway to apply these algorithms in the early stages of LHC realtime event processing, i.e. the trigger system.For example, Ref. [24] focuses on converting these models into firmware for field programmable gate arrays (FPGAs) optimized for low latency (less than 1 µs).If successful, such a program could allow for a more resource-efficient and effective event selection for future LHC runs.
Graph neural networks have also been considered as jet tagging algorithms [25,26] as a way to circumvent the sparsity of image-based representations of jets.These approaches demonstrate remarkable categorization performance.Motivated by the early results of Ref. [25], graph networks have been also applied to other high energy physics tasks, such as event topology classification [27,28], particle tracking in a collider detector [29], pileup subtraction at the LHC [30], and particle reconstruction in irregular calorimeters [31].

Data set description
This study is based on a data set consisting of simulated jets with an energy of p T ≈ 1 TeV, originating from light quarks q, gluons g, W and Z bosons, and top quarks produced in √ s = 13 TeV proton-proton collisions.The data set was created using the configuration and parametric description of an LHC detector described in Ref. [24,32], and is available on the Zenodo platform [33][34][35][36].
Jets are clustered from individual reconstructed particles, using the anti-k T algorithm [3,37] with jet-size parameter R = 0.8.Three different jet representations are considered: -A list of 16 HLFs, described in Ref. [24], given as input to a DNN.The 16 distributions are shown in Fig. 2 for the five jet classes.-An image representation of the jet, derived by considering a square with pseudorapidity and azimut distances ∆η = ∆φ = 2R, centered along the jet axis.The image is binned into 100×100 pixels.Such a pixel size is comparable to the cell of a typical LHC electromagnetic calorimeter, but much coarser than the typical angular resolution of a tracking device for the p T values relevant to this task.Each pixel is filled with the scalar sum of the p T of the particles in that region.These images are obtained by considering the 150 highest-p T constituents for each jet.This jet representation is used to train a CNN classifier.The average jet images for the five jet classes are shown in Fig. 3.For comparison, a randomly chosen set of images is shown in Fig. 4. -A constituent list for up to 150 particles, in which each particle is represented by 16 features, computed from the particle four-momenta: the three Cartesian coordinates of the momentum (p x , p y , and p z ), the absolute energy E, p T , the pseudorapidity η, the azimuthal angle φ, the distance ∆R = ∆η 2 + ∆φ 2 from the jet center, the relative energy E rel = E particle /E jet and relative transverse momentum p rel T = p particle T /p jet T defined as the ratio of the particle quantity and the jet quantity, the relative coordinates η rel = η particle − η jet and φ rel = φ particle − φ jet defined with respect to the jet axis, cos θ and cos θ rel where θ rel = θ particle − θ jet is defined with respect to the jet axis, and the relative η and φ coordinates of the particle after applying a proper Lorentz transformation (rotation) as described in Ref. [38].Whenever less than 150 particles are reconstructed, the list is filled with zeros.The distributions of these features considering the 150 highest-p T particles in the jet are shown in Fig. 5 for the five jet categories.This jet representation is used for a RNN with a GRU layer and for JEDI-net.Prob.Density (a.u.)  Prob.Density (a.u.) Fig. 2 Distributions of the 16 high-level features used in this study, described in Ref. [24].Prob.Density (a.u.) Prob.Density (a.u.)

JEDI-net
In this work, we apply an IN [5] architecture to learn a representation of a given input graph (the set of constituents in a jet) and use it to accomplish a classification task (tagging the jet).One can see the IN architecture as a processing algorithm to learn a new representation of the initial input.This is done replacing a set of input features, describing each individual vertex of the graph, with a set of engineered features, specific of each vertex but whose values depend on the connection between the vertices in the graph.
The starting point consists of building a graph for each input jet.The N O particles in the jet are represented by the vertices of the graph, fully interconnected through directional edges, for a total of edges.An example is shown in Fig. 6 for the case of a three-vertex graph.The vertices and edges are labeled for practical reasons, but the network architecture ensures that the labeling convention plays no role in creating the new representation.
Once the graph is built, a receiving matrix (R R ) and a sending matrix (R S ) are defined.Both matrices have dimensions N O × N E .The element (R R ) ij is set to 1 when the i th vertex receives the j th edge and is 0 otherwise.Similarly, the element (R S ) ij is set to 1 when the i th vertex sends the j th edge and is 0 otherwise.In the case of the graph of Fig. 6, the two matrices take the form: The input particle features are represented by an input matrix I.Each column of the matrix corresponds to one of the graph vertices, while the rows correspond to the P features used to represent each vertex.In our case, the vertices are the particles inside the jet, each represented by its array of features (i.e., the 16 features shown in Fig. 5).Therefore, the I matrix has dimensions P × N O .
The I matrix is processed by the IN in a series of steps, represented in Fig. 7.The I matrix is multiplied by the R R and R S matrices and the two resulting matrices are then concatenated to form the B matrix, having dimension 2P × N E : Each column of the B matrix represents an edge, i.e. a particle-to-particle interaction.The 2P elements of each column are the features of the sending and receiving vertices for that edge.Using this information, a D E -dimensional hidden representation of the interaction edge is created through a trainable function f R : R 2P → R D E .This gives a matrix E with dimensions D E × N E .The cumulative effects of the interactions received by a given vertex are gathered by summing the D E hidden features over the edges arriving to it.This is done by computing E = ER R with dimensions D E × N O , which is then appended to the initial input matrix I: At this stage, each column of the C matrix represents a constituent in the jet, expressed as a (P + D E )dimensional feature vector, containing the P input features and the D E hidden features representing the combined effect of the interactions with all the connected particles.A trainable function f O : R P +D E → R In the following sections, we provide solutions on how to prune neural networks and how to retrain the pruned model to recover prediction accuracy.We also demonstrate the speedup and energy efficiency improvements of the pruned model when run on commodity hardware.

Pruning Methodology
Our pruning method employs a three-step process: training connectivity, pruning connections, and retraining the remaining weights.The last two steps can be done iteratively to obtain better compression ratios.The process is illustrated in Figure 3.2 and Algorithm 1.
→ 70% reduction of weigh and multiplications w/o performance loss

• RR [NO x NE]
• RS A final classifier φ C takes as input the elements of the O matrix and returns the probability for that jet to belong to each of the five categories.This is done in two ways: (i) in one case, we define the quantities -The activation function for the hidden and output layers of the f R network: ReLU [42], ELU [43], or SELU [44] functions.-The activation function for the hidden and output layers of the f O network: ReLU, ELU, or SELU.
In addition, the output neurons of the φ C network are activated by a softmax function.A learning rate of 10 −4 is used.For a given network architecture, the network parameters are optimized by minimizing the categorical cross entropy.The Bayesian optimization is repeated four times.In each case, the input particles are ordered by descending p T value and the first 30, 50, 100, or 150 particles are considered.The parameter optimization is performed on the training data set, while the loss for the Bayesian optimization is estimated on the validation data set.
Tables 2 and 1 summarize the result of the Bayesian optimization for the JEDI-net architecture with and without the sum over the columns of the O matrix, respectively.The best result of each case, highlighted in bold, is used as a reference for the rest of the paper.For comparison, three alternative models are trained on the three different representations of the same data set described in Sec.3: a DNN model taking as input a list of HLFs, a CNN model processing jet images, and a recurrent model applying GRUs on the same input list used for JEDI-net.The three benchmark models are optimized through a Bayesian optimization procedure, as done for the INs.Details of these optimizations and the resulting best models are discussed in Appendix A.

Results
Figure 8 shows the receiver operating characteristic (ROC) curves obtained for the optimized JEDI-net tagger in each of the five jet categories, compared to the corresponding curves for the DNN, CNN, and GRU alternative models.The curves are derived by fixing the network architectures to the optimal values based on Table 2 and App.A and performing a k-fold crossvalidation training, with k = 10.The solid lines represent the average ROC curve, while the shaded bands quantify the ±1 RMS dispersion.The area under the curve (AUC) values, reported in the figure, allow for a comparison of the performance of the different taggers.
The algorithm's tagging performance is quantified computing the true positive rate (TPR) values for two given reference false positive rate (FPR) values (10% and 1%).The comparison of the TPR values gives an assessment of the tagging performance in a realistic use case, typical of an LHC analysis.Tables 3 shows the corresponding FPR values for the optimized JEDInet taggers, compared to the corresponding values for the benchmark models.The largest TPR value for each class is highlighted in bold.As shown in Fig. 8 and Table 3, the two JEDI-net models outperform the other architectures in almost all cases.The only notable exception is the tight working point of the top-jet tagger, for which the DNN model gives a TPR higher by about 2%, while the CNN and GRU models give much worse performance.
The TPR values for the two JEDI-net models are within 1%.The only exception is observed for the tight working points of the W and Z taggers, for which the model using the O sums shows a drop in TPR of ∼ 4%.In this respect, the model using summed O features is preferable (despite this small TPR loss), given the reduced model complexity (see Section 7) and its independence on the labeling convention for the particles embedded in the graph and for the edges connecting them.
6 What did JEDI-net learn?
In order to characterize the information learned by JEDInet, we consider the O sums across the N O vertices Table 3 True positive rates (TPR) for the optimized JEDI-net taggers and the three alternative models (DNN, CNN, and GRU), corresponding to a false positive rate (FPR) of 10% (top) and 1% (bottom).The largest TPR value for each case is highlighted in bold. of the graph (see Section 4) and we study their correlations to physics motivated quantities, typically used when exploiting jet substructure in a search.We con-sider the HLF quantities used for the DNN model and the N -subjettiness variables τ Not all the O sums exhibit an obvious correlation with the considered quantities, i.e., the network engineers high-level features that encode other information than what is used, for instance, in the DNN model.
Nevertheless, some interesting correlation pattern between the physics motivated quantities and the O i sums is observed.The most relevant examples are given in Fig. 9, where the 2D histograms and the corresponding linear correlation coefficient (ρ) are shown.The correlation between O 1 and the particle multiplicity in the jet is not completely unexpected.As long as the O quantities aggregated across the graph have the same order of magnitude, the corresponding sum O would be proportional to jet-constituent multiplicity.
The strong correlation between the O 4 and τ (β=2) 1 (with ρ values between 0.69 and 0.97, depending on the jet class) is much less expected.The τ β 1 quantities assume small values when the jet constituents can be arranged into a single sub-jet inside the jet.Aggregating information from the constituent momenta across the jet, the JEDI-net model based on the O quantities learns to build a quantity very close to τ .The two O sums considered are correlated to the corresponding substructure quantities, but with smaller (within 0.48 and 0.77) correlation coefficients.

Resource comparison
Table 4 shows a comparison of the computational resources needed by the different models discussed in this paper.The best-performing JEDI-net model has more than twice the number of trainable parameters than the DNN and GRU model, but approximately a factor of 6 less parameters than the CNN model.The JEDI-net model based on the summed O features achieves comparable performance with about a factor of 4 less parameters, less than the DNN and GRU models.While being far from expensive in terms of number of parameters, the JEDI-net models are expensive in terms of the number of floating point operations (FLOP).The simple model based on O sums, using as input a sequence of 150 particles, uses 458 MFLOP.The increase is mainly due to the scaling with the number of vertices in the graph.Many of these operations are the ×0 and ×1 products involving the elements of the R R and R S matrices.The cost of these operations could be reduced with an IN implementation optimized for inference, e.g., through an efficient sparse-matrix representation.
In addition, we quote in applying the model to 1000 events, as part of a Python application based on TensorFlow [48].To this end, the JEDI-net models, implemented and trained in Py-Torch, are exported to ONNX [49] and then loaded as TensorFlow graph.The quoted time includes loading the data, which occurs for the first inference and is different for different event representations, that is smaller for the JEDI-net models than for the CNN models.The GPU used is an NVIDIA GTX 1080 with 8 GB memory, mounted on a commercial desktop with an Intel Xeon CPU, operating at a frequency of 2.60 GHz.The tests were executed in Python 3.7, with no other concurrent process running on the machine.Given the larger number of operations, the GPU inference time for the two IN models is much larger than for the other models.
The current IN algorithm is costly to deploy in the online selection environment of a typical LHC experiment.A dedicated R&D effort is needed to reduce the resource consumption in a realistic environment in order to benefit from the improved accuracy that INs can achieve.For example, one could trade model accuracy for reduced resource needs by applying neural network pruning [50,51], reducing the numerical precision [52,53], and limiting the maximum number of particles in each jet representation.

Conclusions
This paper presents JEDI-net, a jet tagging algorithm based on interaction networks.Applied to a data set of jets from light-flavor quarks, gluons, vector bosons, and top quarks, this algorithm achieves better performance than models based on dense, convolutional, and recurrent neural networks, trained and optimized with the same procedure on the same data set.As other graph networks, JEDI-net offers several practical ad-vantages that make it particularly suitable for deployment in the data-processing workflows of LHC experiments: it can directly process the list of jet constituent features (e.g.particle four-momenta), it does not assume specific properties of the underlying detector geometry, and it is insensitive to any ordering principle applied to the input jet constituents.For these reasons, the implementation of this and other graph networks is an interesting prospect for future runs of the LHC.On the other hand, the current implementation of this model demands large computational resources and a large inference time, which make the use of these models problematic for real-time selection and calls for a dedicated program to optimize the model deployment on typical L1 and HLT environments.
The quantities engineered by one of the trained IN models exhibit interesting correlation patterns with some of the jet substructure quantities proposed in literature, showing that the model is capable of learning some of the relevant physics in the problem.On the other hand, some of the engineered quantities do not exhibit striking correlation patterns, implying the possibility of a non trivial insight to be gained by studying these quantities.
Gpy [41].For each iteration, the training is performed using early stopping to prevent over-fitting and to allow a fair comparison between different configurations.The data set for training (validation) consists of 630,000 (240,000) jets, with 10,000 jets used for testing purposes.The loss for the Bayesian optimization is estimated on the validation data set.The CNN and GRU networks are trained on four different input data sets, obtained considering the first 30, 50, 100, or 150 highest-p T jet constituents.The DNN model is trained on quantities computed from the full list of particles.
The DNN model consists on a multilayer perceptron, alternating dense layers to dropout layers.The optimal architecture is determined optimizing the following hyperparameters: -Number of dense layers (N DL ) between 1 and 3.
-Optimization algorithm: Adam, Nadam [54], or AdaDelta.The optimization process gives as output an optimal architecture with three hidden layers of 80 neurons each, activated by ELU functions.The best dropout rate is found to be 0.11, when a batch size of 50 and the Adam optimizer are used.This optimized network gives a loss of 0.66 and an accuracy of 0.76.
The CNN model consists of two-dimensional convolutional layers with batch normalization, followed by a set of dense layers.A 2 × 2 max pooling layer is applied after the fist convolutional layer.The optimal architecture is derived optimizing the following hyperparameters: -Number of convolutional layers N CL between 1 and 3.
-Number of convolutional filters n f in each layer (10, 15, 20, 25, or 30).-Optimization algorithm: Adam, Nadam, or AdaDelta.The stride of the convolutional filters is fixed to 1 and "same" padding is used.Table 5 shows the optimal sets of hyperparameter values, obtained for the four different data set representations.While the optimal networks are equivalent in performance, we select the network obtained for ≤ 50 constituents, because it has the smallest number of parameters.
The recurrent model consists of a GRU layer feeding a set of dense layers.The following hyperparameters are considered: -Number of GRU units: 50, 100, 200, 300, 400, or 500.Table 7 The optimized hyperparameters, number of trainable parameters, and performance metrics of the JEDI-net models on the top tagging data set.Performance metrics are evaluated on the test sample.We quote the area under the ROC curve (AUC), the accuracy, and the background rejection at a signal efficiency of 30%.

Fig. 3 2 Fig. 4
Fig. 3 Average 100 × 100 images for the five jet classes considered in this study: q (top left), g (top center), W (top right), Z (bottom left), and top jets (bottom right).The temperature map represents the amount of p T collected in each cell of the image, measured in GeV and computed from the scalar sum of the p T of the particles pointing to each cell.

1 1 Fig. 5
Fig. 5 Distributions of kinematic features described in the text for the 150 highest-p T particles in each jet.

Fig. 6
Fig. 6 An example graph with three fully connected vertices and the corresponding six edges.

20 Figure 3 . 1 :
Figure 3.1: Pruning the synapses and neurons of a deep neural network.the connections that have been removed.The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.In effect, this training process learns the network connectivity in addition to the weights -this parallels the human brain development [109] [110], where excess synapses formed in the first few months of life are gradually "pruned", with neurons losing little-used connections while preserving the functionally important connections.On the ImageNet dataset, the pruning method reduced the number of parameters of AlexNet by a factor of 9× (61 to 6.7 million), without incurring accuracy loss.Similar experiments with VGG-16 found that the total number of parameters can be reduced by 13× (138 to 10.3 million), again with no loss of accuracy.We also experimented with the more efficient fully-convolutional neural networks: GoogleNet (Inception-V1), SqueezeNet, and ResNet-50, which have zero or very thin fully connected layers.From these experiments we find that they share very similar pruning ratios before the accuracy drops: 70% of the parameters in those fully-convolutional neural networks can be pruned.GoogleNet is pruned from 7 million to 2 million parameters, SqueezeNet from 1.2 million to 0.38 million, and ResNet-50 from 25.5 million to 7.47 million, all with no loss of Top-1 and Top-5 accuracy on Imagenet.

Fig. 7 A
Fig. 7 A flowchart illustrating the interaction network scheme.

-
where j is the index of the vertex in the graph (the particle, in our case), and the i ∈ [0, D E ] index runs across the D E outputs of the f O function.The O quantities are used as input to φ C : R D O → R N .This choice allows to preserve the independence of the architecture on the labeling convention adopted to build the I, R R , and R S matrices, at the cost of losing some discriminating information in the summation.(ii) Alternatively, the φ C matrix is defined directly from the D O × N O elements of the O matrix, flattened into a one-dimensional array.The full information from O is preserved, but φ C assumes an ordering of the N O input objects.In our case, we rank the input particles in descending order by p T .The trainable functions f O , f R , and φ C consist of three DNNs.Each of them has two hidden layers, the first (second) having N 1 n (N 2 n = N 1 n /2 ) neurons.The model is implemented in PyTorch [39] and trained using an NVIDIA GTX1080 GPU.The training (validation) data set consists of 630,000 (240,000) examples, while 10,000 events are used for testing purposes.The architecture of the three trainable functions is determined by minimizing the loss function through a Bayesian optimization, using the GpyOpt library [40], based on Gpy [41].We consider the following hyperparameters: The number of output neurons of the f R network, D E (between 4 and 14).-The number of output neurons of the f O network, D O (between 4 and 14).-The number of neurons N 1 n in the first hidden layer of the f O , f R , and φ C network (between 5 and 50).
two rows of Fig.9show two intermediate cases: the correlation between O 2 and τ

Table 1
Optimal JEDI-net hyperparameter setting for different input data sets, when the summed O i quantities are given as input to the φ C network.The best result, obtained when considering up to 150 particles per jet, is highlighted in bold.

Table 2
Optimal JEDI-net hyperparameter setting for different input data sets, when all the O ij elements are given as input to the φ C network.The best result, obtained when considering up to 100 particles per jet, is highlighted in bold.

Table 4
Table 4 the average inference time on a GPU.The inference time is measured Resource comparison across models.The quoted number of parameters refers only to the trainable parameters for each model.The inference time is measured by applying the model to batches of 1000 events 100 times: the 50% median quantile is quoted as central value and the 10%-90% semi-distance is quoted as the uncertainty.The GPU used is an NVIDIA GTX 1080 with 8 GB memory, mounted on a commercial desktop with an Intel Xeon CPU, operating at a frequency of 2.60GHz.The tests were executed in Python 3.7 with no other concurrent process running on the machine.

Table 6 .
As for the CNN model, the best performance is obtained when the list of input particles is truncated at 50 elements.