Jet tagging in the Lund plane with graph networks

The identification of boosted heavy particles such as top quarks or vector bosons is one of the key problems arising in experimental studies at the Large Hadron Collider. In this article, we introduce LundNet, a novel jet tagging method which relies on graph neural networks and an efficient description of the radiation patterns within a jet to optimally disentangle signatures of boosted objects from background events. We apply this framework to a number of different benchmarks, showing significantly improved performance for top tagging compared to existing state-of-the-art algorithms. We study the robustness of the LundNet taggers to non-perturbative and detector effects, and show how kinematic cuts in the Lund plane can mitigate overfitting of the neural network to model-dependent contributions. Finally, we consider the computational complexity of this method and its scaling as a function of kinematic Lund plane cuts, showing an order of magnitude improvement in speed over previous graph-based taggers.


Introduction
As the Large Hadron Collider continues to explore proton collisions at the energy frontier, an important task to search for signals of new physics beyond the Standard Model (SM) is the identification of heavy particles at the electroweak scale, which might emerge from decays of yet unknown heavier particles.
An entity of particular interest in this quest is the jet, which is essentially defined as a collimated bunch of particles with a certain energy and direction, typically determined using a sequential recombination algorithm [1].One important problem is that electroweak scale particles such as vector bosons or top quarks produced from yet heavier new states can become sufficiently boosted such that their hadronic decays are reconstructed as single jets.It is therefore crucial to have efficient tools to probe the radiation patterns within jets and determine their physical origin.This topic has been the focus of much attention over the past decade, with a range of approaches being developed to extract information from a jet's substructure [2][3][4][5][6].In recent years, a new generation of tools based on deep learning models have emerged, which can achieve very high performance on specific benchmarks [7][8][9][10][11][12][13][14][15][16][17][18] and provide some insights into what kinematic variables drive the discrimination performance [19][20][21][22][23][24][25][26][27][28].A limitation of such deep learning-based methods is the difficulty to estimate their uncertainties, as well as their proneness to rely on unphysical features present in the training data to achieve their high performance, as this data is generally derived from Monte Carlo simulations of proton collisions.
Table 1.Benchmarks of several jet tagging algorithms for a range of processes.The first column gives the area under the ROC curve, and the later two show the background rejection at two different signal efficiencies, 50% and 70% respectively.In each case, larger values indicate better performance.
In this article, we introduce a novel method to identify jets using graph networks.To this end, we represent jets through their so-called Lund plane, associating each Lund declustering with a node on the graph.Compared with other state-of-the-art tools, our new method shows improved performance, notably for processes with complicated topologies such as top decays, while requiring substantially less training time.We will also investigate the robustness of our new tagger, and show how through kinematic cuts on the Lund variables one can mitigate overfitting to the model-dependent effects of Monte Carlo simulations, reducing the reliance of the neural network on non-perturbative contributions.
We provide a brief review of the Lund plane for jet physics in section 2, and describe the LundNet model in section 3. Results for a range of benchmarks are described in section 4, of which a summary is given in table 1.The robustness and computational complexity of the models is explored in section 5. Finally, we offer our conclusions in section 6.

Jets in the Lund plane
The models we introduce in this article rely on the Lund plane [29].This representation provides a useful mapping of the emission phase-space to a two dimensional plane representing the angle and transverse momentum of a given emission with respect to its emitter, and which is often used in discussions of resummations of large logarithms in perturbation theory or of Monte Carlo parton showers.Each emission then creates an additional trian- gular leaf corresponding to the phase space for further emissions.It was shown in recent work that the Lund plane provides a useful basis to achieve an efficient description of the clustering sequence of a jet, containing a rich set of information about its substructure, with notable potential for jet tagging [30].The Lund jet plane allows for a visual representation of the clustering history of a jet.This systematic encoding of a jet's radiation patterns can be measured experimentally [31], allowing for comparisons between theoretical predictions and experimental data [32] and with potential for constraining general purpose Monte Carlo event generators [33].
The Lund plane is obtained by first reclustering a jet's constituents with the Cambridge/Aachen (CA) algorithm [34,35], which sequentially identifies and combines the pair of particles a and b closest in rapidity y, a measure of relativistic velocity along the beam axis, and azimuthal angle φ around the same axis, i.e. minimising ∆ 2 = (y a − y b ) 2 + (φ a − φ b ) 2 .We then iterate over this clustering sequence, starting from the full jet and proceeding by: 1. Declustering the current (pseudo)jet into two transverse momentum ordered pseudojets a and b such that p t,a > p t,b , and where we consider b to be the emission of the (a + b) emitter.
2. Determining a number of kinematic variables associated with the declustering step i, which we denote as a tuple T (i) Here k t = p t,b ∆ is the transverse momentum of emission b with respect to its emitter in the limit where p t,b p t,a , ∆ is the previously defined rapidity-azimuth distance, z = p t,b /(p t,a +p t,b ) is the momentum fraction of the softer subjet b, m is the invariant mass of the (a + b) pair, and ψ = tan −1 y b −ya φ b −φa is the azimuthal angle around subjet a's axis.
3. Repeating this procedure for pseudojets a and b if they contain more than one particle.This procedure produces a binary Lund tree with a tuple of variables T (i) for each node i of the Lund tree, as shown in figure 1.The first two elements of the tuple provide the coordinates in the Lund plane of the corresponding splitting, and the remaining ones provide complementary kinematic information.A subset of this tree of particular significance is the primary list of tuples L primary containing the kinematic variables of each splitting along the primary branch of the tree, i.e. following only the pseudojet with larger transverse momentum in step 3. of the algorithm above, corresponding to points on the blue primary plane in figure 1.The primary Lund sequence can be used notably for two-dimensional visual representations of the radiation patterns in a jet [30,31,36].
Corrections to the Lund plane originating from non-perturbative hadronisation effects affect the low k t region of the plane.One can therefore limit the dependence on the nonperturbative region of any model trained on Lund declusterings by removing emissions that fall below a certain transverse momentum k t threshold.In figure 2 (left), we show the distribution of the number of Lund declusterings per jet for several choices of k t cut in 2 TeV QCD jets simulated using Pythia 8.223 [37].The mean of each distribution is indicated as a dashed line.An additional benefit of a k t threshold is that even for small cut  values the number of nodes per jet is significantly reduced, and therefore correspondingly so the computational cost of training a machine learning model on these inputs.The righthand side of figure 2 shows the average number of nodes per jet as a function of the k t cut, which decreases quadratically as the cut is increased.

LundNet Models
The Lund plane encodes a rich set of information of the substructure and radiation patterns of a jet, therefore serving as a natural input to machine learning models for jet physics.The use of Lund planes for jet tagging was first proposed in Ref. [30] where log-likelihood and deep learning models are applied, and good performance was observed for tagging boosted electroweak bosons.However, the main focus of Ref. [30] was the primary Lund plane, which inevitably leads to some loss of information due to the omission of the secondary and tertiary splittings.In this article, we propose LundNet, a new deep learning model capable of digesting the full Lund plane.Graph neural networks are used in this model to better exploit the structural information associated with the Lund plane representation of a jet, leading to significantly improved performance on a range of jet tagging benchmarks.The LundNet model starts with transforming the Lund tree into a graph, where each node corresponds to a Lund declustering and carries the tuple of kinematic variables T (i) as its input features, and bidirectional edges are formed following the structure of the Lund declustering tree.The graph network architecture is adapted from the ParticleNet [17] model, with the EdgeConv operation proposed in Ref. [38] as a core step.Figure 3(a) illustrates how EdgeConv operates for one node (the highlighted one) in the Lund tree.It consists of two steps: First, a shared multi-layer perceptron (MLP) is applied to each of its incoming edges, using features of the node pair connected by the edge as inputs, and produces a learned "edge feature".As the Lund tree is a binary tree, there are only up to three edges for each node, which do not require a nearest-neighbour search, therefore the computational cost is much lower than for the ParticleNet model.As shown in figure 3(b), we use two layers for this shared MLP, each consisting of a linear layer followed by a batch normalization (BN) [39] and a ReLU activation [40].Then, an aggregation step is performed for the node by taking an element-wise average of the learned edge features of all the incoming edges.A shortcut connection [41] is also added to take the original node features into account directly, and the node feature is then updated to the new value.This operation is performed for all the nodes using the same shared MLPs, therefore updating all the node features but keeping the graph structure unchanged.
The architecture of the LundNet model is shown in figure 3(c).We stack six such EdgeConv blocks to form a deep graph network.The number of channels of the MLPs are (32,32), (32,32), (64, 64), (64, 64), (128, 128) and (128, 128) for the six EdgeConv blocks, respectively.Outputs from these EdgeConv blocks are concatenated per node and further processed by another MLP with 384 channels to better aggregate features learned at different stages.A global average pooling is applied afterwards to read out information from all nodes in the graph.This is followed by a fully connected layer with 256 units and a dropout layer with a drop probability of 0.1, before the final classification output.
The LundNet model uses the Lund kinematic variables defined in equation (2.1) as the input node features.Two variants of the LundNet models are investigated in this article.The first one uses all five Lund variables, as input features to extract as much information as possible from the Lund plane to maximize the jet tagging performance and is referred to as LundNet-5.The second one uses only three Lund variables, (ln k t , ln ∆, ln z) and is referred to as LundNet-3.The removal of the ln m and ψ variables significantly increases the resilience of the model to non-perturbative effects at only a small cost of the performance, as will be discussed in Section 5.
We implement the LundNet model with the Deep Graph Library 0.4.3 [42] using the PyTorch 1.7 [43] backend.The training is performed on a Nvidia GTX 1080 Ti graphics card with a minibatch size of 256.The Adam optimizer [44] is used to minimize the cross entropy loss.The training is performed for 30 epochs, with an initial learning rate of 0.001, and subsequently lowered by a factor of 10 after the 10th and the 20th epochs.A snapshot of the model is saved at the end of each epoch, and the model snapshot showing the best accuracy on the validation dataset is selected for the final evaluation.

Jet tagging in the Lund plane
Let us now turn to a detailed evaluation of our models for the identification of several hallmark signals at the LHC.We will look at four different benchmarks: the tagging of boosted electroweak W for two different transverse momentum cuts, the tagging of top quarks, and the discrimination between quark and gluon jets.The data samples consist in each benchmark of 1.2m signal and background jets simulated through the corresponding process with Pythia 8.223 [37], with an equal split between signal and background events.Events are generated at hadron-level with underlying event turned on, but without including detector effects or the presence of additional pile-up collisions.A subset of 100k jets each are used as validation and test data, with the same number of signal and background events in both samples.All data sets are taken from Ref. [45,46].Jets are clustered using the anti-k t algorithm [47,48] with a radius R = 1.0 using FastJet 3.3.2,and are required to pass a selection cut, with transverse momentum p t > 500 GeV or p t > 2 TeV as indicated, and rapidity |y| < 2.5.In each event, only the two jets with the highest transverse momentum are considered, and are saved as training data if they pass the selection cuts.A summary of these benchmarks is given in table 1.

W tagging
We start by considering the identification of hadronically decaying W bosons, one of the key objects commonly appearing in high energy proton collisions.The signal data is obtained from 600k jets passing the selection cuts and simulated using the pp → W W process, where the W bosons are decayed hadronically.The background consists of the same number of QCD jets simulated through a sample of pp → jj events.Training of the neural network weights for every model is performed using 500k of the W and background samples each.At the end of each epoch, the performance is monitored on a separate validation sample consisting of 100k jets.The final performance of each model is then evaluated using a further independent sample of 100k jets with an equal number of signal and background events.
In fig. 4 we show for each model the background rejection 1/ QCD against the signal efficiency W for W bosons, for jets passing a transverse momentum cut of p t > 500 GeV.Better performance translates to curves that achieve a higher background rejection for a given signal efficiency, i.e. which are closer to the top right corner of the figure.We compare the LundNet-3 and LundNet-5 models with three recent benchmarks: the ParticleNet model introduced in [17], the RecNN model from [9] and the Lund+LSTM 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0  model from the original Lund plane paper [30], which uses an LSTM network on the primary Lund sequence.Both the RecNN and the Lund+LSTM models, while superior to heuristic substructure algorithms, are vastly outperformed by all of the graph based methods considered.The LundNet-3 model is able to achieve about the same signal purity as ParticleNet, but can be trained in substantially less time, as will be discussed in more detail in section 5.3, and takes only a small 3-dimensional input for each declustering node in the Lund plane.By including more kinematic information, the LundNet-5 model is able to provide a slightly higher performance, but as we will see in section 5, this comes at the price of being less robust to non-perturbative effects than its lower-dimensional counterpart.
In figure 5, we show the same process but with a transverse momentum selection cut of p t > 2 TeV for the jets.Here we can observe roughly the same qualitative behaviour as at lower transverse momentum, but with the LundNet-5 model now clearly outperforming the remaining taggers even at high signal efficiencies.At higher transverse momentum, the peak in the Lund plane associated with the W splitting, and the corresponding depletion associated with the colour-singlet nature of the W , become more distinguishable.The Lund+LSTM model, which relies purely on the primary Lund sequence, also shows a strong 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0  performance, although it is still lags significantly behind all the graph-based approaches.

Top tagging
We now turn to the identification of jets originating from top quark decays.Top quarks are of particular interest at the LHC, interacting strongly with the Higgs boson and providing a valuable avenue in searches for new physics, as well as being the only quarks to decay before hadronising.Here the signal data is obtained from the pp → t t process in Pythia 8.223, where the top quarks decay to hadrons and the jets are required to pass a 500 GeV transverse momentum cut.The background QCD jets are identical to the ones used in figure 4. Each model is again trained using 500k signal and 500k background jets, with further validation and testing samples that are both one tenth the size of the training data.
In fig.6, we show the QCD background rejection as a function of the top efficiency.In this case, the Lund+LSTM model does not perform as well as RecNN.This is to be expected, as it was designed for one or two-pronged jet identification and uses only information from the primary Lund declustering sequence.It therefore contains information about the structure of only one of the initial decay products of the original top quark, 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0  limiting the performance that can be achieved without input from secondary planes.It is however interesting to see that in this process with more complex topology, the LundNet-5 model provides a substantial performance gain over existing state-of-the-art methods such as ParticleNet.This is due to the nature of its input, which contains already high-level kinematic information about the radiation patterns of the jet, making it much simpler for the neural network to learn how to distinguish signals with more involved signatures.Thus the LundNet-3 model achieves almost the same signal purity as the ParticleNet algorithm, despite having as input only a reduced 3-tuple of kinematic variables per node and taking about an order of magnitude less time to train.Interestingly, the performance gap between the two LundNet taggers is entirely due to the addition of the subjet mass and azimuthal angle ψ to the input features of each declustering for the LundNet-5 model.
For this study, we consider a signal sample of 500k quark-initiated jets obtained through the q q → q q process in Pythia 8.223, while the background is obtained from gg → gg 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9  events.The jets are clustered with an anti-k t algorithm with radius R = 0.4 and are again required to pass a transverse momentum p t > 500 GeV and rapidity |y| < 2.5 selection cut.
The gluon-jet rejection as a function of the quark-jet efficiency is shown in figure 7.In this case there is not as large a hierarchy between models, with the Lund+LSTM model performing somewhat below the competing approaches.ParticleNet has a slight edge over the other algorithms at small quark efficiencies, but is indistinguishable from the LundNet-5 tagger at high efficiency.The LundNet-3 and RecNN models show similar performance at high efficiency, with RecNN providing slightly higher gluon rejection at lower quark efficiencies.

Robustness study
We will now investigate the robustness of the different models we considered in our benchmarks.To this end we will consider three axes: their resilience to non-perturbative effects, their resilience to detector effects, and the complexity and computational cost of each tagger.

Non-perturbative effects
Beyond its raw performance, it is important for practical applications that a tagger be relatively robust to model-dependent non-perturbative effects.To carry out studies of sensitivity to non-perturbative effects, we compare performance between a data sample of both 50k signal and background jets produced at parton level, and a sample obtained with hadronisation and underlying event models turned on in the event generator.The same model, trained on hadron-level data, is evaluated on both samples for the comparison.For this study, we use the same 2 TeV W jet sample as was used in section 4.1 as well as the corresponding models shown in figure 5, which are now used to label jets from both parton and hadron-level data.
Figure 8 shows the robustness of the tagger in conjunction with its performance.This robustness is measured through the resilience ζ NP [56], calculated using both the efficiency on the hadron-level sample, , and that on the parton-level sample, where ∆ = − and = 1/2 ( + ).The efficiencies are obtained with a fixed cut corresponding to a signal efficiency W = 70% on the hadron-level sample.The curves in figure 8 are obtained by increasing a transverse momentum cut on the k t variable of the Lund plane, progressively removing declustering nodes that fall below the cut.Each curve starts on the upper left of figure 8, with a model trained without any cuts on the Lund plane, and ends in the lower right part of the figure with a model trained with a transverse momentum cut ln k t /GeV > 2 that has higher resilience but lower performance due to the removal of parts of the Lund tree.We can observe that despite their good performance, the ParticleNet and RecNN models have very little resilience to non-perturbative effects, and have no handles through which such robustness can be consistently imposed.Somewhat surprisingly, the LundNet-5 also offers relatively poor robustness to non-perturbative effects.This is due to its higher dimensional input state, which allows the neural network to extrapolate some information on emissions in the non-perturbative regime despite  the presence of a transverse momentum cut.In contrast, the LundNet-3 model becomes very resilient to non-perturbative effects as the transverse momentum cut is increased, outperforming the Lund+LSTM model by a factor two for the same resilience value.
In figure 9 (left), we show the ROC curve for each model trained on the hadron-level W data, with the ROC curve obtained on the parton-level data shown as a dotted line.The lower panel provides the ratio between the parton-level ROC curve and the hadron-level one.The right-hand side of figure 9 gives the ROC curve of the LundNet-3 models obtained for several choices of the ln k t transverse momentum cut applied on the Lund tree.Here we can observe the improved resilience as the k t cut is increased, with the ln k t /GeV > 1 model providing almost the same performance at parton and hadron level, albeit at the cost of a factor 20 in background rejection when compared to the unconstrained model.

Detector effects
Let us now turn to the impact of detector effects on the model robustness.To this end, we create a sample of 100k 2 TeV W and QCD jets using Pythia 8.223, including fast de- tector simulation with Delphes v3.4.1 using the delphes card CMS NoFastJet.tclcard to simulate both detector effects and particle flow reconstruction [57].The effects of detector granularity are then partially mitigated by applying a subjet-particle rescaling algorithm [30,58], where the Delphes particle-flow objects in a jet are reclustered into subjets using a CA algorithm with R h = 0.12 and rescaling the particle flow charged-particle and photons by a factor before discarding neutral hadron candidates.The resulting particles of all subjets are then reclustered into a single jet on which the Lund tree can be measured.Applying the 2 TeV W taggers trained in section 4.1 on this sample, we can now compute an index of resilience to detector effects ζ D in the same way as was done for nonperturbative effects, but taking now in equation (5.1) to be the detector-level efficiency.We show in figure 10 the resilience as a function of performance W / √ QCD for a signal efficiency of W = 70% on the hadron-level data.One can observe here that while for high performance, good resilience can be achieved, a transverse momentum cut in the Lund plane does not result models that are particularly insensitive to detector effects.Adding a further Lund-plane angular cut ln 1/∆ < 4 to remove unmitigated effects due to electromagnetic calorimeter granularity did not provide any noticeable improvement, as is shown in dashed lines in the figure for the Lund+LSTM model.The limitations in achieving higher resilience values for any of the considered models are due to the consistently enhanced performance of taggers at hadron-level.We show the ROC curves for each model in figure 11 (left), with the dotted lines showing the hadronlevel model applied on detector-level data.Here one can note the performance of the Lund+LSTM model on the detector-level sample, achieving performance quite close to the LundNet and ParticleNet models.The LundNet-5 model in particular, is performing slightly worse than LundNet-3 when applied on the detector-level sample, despite having a substantial edge over it on the hadron-level data that both were trained on.The lower panel gives the ratio between both curves, with the background rejection ratio of the Lund+LSTM tagger with an angular cut ln 1/∆ < 4 shown in dashed lines.In the righthand side of figure 11, one can see the ROC curve of the LundNet-3 model trained for increasing k t cuts, showing somewhat improved robustness at larger cut values.

Complexity of models
An important quality for a deep learning-based jet tagger is the simplicity of the model, and the speed of its training and inference on new samples.To quantify these considerations we measure three different metrics for the models:1 • the number of trainable parameters of the model,  The results are shown in table 2 for the graph-based models and the LSTM tagger.As LundNet-3 and LundNet-5 only differs in the dimension of the input features, the number of parameters and the computational cost are essentially the same, therefore we do not distinguish between them in this section and provide numbers derived from the LundNet-3 tagger.The Lund+LSTM model has a much simpler architecture, resulting in only 67k trainable parameters, significantly less than any of the graph-based models.It is however not substantially faster than these larger models, and even underperforms the LundNet models in inference time.The relatively long training time is partly due to the smaller learning rate used when training the LSTM network, and the smaller number of epochs needed for the Lund+LSTM model to converge.Due to its increased number of Edge-Conv blocks, the LundNet model has 26k more parameters than ParticleNet.However, the direct use of the Lund tree as the graph structure removes the need for a costly nearestneighbour search and also significantly reduces the number of edges for each node, therefore increasing both the training and inference speed by almost an order of magnitude.This is compounded by the fact that due to their higher-level kinematic inputs, the LundNet models take significantly less epochs to converge to a good solution. 2n interesting side-effect of the k t cut applied in the Lund plane to improve the robustness of the model as described in previous sections is that it also reduces the number of nodes present in the graph.As such, both the training and the inference time of the model are expected to be reduced as the transverse momentum cut is increased and more nodes are removed from the input graph.To demonstrate this we show in figure 12 the inference time per sample as a function of the average number of Lund declusterings per QCD jet, obtained through models trained with different Lund plane k t cuts, each of which is shown as blue circle.As expected, the inference time scales linearly with the number of nodes in the graph, such that computing time increases quadratically as the ln k t cut is reduced.

Conclusions
In this article, we have introduced LundNet, a novel algorithm used to detect signals at the LHC.We showed that this method provides substantial improvements over existing methods on the identification of key benchmark processes, as well as in training speed and robustness to non-perturbative effects.
The LundNet model combines the power of graph convolutional networks with an efficient representation of the radiation patterns within a jet to optimally extract information from its substructure.Jets are represented through a Lund tree constructed from the CA clustering tree of each jet.Each node of the Lund tree contains a tuple of kinematic information for the corresponding pairwise splitting, used as input to the graph network.By using the clustering tree structure to aggregate information in the graph convolution, the weights of the LundNet model can be trained ten times faster than previous graph-based methods such as ParticleNet, with a similar gain on the inference time when applying the trained model to identify new jets.
We introduced two taggers, LundNet-3 and LundNet-5 which rely on a three-and five-dimensional emission feature space respectively.The LundNet-5 tagger outperforms current state-of-the-art methods on several pivotal jet tagging benchmarks, most notably for the identification of jets originating from top decays, while the lower-dimensional LundNet-3 tagger matches the performance of current jet taggers despite its reduced kinematic input size.
We provided a concrete study of its resilience to model-dependent non-perturbative and detector effects.Through the use of an appropriate transverse momentum cut in the Lund plane, we showed how one can establish an algorithm that retains high performance while maintaining a handle on robustness.Due to its limited kinematic input, the LundNet-3 tagger is best positioned to provide jet identification that is relatively insensitive to non-perturbative effects and detector smearing, while substantially outperforming previous methods based on the primary Lund plane.
These results offer a concrete avenue to implementing effective machine-learning based taggers that can be robust to model-dependent effects present in the training data, a key feature for real-life applications of artificial intelligence at the LHC.In this context, the work presented in this article provides a key step towards a new generation of efficient, robust and tractable jet substructure tools for LHC physics.

Figure 1 .
Figure1.The Lund plane representation of a jet (left) where each emission is positioned according to its ∆ and k t coordinates, and the corresponding mapping to a binary Lund tree of tuples (right).The thick blue line represents the primary sequence of tuples L primary .

Figure 3 .
Figure 3. (a) Illustration of the EdgeConv operation on a node of the Lund tree.(b) Architecture of the EdgeConv block used in the LundNet model.(c) Architecture of the LundNet model.

Figure 4 .
Figure 4. Background rejection 1/ QCD versus signal efficiency W for W jet tagging with transverse momentum p t > 500 GeV.

Figure 5 .
Figure 5. Background rejection 1/ QCD versus signal efficiency W for W jet tagging with transverse momentum p t > 2 TeV.

Figure 6 .
Figure 6. Background rejection 1/ QCD versus signal efficiency Top for top jet tagging with transverse momentum p t > 500 GeV.

Figure 7 .
Figure 7. Background rejection 1/ Gluon versus signal efficiency Quark for quark/gluon discrimination between R = 0.4 anti-k t jets with transverse momentum p t > 500 GeV.

3 Figure 9 .
Figure 9. Background rejection as a function of W tagging efficiency.Dotted lines indicate a W tagger applied on parton-level data.

Figure 10 .
Figure 10.Performance W √ QCD versus resilience to detector effects.

3 Figure 11 .
Figure 11.Background rejection as a function of W tagging efficiency.Dotted lines indicate a W tagger applied on detector-level data.

•
the training time of each model per data sample and per epoch, which provides a measure of the time needed to train a full tagger on a given data set,• and finally the inference time per sample of the trained model on new data points, which provides a measure of how efficiently an existing model can be deployed to label a given sample of jets.We evaluate the training time on the 2 TeV W and QCD training sample used previously in section 4.1, and the inference time on the corresponding test data of 100k jets.cuts: ∅, -2, -1.5, -1, -0.5, 0, 0.5, 1, 1.5, 2Inference time [ms/sample] Lund declusterings per QCD jet

Figure 12 .
Figure 12.Inference time per jet of the LundNet model as a function of the mean number of Lund declusterings per 2 TeV QCD jet.Each circle corresponds to a separate LundNet model trained for a different k t cut, as indicated in the figure text.

Table 2 .
Summary for each model of the number of parameters, training time per sample and epoch, and inference time per sample.The time is measured in milliseconds as obtained when running the models on an Nvidia GTX 1080 Ti card.