Resolving Extreme Jet Substructure

We study the effectiveness of theoretically-motivated high-level jet observables in the extreme context of jets with a large number of hard sub-jets (up to $N=8$). Previous studies indicate that high-level observables are powerful, interpretable tools to probe jet substructure for $N\le 3$ hard sub-jets, but that deep neural networks trained on low-level jet constituents match or slightly exceed their performance. We extend this work for up to $N=8$ hard sub-jets, using deep particle-flow networks (PFNs) and Transformer based networks to estimate a loose upper bound on the classification performance. A fully-connected neural network operating on a standard set of high-level jet observables, 135 $\textrm{N}$-subjetiness observables and jet mass, reach classification accuracy of 86.90\%, but fall short of the PFN and Transformer models, which reach classification accuracies of 89.19\% and 91.27\% respectively, suggesting that the constituent networks utilize information not captured by the set of high-level observables. We then identify additional high-level observables which are able to narrow this gap, and utilize LASSO regularization for feature selection to identify and rank the most relevant observables and provide further insights into the learning strategies used by the constituent-based neural networks. The final model contains only 31 high-level observables and is able to match the performance of the PFN and approximate the performance of the Transformer model to within 2\%.


Introduction
The era of the Large Hadron Collider (LHC) has opened a new frontier in jet physics: the interior of jets. Hadronically decaying objects with large transverse momentum produce collimated quarks and gluons which lead to jets with complex energy patterns, including some with multiple, distinct sub-jets [1]. Jets with two or three sub-jets have been observed and extensively studied [2][3][4][5][6][7][8], and jets with additional hard subjets will become more important as the high-luminosity LHC collects large datasets in which high-p T objects appear in greater numbers [9,10]. Additionally, future colliders may push the energy frontier forward, creating jets with many more hard sub-jets. In these settings, distinguishing jets with multiple hard sub-jets will be an increasingly important element in searches for new physics.
Many theoretically-motivated jet substructure techniques have been proposed and studied to identify jets with multiple hard sub-jets by summarizing the information of the lowlevel jet constituents into a compact set of high-level observables [11,12]. However, recent strides in deep learning have demonstrated the ability to extract information directly from low-level detector data, making them powerful probes of the information content of the jets [13][14][15][16][17][18][19][20]. Application of these deep neural networks (DNNs) to jets has revealed that there is often additional information available in the constituents that is not captured by such high-level observables.
Specifically, studies with two hard sub-jets [17,21] have found a small but stubborn gap when comparing the performance of DNNs on low-level jet constituents to networks on high-level observables, indicating that the observables fail to capture all of the information used by the constituent-based networks. In studies of jets with three hard sub-jets, several studies have analyzed the performance gap between low-and high-level jet information in the context of top-quark tagging. While some have found a negligible gap between high-level observables and pixelated jet image representations [22], modern deep learning architectures operating on unpixelated low-level constituent inputs continue to outperform high-level networks [23][24][25][26]. For jets with four hard subjets, taggers have been developed which use high-level observables [27][28][29], but without a comparison to the performance of networks using low-level jet information.
In this paper, we extend jet tagging studies to jets with up to eight hard sub-jets, to probe the question of whether the existing high-level observables are sufficient to analyze such extreme jets, or whether modern deep learning architectures can identify additional, untapped information. In each case, we compare the performance of networks which use compact high-level observables to those that take voluminous low-level calorimeter data. 1 We then use the networks trained on low-level data as a probe, and attempt to map their strategies into a set of high-level observables that contain the full range of discriminating information.
The rest of this paper is structured as follows. Section 2 contains the details of the dataset generation, and describes the different topologies used to simulate jets with many hard sub-jets. Section 3 describes the fully-connected network operating on existing highlevel observables, followed by Sec. 4 which describes the networks applied to the low-level jet constituents. Section 5 compares the performance of the network based on high-level observables to the constituent-based networks. In Sec. 6, we study the performance gap between the various networks, and find additional high-level observables which narrow the gap to the constituent-based networks. Concluding remarks are given in Sec. 7.

Dataset and Preprocessing
Simulated proton-proton collision events enriched in jets with many collimated quarks are generated using the processes shown in Figure 1. Most samples use the decay of a hypothetical heavy particle, such as a graviton (G), which subsequently decays via heavy Standard Model (SM) particles such as W bosons, Higgs bosons, or top quarks, whose hadronic decays yield collimated pairs and triplets of quarks that contribute N = 2 hard sub-jets per boson or N = 3 hard sub-jets per top quark. For processes with N ≥ 4 hard sub-jets, events are further required to contain high-energy photon radiation which can, for example, boost a G → tt decay into a single jet with N = 6 hard sub-jets.   Table 1 for further generation details.
Samples are generated for N = 1, 2, 3, 4, 6, 8 hard sub-jet classes. To probe the dependence of the classification on the topology of the decay, the N = 4 class is subdivided in two: the N = 4q case, in which two collimated W bosons produce a jet with four hard sub-jets from four light quarks, and the N = 4b case, in which two collimated Higgs bosons result in a jet with four hard sub-jets from four b-quarks. A summary of the generated samples is given in Table 1.
Proton-proton collisions at a center-of-mass energy √ s = 13 TeV are simulated with Madgraph5 v2.8.1 [30], showered and hadronized with Pythia 8.244 [31], and the detector response is simulated with Delphes 3.4.2 [32] using an ATLAS-like card with calorimeter grids of uniform width 0.0125 in both η and φ. This fine granularity ensures that our studies include some of the effects of detector respond, but probe the limits of the algorithms rather than the resolution of the detector. Jets are clustered using the anti-k T algorithm [33] with radius parameter R = 1.2 using FastJet 3.1.2 [34]. Only jets with p T in the range of [1000, 1200] GeV and jet mass in the range of [300, 700] GeV are kept. The quarks produced from each process are truth-matched to the large-radius jet by requiring ∆R = Table 1: Processes used to generate jets with N = 1, 2, 3, 4b, 4q, 6, 8 hard sub-jets. Also shown are the particle masses and generator-level requirements used to more efficiently produce jets with p T ∈ [1000, 1200] GeV and mass ∈ [300, 700] GeV. All masses and momenta are in GeV. Selected W boson masses of (264.5,440.8,617.1) correspond to M Z = (300, 500, 700), respectively. See Figure 1 for the corresponding Feynman diagrams.
To generate jets with a variety of masses, several choices are made for the intermediate particle masses. The resulting spectrum of generated jet masses features clear artifacts due to these choices. Similarly, the distribution of jet p T shows some dependence on the process used to generate the set of collimated quarks. To avoid learning artifacts in jet mass and p T , we selectively reject events until we achieve uniform distributions of jet mass, p T , and N hard sub-jets. This process yields a balanced sample of unweighted events, at the expense of reduced generation efficiency. The balanced histograms are shown in Figure 2, demonstrating an approximately uniform distribution in both jet mass and p T .
Since the low-level networks require fixed-size arrays as input, the events are zeropadded with zero-p T particles to ensure all input arrays have a length of 230. We find that padding the events to this length is more than enough to capture all hard constituents in the events, as they have a mean constituent multiplicity of 82; see Figure 2.
The jets are preprocessed by normalizing the p T of their constituents to sum to unity, and centered based on the p T -weighted arithmetic mean in η and circular mean in φ. In

High-Level Observables Network
Identification of jets with N hard sub-jets is a well-explored topic experimentally for N ≤ 3, for which many theoretically motivated observables have been constructed to summarize the information contained in the jet energy pattern [11][12][13]35]. These observables have the advantage that they are compact and physically interpretable, and can in principle be applied to jets with many more hard sub-jets. Here, we use N-subjettiness [35], together with the jet mass, as a well-known benchmark. Following Ref. [36], a total of 135 N-subjettiness observables (τ β N ) are calculated along the k T axis, with the sub-jet axis parameter N = 1, . . . , 45, and angular weighing exponent β ∈ { 1 2 , 1, 2}, which together with the jet mass account for 136 jet observables. Distributions of some of the N-subjettiness observables are shown in Figure 3 for jets with various numbers of hard sub-jets.
We train a fully-connected (dense) network which operates on the 136 high-level observables, referred to as DNN 136 . A grid search of the hyperparameters of the dense network indicates that the best structure for the network has six hidden layers of size (800-800-800-800-800-64) and ReLu [37] activation function. To prevent overfitting and to facilitate training stability, dropout with rate 0.3 and batch normalization are applied respectively after every hidden layer. The output layer is a 7-dimensional softmax function, with one dimension for each category of N hard sub-jets. Please refer to Table 4 for the model summary of DNN 136 .

Low-Level Jet Constituent Models
The primary focus of our study is to answer the question of whether the patterns of energy depositions in the jets contain additional information useful to the classification task that is not captured by the high-level observables. To probe the relevant information content, we train networks which operate directly on the low-level calorimeter tower constituents of the jets. We focus on two such network architectures; Particle-Flow Networks (PFN) [38] and Transformers [39]. Both networks have matched or outperformed other low-level network architectures, such as convolutional networks, in a variety of classification tasks [21,38,40], but the decision to chose PFNs and Transformers to probe the information of content of the jets relies on their specific learning strategies. PFNs learn an event-level latent representation of the jets, see Sec. 4.1. Transformers, on the other hand, learn a contextualized embedding of each constituent and use self-attention mechanisms to determine which parts of the embedded input sequence to focus on in order to make a final prediction, see Sec. 4.2.
Both networks are invariant to the input ordering, which makes them well-suited for jet classification tasks.

Particle-Flow Networks
The power of Particle-Flow Networks (PFNs) [38] relies on their ability to learn virtually any symmetric function of the jet constituents. Their mathematical structure is naturally invariant under permutation of the input ordering, constructed as a summation over the constituents: where in our case Φ(p Ti , η i , φ i ) represents the per-tower latent space operating on the normalized three-momentum of constituent tower i, and F represents the jet-level latent space. A grid search of the hyperparameters of the PFN finds that the best structure for the network has two layers in the Φ module of size (128, 128) and two layers in the F module of size (1024, 1024), with a drop out rate of 0.2. All hidden layers have ReLu [37] activation functions. The output layer is the same as for DNN 136 ; a 7-dimensional softmax with one element for each possible value of N hard sub-jets. Please refer to Table 4 for a summary of the PFN model.

Transformers
Like the PFN, the Transformer [39] model also operates on the low-level jet constituents. Transformers are characterized by the use of the self-attention mechanism, which learns to focus on the parts of the input sequence that the model deems important, to extract meaningful representations of the input sequence. With self-attention layers, Transformer models and its variants achieve state-of-the-art performance in a variety of sequence data modeling tasks [41][42][43], and in applications to particle physics tasks with inherent symmetries [44,45]. In our application, we employ the Encoder model from the Transformer in [41], which has a stack of multiple attention vectors computed in parallel to increase the expressiveness of the network.
A grid search of the hyperparameters of the Transformer finds that the best structure for the network has four transformer layers, with hidden size of 256 and intermediate dimension of 128. For computational efficiency and to compare networks of similar complexity (see Table 4), we focus our search on smaller architectures compared to the ones used in the original paper [39,41]. A summary of the Transformer model is given in Table 4.

Performance
We measure a network's accuracy as the fraction of correctly identified jets; specifically, the fraction of jets where the predicted class matches the true class. For jets in each class, the 10-fold average accuracy is shown in Figure 4. The Transformer network achieves the highest overall classification performance, with an accuracy of 91.27 ± 0.31%, followed by the PFN with an accuracy of 89.19 ± 0.23%. 2 The high-level model DNN 136 is the least performant model, with an accuracy of 86.90 ± 0.20%. The accuracy of the networks is, however, not uniform across all classes, as some classes achieve better performance than others. Despite this, the relative ranking of the three networks is mostly the same for each class, with the exception of N=4b, in which DNN 136 outperforms the PFN. The classes with the lowest accuracy scores are N = 2, N = 3, and N = 4b. The confusion matrices showing the mean 10-fold classification predictions for the jets in each class are shown in Figure 5, which show that these classes are often misclassified among each other, and with N = 1 to a lesser extent. The confusion matrices also show a high degree of diagonality in all networks, with the largest off-diagonal elements located typically in the adjacent categories. This suggests that the networks have learned to identify the number of hard sub-jets, with the largest classification mistype in the N ± 1 classes, as might be expected. It is however interesting to note how the class with the highest accuracy score for all networks is N = 4q, and how infrequently it is misclassified with N = 4b, suggesting that the networks are learning more information than simply the number of hard sub-jets.
To study whether the networks have a dependence on the p T of the jets, the accuracy of their predictions as a function of jet p T is shown in Figure 6. For all networks, the accuracy remains relatively constant across the spectrum, indicating that the networks do not have a strong p T dependence, as expected given the balancing of the dataset. In all ranges, the Transformer slightly outperforms the PFN, both followed by the DNN 136 .
We also study whether the networks have a dependence on the jet mass by assessing the accuracy of their predictions as a function of jet mass, 3 as shown in Figure 7. While the Transformer outperforms the PFN and the DNN 136 in all ranges, the accuracy of the network's predictions is not uniform, but generally rises with jet mass, in contrast with the lack of variation with jet p T . This dependence is further studied in Section 6.3 by inspecting the correlations between the most important high-level observables and the jet mass.
To our knowledge, this is the first comparison of high-level observables to networks which use jet constituents for jets with more than three hard sub-jets. Our results indicate that observables such as N-subjettiness variables are powerful discriminants for jets with   many hard sub-jets, and perform well when tested on jets with different topologies. There is, however, a small but persistent gap when between their overall performance and that of the low-level networks. In the following sections, we identify new observables which can be combined with the N-subjettiness variables to bridge this gap.

Closing the Performance Gap
The Transformer and PFN models operating on the low-level jet constituents have generally outperformed the high-level observables across the different classes N and across the jet mass and p T spectra. Low-level networks have the advantage of efficiently extracting information from the jet constituents to achieve state-of-the-art performance for the classification task, making them useful probes for the information content in the jets. However, due to the high-dimensionality of their inputs and the low-level nature of the data, it could be challenging to directly interpret low-level models in an experimental context. Under an  experimental setting, having a smaller number of well-understood high-level inputs which capture the necessary information would be preferred. Thus, we aim to identify new highlevel observables which can be added to the existing observables to close the performance gap.
A strategy for identifying jet observables which are able to bridge the performance gap is described in Ref. [21], which searches among the pool of observables called energy flow polynomials (EFPs) [11] to identify those that yield similar classification decisions as DNNs trained on low-level detector data. This strategy, however, applies only to binary decision functions [40,46]. We leave a generalization of that method to multi-class networks for future work, and instead apply a simpler but commonly-used technique for variable selection. First, we employ a large set of EFPs in combination with the existing N-subjettiness variables to capture the information needed to match the performance of the Transformer and the PFN. Then, we systematically reduce the set of observables to find the minimum set that best approximates the performance of the constituent-based networks.

Adding Energy-Flow Polynomials
As described in [11], EFP observables are constructed as nested sums over jet constituents transverse energy, or equivalently p T , scaled by their pairwise angular separation θ ij . These parametric sums are described as the set of all isomorphic multigraphs, where for a jet with M constituents: Each graph can be further parameterized by the weighing factors (κ, β), where Here, p Ti is the transverse momentum of constituent i, and ∆η ij (∆φ ij ) is the pseudorapidity (azimuthal) difference between constituents i and j. Following the documentation of [11], each EFP is accompanied by its unique identifier (n, d, k), which specifies the number of nodes n, edges d, and index k of the corresponding graph.
For our studies, we select all connected (prime) EFP graphs with five or fewer edges and weighing factors κ = 1 and β ∈ { 1 2 , 1, 2}, for a total of 162 observables, which are denoted by EFP β (n, d, k) in what follows. We also include the constituent multiplicity, which is described by the one-node EFP with κ = 0, and has been shown to be a useful observable for jet discrimination [47]. Combined with the N-subjettiness variables and the jet mass, the new augmented dataset has 299 observables. We train a dense network on the 299 observables, labelled DNN 299 , with the same hyperparameters as DNN 136 . The resulting overall accuracy is 89.23%, as seen in Figure 8. With the augmented set, we are thus able to match the overall performance of the PFN, and to close the performance gap with the Transformer to approximately 2%.
The accuracies of DNN 299 across the jet p T and mass spectra are shown in Figure 9 and Figure 10, respectively. In most ranges, DNN 299 matches or closely approximates the performance of the PFN. A small but persistent gap remains between the DNN 299 and Transformer models, indicating that the Transformer model is still utilising useful information not available in the augmented observable set. Additional EFP observables may be able to further narrow the gap, such as EFPs with energy and angular measures not considered in this paper, or with a larger number of edges (degree), which allow for more complex polynomial forms. The costly computation of these variables makes this infeasible at the present time and so is left for future work.

Selecting Energy-Flow Polynomials
The next step in the selection process is to identify the minimal set of high-level observables which manages to close the gap with the PFN and approximate the Transformer. The core idea of the observable selection strategy is to apply L 1 regularization, also known as LASSO [48] regularization, to zero out the weights of the observables that are unimportant to the network during training. In principle, the regularization must be applied to the input weights of the DNN 299 network, which are given by a matrix of rank [800, 299], namely W = [W 1 , W 2 , ..., W 299 ], where W i is a column vector of length 800. However, since we are interested in finding the minimal set of observables which are the most relevant for the classification task, we encourage the network to shrink the entire weight column vector of the irrelevant observables down to zero by implementing a learnable gate parameter g i , such that W i = [g i w i,1 , ..., g i w i,800 ]. This allows us to apply the regularization on the gate parameters g = [g 1 , g 2 , ..., g 299 ] during training. It can then be seen that when the LASSO regularization shrinks g i down to zero, the entire vector of W i also shrinks to zero, and we can confidently exclude the ith observable during training. The overall loss function can be written as where the first term − log f (Y, Y pred ) is the negative log likelihood of the predicted label Y pred and true label Y , and the second term is the LASSO regularization term on the gate parameter g with regularization strength parameter λ.
A grid search of the regularization strengths ranging from 1 to 10 indicates that the best regularization strength parameter value for the classification task is λ = 5. To further limit the minimum observable set, only observables with a gate parameter of |g i | > 0.01 are kept. With these settings, the selection strategy results in 31 LASSO-selected observables whose distributions are shown in Figures 11 and 12.
We train a dense network operating on the 31 LASSO-selected observables, DNN 31 . A hyperparameter search indicates that the best structure for the network has six hidden layers of size (800-800-800-800-800-32) and ReLu [37] activation function. Dropout with rate 0.3 and batch normalization are applied respectively after every hidden layer. Please refer to Table 4 for the model summary of DNN 31 .
The DNN 31 network achieves an overall accuracy of 89.11±0.32%, as shown in Figure 8. Note that for all classes, the DNN 31 matches or closely approximates the performance of the DNN 299 , indicating that the LASSO selection strategy has succeeded in identifying the observables which capture most of the information relevant to the classification task. 4 Figure 9 and Figure 10 show the accuracies across the jet p T and mass spectra, respectively. Both figures show similar trends for DNN 31 as for the other networks; uniform predictions across the jet p T spectrum, and generally rising accuracy with jet mass. 4 Although the LASSO-selected observables do a good job at matching the performance of the full 299 observable set, the overlapping nature of the EFP observables makes it likely that this is not a unique solution and that different settings in the LASSO selection process may yield a somewhat different subset of observables.

Importance of EFP observables
After narrowing down the set of observables to 31 LASSO-selected observables, we attempt to interpret the strategies of the low-level networks in terms of the selected observables by ranking them in order of importance. We measure how much the DNN 31 model relies on each of the observables to make a prediction by randomly shuffling them during test time. That is, the k-th observable (k ∈ {1, . . . , 31}) in the test set is shuffled by randomly replacing it with a corresponding observable value from the training set, on an event-by-event basis. This maintains the marginal distribution for the k-th observable, but destroys any correlations with other observables. The performance of the network is then evaluated on the set of events which includes the shuffled observable. This method allows us to determine the importance of the k-th observable by measuring the drop in the performance of the network relative to the unshuffled set. By this method, an observable is considered important if shuffling its values decreases the accuracy of the model. Likewise, an observable is considered unimportant if shuffling its values leaves the accuracy unchanged, as it indicates that the model does not heavily rely on that observable when making predictions. Table 2 shows the ranking of the 31 LASSO-selected observables. The observables are ranked in order of importance, with the most important corresponding to the largest drop in accuracy of the model when shuffled. These observables mainly consist of EFPs with four or less nodes, and N-subjettiness variables with N < 8. The latter is not surprising as it is expected that N-subjettiness variables with N significantly larger than the number of sub-jets are less important for the classification problem. Figures 13 and 14 show contour plots of the LASSO-selected observables versus the jet mass. The contour plots indicate that there is a correlation between the top ranked observables and the jet mass. This correlation is particularly striking for the τ 1 1 variable, which is the fourth 5 most important observable and can be interpreted as a measure of how collimated the jet constituents are along the one N-sub-jet axis, with lower τ 1 1 values indicating that the constituents are more collimated. By inspection of this observable, we can explain the lower classification power of the networks at lower jet mass values from Figure 10, as jets with lower jet masses may be more collimated and thus harder to classify.
As the importance of an observable may vary across the several classes, we measure how much each of the classes relies on each of the 31 LASSO-selected observables by measuring the drop in class accuracy as the observables are shuffled. Figures 15 and 16 show the drop in class accuracy of the top 10 observables per class. For the N = 1, 2, 3, 4b, 4q classes, the top observables mainly consist of EFPs with four or less nodes, and N-subjettiness variables with three or less sub-jet axes. It is interesting to note that the constituent multiplicity and jet mass are highly important for the N = 2 class. It is also interesting how although the classes N = 4b, 4q share some of the top observables, the relative importance of these observables is different for both classes. For the N = 6, 8 classes, we find that while EFPs with with four or less nodes are still some of the most important observables, the N-subjettiness variables with N larger than 3 become more relevant. Particularly, for the 0.63 ± 0.25% N = 8 case, τ 1 6 and τ 1 4 are the two most important observables, indicating that, at large scales, these jets may look like jets with six or four sub-jets.
For most cases, the most relevant observables are two-node EFPs, or N-subjettiness variables with relatively small N (N =1, 2, 3, or 6). Subsequent observables generally consist of EFPs with more complex shapes or N-subjettiness variables with larger N. This suggests that the network may be focusing on distinguishing between the different classes of jets by first utilizing observables that broadly capture the number of sub-jets, and later utilizing more complex observables that specialize on capturing specific traits for each class, such as the presence of collimated jets or subtle differences in the topologies of the jets. The latter is further explored in Sec. 6.4.

Topology Dependence
The networks have learned to distinguish jets with various numbers of hard sub-jets. We do not, however, claim that they have learned to identify any jet with this number of hard sub-jets. On the contrary, it is likely that the patterns of energy depositions depend on the details of the topology, such as the invariant mass of intermediate resonances and the jet flavors. The networks' ability to distinguish jets from G → HH → 4b and G → W W → 4q is an example. In this section, we explore the dependence of the networks' classification strategies on these details of the sub-jet topology.
We generate two additional samples of jets with N = 4 hard sub-jets. In the first,       Input class N=4q   Input class N=8 labeled 4bM W , jets are produced with the G → HH → 4b process, but the mass of the Higgs boson has been set to the mass of the W boson to more closely align with the 4q sample. In the second, labeled 4qM H , jets are produced with the G → W W → 4q process, but the mass of the W boson mass has been set to the mass of the Higgs boson to more closely align with the 4b sample. As with other samples, the 4bM W and 4qM H samples are selected such that they have uniform distributions in jet mass and p T . We evaluate the networks' predictions on the 4bM W and 4qM H samples. The frequency of classification outputs of these two samples are shown in Figure 17. In the 4qM H case, all three networks mainly classify the samples as N = 4b jets, and rarely classify them as N = 4q jets. This suggests a strong correlation between the intermediate masses and the final state kinematics which the networks are learning. Conversely, in the 4bM W case, the high-and low-level networks result in different classification predictions. The low-level networks mainly classify the 4bM W samples as N = 3 jets, while the DNN 31 mainly classifies these samples as N = 4q or N = 3. The more frequent prediction of N = 3 by the low-level networks hints at the possibility that these networks are identifying complex details of the jet topology, which goes beyond simply identifying hard sub-jets, as the N = 3 sample includes an intermediate W boson as well as a b-jet.
The different predictions of the networks come to no surprise as, although they reach similar accuracies, they each have different learning strategies. The DNN 31 learns functions of the 31 LASSO-selected observables, while the PFN learns an event-level latent representation of the jets by summing over the constituents, and the Transformer uses self-attention mechanisms to determine which parts of the constituent sequence it should focus on when making predictions. 6 It is interesting to note that despite their different learning strategies, all taggers seem to be dependent on the detailed topologies, which is a benefit rather than a flaw in studies aiming to classify specific sub-jet topologies.

Conclusions
We have studied the task of classifying jets with a large number of hard sub-jets (up to N = 8), comparing the performance of networks trained on theoretically-motivated high-level observables to networks operating on low-level jet constituents. We find that the networks trained on the jet mass and high-level N-subjettiness variables are able to effectively discriminate between all classes, particularly for jets with many hard sub-jets or large jet mass. These findings prove useful for current studies searching to classify jets with multiple hard sub-jets. In addition, our results bode well for future jet studies in high energy collider settings, where jets with additional hard sub-jets will become more important as the high-luminosity LHC collects large datasets in which high-p T objects appear in greater numbers [9,10]. However, the networks trained on the N-subjettiness observables fall somewhat short in classification accuracy compared to networks utilising low-level constituent information, particularly for jets with fewer sub-jets or smaller jet mass. We thus supplement these observables with a large set of EFPs, for a total of 299 high-level observables. The network trained on the augmented high-level observable set is then able to match the performance of the well studied PFN networks, and to approximate the performance of the Transformer model. The performance gap between the Transformer model and the high-level observables is small but significant, suggesting that there remains useful discriminating information in the low-level constituents that is not captured even by the full set of 299 input high-level observables.
Utilising LASSO regularization for feature selection, we find the most important observables which contain most of the information relevant for the classification task. We identify 31 high-level observables which largely bridge the gap to the PFN, and rank them in order of importance to gain some insights into the nature of the learning strategies of the networks. We find that the most important observables are mainly EFPs with four or less nodes, and N-subjettiness variables with N < 8. We also identify the constituent multiplicity as one of the top observables, and find it to be particularly important for classifying N = 2 jets.
By analyzing the most important observables for each input class N , we find that the strategy of the networks may rely on first utilizing simple observables that broadly capture the number of sub-jets, such as EFP polynomials with only a few nodes or Nsubjettiness variables with small N values. Subsequently, the networks may utilize more complex observables to capture subtle traits of the topology of the jets. This is further confirmed in our results, which reveal the classifiers to have a strong topology dependence, as they appear to be sensitive to the sub-jet resonance masses and flavor rather than simply being sensitive to the sub-jet multiplicity. Future work may involve disentangling the nature of this information to probe more deeply the classification strategy for high-multiplicity subjets. measured using the area under the receiver operating curve (ROC-AUC); see Table 3 for the ROC-AUC scores and Figure 18 for the ROC curves.  Table 3: ROC-AUC percentage score for classifying jets with N = k versus N = 1 hard sub-jets, averaged over results from 10-fold cross validation.
The binary classification results agree with the confusion matrices in Figure 5, which show a larger degree of confusion between the N = 1 and the N = 2, 3, 4b classes, and almost perfect classification between the N = 1 and the N = 4q, 6, 8 classes.
As a cross check, the PFN and DNN 31 are retrained to perform the binary classification tasks. The results are included in Table 3 as PFN binary and DNN 31,binary . The multi-class taggers slightly outperform the binary taggers, particularly for lower N . This suggests that training on the different N -subjet classes may have resulted in a more robust decision boundary for the N = 1 class. This hypothesis aligns with the results in [49,50], where training on sub-classes helped improve multi-class classification performance.

B Additional Technical Details
In this appendix, we describe the details of the machine learning models and network architectures. The Transformer and PFN models are trained on the three-momenta of the simulated jet constituents, which are preprocessed normalizing the jet p T to unity and subtracting the p T -weighted angular means, as described in Sec.  Table 3 for the ROC-AUC percentage scores.
three-momenta. This ensures that the dense networks have access to only a subset of the information available in the low-level jet constituents. The 10-fold average accuracy of the models and the number of parameters are summarized in Table 4. Common properties across all networks include ReLu [37] activation functions for all hidden layers, and 7-dimensional softmax output functions to classify between all seven sub-jet classes. All networks are trained using the Adam [51] optimizer for up to 1000 epochs, and with batch size of 256.
All networks were optimized by a hyperparameter search using the Sherpa [52] hyperparameter optimization library, while ensuring that the range of trainable parameters in the optimization is roughly the same for all networks. 7 The hyperparameters of the dense networks were optimized in the ranges: intermediate dimension Training of the Transformer and all dense networks was implemented in PyTorch [53] using NVIDIA V100 GPUs. Training of the PFN was implemented in Keras [54] with Table 4: Summary of the machine learning models used in the classification task. The table shows a brief description of each of the models, as well as the number of trainable parameters and the accuracy measured using 10-fold cross validation.  31 . The inference times are provided to give a sense of scale of the latency of each algorithm, as we did not perform any inference optimization.