Streamlined jet tagging network assisted by jet prong structure

Attention-based transformer models have become increasingly prevalent in collider analysis, offering enhanced performance for tasks such as jet tagging. However, they are computationally intensive and require substantial data for training. In this paper, we introduce a new jet classification network using an MLP mixer, where two subsequent MLP operations serve to transform particle and feature tokens over the jet constituents. The transformed particles are combined with subjet information using multi-head cross-attention so that the network is invariant under the permutation of the jet constituents. We utilize two clustering algorithms to identify subjets: the standard sequential recombination algorithms with fixed radius parameters and a new IRC-safe, density-based algorithm of dynamic radii based on HDBSCAN. The proposed network demonstrates comparable classification performance to state-of-the-art models while boosting computational efficiency drastically. Finally, we evaluate the network performance using various interpretable methods, including centred kernel alignment and attention maps, to highlight network efficacy in collider analysis tasks.

Because the order of the particles in an event should not alter the outcome, any ML model for HEP should be a function of a set of particles without order, a "particle cloud".While these methods have demonstrated high performance in various applications, they are not well-suited for particle cloud analysis.The particle cloud model mitigates combinatorial ambiguities by representing final state particles as a permutation-invariant sequence.A network with an input data set of size N must be invariant to N !permutations of the inputs.This simultaneously provides the ability to capture both local structure from nearby particles and global structure resulting among all particles in the cloud by ensuring that all possible combinations of particles are considered.Models based on the Particle clouds were initially introduced by [36,37] in particle physics.These models offer not only the advantages of particlebased approaches but also the flexibility to incorporate arbitrary features of particles, such as particle ID and vertex.
Several particle cloud ML models have been introduced for collider analysis, including Deep Sets [36], Edge convolution [37] and Transformers [38][39][40][41][42]. Deep Sets model is first introduced in [43].To achieve state-of-the-art performance, the Deep Sets model requires a sizeable latent space of the network and becomes very complex.Edge Convolution Neural Network (EDGCNN) offers a method to incorporate local information gained from the nearest neighbours of each particle in the cloud for learning the tasks.Other networks, such as JEDI-net [44], Point Cloud Transformer [45], Lorentz Net [28], and PELICAN [46], are also utilized for particle cloud analysis, and we plan to compare our results with these models.
Particle Transformer [38] is a Transformer-based particle cloud model.Transformers were initially proposed as sequence-to-sequence models for machine translation [47] and modified for collider purposes in [38].At the core of the particle transformer, the attention mechanism which enables the model to focus selectively on different parts of input data, assigning varying degrees of importance to each component.Since the attention mechanism computes the attention weights of each particle to all other particles in the dataset, these weights remain unchanged even if the order of the input tokens is permuted.The transformer models have shown the best performance for collider analysis.However, they suffer from high model complexity and require extensive run time.
This paper presents a novel permutation-invariant network that capitalizes on superior performance with reduced runtime.This can be achieved by analyzing subjet and jet constituent information via cross-attention heads and a "Mixer network" consisting of multi-layer perceptrons(MLPs).
The utilization of subjet information in jet clustering has already appeared in the original ideas of jet classifications [1], aiming to identify the cluster location inside the jet for jet classification.
The subjet momentum is an IRC-safe quantity that matches the hard parton momentum originating jet.TMixer network integrates two MLPs to capture local and global information about the event.Like transformers, these MLPs amalgamate particle and feature tokens from the dataset, facilitating efficient structure learning.The first MLP combines all particle tokens in the dataset, with weight sharing enabling the model to learn the dependencies among each particle and all others.Meanwhile, the second MLP mixes the features of all particle tokens, resulting in a dataset with the same dimensionality as the inputs.The concept of the mixer layer, initially introduced for image classification in [48], is central to our model.However, the MLP part of the mixer layer is not inherently permutation invariant and relies on the order of input particles within the set.
We utilize a cross-attention mechanism that analyzes pre-clustered subjet data to achieve both permutation invariance and model performance.Adjusting crossattention heads to analyze jet kinematics and jet constituents has already been introduced in [41,49], where global event kinematics and constituents of the jets are first processed individually by self-attention heads, and subsequently combined by cross-attention.In this setup, jet constituents become the keys, and global event kinematics become queries and values.The scaled crossattention score becomes the matrix of the dot product between the query and key matrices, which is trained to highlight the quantity significant for the event classification.In short, jet constituent information valuable for event classification is used to transform jet momenta.
In this paper, similar to the previous setup, we apply the cross-attention for jet substructure analysis; the key matrix is constructed from subjet momenta, while jet constituents become the query and the value matrices.These cluster momenta are closely related to the parton momenta from the hard process or resolved emissions from the parton shower.The applied cross-attention heads weigh the relation between clusters and the jet constituents contributing to the event classification.Moreover, the network structure is well-suited for estimating the probability of hadron collider processes.This is because the final state particle distribution of the hadron collider may be described schematically in the factorization limit as follows, where the P s is the probability of the hardons {x} conditioned by a parent parton information {y} while P h is the parton distribution of the hard process involving {y}.
The hadron distribution in the jet correlates with hard parton momenta through the cluster momenta.Therefore, a model based on the jet-to-jet constituent crossattention is promising in representing the QCD process.See the schematical figure of relations between the parton shower process and hadronization in Fig. 1.This paper is organized as follows.In section II, we describe the network structure constituted with mixer layers for the constituents and cross-attention network to incorporate subjet information.We pay particular attention to assure the resultant network permutation invariant.We describe our top and QCD sample for network validation in section III.
In section IV, we describe the different sub-jets clustering methods studied in this paper.We utilize standard sequential recombination algorithms such as Cambridge-Aachen (CA) [50] and Anti-kt [51], which typically require a fixed radius parameter for clustered subjets.
Because the fixed size radius parameter could be a limitation in capturing the jet structure and require hyperparameter tuning, we employ Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDB-SCAN) [52] which does not depend on the radius parameter and use a proper distance metric to ensure Infra-Red and Collinear (IRC) safe clustering.The result is given in section V.In section VI, we explain the network output using different interpretable methods, including centred kernel alignment and attention maps to highlight network efficacy in collider analysis tasks.

II. NETWORK ARCHITECTURE
In this section, we explain the structure of our network.As we already stressed in the introduction, the core of our networks is a simple mixer layer integrated with the subjet information by cross-attention so that the network maintains the hierarchy between low-and high-scale physics.
The proposed network comprises distinct layers: an input layer, a mixer layer, an aggregation layer and a final fully connected (FC) layer.The permutation invariance of the network is ensured by the aggregation layer and cross-attention heads within the mixer layer.
The core of the network is the mixer layer, which consists of two components, two MLPs and cross-attention heads and discussed in subsection II. A. The first MLP acts on each particle in the cloud individually, while the second one acts on each feature of the mixed particles after transposing the dataset.The MLP shares weights across the processing layers, ensuring that all particle and feature tokens obey the same transformation (See figure 2).This allows the network to learn a unified representation among different features.
Input data to the mixer layer passes sequentially through two MLPs consisting of densely connected neural network layers.It is then passed to the cross-attention heads along with the subjet dataset.Note that the two MLPs operate similarly to transformer models with self-attention heads, combining particles and their corresponding features across the entire dataset.This enables the extraction of local and global structural information within the event.A side effect is that the MLPs have smaller tunable parameters to express the complex structure of the event compared to the other particle cloud models such as particle Net or transformers.
To compensate for the less expressivity due to smaller parameters of the Mixer network, we introduce a second input dataset containing subjets to discern the substructure of top and QCD jets.The details of the subjet clustering methods are not essential for the network description and are discussed in Section 4. The additional dataset, together with the output of the MLPs, are analyzed by the network using multi-head cross-attention and described in subsection II.B. The mixer layer preserves the dimension of the input dataset and can be repeated for better performance with more complex data.The network structure is shown in Fig. 2.
To further ensure the permutation invariance of the network, mixer layers are followed by a global Max-Pooling layer.An additional FC layer is added before the output layer with two neurons.

A. MLP mixer
At the heart of the MLP Mixer lies its features mixing mechanism.It begins with feature mixing by transposing the particle and feature axes, and then it continues with particle MLP mixing so that the input data is mixed in

Mixer layer
Layer Norm Embedding the feature and particle axes.Consider the input data set X (i,j) in which i and j run over the particle tokens and their features, respectively.The input data is first passed by a linear FC layer to map the features to higher dimensions.The first MLP acts on the features of each particle token individually mixing them up into new features as: where W 1 , W 2 are the weight matrix of the first MLP FC layers, and σ is the activation function that acts on each component.The output is then passed to the second MLP to mix the particle tokens as follows with W 3 , W 4 are the weights of the second MLP FC layers and σ is the activation function.The skip connection ensures that the input (X i,j ) and output (X ′ i,j ) are strongly correlated.The output X ′ i,j , together with the clustered subjets dataset, are the inputs to the multi-heads crossattention.

B. Cross-attention
The advantage of cross-attention lies in its ability to capture relationships and dependencies between different input sequences or modalities.Unlike self-attention, which operates within a single sequence, cross-attention allows a neural network to incorporate information from one sequence into the processing of another by giving a structure that is suitable for understanding interactions between different classes of the input data.
In the context of our paper, the subjet momentum and jet constituents have different modalities.In principle, the structure of jet clustering can be learned from the jet constituent information.However, having the subjet clustering information as a separate dataset, the network can easily assign jet constituents to the subjet through their location information, allowing the network to concentrate on the structure of each subjet.
Considering the input to the multi-head crossattention as X ′ i,j , which represents the output of the second MLP, along with the subjet information S n,m .A linear FC layer passes X ′ to generate the weight matrix for the Query matrix, while other FC layers generate the weights matrices for the Key and Value from S as where Q, K and V are the query, key and value matrices, respectively, and will be used to compute the attention of the dataset.The superscripts indicate the dimension.The scaled dot product attention score is defined as The resulting attention weight matrix has the dimensions of (particle tokens × subjet tokens) and assigns each particle token to its subjets.The attention output is obtained as The attention output matrix has the same dimension as the first input dataset X.It illuminates the importance of each particle token once assigned to the corresponding subjets and the entire dataset.This assignment of each particle token to its corresponding subjets enables the attention output matrix to emphasize the crucial particle tokens for learning the substructure of the top jet.In this case, the attention output matrix exhibits a different structure between top and QCD jets, as our discussion in section VI.The process described above is repeated for each attention head, resulting in multiple attention outputs for each token.Finally, the attention outputs from all heads are combined and projected into a single representation.This combined representation captures different aspects of the input data learned by each attention head.The output of the multi-head cross-attention has the form with W (n * j×j) is the learnable linear transformation matrix to retain the dimensions of the input dataset.Attention output is used to scale the input data set via a skip connection as The transformed dataset X signifies the importance of each element relative to all elements within the set.While the attention output integrates input and feature tokens, the skip connection preserves the correlation to the original input dataset.Moreover, it preserves the dimensions of X i,j .
Ultimately, the transformed dataset undergoes processing by a global Max-Pooling layer, identifying the particle token with the highest score.The global max pooling operates as the following where k is the number of the particle tokens in the dataset.While any symmetric aggregation function could be utilized to maintain the network's permutation invariance, but we found that Max-Pooling has the best performance [53].
The output is then passed to a FC layer with ReLU activation and an output layer with two neurons.The final output score has the form which encodes the probability of the input event to be signal or background event.

C. The role of cross-attention for collider physics
The cross-attention network is suited to study the correlation between hard partons and hadrons in the events.Considering a hard process leading N final jet, the factorization picture connects the parton distribution to the hadron distribution as follows [54], (11) where H N express the hard scattering cross section, B a and B b is the beam function; J express the collinear evolution of hard partons from the hard scattering, and the soft function S N expresses the soft radiations.The formula suggests that the soft hadron distributions in a jet are conditioned by the hard process H N , the parton evolutions, and the hadronization processes that connect all partons.
Due to the correlation between parton momenta and jet momenta, the QCD process may schematically be expressed as where P s is the hadron distributions in the jet, conditioned by the jet features, and P h is the distribution of jets, which approximately express H N k J k .Note that P s is conditioned by all jets in the events due to the effect of S N in Eq.1.Eq.1 is a much simpler approximation, which assumes hadrons arising from a single parton.
In our network, the cross-attention score is computed as α = QK T , which is the product of the output from the mixer layer and the subjet information.Therefore, the network is strongly directed to study the structure given by Eq12.Taking the correlation between all subjets and all constituents to take care of S N factor in our network.Note that the splitting between P h and P s has ambiguity on the choice of jet radius parameter R. If one takes smaller R, the number of subjets increases by splitting subjets.In Eq 11, this corresponds to the change of the resolving scale of the parton shower.The radius R is an ad-hoc parameter of our network.The proper choice of the radius parameter R for our event sample and method, which does not rely on the radius parameter R, will be discussed in section IV.

III. TOP TAGGING DATASET
Top tagging, namely the identification of jets originating from hadronically decaying top quarks, is crucial in searches for new physics at the LHC.To assess the effectiveness of the proposed network, we utilize the top tagging dataset [34].Jets in this dataset are generated in the centre of mass energy √ s = 14 TeV using Pythia8 [55].Delphes [56] is used for fast detector simulation.The simulation does not account for multiple parton interactions or pileup effects.The jets are clustered from Delphes E-Flow objects using the Anti-kt algorithm with a cone of radius R = 0.8.Jets with transverse momentum p T ∈ [550, 650] GeV and pseudo rapidity |η| < 2. are considered.For top events, the event should contain the jets that match the top quark, namely, a jet within ∆R = 0.8 from a hadronically decaying top quark and also all the three quarks from the top decay are within ∆R = 0.8 from the jet axis.
The QCD dijet process is considered as the background.
The data set contains 1 million t t events and 1 million QCD dijet events.We adhere to the official split for training 1.2M event, validation 400k event, and testing 400k event.The data sample has been widely used in the previous literature, making it easy to compare the network performance with the others.One drawback of using this sample is the effective sample imbalance around the top mass region; the top sample peaks around 170 GeV while the QCD sample peaks near zero; in other words, the overlap between the top sample and the QCD sample is poor, making it difficult to compare the fine difference among the high-performance networks.
Up to 200 constituent particles (hadrons) are retained for each jet in the dataset, with the 4-momenta (px, py, pz, E) of each particle.From this dataset, we construct the first input dataset with up to 100 p T ordered jet constituents with seven features for each particle as: • ∆η = η − η jet , where η (η jet ) is the pseudorapidity of each constituent (jet).
• log(p T ), transverse momentum of each constituent in GeV.
• log(E), energy of each constituent in GeV.
• log(p T /p Tjet ), normalized transverse momentum of each constituent in GeV.
• log(E/E jet ), normalized energy of each constituent in GeV.
The first input dataset has the dimension (100, 7) with the first and second numbers referring to the maximum number of jet constituents and the features, respectively.Any events with fewer jet constituents are padded with zeros to maintain the uniformity of the data size.The second input dataset is the subjet information of the jets with the size of (15, 7) The first index denotes the maximum number of subjets.Again, if the number of subjets is less than 15, the remaining arrays are padded with zeros.The second index denotes the subjet features that are the same as the feature of jet constituents.

IV. SUBJETS CLUSTERING
In jet classification tasks using ML, the local structure of the dataset may be extracted from its nearest neighbours.Network models can extract the data patterns to tag jets by finding the useful correlation among the jet constituents.[57].However, if we ask the network to analyze the correlation among the jet constituents from scratch, the required computational resources increase significantly.Instead, one can incorporate the information by adding some feature representation.For this purpose, we introduce the network assisted by the subject information, which encodes the local structure of the jet.
Unlike the non-parametric approach using ML models, the subjet depends on the choice of the jet clustering algorithms.In the realm of hadron collider physics, the preferred choice of jet clustering is sequential recombination algorithms: Cambridge-Aachen (CA), kT, or antikt.These algorithms have several advantageous properties with minimal parameter adjustments provided by a computationally efficient package, FastJet [58].; they are IRC safe and have the flexibility to capture various jet natures in different environments effectively; for example, CA is suited to compare the data with QCD calculation, while anti-kt is used to reconstruct the jets with underlying events.
These algorithms operate on a recursive or iterative basis and are agglomeration.Agglomerative clustering is a bottom-up hierarchical method where each data point starts its cluster.The clustering process involves computing pairwise distances between clusters called pseudojet, merging the closest ones, and updating the distance matrix until it exceeds a threshold distance.Ensuring the infrared safety of created jets entails an early combination of pairs of particles resulting from soft and collinear emissions during the clustering process.It should be noted that the clustering sequence carries the important hint of parton shower process, and the graph neutral network utilizing the clustering sequence, LundNet [29], achieves high performance.
Although the success of jet clustering algorithms, it has ad-hoc parameter R that determines the cluster size, indirectly affecting the number of constructed jets [59,60].
HDBSCAN is a clustering based on hierarchical density estimate, and the algorithm is not associated with any distance parameter R. HDBSCAN defines clusters adaptively by leveraging the density of data points within proximity, and does not have a predetermined jet radius parameter.It identifies meaningful clusters, and HDB-SCAN distinguishes outliers and noise so that soft particles in low density regions are considered noise points and left unclustered, thus providing a comprehensive perspective on the data structure.
In the rest of this paper, we consider the three clustering algorithms to prepare the subjet dataset.Also, we test the impact of each algorithm on the classification performance of the network.

A. Clustering with radius parameter
The CA and anti-kt algorithm is commonly used in jet clustering.It takes the nearest neighbour method, namely for the minimum distance pair i and j with distance d ij , one replaces i and j with a new object called "pseudojet" with momentum p i +p j .If the smallest distance is a distance to the beam d iB , the particle i is removed from the list.This procedure is repeated until the smallest d ij or d iB is above some threshold d cut The CA algorithm comes with a pair distance measure as where ∆R ij = (η i − η j ) 2 + (ϕ i − ϕ j ) 2 and d iB is the distance of a parton i from the beam.On the other hand, the anti-kt algorithm [51] comes with a pair distance measure as Hard anti-kt jets have circular shapes on the η − ϕ plane and look like jets in a cone algorithm [51] To examine the impact of radius parameter R on the model classification performance, we tested the network performance for different values of R, ranging from 0.1 to 0.5 using the CA algorithm.Figure 3 illustrates the performance of the mixer network for classifying top and QCD jets, evaluated using the Area Under the ROC (AUC) metric.Optimal classification performance is observed at R = 0.3, with deviations from this value resulting in decreased accuracy.This behaviour is expected, as increasing/decreasing R may dilute the subjets structure for the top events.

B. Dynamic radius clustering (HDBSCAN)
HDBSCAN represents a sophisticated extension of traditional clustering algorithms of point data.It introduces a hierarchical approach that identifies clusters of varying densities.Unlike conventional clustering algorithms that necessitate prior specification of cluster count, HDB-SCAN autonomously discerns clusters of varying sizes and shapes.This adaptability renders HDBSCAN particularly well suited to capture meaningful geometrical information about the substructure of the top jet.The algorithm is based on the k-nearest neighbourhood method and is as fast as Fasjet package.
HDBSCAN begins by computing the mutual reachability distance between particles in the event, creating a reachability distance matrix.Then, HDBSCAN constructs a minimum spanning tree from this matrix and identifies the particles corresponding to the peaks in the density of the minimum spanning tree.These particles become the initial cluster centres.Next, the algorithm performs a hierarchical clustering of the particles using a variation of single linkage clustering.Finally, it assigns each particle to a cluster based on the hierarchical clustering structure, leaving particles in low-density areas as noise points.The bottom left (right) plot shows the resulting five (four) subjet clusters when using Anti-Kt (CA) with radius R = 0.3.Due to the unclustered particles, there is a mismatch of the original jet momentum and the sum of the reclustered subjet momenta and the jet momentum; for HDBSCAN clustering on top and QCD samples, an average value of ( P sub )/P jet = 0.983 and 75.97% of the events has P sub ≥ 95%P jet , with P sub , P jet are the subjet and jet momentum 1 .
Subjets are clustered as follows: It starts by calculating the distance from the k th nearest neighbour (the core distance) for all the particles.This k th is the discrete hyper-parameter of HDBSCAN.A particle p is a core particle if the distance to all its nearest neighbours is less than the distance to other core particles.We modify the reachability distance with particle transverse momentum as follows; where η, ϕ are the pseudorapidity and azimuth angle of the considered particle.S i,j is a singularity factor to ensure that softer particles are combined earlier to the nearest cluster as discussed in [59] where k is a small number taken to be 0.0001.The singularity factor is 0 for soft and collimated particles and 1 for well-isolated particles.Therefore, this distance measure interpolates the distance measure of the k T and CA distance measures.Note that this definition could make the algorithm sensitive to the underlying events, while it helps to network actively capture the geometry of the soft particle distributions.
Once the core distances are defined, the density of the particles in the η − ϕ plane is defined via a mutual reachability distance measure r, which is defined as where cor k (x i ), cor k (x j ) are the core distance of x i and x j respectively from their k th nearest neighbour, and d(x i , x j ) is the distance between the two particles, as defined in Eq. 15.Note that the distance between the particles is replaced with the core distance if the particle is in a sparse area; therefore, the particles in the sparse area tend to be pushed away from the other clusters.
For hierarchical clustering, one constructs a graph with nodes representing particles and edges connecting nodes with the weight of the mutual reachability distance.The graph called the minimum spanning tree is built by considering one edge at a time, always adding the lowest weight edge that connects the current tree to a vertex not yet in the tree.In the left plot of Fig. 5, we show the constructed minimum spanning tree of an example top jet event in Fig. 4.This figure clarifies that the unclustered particles, the four blue points in Fig. 4, assigned larger mutual reachability distance shown in red lines.
Having the minimal spanning tree constructed, the next step is to convert that into the hierarchy of connected components.This can be done by sorting the edges of the tree by distance for which particles with smaller distances are merged first and then iterate through, creating a new merged cluster for each edge and ending up with a single linkage tree.In Figure 5, right plot, we show a dendrogram for the reconstructed hierarchical structure from the minimum spanning tree.The length of the branches in the dendrogram represents the distance between clusters and/or clustered particles in the event.Once we obtain the dendrogram, we go back to the clustering sequence to the small distance pairs to identify stable clusters.As depicted in the single linkage tree above, it is common to observe a cluster of one or two points separate from the main cluster.Instead of interpreting this event as the split of a cluster, we perceive it as a single enduring cluster experiencing a reduction of points.For this purpose, we require the minimum number of points m for a single cluster.It is another hyperparameter of the algorithm and we fix it to m = 5.If each child cluster contains particles less than m at a cluster split, this split is considered spurious, and the particles will be removed from the parent cluster.Conversely, if the split is into two clusters, each at least as large as the minimum cluster size, we consider a true cluster split and let that split persist in the tree.In this manner, the tree size shrinks to smaller clusters with larger stability.
To measure the stability of the condensed tree and choose clusters that persist for long during the splitting, a density parameter is introduced as λ = 1 d , where d is the distance represented in the vertical axis in Fig. 5 (right).The stability of each cluster can be defined as the sum of the λ for particles in the cluster; The final cluster of HDBSCAN is the trees with significant stability.To select the final clusters, we start by declaring all nodes as selected clusters.Then, traverse the tree toward larger λ.If the combined stabilities of child clusters exceed that of the parent cluster, we update the parent's stability to equal the sum of its children's stabilities.Conversely, if the parent's stability surpasses the sum of its children, designate the parent cluster as a selected cluster and deselect all its descendants.Upon reaching the root node, consider the selected clusters as our final subjet.
In the selection process, there is no distance nor the (sum of) the momenta involved.In the case of CA or Anti kT algorithms, the momentum is always updated to the sum of the momenta of the parents, where the most energetic constituent dominates the direction.One may imagine that the clustering pattern can completely differ from the CA or anti-kT.However, it is natural that high energy particles create collinear particles, Therefore, the direction of the high energy cluster and the high density location are strongly correlated.
To use the HDBSCAN for jet analysis, it is necessary to calibrate the relation between parton momenta and momenta of the clusters.For example, the relation of the CA subjet and HDBSCAN cluster should depend on the colour of the parent parton and its colour connection.The effect of the colour connection might be interesting to study because HDBSCAN is more sensitive to the distribution of soft particle constituents.To validate the HDBSCAN clustering result, we compare the number of clustered subjets and the shape of subjets distributions obtained using Anti-kt and CA algorithms.Fig. 6 illustrates the number of clustered subjets by HDBSCAN alongside CA and Anti-kt for top jets (left plot) and QCD jets (right plot).HDBSCAN yields a larger number of clustered subjets, attributed to the dynamic nature of the jet radius.Furthermore, Figures 7 and 8 depict the kinematic distributions of the leading subjets for top and QCD jet events across all three algorithms.The energy distribution of the leading subjet distribution is softer for the HDBSCAN because HDB-SCAN does not depend on the radius parameter and can identify the structure inside R < 0.3 radius of CA or anti-Kt subjets.

V. NETWORK PERFORMANCE
In this section, we compare the performance of the Mixer network against three baseline models: Particle Flow network (PFN) [36], ParticleNet [37], and Particle Transformer network (ParT) [38] for the top tagging task.
PFN, rooted in Deep Sets, employs two symmetric neural networks to parameterize permutation-invariant symmetric functions.The first neural network comprises three fully connected (FC) layers with 100, 100, and 256 neurons, respectively.The second neural network consists of three FC layers, each with 100 neurons.The parameters of FC layers are randomly initialized and activated by ReLU.The output layer, featuring two neurons, utilizes a softmax activation function.
The ParticleNet architecture integrates three Edge Convolution (EdgeConv) blocks.The initial EdgeConv block computes pairwise distances using the spatial coordinates of particles in the (η − ϕ) plane.Subsequent EdgeConv blocks employ the learned feature vectors as coordinates.The network updates particle information from its 16 nearest neighbours.The dimensions of the EdgeCov blocks are (64,64,64), (128,128,128), and (256, 256, 256), followed by channel-wise global average pooling to ensure the network's permutation invariance.A FC layer with 256 neurons and ReLU activation processes the aggregated information from the EdgeCov blocks.Finally, a two-unit layer with softmax activation serves as the output layer.
ParT, based on attention-based transformer networks, comprises eight particle attention blocks and two class attention blocks with eight attention heads and a query dimension of 16.Input feature interactions are encoded using four point-wise 1D convolutions with dimensions (64,64,64,8).The attention blocks incorporate a 10% dropout.A final output layer with two neurons and softmax activation completes the network.
For the Mixer network, we use the jet constituent dataset of the dimensions (100, 7) and the subjet datasets with dimension (15,7).The mixer layer takes the jet constituent input and consists of one embedding layer with 128 neurons.The first MLP consists of two FC layers with 128, 64 neurons and one Gaussian Error Linear Units (GELU) non-linear activation layer [62].The second MLP consists of two FC layers with 64, 128 neurons and a GELU activation layer.The multi-heads crossattention acts on the output of the second MLP, with dimension (100, 7), and the subjet dataset.The subjet dataset has 15 parallel heads and a hidden dimension of size 64.A dense layer with 64 neurons is used with ReLU activation before the output layer.
These networks are trained on the seven features extracted from the jet constituents of the top and QCD datasets, as detailed in Section III.Our network is trained on the same features but with an additional dataset encoding subjets information.This dataset is generated by applying jet clustering algorithms to cluster the jet constituents in the Top dataset, resulting in a secondary dataset comprising the seven features of all subjets.Figures 7 and 8 illustrate the considered features of the leading subjets for top and QCD jet events, respectively, when clustered using Anti-kt, CA, and HDBSCAN algorithms.
To further test the impact of the mixer layer, we compare the network performance with a plain transformer model that analyzes the subjets dataset only.The architecture of the plain transformer is detailed in appendix VII.
Table I shows the network performances, the number of tunable parameters in each network, and the training time per epoch.Mixer network performances are reported according to the clustering algorithm of the subjet dataset.For the subjet definition, HDBSCAN Events (Normalized) Leading sub-jet (QCD Jet) Anti-KT Cambridge-Aachen HDBSCAN FIG. 8. Properties of the leading subjets of the QCD jet events clustered with Anti-kt (blue), CA (orange) and HDBSCAN (green).For Anti-kt and CA, we use a radius parameter R=0.3.
achieves the best performance over CA and Anti-kt as it can likely capture more geometrical information that improves network performance.To achieve this performance, one needs both the mixer network for the jet constituent and the subjet inputs.This can be seen by comparing the performance of the transformer using subjet information alone, which shows the lowest classification performance.This underscores the necessity of incorporating MLPs in the mixer layer.The mixer networks are not only achieving state-of-the-art performance comparable to ParticleNET, PELICAN and ParT but also approximately 20 times faster in training.PFN has the TABLE I. performance of the Mixer network for top quark tagging compared with other models.Results for EDI-net [44], Point Cloud Transformer (PCT) [45], Lorentz Net [28], PELICAN [46], PFN [36], ParticleNET [37], and ParT [38] are quoted from their published results.Pretrained ParticleNET and ParT have higher performance with AUC = 0.9866 and AUC= 0.9877, respectively.The pertaining is done on the JETCLASS dataset, followed by the tuning to the top dataset.Transformer(subjet) model is trained from scratch using the CA subjets dataset only.shortest training time but lacks learning of the local information shared between particles and their neighbours, leading to relatively poor performance.

VI. INTERPRETABLE ML TECHNIQUES
ML models' interpretability can be challenging due to their intricate hidden layers.Understanding the model's architecture and learned representations is crucial for accurate predictions.
Various interpretable ML methods have been developed to provide insights into how models make predictions.This helps to validate model decisions.In this section, we employ two methods that offers a straightforward interpretation of the network outcomes, namely, Central Kernel Alignment (CKA) and attention map visualization.CKA is a metric used to compare the similarity between two sets of learned representations in a high-dimensional feature space.It was first introduced in [63] and used in collider analysis in [64].
It measures the representations learned by the network layers or hidden layers of different models, considering local similarities and global structure.On the other hand, attention maps are visual representations generated by attention mechanisms in neural networks, highlighting the input data most relevant for making predictions.They provide insights into the focuses of the model during processing, aiding in the interpretation of the decision-making process.
In the following, we apply those interpretable methods to the Mixer network trained on t a jet constituents dataset with dimensions (100, 7) and a subjets information with dimensions (15, 7) clustered using the CA algorithm with R = 0.3.Importantly, these interpretable methods are agnostic to the specific network configuration and can be applied to other results presented in this paper.
A. CKA similarity CKA similarity, rooted in the principles of kernel methods and alignment-based metrics, offers a comprehensive framework for assessing the similarity between two sets of representations learned by different models or layers within a model.It measures the alignment between representations in a high-dimensional feature space rather than simply comparing their values.Unlike linear similarity measures such as Pearson correlation or Euclidean distance, CKA captures complex relationships between representations learned by different models or layers, making it suitable for comparing high-dimensional and non-linearly transformed data.The primary obstacle in analyzing the representations of hidden layers in neural networks is the dispersion of features across neurons, with sizes often larger than the input dimension and varying in layers or models.
CKA facilitates quantitative comparisons of representations both within individual networks and across different models.This can be done by considering the activation matrices of two hidden layers X and Y evaluated on the same input dataset; when the data size is d, and P 1 and P 2 is the number of neurons of the two different hidden layers, X ∈ R d×P1 and Y ∈ R d×P2 .The CKA similarity is defined as where M = XX T and N = Y Y T are two Gram matrices of the two hidden layers with d×d dimension.The size of the Gram matrices depends only on the number of inputs, therefore, the CKA(M, N ) can be used to compare any layers with different numbers of neurons or networks of different models.The Hilbert-Schmidt Independent Criterion (HISC) [65] between two matrices is defined as where a d×d centering matrix H is defined as Centering the matrices ensures that the CKA similarity is not overly influenced by outliers or extreme values in the data, leading to more robust comparisons between representations.
The value of the CKA ranges between [0, 1].A higher CKA value suggests that these layers have captured redundant information from the input features.If two subsequent layers are similar in the CKA, it indicates the second layer leads to negligible improvement in classification accuracy.In such instances, trimming these layers can reduce model complexity without compromising classification performance.Conversely, the layers with lower CKA values have captured distinct information from the data, and enhanced the classification performance  The CKA results are depicted in Fig. 9, showing the top jet events in the upper plot and QCD jet events in the lower plot.The analysis is based on a sample of 5000 test events, with the subjets dataset clustered using the CA algorithm with R = 0.3.CKA values are computed for distinctive model layers, including the embedding layer, the two FC layers for the first and second MLP mixer, the multi-head cross-attention layer, and the final FC layer.
In general, layers with low correlations imply that they capture independent information from each other, under-scoring their significance in the network's decision making process, see for example figure 3 in [63].
The multi-head cross-attention layer shows lower similarity with the two MLPs for the top jet with CKA value 30% and 57% for the QCD jets.The top jet CKA values are lower than QCD ones, which suggests the network layers are adept at capturing distinct information and are capable of learning the substructure of the top jet.The MLP mixer layers must have focused on the other features of the model.The first and second MLP mixers exhibit low similarity.Specifically, for top events, the two MLPs demonstrate lower CKA values around 58% compared to the QCD events with CKA value 76%, suggesting that the network has learned a specific internal structure unique to top events.

B. Attention maps
Attention maps visualize the attention scores assigned to each particle token in the input sequence, providing a representation of where the model focuses its attention during the decision making process [66].
Also, it reveals the relation between particle tokens.For instance, it highlights the information extracted from the jet constituents relevant to the clustered subjets.Fig. 10 presents the cross-attention maps for a sample of 50,000 test events, showing top jet events in the upper plot and QCD jet events in the lower plot.As mentioned, the model is trained on two input datasets: jet constituents with dimensions (100, 7) and a subjets dataset with dimensions (15, 7) clustered by the CA algorithm with R = 0.3.The network comprises 15 crossattention heads that operate in parallel, with Fig. 10 displaying the average output of these 15 heads.The cross-attention maps for all attention heads individually are illustrated in Appendix VII.
The visualizations reveal a distinct focus of the network; it concentrates on three subjets to identify top events, while it directs its attention to only one subjet for identifying QCD events.This observation underscores the effectiveness of the multi-head cross-attention mechanism, particularly in conjunction with the subjets dataset, for capturing the substructure inherent in top jets.In the appendix, we also see the 4th-8th subjets contribute to individual attention heads, consistent with the previous ML study based on the substructure variables.

VII. CONCLUSION
In this paper, we present a simple permutationinvariant network, a "Mixer network" assisted by subjet information.The "Mixer network" has a simple structure with fewer tunable parameters and is approximately 20 times faster than state-of-the-art networks, such as Par-ticleNET and ParT, with comparable classification performance.
The network comprises mixer layers designed to maintain the dimensions of both input and output datasets.This allows for easy scalability of model complexity by stacking additional mixer layers, thereby enhancing expressivity when analyzing intricate data.Each mixer layer consists of two MLPs, with weight sharing across the entire dataset, facilitating the mixing of particle and feature tokens within the particle cloud.As a result, these MLPs can capture global and local features, including relationships between individual particles and their nearest neighbours.Subjet information is utilized to improve further the network's ability to learn the local structure of the jet.Cross-attention mechanisms are employed to analyze both the subjets and the jet constituents datasets, enabling the extraction of relevant local information and updating model weights to ensure that each jet constituent incorporates updated information about its nearby particles.
The secondary dataset is generated using three jet clustering algorithms: Anti-kt, CA, and HDBSCAN.HDB-SCAN is a density-based clustering algorithm which does not depend on any radius parameters.This paper introduces HDBSCAN in collider analysis for the first time and presents a detailed comparison with recursive clustering algorithms such as Anti-kt and CA.Employing the Mixer network with a second dataset clustered by HDBSCAN demonstrates relatively higher classification performance than anti-kt or CA.
To elucidate the network results, CKA similarity and attention map visualization are utilized.CKA analysis confirms that the cross-attention and mixer layer capture the different information.Moreover, the first MLP layers, responsible for mixing particle tokens, and the second MLP layers, responsible for mixing feature tokens, learn distinct information, thereby ensuring their effectiveness in capturing the substructure of the jet.Moreover, visualization of the cross-attention maps reveals that the network assigns higher weights to the leading subjets and their associated jet constituents, underscoring the efficacy of cross-attention mechanisms in incorporating neighbouring information into each jet constituent.
The merit of using cross-attention arises from the fact that the jet substructure is conditioned by a few partons originating from the hard process; therefore, the estimated probability of event distribution can be expressed by the product of the hard parton distribution (∼ jets or subjet distribution) times the hadron distribution in the jet conditioned by the parton distribution.Crossattention guarantees that subjet and jet constituent information appear as multiplicative matrices.
In [41], we used cross-attention heads to connect the outputs of transformer layers of global event kinematics and that of jet substructures.It has shown that the network effectively learned the correlation between jet substructures and event kinematics, significantly improving the event classification task of pp → H → hh.One of the drawbacks of this approach comes from the computational complexity of using transformer layers for the jet encoding step.In this paper, we drastically reduced the network complexity by replacing the transformer layer with the subjet-assisted MLP mixer.The output of the mixer layer has the same dimensions as the input dataset.Therefore, we can replace the transformer layer of subjet analysis in [41] with the mixer layer, keeping the same performance.In short, the network proposed in this paper opens up the step toward global event analysis encoded in the particle cloud much more efficiently than the previous state-of-the-art approaches.
We employ a straightforward attention-based transformer model with multi-head self-attention to analyze the subjet dataset clustered using the CA algorithm.The model takes as input a subjet dataset with dimensions of (15,7), where 15 represents the number of subjet tokens and 7 denotes the corresponding features.
Comprising three stacked transformer layers and an MLP with three FC layers, each transformer layer features 8 self-attention heads operating in parallel, with a hidden dimension, query dimension, set to 256.The output from these attention heads is then combined with the original input data via a skip connection layer.The resulting output from the skip connection undergoes flattening and is passed through two fully connected layers with 128 and 7 neurons, respectively, utilizing the GELU activation function.Subsequently, the output from the final fully connected layer is combined with the selfattention output via a second skip connection layer.
Following this, the final output of the transformer layer undergoes normalization and retains the same dimension as the input dataset.The output of the transformer layers is then fed through an MLP comprising three fully connected layers with dimensions of 256, 128, and 64, employing the GELU activation function.After each fully connected layer, a dropout layer with a dropout rate of 20% is applied to prevent overfitting.Finally, the output is passed to the output layer with two neurons and softmax activation for classification.The model is trained for 30 epochs with a batch size of 1024.

B. ATTENTION HEADS VISUALIZATION
In this appendix, we present the cross-attention maps of the 15 utilized attention heads individually.The output of these cross-attention heads possesses dimensions of (N, 15,80,15), where N denotes the number of test events, 15 signifies the number of parallel attention heads, and the last two dimensions indicate the quantities of jet constituents and subjet tokens, respectively.
In Fig. 11, we depict the accumulated averages of 50,000 test events for each attention head separately.For clarity, we maintain the first 30 jet constituents along the X axis, with the Y axis representing the subjet tokens.Notably, Fig. 11 reveals that the Mixer network assigns considerable attention to a larger number of subleading subjets.The detail was not apparent in Fig. 10, where we averaged across all heads.

FIG. 1 .
FIG.1.Schematical figure of the mixer layer in the Mixer network.A hard parton from the hard process goes through the parton shower, creating subjets.The information of jet constituents is processed by two MLPs, then analyzed together with subjet information via the cross attention layer.

FIG. 2 .
FIG.2.Left: full network structure.Right: structure of the mixer layer block.Each MLP within the mixer layer shares its weights and comprises two FC layers and a Gaussian Error Linear Units (GELU) layer.
FIG. 3. Receiver Operating Characteristic (ROC) curves for the mixer network trained on a secondary subjets dataset clustered with CA with different R values.

FIG. 4 .
FIG. 4. Example event for top jet clustering.Left top: distribution of all top jet constituents in the η −ϕ plane.Right top: The results of the clustering of the subjets using HDBSCAN.Left bottom: The results of the clustering of the subjets using Anti-Kt with R = 0.3.Right bottom: The results of the clustering of the subjets using CA with R = 0.3.The points with the same colour represent the same cluster members.

FIG. 5 .
FIG. 5. Right: the minimum spanning tree for top example event.The colour bar represents the mutual reachability distance between the reachable particles.Left: dendrogram for Single Linkage Tree.

FIG. 7 .
FIG.7.Properties of the leading subjets of the top jet events clustered with Anti-kt (blue), CA (orange) and HDBSCAN (green).Anti-kt and CA are considered with radius parameter R = 0.3.

FIG. 9 .
FIG.9.The CKA similarity of top jet events (top plot) and QCD jet (bottom).Axes represent the network layers.FC(MLP1) and FC(MLP2) are the fully connected layers in the first and second MLP of mixer layers, respectively.The last FC represents the last FC layer in the network, and Attention is the multi-heads cross-attention.

FIG. 10 .
FIG.10.Cross-attention maps for 50000 test events of top (top plot) and QCD (bottom plot) averaged over 15 attention heads.The X-axis shows the attention score for the first transformed 30 th jet contents, while the Y-axis shows the attention score for the transformed subjets.

FIG. 11 .
FIG.11.Cross-attention maps of the 15 attention heads for 50000 test events.The upper three rows are for top jet events, and the bottom rows are fro QCD events.
Training time is per epoch with a batch size of 1024.The GPU training time is measured on an NVIDIA RTX A6000 card.