1 Introduction

In high-energy physics experiments, tagging jets, which are collimated sprays of particles produced from high-energy collisions, is a crucial task for discovering new physics beyond the Standard Model. Jet tagging involves distinguishing boosted heavy particle jets from those of QCD initiated quark/gluon jets. Since jets initiated by different particles exhibit different characteristics, two key issues arise: how to represent a jet and how to analyze its representation. Conventionally, jet tagging has been performed using hand-crafted jet substructure variables based on physics motivation. Nevertheless, these methods can often fall short in capturing intricate patterns present in the raw data.

Over the past decade, deep learning approaches have been extensively adopted to enhance the jet tagging performance [19]. Various jet representations have been proposed, including image-based representation using Convolutional Neural Network (CNN) [2, 8, 11, 17, 20, 21, 25, 32], sequence-based representation with Recurrent Neural Network [1, 10], tree-based representation with Recursive Neural Network [7, 23] and graph-based representation with Graph Neural Network (GNN) [3, 4, 14, 16, 24, 33]. More recently, one representation approach that has gained significant attention is to view the set of constituent particles inside a jet as points in a point cloud. Point clouds are used to represent a set of objects in an unordered manner, described in a defined space. By adopting this approach, each jet can be interpreted as a particle cloud, which treats a jet as a permutation-invariant set of particles, allowing us to extract meaningful information with deep learning method. Based on the particle cloud representation, various deep learning architectures have been introduced, such as Deep Set Framework [18], ABCNet [26], LorentzNet [14] and ParticleNet [30]. Deep Set Framework provides a comprehensive explanation of how to parametrize permutation invariant functions for inputs with variable lengths, taking into consideration both infrared and collinear safety. ParticleNet adapts the Dynamic Graph CNN architecture [37], while ABCNet takes advantage of attention mechanisms to enhance the local feature extraction. The LorentzNet focused more on incorporating inductive biases derived from physics principles into the architecture design, utilizing an efficient Minkowski dot product attention mechanism. All of these architectures realize substantial performance improvement on top tagging and quark/gluon discrimination benchmarks.

Over the past few years, attention mechanisms have become as a powerful tool for capturing intricate patterns in sequential and spatial data. The transformer architecture [35], which leverages attention mechanisms, has been highly successful in natural language processing and computer vision tasks such as image recognition. However, when dealing with point cloud representation, which inherently lack a specific order, modifications to the original transformer structure are required to establish a self-attention operation that is invariant to input permutations. To address these issues, a recent approach known as point cloud transformer (PCT) [15, 27] was proposed, which entails passing input points through a feature extractor to create a high-dimensional representation of particle features. The transformed data is then passed through a self-attention module that introduces attention coefficients for each pair of particles. Another notable approach is the particle transformer [31], which incorporates pairwise particle interactions within the attention mechanism and obtains higher tagging performance than a plain transformer and surpasses the previous state-of-the-art, ParticleNet, by a large margin.

In recent studies, the dual attention vision transformer (DaViT) [12] has exhibited promising results for image classification. The DaViT introduces the dual attention mechanism, comprising spatial window attention and channel group attention, enabling the effective capture of both global and local features in images. In this paper, we utilize the dual attention mechanism for jet tagging based on point cloud representation. We expanded the particle self-attention established by existing works by introducing the channel self-attention. In the particle self-attention, the particle number defines the scope, and the dimension of particle feature defines the feature dimension. While in the channel self-attention, the channel dimension defines the scope, and the particle number defines the feature dimension. Thus each channel contains an abstract representation of the entire jet. By performing self-attention on these channels, we capture the global interaction by considering all the particles when computing attention scores between each pair of channels. Compared to existing particle self-attention, the channel self-attention is naturally imposed from a global jet perspective rather than a particle one. To achieve the dual attention mechanism, we introduce the channel attention module. By alternately applying the particle attention module and the channel attention module to combine both the local information of the particle representation and the global information of the jet representation for jet tagging, we build a new network structure, called particle dual attention transformer (P-DAT). Furthermore, inspired by Ref. [31], we design the pairwise jet feature interaction. We incorporate both the pairwise particle interaction and the pairwise jet feature interaction to increase the expressiveness of the attention mechanism. We evaluate the performance of P-DAT on top tagging and quark/gluon discrimination tasks and compare its performance against other baseline models. Our analysis demonstrates the effectiveness of P-DAT in jet tagging and highlights its potential for future applications in high-energy physics experiments.

This article is organized as follows. In Sect. 2, we introduce the particle dual attention transformer for jet tagging, providing a detailed description of model implementation. In Sect. 3, we present the performance of P-DAT and the existing algorithms obtained for top tagging task and quark/gluon discrimination task, utilizing several evaluation metrics and provide an extensive discussion of these results. In Sect. 4, we conduct a comprehensive comparison of computational resource requirements for evaluating each model, including the number of trainable weights and the number of floating-point operations (FLOPs). Finally, our conclusions are presented in Sect. 5.

2 Model architecture

The focus of this paper is to introduce the particle dual attention transformer (P-DAT), which is designed to capture both the local particle-level information and the global jet level information. In this section, we first introduce overall structure of the model architecture. Then we delve into the details of the channel attention module and its combination with the particle attention module. Finally, we present the model implementation.

2.1 Overall structure

Fig. 1
figure 1

Illustration of the whole model architecture

The whole model architecture is illustrated in Fig. 1. It contains three key components, namely the feature extractor, the particle attention module and the channel attention module.

First of all, we employ the same feature extractor as in Ref. [27] to transform the inputs from \(P\times 7\) to a higher dimensional representation \(P\times N\), where P represents the number of particles within the jet, and N denotes the dimension of the embedding features for each particle. As shown in Fig. 2(left), the feature extractor block incorporates an Edge Convolution (EdgeConv) operation [36] followed by 3 two-dimensional convolutional (Conv2D) layers and an average pooling operation across all neighbors of each particle. The EdgeConv operation adopts a k-nearest neighbors approach with \(k=20\) to extract local information for each particle based on the proximity in the \(\eta -\phi \) space. All convolutional layers are implemented with stride and kernel size of 1 and are followed by a batch normalization operation and GELU activation function. Same as in Ref. [27], we employed two feature extractors with N = 128 and N = 64, respectively.

Subsequently, we alternately stack two particle attention modules and two channel attention modules to combine both the local information of the particle representation and the global information of the jet representation. A dropout rate of 0.1 is applied to all particle attention blocks and channel attention blocks. Furthermore, inspired by Ref. [31], we designed a channel interaction matrix based on physics principles. Then we incorporate the particle interaction matrix to the particle attention module and incorporate the channel interaction matrix to the channel attention module. For the particle interaction matrix, we utilize a 3-layer two-dimensional convolution with (32,16,8) channels with stride and kernel size of 1 to map the particle interaction matrix to a new embedding \(P\times P \times N_h\), where \(N_h\) is the number of heads in the particle self attention module. As for the channel interaction matrix, we utilize an upsampling operation and a 3-layer two-dimensional convolution to map the channel interaction matrix to a higher dimensional representation \(N\times N\), with N the input particle embedding dimension. Therefore, to process a jet of P particles, the P-DAT requires three inputs: the jet dataset, the particle interaction matrix and the jet feature interaction matrix derived from the kinetic information of each particle inside the jet.

Next, the outputs of the particle attention blocks and channel attention blocks are concatenated, followed by an 1 dimensional Convolutional Neural Network (CNN) layer with 256 nodes and an average pooling operation across all particles. This output is then directly fed into a 3-layer MLP with (256, 128, 2) nodes, as shown in Fig. 2(right). In addition, a batch normalization operation, a dropout rate of 0.5 and the GELU activation function are applied to the second layer. Finally, the last layer employs a softmax operation to produce the final classification scores. It is worth noting that the inclusion of class attention blocks, as described in Ref. [31], did not lead to an improvement in performance of P-DAT, as observed in our experiments.

Fig. 2
figure 2

Illustration of the feature extractor block and the MLP block

Fig. 3
figure 3

Illustration of the particle multi-head attention block

2.2 Particle attention module

The particle self-attention block, which is already established in the existing papers, aims to establish the relationship between all particles within the jet using an attention mechanism. As presented in Fig. 3, three matrices, which are called query (Q), key (K), and value (V), are built from linear transformations of the original inputs. Attention weights are computed by matrix multiplication between Q and K, representing the matching between them. Same as the particle transformer work [31], we incorporate the particle interaction matrix \(U_1\) as a bias term to enhance the scaled dot-product attention. This incorporation of particle interaction features, designed from physics principles, modifies the dot-product attention weights, thereby enhancing the expressiveness of the attention mechanism. The same \(U_1\) is shared across the two particle attention blocks. After normalization, these attention weights reflect the weighted importance between each pair of particles. The self-attention is then obtained by the weighted elements of V, which results from multiplying the attention weights and the value matrix. It is important to note that P represents the number of particles, and N denotes the total number of features. The attention weights are computed as:

$$\begin{aligned} {\mathcal {A}}({\textbf{Q}}, {\textbf{K}}, {\textbf{V}})&= \textrm{Concat}(\text{ head}_1,\ldots ,\text{ head}_{N_h}) \nonumber \\ \text {where}~~\text{ head}_i&= \textrm{Attention}({\textbf{Q}}_i, {\textbf{K}}_i, {\textbf{V}}_i) \nonumber \\&= \textrm{softmax} \left[ \frac{{\textbf{Q}}_i({\textbf{K}}_i)^\textrm{T}}{\sqrt{C_h}}+\mathbf {U_1}\right] {\textbf{V}}_i \end{aligned}$$
(1)

where \({\textbf{Q}}_i={\textbf{X}}_i{\textbf{W}}_i^Q\), \({\textbf{K}}_i={\textbf{X}}_i{\textbf{W}}_i^K\), and \({\textbf{V}}_i={\textbf{X}}_i{\textbf{W}}_i^V\) are \({\mathbb {R}}^{P \times N_h}\) dimensional visual features with \(N_h\) heads, \({\textbf{X}}_i\) denotes the \(i_{th}\) head of the input feature and \({\textbf{W}}_i\) denotes the projection weights of the \(i_{th}\) head for \({\textbf{Q}}, {\textbf{K}}, {\textbf{V}}\), and \(N = C_h * N_h\). The particle attention block incorporates a LayerNorm layer both before and after the multi-head attention module. A two-layer MLP, with LayerNorm preceding each linear layer and GELU nonlinearity in between, follows the multi-head attention module. Residual connections are applied after the multi-head attention module and the two-layer MLP. In our study, we set \(N_h=8\) and \(N=64\).

Fig. 4
figure 4

Illustration of the channel attention block

2.3 Channel attention module

The main contribution of this paper is to explore the self-attention mechanism from another perspective and propose the channel-wise attention mechanism for jet tagging. Unlike the previous particle self-attention mechanism which computes the attention weights between each pair of particles, we apply attention mechanisms on the transpose of particle-level inputs and compute the attention weights between each pair of particle features. In this way, the channel-wise attention mechanism naturally capture the global interaction of each pair of particle features by taking all the particles into account, which can be viewed as the interaction of each pair of jet features. Additionally, taking inspiration from Ref. [31], we have devised a jet feature interaction matrix based on physics principles, which can be added to enhance the expressiveness of the channel attention mechanism.

As depicted in Fig. 4, the channel self-attention block applies attention mechanisms to the jet features, enabling interactions among the channels. To capture global information in the particle dimension, we set the number of heads to 1, where each channel represents a global jet feature. Consequently, all the channels interact with each other. This global channel attention mechanism is defined as follows:

$$\begin{aligned} {\mathcal {A}}({\textbf{Q}}_i, {\textbf{K}}_i, {\textbf{V}}_i)&= \textrm{softmax} \left[ \frac{{\textbf{Q}}_i^\textrm{T}{\textbf{K}}_i}{\sqrt{C}}+\mathbf {U_2}\right] {\textbf{V}}_i^T \end{aligned}$$
(2)

where \({\textbf{Q}}_i, {\textbf{K}}_i, {\textbf{V}}_i \in {\mathbb {R}}^{C \times P}\) are channel-wise jet-level queries, keys, and values. Note that although we perform the transpose in the channel attention block, the projection layers \({\textbf{W}}\) and the scaling factor \(\frac{1}{\sqrt{C}}\) are computed along the channel dimension, rather than the particle dimension. Similar as the particle self-attention block, we incorporate the designed channel interaction matrix \(U_2\) as a bias term to enhance the scaled dot-product attention. The same \(U_2\) matrix is shared across the two channel attention blocks. After normalization, the attention weights indicate the weighted importance of each pair of global features. The self-attention mechanism produces the weighted elements of V, obtained by multiplying the attention weights and the value matrix. Additionally, the channel attention block includes a LayerNorm layer before and after the attention module, followed by a two-layer MLP. Each linear layer is preceded by a LayerNorm layer, and a GELU nonlinearity is applied between them. Residual connections are added after the channel attention module and the two-layer MLP.

2.4 Combination of particle attention module and channel attention module

Throughout the whole architecture, all the particle attention modules and the channel attention modules are stacked while maintaining a consistent feature dimension of \(N=64\). The channel attention module captures global information and interactions, while the particle attention module extracts local information and interactions. In the context of the channel self-attention mechanism, a \(C\times C\)-dimensional attention map is computed, involving all the particles, resulting in a computation of the form \((C\times P) \cdot (P\times C)\). This global attention map enables the channel attention module to dynamically fuse multiple global perspectives of the jet. Subsequently, a transpose operation is performed, yielding outputs with new channel information, which are then passed to the subsequent particle attention module. Conversely, in the particle self-attention mechanism, a \(P\times P\)-dimensional attention map is computed by considering all particle features, resulting in a computation of the form \((P\times C) \cdot (C\times P)\). This local attention map empowers the particle attention module to dynamically fuse multiple local views of the jet, generating new particle features and passing the information to the following channel attention module. By alternatively applying these two types of modules, the local information and global information can complement each other.

2.5 Model implementation

The PYTORCH [29] deep learning framework is utilized to implement the model architecture with the CUDA platform. The training and evaluation steps are accelerated using a NVIDIA GeForce RTX 3070 GPU for acceleration. We adopt the binary cross-entropy as the loss function. To optimize the model parameters, we employ the AdamW optimizer [22] with an initial learning rate of 0.0005, which is determined based on the gradients calculated on a mini-batch of 64 training examples. The network is trained up to 100 epochs, with the learning rate decreasing by a factor of 2 every 10 epochs to a minimal of \(10^{-6}\). In addition, we employ the early-stopping technique to prevent over-fitting.

Furthermore, as mentioned in Ref. [31], the introduction of the pairwise interaction matrix based on physics principle significantly increases the computational time and memory consumption, therefore limiting the number of pairwise interaction matrix which is the prior knowledge based on physics principle. In this paper, in order to address the memory issue caused by huge input data, we implemented the Chunk Loading strategy, a commonly used technique in the field of deep learning for data loading. This approach entails continuously importing and deleting data during the training, validation and test process, enabling us to train our model on a large dataset while mitigating the memory load. We give a detailed description of this approach in the following:

Within a loop, input data batches are dynamically loaded for training, validation, and test. Each batch contains 1280 events. Regardless of whether it’s for training, validation, or testing, the data loading process remains consistent. This uniformity ensures that the iteration counts for training, validation, and testing may vary, but the data-handling approach remains the same. During each iteration, we employ NumPy’s memory-mapped file access to efficiently retrieve training data, corresponding labels, particle interaction matrices, and jet interaction matrices. Once this batch is processed for training/validation/testing, the loaded data is subsequently removed to free up memory resources. Subsequently, we proceed to load the next batch of data for next iteration. This method significantly reduces memory consumption by allowing us to access the necessary data without the need to load the entire dataset into memory all at once. This strategic approach not only optimizes memory utilization but also effectively mitigates the challenges associated with handling substantial input data. It allows us to train our model efficiently while preventing memory exhaustion.

Table 1 The jet feature pairwise interaction matrix used as the inputs for the P-DAT. Here PID represents the particle identification

3 Results of jet classification

The P-DAT architecture is designed to process input data consisting of particles inside the jets. Based on the point cloud representation, we regard each constituent particle as a point in the \(\eta -\phi \) space and the whole jet as a point cloud. To ensure consistency and facilitate meaningful comparisons, we first sorted the particles inside the jets by transverse momentum and a maximum of 100 particles per jet are employed. The input jet is truncated if the particle number inside the jet is more than 100 and the input jet is zero-padded up to the 100 if fewer than 100 particles are present. In this process, the zero-padded constituent particles were directly introduced as zeros into the model, without the utilization of any additional masking. This selection of 100 particles is sufficient to cover the vast majority of jets contained within all datasets, ensuring comprehensive coverage. Each jet is characterized by the 4-momentum of its constituent particles. Based on this information, we reconstructed 7 features for each particle. Additionally, for the quark–gluon dataset, we included the Particle Identification (PID) information as the 8-th feature. These features are as follows:

$$\begin{aligned} \left\{ \begin{aligned} \hbox { log}\ E,\ \hbox { log}\ p_\text {T},\ \frac{p_{\text {T}}}{p_{\text {TJ}}},\ \frac{E}{E_J},\ \varDelta \eta \ \varDelta \phi ,\ \varDelta R, \ \text {PID} \end{aligned} \right\} . \end{aligned}$$
(3)

For the pairwise particle interaction matrix, we adopt the same four features as employed in Refs. [13, 31]. Additionally, we include the difference in transverse momentum as an additional feature. To summarize, we calculated the following 5 features for any pair of particles a and b with four-momentum \(p_a\) and \(p_b\), respectively:

$$\begin{aligned} \varDelta R&= \sqrt{(y_a - y_b)^2 + (\phi _a - \phi _b)^2}, \nonumber \\ k_{\text {T}}&= \min (p_{\text {T},a}, p_{\text {T},b}) \varDelta , \nonumber \\ z&= \min (p_{\text {T},a}, p_{\text {T},b}) / (p_{\text {T},a} + p_{\text {T},b}), \nonumber \\ m^2&= (E_a+E_b)^2 - \Vert {\textbf{p}}_{a}+{\textbf{p}}_{b}\Vert ^2, \nonumber \\ \varDelta p_{\text {T}}&= |p_{\text {T},a}-p_{\text {T},b}| \end{aligned}$$
(4)

where \(y_i\) represents the rapidity, \(\phi _i\) denotes the azimuthal angle, \(p_{\text {T},i} = (p_{x, i}^2+p_{y, i}^2)^{1/2}\) denotes the transverse momentum, \({\textbf{p}}_i=(p_{x,i}, p_{y,i}, p_{z,i})\) represents the momentum 3-vector and \(\Vert \cdot \Vert \) is the norm, for \(i=a\), b. As mentioned in Ref. [31], we take the logarithm and use \((\ln \varDelta , \ln k_{\text {T}}, \ln z, \ln m^2, \ln \varDelta p_{\text {T}})\) as the interaction features for each particle pair to avoid the long tail problem. Moreover, apart from the 5 interaction features, we design one more feature for the quark–gluon benchmark dataset, defined as \(\delta _{i,j}\), where i and j are the Particle Identification of the particles a and b.

Furthermore, as mentioned in Sect. 2, we have designed a pairwise jet feature interaction matrix, drawing inspiration from the work Ref. [31]. The list of all jet features used in this study is presented below. Note that all the jet features are calculated based on the four-momentum of all the constituent particles within the jet. The interaction matrix is constructed based on a straightforward yet effective ratio relationship, as illustrated in Table 1.

$$\begin{aligned} \Bigg \{ \begin{aligned}&\text {E},\ {p_\text {T}},\ \sum p_{Tf},\ \sum E_f,&\overline{\varDelta \eta }, \ \overline{\varDelta \phi },\ \overline{\varDelta R}, \ \text {PID} \end{aligned} \Bigg \}. \end{aligned}$$
(5)

To provide a clearer explanation of the concept of the jet feature pairwise interaction matrix, we will now present a detailed description. The first variable E represents the energy of the input jet. \(p_{\text {T}}\) denotes the transverse momentum of the input jet, while \(\sum p_{Tf}\) and \(\sum E_f\) represent the sum of the transverse momentum fractions and the energy fractions of all the constituent particles inside the input jet, respectively. Additionally, \(\overline{\varDelta \eta }\), \(\overline{\varDelta \phi }\) and \(\overline{\varDelta R}\) correspond to the transverse momentum weighted sum of the \(\varDelta \eta \), \(\varDelta \phi \), \(\varDelta R\) of all the constituent particles inside the input jet, respectively. Here \(\varDelta \eta \), \(\varDelta \phi \) and \(\varDelta R\) refer to the distances in the \(\eta -\phi \) space between each constituent particle and the input jet. Furthermore, for the quark–gluon dataset, we incorporated the 8th feature based on the particle identification information. It represents the particle identification associated with the specific particle type whose sum of transverse momentum accounts for the largest proportion of the entire jet transverse momentum. The entire jet feature pairwise interaction matrix is defined as a symmetric block matrix with diagonal ones. For convenience, we named \(\{\text {E},\ {p_\text {T}},\ \sum p_{Tf},\ \sum E_f\}\) as variable set 1 and \(\{\overline{\varDelta \eta }, \ \overline{\varDelta \phi },\ \overline{\varDelta R}\}\) as variable set 2. We build the pairwise interactions among variable set 1 and variable set 2, respectively. Firstly, we employ a ratio relationship to define the interaction between E and \({p_\text {T}}\}\). Additionally, we establish that the interaction between \(\sum E_f\) and E is 1, while no interactions exist between \(\sum E_f\) and any other variables, except for E and particle identification. Similarly, we define the interaction between \(\sum p_{Tf}\) and \(p_T\) as 1, with no interactions between \(\sum p_{Tf}\) and any other variables, except for \(p_T\) and particle identification.

Table 2 Comparison between the performance reported for P-DAT and existing classification algorithms on the quark–gluon discrimination dataset. The uncertainty is calculated by taking the standard deviation of 5 training runs with different random weight initialization

Secondly, we apply a ratio relationship to define the interaction between \(\overline{\varDelta R}\) and \(\{\overline{\varDelta \eta }, \overline{\varDelta \phi } \}\), while no interaction is specified between \(\overline{\varDelta \eta }\) and \(\overline{\varDelta \phi }\). Finally, we determine the interactions between particle identification and all other variables as the ratio of the sum of the corresponding variables of the particles associated with the particle identification to the variable of the jet.

3.1 Quark/gluon discrimination

The quark–gluon benchmark dataset [18] was produced using Pythia8 [34] without detector simulation. It includes quark-initiated samples \(q{\overline{q}} \rightarrow {Z\rightarrow {\nu {\overline{\nu }}}+(u,d,s)}\) as signal and gluon-initiated data \(q{\overline{q}} \rightarrow {Z\rightarrow {\nu {\overline{\nu }}}+g}\) as background. Jet clustering was performed using the anti-kT algorithm with R = 0.4. Only jets with transverse momentum \(p_T \in \) [500, 550] GeV and rapidity \(|y| < 1.7\) were selected for further analysis. Each particle within the dataset comprises not only the four-momentum, but also the particle identification information, which classifies the particle type as electron, muon, charged hadron, neutral hadron, or photon. The official dataset compromises of 1.6M training events, 200k validation events and 200k test events, respectively. In this paper, we focused on the leading 100 constituents within each jet, utilizing their four-momenta and particle identification information for training purposes. For jets with fewer than 100 constituents, zero-padding was applied. For each particle, a set of 8 input features was used, based solely on the four-momenta and identification information of the particles clustered within the jet. The accuracy, area under the curve (AUC), and background rejection results are presented in Table 2.

From Table 2, we can see that in the context of the quark/gluon discrimination task, P-DAT exhibits powerful classification performance, surpassing the majority of models while falling slightly behind other two transformer-based models, PCT and ParT. The superior results of ParT can be attributed to its significantly more complex architecture with a total of L = 8 particle attention blocks and 2 class attention blocks. The model complexity of ParT exceeds the P-DAT model by a substantial margin. As for the PCT model, all self-attention layers employ query, key, and value matrices obtained through one-dimensional convolutional layers, resulting in a larger number of FLOPs compared to our model. P-DAT strikes a favorable balance between performance and model complexity. Additionally, our P-DAT model incorporates the channel attention module, offering greater flexibility in leveraging abundant jet information compared to the other two methods.

3.2 Top tagging

The benchmark dataset [5] used for top tagging comprises hadronic tops as the signal and QCD di-jets as the background. Pythia8 [34] was employed for event generation, while Delphes [9] was utilized for detector simulation. All the particle-flow constituents were clustered into jets using the anti-kT algorithm [6] with a radius parameter of R = 0.8. Only jets with transverse momentum \(p_T \in \) [550, 650] GeV and rapidity \(|y| < 2\) were included in the analysis. The official dataset contains 1.2M training events, 400k validation events and 400k test events, respectively. Only the energy-momentum 4-vectors for each particles inside the jets are provided. In this paper, the leading 100 constituent four-momenta of each jet were utilized for training purposes. For jets with fewer than 100 constituents, zero-padding was applied. For each particle, a set of 7 input features based solely on the four-momenta of the particles clustered inside the jet was utilized. The accuracy, area under the curve (AUC), and background rejection results can be found in Table 3.

Table 3 Comparison between the performance reported for P-DAT and existing classification algorithms on the top tagging dataset. The uncertainty is calculated by taking the standard deviation of 5 training runs with different random weight initialization

From Table 3, a similar pattern emerges when analyzing the performance of models in the top tagging task. P-DAT exhibits competitive classification performance. While other two transformer-based models, PCT and ParT, achieve modestly enhanced performance, especially in terms of background rejection rates, which reach nearly twice that of our P-DAT model, this advantage comes at the cost of increased model complexity and resource demands.

Furthermore, given that our P-DAT model includes the channel attention module and considering the distinct jet substructure characteristics observed in boosted top jets and boosted QCD jets, we have the opportunity to formulate a set of jet substructure variables and develop an additional self-attention module to calculate attention weights for every pair of these jet substructure variables. The resulting attention weight matrix can be employed as a bias term to augment channel scaled dot-product attention. This can be an interesting research direction in the future to enhance the performance of top tagging. While we acknowledge that ParticleNet Lite achieves higher background rejection rates with smaller model complexity regarding top tagging task, we believe that the adaptability and innovation inherent in the P-DAT model, combining the global jet information and local particle information, pave the way for exciting possibilities in this field.

4 Computational complexity

In addition to evaluating the algorithm’s performance, it’s crucial to consider the computational cost involved. To gauge the computational resources needed for assessing each model, we calculate both the number of trainable parameters and the number of floating-point operations (FLOPs). Table 4 presents a comparative analysis of these factors across various algorithms.

Table 4 Comparison between the number of trainable weights and floating point operations (FLOPs) reported for P-DAT and existing classification algorithms

In the context of computational complexity comparison among various models, our P-DAT model emerges as a notable candidate. While the number of P-DAT trainable parameters is increased by more than 2.6 times compared to PCT, the number of floating point operations (FLOPs) is actually 45\(\%\) lower. Notably, when compared to ParticleNet, PCT, and ParT, P-DAT features the smallest FLOPs. P-DAT distinguishes itself by maintaining a comparatively modest parameter count at 498k while offering a reasonable level of computational efficiency with 144 M FLOPs. This balance between model complexity and computational demands positions P-DAT as an attractive choice for practical applications, where it can potentially deliver competitive performance with fewer computational resources, making it a promising option for deployment and further research.

5 Conclusion

In this study, we introduced the particle dual attention transformer (P-DAT) as an innovative model architecture for jet tagging. We designed the channel attention module and alternately employed the particle attention module and the channel attention module to capture both jet-level global information and particle-level local information, while maintaining computational efficiency. Additionally, we incorporate both the pairwise particle interactions and the pairwise jet feature interactions in the attention mechanism. We evaluate the P-DAT architecture on the classic top tagging task and the quark–gluon discrimination task and achieve competitive results compared to other benchmark strategies. Notably, our P-DAT maintains a relatively modest parameter count 498k while simultaneously delivering a reasonable level of computational efficiency with 144 M FLOPs, which strikes a balance between computational complexity and model performance. Besides, given the substantial computational demands posed by introducing a pairwise interaction matrix based on physics principles, which can impact both time and memory resources, we have introduced the Chunk loading strategy which involves dynamic data import and deletion throughout the training, validation, and testing phases, effectively addressing memory usage constraints.

Finally, channel attention module opens up more possibilities for future exploration. For instance, in this study we proposed the channel attention module and designed the jet feature interaction matrix as our primary contributions. As an alternative approach to utilizing simple ratio-based interaction matrix, we could explore the possibility of constructing a dedicated attention module for jet features. By incorporating the resulting attention weight matrix into the channel attention module, we may potentially enhance performance. This strategy offers the advantage of incorporating valuable supplementary jet information and leveraging the intrinsic patterns within jet features revealed by the jet feature attention mechanism.