Multi-scale cross-attention transformer encoder for event classification

We deploy an advanced Machine Learning (ML) environment, leveraging a multi-scale cross-attention encoder for event classification, towards the identification of the $gg\to H\to hh\to b\bar b b\bar b$ process at the High Luminosity Large Hadron Collider (HL-LHC), where $h$ is the discovered Standard Model (SM)-like Higgs boson and $H$ a heavier version of it (with $m_H>2m_h$). In the ensuing boosted Higgs regime, the final state consists of two fat jets. Our multi-modal network can extract information from the jet substructure and the kinematics of the final state particles through self-attention transformer layers. The diverse learned information is subsequently integrated to improve classification performance using an additional transformer encoder with cross-attention heads. We ultimately prove that our approach surpasses in performance current alternative methods used to establish sensitivity to this process, whether solely based on kinematic analysis or else on a combination of this with mainstream ML approaches. Then, we employ various interpretive methods to evaluate the network results, including attention map analysis and visual representation of Gradient-weighted Class Activation Mapping (Grad-CAM). Finally, we note that the proposed network is generic and can be applied to analyse any process carrying information at different scales. Our code is publicly available for generic use.


Introduction
Information about jet identification provides powerful insights into collision events and can help to separate different physics processes originating these.This information can be extracted from the elementary particles localized inside a jet.Recently, various methods have been used to exploit the substructure of a jet to probe new physics signatures using advanced Machine Learning (ML) techniques [1][2][3][4][5].
Conversely, using the reconstructed kinematics from the final state jets for event classification spans the full phase space and exhibits large classification performance [6][7][8][9][10][11][12][13][14][15][16][17][18].Such high-level kinematics (i.e., encoding the global features of the final state particles), possibly together with the knowledge of the properties of (known or assumed) resonant intermediate particles, remains blind to the information encoded inside the final state jets.
A possible way to extract information from both jet substructure and global jet kinematics is to concatenate the information extracted from a multi-modal network [19][20][21][22][23][24].However, such a simple concatenation leads to an imbalance of the extracted information, within which the kinematic information generally dominates [25].
In this paper, we present a novel method for incorporating different-scale information extracted from both global kinematics and substructure of jets via a transformer encoder with a cross-attention layer.The model initially extracts the most relevant information from each dataset individually using self-attention layers before incorporating these using a cross-attention layer.The method demonstrates a larger improvement in classification performance compared to the simple concatenation method.
To assess our results, we analyze the learned information by the transformer layers through the examination of the attention maps of the self-and cross-attention layers.Attention maps provide information about the (most) important embedded particles the model focuses on when classifying signal and background events.However, they cannot highlight the region in the feature (e.g., phase) space crucial for model classification.For this purpose, we utilize Gradient-weighted Class Activation Mapping (Grad-CAM) to highlight the geometric region in the η −ϕ (detector) plane where the model focuses on classifying events.
We test our approach for the dominant decay channel of Higgs boson pairs with Standard Model properties (hh) produced at the LHC, that is, into four b-(anti)quarks.This signal has historically proved to be extremely challenging to extract owing to a significant QCD background.Lately, there have been several attempts to tackle this signature using both standard [26][27][28] andML [29] approaches.Furthermore, in the case that the hh intermediate state emerges from the (resonant) decay of a heavier Higgs state (H), the kinematics becomes very challenging, as each of the two would-be (slim) b-jets produced by the two h decays actually merge into one (fat) jet, as the two h states can be very boosted.The final states in the detectors little resemble the primary parton kinematics of the underlying physics in such case.Finally, we assume the HL-LHC collider environment.This offers another challenge of increased presence of tracks in the final state not emerging from the aforementioned hard scattering and subsequent parton-to-jet dynamics, e.g., pile-up, soft underlying event, multi-parton scattering, etc.
The plan of this paper is as follows.In the next section, we describe the basics of a transformer encoder.Then, in Sect.3, we introduce the physics process that we use as an example.Then, we present our numerical results.In Sect.5, we interpret the classification results using various methods.The Sect. 6 is for conclusions.The details of our network structure can be found in the appendix.

Transformer encoder
Transformers were originally proposed as sequence-to-sequence models for machine translation [30].The main ingredient of the original transformer model is the encoder-decoder block.However, the models using encoder block only often appear for event classification analysis at the LHC.[31][32][33].
Inherited by the word tokens in the original transformer model, transformer encoders are used to analyze events in terms of clouds of particles for High Energy Physics (HEP) analysis [34,35].Particle clouds represent the final state particles as a permutation invariant sequence of particles.Such a representation has the ability to share the advantages of particle based representations, especially the flexibility to include arbitrary features for each particle.
The motivation to apply transformer encoders to particle clouds stems from their inherent ability to model interactions between particles irrespective of their spatial proximity.By leveraging self-attention mechanisms, transformer encoders enable each particle to dynamically weigh the influence of other particles within the entire cloud, thus capturing both local and global dependencies.This can potentially revolutionize the analysis of HEP systems, particularly by offering a more holistic understanding of their behavior and interactions.
Understanding the scientific operation of transformer encoders in the context of particle clouds requires diving into the core components of these models.At the heart of the transformer architecture is the attention mechanism, an algorithm that allows the model to focus on different parts of the input sequence when making predictions.An attention mechanism operates by assigning attention weights to different particles based on their relevance to the current particle being processed.This allows the model to consider global relationships and dependencies, enabling it to capture emergent behaviors, interactions, and patterns that may not be apparent in filter based methods, e.g., Convolutional Neural Networks (CNNs), which mainly extract the local information.

Attention mechanism
The attention mechanism is an essential component of transformer models, playing a crucial role in capturing information and dependencies amongst particles.In the transformer architecture, the attention mechanism enables the model to focus selectively on different parts of the input sequence, allowing for the modelling of complex relationships and dependencies.In general, the attention mechanism operates by assigning different weights to different elements in the input sequence, emphasizing the more relevant parts while downplaying the less relevant ones2 .The attention mechanism broadly span two types, as follows.
• Self-attention is a more advanced form of attention where the model attends to different positions in the input sequence to weight their importance concerning the current position.In the context of the transformer model, self-attention allows each element in the sequence to attend to all other elements, capturing both local and global dependencies.Attention scores are calculated and used to combine the values associated with different positions linearly.
The self-attention mechanism enables the model to consider the entire context, making it particularly effective for tasks where long-range dependencies are crucial.
• Cross-attention extends the self-attention mechanism to handle input sequences from different sources.In the transformer architecture, it is often used when processing pairs of sequences of different structures.Cross-attention allows each element in the first sequence to attend to all other elements in the subsequent sequence.This facilitates modeling the relationships between different modalities or extracting the relevant information from sequences with different scales.
Consider the input data sets (x i , x j ) that have first been passed by a linear fully connected NN layer to generate the weight matrices as follows: where K, Q and V are called key, query, and value vectors, respectively, and used to compute the attention to the whole data set.Scaled dot-product attention can then be defined as while the attention output is computed as a weighted sum of the attention scores as This is called self-attention if the attention is computed for the same data set, i.e., x i = x j .The weights matrices have the dimensions of which mixes the features of the input data and retain the dimension of the embedded input to the original one.In contrast, if the two input data sets differ, i.e., x i ̸ = x j , cross attention is needed.In this case, the weight matrices should have different dimensions with Attention output is used to scale the input data set via a skip connection as The new transformed data set x i indicates the attention importance of each element in the data set to the whole elements in the set.Although the attention output mixes the input and feature tokens, the skip connection keeps the reference to the order of the original input data set.At its basic level, each transformer layer includes a multi-head attention, which combines different attention heads, allowing for parallel multi-dimensional processing of the inputs.Multi-head attention is a key innovation in the transformer model architecture, enhancing expressive power and capturing complex patterns in data by allowing the model to attend to different aspects of the input sequence simultaneously.Therefore, this mechanism eases the understanding of varied and subtle connections within the data, offering a more thorough representation.
As explained, a single set of attention weights is computed for the entire input sequence.Multi-head attention extends this concept by employing multiple attention heads, each responsible for learning different aspects of the relationships within the data.Each attention head independently processes the input sequence, producing a set of output values.These outputs are then linearly combined to form the final output of the multi-head attention layer.
Mathematically, if h represents the number of attention heads, and head i denotes the ith attention head, the output O is obtained by concatenating the outputs of each attention head and linearly transforming these: with W O the learnable linear transformation matrix which has the dimensions of W to retain the same dimensions as the original input.This enables the model to capture different aspects of relationships and dependencies simultaneously.The choice of the number of attention heads, denoted as h i , is a crucial hyperparameter in designing a transformer model.Increasing the number of attention heads has several implications, such as enhancing the model's capacity to capture complex relationships.It is also important to mention that a higher number of attention heads also increases computational complexity.Training and inference times and memory requirements could increase.Therefore, the number of attention heads should be balanced based on the task requirements and available computational resources.
In this particular context, we present an innovative methodology aimed at integrating inputs characterized by distinct scales within a multi-modal transformer model featuring cross-attention layers.The schematic representation of the network architecture is shown in Fig. 1.Considering the specific case of the HEP process to be studied at the LHC, gg → H → hh → b bb b, the model dynamically adjusts three streams through self-attention transformer layers, each devoted to analyzing the leading jet, second-leading jet and the reconstructed kinematics, respectively.At this juncture, the model independently extracts pivotal information from each data set, leveraging self-attention mechanisms before their collective processing through a cross-attention layer.
The main role of the cross-attention layer is to extract the local jet substructure information effectively and incorporate it into the extracted kinematic information.Notably, the adaptability of the cross-attention layer in merging information from one data set into another affords flexibility in determining how to integrate the extracted information, providing the option to accentuate jet information for enhancing kinematics.Once the most relevant information from the data sets is extracted and combined via the cross-attention layer, we feed the output to fully connected NNs to analyze the captured information and compute the classification probability.The inclusion of self-attention layers in the model holds significance, as it allows for the independent extraction of the most relevant information from each data set prior to their amalgamation using the cross-attention mechanism.This characteristic makes the model proficient in analyzing multi-scale data characterized by intricate structures.Here, P j1 , P j2 are the number of the leading and second leading jet constituents while the P m 's are the reconstructed particles, j 1 , j 2 , and H. Also, MHSA stands for multi-heads self-attention layers, and MHCA stands for multi-heads cross-attention layers.Finally, the N i 's are the number of the used transformer encoders.The transformer layers are stacked and work sequentially, as pointed out by the black arrow.

Physics example
We analyse SM-like di-Higgs boson (hh) production at the HL-LHC with an integrated luminosity of 3000 fb −1 within the framework of the 2HDM.In the boosted regime, where the di-Higgs boson is produced from an on-shell heavy Higgs, H, the final state features two fat jets, as illustrated in Fig. 2 by the two red cones therein.Therefore, to start with, in this section, we provide a brief review of the 2HDM with type-II Yukawa couplings, focusing on the aspects that are relevant to our analysis.We then describe the strategy behind our numerical analysis, together with its constituent elements, i.e., the event generation and detector simulation procedures, as well as the signal and background properties, in terms of the overall kinematics and internal dynamics of jets.We adopt different transformer encoder configurations to analyze the kinematics and jet substructure individually and efficiently combine the information from both of these.

The 2HDM
The 2HDM is an extension of the SM through a second SU (2) L Higgs doublet with the same quantum numbers under the SM symmetry gauge group [36,37].The most general 2HDM Higgs potential is given by The potential structure allows for Flavor Changing Neutral Currents (FCNCs) at the tree level, which is strongly constrained by experimental measurements.Applying a global Z 2 symmetry to the scalar potential, with (ϕ 1 , ϕ 2 ) → (ϕ 1 , −ϕ 2 ) transformations, prevents the existence of such FCNC sources [38].However, the most general Yukawa interaction violates such a Z 2 symmetry, thus leading to potential FCNCs at the tree level, as pointed out in Ref. [39].Therefore, to tame the latter, only specific Yukawa structures, known as Types [36], are allowed.Yet, to enable Electro-Weak Symmetry Breaking (EWSB) consistent with the measured particle spectrum of the SM, a softly broken Z 2 symmetry should eventually be enabled by requiring a small but non-vanishing term m 2 12 (ϕ † 1 ϕ 2 ) and setting λ 6 = λ 7 = 0. (Herein, softly means that the model still respects the Z 2 symmetry at small distances through all orders of perturbation theory.)The soft mass m 2  12 and λ 5 are in general complex, though [40,41].In the following, we will consider a real potential that thus preserves the CP symmetry, IM(m 2 12 ) = Im(λ 5 ) = 0.In such a configuration of the 2HDM, then 7 independent parameters remain, which are the λ i 's, with i = 1, . . .5, tan β = v 2 /v 13 and m 2  12 , from which the physical parameters, i.e., Higgs boson masses and couplings, are obtained, with the constraint that one of the two CP-even Higgs fields should be the discovered one with mass of 125 GeV or so (which in our case is the h field).Finally, as mentioned already, we restrict our study to the Type-II among the possible Yukawa structures.
The tree level mass matrix squared for the Higgs fields can be obtained as where the h i 's (i = 1, . . ., 4) are the four components of the complex doublet fields.Upon EWSB, three physical neutral scalars are obtained after diagonalizing the corresponding mass matrices, as intimated, two CP-even (scalar) ones (h, H) and a CP-odd (pseudoscalar) one (A), with masses given by with where the VEVs satisfy the relation v = √ v 1 + v 2 (with v being the SM one)4 .To stay with the neutral Higgs sector, the imposed CP conservation only allows for tree level couplings between two massive gauge bosons and the CP-even Higgs states, while the CP-odd Higgs state can only couple to a gauge boson and a CP-even Higgs one.Furthermore, all neutral Higgs states can couple to fermions.The coupling strength of the neutral Higgs bosons to both matter and forces are parameterized in terms of tan β and another parameter, α, which is the mixing angle between the CP-even Higgs states [36].Furthermore, the triple scalar coupling is independent of the Yukawa structure and is given by [42] where e is the electric charge and s, c are the sin and cos of the given angle.The 2HDM free parameters are constrained from various theoretical considerations and experimental observations, as described in [43].Thus, profiting from the scan results performed therein, we adopt four Benchmark Points (BPs), with m H = 600, 800, 1000, and 2000 GeV, that satisfy all the current bounds.In Tab. 1, we show the parameters values of these points while the last column shows the production cross section σ prod of our target process (prior to the two h → b b decays) at √ s = 14 TeV.

Analysis strategy
With the theoretical setup clarified, we now proceed to a phenomenological study of di-Higgs boson production, focusing on final states with two boosted fat jets.We align our analysis with the boosted analysis presented in the latest ATLAS paper [44].The primary background contamination arises from QCD multijet processes, specifically pp → jjjj, contributing an estimated 90% of the total background, while the di-top process tt contributes at the 10% level.Other background processes, including SM h, hh, and EW di-boson production, have been assessed to make negligible contributions to the selected event yields.Therefore, they are not included in our analysis.Given that BSM di-Hggs events suffer from huge background contamination and it is not trivial to extract the signal information, we employ various configurations of transformer encoders for this analysis.
Commencing with the analysis of the global information encoded in the high-level reconstructed kinematics of both the signal and relevant background events, we employ a transformer encoder with multi-head self-attention to optimize the separation power between signal and background events.However, the presence of similar (to the signal) kinematic structures in some background processes poses a challenge to the classification efficiency of this network.Additionally, the substantial cross section of the background events diminishes signal significance, even after optimizing the cut on the output score.To enhance the network performance, one may consider applying initial cuts on certain variables before inputting the distributions into the network, aiming to amplify the signal and suppress the backgrounds.
In this analysis, we adhere to the pre-selection cuts outlined in [44], requiring two fat jets with a double b-tagging each.Moreover, each event must have at least two fat jets with radius R = 1.0 and pT > 450 GeV for the leading jet and pT > 250 GeV for the second leading jet.Each of the two fat jets is required to have a pseudorapidity |η(J)| < 2.5, a lower mass cut of m(J) > 50 GeV, and a mass window of 200 GeV is applied for the m H reconstruction for m H ≤ 1 TeV and relaxed for higher masses to allow for more statistics.Unlike the ATLAS analysis, we do not consider pile-up effects in this analysis.
In addition to the global kinematical variables, we can utilize jet substructure of the jets to distinguish between signal and background events.This naturally arises from the fact that jets initiated by different particles exhibit distinct characteristics.While heavy boosted particles, such as W ± , Z and Higgs bosons, can result in jets with a distinctive multi-prong structure, quark and gluon jets are unlikely to have such structure.Furthermore, the boosted colour singlet particle is isolated in colour flow.Therefore, two b jets from Higgs decay are colour-connected only among themselves, unlike QCD jets.
Consequently, the features of the parent particles can be inferred from the structure of the jet constituents.This information enables the recovery of various local details about events from different processes, serving as a discriminating variable between signal and background events.The study of jet substructure to identify the parent particle initiating a jet, thereby Distinguishing jets initiated from heavy boosted particles from QCD jets was introduced in [45][46][47][48][49][50][51][52][53][54][55][56][57][58] (and references herein).Recently, improvement on jet identification continued by using ML methods for jet image analysis [59][60][61][62][63][64][65][66][67], graph based analysis [68][69][70] or sequence based analysis [71][72][73][74][75][76].Exploiting the jet substructures of the Higgs jets to discriminate between signal and background events, In this paper, we especially employ a multi-modal transformer encoder with self-attention multi-heads to analyze the jet contents.The different modalities are designed to extract information from the leading and secondleading jet contents in parallel before a simple concatenation is performed for classification purposes.Without cross attention to the high-level kinematical variable discussed next, the classification performance is based solely on the information localized inside the fat jet.
Integrating inputs of varying scales encompassing both kinematics and jet substructure information, we utilize a multi-modal transformer encoder equipped with three streams and cross-attention head.
The first and second streams process information from the leading and second-leading jet contents.Each of them features a transformer encoder with self-attention heads.Once important features are extracted from the jets, they are aggregated in an addition layer.The third stream, dedicated to high-level kinematics, employs a transformer encoder with self-attention heads.The output from the addition layer and the final layer of the third transformer are fed into a cross-attention layer.This cross-attention layer is pivotal in connecting information extracted from the jet constituents to the corresponding kinematics, enhancing the overall classification performance.To elucidate the impact of the crossattention layer, we introduce a fourth model wherein we substitute the cross-attention layer with a straightforward concatenation layer.
For events simulation, we use MadGraph5 [77] to estimate multi-parton amplitudes and to generate signal and background events for subsequent processing.Background processes pp → bbbb 5 and pp → t t are computed at Leading Order (LO) while the Higgs production from gluon-gluon fusion is calculated at Next-to-LO (NLO) in QCD using an effective coupling calculated by SPheno [78,79].PYTHIA [80] is used for parton shower, hadronization, heavy flavor decays, and for adding the soft underlying event.The DELPHES package [81] is used for detector simulation.
DELPHES parameterizes the detector response, including tracks, calorimeter deposits, and high level objects such as isolated electrons, jets, taus, and Missing E T (MET).We use the default ATLAS card for resolution, but the (fat)jets are reconstructed from the Eflow objects, combining tracks and calorimeter information.The tt background is simulated at LO and up to two more jets with the matching scale 20 GeV via the MLM method [82,83].Fat jets are clustered using the anti-k T algorithm [84,85] with cone radius R = 1 and, to ensure further suppression of pile-up noise, jet trimming is performed [86].

Data pre-processing
Particle clouds enable configuring diverse data into the network, emphasizing the permutation symmetry of inputs to yield a promising representation of jets.Initially, we pre-process the data sets for the leading and second-leading jet contents up to 50 constituents each.The particles are arranged in descending order based on their transverse momentum.For events with fewer constituents, the remaining positions are padded with zeros, ensuring conformity with the stipulated count 6 .For each instance of the jet contents we store 4 features: pT, η, ϕ and log pT pT (jet) [35].Fig. 3 shows the four features averaged over the number of jet contents for 10000 events of the leading jet (left) and second leading jet (right).
To optimize the network discriminative accuracy, it is imperative to pre-process the jet contents, ensuring the manifestation of a multi-prong structure specific to signal events.We use the preprocessing steps that were introduced for jet image analysis.The preprocessing allows learning from small input data and considerably speeds up the learning process 7 .For this purpose, the following transformations are applied before inputting the data into • Translation Jet contents are shifted in the η − ϕ directions such that the jet axis is at the center of the η − ϕ plane.
• Rotation Rotation is executed to mitigate the stochastic nature inherent in the decay angle concerning the (η − ϕ) coordinate system.This alignment is achieved comprehensively by ascertaining the principal axis of the original data and subsequently rotation around the jet-energy centroid.This rotation ensures that the principal axis consistently aligns vertically.The rotation transformation is performed by first computing the leading eigenvector of the covariance matrix as the principle axis of the jet.A rotating angle, θ, is then defined as arctan2 x 1 x 2 , with x 1 , x 2 are the first and second components of the eigenvector respectively.Finally, the rotating angle is used to rotate the (η − ϕ) coordinates of the jet constituents to new non-physical coordinates, (η ′ − ϕ ′ ), in which the principle axis of the jet is always vertical.
• Flipping Jet constituents are reflected over the vertical axis such that the right side of η ′ always has the highest momentum.This ensures that the hardest radiation always appears in similar locations, which can be exploited to enhance the classification performance.
After pre-processing transformations, input data sets for the leading and second leading jets have the dimensions of (n, 50, 4), where n is the number of events, 50 is the number of jet constituents, and 4 is the number of pre-processed features.
Fig. 4 presents the cumulative average of 50000 p T distributions for both the leading (upper row) and second-leading (lower row) jets.The impact of pre-processing transformations is evident in revealing the multi-prong structure characteristic of signal events, wherein the leading and second-leading subjets are localized in specific regions within the (η ′ − ϕ ′ ) plane.In contrast, subjets from QCD multi-jets exhibit a broad energy range, lacking a discernible prong structure.Conversely, tt events show a distinct three-prong structure attributable to the fully hadronic decays of the top quark.Notably, despite the multi-prong structure in tt background events, their contribution to the overall background is merely 10%.We will see later that background rejection efficiency is high, therefore t t background can be important to estimate the accessibility of the signal.The kinematics data sets have dimension (n, 3,6) with n as the number of events, 3 as the number of reconstructed particles, leading, second leading jet, and heavy Higgs, and 6 as the number of the kinematic variables for each reconstructed particle.The 6 kinematic variables are mass m, p T , η, ϕ, and energy of the jet E and the rotation angle of the jet θ.Note that we assign 5 inputs corresponding to the 4 momenta of the jet.Because of the kinematical constraints p 2 = m 2 and p H = p J1 + p J2 , there are only 8 physically independent observable among 15 kinematical inputs.These additional inputs help the network to figure out relevant features for the classification.Fig. 5 shows the normalized kinematic distributions for the signal point with m H = 1 TeV and backgrounds.In addition to the reconstructed high-level kinematics, we incorporate the θ i distributions for the leading and second-leading jets (but not the heavy Higgs), which are the rotating angles of the leading and second leading jet contents.
We incorporate the data sets as input to the networks as follows, inputs to the first and second transformer encoders have the dimensions of (n, 50, 4).In contrast, the input to the third transformer encoder has the dimension of (n, 3,6).Once the data sets are pre-processed, we stack signal and background events in each data set separately, attaching labels of Y = 1 for the signal events and Y = 0 for the background events.During the training of the network, the model tries to minimize a categorical cross entropy loss function by minimizing the difference between the model prediction and the assigned labels.In this analysis, we use equal size data sets for signal and background events for training with 1 million events9 and 100000 event for test.
subjets and define the rotation and flipping based on the subjet locations.

Results
We now present the analysis results for probing the signature of the heavy scalar in the process of boosted di-Higgs boson production, gg → H → hh → b bb b, at the HL-LHC with integrated luminosity of 3000 fb −1 .The discriminating power of each network will measure how well the signal and background may be characterised through their different features, all entangled together into several kinematic distributions and jet substructure information.
For this purpose, we utilize four different attention based transformer models which analyze the reconstructed high level kinematics or the jet substructures individually via transformer encoders with self-attention mechanism.Alternatively, we adopt two multi-modals transformer encoders to analyze the combined information of kinematics and jets substructure.
In the latter, we incorporate the different information using a simple concatenation layer or cross-attention layer.A full description of the used networks is in Appendix A. The ATLAS limits are extracted from the latest analysis in [44] and linearly scaled to the integrated luminosity of 3000 fb −1 .
The classification performance of the utilized networks is presented in Fig. 6.In the left plot, we showcase the ROC for the employed networks for a signal with m H = 1 TeV.The multi-modal transformer encoder with cross-attention layers performs best, achieving an Area Under the Curve (AUC) of 98.8%.In contrast, the transformer encoder trained solely on the jet substructure information exhibits the lowest performance with an AUC of 84.4%.It is crucial to highlight the impact of the cross-attention layer, which enhances performance by 7% over the transformer model trained exclusively on kinematic information.Replacing the cross-attention layer with a simple concatenation layer results in a degradation of classification performance by approximately ∼ 4%, as depicted by the green line in the plot.
In the right plot, we present the 95% upper limit on the production cross-section at the HL-LHC for heavy scalar mass ranges between 600 − 2000 GeV.The dashed black line represents the limit for the ATLAS analysis [44], with linear scaling of the integrated luminosity to 3000 fb −1 .For lower masses, m H ≤ 1 TeV, all the used transformer models show enhanced performance over the ATLAS analysis, exhibiting over 10 times better sensitivity.For larger masses, for which the reconstructed kinematics of the signal are faithful to its true structure with vanishing background events, the performance of the transformer models saturates.In fact, for the limit, e.g., m H = 2 TeV, the background events can be easily removed with a simple cut on the reconstructed distributions of the signal events, which exhibits a clear difference from the background distributions.The transformer network trained on the jet constituents only does not show a large impact with varying the heavy scalar mass.
The network performance is subject to training and statistical uncertainty from limited training and testing samples.For example, the network performance can be influenced by the random partitioning of the training and test data sets, and the network performance varies when repeating the training and test steps with new splits.
We repeat the experiment for k times and report the results as bands between the highest and lowest values.In our results, we use k = 5, and the bands represent the values of the different represented experiments.
As for optimizing the signal-to-background yield, we enforce a cut on the network output score to keep only 20 events of the background.With this choice, we alleviate the statistical errors that may occur for lower background [87].
The optimized signal and background events are used to derive the upper limit using the following formula [88] with N s and N b being the number of signal and background events, respectively, and where σ b characterizes the uncertainty in the background events chosen to be the conservative value of 10% [89].In this approximation, one expects to exclude regions with a total significance of Z A > 2.  output of the cross-attention layer when trained on kinematics and jet information.Attention output has the dimensions of (reconstructed particles × features), and for both plots we use 10000 test events and average over the features for the background and the signal point with m H = 1 TeV.J 1 , J 2 and H represent the transformed particles as in equation 4.

The influence of cross-attention
To evaluate the impact of the cross-attention layer on the classification performance, Fig. 7 presents the attention output, as defined in Equation 3, for both the multi-modal transformer with cross-attention trained on kinematics plus jet constituents and the transformer network trained on kinematics only.In both networks, the attention output has dimensions of (3,6), where 3 represents the reconstructed particles (leading, second-leading jet, and heavy Higgs), and the last dimension accounts for the utilized features.We stress that in the network structure shown in Fig. 2, we adjust the Query matrix to be the output of the transformer layers for the jet constituents.In contrast, the Key and Value matrices are the output from the transformer layers for the kinematics.Accordingly, the output of the cross-attention layers has the same dimensions as the kinematics dataset.In principle, we have the freedom to choose whether to add the jet information to the kinematics by fixing the assigned Query, Key, and Value matrices, but we opted to incorporate the jet information into the high-level kinematics.Fig. 7 displays the distributions of the attention output for each transformed particle individually and averaged over the used features.The top row shows the attention output for signal and background events using a transformer encoder trained on kinematics only.
Conversely, when the information of the jet constituents is included using the crossattention layer, the attention output distributions for background events are broader, and the signal distributions are narrower.The fact that background jets lack a multi-prong structure with broader soft radiations influences the attention output for background events, increasing the output variations in the feature space.
Finally, we include, alongside the described kinematical information, also the rotation angle θ aligning the fat jet axis to the ϕ direction after shifting the jet η and ϕ to the center of the η − ϕ plane.This information allows the network to reconstruct the full events and access the correlation of the jet shape to the other fat jet and the beam axis.
In Table 2, we compare the AUC value of the network using Kinematical inputs(Kins), Kins + θ, Kins+ jet substructure inputs (Jet str.), Kins + Jet str.+ θ.Adding θ to Kins improves AUC by 0.59 while adding θ to Kins+ Jet str.improves AUC by 1.45.This indicates the correlation between all inputs (Jet str.θ, and Kins) is contributing to the signal and background classification.In Fig. 8 Left, we show the ROC curve of the network trained without the θ inputs (red) compared to the ROC curve of our coss-attention model (blue).The improvement in the background rejection is a factor of four for a signal efficiency of 80%.Therefore, including θ results in a drastically increased performance.In Fig. 8 Middle and Right, we show the efficiency in rejecting background for the model with/without θ inputs.The model with θ has higher efficiency at m J 1 ∼ m h and p T ∼ m H 2 .In short, the model can focus more on the H → hh kinematics with θ inputs.We also looked for simple correlations among θ and the other kinematical variables, such as η J ϕ J , but did not find any apparent ones contributing to the selection improvement, consistent with insignificant improvement by adding θ to the model using Kins.The correlations within the internal structures of the jet will be investigated in future publications.The ROC is obtained by using 20,000 signal and background testing events.The error is estimated as in Fig. 6.Middle(Right): plot shows the signal efficiency as varying m J 1 (p T J 1 ) for the best training results.The ratio is calculated with the score cut of 80% of the signal efficiency for 20,000 signal samples.The efficiency (without) using θ is shown by blue(red) bars indicating statistical errors without taking into training errors.The acceptance of the full model is higher than without θ input at m J 1 ∼ m h and p J 1 ∼ m H /2.

Interpretation of the transformer encoder results
In the following section, we discuss additional methods to interpret and analyze the results of the transformer encoder with cross-attention, which performs best in Fig. 6 The interpretation methods are generic and can be further applied to other networks to interpret their results.As attention-based transformer models excel in capturing intricate spatial relationships and global context within data, their interpretability becomes paramount.Interpretation methods for attention-based transformers aim to elucidate the visual cues, features, and regions that contribute significantly to the model's predictions.Common Interpretation Methods are • Attention Maps: Attention maps visualize the focus of the model by highlighting the particles in the cloud that receive higher attention.These maps provide a direct view into which particles are considered most relevant for prediction, facilitating an intuitive understanding of the model's decision-making process.
• Grad-CAM It generates class-specific activation maps by weighting the gradients of the predicted class score with respect to the final transformer layer [90].This technique highlights the regions in the feature space that are crucial for the model's classification decision and thus can provide a geometrical interpretation, η − ϕ plane, of the learned information by the network.
• Saliency Maps Saliency maps for transformer models are a form of interpretability technique used to understand and visualize the importance of different parts of the input sequence concerning the model's predictions.Saliency maps highlight the regions of the input that most significantly influence the model's output, providing insights into the model's decision-making process [91][92][93].By examining the saliency map, users can gain insights into which parts of the input sequence are crucial for the model's predictions.
• Layer-wise Relevance Propagation (LRP) The primary goal of LRP is to assign relevance scores to input features, indicating their contribution to the model's output [94].However, it's worth noting that LRP has limitations, and its effectiveness can vary depending on the specific neural network architecture and the nature of the task.Different variants of LRP have been proposed to address specific challenges and improve its applicability to various models.
The interpretation of attention-based transformer models is pivotal for unlocking their full potential and ensuring their responsible deployment in real-world applications.Among all the mentioned methods, we adopt the attention maps and Grad-CAM to interpret the learned information using the transformer model.

Attention maps
Attention maps serve as a bridge between the abstract nature of neural network computations and the desired interpretability.These maps visualize the attention scores assigned to each particle token in the input sequence, providing a representation of where the model focuses its attention during processing.The analysis of the attention maps highlights the particle tokens that receive higher attention scores, indicating their significance in the model's decision.Also, it reveals how particle tokens relate to each other.For instance, it highlights the information extracted from the jet constituents relevant to the reconstructed objects.Importantly, examining attention maps can pinpoint areas where the model might struggle or make mistakes.
In this context, we utilize attention maps to analyze the acquired information from the last transformer layer of the jet constituents.Our focus centers on the output of the network shown in Fig. 1.We begin by examining the attention maps of the Add() layer, which contains information about the jet substructure.In this case, the attention maps denoted as α ij in Equation 2, have dimensions of (n heads , 50, 50), where 50 represents the number of constituents in the jet, and n heads denotes the number of self-attention heads.We take n heads = 5 , see Appendix A) for detail.
Fig. 9 displays the values of the attention maps for each attention head individually, with signal events in the top row and background events in the bottom row.Given that jet constituents are originally ordered by their momentum, the X and Y axes ticks represent the attention values of the jet constituents in descending order (where the zero tick represents the leading transformed jet constituent particle).The attention map values reveal that the model concentrates on the leading and second-leading jet constituents to identify events as signal-like, particularly evident in attention heads 1, 2, 4, and 5.While the transformer layers intermingle particle and feature tokens, the skip connection still preserves the order  of the attention output in relation to the original input data.In fact, this reflects the efficiency of the network to capture the two-prong structure of the signal events.On the other hand, the network assigns high attention to the wide momentum orders of the jet constituents when the network identifies the input as a background event.Conversely, the network assigns high attention to a broad momentum range of jet constituents when identifying the input as a background event.The attention maps for background events exhibit significant agreement with the jet substructure of the background events presented in Fig. 4. To this end, through an analysis of the attention scores from the last transformer layer of the jet constituents, we confirm that the transformer model adeptly extracts the correct multi-prong structure of signal events.Meanwhile, for background events dominated by QCD processes, the model exhibits high attention across a wide momentum range of jet constituents.
The attention maps of the cross-attention layer illustrate the attention scores between the jet constituents and the reconstructed particles, including the leading and secondleading jets and the heavy Higgs.The dimension of the attention score in the cross-attention layer is (n heads , 3, 50), where 3 represents the number of reconstructed particles, 50 is the number of jet constituents, and n heads is the number of cross-attention heads, set at 8. Fig. 10 displays the cross-attention maps for signal events (top) and background events (bottom), averaged over the used cross-attention heads.
The cross-attention maps for signal events exhibit a stronger correlation between the highest momentum jet constituents and the heavy Higgs.In contrast, the Heavy Higgs displays a flat attention pattern with jet constituents of different momenta.Indeed, the results from the cross-attention maps, along with the cross-attention output shown in Fig. 7, provide a comprehensive overview of the impact of the cross-attention layer.This layer effectively assigns information from the jet constituents to the kinematics of the reconstructed particles to enhance the classification performance.

Grad-CAM
Grad-CAM is a technique designed to visualize and interpret the decisions made by DNN models.It builds upon the idea of class activation maps (CAMs) [95,96] but extends it to models with arbitrary architectures.The primary objective of Grad-CAM is to highlight the important regions in a transformed input features space, η − ϕ plane, that contribute to the prediction of a specific class [97].
Let F k ( η, ϕ) represent the activation of the final transformer layer for the k th event.The gradient of the predicted class score (Y c ) with respect to the activation output is computed as: This gradient is then globally averaged to obtain the importance weights (α) as where Z is the size of the feature activations, and the sum runs over the jet constituents.η, ϕ and p T are the transformed features.The final Grad-CAM heatmap is a weighted sum of gradients as This heatmap highlights the regions of the input image that contribute the most to the prediction of the target class.
In general, it operates by utilizing the gradient information flowing into the final transformer layer in the following way: During the forward pass, the neural network processes the input particle cloud, and the activations of the final transfomer layer are obtained.The gradients of the predicted class score with respect to the final transformer layer activation are computed during the backward pass.The gradients are then used to calculate the importance of the activation map.These importance scores are essentially the weights assigned to each spatial location of the final transformer layer.The weighted sum of the particle tokens is computed, creating the Grad-CAM.This map highlights the regions that contributed the most to the final prediction.Additionally, upsampling is often employed to match the Grad-CAM dimensions with the original input features.
To visualize the geometrical interpretation of the learned information from the jet constituents, we utilize Grad-CAM on the final self-attention layer of the jets, specifically, the Add() Layer depicted in Fig. 1.The results are shown in Fig. 11 for 5000 test images.The left panel illustrates the p T distribution of the predicted events as signal (top) or background (bottom).Signal events are considered for the benchmark point with m H = 1 TeV.The right panel displays the heat map of the Grad-CAM output for the predicted signal (top) and the predicted background (bottom).
The visualization of the heat map clarifies that the transformer model focuses on the two-prong structure to classify the input event as a signal.On the other hand, it relies on the soft-radiation pattern to classify the input event as a background event.Interestingly, we found that the result highlights that the model focuses on the positive η direction to make predictions, which is due to the flipping transformation done in the pre-processing step.
While Grad-CAM has the power to explain the considered regions in the feature space for the network predictions, one of its drawbacks is that it relies on gradient information from Figure 11: Grad-CAM results for 5000 test events of the transformer model with crossattention.Left: p T distribution of the jet constituents when events are predicted as signal events (top) and background events (bottom) in η-ϕ plane.Note that the asymmetric pattern is due to the flipping transformation in the pre-processing steps in which all constituents with larger momentum are reflected in the positive η direction.Right: heat-map of the Grad-CAM results in η-ϕ plane.
the final transformer layer.In cases where global context is crucial for decision-making, Grad-CAM may not capture long-range dependencies effectively.Moreover, Grad-CAM might be sensitive to small changes in the input, potentially making it less robust in the presence of adversarial examples.

Conclusion
In conclusion, this paper introduces an innovative method for enhancing event classification by effectively incorporating information from both global kinematics and substructure of jets in an event.Conventional approaches, using simple concatenation to combine the event information, have limitations, especially for scenarios where kinematical structures dominate.
Specifically, the proposed method utilizes a transformer encoder with cross-attention layers, enabling the extraction of different scale information from both global kinematics and jet substructure.The results demonstrate a substantial improvement in classification performance compared to traditional concatenation methods.Indeed, the analysis of the learned information, conducted through attention maps and a Grad-CAM algorithm for visual representation, provides valuable insights into the model focus on important particles and geometric regions in the transformed η − ϕ plane that is crucial for event classification.
We have validated this approach by focusing on the dominant decay channel, i.e., into four b-jets, of SM-like Higgs boson pairs produced in the resonant decay of a heavier CP-even Higgs state, at the HL-LHC.This challenging scenario involves merging would-be slim b-jets into a fatjet, due to the boosted nature of the lighter Higgs states so that the possibility of accessing partonic dynamics is apparently lost at the detector level.Furthermore, this occurs in an environment rich in tracks and calorimetric information not directly pertaining to the hard scattering sought, as typical of this CERN machine upgrade.Therefore, all these aspects add complexity to the classification task.Despite these challenges, the proposed method effectively addresses the intricacies of the final state in the detectors, ultimately outperforming mainstream signal selection procedures, whether based solely on kinematical analysis or less advanced ML tools.In the broader context, this research contributes to utilizing advanced jet identification techniques for global event reconstruction towards the understanding of collision events consisting of dynamics acting at various physics scales.Thus, the proposed method offers a promising avenue for improving the accuracy and efficiency of event classification in potentially many more complex scenarios encountered in high energy physics experiments.
the transformer layer undergoes a normalization layer and has the same dimension as the input dataset.The normalized output from each transformer layer is combined through an addition layer.This output then passes through a Multi-Layer Perceptron (MLP) comprising two fully connected layers with dimensions 128 and 64, employing the GELU activation function.Following each fully connected layer, a dropout layer with a dropout rate of 20% is applied.The output is then passed to the output layer for classification.The model is trained for 30 epochs with batch of size 500 in 1421 seconds.
• A single-stream self-attention transformer encoder is employed for kinematics analysis.
The network exclusively utilizes the kinematics dataset as input.To achieve this, we adopt the identical structure of the self-attention transformer encoder designed for jet substructure, but with a singular stream.The model is trained for 30 epochs with batch of size 500 in 1390 seconds.
• A three-stream transformer encoder is employed to analyze the leading and subleading jet constituents and the reconstructed kinematics.In this approach, we adjust the transformer layers for the leading and subleading jets from the first network, while the transformer layers for the kinematics are adapted from the latter network.The output of the self-attention transformer encoder layers for jet constituents is added via an addition layer.The resulting output from the addition layer, along with the output from the self-attention transformer layers of the kinematics, is then fed to a crossattention transformer layer.This cross-attention transformer layer is repeated twice, and the output has the same dimensions as the input kinematics dataset, i.e., (3,6).Subsequently, this output passes through a MLP consisting of two fully connected layers with dimensions 128 and 64, utilizing the GELU activation function.After each fully connected layer, a dropout layer with a dropout rate of 20% is applied.The resulting output is then forwarded to the output layer for classification.The model is trained for 30 epochs with batch of size 500 in 1576 seconds.
• The final network is configured to mirror the three-stream transformer encoder, with the only modification being the substitution of the cross-attention transformer layers with a single concatenation layer.The model is trained for 30 epochs with batch of size 500 in 1282 seconds.
For training of all models, we use two NVIDIA RTX 6000 GPU cards using Tensorflow mirror strategy with the utilization of 80% and 30% for the first and second cards, respectively, and memory consumption of 96% (48 GB) of both cards.

Figure 1 :
Figure1: Structure of the transformer model used.Here, P j1 , P j2 are the number of the leading and second leading jet constituents while the P m 's are the reconstructed particles, j 1 , j 2 , and H. Also, MHSA stands for multi-heads self-attention layers, and MHCA stands for multi-heads cross-attention layers.Finally, the N i 's are the number of the used transformer encoders.The transformer layers are stacked and work sequentially, as pointed out by the black arrow.

Figure 2 :
Figure 2: Feynman diagram for the signal process.

Figure 3 :
Figure 3: Left: distributions for the averaged constituents of 10000 leading jets.Right: distributions for the averaged constituents of 10000 second leading jets.Signal distributions are for the BP with m H = 1 TeV.

Figure 4 :
Figure 4: For illustration purposes, we show the accumulated average of 50000 p T distributions of the leading (second leading) fat jet contents in the upper raw (lower raw) after pre-processing steps for both signal and backgrounds.The signal events (left) are simulated for the BP with m H = 1 TeV and shown against the yield of the bbbb (center) and tt (right) background events.Here, X and Y ticks indicate the bin in η and ϕ direction.

Figure 5 :
Figure 5: Kinematics distributions of 10000 events for the signal BP with m H =1 TeV and the corresponding backgrounds after applying the pre-selection cuts.

Figure 6 :
Figure6: Left: The Receiver Operating Characteristic (ROC) curves for the four networks for the signal BP with m H = 1 TeV.Right: 95% upper limit on the total cross section for the process gg → H → hh (having factored out the SM-like h → b b decays) at the HL-LHC with integrated Luminosity 3000 fb −1 for different ML analyses.The band for each plot represents the upper and lower values for 5 independent training of different randum number seeds, and the middle line represents the central values.The ATLAS limits are extracted from the latest analysis in[44] and linearly scaled to the integrated luminosity of 3000 fb −1 .

Figure 7 :
Figure7: Top: output of the self-attention layer when trained on kinematics only.Bottom: output of the cross-attention layer when trained on kinematics and jet information.Attention output has the dimensions of (reconstructed particles × features), and for both plots we use 10000 test events and average over the features for the background and the signal point with m H = 1 TeV.J 1 , J 2 and H represent the transformed particles as in equation 4.

Figure 8 :
Figure8: Left: The ROC curve and error band of the full model using θ input (blue) and the model without θ input (red).The band indicates the max and min of the 5 independent training.The ROC is obtained by using 20,000 signal and background testing events.The error is estimated as in Fig.6.Middle(Right): plot shows the signal efficiency as varying m J 1 (p T J 1 ) for the best training results.The ratio is calculated with the score cut of 80% of the signal efficiency for 20,000 signal samples.The efficiency (without) using θ is shown by blue(red) bars indicating statistical errors without taking into training errors.The acceptance of the full model is higher than without θ input at m J 1 ∼ m h and p J 1 ∼ m H /2.

Figure 10 :
Figure10: Cross-attention maps of the cross-attention transformer layer averaged over the 8 cross-attention heads, which processes the jet substructure and the event kinematics for the signal (top) and backgrounds (bottom) for a 120K test event.The X-axis shows the attention score for the first transformed 20 th jet contents while the Y-axis shows the attention score for the transformed reconstructed final state particles.

Table 1 :
Input parameters for our four BPs.The last column shows the production cross section for the process gg → H → hh.

Table 2 :
Area Under the ROC (AUC) for the networks using Kinematics or Kinematics + Jet structure information with/without θ.