Detecting an axion-like particle with machine learning at the LHC

Axion-like particles (ALPs) appear in various new physics models with spontaneous global symmetry breaking. When the ALP mass is in the range of MeV to GeV, the cosmology and astrophysics bounds are so far quite weak. In this work, we investigate such light ALPs through the ALP-strahlung production processes $pp \to W^\pm a, Z a$ with the sequential decay $a \to \gamma\gamma$ at the 14 TeV LHC with an integrated luminosity of 3000 fb$^{-1}$ (HL-LHC). Building on the concept of jet image which uses calorimeter towers as the pixels of the image and measures a jet as an image, we investigate the potential of machine learning techniques based on convolutional neural network (CNN) to identify the highly boosted ALPs which decay to a pair of highly collimated photons. With the CNN tagging algorithm, we demonstrate that our approach can extend current LHC sensitivity and probe the ALP mass range from 0.3~GeV to 5~GeV. The obtained bounds are stronger than the existing limits on the ALP-photon coupling.


Introduction
Many extensions of the Standard Model (SM) predict the existence of light pseudoscalars, the so-called axion-like particles (ALPs).In general, they are predicted in any models with spontaneous breaking of a global U (1) symmetry [1][2][3][4] and also appear in supersymmetry (SUSY) with dynamical SUSY breaking [5] or spontaneously Rsymmetry breaking [6] as well as in compactifications of string theory [7][8][9].Besides, they may play a crucial role in solving the hierarchy problem [10] and be related to the electroweak phase transition [11].So far such ALP particles have been being searched at the Large Hadron Collider (LHC) and their phenomenological studies are being extensively studied nowadays.
Of course, for the phenomenological studies, the ALP masses and couplings to SM particles are the most important parameters.A considerable region of these parameters has already been probed by the cosmological observations, the low-energy experiments and the high energy colliders [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27]: (i) For an ALP wih mass m a 5 GeV, the LEP and current LHC experiments have good sensitivities and a considerable region of parameter space has been probed.For instance, the e + e − → γa (a → γγ) and Z → aγ processes at LEP [28] have been exploited to search for ALPs.The γγ → a → γγ searches have been utilized in electromagnetic PbPb collisions at the LHC by CMS and ATLAS [29].The rare decay channels of Higgs boson h → Za (a → γγ) and h → a(→ γγ)a(→ γγ) at the LHC [30] have also been utilized to probe the ALP-photon coupling g aγγ versus the ALP mass m a .
(ii) For an ALP with mass below the MeV scale, the cosmological and astrophysical observations [31,32], such as the Big Bang Nucleosynthesis (BBN), the Cosmic Microwave Background (CMB) and the Supernova 1987A, have already produced many constraints on ALP couplings.In addition, such light ALPs can become cold dark matter (DM) [33][34][35] and be detected by various astrophysical and terrestrial anomalies [36], including the unexpected X-ray emission line around 3.5 keV [37] and the excess of electronic recoil events in XENON1T [38].
(iii) For an ALP in the mass range of MeV to GeV, it may make sizable contributions to low energy observables in particle physics and many searches in intensity frontiers [39][40][41][42] which have been performed recently, such as the lepton flavor violating decays [43], the rare meson decays [39,40,44,45] and the ALP production in beam dumps experiments [46].Besides, such an ALP has been proposed to explain the muon anomalous magnetic moment [41,47] and may also offer a plausible explanation for the Koto anomaly [48].
(iv) For the ALP mass in between 0.1 GeV and 10 GeV, recently, the Belle II [49] search for the process e + e − → γa, a → γγ in the mass range 0.2 GeV < m a < 9.7 GeV using data corresponding to an integrated luminosity of (445±3)pb −1 .
Note that, a rather light ALP can be highly boosted, and thus the two photons from the ALP decay can be highly collimated and lead to an interesting "diphotonjet" signature at the LHC.Distinguishing such diphoton-jets from the overwhelming QCD-jets and single photons is crucial for the ALP search at the LHC.In the literature there are some studies on these diphoton-jets by exploiting the jet substructure [16,[50][51][52][53].We will concentrate on the ALP that only couples to the electroweak gauge bosons with a mass range from a few hundred MeV to 10 GeV.Such a light ALP with couplings to gauge bosons has attracted much attention recently.Different from the existing studies [30,54], we will neglect the ALP-gluon coupling so that its dominant production channel at the LHC is the ALP-strahlung processes pp → V a (V = W, Z), as shown in Fig. 1.
Besides the conventional analysis methods, the machine learning techniques have been used to search for new physics [55][56][57][58][59][60][61][62][63].In our work we will exploit low-level jet information directly by utilizing machine learning (ML) and computer vision (CV) techniques, rather than using physics-inspired features, in order to not only improve the discrimination power, but also gain new insight into the underlying physical processes.A convolutional neural network (CNN) will be designed to analyze these "diphoton-jet" events from the decay of ALPs based on the notion of jet-image which was first introduced in Ref. [64] and then studied in Refs.[65][66][67][68][69][70][71][72][73][74][75][76][77][78].In this way, the calorimeter is regarded as a camera and the jets are represented as images in which the pixel intensities are the energy depositions of the particles within the jet.It was found in the literature that the CNN can provide the ability to learn rich high-level abstract features of jet images and greatly enhance the discrimination power.Based on the diphoton-jet tagging, we will further use another simple neural network (NN) to obtain optimal detection significance for the two ALP-strahlung processes in our study.
< l a t e x i t s h a 1 _ b a s e 6 4 = " k 2 p V j v X y e 8 2 6 2 . Feynman diagrams of ALP-strahlung production processes pp → W ± a and pp → Za with (a → γγ) at the LHC .This paper is organized as follows.In Section II, we describe the effective Lagrangian of ALP interactions and the simulation details.In Section III, we first introduce the jet-image pre-processing techniques, and then design a CNN architecture to identify the diphoton-jet events, and also build an NN to optimize the detection significance.In Section IV, we present and discuss the numerical results.Finally, we draw our conclusions in Section V.

Model
We consider an effective Lagrangian which consists of ALP interactions with electroweak gauge bosons up to dimension-5, given by [21] where a represents the ALP field, f a is the ALP decay constant which is set to 1 TeV −1 in this paper, and V µν and Ṽµν represent the field strength for a SM gauge boson defined as V µν ≡ ∂ µ V ν − ∂ ν V µ and Ṽµν ≡ µνρσ V ρσ , with W µν and B µν being for SU (2) L and U (1) Y , respectively.
The third and fourth terms in Eq. 2.1 induce the following dimensional couplings: g aγγ , g aW W , g aZZ , g aγZ , which control the strength of the ALP's interaction with gauge bosons ) where θ W is the Weinberg angle.For simplicity, we set C W W = C BB in our study.
In this work, we consider two ALP-strahlung production processes pp → W ± (→ ± ν)a(→ γγ) and pp → Z(→ + − )a(→ γγ) as our signals, in which ± denotes e ± or µ ± .The SM backgrounds of Za signal are the Zγ and Zj processes, while the SM backgrounds of W ± a signal are mainly the W ± γ and W j processes (j stands for a light jet).Besides, the QCD di-jet events could also be an important SM background whenever one jet fragments into electromagnetic energy mostly and the other jet fragments into a final state with an energetic isolated lepton.Although fake rate of jets is quite small, the large cross section of QCD di-jets makes its contribution non-negligible.In the LHC experiment, when m a 10 GeV, ALP will decay into two well separated photons that are identified as 2γ events by the detector.On the other hand, if the ALP mass is lighter than a few hundred MeV, it will decay into highly collimated photon pairs that deposit energy in the electromagnetic calorimeter as a single photon.Between these two mass bounds, the two photons will be seen like a "diphoton-jet".In this case, we can use calorimeter towers as the pixels of the image and measure a jet as an image.Then we use machine learning techniques based on convolutional neural networks to discriminate diphoton-jets from single photons and QCD-jets.

Simulation details
For Monte Carlo signal simulations, we implement the effective Lagrangian of Eq. (2.1) in FeynRules [79] to generate the corresponding UFO model file.The parton level signal and background events are generated with MadGraph5 aMC@NLO [80] at the leading order.The following cuts for parton level event generation are employed: p T j > 20 GeV, |η j | < 5. We perform parton shower and fast detector simulations with Pythia8243 [81] and Delphes 3.4.3[82].FastJet 3.3.2[83] is used for jet clustering.The NNPDF2.3QED parton distribution function (PDF) set [84] is chosen in our calculations.Scale uncertainty is determined through independent restricted variation of the factorization scale µ F and the normalization scale µ R .We use the built-in systematics tool in MadGraph5 aMC@NLO to evaluate the PDF and scale uncertainties.The cross sections of backgrounds are given in Table 1.In Delphes, the stable hadrons such as neutrons and charged pions are assumed to deposit all their energy in the hadron calorimeter HCAL.For short-lived particles such as neutral pions decaying into a pair of photons are assumed to deposit all their energy in the electromagnetic calorimeter ECAL.For long-lived particles, such as Kaons and Λ with cτ smaller than 10 mm, they are assumed to share their energy deposit between ECAL and HCAL with default fractions f ECAL =0.Based on the energy-flow algorithm [85], the EflowPhotons, EflowNeutralHadrons and ChargedHadrons, which are composed of deposits in ECAL and HCAL, are clustered into jets using the anti-k t algorithm [86] with R j = 0.4.Only the leading jet in each event is retained for further analysis and it is required with p T > 50 GeV and |η| < 2.5.Besides, we also include the pile-up events to perform a pile-up robust analysis.The numerous soft QCD pile-up events are generated with Pythia8 and then simulated by Delphes.In the CMS card we consider the average amount of pile-up events per bunch-crossing as 40.We take the default parametrization implemented in the CMS card to distribute the hard scattering events and pile-up events in time and z positions randomly.The ground truth labels of the leading jets in signal events are "diphoton-jet".The leading jets in the W ± a and Za background events are labelled as "single photon".While the leading jets in other background events are labelled as "QCD-jet".
In terms of the effective ALP-photon coupling g aγγ , the decay width of the ALP is given by Then the decay length in the limit of E a m a is approximated by [46] Since the typical value of the transverse momentum of our ALP is in the range of 50 GeV to 1 TeV, we can have the decay length 0.1 mm < l a < 4.9 mm for m a = 0.3 GeV and g aγγ = 1 TeV −1 .On the other hand, if the ALP is very light, the produced ALP at the LHC can be long-lived and may either be undetected (missing energy) or decay away from the primary vertex, other methods have to be utilized to observe such light ALP signature [87].
In order to discriminate a diphoton-jet from a single photon and QCD-jet, we reorganize the information of jet constituents provided by the ECAL and HCAL as digital images which are the so-called jet images.In this work, the jet images are of 40 × 40 pixels resolution.They cover the area of [-0.4,0.4]×[-0.4,0.4] in the η × φ plane centering around the reconstructed jet axes.Thus, each pixel corresponds to ∆η × ∆φ = 0.02 × 0.02, which matches the simulated CMS electromagnetic calorimeter granularity.As illustrated in Fig. 6, for each pixel, we sum up the transverse momentum separately for all the particles, for the charged hadrons, for the photons and for the neutral hadrons falling into the pixel as the pixel intensity in the four separate image channels.Therefore, each image channel is one jet observable.

Jet image pre-processing
To make the CNN learn highly discriminative features between signal and background events, we pre-process all the jet images by translation, rotation and normalization.Note that since the transverse momentum p T is invariant under longitudinal boosts, the pixel intensities in the four channels, which are given by the sum of the transverse momentum of all the particles, the charged hadrons, the photons and the neutral hadrons respectively inside the jet, are invariant under translation and rotation in η and φ.
• Translation: First of all, we define a new coordinate system in the η − φ plane centered at the jet.In this way, the coordinates of each particle k inside the jet in the new coordinate system are given by where (η 0 , φ 0 ) is the jet coordinates in the frame with the interaction point of p-p collision as the origin.
• Rotation: The second step of pre-processing is rotating the jet image around the image center.We define the "barycenter" of a jet image as which is the weighted sum of the transverse momenta of all the constituents inside the jet, where p T = k p T k is the total transverse momentum of the jet and p T k is the transverse momentum of the k-th particle inside the jet.Then, we rotate the whole jet image such that the "barycenter" of the jet is in the 12 o'clock position: • Normalization: The last step is normalization.In order to make the neural network learn more efficiently, we scale the pixel intensity of each channel to [0,1] via dividing by a constant.In this way, the absolute energy scale dependence of jet images can be removed.It allows for comparisons of jet images with different energies.We take different normalization constants for different channels.Most of the pixel intensities of diphoton-jet images and single photon images are carried by the the pair of collimated photons and the single photon with transverse momentum of tens of GeV.Therefore for the channel given by the sum of the transverse momentum of all the particles and the channel given by the sum of the transverse momentum of all the photons, the normalization constant is set to 100.The charged hadrons and neutrons distributions in all the jet images samples are diffuse and the transverse momentum of most of them are less than 10 GeV.Therefore for the channel given by the sum of the transverse momentum of all the charged hadrons or all the neutral hadrons, the normalization constant is set to 10.

Jet-tagging neural network
An illustration of the architecture of the jet-tagging CNN is shown in Fig. 2. The input is a jet-image of 40 × 40 pixels with four channels.Followed by the input layer, a normal convolutional layer and a depthwise convolutional layer are designed to extract pixel-level feature map of the jet-image.The kernel size of the normal convolutional layer is 5 × 5 and the number of kernels is 32.The inner structure of the depthwise convolutional layer [88] is given in Fig 5, where a depthwise convolution of 5 × 5 kernel is first applied on each channel of the input feature map, and then a 1 × 1 pointwise convolution is applied to aggregate the features along channel direction.Besides, the leaky ReLU [89] is adopted as the activation after both the normal convolutional layer and the depthwise convolutional layer.Then, we add an attention block to weight the feature map, as shown in Fig. 3(a), so as to make the CNN concentrate on the significant features distinguishing signals from backgrounds.The resulting feature map obtained from the previous layer is taken as the input of an efficient channel attention (ECA) module [90].As shown in Fig. 3(b), by independently averaging all the pixels values of each feature map with each average value representing the corresponding feature channel, the global average pooling results in a feature vector whose dimension equals to the number of input channels which is 32 in our designed model.Then, a one-dimension convolution of kernel size 5 is adopted to aggregate the local information in the feature vector and the result is activated by a sigmoid function to generate the weights on each feature channel.Finally, the input feature map is weighted by multiplying each feature channel by its corresponding weight.As a result, the ECA module suppresses less important feature channels.In short, ECA attention is computed as where Ω mk indicates the set of k adjacent channels of y m and σ denotes the sigmoid function.The weight of m-th channel is calculated by considering the local cross-channel interaction between y m and its k neighbors.In this way, the ECA module first captures the local information which is exchanged quickly across channels through each channel and its k neighbors.Then the global cross-channel interactive information is captured by fast one-dimensional convolution of size k.This size k represents the coverage of the local cross-channel interaction, i.e., how many neighbors participate in attention prediction of one channel.In order to avoid manual tuning of k via cross-validation, k is determined adaptively.Secondly, the weighted feature map obtained from the ECA module is further fed into a spatial attention module [91].As shown in Fig. 3(c), both a global max pooling and a global average pooling are applied on the input feature map along the channel direction to generate two aggregated feature channels.Then, the spatial weights are generated by applying a convolution of kernel size 3 × 3 and a sigmoid activation.Finally, the input feature map is multiplied by the spatial weights pixel-wisely.In short, the spatial attention is computed as where σ denotes the sigmoid function and f 3×3 represents a convolution operation with the filter size of 3 × 3. F s avg ∈ R 1×40×40 and F s max ∈ R 1×40×40 represent the two 2D maps generated by the two pooling operations.
Next, we apply a stack of 2 × 2 maxpooling layers and 5 × 5 depthwise convolutional layers.All the feature maps have 32 channels and are activated by a leaky ReLU.
Afterwards, we flatten the final feature map into a single vector and feed it into a fully connected network with two hidden layers.The hidden layers have 128 and 32 neurons, respectively, with a leaky ReLU activations.In the output layer, we apply the SoftMax activation to generate three class probabilities for the diphoton-jet, the single photon and the QCD-jet, respectively.The sum of the three probabilities is one.
To optimize the model parameters in the jet-tagging CNN, we choose the crossentropy as the loss function.The CNN is trained using the Adam [92] optimizer with a constant learning rate of 0.001 based on the gradients calculated on a mini-batch of 64 training examples.The network is trained up to 100 epochs, and we adopt the early-stopping technique to prevent over-fitting.

Jet selection neural network
Detection significance depends on the event selection efficiency.In order to obtain an optimal detection significance, another simple neural network is built to learn an optimal selection cut on jet tagging probabilities.As shown in Fig. 4, it takes the three jet-tagging probabilities obtained from the jet-tagging neural network as input.A hidden layer with 10 neurons and a leaky ReLU activation is used to enhance the non-linearity of the optimized cut.The output layer is activated by a sigmoid function.This network predicts the probability of the jet selected as a signal.
Input Layer Hidden Layer Output Layer Because the detection significance is calculated by counting the signal and background events, it cannot be used directly to build the loss function of the optimization.Since the loss function should be mathematically differentiable, we first approximate the signal and background event count using the jet selection probabilities obtained from the neural network for all events.Denoting the jet selection probability for event i as p i , the jet selection efficiency can be estimated by where "pj", "p" and "j" represent the diphoton-jet signal, the single photon background and the QCD-jet background events, respectively.The sum runs over all the corresponding events.Then the signal and background event count S and B can be approximated by where σ S , σ p and σ j represent the cross sections of the diphoton-jet signal, the single photon background and the QCD-jet background after the basic selection which will be described in the next section.Then, the simple detection significance formula can be expressed as We take −Z as the loss function to optimize the NN model weights.Then, after optimization, we take 0.5 as the selection threshold to count the number of diphotonjets, single photons and QCD-jets (whose selection probabilities are greater than 0.5) and use the Poisson formula to accurately evaluate the detection significance.In our analysis, we use the Adam optimizer with a learning rate of 0.02 to optimize the model weights based on the gradients calculated over 5000 epoch.
Both the CNN and the simple NN were implemented in the deep learning framework of PyTorch [93] with GPU acceleration.Xavier-uniform initialization [94] is used to initialize the model weights and the model biases are initialized to zero.

Analysis and results
For the ALP-strahlung production process pp → W ± a, the final states are identified as one lepton and one diphoton-jet.The main SM backgrounds are from the QCD di-jet, W ± j, W ± γ, t t and tj productions.According to the above analysis, we adopt a basic selection criteria to select signal events in our analysis: (i) There is exactly one lepton (electron or muon) with p T > 20 GeV and |η| < 2.5; (ii) The hardest jet is required to have p T > 50 GeV and |η| < 2.5 (since our signal contains a hard diphoton-jet).
For the ALP-strahlung process pp → Za, the final states are marked by an opposite-sign and same-flavor charged lepton pair and a diphoton-jet.The main SM backgrounds are dominated by the productions of Zγ and Zj.We distinguish the signal and the background by imposing a basic selection criteria: (i) There are exactly two oppositely charged leptons with p T > 20 GeV and |η| < 2.5; (ii) The invariant mass of the oppositely charged lepton pair with same flavor is required to be in the range of 70 GeV < m ll < 110 GeV; (iii) The hardest jet is required to have p T > 50 GeV and |η| < 2.5 (for the reason stated in the above).
According to the L1 trigger menu of the CMS with tracking information [95], the current offline threshold of the single electron (muon) + jet are 23 GeV (16 GeV) and 66 GeV, respectively.The offline threshold of the single electron + photon are 22 GeV and 16 GeV, respectively.These will be able to collect all events of interest for our study.The higher event rates and event sizes at the HL-LHC will be a challenge for the trigger and data acquisition systems.With a complex series of upgrades including the installation of new detectors and the replacement of ageing electronics, we expect the future trigger menu could be comparable with that of Run-2.basic selection signal jj W ± γ W ± j t t tj No Cut 4.6 × 10 −1 2.5 × 10 7 9.9 × 10 1 4.1 × 10 4 6.0 × 10 2 2.0 × 10 2 1 lepton with p T > 20 GeV and |η| < 2.5 6.0 × 10 −2 1.9 × 10 5 1.2 × 10 1 4.4 × 10 3 1.5 × 10 2 2.9 × 10 1 The hardest jet with p T > 50 GeV and |η| < 2.5 4.0 × 10 −2 1.3 × 10 5 2.5 × 10 0 1.6 × 10 3 1.4 × 10 2 1.9 × 10 1 Table 2.The basic selection cut-flow of the cross sections (in units of pb) for the ALPstrahlung production process pp → W ± a with m a =3 GeV at the 14 TeV LHC where g aγγ are set to 0.64 TeV −1 .As a comparison, the corresponding results of the backgrounds are also listed.
As an example, we consider a benchmark signal point with g aγγ = 0.64 TeV −1 and m a = 3 GeV.The cut-flow of our basic selection for the ALP-strahlung production processes pp → W ± a and pp → Za and the main backgrounds are given in Tables 2  and 3. Next, we use the signal and background events after basic selection to train the CNN.During the 100 training epochs, the model with the minimal validation loss is chosen as the best model.The hardest jet with p T > 50 GeV and |η| < 2.5 2.9 × 10 −3 4.4 × 10 −1 1.0 × 10 2 Table 3. Same as Table 1, but for the ALP-strahlung production process pp → Za and the corresponding backgrounds.
After applying the above translation and rotation in the preprocess, in Fig. 6, we present the jet images of the diphoton-jet from the ALP, the single photon and the QCD-jet.The benchmark signal events are generated for the ALP with g aγγ = 0.64 TeV −1 and m a = 3 GeV.The single photon and QCD-jet are taken from W ± + γ and W ± + j background events, respectively.The pixel intensities pT (a), pT (b), pT (c) and pT (d) (namely four image channels) correspond to the averages of the transverse momentum of all the particles, the photons, the charged hadrons, and the neutral hadrons falling in each pixel over the total number of events, respectively.As shown in Fig. 6, the diphoton-jet and the single photon events have much higher p T a and p T b than the QCD-jet events.Meanwhile, the QCD-jet background has much higher p T c and p T d than the diphoton-jet and the single photon events.In the four image channels, the spread of the pixel intensity of the QCD-jet events around each image center is wider than that of the diphoton-jet and the single photon events.Fig. 7 shows the attention images for a QCD-jet, a single photon and a diphotonjet samples, respectively, when m a = 3 GeV.It shows that the network can learn to automatically extract the most distinguishable image regions and pay more attention to them.For a diphoton-jet sample, the network focuses on two image regions that contain a pair of collimated photons, and the leading subjet gets relatively stronger attention.For a single photon sample, the network pays strong attention to the pixel where the leading subjet is located.And the attention is weak and more diffuse for a QCD-jet sample.
After training, we find the optimal model parameters and the jet-tagging CNN is tested on the jets in the remaining signal and background events based on the best model.The resulting jet tagging probabilities are fed into the jet selection NN to optimize the jet selection efficiency for the diphoton-jet, the single photon and the QCD-jet, respectively.We adopt the widely used ternary plot to present the jet tagging probabilities for each jet in the validation set.As shown in Fig. 8, the red, green and blue points denote the diphoton-jet, the single photon and the QCD-jet, A ternary plot for the diphoton-jet events, the single photon events and the QCD-jet events.The red, green and blue points represent the diphoton-jet events, the single photon events and the QCD-jet events, respectively. .The optimal cut for the ALP-strahlung production process pp → W ± a based on the selection NN with g aγγ = 0.64 TeV −1 and m a = 3 GeV.The red, green and blue points represent the diphoton-jet events, the single photon events and the QCD-jet events while the black line is the optimal cut.respectively.Same as before, the benchmark point is chosen as g aγγ = 0.64 TeV −1 and m a = 3 GeV.It is clear that the size of the validation data set for the CNN is same as the size of the training data set for the jet selection NN.
In the following, we will employ a simple NN to obtain the optimal cut to maximize the statistic significance for pp → W ± a and pp → Za processes.The integrated luminosity is set to 3000 fb −1 at the 14 TeV LHC.In Fig. 9, we present the optimal cut for the ALP-strahlung production process pp → W ± a based on the selection NN with g aγγ = 0.64 TeV −1 and m a = 3 GeV.
In order to estimate the statistical error, we train seven selection neural networks to obtain the mean detection significance and its standard deviation.For the diphoton-jet signal events and the single photon background events, we use seven different data sets, with each of them containing 300k events.The size of single photon background events which contains W ± γ and Zγ processes in the training data set are also 300k.The number of events from different processes are determined by the cross section.Note that, after jet selection cut, there are limited number of QCD-jets.However, generating QCD-jet events is computationally expensive.To ensure a stable statistics, we adopt the bootstrap sampling method for the QCD-jet background events which contains jj, W ± j, t t, tj and Zj processes.At each time, 3M QCD-jets are randomly sampled from all the QCD-jet samples.The number of events from different processes are also determined by the cross sections.The event number in the training data set can ensure that there are at least 10 events left after any cut imposed by the neural network regardless of signal or backgrounds.The optimal detection significance for the ALP-strahlung production processes pp → W ± a and pp → Za obtained from seven simple NNs are shown in Tables 4 and 5  The optimal significance for the ALP-strahlung production process pp → W ± a obtained from the trainings of seven significance neural networks.The benchmark point is chosen as g aγγ = 0.64 TeV −1 .
In Fig. 10, we present the 2σ bounds on the ALP-photon coupling g aγγ versus m a from the ALP-strahlung production processes pp → W ± /Za.Based on the results of the seven simple NNs, we calculate the mean significance and the corresponding standard deviation for all benchmark points and we draw two exclusion bands where each band has a width of 10 standard deviations.The red band indicates the 2σ bounds for pp → W ± a process while the blue band indicates the 2σ bounds for pp → Za process.As mentioned in Section II, the diphoton-jet we defined in this paper can only be applied to the ALP mass range from a few hundred MeV to 10 GeV.When the ALP mass is larger than 10 GeV, the ALP will decay to two separated photons and are detected as 2γ events.When the ALP mass is less than a few hundred MeV, the ALP will be highly boosted and decay into a pair of highly collimated photons.The reconstruction of such a low mass resonance in our phenomenological study is very challenging so that we focus on the ALP mass range of 0.3 GeV to 10 GeV.Table 5.The optimal significance for the ALP-strahlung production process pp → Za obtained from the trainings of seven significance neural networks.The benchmark point is chosen as g aγγ = 0.64 TeV −1 .
For comparison, we also present the previous constraints such as the beam dump searches for short-lived axions, the LEP searches for ALPs via e + e − → 2γ/3γ processes, the di-photon resonance around the B s mass between 4.9 GeV and 6.3 GeV from the LHCb, the isolated and energetic photons produced by the hadronic decay of Z boson at the L3.Since the resonant searches based on Upsilon meson decay process Υ → aγ at Babar can only be used to scenarios with a non-zero ALP-gluon coupling while in our analysis we turn off the ALP-gluon coupling, we do not include the BaBar limits.Note that the searches for γγ resonances in photon and weak boson fusion processes by ATLAS and CMS [18,28,96,97] are sensitive only when m a >10 GeV.Recently, The γγ → a → γγ searches have been utilized in electromagnetic PbPb collisions at the LHC by CMS and ATLAS [98,99], which provide the current most competitive ALPs limits in the mass range of 5 GeV < m a < 100 GeV, as shown in Fig. 10.Besides, e + e − → γa, a → γγ processes have been searched in the mass range 0.2 GeV < m a < 9.7 GeV in the Belle II [49], as shown in Fig. 10.Moreover, we also show the constraints from Ref. [53] , in which the electroweak ALP is probed via the ALP-strahlung production processes pp → W ± /Za in the mass range of 0.3 GeV < m a < 10 GeV at the 14 TeV HL-LHC based on jet substructure variables and BDT method.The constraints are depicted as red dashed line and blue dashed line to present the 2σ bound for the pp → W ± a process and the pp → Za process, respectively.Since the results in Ref. [53] is based on conventional jet substructure variables and BDT, we tagged these two dashed lines with "pp→ W ± a without CNN" and "pp→ Za without CNN".As shown in Fig. 10, the shapes of the red and blue bands in this work based on CNN are similar to the shapes of the dashed red and blue lines in Ref. [53] based on jet substructure variables which could cover part of the triangle region between the Belle II bound and the ATLAS/CMS(PbPb) bound.The exclusion limits are weaker when m a is close to 0.3 GeV and 10 GeV since the diphoton-jet signal feature is not clear with either being detected as 2γ events or being tagged as a single photon in our analysis.The best exclusion limits from the processes pp → W ± a and pp → Za are obtained when m a are in the range of 5 to 7 GeV and 3 to 5 GeV, respectively.For the same ALP-strahlung process, diphoton-jet tagging based on CNN in this work is much better than that based on jet substructure variables analysis in Ref. [53].
We find that our approach based on the jet-tagging CNN and the detection significance optimization NN can improve the current LHC sensitivities to the ALP mass from 5 GeV to 0.3 GeV in the case of vanishing ALP-gluon coupling.Moreover, it can greatly surpass the existing LEP bounds, Belle II constraints and the limits from Ref. [53].For instance, g aγγ in our study can be excluded to 1.1 TeV −1 at m a = 0.3 GeV and 0.5 TeV −1 at m a = 5 GeV.It should be noted that our obtained limit from the HL-LHC is stronger than that from current Belle-II data.However, the future Belle II with full integrated luminosity of 50 ab −1 may provide more stringent constraints than ours [40].
In this work, we mainly investigate the potential of utilizing the CNN method to distinguish the ALP diphoton-jet events from the single photon and QCD jets events.But it is worth to note that additional standard kinematic cuts exploiting the full kinematical properties of the signal and background events could be used to further suppress the backgrounds and enhance the sensitivity.For example, we note that the resulting QCD-dijet events is the dominant background for the signal process pp → W ± a. Therefore, isolating photon and a cut on the missing transverse energy would be helpful to reduce the such a background in our analysis of the process pp → W ± a.Our results obtained in this paper should be considered as a lower limit on top of which further improvements can be implemented.Besides, we should mention that our sensitivity analysis would be affected by the systematic uncertainties, such as the calibration of the jet energy scale.However, this will need a full simulation of detector and real data.Since the realistic detector performances of the HL-LHC are not still available, we do not include the systematic uncertainties in the current calculation.An updated analysis will be done in our future work.

Conclusions
In this work we studied the ALP-strahlung production processes pp → W ± a, Za in the mass range of 0.3 GeV < m a < 10 GeV at the 14 TeV HL-LHC.Since the two photons from the ALP decay are highly collimated for such a light ALP, we       G M Q 2 K K c g j H R W u v 6 C F L j g c J P D B E E A Q 9 m A h p q e J I k y E x L U x I i 4 i 5 K o 6 w x g 5 8 i a k Y q S w i B 3 Q t 0 + z Z s o G N J e Z s X I 7 t I p H b 0 R O A 7 v k 4 a S L C M v V D F V P V L J k f 8 s e q U y 5 t y H 9 7 T T L J 1 b g k t i / f B P l f 3 2 y F 4 E e j l U P L v U U K k Z 2 5 6 Q p i T o V u X P j S 1 e C E k L i J O 5 S P S L s K O f k n A 3 l i V X v 8 m w t V X 9 T S s n K u Z N q E 7 z L X d I F F 3 9 e 5 z S o H R S K p c L h e S l f P k m v O o t t 7 G C P 7 v M I Z Z y h g i p l 3 + A R T 3 j W m t q t d q f d f 0 q 1 T O r Z x L e h P X w A O O + Y E w = = < / l a t e x i t > Figure 10.The 2σ bounds on the ALP-photon coupling g aγγ versus m a plane.Regions above the red blue dashed lines are by pp → W ± /Za at the 14 TeV LHC with an integrated luminosity 3000 fb −1 , without using CNN [53].Regions above red and blue bands are excluded by this work using CNN.Other bounds shown are from LEP [17,28], L3 [100], Belle II [49], LHCb [101], ATLAS/CMS [18,28,96,97], ATLAS/CMS(PbPb) [29] and Beam Dumps [46].designed a jet-tagging CNN to discriminate our signal from the QCD-jets and the single photon backgrounds based on the jet-image notion and proposed a detection significance optimization NN to search for the optimal cut to maximize the statistic significance for the ALP-strahlung production processes pp → W ± a, Za.With the help of machine learning techniques, we obtained the 2σ bounds on the ALP-photon coupling g aγγ versus the ALP mass m a .The coupling g aγγ > 1.1 TeV −1 at m a = 0.3 GeV and g aγγ > 0.5 TeV −1 at m a = 5 GeV can be excluded at 2σ level at the 14 TeV LHC with an integrated luminosity of 3000 fb −1 .This shows that our approach can extend the current LHC bounds on the ALP mass from 5 GeV to 0.3 GeV and the obtained bounds are stronger than the existing other limits.

Figure 2 .
Figure 2. Illustration of the jet-tagging CNN

Figure 3 .
Figure 3.The attention block designed in our model.

Figure 4 .
Figure 4. Architecture of the selection neural network.

Figure 5 .
Figure 5. Illustration of the depthwise convolution layer and the pointwise convolution layer.

Figure 6 .Figure 7 .
Figure 6.The jet images for three kinds of jets after applying the translation and rotation steps of the pre-processing.The three kinds of jets are the leading jets of the benchmark signal events with g aγγ = 0.64 TeV −1 and m a = 3 GeV, the leading jets of the W ± γ background events and the leading jets of the W ± j background events after basic selection.The pixel intensities of the four channels, defined by the sum of the transverse momentum of all the particles, the photons, the charged hadrons and the neutral hadrons, are labeled as p T a , p T b , p T c and p T d , respectively.

Figure 8 .
Figure 8.A ternary plot for the diphoton-jet events, the single photon events and the QCD-jet events.The red, green and blue points represent the diphoton-jet events, the single photon events and the QCD-jet events, respectively.

Figure 9
Figure9.The optimal cut for the ALP-strahlung production process pp → W ± a based on the selection NN with g aγγ = 0.64 TeV −1 and m a = 3 GeV.The red, green and blue points represent the diphoton-jet events, the single photon events and the QCD-jet events while the black line is the optimal cut.
m a [GeV] < l a t e x i t s h a 1 _ b a s e 6 4 = " S p i f D t w A o y p L l 8 z L u v S I X f e f p g c = " > A A A B + X i c b V D L S g N B E J y N r x h f q x 6 9 D A b B U 9 i N g h 6 D H v Q Y w T w g W Z b Z S S c Z M v t g p j c Y l v y J F w + K e P V P v P k 3 T p I 9 a G

[TeV 1 ]
r b g X l e r D Z b l 2 k 8 d R J C f k l J w T l 1 y R G r k n d d I g n I z J M 3 k l b 1 Z m v V j v 1 s e i t W D l M 8 f k D 6 z P H 0 u t k 2 8 = < / l a t e x i t > g a < l a t e x i t s h a 1 _ b a s e 6 4 = " w u 1 R 8 x z T j a B I s 2 B O / 3 y X 9 J u 1 b 1 D q q 1 8 8 N y / S S P o 0 C 2 y Q 7 Z I x 4 5 I n V y R h q k R T i 5 I w / k i T w 79 8 6 j 8 + K 8 f o 3 O O P n O F v k B 5 + 0 T 5 I e b + g = = < / l a t e x i t > LEP (JHEP 2015) < l a t e x i t s h a 1 _ b a s e 6 4 = " Z o G 0 5 + m m u U s ul R Q i 7 5 S G Y h L H l D Q = " > A A A C E 3 i c b V C 7 S g N B F J 3 1 G d d X 1 N J m M A j R I u x G R c u g B I J Y r G A e k A 1 h d j J J h s w + m L k r h i X / Y O O v 2F g o Y m t j 5 9 8 4 S b b Q x H O 5 c D j n X m b u 8 S L B F V j W t 7 G w u L S 8 s p p Z M 9 c 3 N r e 2 s z u 7 N R X G k r I q D U U o G x 5 R T P C A V Y G D Y I 1 I M u J 7 g t W 9 w d X Y r 9 8 z q X g Y 3 M E w Y i 2 f 9 A L e 5 Z S A l t r Z Y x f Y A y Q 3 Z W e E s Z s W d l 1 z q u e v K 2 U H F y 3 7 7 G j U z u a s g j U B n i d 2 S n I o h d P O f r m d k MY + C 4 A K o l T T t i J o J U Q C p 4 K N T D d W L C J 0 Q H q s q W l A f K Z a y e S m E T 7 U S g d 3 Q 6 k 7 A D x R f 2 8 k x F d q 6 H t 6 0 i f Q V 7 P e W P z P a 8 b Q v W g l P I h i Y A G d P t S N B Y Y Q j w P C H S 4 Z B T H U h F D J 9 V 8 x 7 R N J K O g Y T R 2 C P X v y P K k V C / Z J o X h 7 m i t d p n F k 0 D 4 6 Q H l k o 3 N U Q h X k o C q i 6 B E9 o 1 f 0 Z j w Z L 8 a 7 8 T E d X T D S n T 3 0 B 8 b n D 6 e 3 m t g = < / l a t e x i t > LEP (PLB 2016) < l a t e x i t s h a 1 _ b a s e 6 4 = " r s u 3 r S J z B U i 9 5 U / M / r x u B d O g k P o h h Y Q O c P e b H A E O J p P r j P J a M g x p o Q K r n + K 6 Z D I g k F n W J O h 2 A t n r x M W p W y d V q u 3 J 4 V q r U 0 j i w 6 Q s e o i C x 0 g a r o B j V Q E 1 H 0 i J 7 R K 3 o z n o w X 4 9 3 4 m I 9 m j H T n E P 2 B 8 f k D C L C a h g = = < / l a t e x i t > LHCb < l a t e x i t s h a 1 _ b a s e 6 4 = " m k 1 G K N c o a i 2 u 0 D f m v g 0 X x q o g C M 8 = " > A A A B 8 n i c b V A 9 S w N B E N 3 z M 8 a v q K X N Y h C s w l 0 U t A y m S W E R w X z A 5 Q h 7 m 7 1 k y d 7 u s T s n h i M / w 8 Z C E V t / j Z 3 / x k 1 y h S Y + G H i 8 N 8 P M v D A R 3 I D r f j t r 6 x u b W 9 u F n e L u 3 v 7 B Y e n o u G 1 U q i l r U S W U 7 o b E M M E l a w E H w b q J Z i Q O B e u E 4 / r M 7 z w y b b i S D z B J W B C T o e Q R p w S s 5 P e A P U F 2 1 6 i H 0 3 6 p 7 F b c O u c P R d S R P g = = < / l a t e x i t > Beam Dumps < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 a v 4 u H B L a Q e y H l o g I p J s / h P b a 8 M = " > A A A B + n i c b V D L T g I x F O 3 4 R H w N u n T T S E x c k R k 0 0 S V B F y 4 x k U c C E 9 I p B R r a z q S 9 o 5 K R T 3 H j Q m P c + i X u / B s L z E L B k z Q 5 O e e e 3 N s T x o I b 8 6 M 1 5 c l 6 c d + d j P r r i Z J l D 9 A f O 5 w 9 c c p Q Q < / l a t e x i t > ATLAS/CMS (PbPb) < l a t e x i t s h a 1 _ b a s e 6 4 = " t h 5 y H 0 9 r U l R n t O z j E g d o M n 4 U 3 B g = " > A A A C G X i c b V D L T g I x F O 3 4 R H y h L t 0 0 E h P c 4 A y a 6 B J k 4 0 I T D M + E I a R T C j R 0 H m n v G M l k f s O N v + L G h c a 4 1 J V / Y 4 F Z K H i a J u e e c 2 / a e 5 x A c A W m + W 0 s L a + s r q 2 n N t K b W 9 s 7 u 5 m 9 / Y b y Q 0 l Z n f r C l y 2 H K C a 4 x + r A Q b B W I B l x H c G a z q g 8 8 Z v 3 T C r u e z U Y B 6 z j k o H H + 5 w S 0 F I 3 Y 9 r A H S u / p 5 I W a D 1 J P B N Z 8 B w p J e 9 m f i f 1 0 l w c O 2 l I o w T h J A v F g 0 S S T G i s x R o X y j g K C e G M K 6 E u Z X y E V O M o 8 k q b 0 J w l 1 9 e J c 1 y y b 0 o l e 8 u i 5 V q F k e O n J B T c k 5 c c k U q 5 J b U S Y N w k p B n 8 k r e r C f r x X q 3 P h a t a 1 Y 2 c 0 z + w P r 8 A X Z c k v Q = < / l a t e x i t > LHC Bounds (photon fusion + VBF) < l a t e x i t s h a 1 _ b a s e 6 4 = " D S e 5 o w O3 r l r G 6 R r F t Y r J S H H + A O o = " > A A A C I 3 i c b V D L S g M x F M 3 4 t r 5 G X b o J F k E R y k w V F F d S Q V y 4 q G A f 0 C k l k 2 Z s a C Y Z k j t i G f o v b v w V N y 6 U 4 s a F / 2 L a z s L X C Y H D O f f e 5 J 4 w E d y A 5 3 0 4 M 7 N z 8 w u L S 8 u F l d W 1 9 Q 1 3 c 6 t u V K o p q 1 E l l G 6 G x D D B J a s B B 8 G a i W Y k D g V r h P 2 L s d + 4 Z 9 p w J W 9 h k L B 2 T O 4 k j z g l Y K W O e x Y A e 4 A M X 1 9 d 4 I p K Z d f g Y Y D z E x S m 7 n 7 S U 6 A k j t L x H H y I 6 5 X L g 2 H H L X o l b w L 8 l / g 5 K a I c 1 Y 4 7 C r q K p j G T Q A U x p u V 7 C b Q z o o F T w Y a F I D U s I b R P 7 l j L U k l i Z t r Z Z M c h 3 r N K F 0 d K 2 y s B T 9 T v H R m J j R n E o a 2 M C f T M b 2 8 s / u e 1 U o h O 2 x m X S Q p M 0 u l D U S o w K D w O D H e 5 Z h T E w B J C N b d / x b R H N K F g Y y 3 Y E P z f K / 8 l 9 X L J P y q V b 4 6 L 5 5 U 8 j i W 0 g 3 b R P v L R C T p H V 6 i K a o i i R / S M X t G b 8 + S 8 O C P n f V o 6 4 + Q 9 2 + g H n M 8 v k + + i X w = = < / l a t e x i t >(photon and weak < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 D X / 1 Y 0 r Y X j P8 7 Z f F 2 W W N Y e Q E m 0 = " > A A A C 3 H i c j V H L S s N A F D 3 G V 6 2 v q g s X b o J F c F V S q e h S d O O y g n 2 A 1 T K Z j h q a Z s J k 4 o P S n T t x 6 w + 4 1 e 8 R / 0 D / w j t j C j 4 Q n Z D k z L n 3 n L l 3 r h + H Q a I 9 7 2 X E G R 0 b n 5 j M T e W n Z 2 b n 5 g s L i / V E p o q L G p e h V E 2 f J S I M I l H T g Q 5 F M 1 a C 9 f x Q N P z u n o k 3 L o R K A h k d 6 u t Y H P f Y W R S c B p x p o t q F 5 Z Y W V 7 q / H p 9 L L S O X R R 3 3 U r D u o F 0 o e i X P L v c n K G e g i G x V Z e E Z L X Q g w Z G i B 4 E I m n A I h o S e I 5 T h I S b u G H 3 i F K H A x g U G y J M 2 p S x B G Y z Y L n 3 P a H e U s R H t j W d i 1 Z x O C e l V p H S x R h p J e Y q w O c 2 1 8 d Q 6 G / Y 3 7 7 7 1 N L V d 0 9 / P v H r E a p w T + 5 d u m P l f n e l F 4 x T b t o e A e o o t Y 7 r j m U t q b 8 V U 7 n 7 q S p N D T J z B H Y o r w t w q h / f s W k 1 i e z d 3 y 2 z 8 1 W Y a 1 u x 5 l p v i z V R J A y 5 / H + d P U N 8 o l S u l z Y N K c W c 3 G 3 U O K 1 j F O s 1 z C z v Y R x U 1 W / 8 D H v H kn D g 3 z q 1 z 9 5 H q j G S a J X x Z z v 0 7 z Q S Z D g = = < / l a t e x i t > boson fusion) < l a t e x i t s h a 1 _ b a s e 6 4 = " z 9 8 S 4 J / 5 o 4 6 i k l 2 6 q M y 5 d S W q X A I= " > A A A C 2 X i c j V H L S s N A F D 2 N r 1 p f 8 b F z E y y C b k o q F V 0 W 3 b i s Y B / Q l p K k 0 x q a Z E I y E W vp w p 2 4 9 Q f c 6 g + J f 6 B / 4 Z 0 x B b W I T k h y 5 t x z z s y d s U P P j Y V p v m a 0 m d m 5 + Y X s Y m 5 p e W V 1 T V / f q M U 8 i R x W d b j H o 4 Z t x c x z A 1 Y V r v B Y I 4 y Y 5 d s e q 9 u D U 1 m v X 7 E o d n l w I Y Y h a / t W P 3 B 7 r m M J o j r 6 V k u w a z G y e c w D o 5 d I 3 f 6 4 o + f N g q m

Table 1 .
The fiducial cross sections of the SM backgrounds with the theoretical uncertainties at 14 TeV LHC.
The size of the training set and validation sets are both 150k.