Portraying Double Higgs at the Large Hadron Collider II

The Higgs potential is vital to understand the electroweak symmetry breaking mechanism, and probing the Higgs self-interaction is arguably one of the most important physics targets at current and upcoming collider experiments. In particular, the triple Higgs coupling may be accessible at the HL-LHC by combining results in multiple channels, which motivates to study all possible decay modes for the double Higgs production. In this paper, we revisit the double Higgs production at the HL-LHC in the final state with two $b$-tagged jets, two leptons and missing transverse momentum. We focus on the performance of various neural network architectures with different input features: low-level (four momenta), high-level (kinematic variables) and image-based. We find it possible to bring a modest increase in the signal sensitivity over existing results via careful optimization of machine learning algorithms making a full use of novel kinematic variables.


Introduction
The discovery of the Higgs boson at the Large Hadron Collider (LHC) launched a comprehensive program of the precision measurements of all Higgs couplings. While the current data shows that the Higgs couplings to fermions and gauge bosons appear to be consistent with the predictions of the Standard Model (SM) [1], the Higgs self-couplings are yet to be probed at the LHC and at future colliders. The measurement of the Higgs self-couplings is vital to understand the electroweak symmetry breaking mechanism. In particular, the triple Higgs coupling is a guaranteed physics target that can be probed at the high luminosity (HL) LHC and the succeeding experimental bounds on the self-couplings will have an immediate and long-lasting impact on model-building for new physics beyond the SM. The Higgs (h) self-interaction is parameterized as where m h is the Higgs mass, and v is the vacuum expectation value of the Higgs field. λ SM 3 = λ SM 4 = m 2 h 2v 2 are the SM Higgs triple and quartic couplings, while κ i (i = 3, 4) parameterize the deviation from the corresponding SM coupling. In order to access the triple (quartic) Higgs coupling, one has to measure the double (triple) Higgs boson production. In this paper we focus on probing the triple Higgs coupling at the HL-LHC, which is likely achievable when combining both ATLAS and CMS data [2][3][4][5] with a potential improvement on each decay channel 1 . Double Higgs (hh) production has been extensively discussed in many different channels, such as bbγγ [21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38], bbτ τ [23-25, 27, 39-43], bbbb [24,[44][45][46][47][48][49][50], bbW + W − /ZZ [51][52][53][54][55][56][57][58][59], [60,61], W + W − τ τ [60], and τ τ τ τ [60] (see also Refs. [2,5] and references therein). On the other hand, less attention was given to the final state with two b-tagged jets, two leptons and missing transverse momentum, as it suffers from large SM backgrounds, primarily due to the top quark pair production (tt). Therefore several existing studies in this channel made use of sophisticated algorithms (e.g. neural network (NN) [54], boosted decision tree (BDT) [55,62], and deep neural network (DNN) [63,64]) to increase the signal sensitivity, although they lead to somewhat pessimistic results, with a significance much smaller than 1σ at the HL-LHC with 3 ab −1 luminosity. More recent studies claim that the significance can be greatly improved by utilizing novel kinematic methods [56], or by adopting more complex NNs such as convolutional neural networks (CNN) with jet images [65] and message passing neural networks (MPNN) with four-momentum information [66].
In particular, Ref. [56] introduced two new kinematic variables (Topness and Higgsness) and investigated their impact on the reduction of tt background, along with M T 2 andŝ min . They showed that the substantial improvement on the signal significance was due to the use of the kinematic variables which contain the mass information. In their follow-up study, Ref. [65], authors performed more dedicated analysis using Delphes simulation, and included tW background which was missing in the previous study. Ref. [65] utilized a simple convolutional neural network with those newly introduced kinematic variables along with existing ones (16 variables in total) to improve the signal significance. They also studied jet images (with charged and neutral hadrons only) in double Higgs production and showed that the additional improvement was possible.
The goal in this article is to extend the scope of the existing studies on the double Higgs production at the HL-LHC in the final state with (h → bb)(h → W ± W * ∓ → + ν −ν ), by studying performance of various NN architectures. In particular, we would like to address the following important points, which were not answered properly in Refs. [56,65].
1. The performance of NNs with different types of input features: low-level (four momenta), high-level (kinematic variables), and image-based inputs.
2. Ref. [65] used CNN with the jet images, which are the energy deposits of charged and neutral hadrons in the hadronic calorimeter. How robust are these results? How much error do we make due to different choice of hyper-parameters? 3. In principle, the lepton momenta are correlated with the momenta of two b-quarks (and therefore their hadronic activities), so that one could consider the image of leptons simultaneously. Can the image-based NNs catch the non-trivial correlation between leptons and b-quarks?
4. Ref. [56] introduces two novel kinematic variables (Topness and Higgsness), which provide a good signal-background separation. As a byproduct, one obtains the momentum of two missing neutrinos. What would be the best way to utilize the neutrino momentum information along with visible particles? Would the "image" of neutrinos provide an additional handle for the signal-background separation?
5. What are the signal efficiency and background rejection rate of different NN algorithms?
6. How much improvement does the bbW W channel bring in the combination of individual channels?
This paper is organized as follows. We begin in section 2 our discussion on the event generation for the signal and backgrounds, followed by data preparation for NN analysis in section 3. In section 4, we examine several NN architectures including deep neural networks (DNNs), convolutional neural networks (CNNs), residual neural networks (ResNets), graph neural networks (GNNs), capsule neural networks (CapsNets) and matrix capsule networks. We will study their performances with the low-level (four momenta), high-level (kinematic variables), and image-based input data, which is summarized in section 5. Section 6 is reserved for a discussion and outlook. We provide a brief review on various kinematic variables in appendix A.

Theoretical setup and simulation
The signal (hh with κ 3 = 1) and backgrounds are generated for a center-of-mass energy of √ s = 14 TeV, using the MadGraph5_aMC@NLO [67,68] which allows for both tree-and loop-level event generations. We use the default NNPDF2.3QED parton distribution functions [69] using dynamical renormalization and factorization scales. We normalize the double Higgs production cross section to 40.7 fb at the next-to-next-to-leading order (NNLO) accuracy in QCD [70]. The dominant background is tt production whose tree-level cross section is rescaled to the NNLO cross section 953.6 pb [71]. We consider only the leptonic decays of tops tt → (b + ν)(b −ν ) with being e, µ, τ , that includes off-shell effects for the top and the W . The next dominant background is tW + j production matched (five-flavor scheme) up to one additional jet in order to partially include the next-to-leading order (NLO) effects 2 . Both 2 The NLO tW process mixes with LO tt, which requires a careful treatment to avoid a double-counting.
Several methods have been presented in the literature [72][73][74][75] to resolve the issue, including the diagram removal (DR) adopted in the ATLAS [64] and CMS [76] analyses. In this paper, we have followed the same DR scheme where the tW background is generated up to one additional j excluding diagrams that overlap with tt process.  Table 1. Signal and background cross sections after the baseline selection described in section 2, including appropriate k-factors as well as taking into account the improved b-tagging efficiency and fake rates.
top and W are decayed leptonically as for the tt sample. The sub-dominant backgrounds consist of tth and ttV (V = W ± , Z) whose cross sections are normalized to 611.3 fb [77] and 1.71 pb [78] at the NLO, respectively. We also include Drell-Yan backgrounds bj and τ τ bb, where j denotes partons in the five-flavor scheme. The NNLO k-factor of the Drell-Yan production [79] is close to unity (k-factor ≈1). The hard scattering events are decayed (unless mentioned otherwise), showered, and hadronized using Pythia8 [80]. Detector effects are simulated with Delphes 3.4.1 [81] based on modified ATLAS configurations [65]. Jets are reconstructed by Fastjet 3.3.1 [82] implementation using the anti-k T algorithm [83] and a cone radius of r = 0.4. We take advantage of the improved b-tagging efficiency reported by ATLAS, associated with the central tracking system for the operation at the HL-LHC [84]. We use a flat b-tag rate of b→b = 0.8, and a mistag rate that a c-jet (light-flavor jet) is misidentified as a b-jet, c→b = 0.2 ( j→b = 0.01). Events with exactly two b-tagged jets which pass minimum cuts p T (b) > 30 GeV and |η(b)| < 2.5 are considered. Two b-tagged jets are further required to satisfy a proximity cut ∆R bb < 2.5 and an invariant mass cut 70 GeV < m bb < 140 GeV.
A lepton is declared to be isolated if it satisfies p T ( )/(p T ( ) + i p T i ) > 0.7 where i p T i is the sum of the transverse momenta of nearby particles with p T i > 0.5 GeV and ∆R i < 0.3. Events with exactly two isolated leptons which pass minimum cuts p T ( ) > 20 GeV and |η( )| < 2.5 are selected. Two leptons are further required to pass a proximity cut ∆R < 1.5 and an invariant mass cut m < 70 GeV. Events are required to pass the minimum missing transverse momentum (defined as in Ref. [65]) / P T = | / P T | > 20 GeV. After this baseline selection, the signal and background cross sections are summarized in Table 1, including appropriate k-factors and taking into account the improved b-tagging efficiency and fake rates. The dominant background is tt (97%), followed by tW (2%). The backgroundto-signal cross section ratio is about 9250 after the baseline selection. Throughout the study in this paper, we will assume L = 3 ab −1 for the integrated luminosity, unless otherwise mentioned.

Pre-processing input data
Data preparation or preprocessing is an important part of ML analysis. In particular, to fully understand performance of NNs with different types of inputs, we categorize input features used in this paper as follows. The most basic information (low-level features) is four-momenta of four visible particles (two leptons and two b-jets) where the dimension is dim(V (vis) pµ ) = 16. There are various kinematic methods which provide approximate momenta of missing neutrinos. In this paper, we adopt Topness (in Eq.(A.5)) [56,85] and Higgsness (in Eq.(A.8)) [56], which are proposed for the double Higgs production in particular. As a result of minimization procedures, these momenta carry some kinematic features of the missing neutrinos approximately. For example, the neutrino momenta obtained using Topness are consistent with the top and W mass constraints, while those obtained using Higgsness are consistent with the double Higgs production. Therefore we include them in our input variables as pµ ) = 8. With those basic four-momenta information, we can construct 11 and 15 low-level kinematic variables as where min[∆R b ] denotes the smallest angular distance between a b-jet and a lepton, ∆R H νν (∆R T νν ) and m H νν (m T νν ) represent an angular distance and an invariant mass between two neutrinos reconstructed using Higgsness (H) or Topness (T) variables. We confirmed that distributions of the first 10 variables in V 11-kin are similar to those in Ref. [65]. Although Ref. [65] did not utilize the last 5 variables in Eq.
] delivers a direct angular information between a b-jet and a lepton, and the other 4 variables provide additional information on the neutrino sector. Therefore we include them in our analysis (see Fig. 17 for their distributions). Note that some kinematic variables in V 15-kin are strongly correlated with each other due to the pencil-like event shape of the hh production, as compared to the isotropic dominant background (tt) [56,65]. See Appendix for more details on the event shape variables for the signal and backgrounds. Once a few cuts are imposed, the effect of remaining cuts is significantly diminished. Therefore, it is important to investigate new kinematic variables, which are less correlated to those introduced in the early literatures.
Among the low-level kinematic variables defined in V 15-kin , the invariant mass is the only mass variable. For more efficient background suppression utilizing mass information of top quark and Higgs, we use the following six high-level kinematic variables introduced in Refs. [56,65]: where √ŝ (bb ) min and √ŝ ( ) min are the minimum energy required to be consistent with the produced events in the specified subsystem, M T 2 are the stransverse mass variables, and T/H denote Topness and Higgsness. These six high-level variables are defined in Appendix A. It has been shown that these high-level variables allow neural networks to learn key features of the processes faster and more accurately. We will use these 6 high-level kinematic variables along with V 15-kin , (3.8) A great breakthrough for deep neural networks in the image recognition opens up a possibility for a better background rejection when the energy and momentum of the final state particles are converted into image pixels. Jet images [86] are based on the particle flow information [87] in each event. We divide the particle flow into charged and neutral particles. The charged particles include charged hadrons, while the neutral particles consist of neutral hadrons as well as non-isolated photons. Leptons are removed from the both samples. Since these particles are spread over the (η, φ) plane, it is challenging to identify key features such as color-flows and associated hadronization patterns of the signal and backgrounds. It is therefore instructive to rearrange them to make these features more accessible and allow for a robust identification. Here we define the origin of the (η, φ) coordinate to be the center of the reconstructed b-tagged jets. All particles are translated accordingly in the (η, φ) plane. Jet images are discretized into 50 × 50 calorimeter grids within a region of −2.5 ≤ η ≤ 2.5 and −π ≤ φ ≤ π. The intensity of each pixel is given by the total transverse momentum of particles passing through the pixel. The final jet images have a dimension of (2 × 50 × 50) where 2 denotes a number of channels, charged and neutral particle images, which are shown in the first and second columns in Fig. 1 for the signal and backgrounds. In the case of the signal (hh in the first row), the two b-tagged jets decayed from the color-singlet Higgs. Therefore their hadronization products are in the direction of the two b-tagged jet (toward the center). The empty region around the origin is due to ∆R bb > 0.4 requirement. On the other hand, the dominant background (tt in the second row), the jet images tend to be wider than the signal, as two top quarks are color-connected to the initial states. The cumulative distributions clearly demonstrate their differences.
In order for neural networks to fully take into account the spatial correlation between images of final state particles, we project two isolated leptons into the discretized 50 × 50  Figure 1. The cumulative average of various particle images for the signal and the different background processes after the baseline selection. The particles images are shown in the order from left to right: charged hadrons (1st column), neutral hadrons (2nd), isolated leptons (3rd), reconstructed neutrinos using Higgsness (4th), and reconstructed neutrinos using Topness (5th) for the signal (hh in the first row), tt (2nd), tW + j (3rd), tth (4th), ttV (5th), bj (6th), and τ τ bb (7th). The origin of the (η, φ) is taken to be the center of the reconstructed two b-tagged jets.
calorimeter grids as well. Combined with the jet images, we have a set of images for visible particles whose data structure is represented by where 3 denotes charged (C), neutral (N ) particle images, and lepton ( ) images, which are shown in the third column in Fig. 1 for the signal and backgrounds. In the case of the signal (first row), leptons are scattered around φ ≈ ±π, which is opposite to the direction of the two b-tagged jets (origin), while for the dominant background (tt in the second row), leptons are more spread. This is consistent with the observation made in Refs. [56,65] using thê s min variable or invariant mass. The double Higgs production resembles the pencil-like (two leptons and two b-quarks are back-to-back approximately), while tt production is more or less isotropic. The lepton image also explains a shadow in the (0, ±π) region of two hadron images (first and second column) in the signal and backgrounds. See the Appendix for more information on the event shapes. Similarly, one can create images of the two reconstructed neutrinos using Topness and Higgsness, which are shown in the fourth and fifth columns in Fig. 1. As expected from the kinematics, the neutrino images resemble lepton images, which would help the signalbackground separation, in principle. To assess importance of these neutrino images in the signal sensitivity, we consider a complete set of images for all final state particles whose data structure is represented by where 5 denotes a number of channels including all jet, lepton, and neutrino images. As clearly shown in Fig. 1, the kinematic correlation among the decay products are mapped onto these images, including the missing transverse momentum. To catch the non-trivial correlations, more complex and deeper NNs will be considered. Each neural network takes a different set of input features for a classification problem between the signal and backgrounds. More details of NN architectures will be described in the next section.

Performance of machine learning algorithms
With the increasing collision rate at the LHC, a task in collider analysis requires the significant dimensional reduction of the complex raw data to a handful of observables, which will be used to determine parameters in Lagrangian. Deep learning-based approaches offer very efficient strategies for such dimensional reduction, and have become an essential part of analysis in high energy physics. However, important questions still remain regarding how to best utilize such tools. Specifically we ask i) how to prepare input data, ii) how to design suitable neural networks for a given task, iii) how to account for systematic uncertainties, and iv) how to interpret results. In this section, we scrutinize some of these questions by exploring various neural networks in the context of double Higgs production. Here, we discuss the essence of each network and summarize results briefly, leaving more detailed comparison in Section 5. Implementations of neural networks used in this paper are based on Pytorch Framework [88] and can be found from https://github.com/junseungpi/diHiggs/. The events, which pass the baseline selection described in Section 2, are divided into 400k training and 250k test data sets.
To make a fair comparison of different NN structures, we consider the discovery reach of the signal at the LHC by computing the signal significance (σ dis ) using the likelihood-ratio [89] σ dis ≡ −2 ln L(B|S +B) L(S +B|S +B) , where L(x|n) = x n n! e −x , and S and B are the expected number of signal and background events, respectively.

Deep Neural Networks
A fully-connected layer (FC or DNN) is the basic type of neural networks where every neuron in each layer is connected to all neurons in two consecutive layers. An input layer is composed of the combination of four-momenta of reconstructed particles in Eqs. (3.1-3.4), or kinematic variables in Eqs. (3.5-3.8). It is followed by 8 hidden layers with the decreasing number of neurons from 2400 to 300 as shown in Fig. 2, and ReLU (Rectified Linear Unit) function [90] is used to activate each neuron. The final hidden layer is connected to the output layer that contains two neurons with each representing a label of 1 for the signal and 0 for the backgrounds. A softmax activation function is used for the output layer. We use Adam optimizer [91] and a learning rate of 10 −4 to minimize the binary cross entropy loss function. To prevent overfitting, we add the L 2 regularization term to the loss function by setting  Significance 10−kin with FC V Ref. [   First observation is that DNN with the normalized input features lead to a slightly higher significance than that with the bare input for most NNs. The variation in the significances for different input is slightly narrower with the normalized features. Secondly, it is clear that when the DNN is trained with four-momenta of visible particles (V 10−kin and V Ref. [63] 8−kin are sets of kinematic variables used in Ref. [62] and Ref. [63], respectively. Note that the performance of DNN increases, when using more kinematic variables. We find that when the DNN is trained with 11 kinematic variables (V 11-kin ) the significance increases up to 10%-50% compared to the results using the four-momenta for a wide range of signal number of events (see the green dotted curve in the left panel.). Interestingly, 15 kinematic variables (V 15-kin ), which include the kinematic variables using the momentum of the reconstructed neutrinos, provide an additional steady ∼ 10% improvement on the signifiof neural networks. For each layer, we apply the ReLU activation function. Also the configuration of the output layer is the same for all other neural networks throughout this paper, unless otherwise mentioned. 4 We perform the linear transformation, xi → x i = axi + b for each input xi such that x i ∈ [0, 1].  cance (see the red and green curves in the left panel). It is worth noting that the 6 high-level variables (V 6-kin ) adds the orthogonal set of information to V 15-kin , which enables the DNN to better disentangle the backgrounds from the signal and brings additional ∼ 10% improvement. Finally, as mentioned previously, while the relative improvement is diminished with the normalized input features (as shown in the right panel), the importance of the kinematic variables still remain.

Convolutional Neural Networks
When the final state is fully represented by a set of images as in Eq.(3.9-3.10), deep neural networks specialized for the image recognition provide useful handles. One of the most commonly used algorithms is a convolutional neural network (CNN) as shown in Fig. 4. The input to the CNN is the 3D image of V image ) whose dimension is given by 5 × 50 × 50 (3 × 50 × 50 ) where 5 (3) denotes a number of channels. In order to exploit the spatial correlation among different channels, we first apply the 3D convolution using the kernel size of 5 × 3 × 3 (3 × 3 × 3), the stride 1, the padding size 1, and 32 feature maps 5 . Next, we apply the max-pooling using the kernel size of 2 × 2 × 2 without the stride and padding, which subsequently reduces the image dimension down to 32 × 25 × 25.
Since this is effectively the same dimension as the 2D image with 32 feature maps, in what follows, we apply the 2D convolution using the kernel size of 3 × 3, the stride of 1, and 32 feature maps, followed by the max-pooling with the kernel size of 2 × 2. We repeat this procedure until the image dimension is reduced to 32 × 6 × 6. Each of these neurons are connected to 3 hidden DNN layers with 600 neurons, and the final DNN layer is connected to Significance the output layer. To study the effect of the kinematic variables along with the image inputs, we slightly modify the NN structure, in which case these 3 hidden DNN layers are not used. Instead we construct the separate DNN consisting of 6 hidden layers with the decreasing number of neurons from 1200 to 600 (as shown in the right-upper corner in Fig. 4). The last layer with 600 neurons for the kinematic variables are combined with the output of CNN of dimension 32 × 6 × 6. When training the network, we use Adam optimizer, the learning rate of 10 −4 , regularization term of weight_decay=5 × 10 −4 , mini-batch size of 20, and epochs of 24. Fig. 5 shows the performances of CNNs with various inputs. First, we roughly reproduce results (green, solid) presented in Ref. [65], taking V (C,N ) image as input. It is important to check this, as all event generations and simulations are performed completely independently. Adding lepton images, the overall significance of 0.9 ∼ 1 can be achieved with V (C,N, ) image image information without kinematic variables (red, dotted), which is substantially larger than that with V image . Albeit not a substantial impact, addition of neutrino images (V helps improve the significance up to ∼ 5% (red, solid). Finally, addition of V 21-kin kinematic variables increases the significance substantially, making CNN with V 21-kin + V (C,N, ,ν H ,ν T ) image as the best network in terms of the signal significance (blue, solid).

Residual Neural Networks
The image sets that we feed into CNNs contain only few activated pixels in each channel. Given the sparse images, the performance of the CNNs could greatly diminish. One of possibilities to ameliorate this problem is to design neural networks at much deeper level. As the CNN goes deeper, however, its performance becomes saturated or even starts degrading rapidly [92]. A residual neural network (ResNet) [93,94] is one of the alternatives to the CNN which introduces the skip-connections that bypass some of the neural network layers as shown in Fig. 6.
image ). We apply the 3D convolution using the kernel size of 5×3×3 (3×3×3), the stride of 1, and 32 feature maps. Note that we apply the batch normalization and ReLU activation function after each convolutional layer, but we do not use the max-pooling. We introduce the three-pronged structure: i) three series of 2D convolutions using the kernel size of 3 × 3, the stride of 1, and 32 feature maps, ii) two series of 2D convolutions using the same configurations, and iii) the skip-connection. All three paths are congregated into the single node, which enables the ResNet to learn various features of convoluted images, while keeping the image dimension unchanged. One way to reduce the dimensionality of the image is to change the striding distance, and we consider the two-pronged structure: i) the 2D convolution using the stride of 2, and ii) the 2D convolution using the stride of 2 followed by another 2D convolution using the stride of 1. Both layers are congregated again into the single node. These are basic building blocks of our ResNet, and repeated several times in hybrid ways, until the image dimension is brought down to 32×4×4. In parallel to the ResNet pipeline, we construct the DNN consisting of 6 hidden layers with the decreasing number of neurons from 1200 to 600. The inputs to the DNN are the kinematic variables. This step is similar to what has been done in CNN when including the kinematic variables in addition to the image inputs. The final neurons of two pipelines are congregated into the same layer, and subsequently connected to the output layer. We use the learning rate of 10 −4 , the regularization term of weight_decay=5 × 10 −4 , the mini-batch size of 20, and the epochs of 11.
We obtain the overall significance of ∼ 1 − 1.25 can be achieved as shown in Fig. 5, which marks very high significance along with CNNs. The impact of neutrino images turns out to be mild, when including kinematic variables. This is partially because the reconstructed momentum of neutrinos (and the corresponding images) are byproducts of the Higgsness and Topness variables, which are already included in the variables of V 6-kin . Additional neutrino images, therefore, are redundant information in the ResNets and CNNs.

Capsule Neural Networks
The max pooling method in the CNN selects the most active neuron in each region of the feature map, and passes it to the next layer. This accompanies the loss of spatial information about where things are, so that the CNN is agnostic to the geometric correlations between the pixels at higher and lower levels. A capsule neural network (CapsNet) [95,96] was proposed to address the potential problems of the CNN, and Fig. 7 shows the schematic architecture of the CapsNet used in our analysis.
The input to the CapsNet is the 3D image of V image ). We apply the 3D convolution using the kernel size of 5 × 3 × 3 (3 × 3 × 3), the stride of 1, and 32 feature maps. Note that we do not use the max-pooling. We apply the series of 2D convolution using the kernel size of 3 × 3, the stride of 1, and 256 feature maps, until the image dimension is reduced to 256 × 40 × 40. The output neurons of 2D convolution are reshaped to get a bunch of 8-dimensional primary capsule vectors, which contain the lower-level information of the input image. For each feature map, there are 40 × 40 arrays of primary capsules, and there are 32 feature maps in total. To sum up, there are N caps = 40 × 40 × 32 = 51200 primary capsules formed in this way. We denote primary capsule vectors as u i where the index i runs   Figure 7. A schematic architecture of the capsule network model (CapsNet) used in this paper. from 1 to N caps . Each primary capsule is multiplied by a 16 × 8 weight matrix W ij to get a 16-dimensional vectorû j|i which predicts the high-level information of the image where j denotes a class label, 0 or 1. To construct a digital capsule vector (v j ), we first take the linear combination of prediction vectorsû j|i from all capsules in the lower layer and define capsules s j in the higher-level, where the summation runs over the index i, and c ij denote routing weights where all coefficients b ij are initialized to 0. The digital capsule vector is defined by applying a squash activation function to s j v j = ||s j || 2 1 + ||s j || 2 where j again denotes a class label, 0 or 1. The final length of each digital capsule vector ||v j || represents a probability of a given input image being identified as a class of j. The CapsNet adjusts the routing weights c ij by updating the coefficients b ij such that the prediction capsuleŝ u j|i having larger inner products with the high-level capsules v j acquire larger weights The procedure from Eq. (4.2) to Eq. (4.6) is referred to as the routing by agreement algorithm [95], which we repeat three times in total to adjust the routing weights c ij . The output digital capsule vectors v j are used to define the margin loss function where T 1 = 1 and T 0 = 0 for the signal and backgrounds respectively, m + = 0.9, m − = 0.1, and λ = 0.5. Another parallel routine, called a decoder, attempts to reconstruct the input image out of the digital capsule vectors v j . The digital capsule vectors are fed into 3 DNN hidden layers with increasing number of neurons from 512 to 12500 (7500), and reshaped into the input image size of 5 × 50 × 50 (3 × 50 × 50). The reconstruction loss function for the decoder is defined by the sum of squared differences in pixel intensities between the reconstructed (I (reco) k ) and input (I where the index k runs from 1 to the total number of pixels in the image, and N is a normalization factor defined by the total number of pixels times the total number of training events.
In order for capsule vectors to learn features that are useful to reconstruct the original image, we add the reconstruction loss into the total loss function which is modulated by the overall scaling factor of α. Following the choice of Ref. [95], we set α = 5.0 × 10 −4 . When training the network, we used Adam optimizer, the learning rate of 10 −4 , regularization term of weight_decay=0, mini-batch size of 20, and epochs of 11.  Figure 9. A schematic architecture of the matrix capsule network model (Matrix CapsNet) used in this paper. Fig. 8 shows the performances of the CapsNet where the overall significance of 0.8 ∼ 0.9 can be achieved, which is slightly lower than CNN with the same image inputs but better than CNN with V (C,N ) (see Fig. 5). Albeit not a substantial impact, additional neutrino images help to improve the significance up to ∼ 5%.

Matrix Capsule Networks
Regardless of the progress made in the CapsNet, there are a number of deficiencies that need to be addressed. First, it uses vectors of length n to represent features of an input image, which introduces too many n 2 parameters for weight matrices. Second, in its routing by agreement algorithm, the inner products are used to measure the accordance between two pose vectors (cf. Eq.(4.6)). This measure becomes insensitive when the angle between two vectors is small. Third, it uses the ad hoc non-linear squash function to force the length of the digital vector to be smaller than 1 (cf. Eq.(4.5)).
The aim of the Matrix CapsNet [97] is to overcome the above problems by generalizing the concept of the capsules. The major difference with the original CapsNet is that here each capsule is not a vector, but it is the entity which contains a n × n pose matrix M and an activation probability a. The utility of the pose matrix is to recognize objects in various angles, from which they are viewed, and it requires a smaller number of hyper-parameters in its transformation matrix. 6 6 The same number of components in the vector of length n can be contained in the √ n × √ n matrix. The former requires the n × n weight matrix, but the later requires the √ n × √ n weight matrix giving rise to a significant reduction in the amount of hyper-parameters. image ). We apply the 3D convolution using the kernel size of 5 × 3 × 3 (3 × 3 × 3), the stride of 1, and 64 feature maps. It is followed by a series of 2D convolutions using the kernel size of 3 × 3, the stride of 1 or 2, and 64 feature maps, until the image dimension is reduced to 64 × 22 × 22. Next, we apply the modified 2D convolution using the kernel size of 1 × 1 and the stride of 1 where each stride of the 1 × 1 convolution transforms the 64 feature maps into 8 capsules. As a result of this operation, we obtain the layer L of capsules whose dimension is denoted as 8 × 22 × 22 in unit of capsules.
Each capsule i in the layer L encodes low-level features of the image, and it is composed by the 2 × 2 pose matrix M i and the activation probability a i . To predict the capsule j in the next layer L + 1, which encodes high-level features, we multiply the pose matrix M i by a 2 × 2 weight matrix W ij where V ij is the prediction for the pose matrix of the parent capsule j. Since each capsule i attempts to guess the parent capsule j, we call V ij the vote matrix.
On the other hand, the 2×2 pose matrix of the parent capsule j is modeled by 4 parameters of µ h j (with h = 1, 2, 3, 4), which serve as the mean values of the 4 Gaussian distribution functions with standard deviations of σ h j . In this model, the probability of V ij belonging to the capsule j is computed by where V h ij denotes the h th component of the vectorized vote matrix. Using this measure, we define the amount of cost C to activate the parent capsule j so that the lower the cost, the more probable that the parent capsule j in the layer L + 1 will be activated by the capsule i in the layer L. We take the linear combination of the costs from all the capsules i in the layer L where each cost is weighted by an assignment probability r ij . To determine the activation probability a j in the layer L + 1, we use the following equation where b j and λ are a benefit and an inverse temperature parameters, respectively.
Hyper-parameters such as W ij and b j are learned through a back-propagation algorithm, while µ h j , σ h j , r ij , and a j are determined iteratively by the Expectation-Maximization (EM) routing algorithm. 7 The inverse temperature parameter λ, on the other hand, is fixed to 10 −3 . After repeating the EM iteration 3 times, the last a j is the activation probability, and the final parameters of µ h j (with h = 1, 2, 3, 4) are reshaped to form the 2 × 2 pose matrix of the layer L + 1.
The above prescription of computing the capsules in the layer L + 1 from the layer L can be systematically combined with the convolutional operation. Recall that we ended up with the capsule layer with the dimension of 8 × 22 × 22. Here, we apply the convolutional capsule operation using the kernel size of 5 × 5, the stride of 2, and 16 feature maps. This operation is similar to the regular CNN, except that it uses the EM routing algorithm to compute the pose matrices and the activation probability of the next layer. It is followed by another convolutional capsule operation until the image dimension is brought to 16 × 7 × 7 in unit of capsules. These capsules in the last layer are connected to the class capsules, and it outputs one capsule per class. The final activation value of the capsule of a class j is interpreted as the likelihood of a given input image being identified as a class j.
The final loss function is defined by where a t denotes a true class label, and a j is the activation for class j, N train denotes a number of training events. If the difference between the true label a t and the activation for the wrong class j( = t) is smaller than the threshold of m, the loss receives a penalty term of (m − (a t − a j )) 2 [97]. The threshold value m is initially set to 0.1, and it is linearly increased by 1.6 × 10 −2 per each epoch of training. It stops growing when it reaches to 0.9 at the final epoch. When training the network, we used Adam optimizer, the learning rate of 10 −4 , regularization term of weight_decay=5 × 10 −5 , mini-batch size of 20, and epochs of 20. Fig. 8 shows the performances of the Matrix CapsNet where the overall significance of 0.7-0.8 can be achieved. It is slightly lower than that for CapsNet. We find that performance of CapsNet and Matrix CapsNet is comparable or slightly worse than those using CNN with the same image inputs (see Fig. 5).

Graph Neural Networks
Instead of using the image-based neural networks that could suffer from the sparsity of the image datasets, one could encode the particle information into the graphical structure which consists of nodes and edges. Graph neural networks (GNNs) [98,99] take into account topological relationships among the nodes and edges in order to learn graph structured information from the data. Each reconstructed object (in Eq.(3.1)) including neutrinos obtained from the Higgsness (in Eq.(3.3)) is represented as a single node. Each node i has an associated feature vector x i which collects the properties of a particle. The angular correlation between two nodes i and j is encoded in an edge vector e ij . The first type of GNN architectures that we consider is a modified edge convolutional neural network (EdgeConv) [100], which efficiently exploits Lorentz symmetries, such as an invariant mass, from the data [66,101]. Its schematic architecture is shown in Fig. 10. All nodes are connected with each other, and the node i in the input layer is represented by a 4-momentum feature vector of x 0 i = (p x , p y , p z , E) i , while leaving the edge vectors e ij unspecified. The EdgeConv predicts the features of the node in the next layer as a function of neighboring features. Specifically, each feature vector of a node i in the layer n is defined by where the symbol ⊕ denotes a direct sum and E stands for the a set of feature vectors j in the layer n − 1 that are connected to the node i. • and W n denote the element-wise multiplication and a weight matrix, respectively. We do not include the bias vector and the ReLu activation function. This completes one round of the graph convolution, and we repeat it N = 3 times. In the output layer, all feature vectors are concatenated and multiplied by a weight matrix, leading to a vector of dimension 2p We apply the sigmoid activation function on each component ofp where k stands for the class label, 0 or 1, and p k represents a probability which is used to classify the signal and backgrounds. We use the cross entropy as the loss function where y 1 = 1 and y 0 = 0 for the signal and backgrounds respectively. When training the network, we adopted Adam optimizer with the learning rate of 9.20×10 −7 and the momentum parameters β 1 = 9.29 × 10 −1 and β 2 = 9.91 × 10 −1 . We used the regularization term with weight_decay=3.21×10 −2 , a multiplicative factor of the learning rate decay γ = 1, mini-batch size of 130, and epochs of 70.
The second type of GNN architectures is a message passing neural network (MPNN) [102] as shown in Fig. 11. The node i in the input layer is represented by a feature vector of x 0 i = (I , I b , I ν , m, p T , E) i , where m, p T , and E denote the invariant mass, transverse momentum, and energy of a particle, respectively. The default values for I i are set to zero. I = 1 for the hardest lepton and I = −1 for the second hardest lepton, I b = 1 for the hardest b-jet and I b = −1 for the second hardest b-jet, and I ν = 1 for the hardest neutrino and I ν = −1 for the second hardest neutrino (note that we are using reconstructed neutrino momenta from Higgsness). All nodes are connected to each other, and a single component edge vector e ij is represented by the angular separation (∆R x i ,x j ) between two particles in the node i and j. The MPNN preprocesses each input node i, multiplying the x 0 i by a weight matrix W 0 Each feature vector of a node i in the layer n is defined by where W n and W n denote weight matrices. This completes one round of the graph convolution, and we repeat it N = 3 times. In the output layer, each feature vector of node i is multiplied by a weight matrix to become the vector of length 2p We apply the sigmoid activation function on each component ofp i where k stands for the class label, and p i1 represents a probability of a node i being identified as the signal. To identify the signal, we require the graph to pass the cut where p cut 1 is optimized to yield a best significance. We use the cross entropy as the loss function When training the network, we used Adam optimizer, the learning rate of 1.44×10 −5 , regularization term of weight_decay=9.67×10 −2 , β 1 = 9.16×10 −1 , β 2 = 9.92×10 −1 , γ = 9.81×10 −1 , mini-batch size of 164, and epochs of 21. We find that the EdgeConv and MPNN show a very good performance with the basic momentum input, and show their signal significance in the right panel of Fig. 8, which clearly surpasses the performance of FC with the same input features (shown in Fig. 3). Moreover, before combining kinematic variables and image-based input features, the performance of the EdgeConv and MPNN is comparable to or slightly better than those based on NN with image only such as CapsNet, Matrix CapsNet, or CNN (see Fig. 5 and the left panel in Fig. 8.). This comparison illustrates a couple of important points. First, one can further improve vanilla DNN with basic four momenta input introducing evolution of nodes and edges in MPNN/EdgeConv. Second, to bring additional improvement in the signal significance, it is crucial to consider different types of inputs such as image-based features and high-level kinematic variables in addition to basic four momenta, and develop NN architecture suitable for the corresponding input features.

Comparison of different networks
In this section, we summarize our results of exploration of different NN structures. We have tried (i) fully connected NN with four momenta and kinematic variables, (ii) CNN, ResNet, CapsNet and Matrix CapsNet with kinematic variables and image data, and (iii) EdgeConv and MPNN with four momentum.
In the left panel of Fig. 12, we summarize the signal significance of double Higgs production at the HL-LHC with L = 3 ab −1 for various NNs (left, top), taking the best model for each type. We find that CNN performs the best, followed by ResNet with 21 kinematic variables and image inputs. ResNets result is very comparable to DNN with 21 kinematic variables (no image data used for fully connected NNs). These three different NNs algorithms lead to the similar performance with the signal significance 1.2-1.3 for the signal number of events around 20. As shown in Fig. 5 and in Table 2, addition of lepton-image (V (C,N, ) ) to charged and neutral hadrons (V (C,N ) ) helps improve the significance (see green-solid and red-dotted). Moreover, the full set of images (V (C,N, ,ν H ,ν T ) ) including neutrino momentum    Table 2. Signal and background cross section (in fb) cutting on NN score for various combinations of NN architectures and inputs. The NN score is chosen such that the signal number of events is approximately 20 (N S ≈ 20.). The significance σ is calculated using the log-likelihood ratio for a luminosity of 3 ab −1 at the 14 TeV LHC. Note that the efficiency of the NN score cut can be calculated by taking the ratio of cross sections in this table and those in Table 1.
information brings the further increase (see red-solid). This substantial improvement is attributed to the inclusion of 21 kinematic variables along with all image inputs, which lead to about 1.3 significance for the signal events N S = 20. This remarkable gain in the signal significance over the existing results [65,66] are due to the interplay between novel kinematics and machine learning algorithms. It is noteworthy that one can form image data out of leptons and reconstructed neutrinos, and obtain the improved result. We have checked that the CNN with 21 kinematic variables and image input outperforms network structures used in literature, which are labeled as V Ref. [62] 10−kin and V Ref. [63] 8−kin in Table 2. FC with more kinematic variables lead to a higher significance, even without four momentum input, as illustrated in Table 2. When using V 11−kin , the significance drops, which indicates that it is crucial to choose the right kinematic variables to reach the maximum significance, and V 21−kin are the right choice. The second class of algorithms include CapsNet, Matrix CapsNet and MPNN, which lead to the signal significance around 0.8-1. With four momentum input only (without any kinematic variables or images), we find that MPNN performs the best, reaching the significance of ∼ 1.
In the right panel of Fig. 12, we show the variation of the final results for 10 independent runs of the same NNs with different initial values of weights for various NNs (shown in the left panel). This exercise serves as an estimation of uncertainties associated with NN. As illustrated in Fig. 12, our results are stable under multiple runs, leading to similar results.
Taking 20 for the benchmark signal number of events, in Table 2, we summarize the signal and background cross sections (in fb) as well as the significance for various combinations of NN architectures and inputs by cutting on the NN score. The significance σ is calculated using the log-likelihood ratio (Eq. (4.1)) for a luminosity of 3 ab −1 at the 14 TeV LHC. Note that the signal and background efficiencies with respect to the cut on NN scores can be estimated by taking the ratio of each cross section in this table and the one in Table 1.
As an illustration, in Fig. 13, we show the NN score distribution for CNN with V 21-kin + V (C,N, ,ν H ,ν T ) which gives the best significance. We scan over the lower bound on the NN score and count survived number of signal and background events. For the signal number of events N S = 20, we obtain 220 total background events, which comprises (96, 52, 39, 10, 22, 1) events for individual backgrounds (tt, tW , tth, ttV , bj, τ τ bb), respectively. tt process makes up about 44% of the total background after applying the cut on the NN score, while tW and tth+ttV account for about 24% and 22%, respectively. bj +τ τ bb background makes up roughly 10%. After the baseline selection (before cutting on NN score), the tt contribution was 97%, while tW was ∼ 2%, as shown in Table 1. The signal efficiency for CNN is about S = 23%, while the background rejection is 3500 = 1/ B , where B = 2.8 × 10 −4 . Therefore the background to signal ratio is reduced to σ bknd /σ hh ≈ 11 from σ bknd /σ hh ≈ 9250 (see Table  1).
We show in Fig. 14 the receiver operating characteristic (ROC) curve for selected NN architectures evaluated on the same test sample, taking the signal efficiency S for the x-axis and the background rejection 1/ B for the y-axis. The area under the ROC curve (AUC) is another commonly used quantity to test the performance of a classification model. Here we quote a few sample AUC values in the ( S , 1 − B ) plane. As shown in Fig. 14

Discussion and outlook
In this paper, we investigated the discovery potential of the HL-LHC for the double Higgs production in the final state with two b-tagged jets, two leptons and the missing transverse momentum. We have utilized the novel kinematic variables and particle images (including reconstructed neutrinos) along with various machine learning algorithms to increase the signal sensitivity over the large backgrounds.
Our study provides a plenty of meaningful and interesting results. First, for the collider kinematics point of view, we have shown the importance of high-level kinematic variables in NNs. Without suitable kinematic variables, NNs with images only can not achieve optimal results. We also showed the minor improvement in the final significance, when using kinematic  Table 3. Summary of significance of the individual channels and their combination at the HL-LHC. Results for the first 5 channels and their combinations are taken from Ref. [5], while last two rows are combined results, replacing the existing result with new results presented in this paper. We use 1.3 for the significance for hh → bbV V ( νν) from this study, and assume that the systematics brings about 10% reduction in the significance. As in Ref. [5], we assume that ATLAS and CMS would have the same significance, as the results in the bbV V ( νν) and bbZZ(4 ) are only performed by the CMS collaboration.
variables constructed with neutrino momenta, which are obtained via Topness and Higgsness minimization. Second, for various machine learning methods, we have made a dedicated comparison with different types of input features: low-level four momenta, high-level kinematic variables and particle images. We have observed that CNN outperforms most other NNs, and are comparable to or slightly better than DNN. We also illustrated the importance of high-level kinematic variables in DNN as well as in CNN/ResNet. One of important results in our study is that the signal significance is roughly stable around ∼ 1 for various machine learning algorithms with different choices of input features. We have checked that the variation in the NN performance within each algorithm is also stable, when repeating the same training with new random seeds. Finally, for physics point of view, we have shown that the expected significance is ∼ 1.3 for the number of the signal events of ∼ 20 with the improved b-tagging efficiency. This is due to the interplay of clever kinematic variables and flexible NNs.
Such a high and stable significance has a large impact on the double Higgs production. Especially, the dilepton channel would make a sizable contribution to the combined significance. Our study also motivates a similar analysis in the semi-leptonic channel, whose significance is known to be much smaller than that in the dilepton channel. Table 3 Expected signal significance (3σ) 95 % CL exclusion on (κ 3 , α) 95 % CL exclusion on α from tt(h → bb) dilepton channel at HL − LHC 95 % CL exclusion on α from tt(h → bb) dilepton channel at HL − LHC α α Figure 15. Expected 3σ significance of observing Higgs boson pair production (left) and 95% C.L. exclusion (right) in the (κ 3 , α) plane at the HL-LHC with 3 ab −1 . We used the binned log-likelihood analysis with statistical uncertainties only, assuming the same efficiencies for all (κ 3 , α) values as one for (κ 3 , α) = (1, 0) (SM point denoted by ). Contours of the double Higgs production cross section (in fb) are shown in black-solid curves. The yellow shaded region is obtained using results in this study for the dilepton channel (hh → bbW W * → bb νν). The red dashed curve is obtained combining three channels, bbbb + bbτ + τ − + bbγγ following Ref. [45], while the blue dashed curve includes all four channels. The horizontal-black dotted line represents a sample 95% exclusion on the CP angle from the dilepton channel of tth production with h → bb [103], |α| 35 • .
signal significance of the five individual channels and their combination at the HL-LHC with 3 ab −1 , taken from Ref. [5]. The significances are added in quadrature, and the channels are treated as uncorrelated, assuming that the systematic uncertainties which are expected to be correlated between the experiments, such as the theory uncertainties and the luminosity uncertainty, have little impact on the individual results. Since the results in the bbV V ( νν) and bbZZ(4 ) are only performed by the CMS collaboration, the likelihoods for those two channels are scaled to 6 ab −1 in the combination of ATLAS and CMS, leading to 4.5 without systematics and 4.0 with systematics. Similar results are found in Ref. [2]. In the two bottom rows, we show the updated combined results, replacing the existing results (0.59 without systematics and 0.56 with systematics) with new results (1.3 without systematics) presented in this paper. The total systematic uncertainties from current analyses are about 20% [63,64], while Ref. [54] considers somewhat optimistic scenarios and scans the systematic uncertainty up to 5% for the expected relative uncertainty in the signal yield. The estimation of the systematic uncertainty is beyond the scope of our study. Instead, we simply consider 10% reduction in the signal significance, to take into account the systematics in the bbV V ( νν) channel, which is not an unreasonable assumption, since the significance reduction in the CMS study in the same channel is about 5%. This naive estimate of the systematic uncertainty results in the reduction of ATLAS (CMS) significance from 3.8 to 3.2 (from 3.0 to 2.8). Considering both ATLAS and CMS, the expected significance including our results in this study becomes 4.8 without systematics and 4.2 with systematics.
The double Higgs production exhibits a non-trivial interference between the box and triangle diagrams. In Fig. 15, we show the production cross section of the double Higgs (blacksolid) in the 2 dimensional parameter space of (κ 3 , α), where α is the CP angle in the tth coupling (see Ref. [104] for the impact of the double Higgs production on the top quark Yukawa coupling.). The # mark represents the hh production cross section corresponding to the SM point (κ 3 , α) = (1, 0), which happens to be near the minimum cross section at (∼ 2.5, 0). Although the signal cross section is too small to measure the double Higgs production with this single channel alone, one could set an exclusion limit, as shown in the right panel of Fig.  15. A significance (σ excl ) for exclusion can be calculated using a different likelihood-ratio For an exclusion at 95% C.L., we demand σ excl ≥ 2, leading to σ bbW W excl ≈ 105 fb, which is shown as the yellow shaded region in Fig. 15, which would be excluded at 95% C.L. at the HL-LHC with 3 ab −1 , assuming the efficiencies are the same as that for the SM production. Although this is roughly a reasonable approximation, a dedicated study should be performed to obtain more precise bounds. After reproducing the signal significances (1.4, 2.5 and 2.1) for bbbb, bbτ + τ − and bbγγ channels as shown in Table 3 following Ref. [45], we have confirmed the individual 2σ bounds on the double Higgs production cross section: σ bbbb excl ≈ 100 fb, σ bbγγ excl ≈ 95 fb, and σ bbτ τ excl ≈ 76 fb. Then we have calculated the combined 95% C.L. exclusion, σ bbbb+bbγγ+bbτ τ excl ≈ 66 fb, using the binned likelihood ratio, which is shown as the red dashed curve in Fig. 15. The addition of our results in bbW W * → bb νν improves further, σ bbbb+bbγγ+bbτ τ +bbW W excl ≈ 64 fb, which is shown as the blue dashed curve. Independent bounds on the Top-Higgs CP phase α can be obtained by studying tth production. As an illustration, we have added the horizontal-black dotted line, which represents 95% exclusion on the CP angle from the dilepton channel of tth production with h → bb [103]. Similarly, the expected 3σ significance of observing Higgs boson pair production is shown in the left panel, corresponding to σ bbW W 3σ ≈ 92 fb, σ bbbb+bbγγ+bbτ τ 3σ ≈ 34 fb, and σ bbbb+bbγγ+bbτ τ +bbW W 3σ ≈ 31 fb. The novel kinematic variables that we have introduced can be used in other channels. For example, the semi-leptonic channel hh → bbW W * → bb ν jj is known to give a very poor signal significance [62], although a somewhat promising result was obtained in Ref. [105] using jet substructure techniques. A recent study Ref. [52] applied the Topness and Higgsness to hh → bbW W * → bb ν jj and showed that a suitable cut leads to 69% of the signal surviving fraction and 1.2% of the background surviving fraction. We expect that these new kinematic variables could help improve the signal and background separation in the semi-leptonic channel as well as other channels such as hh → γγW W * and hh → W W * W W * . Finally we note that machine learning methods could help to study the semi-leptonic as well as the fully hadronic channel, utilizing the effectiveness in resolving the combinatorial problems in these channels [106][107][108][109][110].

A A brief review on kinematic variables
In this appendix, we provide a short review on kinematic variables used in this paper. For more details, we refer to Refs. [111,112]. We consider the following 21 kinematic variables in total (Eqs. • p T ( 1 ) and p T ( 2 ) are the transverse momentum of the hardest and the next hardest leptons, respectively.
• p T and p T bb are the transverse momentum of the two-lepton system and the two-btagged jets, respectively.
• / P T is the missing transverse momentum.
• ∆R and ∆R bb are the angular separation between two leptons, between two b-tagged jets, respectively. The angular distance is defined by where ∆φ ij = φ i − φ j and ∆η ij = η i − η j denote the differences in the azimuthal angles and rapidities respectively between particles i and j.
• ∆φ bb, is the angular separation between the bb system and the two lepton system.
• m and m bb are the invariant mass of two leptons and two b-tagged jets, respectively.
• min [∆R b ] is the smallest angular distance between a b-jet and a lepton out of 4 possible combinations, which is shown in the left-upper corner in Fig. 17.
• √ŝ (bb ) min and √ŝ ( ) min are the minimum √ŝ for the (bb ) subsystem and the ( ) subsystem, respectively. Theŝ min variable [113,114] is defined aŝ where (v) represents a system of visible particles, and m v ( P v T ) is corresponding invariant mass (transverse momentum). It provides a way to approximate the Mandelstam variableŝ of the system (v) in the presence of the missing momenta. Here we consider two systems, v = {bb } and v = { }. First, √ŝ (bb ) min represents a minimum energy required to produce two Higgs bosons (two top quarks) in the signal (tt) events. Second, √ŝ ( ) min represents a minimum energy required to produce two W bosons. In case of the tt background, the peak of the distribution appears near 2m W . In case of the signal, the peak appears around the Higgs mass. The distributions of the variablesŝ wherem is the test mass for the daughter particle, and the minimization over the transverse masses of the parent particles M T P i (i = 1, 2) is performed by scanning the unknown transverse neutrino momenta p νT and pν T , under the / P T constraint. See Refs. [112,[116][117][118][119][120][121][122][123] for more information and other variants of M T 2 .
When M T 2 is applied to the bb visible system we abbreviate it as M (b) T 2 . In this case, the daughter particles are the W bosons, and hence we setm = m W = 80 GeV. By construction, M (b) T 2 is bounded by the mass of the corresponding parent particle, so that its distribution exhibits a sharp drop at around M (b) T 2 = m t , for tt events. Double Higgs production, on the other hand, does not obey this bound, which in turn can be used to disentangle two different samples.
When M T 2 is applied to the + − visible system, we abbreviate it as M • Topness measures a degree of consistency of a given event with tt production. Topness itself is a minimized chi-square value constructed by using four on-shell constraints, m t , mt, m W + and m W − , and 6 unknowns (the three-momenta of the two neutrinos, p ν and pν) subject to the constraint, / P T = p νT + pν T . Due to the twofold ambiguity in paring a b-jet and a lepton, we define Topness as the smallest chi-square value where m W * is bounded above m W * ≤ m h − m W and its location of the peak can be estimated by The location of the peak in the invariant mass m νν distribution of two neutrinos appears at m peak νν ≈ 30 GeV. The definitions of Topness and Higgsness involve σ hyperparameters which stand for experimental uncertainties and particle widths. In our numerical study, we use σ t = 5 GeV, σ W = 5 GeV, σ W * = 5 GeV, σ h = 2 GeV, and σ ν = 10 GeV. Our results are not sensitive to these numerical values. The Topness and Higgsness are shown in Fig. 16. See Ref. [124] for the heavy Higgs decaying to two W bosons. • ∆R H νν (left-middle panel in Fig. 17) and ∆R T νν (right-top panel in Fig. 17) are the angular separations of the two neutrinos which are reconstructed using Higgsness and Topness, respectively. It is expected that they resemble ∆R .
• m H νν (left-bottom panel in Fig. 17) and m T νν (right-middle panel in Fig. 17) are the invariant mass of the two neutrinos which are reconstructed using Higgsness and Topness, respectively. It is expected that they resemble m .
Finally we would like to comment on the use of the reconstructed neutrinos. Some of the 21 kinematic variables which are defined in the laboratory frame show the global properties of the signal and background processes. For example, distributions of invariant masses (m and m bb ) and angular variables (∆R , ∆φ bb, , min[∆R b ], etc) show that the pencil-like production of hh and isotropic production of tt. These are better measured via shape variables in the CM frame. Now that we have obtained approximate momenta of the missing neutrinos, we are able to Lorentz-boost to the CM frame of the production event by event. Among many shape variables, we consider the sphericity and thrust variables.
The thrust (T ) is defined as with respect to the direction of the unit vector n which maximizes T , identified as the thrust axis n T . This definition implies that for T = 1 the event is perfectly back-to-back while for T = 1/2 the event is spherically symmetric. The sphericity (S) provides global information  where i, j are the spatial indices and the sum runs over all jets. The ordered eigenvalues λ i (λ 1 > λ 2 > λ 3 ) with the normalization condition i λ i = 1 defines the sphericity S = 3 2 λ 2 + λ 3 . (A.11) The sphericity axis n S is defined along the direction of the eigenvector of λ 1 . The sphericity measures the total transverse momentum with respect to the sphericity axis defined by the four momenta in the event. In other words, the sphericity of an event is a measure of how close the spread of energy in the event is to a sphere in shape. The allowed range for S is 0 ≤ S < 1. For a pencil-like production, S ≈ 0, while S ≈ 1 for isotropic production. We obtain neutrino momenta using Higgsness and Topness, which in fact show similar distributions statistically, as shown in the neutrino images. Fig. 18 shows the thrust (left), sphericity (middle) and thrust vs. sphericity (right) in the lab frame (top) and CM frame (bottom) for the hh (red) and tt (blue) productions. The numerical values of (T LAB Topness minimization for the neutrino momenta. We do not show results with Higgsness, as they are very similar.