Portraying Double Higgs at the Large Hadron Collider

We examine the discovery potential for double Higgs production at the high luminosity LHC in the final state with two $b$-tagged jets, two leptons and missing transverse momentum. Although this dilepton final state has been considered a difficult channel due to the large backgrounds, we argue that it is possible to obtain sizable signal significance, by adopting a deep learning framework making full use of the relevant kinematics along with the jet images from the Higgs decay. For the relevant number of signal events we obtain a substantial increase in signal sensitivity over existing analyses. We discuss relative improvements at each stage and the correlations among the different input variables for the neutral network. The proposed method can be easily generalized to the semi-leptonic channel of double Higgs production, as well as to other processes with similar final states.


Introduction
The discovery of the Higgs boson [1,2] jumpstarted the comprehensive program of precision measurements of all Higgs couplings. While the Higgs boson couplings to fermions and gauge bosons are in good agreement with the Standard Model (SM) predictions [3], the Higgs self-couplings are difficult to measure experimentally [4][5][6][7][8][9][10][11][12]. Yet, the knowledge of those couplings is crucial for understanding the exact mechanism of electroweak symmetry breaking and the origin of mass in our universe. It is also a guaranteed physics target which can be probed at the upgraded Large Hadron Collider (LHC) or at future colliders. The resulting experimental constraints on the Higgs self-couplings will have an immediate and long-lasting impact on model-building efforts beyond the SM.
We parameterize the Higgs self-interaction as follows: where m h is the mass of the SM Higgs boson (h), v ≈ 256 GeV is the Higgs vacuum expectation value, h 2v 2 are the SM values for the Higgs self-couplings, while κ 3 and κ 4 parametrize the corresponding deviations from them. In order to access κ 3 (κ 4 ), one has to measure the process of JHEP09(2019)047 double (triple) Higgs boson production at the LHC, possibly with high luminosity (HL), or at future colliders.
The recent study in ref. [39] introduced some new ideas for reducing the SM backgrounds in this channel. For example, the new variables Topness and Higgsness were designed to test whether the event kinematics is consistent with tt or hh, respectively. The use of Topness and Higgsness already effectively reduced the tt background to a manageable level, and additional variables were then employed to handle the remaining SM background processes -e.g., the subsystem variable M ( ) T 2 is effective in eliminating background arising from τ decays. In this paper, we supplement the novel kinematic method from ref. [39] with the analysis of the jet image in the h → bb decay, where the basic idea is to treat the detector as a camera and the streams of jets as an image [44][45][46][47][48][49][50][51]. In our case, the collimated nature of the Higgs decay will hopefully differ from the patterns obtained in SM production processes. In addition, we adopt a deep learning framework in our main analysis, since it is known that modern deep learning algorithms trained on jet images provide improved signal-to-background discrimination [45][46][47][48][49][50][52][53][54].
The analysis presented in this paper contains a number of improvements in comparison to previous studies: • Unlike the customized detector simulation performed in ref. [39], here we employ Delphes [55] to simulate detector effects such as detector resolution, reconstruction efficiency, etc., and Fastjet [56] for jet-reconstruction.
• We use deep learning framework to optimize the cuts, which further increases the significance compared to the conventional cut-and-count as performed in ref. [39].
• We exploit an enlarged set of relevant variables which consists of the 10 variables originally considered in ref. [43]: p T 1 , p T 2 , / P T , m , m bb , ∆R , ∆R bb , p T bb , p T , and ∆φ bb, , supplemented with the six recent variables from ref. [39]: Topness, Higgsness, M • We include a SM background process, tW production, which was missing from all previous discussions of this channel, yet it turns out to be the next dominant background once the tt background is under control.

JHEP09(2019)047
• The fact that the Higgs boson h is a color-singlet allows us to use the jet image of the h → bb decay for further background suppression [45][46][47][48]50].
• We examine the effect of pile-up, which was missing from previous studies. The expected average number of pile-up µ at the HL-LHC is O(200) collisions per bunch crossing [57]. Thus for any precision measurements, it is crucial to have a strategy in place to ensure that pile-up effects do not jeopardize the analysis. Here we choose to apply the Soft Drop algorithm [58] for QCD analyses, which is a powerful pile-up mitigation technique. In order to reduce pile-up effects on the relevant kinematic variables, we adopt the definition for a missing transverse momentum from ATLAS, which excludes contributions from soft neutral particles [59].
Our results show that the dominant tt background can be significantly reduced until it is comparable to the other subdominant backgrounds, i.e., after all cuts, we find that all SM backgrounds contribute at similar levels. This reduction can be accomplished without sacrificing too much of the signal rate, which leads to an improved signal significance. Our study indicates that the dilepton channel from hh → bbW + W − could contribute to the combined significance for hh discovery on par with the other final states, making double Higgs production sooner accessible at the HL-LHC.
This paper is structured as follows. We begin our discussion of the SM backgrounds and present the details of our simulation in section 2. In the following two sections 3 and 4, we provide some basic information on the kinematic variables used later in the analysis and on jet images, respectively. Then in section 5 we discuss how we set up our analysis in a deep learning framework. Section 6 presents our results, while section 7 is reserved for the discussion and conclusions. We include a brief review on deep neural networks in appendix A.

Event generation and detector simulation
Parton-level signal and background events were generated using MadGraph5 aMC@NLO v2.6 [60] with the default NNPDF2.3QED parton distribution functions [61] at leading order QCD accuracy at the √ s = 14 TeV LHC. The default dynamical renormalization and factorization scales were used. We assume 3ab −1 of luminosity throughout this paper. Parton-level events were generated with the following cuts: p T j > 20 GeV, p T b > 20 GeV, p T γ > 10 GeV, p T > 10 GeV, η j < 5, η b < 5, η γ < 2.5, η < 2.5, ∆R bb < 1.8, ∆R < 1.3, 70 GeV < m jj , m bb < 160 GeV and m < 75 GeV. For jj νν, bj and tW + j backgrounds, we impose 5 GeV < m < 75 GeV additionally. Here the angular distance ∆R ij is defined by where ∆φ ij = φ i − φ j and ∆η ij = η i − η j are respectively the differences of the azimuthal angles and rapidities between particles i and j.
The double Higgs production cross-section is normalized to σ hh = 40.7 fb, the next-tonext-to-leading order (NNLO) accuracy in QCD [62]. Considering all relevant branching JHEP09(2019)047 fractions, we obtain signal cross section σ hh · 2 · BR(h → bb) · BR(h → W W * → + − νν) = 0.648 fb, where denotes an electron or a muon, including leptons from tau decays. The major background is tt production, whose cross section is normalized to the NNLO QCD cross-section 953.6 pb [63]. Another important background is tth, which is normalized to the next-to-leading order (NLO) QCD cross-section of 611.3 fb [64]. For the ttV (V = W ± , Z) background, we apply an NLO k-factor of 1.54, resulting in a cross-section of 1.71 pb [65]. We apply an NLO k-factor of 1.0 for the Drell-Yan type backgrounds bj and τ τ bb, where j denotes partons in the five-flavor scheme. Note that a recent study indicates that k N N LO,DY QCD⊗QED ≈ 1 [66]. The irreducible jj νν background from the mixed QCD+EW process is included with k N LO = 2. Finally, we generate tW + j events with up to one additional matched jet (in the five-flavor scheme), whose cross-section turns out to be 0.51 pb (after the cuts) including all relevant branching fractions. As we try to reconstruct events, off-shell effects for the top quark and W boson need to be taken care of properly. We generate parton level events with MadGraph5, which includes the proper treatment of the off-shell effects for the top quark and the W boson for both signal and all backgrounds.
Events are further processed for parton-shower/hadronization using Pythia8235 [67]. We use Delphes 3.4.1 [55] for simulating the detector effects and Fastjet 3.3.1 [56] for jet-reconstruction, with modified ATLAS settings as follows.
• Jets are clustered with the anti-k T algorithm [68] with cone-size ∆R = 0.4, where ∆R is the distance (2.1) in the (φ, η) space. For the analysis, we consider jets with p T j > 20 GeV and |η j | < 2.5.
• We use the a flat b-tagging efficiency, b→b = 0.75, and flat mis-tagging rates for non-b jets of c→b = 0.1 and j→b = 0.01 [57].
• For lepton isolation, we require where the sum is taken over the transverse momenta p T i of all final states particles i, i = , with p T i > 0.5 GeV and within ∆R i < 0.3 of the lepton candidate . Leptons are also required to have p T > 10 GeV and |η | < 2.5.
• For photon isolation, we analogously require i p T i p T γ < 0.12 for particles within ∆R iγ < 0.3 of the photon candidate γ. Photons are also required to have p T γ > 25 GeV and |η γ | < 2.5.
• The missing transverse momentum / P T is defined as the negative vector sum of the transverse momenta of the accepted leptons, photons, jets and soft tracks as follows [59]; Here the last term is added to consider unused soft tracks. These tracks are required to have p T > 0.4 GeV, |η| < 2.5 and transverse (longitudinal) impact parameter |d 0 | < 1.5 mm (|z 0 sin θ| < 1.5 mm). To reduce effects from pile-up, we only use particles which have track information.

JHEP09(2019)047
After particle reconstruction, we employ the following baseline selection cuts 1 from ref. [39]: • the two leading jets must be b-tagged, each with p T > 30 GeV, • exactly two isolated leptons of opposite sign, each with p T > 20 GeV, GeV for the reconstructed missing transverse momentum, • proximity cut of ∆R < 1.0 for the two leptons, • proximity cut of ∆R bb < 1.3 for the two b-tagged jets, • m < 65 GeV for the two leptons, • 95 GeV < m bb < 140 GeV for the two b-tagged jets.
For those events which passed the baseline cuts, we form 16 kinematic variables, as well as jet images. As we will see later, the jet images can capture additional features which are not already contained in the 16 standard kinematic variables. Therefore one can obtain better performance by combining kinematics and jet images, which is one of the main ideas of this paper.

Kinematics in signal and backgrounds
In this section we introduce the 16 kinematic variables used in this analysis. Their kinematic distributions (for signal and all relevant backgrounds) are shown in figure 1 and will be discussed shortly.
We begin with ten standard kinematic variables, which were previously considered in refs. [9,43] (their distributions are shown in the first ten panels of figure 1): • m bb , the invariant mass of the two b-tagged jets (1st plot in the 1st row). This is expected to be a good variable, since for signal events, the two b-jets originate from the decay of a narrow resonance (the Higgs boson) and would therefore reconstruct to the Higgs mass, up to resolution effects: m bb ∼ m h . This justifies the baseline cut of 95 GeV < m bb < 140 GeV, as indicated with the vertical dotted lines. In contrast, no such correlations exists for backgrounds events: the two b-jets either originate from different decay chains and are uncorrelated (as in the case of tt, for example), or they reconstruct to the mass of a Z-boson or an off-shell gluon, with a mass lower than m h . The plot in figure 1, while confirming those expectations, also shows that the total background happens to peak at a value of m bb which, unfortunately, is not too far away from m h , providing the motivation to explore other variables. JHEP09(2019)047 Distributions of the 16 kinematic variables for signal (hh) and different types of backgrounds (tt, tW , ttV , tth, τ τ bb, bj and jj νν) before baseline cuts. The y-axis represents the number of events for each process and all individual distributions are normalized properly according to their respective cross-sections assuming 3 ab −1 at the 14 TeV LHC. The dotted vertical lines indicate the baseline cuts introduced in section 2.
• m , the invariant mass of the two leptons (2nd plot in the 1st row). For the case of the signal, the two leptons ultimately originate from the Higgs boson decay, and therefore their invariant mass m is bounded from above, hence the baseline cut of m < 65 GeV. Note that the m distribution (which is observable) should be the same as the distribution of m νν (which is unobservable).

JHEP09(2019)047
• ∆R bb , the angular separation (2.1) between the two b-tagged jets (3rd plot in the 1st row). Given the relatively low Higgs mass, the two Higgs particles in hh production have sizable transverse momentum and their respective decay products (e.g., the two b-quarks) tend to go in the same direction. This and the next four variables try to exploit this kinematic property of the signal. For example, the Higgs boost implies that ∆R bb is relatively small for signal events, and this motivates the proximity cut of ∆R bb < 1.3.
• ∆R , the angular separation (2.1) between the two leptons (4th plot in the 1st row). Here the same arguments apply as in the case of ∆R bb just discussed. The corresponding plot in figure 1 confirms that the signal ∆R distribution peaks well below most of the background processes, prompting the baseline cut of ∆R < 1.0.
• ∆φ bb, , the azimuthal angle in the transverse plane between the two b-jet system and the two lepton system (1st plot in the 2nd row). This is yet another way to capture the back-to-back boost of the two Higgs bosons in double Higgs production. Figure 1 shows that the signal peaks at ∆φ bb, = ±π more sharply than the background, which could be exploited later in the neural network analysis. However, no baseline cut was applied in this case, since ∆φ bb, is expected to be largely correlated with ∆R bb and ∆R .
• p T bb , the transverse momentum of the two b-jet system (2nd plot in the 2nd row). Like the previous three variables, this variable is motivated by the significant boost of the Higgs bosons in the signal, but no baseline cut was applied.
• p T , the transverse momentum of the two lepton system (3rd plot in the 2nd row). This variable behaves similarly to p T bb , but to a lesser extent, since the two leptons come from separate W s, while the two b-quarks are direct decay products of the Higgs boson.
• / P T = | / P T |, the magnitude of the missing transverse momentum (4th plot in the 2nd row). A / P T cut is routinely applied in order to fight the QCD backgrounds (not shown in figure 1). Following ref. [9], here we use a baseline cut of / P T > 20 GeV.
• p T 1 , the transverse momentum of the hardest lepton (1st plot in the 3rd row).
• p T 2 , transverse momentum of the next-hardest lepton (2nd plot in the 3rd row). As shown in figure 1, the individual transverse momenta of the two leptons are similar for both signal and backgrounds. Therefore, the lepton p T 's may be good for triggering purposes, but not for background rejection.
We note that for the signal, many of these 10 variables are strongly correlated to each other. 2 This implies that cutting on one variable significantly reduces the power of other JHEP09(2019)047 variables. At the same time, while these 10 variables are among the most commonly used in high energy physics, it is not guaranteed that they fully capture all kinematic differences between signal and background. This is why we introduce six additional variables [39]: Topness, Higgsness, √ŝ (bb ) min , T 2 , shown in the last six panels of figure 1, which are meant to take full advantage of the kinematic differences between the signal and background event topologies.
The Topness variable measures the degree of consistency of a given event with the kinematics of dilepton tt production, where there are 6 unknowns (the three-momenta of the two neutrinos, p ν and pν) and four on-shell constraints, m t , mt, m W + and m W − . Here m t = mt is the mass of top or antitop quark, and m W ± = m W is the mass of the W boson. Then the neutrino momenta can be fixed by minimizing the following quantity subject to the missing transverse momentum constraint, / P T = p T ν + p Tν . The parameters σ t and σ W are indicative of the corresponding experimental resolutions and intrinsic particle widths. In principle, they can be treated as free parameters and one can tune them using NN, BDT, etc. In our numerical study, we shall use σ t = 5 GeV and σ W = 5 GeV. Since there is a twofold ambiguity in the paring of a b-quark and a lepton, Topness is defined as the smaller of the two χ 2 s [39], The Topness distributions for both signal and backgrounds before baseline cuts are shown in figure 1 (3rd plot in the 3rd row). We observe that, as expected, T tends to have smaller values for the main background (tt) than for signal. In our signal of hh production, the two b-quarks arise from a Higgs decay (h → bb), and therefore their invariant mass m bb can be used as a first cut to enhance the signal sensitivity. For the decay of the other Higgs boson, h → W ± W * ∓ , Higgsness is defined as follows [39] H ≡ min /  mass distributions expected for the off-shell W -boson, W * , and the neutrino pair, νν. The invariant mass m W * is bounded by 0 ≤ m W * ≤ m h − m W and the peak of its distribution is at The left panel of figure 2 shows the unit-normalized invariant mass distribution of the proper lepton-neutrino system (m ν ). The distribution has a bimodal shape -the narrow peak on the right near 80 GeV corresponds to the on-shell W -boson resonance, while the broader hump to the left is due to the off-shell W * , with a clear end-point at m h − m W = 45 GeV and a maximum near m peak W * = 40 GeV in accordance with (3.4). The definition of Higgsness (3.3) also includes a term which tests for consistency with the expected invariant mass distribution dσ dmνν for the neutrino pair, 3 which is shown in the right panel of figure 2. The red solid curve gives the pure phase space prediction where λ(x, y, z) = x 2 + y 2 + z 2 − 2xy − 2yz − 2zx is the two-body phase space function and f (m) is the invariant mass distribution of the antler topology with h → W W * → + − νν: where the endpoint E and the parameter η are defined in terms of the particle masses as  Note that by allowing one of the W -bosons to be on-shell, eqs. (3.5)-(3.8) generalize the results previously derived in refs. [69][70][71][72] for the purely on-shell case. The blue histogram in the right panel of figure 2 shows the actual m νν distribution, whose shape is slightly different from the pure phase space result (3.5), due a helicity suppression in the W --ν vertex. In particular, we observe that the actual peak is at m peak νν ≈ 30 GeV, which is the value that we shall use in the definition of Higgsness (3.3). 4 The definition of Higgsness (3.3) contains some additional resolution parameters: σ h for the reconstructed mass of the Higgs boson, σ W * for the reconstructed mass of the offshell W boson, and σ ν for the m νν resolution. In what follows, we shall take σ W * = 5 GeV, σ h = 2 GeV, and σ ν = 10 GeV. 5 The Higgsness distributions for both signal and backgrounds before baseline cuts are shown in figure 1 (4th plot in the 3rd line). The two dimensional map of (Higgsness, Topness) on a log-log scale is depicted in figure 3. The Higgsness and Topness distributions in figure 1 are projections of this two dimensional scatter plot onto the x-axis and yaxis, respectively. Although the signal and the backgrounds do not exhibit a very clean 4 We note that other variants of Higgsness are also possible -for example, instead of penalizing the function H by the distances to the peaks in the corresponding distributions, one can introduce penalty terms which take advantage of the knowledge of the exact probability distributions (the blue histograms in figure 2). 5 We have checked that our results are not very sensitive to these choices.

JHEP09(2019)047
separation in the individual one-dimensional projections in figure 1, their two dimensional correlation plots show some visible differences. We note that even after employing the baseline cuts, one can still see a difference in the two dimensional correlation of Higgsness and Topness (bottom row plots). Along with Higgsness and Topness, we also consider two versions of theŝ min variable [73,74], which is defined aŝ where (v) represents a set of visible particles under consideration, while m v and P v T are their invariant mass and transverse momentum, respectively. The variable (3.9) characterizes the system comprising of the visible particles (v) and the invisible particles (here assumed to be massless) which are responsible for the measured missing transverse momentum / P T . It provides the minimum value of the Mandelstam invariant massŝ for the system which is consistent with the observed visible 4-momentum vector. We shall apply (3.9) to the whole event, where v = {bb }, or to the subsystem resulting from the decay h → W ± W * ∓ → + − νν, where v = { }. The distributions of the resulting variablesŝ min variable represents the minimum energy required to produce the two original parent particles (the two Higgs bosons in the case of the signal and the two top quarks in the case of the major tt background). This is why one would expect the distribution to peak around the parent mass threshold, 2m h for the signal and 2m t for the background [73]. However, the first panel in the fourth row of figure 1 shows that while the backgroundŝ (bb ) min distribution peaks near 2m t , which is expected, the signalŝ (bb ) min distribution peaks around 400 GeV, which is substantially higher than 2m h . This implies that the two top quarks are produced more or less at rest, while the two Higgs bosons have a sizable boost. Similarly, the variablê s ( ) min is the minimum energy required to produce the two W bosons. For the tt background, where both W bosons are on-shell, the peak is expected to occur around 2m W . On the other hand, the signal distribution should be softer, since one of the W bosons is off-shell, and furthermore, the peak should be located slightly below the Higgs boson mass. These kinematic differences are illustrated in the second plot on the fourth row of figure 1, and motivate the use ofŝ T 2 ), and then when it is applied to the + − visible system resulting from the W → ν decays (M ( ) T 2 ). In principle, M T 2 is defined as [76] where the minimization over the transverse masses of the parent particles M T P i (i = 1, 2) is performed over the transverse neutrino momenta p νT and pν T , subject to the / P T constraint. 6 The parameterm in (3.10) is the test mass for the daughter particle: in < l a t e x i t s h a 1 _ b a s e 6 4 = " w W n g k 3 e x 7 C H s N p E 7 E W a t K 6 s l 2 S 0 = " T 2 ) typically require a few extra steps to compute them, thus we shall refer to them as high-level kinematic variables, while the remaining 10 traditional variables will be called low-level kinematic variables. We will perform two independent analyses -one with and one without the high-level kinematic variables, in order to estimate the performance benefit from adding the additional 6 variables.

Color flow in signal and backgrounds
We note that the two b-quarks in the signal result from the decay of a single non-colored object, the Higgs boson. In contrast, the two b-quarks in tt production (which is the dominant background) arise from the decays of top quarks, which in turn are produced via the strong interactions from a gluon-gluon initial state. This distinction is pictorially illustrated in figure 4. The different color-flow [86] will lead to different hadronization patterns, which can be used to discriminate a color singlet particle from a color octet (or triplet) at hadron colliders such as the LHC. Since the quarks which originate from a color singlet particle are color-connected to each other, their hadronization will not involve the initial state partons. On the contrary, the quarks which originate from a color octet particle are color-connected to the annihilating partons in the initial state, and consequently their hadronization is correlated with these initial state partons, see figure 4.

JHEP09(2019)047
Figure 5. Cumulative p T distributions resulting from showering 10,000 times a single partonic event for the signal (left) and tt production (right). The two b quarks from h → bb are colorconnected to each other and the soft radiation tends to fill in the region between them (left panel), while the two b quarks from tt production are not color-connected and the two clusters from their hadronization tend to be more isolated (right panel).
The difference in color flow will be reflected in the resulting hadron distributions. Hadrons coming from a color-singlet object will tend to be closer to the direction of the original mother particle, and as a result, the soft radiation will tend to populate the region between the two b quarks. On the other hand, hadrons from the decay of a color-octet particle will not be so narrowly focused, due to the influence of the initial state partons. These features are illustrated in figure 5, where we show the cumulative p T distributions in the (η, φ) plane after showering the same partonic event 10,000 times. In the left panel we used a signal event, while in the right panel we used an event from tt production. We see that the b-jet clusters in the right panel tend to be better defined and more isolated, since they are not color-correlated among themselves. On the other hand, in the left panel we observe quite a bit of soft radiation in the region between the two b jets, due to the existing color connection between them.
Of course, the results in figure 5 are only valid in the statistical sense, since we took the same parton-level event and hadronized it multiple times. In reality, only one instance of this hadronization will be realized, as illustrated in figure 6. The top row of plots shows the hadronization patterns for charged particles (left panel) and neutral particles (right panel) in the case of one signal event, while the bottom row shows the same, but for one tt event. The parton-level event information is quoted (in GeV) to the right of each row of panels, and then each event is translated in the (η, φ) plane until the origin is aligned with the direction of the b-quark pair. The color scheme indicates the total p T in each pixel, while the dotted circles represent the ∆R = 0.4 cones for reconstruction of the corresponding b jets.
As an alternative to figure 5, in figures 7 and 8, we illustrate the effects of colorconnection by showing the average of the jet images for the signal and the different background processes before and after the baseline cuts, respectively (some basic generation-JHEP09(2019)047 level cuts were imposed on the events in figure 7). The origin of the (η, φ) plane plane is taken to be the center of the b quark pair and the color scheme indicates the total p T in each pixel. The black dotted line delineates the region 1.6 ≤ η ≤ 1.6 and −2.01 ≤ φ ≤ 2.01 used in the analysis. One can observe a striking difference in density between signal and background events in figure 7 -the two b quarks tend to be more collimated in the signal and more spread out in the background.
Unfortunately, after imposing the baseline cuts introduced in section 2, this distinction tends to be washed out and the backgrounds start mimicking the signal: one can see a similar structure emerging in all panels in figure 8, albeit with some subtle differences. Although one may find it difficult to discriminate signal from backgrounds simply by looking at a particular event, the patterns in the average jet images are different, and have been used actively for signal versus background separation [45,47,87]. In this paper, instead of quantifying the difference (e.g., with a pull vector [45]) we will use the images themselves on deep neural networks (DNNs), along with the 16 kinematic variables introduced in the previous section.

JHEP09(2019)047
5 Analysis using deep learning DNN is known to be very efficient and powerful in image recognition [88,89] and the particle physics community has used it for various applications. 7 For instance, one can map the information about the direction and the energy (or transverse momentum) of a particle onto a pixel in an image. DNN then provides excellent classification between signal and background in the jet image [48,50,53,54]. It also shows performance gains in multrivariate analyses over traditional cut-and-count analyses or BDTs [52,93]. In this section, we describe how we organize our analysis in a DNN framework. In the following three subsections, we address the issues of data pre-processing, DNN architecture and training of the NN.

Data pre-processing
In order to achieve the improved DNN learning performance and to minimize the error, it is important to properly process signal and backgrounds events before feeding them into a DNN framework. For each event passing the baseline cuts, the jet images are processed as follows.
1. Input data: we use the particle flow for our input data [94].
2. Particle classification: we divide the particle flow into two groups: neutral particles and charged particles. Neutral particles include photons and neutral hadrons, while charged particles include charged hadrons.
3. Lepton removal: if there is a lepton, we remove it.
4. Shift: we shift all particle coordinates in the (η, φ) plane with respect to the center of the reconstructed b-quark pair, i.e., we set ( 2 ) as the new origin, (0,0).

5.
Pixelization: we discretize the rectangular region in the (η, φ) plane defined by −2.5 ≤ η ≤ 2.5 and −π ≤ φ ≤ π into a grid of 50 × 50 pixels for each particle classification (charged particle set and neutral particle set). In each pixel, we record the total transverse momentum as the pixel's intensity (in case of more than one particle, we add the transverse momenta and record the total sum). We refer to this 50 × 50 discrete image as the jet image [48].
6. Normalization: we rescale each jet image intensity as I ij → I ij /I max , where i, j = 1, 2, . . . , 50, and I ij represents the intensity value in the (i, j) pixel. I max is defined to be the largest value of pixel intensity found in the two 50 × 50 pixel images.
7. Cropping: we crop the jet image to 32 × 32 pixels, by further restricting to the (η, φ) rectangular range of −1.6 ≤ η ≤ 1.6 and −2.01 ≤ φ ≤ 2.01. 7 The use of neural networks for data analysis in high energy physics can be traced back to the pioneering work by R. Field and his students in the mid-nineties [90][91][92]. The final jet image has dimension 2 × 32 × 32 and is comprised of one charged particle channel with dimension 1×32×32 and a neutral particle channel with dimension 1×32×32. This pre-processed jet-image is the input to the DNN. We note that Figures 7 and 8 showed the combined 1 × 50 × 50 jet-image obtained by adding the neutral and charged particle layers. The black dotted rectangular area in those figures showed the restricted 1 × 32 × 32 pixel area.

DNN architecture
Our DNN architecture consists of three sub-architectures, which will merge later, as illustrated in figure 9. Combined deep learning (DL) is not yet very common, 9 but recently there have been several studies in particle physics [50], as well as in other areas [96,97], which showed improved results over simple DL. In this subsection, we provide some details of our DNN layer architecture as follows: 0. Initialization. Since DNN has a lot of parameters, it is important to give non-biased initial values for the (weight, bias) before running DNN with all input data. We use the He uniform initialization method as in ref. [98], among several other algorithms for parameter initialization [98][99][100].

JHEP09(2019)047
1. Jet images. They are represented by the top panel in figure 9.
(a) Input data: we use pre-processed jet images as inputs.
(b) Convolutional neural networks layers: we use three layers of convolutional neural networks (CNN). Each layer has a 32×2×2 filter with no stride and no padding. We proceed with the batch normalization process after filtering [101], using the ReLU function as our activation function [102]. After activation, we introduce the max pooling layer which has a 2 × 2 shape with 2 × 2 strides and padding.
(c) Dense layers: we feed the output of the CNN into two fully connected 1 × 64 dense layers, using ReLU as the activation function.
2. The 6 high level variables. Those are illustrated by the middle panel in figure 9. (a) Input data: p T 1 , p T 2 , / P T , m , m bb , ∆R , ∆R bb , p T bb , p T , ∆φ bb, .
(b) Dense layers: we follow the same procedure as in the case with the 6 high level variables above.

Combination
(a) Merge: we apply three single (1×1) dense layers to the jet image, the 6 high level variables and the 10 low level variables. These layers are denoted as α, β, and γ, respectively, as shown in figure 9. To merge the three sub-architectures, we introduce the final dense layer of dimension 1 ×3 without an activation function.
(b) Final output: to distinguish signal from backgrounds, we apply a layer of dimension 1 × 2 without an activation function.

DNN training
We now proceed with deep learning on the DNN architecture described in section 5.2, using the pre-processed input data. We use Microsoft CNTK [103] as the main DNN library on GPU with an Nvidia CUDA platform. We use the Adam optimizer [104] with cross entropy with SoftMax loss function and classification error function. The sizes of the training data set and the testing data set are about 40k and 17k, respectively. The size of the mini-batch is 128 and that of the epoch is 30.
For each event, we prepare the jet images and the 16 variables. The dimension of the final output is 1 × 2, (P sig , P bknd = 1 − P sig ). If the deep learning score is equal to 1, i.e., P sig = 1 (P bknd = 0), the corresponding event is taken to be a signal event. If P sig = 0 (P bknd = 1), the event is considered to be background.

Results
In this section we present our results. First we validate our framework by repeating the analysis performed in ref. [39] under similar assumptions. 10 We obtained consistent results for the conventional cut-and-count method with Delphes detector simulation. When we added deep learning, the signal significance improved slightly by 5-10%. Now considering all relevant backgrounds and using all 16 variables and jet images, we show the deep learning score for the signal and the individual background processes in figure 10. The signal should peak near P sig = 1 by construction, and indeed this is what is observed in the figure. Note that the tt and tW processes are well separated from the signal and both peak near P sig = 0. This is direct consequence of the improvements made in our analysis -introducing the proper kinematic variables and jet images, which were meant to target the dominant background (tt production), as evidenced in figures 1, 3 and 8. Although the subdominant backgrounds are also reduced in this process, they remain rather flat in figure 10.
The deep learning score shown in in figure 10 can now be used as a signal-to-background discriminator. By placing a lower cut and counting the number of surviving signal and 10 Our current analysis has several notable improvements over the one carried out in ref. [39]. First, the detector simulation is different -in the current study, we use Delphes, which assumes (on average) ∼ 90% (∼ 80%) reconstruction efficiency for leptons (b-jets), while ref. [39] assumed 100% reconstruction efficiency for both. In addition, the Delphes detector resolution itself is slightly different from one used in ref. [39]. In particular we find that the resolution of the missing transverse momentum is worse in Delphes and hence our current results are more conservative (if not more realistic). Finally, as mentioned earlier, we are now including tW + j production, which turned out to be the next dominant background, yet was missing from all previous studies. These effects should be kept in mind when comparing our results here to previous results in the literature. background events, one obtains the efficiency curve (also known as a receiver operating characteristic (ROC)) shown in the left panel of figure 11. The curve contains several independent runs of deep learning and shows the signal efficiency ( signal ) versus the fraction of rejected background events, i.e., 1 − bknd , where bknd is the background efficiency. The efficiency corresponding to the results in figure 10 is shown with the red solid curve labeled "with jetimage DNN". The other two solid lines show the efficiencies which would be obtained if we were to remove the jet images from the analysis: the purple solid curve (labelled "10var only DNN") is obtained with the help of the 10 low-level kinematic variables, while the blue solid curve (labelled "16var only DNN") shows the improvement when we add the 6 high-level variables and use the full set of 16 variables from section 3, but still without jet images. The black dotted curve (labeled "jet image only DNN") shows the result when we use jet images alone, with no help from any of the 16 kinematic variables. Finally, the blue dashed line (labelled "10var with jetimage DNN") shows the result from an analysis combining jet images with the 10 low-level kinematic variables only. The corresponding signal significances are shown as a function of the number of events in the right panel of figure 11. Note that the right panel contains an additional curve (the purple dashed line labeled "10var only BDT") where we use the 10 low-level variables and adopt a BDT algorithm using the TMVA tool kit [105]. The comparison of the latter line against the "10var only DNN" result (purple solid line) reveals the relative performance of DNN versus BDT.

JHEP09(2019)047
In order to examine the effects of pile-up, we use several methods as follows. In the first method, we use the Soft Drop algorithm [58] to remove soft jet activity which is exacerbated by pile-up. We set β = 0 and z cut = 0.1 with R = 1.2 anti-k T clustered fatjets. Then we select the closest fatjet to the bb momentum in the η-φ plane and replace the particle flow data with the charged and neutral jet constituents of the selected fatjet.  Table 1. Signal and background cross sections in fb after baseline cuts (first row) and at different stages of analysis, using a combination of kinematic variables and jet images while requiring N = 20 signal events. The significance σ is calculated using the log-likelihood ratio for a luminosity of 3 ab −1 at the 14 TeV LHC. method, we remove the neutral jet image layer in the analysis. Unlike charged particles, which can be cleaned up from pile-up relatively easily by checking the longitudinal vertex information [106], neutral particles cannot be treated the same way and suffer from nonremovable pile-up effects. The corresponding results with these two pile-up mitigation methods are also shown in figure 11 with the red dotted line labelled "16var with jetimage DNN, SoftDrop" and the red, dashed line labelled "16var with jetimage DNN, no neutral layer", respectively.
We also examine the performance of the DNN with four momentum information as input. The corresponding results are shown in figure 11, where the green-dashed (green-solid) curve represents the significance with four momentum information only (four momentum information plus jet images). The inputs are 18 real numbers, i.e. the four momenta of the two leptons and the two b-tagged jets and the missing transverse momentum. For this exercise, we use a 4 × 128 dense layer instead of a 4 × 64 dense layer. We notice that the DNN performance with kinematic variables is better. This is because, in general, the use of four momenta requires a large training sample in order to be effective, while the kinematic variables already perform efficiently with a smaller data set. If the architecture is deep enough with a large amount of data, the DNN performance with four momentum information would be comparable (or better) to that with kinematic variables only. This exercise illustrates the importance of the appropriate use of kinematic variables.
In summary, figure 11 demonstrates that jet images (which capture the effects of color flow) can improve performance over the baseline selection cuts. At the same time, DL with jet image substructure alone does not show the best performance, and becomes fully effective (and still stable under pile-up) only when it is combined with the full set of 16 variables, including the high-level ones. Table 1 summarizes the signal and background cross sections in fb at different stages of the analysis for the case of N = 20 signal events. The last two columns show the signal significance σ and the signal-to-background ratio S/B. The significance is calculated using the log-likelihood ratio for a luminosity of 3 ab −1 at the 14 TeV LHC. In order to understand the correlation between jet images and the 16 kinematic variables, we performed two independent runs with "jet images only; no kinematic variables" and "16 kinematic variables; no jet images". The corresponding results are shown in figure 12. Since the two DLs are trained separately, both the x-axis and the y-axis are normalized to unity. As expected, figure 12 reveals a degree of correlation between the jet images and the 16 kinematic variables, which is somewhat stronger for the signal and less so for the background.

JHEP09(2019)047
In our main analysis, we performed simultaneous runs as shown in the deep learning architecture in figure 9. Before calculating our final deep learning score, we obtain three intermediate values, α, β, and γ, which represent the DL scores for the respective substructure corresponding to the jet images, the 6 high level variables and the 10 low level variables. The first 6 panels in figure 13 show the pair-wise correlations between these three intermediate scores for the signal (top row) and the background (middle row). The bottom three panels in the figure show the one-dimensional distributions of the intermediate scores for signal (blue histograms) and background (red histograms). We observe that the score from jet images (α) is relatively uncorrelated to the kinematic variables scores β and γ, which motivates the simultaneous training on jet images and kinematic variables together.
Finally, in figure 14 we scan over different values of the triple Higgs coupling κ 3 and show the discovery significance (left panel) and precision (middle panel) as a function of κ 3 . Both the significance σ and the precision ∆χ 2 are calculated fixing DL cuts that would give a certain number of signal events (N = 15,20,25,30) for the SM at κ 3 = 1 (marked with the dotted vertical line). For the significance, we used the log-likelihood-ratio where S and B are the expected number of signal and background events, respectively. We The shape of the significance roughly follows the cross section ratios between the case of κ 3 = 1 to the case of κ 3 = 1. This is illustrated in the rightmost panel of figure 14, which shows the cross section scaled as σ(κ 3 )/min σ(κ 3 ) , i.e., normalized with respect to the minimum cross section for each curve. The blue curve represents the double Higgs production cross section before cuts, and in this case we find the minimum of the cross section somewhere between κ 3 = 2 and κ 3 = 3. After baseline cuts (the red solid line), the minimum shifts to around κ 3 ∼ 4, and after DL cuts (the green solid line), the minimum shifts even further out to around κ 3 ∼ 5. In the latter case, we observe that the signal cross sections for κ 3 = 1 and κ 3 = 8 are numerically very close, as indicated by the two vertical dotted lines in the right panel. This provides an explanation for the double dip structure seen in the middle panel of figure 14.  15,20,25,30) for the SM at κ 3 = 1 (marked with the dotted vertical line). The shape of the significance roughly follows the cross section ratios between the case of κ 3 = 1 to the case of κ 3 = 1. This is illustrated in the right panel, which shows the scaled cross section, computed as σ(κ 3 )/min σ(κ 3 ) for each curve.

JHEP09(2019)047
As demonstrated in the rightmost panel of figure 14, the analysis cuts modify the signal cross section so that the location of its minimum shifts to higher values of κ 3 . This can be understood as follows. At leading order, the Higgs pair production cross section is given by before convoluting with the parton distribution functions [107,108]. Here F 1 represents a parity-even triangle and box diagram contribution, while F 2 is a parity-odd box diagram contribution. Now F 1 can be rewritten as F 1 = κ 3 F + F , where F is the triangle diagram contribution and F is the box diagram contribution. Therefore the cross section can be parameterized as a quadratic function of κ 3 , where the c coefficients are related to contributions from and diagrams. The observation that the baseline cuts and the DL cut shift the minimum cross section to a larger κ 3 value implies that the effects of the cuts are stronger on c than c , . In other words, our cuts are more likely to affect the triangle diagram which contains the triple Higgs coupling. Unlike the box diagram, the triangle diagram includes an off-shell Higgs in the s-channel. Since it is harder to produce a Higgs pair from an s-channel off-shell Higgs, the Higgs pair generated from the triangle diagram is not as energetic as the one coming from the box diagram, and will therefore tend to have lower transverse momentum. As discussed in section 4, several of the cuts on our kinematic variables, namely, ∆R bb , ∆R , ∆φ bb, , p T bb and p T , rely on the fact that the Higgs bosons are produced with a significant boost. Consequently, the effect of the cuts will be to suppress the c term and enhance the box diagram contribution, which in turn shifts the location of the minimum to a larger value of κ 3 .
Note that the results for the significance and the precision in figure 14 do not change dramatically when we require a different number of signal events at the SM point. This  means that the dependence on the DL cut is relatively mild, since the kinematics remains similar when we vary κ 3 , so that the dependence on the cross section is more important. This is illustrated in figure 15, which shows the cross section (in pb) after cutting on the DL score, σ DL (pp → hh → bb + − νν), (left panel) and the ratio σ DL /σ baseline between the cross section σ DL after the DL cut and the cross section σ baseline after baseline cuts (right panel).

Discussion
In this paper, we investigated double Higgs production in the hh → bbW W * → bb + / P T final state. It is known to be one of the difficult channels due to the large backgrounds, σ bknd /σ hh ∼ 10 5 . We performed a detailed analysis by adopting a deep learning framework and successfully combining new kinematic variables and jet image information. As a result, we obtained a sizable increase in signal sensitivity and an improved signal-to-background ratio compared to the existing analyses.
Our results showed that the dominant tt background can be brought down to the level of the other remaining backgrounds, without sacrificing too much in the signal rate. This is mostly due to the use of Higgsness, Topness and the subsystem variable M Other backgrounds like bbτ τ can be reduced further by the use of M ( ) T 2 . Finally, additional improvements are possible with the use of jet images. After all cuts, we find that all backgrounds contribute at similar levels.
We find from recent CMS and ATLAS analyses with 36 fb −1 of LHC data at 13 TeV that the 95% confidence level observed (expected) upper limit on the production cross section is 22.2 (12.8) times the standard model value [7] for CMS and 6.7 (10.4) times the predicted Standard Model cross-section [109] for ATLAS. The leading channel in CMS is bbγγ followed by bbτ τ , while the leading channels in ATLAS are bbτ τ and bbbb, followed JHEP09(2019)047 by bbγγ. The main difference arises due to the superior b-tagging efficiency for the ATLAS detector [14]. In both studies, the bbW W * channel was largely overlooked due to the expected poor significance. However, our study suggests that double Higgs production may be probed in the dilepton bbW W * channel as well, and would contribute to the combined analysis on par with the other final states, increasing the overall significance. For example, in ref. [13], the ATLAS collaboration showed that the combined significance of hh → bbbb, hh → bbτ τ , and hh → bbγγ is 3.5 (3.0) without (with) systematic uncertainties at the 14 TeV LHC with 3 ab −1 . Their individual significance is 1.4, 2.5 and 2.1 (0.61, 2.1 and 2.0), respectively without (with) systematics. They did not combine with the hh → bbW W * channel but a naive estimate shows that when including our channel, the combined significance would be about 3.7.
We urge the experimental collaborations to consider the ideas presented in this paper and test them in the LHC data. We would also like to mention that the proposed method can be easily generalized to the semi-leptonic channel from hh → bbW W * production, as well as to other processes with similar final states.

A Deep Neural Network
The artificial neural network (ANN) is one of the most popular approaches to pattern recognition in machine learning algorithms. The structure of an ANN is defined by a succession of non-linear and linear transformations between nodes or artificial neurons, which are located on input, output or hidden layers. A hidden layer which uses an ordinary one-dimensional layer is called a dense layer or a fully connected layer.
The linear operation consists of weights and bias: where the corresponding dimension of the input would be n I = n 2 . The non-linear transformation is often called activation function, which imitates the action potential of biological neurons. Similar to how each neuron adjusts how much signal it needs to deliver to the next neuron using an electric action potential, the activation function determines the output of a particular neuron for a set of given inputs from neurons on the previous layer, and the output is then used as input for the next artificial neuron. If the neural network has sufficiently many hidden layers, the network is called deep neural network (DNN). DNN can learn from the input data to obtain the desirable output by adjusting the parameters in the hidden layers. We note that the proper normalization of the input data helps improve convergence during training. The goal of the training is to determine the parameters (weights and biases) by minimizing the loss, which represents the difference between the target output and the actual DNN output. There are various algorithms for optimization of the parameters [104,110,111]. Instead of feeding the entire data into the DNN all at once, one splits the input data into several subsets with random selection and takes one subset, called mini-batch, for a given iteration, which helps avoid the over-fitting problem [112,113]. When the full training set is used, the cycle is called epoch, and one uses several epochs to obtain a welltrained DNN. When training DNN with a mini-batch, the corresponding loss is defined by the sum of all losses over the mini-batch or by their average. Once the training is over, for testing one uses a different data set from the one used in the DNN training, in order to avoid the over-fitting problem. In order to test the trained JHEP09(2019)047 DNN model one can use either the loss function or the classification error function. If the number of test events is n, the classification error function is defined by where ArgMax({y[i]}) gives the position i max where the value of {y[i]} is maximized. j represents the j-th test event, the δ is Kronecker delta function. Often one takes additional steps such as dropout for reducing over-fitting in neural networks [114] and batch normalization for improving the performance and stability of artificial neural networks [101]. Dropout makes a random drop of units (both hidden and visible) in a neural network and is considered an efficient way of performing model averaging. The batch normalization procedure normalizes the input layer by adjusting and scaling the activations: where α represents the α-th input in a mini-batch and n is the size of the mini-batch. The dimensions of input and output are the same. Note that (γ, β) are the learned parameters during the training and is a parameter added to avoid a divergence in the denominator. The batch normalization allows each layer of a network to learn by itself independently of the other layers. A convolutional neural network (CNN) is a class of DNN, most commonly used to analyze images. CNN utilizes filters made of a set of neurons with a fixed size. The value of parameters in each filter is learned during the training process. By varying the position of the filters on the input and learning the values of different filters, CNN can find local features of the input data. This process is called convolution and a hidden layer which uses convolution is called a convolutional layer. With n f filters whose size is (n f s × n f s ), the convolution is defined as follows where the dimension of the input is n f ×(n×n) and the dimension of output is n f ×(n ×n ). The corresponding ranges of the parameters are k = {0, · · · , n f − 1}, α = {i, · · · , i + n fs }, β = {j, · · · , j + n fs }, i, j = {0, n s , 2n s , · · · , n }, n = n/n s − n f s + n s , and n s is called the stride.

JHEP09(2019)047
Since each filter has a finite size, the output size decreases, after applying the convolution (A.14) on the input or on the output from a previous layer. In order to prevent the size reduction, CNN incorporates the padding process: which increases the size of the original input by adding zeros around it. Usually the padding is used before applying convolution or pooling. CNN may include local or global pooling layers (often called sub-sampling), which combine the output of several neurons at one layer to a single neuron in the next layer. For example, max (average) pooling takes the maximum (average) value from a set of neurons at the previous layer and passes it to next layer. For a pooling dimension n p , the relation between the output with dimension n f ×(n ×n ) and the input with dimension n f ×(n×n) where α = {i, · · · , i + n p }, β = {j, · · · , j + n p }, k = {0, · · · , n f − 1}, i, j = {0, n s , 2n s , · · · , n }, n = n/n s − n p + n s , and n s is the stride. Another beneficial feature of a CNN is the reduction of the number of parameters via convolution and pooling, which effectively increases the learning speed in deep neutral networks. A typical DNN architecture consists of a combination of convolutional layers and dense layers, which provides better performance compared to a NN with only one type of layers [115].
Open Access. This article is distributed under the terms of the Creative Commons Attribution License (CC-BY 4.0), which permits any use, distribution and reproduction in any medium, provided the original author(s) and source are credited.