Higgs self-coupling measurements using deep learning in the bb¯bb¯\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ b\overline{b}b\overline{b} $$\end{document} final state

Measuring the Higgs trilinear self-coupling λhhh is experimentally demanding but fundamental for understanding the shape of the Higgs potential. We present a comprehensive analysis strategy for the HL-LHC using di-Higgs events in the four b-quark channel (hh → 4b), extending current methods in several directions. We perform deep learning to suppress the formidable multijet background with dedicated optimisation for BSM λhhh scenarios. We compare the λhhh constraining power of events using different multiplicities of large radius jets with a two-prong structure that reconstruct boosted h → bb decays. We show that current uncertainties in the SM top Yukawa coupling yt can modify λhhh constraints by ∼ 20%. For SM yt, we find prospects of −0.8 <λhhh/λhhhSM\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\lambda}_{hhh}/{\lambda}_{hhh}^{\mathrm{SM}} $$\end{document}< 6.6 at 68% CL under simplified assumptions for 3000 fb−1 of HL-LHC data. Our results provide a careful assessment of di-Higgs identification and machine learning techniques for all-hadronic measurements of the Higgs self-coupling and sharpens the requirements for future improvement.

This paper synthesises these separate advances to present a comprehensive assessment of HL-LHC analysis strategies across all kinematic regimes of the hh → 4b channel to evaluate and improve λ hhh constraints. We now discuss the specific novelties of this paper.
We show that analyses optimised for discovery sensitivity of hh with SM couplings are suboptimal for constraining λ hhh at the boundaries of projected limits. This is because SM hh production has more boosted Higgs bosons than scenarios where λ hhh /λ SM hhh ∼ 5 due to the relative contributions of the interfering production amplitudes (figure 1). Reconstructing Higgs bosons with lower boosts is more limited by trigger thresholds, but we nonetheless demonstrate that optimising for signals with non-SM λ hhh rather than SM values can improve λ hhh constraints. Furthermore, due to the small signal-to-background ratios, our sensitivity is limited by systematics, whose size we vary to quantify its impact on λ hhh constraints. Reducing the formidable multijet background can mitigate the impact of systematics, whose suppression relies on modern b-tagging algorithms [71][72][73]. Our study adopts expected improvements of the impact parameter resolution from inner tracker upgrades crucial for b-tagging.
Additionally, we investigate the impact of experimental effects such as the finite resolution of jet reconstruction and trigger limitations. This presents a more complete picture of the limitations on the sensitivity of the hh → 4b channel beyond previous studies. We also study the composition of the background in detail, including small backgrounds (such as single Higgs production) and the impact of a neural network selection on this composition and shape of background distributions. This is important as uncertainties in background composition affect the experimental systematics of current analyses.
This paper is structured as follows. Section 2 outlines the Monte Carlo simulation of the signal and background processes together with discussion of detector emulation. Section 3 describes the baseline and neural network analysis strategies. Section 4 presents the results and statistical analysis before section 5 summarises our conclusions.

Signal and background modelling
In the SM after electroweak symmetry breaking, the Higgs self-coupling is determined by the Higgs potential This paper focuses on measuring the trilinear coupling λ hhh . This will directly test the SM prediction m 2 h = λ hhh v 2 , and any deviation from this value is indicative of BSM physics. The Higgs boson mass m h = 125 GeV and electroweak vacuum expectation value v = 246 GeV are fixed to their independently measured values. Directly probing the quartic self-coupling λ hhhh requires triple Higgs boson production [108][109][110][111], which is beyond the scope of this work due to its highly suppressed cross-section. This section details the Monte Carlo simulation of the signal and background used for our analyses. We outline the phenomenology of Higgs pair production along with its simulation (subsection 2.1), background processes (subsection 2.2), and detector emulation (subsection 2.3).

Higgs pair production
At leading order, the Higgs pair production cross-section has contributions from amplitudes we refer to as 'triangle' and 'box', shown in figure 1, and their interference. These three contributions scale with the top Yukawa and trilinear self-coupling as σ triangle ∼ λ 2 hhh y 2 t , σ box ∼ y 4 t , σ interference ∼ −λ hhh y 3 t . (2.2) Sensitivity to λ hhh therefore depends on probing the triangle and interference amplitudes. The comparatively stronger dependence of the total cross-section on y t highlights the importance of top Yukawa constraints for λ hhh determination. The interference term probes the sign of λ hhh , as it has a relative negative sign due to an additional fermion propagator in the box loop with respect to the triangle. This implies destructive interference for λ hhh > 0. In the |λ hhh |/y t → ∞ limit, the total cross-section asymptotes to the same value for either sign of λ hhh as the triangle contribution dominates over the interference.
To modify couplings (λ hhh , y t ) away from SM values λ SM hhh , y SM t , keeping all other SM parameters fixed, we use the HeavyHiggsTHDM model [114]. This adopts the 'improved effective field theory' prescription, which uses gluon-Higgs vertices in the large quark mass limit and is improved by accounting for finite top mass effects as discussed in ref. [115]. Such corrections are important given loop momenta are comparable to the top mass pole. We set couplings to all BSM particles to zero. We vary the top Yukawa y t while fixing the top mass to m t = 172 GeV. We define κ λ = λ hhh /λ SM hhh and κ t = y t /y SM t to parameterise variations from the SM couplings, as is conventional in the so-called 'kappa framework' [2]. This is a standard prescription, but to consistently study global constraints with other measurements, a full effective field theory treatment is recommended [116][117][118]; this is deferred to future work.
To sample points in the two-dimensional signal parameter space (λ hhh , y t ), we employ MadGraph 2.6.2 [112,113]. We consider only the dominant production mode via gluon fusion gg → hh, inclusive of all Higgs boson decays. The NNPDF3.0 parton distribution functions [119] at next-to-leading order (NLO) are used from the LHAPDF package [120]. We generate 100k Monte Carlo (MC) events per point and calculate cross-sections at leading order (LO) in the strong coupling constant α s . Figure 2 illustrates the sampled points and LO cross-sections calculated by MadGraph, where the SM value is around σ SM LO = 16 fb at √ s = 14 TeV, and table 1 shows LO cross-sections for example points. For κ t = 1, the LO cross-section falls to a minimum of 6.2 fb at around κ λ = 2.5 due to destructive interference. The cross-section gradient is also shallow around this minimum, rising by only ∼ 15% to 7.0 fb and 7.3 fb for κ λ = 2 and κ λ = 3 respectively. This makes λ hhh constraints around such values challenging using inclusive cross-section measurements alone. For the training of the neural network (subsection 3.2), we use an identical generator configuration to produce a dedicated set of high statistics samples with 250k events per κ λ variation for fixed κ t = 1 and use MadGraph to decay both Higgs bosons via h → bb.
We normalise the signal rate to NNLO. 1 We start from the LO MadGraph crosssections and improve their accuracy by applying λ hhh -dependent LO-to-NLO k-factors 1 The cross-section for κ λ = 1 has recently been calculated to N 3 LO accuracy [121,122].  [112,113] as a function of the variation of Higgs trilinear self-coupling κ λ and top Yukawa coupling κ t from their SM values λ SM hhh , y SM t using the HeavyHiggsTHDM model [114]. The orange markers indicate the points sampled from this parameter space.
Turning to the differential distributions, variations in (λ hhh , y t ) away from SM values induce striking features illustrated in figure 3. These are shown at parton level ('parton level' refers to the b-quarks produced by the Higgs decays, before hadronisation or showering), with other variables found in appendix A. Ideally, analyses should have a high signal-to-background ratio S/B across different kinematic regimes of the Higgs boson as their boosts can vary rapidly with variations in coupling. Figure 3a shows κ λ values between 1 and 4 inclusive, where destructive interference is near maximal and small κ λ variations cause dramatic changes to the m hh and p T (h) distributions. For SM couplings, events are suppressed at low values 250 < m hh 300 GeV and peaks at m hh 400 GeV, resulting in higher p T Higgs bosons. For κ λ = 2, destructive interference causes the m hh shape to vanish at m hh 320 GeV. The signal instead occupies localised regions on either side of this minimum: near the kinematic threshold and at higher  values where the Higgs bosons are more boosted. As κ λ increases to κ λ = 3 and 4, many events shift to low m hh 350 GeV, where the Higgs bosons have comparatively lower p T . Figure 3b shows how the sign of κ λ impacts the m hh and p T (h) distributions. For scenarios with destructive interference κ λ > 0, the signals κ λ ≥ 5 all occupy low m hh values peaking near 250 GeV. Interestingly, as κ λ increases from 5 to 10, the m hh and p T (h) spectra become slightly harder. Meanwhile, scenarios with constructive interference κ λ < 0 have a greater proportion of signal events occupying comparatively higher m hh and p T (h) values. These qualitatively different kinematic features can help lift degeneracies in the sign and value of λ hhh that give the same total cross-sections.
Physically, these features arises due to kinematic thresholds of the interfering amplitudes. The triangle amplitude tends to dominate at lower m hh , while the box diagram has a kinematic threshold at approximately twice the top mass 2m top and impacts larger m hh values. Reconstructing these differential features, especially where the cross-section gradient is small, can improve λ hhh sensitivity.  The columns indicate any explicit parton level generator preselection on the leading parton p T (j gen 1 ), the leading order MadGraph crosssection σ MG LO , the number of events generated N gen events , and an estimate of the effective luminosity L eff = N gen events /σ MG LO . The k-factor is a scaling factor applied to account for NLO corrections for the dominant backgrounds, and overall higher order corrections for the signal described in the main text. Information for a representative set of signals are displayed in the bottom five rows. For the signals, higher statistics samples of 250k events per point are generated for neural network training as discussed in the main text.

Background processes
We calculate cross-sections and produce MC events at LO for background processes using the same generator configuration as the signals, which are summarised in table 1. We refer to processes with four real b-quarks at parton level as 'irreducible' backgrounds. Processes with fewer than four real b-quarks can enter the analysis selection if a jet is misidentified as a b-jet, which we refer to as 'reducible' backgrounds. We apply k-factors from ref. [46] to account for NLO corrections for the dominant backgrounds shown in table 1.
We generate two sets of multijet processes: irreducible bbbb (4b) and reducible bbjj (2b2j). We define light-flavour partons j as all quarks and gluons except bottom and top -7 -JHEP12(2020)115 j ∈ {u, d, s, c, g}. The 2b2j cross-section is two orders of magnitude larger than 4b, so is substantial even with powerful light jet rejection from b-tagging algorithms. We do not simulate 4j to preserve computational resources and ref. [46] showed that this process is subdominant compared to 4b and 2b2j. To improve statistics in high p T tails, we generate events for ranges of leading parton p T (j gen 1 ) in MadGraph, sliced with lower bin edges at 20, 200, 500, 1000 GeV, where leading parton refers to b-quarks or light partons j.
We generate top quark pairs tt and the irreducible tt+bb process at the matrix element level, which together are expected to comprise around 10% of the background rates. For single Higgs processes in associated production, we consider W h, Zh, tth, bbh. These processes typically have cross-sections that are an order of magnitude greater than di-Higgs, and the presence of a Higgs boson increases the probability of events passing analysis selections than other electroweak processes. Finally, we consider the diboson ZZ process, which has a modest cross-section and its 4b final state constitutes an irreducible background. We generate at least one million events for each of these processes.

Detector emulation
For all signal and background samples, the decay, parton shower, hadronisation, and underlying event are modelled by Pythia 8.230 [131]. To save computational resources for our simplified study, we do not emulate pileup. We expect that detector upgrades [132][133][134] and improvements in pileup mitigation techniques [135][136][137][138][139] will reduce the impact of pileup. To emulate reconstruction effects, we use Delphes 3.4.1 [140] and assume the default AT-LAS configuration card unless stated otherwise. We define three sets of reconstructed jets using the anti-k t clustering algorithm [141,142] with radius parameter R: • Small jets (j S ) are defined with R = 0.4 and cluster only calorimeter towers. We impose p T > 40 GeV on all these jets to emulate the detector trigger requirement. We require these to be within a tracking acceptance of |η| < 2.5 to allow b-tagging, where η is the pseudorapidity.
• Large jets (j L ) are defined with R = 1.0, also clustering only calorimeter towers. Based on the expected kinematics of boosted Higgs bosons p T 2m h , we require these to satisfy p T > 250 GeV and be central |η| < 2.0.
• Track jets (j T ) are defined by clustering only tracking information using R = 0.2.
We impose kinematic requirements of p T > 20 GeV and |η| < 2.5. We associate these track jets to large jets if their distance satisfies For simplicity, we tag the flavour of both small jets and track jets using the same default implementation in Delphes, which parameterises tagging efficiencies based on the truth quark flavour of the jet. The p T -dependent efficiencies of bottom, charm, and light jet (u, d, s, g) mistag rates are based on the '70% working point' of the ATLAS multivariate MV2c20 algorithm [71]. The b-tagging efficiency peaks at 74% for p T 150 GeV, falling to 50% at p T 500 GeV. The charm mistag rate is around 10%, peaking at 14% for -8 -

JHEP12(2020)115
p T 100 GeV before falling to 7% at p T 500 GeV. The light mistag rate remains on the order of 1% throughout the p T regimes of interest. We do not increase the b-tagging range to |η| < 4.0 expected from the detector upgrades [143] because we expect the region |η| > 2.5 to be dominated by the gluon splitting background g → bb whereas b-quarks from Higgs are more central. However, we follow ref. [57] to emulate expected b-tagging improvements of 8% per jet in |η| < 2.5 by scaling the 4b background and hh signals by a factor of 1.36, and the 2b2j and tt backgrounds by 1.17. To improve MC statistics after b-tagging, we multiply events by these efficiencies as a weight rather than discarding non-b-tagged events.
For processes with a low number of real b-quarks such as 2b2j, this can improve the number of raw MC events by up to three orders of magnitude when requiring four b-tagged jets.
Electrons and muons must be within the tracking acceptance |η| < 2.5, satisfy p T > 10 GeV, together with default Delphes efficiencies, smearing and isolation. We modify the default calculation of missing transverse momentum (p miss T ) to ensure muons are included. It is relevant to study how experimental effects such as jet clustering, missing energy due to neutrinos and finite detector resolution impact the discriminating variable m hh . As m hh is sensitive to κ λ , this can motivate improvements to its reconstruction for future work. Figure 4 shows the impact of these effects for the κ λ = 2.5 signal with near-maximal destructive interference to accentuate the shape differences. First, four small jets (with the p T and η requirements defined above applied) are identified by association to the bquarks from the Higgs bosons found in the truth record. For these jets, we then compare their parton level m hh distribution (shaded grey) to three classes of jets at different levels of reconstruction: 1) truth-level jets clustered from all final state truth particles (yellow line), 2) truth-level jets clustered from all final state truth particles except neutrinos (blue line), and 3) reconstructed-level jets (red line). The shape of m hh for all three jet definitions is mildly distorted to lower values than the parton level, with the reconstructed jets being generally lower than their truth counterparts. While this shows the degradation of m hh resolution, the qualitative features of the non-trivial m hh shape are preserved. Figure 4a considers a p T > 20 GeV threshold on the reconstructed jets, while figure 4b raises this threshold to the nominal p T > 40 GeV considered in this work. This illustrates that the signal rate at low m hh is substantially driven by trigger thresholds. Future work may also consider jets with variable radius, which may improve signal-to-background ratios [144]. This comparison highlights the following important experimental considerations: • Maintaining sufficiently low trigger thresholds for the HL-LHC upgrades [132] is of key importance for both discovery of the di-Higgs process as well as constraining λ hhh . In particular, figure 3a shows the 1 ≤ κ λ ≤ 4 scenarios exhibit large qualitative changes in shape at low m hh 400 GeV. A rise in trigger thresholds can reduce sensitivity to this region. Figure 3b shows the majority of κ λ ≥ 5 signals reside at m hh 350 GeV, but this is also compensated by the faster changes in total cross-section.
• Corrections should be applied to compensate for energy loss in b-jets, which is not accounted for by standard jet calibration. These techniques have been recently deployed by the LHC collaborations [145] but are beyond the scope of this work. Applying these corrections at the trigger level remains an open problem.

Analysis strategies
This section presents our analysis strategies to constrain the trilinear self-coupling λ hhh in the hh → 4b channel. We first outline the figures of merit that motivate our analysis design. The goal of measurement is to maximise discrimination between different (λ hhh , y t ) values, which requires solving two conceptually distinct classification problems: • signal characterisation i.e. the measurement power to discriminate λ i hhh vs λ j =i hhh . This is most simply quantified by the difference (squared) of the signal rates (S i − S j =i ) 2 .
• background suppression i.e. signal S vs background B discrimination. Intuitively, if the changes (S i − S j =i ) 2 between two couplings are comparable or smaller than the background uncertainties, this reduces the ability to discriminate λ i hhh vs λ j =i hhh .
We quantify how well our analyses simultaneously achieves these goals in a statistically meaningful way using the chi-square Here, S denotes the signal yields after analysis selections for the nominal (λ i hhh , y i t ) and alternative (λ j hhh , y j t ) coupling hypotheses. The ς S (ς B ) are the combined absolute uncertainties on the signal (background), where ς B ς S i is typical in the 4b channel. To benchmark SM measurement precision, we can fix S j = S SM , which assumes the observed data will correspond to that of the SM. Nonetheless, it is important to consider dedicated optimisation assuming S i = S BSM should nature prefer BSM couplings. Higher χ 2 values indicate better discrimination power between two coupling hypotheses, which is achieved by maximising the numerator (S i − S j ) 2 while minimising the uncertainties ς S,B . As background rates are large, the systematic component of the uncertainty dominates the denominator. One way to reduce the impact of background systematics is to suppress B. We design two classes of analysis strategies to fulfil these objectives: • The baseline analysis (subsection 3.1) uses conventional rectangular cuts on variables reconstructing Higgs bosons, inspired by recent ATLAS and CMS strategies. This serves as a baseline to benchmark the performance of neural network optimisation.
• The neural network analysis (subsection 3.2) demonstrates the use of a multivariate strategy to optimise sensitivity beyond the baseline analysis. This consists of an additional selection requirement based on the output of an artificial neural network.

Baseline analysis
The baseline analysis is loosely inspired by a recent ATLAS analysis [41]. Our event selection is summarised in table 2. In all categories, we require at least four b-tagged small or track jets to suppress multijet backgrounds. We also require the pseudorapidity difference of the two reconstructed Higgs candidates to be small |∆η(h 1 , h 2 )| < 1.5 because high mass objects occupy more central regions than multijet backgrounds dominated by gluon-gluon scattering. To suppress W → ν decays from top quarks, we veto electrons or muons and require E miss T < 150 GeV in all categories. To probe different regimes of Higgs boson kinematics, we define three categories based on the exclusive number of large jets in each event, which we refer to as resolved, intermediate, and boosted. Figure 5 displays the multiplicity of resolved, intermediate and boosted events for different κ λ , as well as reconstructed m hh distributions inclusive of number of large jets. The multiplicity distribution shows that this categorisation has λ hhh discrimination. Interestingly, κ λ = 2 and κ λ = 3 have more events falling in the intermediate and boosted categories N (j L ) ≥ 1 than the SM κ λ = 1, consistent with the higher m hh and -11 -JHEP12(2020)115 p T (h) tails (figure 3a). This orthogonal categorisation in N (j L ) enables straightforward statistical combination to enhance sensitivity. For the three categories, the final signal region (SR) is defined by mass window requirements on the reconstructed h → bb systems of m(h 1,2 ) ∈ [90,140] GeV, based on the mass resolution and background rejection.
Resolved. The resolved category requires exactly zero reconstructed large jets. This targets Higgs bosons with low transverse momenta, p T 2m h , which we reconstruct as four distinct small jets. To identify pairs of jets consistent with a h → bb decay, we consider the leading four small jets and construct Higgs boson candidates from small jet pairs that minimise the mass difference between the candidates ∆m(h 1 , h 2 ). The (sub)leading Higgs candidate is defined as the system of a pair of small jets with the (lower) higher p T . As the triangle amplitude dominates at low m hh and therefore low Higgs p T , this category is particularly important for λ hhh sensitivity. We implement a selection defined in ref. [41], where the angular distance ∆R h 1,2 jj between the jet pair of the Higgs candidates satisfy This adjusts the angular distance between the jets of each Higgs candidate according to the boost of the system characterised by the invariant mass of the 4-jet system m 4j .

Intermediate. The intermediate category requires exactly one large jet in the event.
This targets regimes where exactly one Higgs boson is sufficiently boosted (p T 2m h ) that the b-jets in the h → bb system become merged so are more efficiently reconstructed as one large jet. The two b-quarks are reconstructed as two track jets j T , which we require to be associated to this large jet by ∆R(j T , j L ) < 1.0. The remaining small jets j S , separated from the large jet by ∆R(j S , j L ) > 1.2, are paired to form the subleading Higgs candidate, where the j S pair is chosen to minimise the mass difference of the two Higgs candidates.
Boosted. The boosted category requires exactly two large jets targeting hh → 4b events where both Higgs bosons have high Lorentz boosts. These events typically reside in the tails of the m hh distribution (m hh 500 GeV), which can be important for probing BSM λ hhh couplings that enhance m hh and p T (h) at high values. Importantly, the high expected signal rate due to the large branching ratio B(h → bb) 58% implies greater statistical power in these tails compared with lower rate channels such as bbτ τ or bbγγ. We require each large jet to have two b-tagged track jets associated to this large jet. Figure 6 shows the signal acceptance times efficiency A × ε for the baseline analysis signal region in the three categories. Due to the different Higgs boson kinematics when varying κ λ , the A×ε strongly depends on κ λ . We find the highest A×ε = 0.55% at κ λ = 1.  of magnitude lower than those of the resolved category. In particular, we find the lowest A × ε = 0.26%, 0.019%, 0.007% at κ λ = 4, 6, 9 for the resolved, intermediate, and boosted categories respectively. This suppression is dominantly due to the lower h → bb boosts causing events to pass the jet (trigger) p T requirements with lower efficiency. Figure 6b shows that the acceptance decreases slowly for increasing κ t due to the m hh distribution being shifted toward lower values, as displayed in appendix A. Figure 7 shows the m hh and leading Higgs p T distributions for signal and background in the signal region of the three categories. Notably, the resolved category has the greatest shape discrimination between κ λ hypotheses, where κ λ = 5 has a visibly higher proportion of events at lower m hh values than κ λ = 1. To exploit this, our final selection divides events into non-overlapping m hh bins whose lower edges are defined in table 2. The resolved category uses six bins chosen for simplicity, while fewer bins are used for the intermediate and boosted categories due to lower expected event rates. When performing the statistical analysis in section 4, these orthogonal m hh bins are combined to enhance sensitivity. Future work could consider dedicated optimisation of this binning scheme. Appendix A shows further distributions involving the subleading Higgs p T and di-Higgs system p T (hh).
nodes use the rectified linear unit (ReLU) defined as max(0, x) for the activation function, whose advantages over the traditionally used sigmoid function are discussed in ref. [148]. As input, the network uses a comprehensive set of 20 variables summarised in table 3. This comprises the four-momenta of the two Higgs candidates, the ∆R distance between the two subjets associated to each Higgs candidate, the b-tagging state of these subjets, the missing transverse momentum with magnitude E miss T and azimuthal angle φ, the number of reconstructed electrons and muons, and the mass and transverse momentum of the di-Higgs system. These variables are chosen for their signal vs background discrimination power. A separate neural network is trained for each of the resolved, intermediate and boosted categories using the preselection defined in table 2. For the training, the input samples are normalised such that equal weight is given to signal, multijet (2b2j and 4b) background, and tt background.
For outputs, we construct three nodes corresponding to signal, multijet (2b2j and 4b) background, and tt background. This is referred to as a 'multi-class classification network'. The model assigns a score p i corresponding to how likely an event corresponds to one of the three processes. The output nodes use a normalised exponential activation function known as 'softmax', which is a standard choice for multi-class configurations. This constrains each output score 2 to p i ∈ [0, 1] and their sum to unity i p i = 1. We constructed a threeoutput classifier to explore its utility in classifying background processes for designing control regions, but section 4 will only use the binary signal classifier for simplicity. In typical experimental implementations, multijet processes are estimated using data-driven methods while tt employs MC, which have different systematics.
Half of the MC events for the multijet and tt background samples, and all of the high statistics hh → 4b signal samples are set aside for training (20% of the training events are used for training validation and are not used for actual training). The categorical cross-entropy is used as the loss function, which quantifies the accuracy of the model predictions at each training step, and is minimised using the Adamax algorithm [149]. The step size used in this minimisation is controlled by a learning rate hyperparameter. The cross-entropy H between the network prediction and the true classes in N events from the training set is defined by Here, p label is 2 Despite the notation and its properties, pi is not a true posterior probability. the vector containing the class of each event (1 for the true class and 0 for the other two) and p model is a vector containing the score assigned to each class for each event. The initial learning rate in the Adamax optimiser is set to 5 × 10 −3 for the resolved and intermediate neural networks, and 5×10 −5 for the boosted ones. The training set is divided into batches during training. Once all batches are finished processing, one training epoch is complete and the next one starts by processing the first batch again. The number of events in each batch is a tunable hyperparameter, which we set to 100 in this study. To mitigate overfitting unphysical features such as statistical fluctuations, we apply dropout [150] at a rate of 30% to both internal layers. This means 30% of internal nodes are randomly masked during each training iteration. We find 20 training epochs gives close to optimal performance. The learning rate, batch size and dropout rate are optimised using a random search method. Figure 8 shows the signal score p DNN signal distributions for the DNN trained on κ λ = 1 and κ λ = 5 for background and benchmark signals in the three categories. The signal vs background discrimination is improved across the categories, suggesting that our neural networks capture kinematic information beyond the cuts of the baseline analysis. However, this depends on the value of κ λ . For example, the upper-left plot shows that the DNN trained on κ λ = 1 adds substantial discrimination power for a κ λ = 1 signal, but not for a κ λ = 5 signal. This will be further discussed in section 4. Our neural network analysis imposes a universal requirement of p DNN signal > 0.75 for simplicity. Future work could consider optimising requirements on p DNN signal that are different for each category. Additionally, one could extend the analysis by fitting the p DNN signal variable instead of using only one bin. Figure 9 shows the m hh distributions after the p DNN signal > 0.75 requirement is imposed for the DNN trained on κ λ = 1 and κ λ = 5. We note that the shape of the κ λ = 5 signal (as well as that of the background) in the resolved category is particularly different between the two trainings. Since the low-m hh regime offers the most discrimination between different κ λ values, we use κ λ = 5 as the nominal signal training sample to ensure the DNN gives more weight to these events with respect to a SM optimisation. As we will see in section 4, this indeed performs better than training on κ λ = 1 when setting κ λ coupling limits. Note that the same is not necessarily true when setting cross-section limits assuming SM kinematics.

What is the DNN learning?
As machine learning techniques become increasingly widespread, it is important to understand how our neural network exploits the given physics information [82]. Specifically, we evaluate the ranked feature importance to quantify how much each input variable (model feature) changes the signal score in both signal and background events. Interpretability of deep neural networks has seen rapid development in the computational sciences. We adopt a recently developed framework for interpreting predictions called SHapley Additive exPlanations (SHAP) [88]. This framework combines several feature importance tests available for machine learning models in the literature into a single value as detailed in ref. [88]. These values are consistent for all types of inputs to the neural network, and can be compared with one another. A heuristic description of SHAP values can be found in appendix B.1 that provides intuition for this approach. Events / 0.05  Events / 0.05   Events / 20 GeV Events / 40 GeV Events / 20 GeV Events / 40 GeV Events / 50 GeV -19 -

JHEP12(2020)115
The SHAP framework requires that the trained model is applied to a specific set of events and only evaluates the feature importance for these events. We construct a new subset of events with half signal and half background to reflect the importance of distinguishing them from each other rather than discrimination between the background components had the S/B mirrored the preselection rates. For this evaluation, we construct the background sample to be composed of 80% 2b2j, 15% 4b and 5% tt events to approximate the background composition of the three categories. Figure 10 shows the 20 input variables of the neural networks, ranked by their impact on the final signal score for the κ λ = 5 training. This impact is measured by the magnitude of their SHAP values averaged over the whole dataset given to the framework, which is plotted on the x-axis. Each point plotted per row corresponds to one event fed to the framework, and its location along the x-axis represents what impact that variable has on the signal score of the event. The relative magnitude represents how much the value of that variable changes the signal score compared to all other variables in all events. Points with a larger positive SHAP value increase the signal score more while the inverse is true for negative SHAP values. The absolute scale of the SHAP value is arbitrary in these plots. The colour scale indicates the value of the feature on the specific event e.g. a blue dot on the b-tag(h cand 1 , j 1 ) indicates the leading subjet associated to the leading Higgs candidate is not b-tagged. Meanwhile, a pink dot in the m hh row indicates that event has high di-Higgs invariant mass for the plotted SHAP value.
The b-tagging state of (sub-)jets are among the most important features in the resolved and intermediate categories. This is less important in the boosted category possibly due to lower b-tagging efficiencies at high p T , which could be improved by future work in novel b-tagging techniques [151]. Angular and mass variables are stronger discriminants against multijet processes in this regime. The opening angles between these (sub-)jets and the invariant mass of the di-Higgs system carry a large amount of information in all three categories. Variables sensitive to semi-leptonic top decays, such as the number of leptons or missing transverse momentum, are effective at rejecting background (large negative SHAP value for high feature value) but less so at identifying signal (small positive SHAP values for any feature value).
Appendix B presents supplementary material characterising the performance of our neural networks. This includes receiver operator characteristic (ROC) curves, signal acceptance times efficiency, ranked feature importance for the DNN trained on κ λ = 1, together with correlations between the neural network scores and reconstructed (di-)Higgs mass variables m(h 1 ) and m hh .

Higgs self-coupling constraints
This section presents the results and discussion for the baseline and neural network analyses. We compare the performance of the resolved, intermediate, boosted categories, and their combined constraints. We set the luminosity to the target HL-LHC dataset of L = 3000 fb −1 .     λ hhh constraints assuming κ t = 1 and discusses the impact of systematic uncertainties. In subsection 4.3, we evaluate λ i hhh vs λ j =i hhh discrimination power and discuss strategies for improvement. Finally, subsection 4.4 lifts the assumption of κ t = 1 to examine the impact of top Yukawa y t uncertainties on κ λ constraints. Table 4 shows the signal and background yields in the final signal region for the baseline ('Baseline' columns) and neural network ('DNN' columns) analyses prior to binning in m hh . For the baseline analysis in the resolved category, we find the reducible 2b2j process comprises 49% of the background due to its formidable cross-section and only modest suppression from b-tagging. The irreducible 4b rate is comparable, comprising 45% of the background. These multijet processes constitute 94% of the backgrounds, similar to recent experimental analyses [41]. They also dominate in the intermediate and boosted categories -22 -JHEP12(2020)115 but interestingly, the 4b/2b2j ratio reduces to 0.25 and 0.21 respectively, suggesting light flavour jet systems have higher boost than those from b-jets. The top quark contribution, dominated by tt, is around 3% for the resolved and intermediate categories. The single Higgs backgrounds contribute less than 1% of the total, with the most dominant contributions arising from the irreducible associated production tth and the electroweak Zh processes. For the neural network analyses, the DNN signal score p signal is required to satisfy p signal > 0.75. Two DNNs are considered: one trained on the SM κ λ = 1 signal and the other on κ λ = 5 where we expect the boundary of sensitivity to be for κ λ limits. In table 4 for the resolved category, we find the p signal > 0.75 requirement of the DNN trained on κ λ = 1 gives a 61% SM κ λ = 1 signal efficiency for 86% background rejection compared to the baseline analysis, providing a signal-to-background of S κ λ =1 /B = 9.6 × 10 −4 and significance of S κ λ =1 / √ B = 0.49. When a different signal hypothesis κ λ = 5 is considered to that used in DNN training, we find the significance is lower S κ λ =5 / √ B = 0.34, as one would expect. The background rejection is dominated by the reduction in the 2b2j process such that the irreducible 4b component of the background now becomes dominant for the resolved category. For the DNN trained on κ λ = 5, we find the p signal > 0.75 requirement results in an 82% background rejection for 39% κ λ = 5 signal efficiency, giving S κ λ =5 /B = 8.1 × 10 −4 and S κ λ =5 / √ B = 0.46. As one would expect, this is higher for the κ λ = 5 signal than using the DNN trained on κ λ = 1, but this is only the case for the resolved category. Interestingly for the intermediate and boosted categories, the DNN trained on κ λ = 1 results in higher signal yields for all signal hypotheses listed in table 4. Such categories have a greater signal acceptance times efficiency for κ λ = 1 than for κ λ = 5 scenarios (figure 6) as the former κ λ case has a harder Higgs p T spectrum (figure 3).

Signal and background rates
Turning to the yields binned in m hh , table 5 compares these across the three analyses and three categories for the κ λ = 1, 2, 5 signals. For the DNN trained on κ λ = 1, the yields are suppressed in the low m hh bins such that there is no signal in m hh ∈ [200, 250] GeV. This is expected given the background rate is high but the signal rate is low in these bins. Therefore, only training the DNN on κ λ = 1 signals is suboptimal for signals that occupy lower m hh values such as κ λ = 5. We see that the κ λ = 5 training retains the κ λ = 5 signal in the first m hh ∈ [200, 250] bin, which is important for κ λ constraints that will be further discussed in the next subsections.  Table 5. Summary of signal region event yields binned by m hh for the total background B and benchmark signals for different κ λ couplings S κ λ =i . These are displayed for the baseline and neural network (DNN) trained on κ λ = 5, and trained on κ λ = 1 for comparison. The yields are normalised to an integrated luminosity of L = 3000 fb −1 for the three categories.
We now evaluate the binned statistical significances defined by Z = S/ B + (ζ b B) 2 including background systematics ζ b and display these in table 7. For each category, the lowest two rows show the significances combined in two ways for comparison: • Quadrature sum. The significance for each m hh bin i is summed in quadrature to give Z = i Z 2 i . This accounts for m hh shape information, where for simplicity we assume no correlations in systematics.
• One bin. The yields in each bin i are summed S = i S i , B = i B i before the significance Z is calculated such that no m hh shape information is considered.
Finally, the bottom three rows of table 7 combine the significances across all three categories. The quadrature sum considers all the m hh bins in all three categories, while the combined one bin approach is the quadrature sum of the one bin significances from each category. For simplicity, we neglect statistical and systematic uncertainties on the signal in the significances as analyses are in background-dominated S B regimes. Given B 1, we assume statistical uncertainties follow the Gaussian approximation ς B stat = √ B. For the systematic uncertainties, we adopt a simple approach that assumes the nominal benchmark values ζ b summarised in table 6. We assume ζ b = 0.3% for the resolved category, which is very demanding but in accord with ref. [57]; this is a HL-LHC extrapolation of the recent ATLAS analysis which reports 2% systematics [41]. For the boosted category, a similar ATLAS analysis [41] currently reports up to 18% systematics, dominated by the data-driven background estimate method and jet mass resolution [41], while the current CMS analysis report up to 7% systematics [43]. For our work, we assume that this can be controlled to the level of ζ b = 5% at the HL-LHC. For the intermediate

JHEP12(2020)115
Category Systematic ζ b Resolved 0.3% Intermediate 1% Boosted 5% Table 6. The nominal relative systematic uncertainties on the background estimate ζ b assumed when performing the χ 2 analysis for the different analysis categories.
category, we assume a benchmark value between what we apply for resolved and boosted of ζ b = 1%. In experimental implementations [39][40][41][42][43][44][45], the multijet background is typically modelled via data-driven methods so is not subject to theoretical uncertainties such as renormalisation scale, parton shower and parton distribution functions. Nonetheless, data-driven procedures have systematics due to unknown composition and extrapolation over correlated variables.
For the baseline analysis significances in table 7, the total significance is higher for the quadrature sum approach compared to the one bin counterpart, demonstrating the importance of m hh shape information for sensitivity. With ζ b = 0%, the Z κ λ =1 significance evaluated using the κ λ = 1 signal for the resolved category is 0.54 compared to 0.37 with the nominal 0.3% systematics and just 0.08 without binning in m hh . The intermediate and boosted categories have comparable significances of 0.29 and 0.31 respectively with ζ b = 0%, but dilutes to 0.12 and 0.02 with their nominal systematics of table 6. The significance combining all categories is 0.69 for ζ b = 0% and 0.39 with nominal systematics.
Moving to the neural network analyses, we find the significances are higher than those of the corresponding baseline analysis across all m hh bins. This is especially important when systematics are included, where we see the resolved category Z κ λ =1 significance improve from 0.54 to 0.59 for the DNN trained on κ λ = 1 with 0% systematics but the improvement is 0.37 to 0.48 with the nominal 0.3% systematics. When systematics are included, this suggests that the background rejection is more important to improve significance. We find that the DNN trained on κ λ = 5 provides higher significances for κ λ = 5 signals (Z κ λ =5 = 0.44) than the DNN trained on κ λ = 1 (Z κ λ =5 = 0.30) in the resolved category. The intermediate category has significances that are numerically about half those of the resolved category: Z κ λ =1 = 0.24 (vs 0.48) for DNN trained on κ λ = 1 intermediate (vs resolved). With all categories combined (bottom rows of table 7), the DNN trained on κ λ = 1 gives Z κ λ =1 = 0.75 with 0% systematics and is diluted to 0.53 with systematics, which improves upon the baseline analysis significance of 0.39. The significance without binning in m hh is lower at 0.32, suggesting that additional discrimination power remained in the m hh shape not captured by a cut on the DNN score.

χ 2 analysis of λ hhh fixing κ t = 1
We now perform a χ 2 analysis to evaluate projections of Higgs self-coupling λ hhh constraints. We adapt the definition in eq. (3.1) into one for benchmarking our analyses: (4.1)   Here, B is the total background rate, S SM is the SM signal rate, and S is the signal rate we wish to distinguish from the SM counterpart. These are evaluated in each m hh bin i using the yields in      For the baseline analysis, figure 11a shows the χ 2 vs κ λ distribution without systematics. The resolved category is most constraining κ λ ∈ [−0.75, 6.5] at 68% CL, followed by the intermediate κ λ ∈ [−3.3, 10.2] at 68% CL with boosted being the weakest category. To gain intuition for the sensitivity between the three categories, we note that figure 3 shows -27 -

JHEP12(2020)115
that signals with κ λ 5 have the lowest Higgs p T spectrum. Therefore, the intermediate and boosted scenarios have lower acceptance for these scenarios where the upper boundary of κ λ sensitivity is expected. Meanwhile, for negative κ λ , we see that the Higgs bosons in κ λ −1 scenarios have modest boost ( figure 3). Therefore, the acceptance is higher for all categories and the constraints are stronger for negative κ λ . Although scenarios near the SM value κ λ = 1 have the most boosted Higgs bosons and signal acceptance is high (figure 6), these κ λ values are very challenging to constrain in a χ 2 analysis because the total signal cross-sections are small and change slowly with respect to κ λ due to destructive interference (subsection 2.1). Inspecting table 4, we find factors of two change in signal yield between the challenging hypotheses of κ λ = 1, 2, 5. However, the yields binned in m hh (table 5) suggest that our shape analysis does have modest sensitivity to signal distribution changes for these κ λ scenarios even if it is insufficient for 68% CL sensitivity. Figure 11b shows the χ 2 vs κ λ distributions assuming the nominal systematics of table 6. The 0.3% systematics for the resolved category dilutes the constraints to κ λ ∈ [1.6, 7.9] at 68% CL. Meanwhile, the 1% systematics for the intermediate category weakens the constraint to κ λ ∈ [−5.8, 13.6] at 68% CL. However, the boosted category loses all constraining power in the considered κ λ range for the assumed 5% systematics. Overall for the baseline analysis, the statistical combination of all three categories leads to a 68% CL constraint of κ λ ∈ [−0.6, 6.4] for no systematics. This degrades to κ λ ∈ [−1.6, 7.9] when assuming the nominal systematics, driven by the sensitivity of the resolved category.
Turning to the neural network analysis, we evaluate the χ 2 vs κ λ distribution assuming the nominal systematics with the DNN trained on κ λ = 1 (figure 11c) and κ λ = 5 (figure 11d). Statistically combining all three categories achieves 68% CL constraints of κ λ ∈ [−1.0, 7.4] and κ λ ∈ [−0.8, 6.6] for the DNN trained on κ λ = 1 and κ λ = 5 respectively. These improve upon the baseline analysis of κ λ ∈ [−1.6, 7.8], with the κ λ = 5 DNN training surpassing those achieved by the κ λ = 1 DNN training. This shows that dedicated optimisation for a signal closer to the boundary of sensitivity can improve κ λ constraints. We choose to train on the κ λ = 5 signal due to its substantial triangle diagram contribution and therefore has a significant fraction of events at low m hh . The kinematics are also reasonably representative of higher κ λ values (figure 3b). Training on κ λ = 5 instead of κ λ = 1 encourages the neural network to retain low m hh events, which has larger shape differences between different κ λ values. This underscores the importance of optimisation away from the signals with SM couplings for κ λ constraints in the hh → 4b final state.
A future improvement may consider designing multiple independent neural networks, each optimised for one sampled value of κ λ . A more sophisticated approach may extend the idea of a parameterised neural network [80] to construct an observable that is a wellbehaved function of κ λ and provides good signal discrimination power, but this is beyond the scope of this work. While we find that sensitivity is driven by the resolved category, it is noteworthy that the DNN substantially improves the κ λ constraints of the intermediate category from κ λ ∈ [−5.8, 13.6] of the baseline analysis to κ λ ∈ [−3.8, 10.4] for the DNN trained on κ λ = 1 and κ λ ∈ [−3.4, 11.3] for the DNN trained on κ λ = 5. In a wider context, we note that intermediate category has close to comparable λ hhh constraints with projected HL-LHC constraints for the di-b-quark plus diboson hh → bbV V final states [58].   Figure 12. Summary of 68% CL intervals (χ 2 < 1) on κ λ with fixed κ t = 1 for different assumed background systematics from 0% to 5% for the neural network analysis trained on the κ λ = 5 signal. This is shown for the resolved (dark blue), intermediate (medium blue), boosted (light blue) categories along with their combination (orange). The luminosity is assumed to be 3000 fb −1 . Lines without endcaps mean the 68% CL limit is outside the considered range of κ λ ∈ [−20, 20]. The vertical grey line denotes the SM value of κ λ . The nominal systematics assumed in this work are shown in table 6. In appendix C, figure 27 shows the version of this plot with the DNN trained on the κ λ = 1 signal.
While we have assumed the nominal systematics of table 6, figure 12 now evaluates the 68% CL κ λ limits for different systematics ranging from 0% to 5% with the DNN trained on κ λ = 5. We find that the constraints for the resolved category degrade to κ λ ∈ [−5.5, 11.9] for ζ b = 5% and the intermediate category constraints weaken to κ λ ∈ [−8. 2, 15.8]. This shows how important the assumed level of systematics are for constraining λ hhh due to small S/B 10 −3 . Meanwhile, the boosted category has no constraining power within κ λ ∈ [−20, 20] when systematic uncertainties are assumed at the nominal 5% level. If these could be improved to the 2% level, the boosted category can constrain κ λ > −12.4. We also find that if the systematics are controlled to 1%, there is additionally sensitivity to positive κ λ < 17.6. The difficulty in forecasting experimental systematics at the HL-LHC leads to significant uncertainty for projections of κ λ constraints. One can also interpret these systematics as benchmark targets for future analyses to improve λ hhh constraints.

λ i hhh vs λ j =i hhh discrimination power
So far, our χ 2 distributions (figure 11) have been evaluated with respect to the SM κ λ = 1 signal i.e. assuming we observe the SM signal in data. We now generalise this to evaluate the discrimination power between any two κ λ hypotheses, which we present in figure 13. While we train our DNN analysis on one class of signal (here κ λ = 5), it is important to quantify the signal characterisation capability should λ hhh deviate from the SM or the trained signal scenario. We calculate the χ 2 ij between two signal hypotheses (i, j) shown on the axes of figure 13. We present this for the neural network analysis trained on κ λ = 5, combining the three categories and assuming the nominal systematics in table 6. Larger values of χ 2 ij indicate that it is easier to discriminate between two hypotheses (i, j), while χ 2 ij values around unity or below indicate difficulty in discriminating. Overall, we find figure 13 shows good discrimination power where χ 2 ij 1 for scenarios with κ λ ≥ 10 and κ λ ≤ −5, whose sensitivity is driven by the total cross-section. This corresponds to κ λ values far away from destructive interference and the cross-section changing rapidly with respect to κ λ . We also note small χ 2 ij < 1 occurs around κ λ = −2 and κ λ = 7, which has χ 2 ij 0.5. This arises physically due to the total cross-sections being  similar σ(κ λ = −2)/σ(κ λ = 7) 0.84 combined with acceptance being higher for negative κ λ (figure 6) due to harder Higgs p T spectra ( figure 3). These sign degeneracies are not symmetric about zero due to maximal destructive interference being near κ λ ∼ 2.5. Generally, scenarios that are difficult to distinguish fall along a northwest-to-southeast corridor of low χ 2 ij values. There is also a cluster around 1 ≤ κ λ < 4 with χ 2 ij < 1 arising from both total cross-sections and changes with respect to κ λ being suppressed due to near-maximal destructive interference (subsection 2.1).
For future work, it would be interesting to exploit m hh shape information better via dedicated optimisations to break degeneracies in the κ λ parameter space further. Moreover, one could imagine a DNN trained just on a single κ λ hypothesis κ λ = 5 is suboptimal for discriminating against two signal hypotheses unrelated to κ λ = 5. It would be desirable to design analyses, such as generalising the current neural network strategy, to optimise the discrimination power across a wider variety of κ λ values.

κ λ constraints allowing κ t = 1
Given the rapid scaling of di-Higgs cross-sections with the top Yukawa coupling y t shown by eq. (2.2), we expect modifications in y t to impact the projected λ hhh constraints, which we now explore. Figure 14a shows the χ 2 distribution in the κ t vs κ λ plane for the baseline analysis after combining the resolved, intermediate and boosted categories, assuming the nominal systematics of table 6. The contours are displayed after linear interpolation of the points sampled from this parameter space with the density shown in figure 2. The grey line marks out the χ 2 = 1 contour that corresponds to 68% CL. Consistent with the one-dimensional distributions (figure 11b), we see the red colours representing regions that have high χ 2 values and are statistically excluded, while the blue shades indicate parameters that have low sensitivity. The dark blue strip of very low χ 2 that runs from -31 -   around (κ λ , κ t ) ∼ (0, 0.8) to (3, 1.2) indicates a region that is very similar to the SM (1, 1) parameters, given this is our reference value for calculating the χ 2 .

JHEP12(2020)115
Of particular interest is the shape of the χ 2 = 1 contour in the two-dimensional κ t vs κ λ plane: as κ t increases, the 68% CL constraint on κ λ improves, indicated by the region satisfying χ 2 < 1 decreasing. To make quantitative comparisons, we choose benchmark κ t values of 0.8 and 1.2 based on current uncertainties from direct pp → tth measurements of y t being on the order of 20% [67,69]. We note that while measurements of gluon fusion production have smaller uncertainties, these indirectly probe y t via the one-loop gluon-gluon-Higgs interaction.
A further observation in figure 14a is that we can extract a 68% CL upper limit on the top Yukawa κ t < 1.25 assuming fixed κ λ = 1. This improves slightly to κ t < 1.22 using the DNN trained on κ λ = 5. It is interesting that this additional result of our work arises without considering dedicated optimisation for constraining κ t . That the di-Higgs process has constraining power to put a better than 25% upper bound on κ t is helped by the fast quartic dependence of the box diagram contribution to the cross-section σ box ∝ y 4 t . The ability to constrain κ t depends strongly on the value of κ λ : for example, we find no κ t constraint if κ λ = 5. Although our hh → 4b analysis is not expected to compete with other higher statistics channels in constraining y t (directly via tth or indirectly via gluon fusion), it provides an additional independent probe that could contribute in a global fit. Now we extend our discussion by using figure 15, which overlays two-dimensional 68% CL contours to compare the baseline analysis (dashed lines) with DNN analysis trained on κ λ = 5 (solid lines). This figure also separates the contours by category, but as the boosted has little constraining power (in part due to the conservative 5% systematic), only comparisons between resolved and intermediate are of interest. Similar to the onedimensional χ 2 distributions, sensitivity is driven by the resolved category across κ t = 1 with the intermediate category contribution being non-negligible for negative κ λ .
Assuming κ t = 1, the resolved category baseline analysis is κ λ ∈ [−1.6, 7.9] and improves by 20% (in δκ λ ) to κ λ ∈ [−1.0, 6.6] for the DNN trained on κ λ = 5, which also is summarised in table 8. While the intermediate category has comparably weaker constraints of κ λ ∈ [−5.8, 13.6] for the baseline analysis, the DNN trained on κ λ = 5 improves this by 24% to κ λ ∈ [−3.4, 11.3]. This improvement is largely uniform for different κ t . For negative κ λ , the slope of the intermediate contour in the κ t vs κ λ plane is similar to that of the resolved. This suggests that the two categories are sensitive to similar effects of the constructive interference occurring for negative κ λ . By contrast, the fact that the intermediate contour is nearer vertical for positive κ λ shows it is less sensitive to changes in κ t even if the constraints are weaker than resolved. Additionally, we see that it is the resolved category that drives the κ t < 1.22 68% CL constraint for κ λ = 1.

Conclusion
This paper presented a comprehensive analysis of Higgs pair production in 4b final states to evaluate and improve Higgs self-coupling (λ hhh ) constraints at the HL-LHC. We extended current strategies in several directions and now summarise our conclusions.
We performed a detailed comparison of λ hhh constraints across event categories that have zero (resolved), exactly one (intermediate) and two or more (boosted) boosted Higgs bosons in the 4b channel for the first time. The resolved analysis was found to be most constraining, followed by the intermediate and then the boosted. We used deep neural networks for signal-background separation and achieved an 86% background rejection for 61% SM signal efficiency in the resolved category compared to the baseline analysis. This improved the signal significance from 0.69 (0.39) to 0.75 (0.53) without (with) the nominal systematics assumed, which can contribute in combination with other channels. Importantly, analyses optimised for discovery of SM hh production can be suboptimal for constraining λ hhh in the 4b channel. We improved constraints by optimising on signals with λ hhh values that are five times the SM and have lower boosts than those closer to SM values. We showed the impact of experimental limitations on jet reconstruction and triggering for reconstructing the principal discriminating variable m hh .
Assuming SM top Yukawa y t , our baseline analysis provides 68% CL constraints of −1.6 < κ λ < 7.9, while the corresponding constraint for the neural network analysis is −0.8 < κ λ < 6.6. This assumes 3000 fb −1 of luminosity, background systematics controlled to 0.3%, 1% and 5% for resolved, intermediate and boosted categories respectively, and no further advances in the analysis strategy. Interestingly, we can extract a 68% CL upper -34 -JHEP12(2020)115 limit on the top Yukawa of κ t < 1.22 assuming κ λ = 1 despite no dedicated optimisation, though this constraint changes rapidly for different κ λ values. We quantified that current uncertainties on y t constraints can modify λ hhh limits by ∼ 20%. We caution that these constraints are not necessarily directly comparable with other projections in the literature due to different simplifying assumptions.
While our study suggests that the 4b channel is challenging for probing λ hhh compared with for example bbγγ, information from all independent channels are experimentally important in the final statistical combination. Our conclusions sharpen the experimental requirements to improve λ hhh constraints using the hh → 4b channel, providing key performance targets for the HL-LHC programme. Given the dominant reducible 2b2j background, reducing the flavour mistag rate is critical, which requires both hardware (upgraded trackers improve vertex resolution) and software (e.g. deep learning) solutions. We quantified the impact of different scenarios of background systematic uncertainties from 5% to statistical-only for all three categories. This showed that controlling systematics on multijet background estimates to demanding percent level or better is crucial for improving λ hhh sensitivity. For scenarios where the boost of the Higgs bosons is low, notably around the boundary of expected sensitivity for positive κ λ ∼ 5 values, maintaining low p T jet triggers will be particularly important. Experimentally, applying and improving this rich confluence of techniques explored will accelerate progress towards constraining λ hhh beyond gains from increased statistics alone.

A Additional kinematic distributions
This appendix collects additional kinematic distributions. Figure 16a shows the impact at parton level of top Yukawa variations for fixed κ λ = 1. In particular, κ t = 1.5 has little impact on the m hh shape, but lowering κ t to 0.5 makes the destructive interference induce a similar m hh shape as (κ λ , κ t ) = (2, 1). Figure 16b displays the rapidity difference between the Higgs bosons |∆η(h 1 , h 2 )|, which has mild discriminating power between couplings. Figure 17 displays distributions at reconstructed level for both backgrounds and signals in the baseline analysis signal regions normalised to cross-sections. The subleading Higgs candidate p T (h cand 2 ) and transverse momentum of the di-Higgs system p T (hh) are shown. We note that in the resolved category, the p T (h cand 2 ) retains some discrimination power between different κ λ hypotheses. The p T (hh) is also lower for the resolved than the intermediate and boosted analyses. This suggests that a modest amount of the boost that the Higgs bosons receives arises from the recoil of the di-Higgs system against other activity such as initial state radiation jets. except for requirements on the variable being plotted. The magnitude of the missing transverse momentum E miss T and the pseudorapidity difference of the two Higgs candidates |∆η(h 1 , h 2 )| is shown. A mild sensitivity to different κ λ values is observed for |∆η(h 1 , h 2 )|, however, this is subdominant compared to the discriminating power of m hh or p T (h). Other variables employed in our analysis selection are used to reduce background and are physically insensitive to κ λ . For example, we cut on E miss T to reduce semileptonic tt which has neutrinos in the final state, but this has no physical correlation with kinematics that are affected by modifying κ λ .

B Neural network performance
This appendix provides further performance benchmarks of the neural network analysis for the DNN trained on κ λ = 1 and κ λ = 5 signals. All the plots in this section are made with an independent set of events compared with those used for training. The selection criteria applied for the DNN training, and also for all plots in this section, are looser than the final analysis selection in order to retain sufficient event statistics for reliable training. One significant difference is that this dataset does not include any requirements on the number of b-tagged jets. Figure 19 shows the true positive vs false positive rate known as receiver operator characteristic (ROC) curves. These are computed on the whole test set (i.e. the set of events not used for training) for these networks in the resolved, intermediate, and boosted categories for signal classification only. In the general case of multi-class classification, the Figures 20 also displays the signal acceptance times efficiency A × ε as a function of κ λ for the different categories in the neural network analysis. One can see that the signal acceptance is higher around κ λ ∼ 5 when the DNN is trained on κ λ = 5. Finally, figure 21 shows the ranked feature importance for the neural network analysis trained on κ λ = 1. Discussion of this figure follows that presented in section 3.2.1 of the main text.

B.1 Heuristics of Shapley values
Here we provide a heuristic overview of the SHAP method for model interpretation; for technical details and definitions, see ref. [88]. The SHAP method is based on Shapley values φ, which were originally introduced in the context of game theory. These calculate the model prediction (neural network score), f (S ∪ x), when including a variable x (feature) from a set S. The prediction is computed without this variable f (S). From these, the difference, f (S ∪ x) − f (S), intuitively characterises the importance of that variable and

B.2 Network correlation plots
We validate the architecture and training of the neural network analysis has the expected behaviour by testing their performance on a statistically independent set of signal and background events. Figure 22 show the leading Higgs candidate mass as a function of the signal score trained on κ λ = 1 and κ λ = 5 signals when tested on the corresponding pp → hh signal that exclusively decays to b-quarks. For the resolved category, a prominent peak appears at high signal score trained on κ λ = 1 and around m(h cand 1 ) ∼ 125 GeV. There is a tail of events at lower signal score where it is difficult to kinematically distinguish the signal from background.
An interesting feature is observed in figure 22 depicting the κ λ = 5 DNN score computed in the κ λ = 5 signal sample versus the leading Higgs candidate mass. Three distinct peaks can be seen. We find each peak corresponds to events with different numbers of b-tags. The most prominent peak with the highest signal score corresponds to events with three or four b-tags. Meanwhile, the two peaks with signal scores around 0.3 and 0.5 correspond to events with one and two b-tags respectively. Figures 23 show similar correlation plots, now testing on a tt sample. We see low values of signal scores, indicating that the networks are effective at classifying this background. More structure is seen in these two-dimensional plots due to the inclusive decay of the b-quarks in this sample. Several of these plots feature two distinct peaks, corresponding to semi-leptonic and hadronic b-decays. Hadronic decays correspond to the peak with higher signal score, since these are more similar to the hh → 4b signal. Furthermore, the plots for the intermediate and boosted categories feature two distinct peaks in the leading Higgs mass around ∼ 80 GeV and ∼ 170 GeV. This suggests that the large jet is capturing the boosted decay products of the W boson and top quarks. Figures 24 and 25 show the corresponding correlation plots for the di-Higgs invariant mass m hh as a function of DNN signal score. We see that the signal score is not strongly correlated with m hh .

C Additional χ 2 distributions
This appendix collects χ 2 distributions supplementing those in the main text. Figure 26 shows the discrimination power χ 2 ij matrix between different κ λ hypotheses for κ t = 1 for the baseline analysis and neural network analysis trained on κ λ = 1. We find a similar pattern of regions with lower χ 2 ij values as figure 13 in the main text, where further discussion can be found.  Figure 27 shows a summary of the 68% CL limits for different systematics when the DNN is trained on the κ λ = 1 signal. The qualitative features are similar to the corresponding figure 12 where the DNN is trained on the κ λ = 5 signal. As noted in the main text, the DNN trained on κ λ = 1 for the intermediate and boosted categories are slightly more constraining than that trained on κ λ = 5.        Figure 27. Summary of 68% CL intervals (χ 2 < 1) on κ λ with fixed κ t = 1 for different assumed background systematics from 0% to 5% for the neural network analysis trained on the κ λ = 1 signal. This is shown for the resolved (dark blue), intermediate (medium blue), boosted (light blue) categories along with their combination (orange). The luminosity is assumed to be 3000 fb −1 . Lines without endcaps mean the 68% CL limit is outside the considered range of κ λ ∈ [−20, 20].  analysis and (b) neural network analysis trained on the κ λ = 5 signal. This is displayed in the κ t vs κ λ plane at L = 3000 fb −1 . This is shown for the resolved (upper), intermediate (middle), boosted (lower) categories assuming 0.3%, 1% and 5% systematic uncertainties, respectively. The grey contour indicates χ 2 = 1 corresponding to 68% CL.