Constraints on the trilinear and quartic Higgs couplings from triple Higgs production at the LHC and beyond

Experimental information on the trilinear Higgs boson self-coupling $\kappa_3$ and the quartic self-coupling $\kappa_4$ will be crucial for gaining insight into the shape of the Higgs potential and the nature of the electroweak phase transition. While Higgs pair production processes provide access to $\kappa_3$, triple Higgs production processes, despite their small cross sections, will provide valuable complementary information on $\kappa_3$ and first experimental constraints on $\kappa_4$. We investigate triple Higgs boson production at the HL-LHC, employing efficient Graph Neural Network methodologies to maximise the statistical yield. We show that it will be possible to establish bounds on the variation of both couplings from the HL-LHC analyses that significantly go beyond the constraints from perturbative unitarity. We also discuss the prospects for the analysis of triple Higgs production at future high-energy lepton colliders operating at the TeV scale.


I. INTRODUCTION
Since the discovery of a Higgs boson with a mass of about 125 GeV in 2012 [1,2], a tremendous and ongoing effort has been enacted in order to gain insights into the properties and interactions of the detected state.Its couplings with third generation fermions and weak gauge bosons, as well as the loop-induced couplings with gluons and photons, have been investigated in detail, indicating agreement with the predictions of the Standard Model (SM) within the present experimental and theoretical uncertainties.In view of the plethora of possible connections of the detected Higgs boson to sectors of physics beyond the SM (BSM), probing the Higgs interactions with respect to possible effects of BSM physics will be of central importance at the present and future runs at the LHC and at any future collider.
In this context the Higgs boson self-couplings are of particular relevance, while experimentally these couplings are very difficult to access.Experimental information about the trilinear and quartic Higgs couplings is needed to gain insights about the shape of the Higgs potential, which will have implications for a better understanding of the electroweak phase transition in the early universe and may be instrumental for explaining the observed asymmetry between matter and anti-matter in the universe.In the SM the Higgs potential is given by in terms of the single Higgs doublet field Φ.In extended scalar sectors the potential can have a much richer structure.While the cubic and quartic Higgs couplings arising from Eq. ( 1) are correlated in the SM and can be predicted in terms of the known experimental values of the mass of the detected Higgs boson and the vacuum expectation value, large deviations from the SM predictions for the Higgs self-couplings are possible even in scenarios where the other couplings of the Higgs boson at 125 GeV are very close to the SM predictions (see e.g.Ref. [3] for a recent discussion of this point for the case of the trilinear Higgs coupling).Experimental constraints on the trilinear and quartic Higgs self-couplings can be expressed in terms of the so-called κ-framework, where κ 3 (κ 4 ) denotes the coupling modifier of the cubic (quartic) coupling from its SM value at lowest order, i.e. κ i = g i /g SM i , where g i denotes the value of the coupling and g SM i its lowest-order SM prediction, and i = 3, 4.
The most direct probe of the trilinear Higgs coupling at the LHC is the production of a pair of Higgs bosons, where κ 3 enters at leading order (LO).Both the AT-LAS [4] and CMS [5] collaborations determine the limits on κ 3 from both gluon fusion and weak boson fusion (WBF) from different decay channels of the Higgs boson.At next-to-leading order (NLO), the trilinear Higgs coupling contributes to the Higgs-boson self-energy and also enters in additional one-loop and two-loop diagrams in WBF and gluon fusion, respectively, enabling the possibility of an indirect measurement through single-Higgs production [6][7][8][9][10].The inclusion of single-Higgs information by the ATLAS collaboration results in the most stringent bound to date on κ 3 : [−0.4,6.3].Triple-Higgs production is known to suffer from very small cross sections, but yields additional information on κ 3 which could be used in combination with the aforementioned searches.Furthermore, it can provide the first experimental constraints on the quartic Higgs coupling κ 4 .
The paper is structured as follows.In Sect.II we discuss the allowed values of κ 3 and κ 4 from the perspective of perturbative unitarity and show that sizeable contributions to κ 4 can occur, especially if κ 3 deviates from the SM value.We explore in Sect.III how well the HL-LHC will be able to constrain κ 3 and κ 4 from the 6b and 4b2τ channels.Lepton colliders are additionally explored in Sect.IV before conclusions are presented in Sect.V.

II. CURRENT BOUNDS, UNITARITY AND THEORETICAL MOTIVATION
Besides the experimental constraints from Higgs pair and triple Higgs production processes, which will be discussed below, theoretical bounds can be placed on the Higgs self-couplings from the requirement of perturbative unitarity.In our analysis we employ the unitarity constraints obtained at tree level.
A general matrix element for 2 → 2 scattering with initial and final states |i⟩ and |f ⟩, respectively, can be decomposed in terms of partial waves through the Jacob-Wick expansion [11] where J indicates the total angular momentum of the corresponding amplitude, and the λ i,f denote the helicities of the initial and final states.The most relevant channel at tree level for constraining κ 3 and κ 4 is HH → HH scattering, where the Wigner-D functions D J λi,λ f reduce to unity for the zeroth partial wave.Conservation of probability leads to the requirement that the perturbative expansion must satisfy the optical theorem, which can be used to obtain an upper bound on the zeroth partial wave of Re a which if violated indicates inconsistencies in the perturbative calculation.
The zeroth partial wave at tree level can be calculated as In the limit where the centre of mass energy is high, a 0 ii solely depends on κ 4 , while at lower energies a sizeable contribution from κ 3 can yield a peak in a 0 ii that surpasses the allowed limit.We have calculated the zeroth partial wave for different values of κ 3 and κ 4 for a large range of energies in order to identify the parameter regions that are allowed by tree-level perturbative unitarity.Fig. 1 shows the bounds from perturbative unitarity along with the current experimental bounds on the trilinear coupling from ATLAS, κ 3 ∈ [−0.4,6.3] at the 95% C.L. [4], and the 95% combined ATLAS and CMS HL-LHC projection under the SM hypothesis, The unitarity bounds on κ 4 are significantly weaker than the ones on κ 3 .This feature can be understood from the Effective Field Theory (EFT) perspective, where effects from higher-dimensional operators to the potential are incorporated as an expansion in terms of inverse powers of a UV-scale Λ [20] (see also the discussion in Refs.[18,21]) (5) We use the convention where in the unitary gauge Φ = 0, (v + H)/ √ 2 , where v denotes the electroweak vacuum expectation value (VEV), and H is the 125 GeV Higgs boson.The benefit of this parameterisation of the higher-dimensional operators is that κ 3 receives corrections purely from dimension-six operators, while κ 4 only from dimension-six and dimension-eight operators (interaction vertices with more Higgs legs would additionally receive corrections from O(1/Λ 6 ) terms and so on).With the definitions of κ i as before, the coupling modifiers re-ceive corrections Thus, if a small correction is induced in κ 3 , one should expect that in an EFT theory with high scale cutoff where the dimension-eight terms are negligible, the deviation in κ 4 from the SM expectation would be six times larger.
Even if the higher-dimensional contributions are relevant, |(κ 4 − 1) − 6(κ 3 − 1)| < 6|κ 3 − 1| needs to be satisfied in order to maintain a well-behaved expansion in powers of Λ.Although in this work we choose to work in all generality without any EFT assumptions on the κ 3 and κ 4 modifiers, we indicate the region where this condition is fulfilled in Fig. 1.
In order to present an example where Eq. ( 6) can be realised, we consider the Two-Higgs Doublet Model (2HDM), where beyond tree level, the cubic and quartic self-couplings can receive significant contributions, as shown in Ref. [3] (see also Ref. [22]).A review of the 2HDM can be found in Ref. [23].We work in the alignment limit with the lightest scalar identified as the 125 GeV Higgs boson (after Electroweak Symmetry Breaking (EWSB) and rotation to the Higgs basis) and perform a one-loop calculation2 of the trilinear and quartic couplings employing the on-shell renormalisation scheme.As a motivated example we pick a benchmark point from Refs.[3,26] which is compatible with the latest experimental results while also receiving sizeable trilinear-coupling corrections.We reproduce the oneloop result of Refs.[3,26] and also show the quartic coupling in Fig. 2. As expected, the prediction for the quartic coupling quickly rises to values even beyond what is allowed by tree-level perturbative unitarity in the κframework if the splitting between the mass of the CPodd Higgs boson in the 2HDM, m A , and the BSM Higgs scale M = m 12 /(c β s β ) increases.In the displayed example the unitarity bound is violated if m A surpasses ∼ 1100 GeV, and further two-loop contributions would tighten the bound on m A .
In Fig. 3 the linear relation between κ 3 and κ 4 in the 2HDM is shown for variations of the scale M and for masses m A ≤ 1.5 TeV.Varying the values of m A and M shifts the relation between the self-couplings while maintaining a linear correlation between them.For κ 3 = 6 the corresponding results for κ 4 vary between κ 4 ≈ 22 and κ 4 ≈ 31 for the displayed scenarios.Thus, the largest allowed values for κ 3 according to the present bounds are correlated in the 2HDM with very large shifts in κ 4 .As indicated by the shaded light blue region in the plot these predictions for κ 3 and κ 4 are associated with a well-behaved power expansion within an EFT framework.While it would also be of interest to explore which models can induce an even larger deviation of κ 4 for relatively small values of κ 3 , potentially resulting in regions that require a non-linear effective prescription (for instance the Electroweak Chiral Lagrangian), we leave such an investigation for future work.

III. TRIPLE HIGGS PRODUCTION AT THE HL-LHC
The production of three Higgs scalars at the LHC and future colliders is highly suppressed compared to single and double Higgs production, severely limiting the available final states that can be explored at the LHC.In order to obtain the highest values for the product of cross section and branching ratios, one needs to consider the dominant production mode through gluon fusion, but also the main decay channel to a b-quark pair.The latter is difficult in hadron collisions due to the sizeable multi-jet background from QCD processes.It can be problematic for typical cut-and-count analyses to sufficiently suppress the background while at the same time avoiding a large reduction of signal events in order to maximise significance.In this work we resort to Machine Learning (ML) techniques for appropriately selecting the signal region of the considered channels.
In order to identify which of the decay channels of the on-shell Higgs bosons can be utilised for the analysis at the LHC, we start with an optimistic estimate of the number of events for the 6b, 4b2τ , 2b4τ and 4b2γ final states. 3Within the SM the involved branching ratios are given as We note that the 4b2γ and 2b4τ final states only produce a few events at 3/ab, even at relatively large coupling modifiers κ 3 ≳ 4.5, κ 4 ≳ 30 (taking into account Kfactors of 1.7 [28] and tagging efficiencies of all taus and all-but-one b-quarks).It is therefore unlikely that these channels will be statistically significant at the HL-LHC, even though they can be highly relevant for colliders utilising higher energies, as shown in Refs.[27,[29][30][31] 4 .We therefore will not consider these channels further, and instead focus on the 6b and 4b2τ channels.
The background processes for the 6b final state have been thoroughly discussed in Ref. [34] (see also Ref. [35]), and it is expected that the dominant contribution arises from multi-jet QCD 6b events.This is the only back-ground that is taken into consideration for this final state in this work, and we neglect subdominant channels.
In the 4b2τ channel5 , the dominant backgrounds arise from the production of four b-quarks along with two W bosons (W W bbbb) or one Z boson (Zbbbb).The former includes the production of a top and bottom pair (ttbb) with subsequent decays t → W b. The production of a top pair associated with a Higgs (ttH) or a Z boson (ttZ) also yields noteworthy contributions.Here the ttH channel is particularly problematic if a reconstructed resonance close to the 125 GeV mass is required during an analysis to isolate the triple-Higgs signal.The final background included in our analysis is the four top production (tttt).
A. Analysis

Event generation and pre-selection
We use Madgraph5 MC@NLO [37,38] for event generation and modify the provided SM model file in the UFO [39] format to introduce the modifications of the trilinear and quartic Higgs couplings κ 3 and κ 4 , respectively. 6Signal events are generated for pp → hhh and are subsequently decayed on-shell with Madspin [42] in order to obtain the cross section rates.Due to the complexity of the multi-particle final states we generate events with a minimum transverse momentum for the bquarks of p T (b) > 28 GeV and within the pseudorapidity region |η| < 2.5, while we will later impose stricter cuts during the analysis.Additionally, since the signal consists of three on-shell Higgs bosons, we impose a cut on the invariant mass of the process of √ ŝ > 350 GeV at generation level.
While in principle one could explore different cuts in order to efficiently identify the signal region, the complexity of the final states would render this a cumbersome and difficult procedure, possibly requiring the use of complicated observables.Instead, we resort to Graph Neural Networks (GNNs) for an efficient discrimination between signal and background events.This requires an appropriate embedding of particle events to graphs.Before we address the ML aspects of the analysis it is appropriate to define pre-selection conditions required to be satisfied by each event that gets passed to the network.
Showering and hadronisation is performed with Pythia8 [43] saving the resulting events as HepMC files [44].
FastJet [45,46] is interfaced through Rivet [47,48], and jets are clustered using the anti-kT algorithm [49] of radius 0.4 and requiring a transverse momentum of p T (j) > 30 GeV.We use Rivet to calculate the events that will pass the pre-selection using a b-tagging efficiency (independent of p T ) of 0.8.For the 6b (4b2τ ) channel, at least five (three) b-quarks are required, satisfying the conditions p T (b) > 30 GeV and |η(b)| < 2.5.For the 4b2τ channel two τ particles must also be identified in the central part of the detector, |η(τ )| < 2.5, with p T (τ ) > 10 GeV.The τ particles are identified with the TauFinder class of Rivet, and at least one τ particle must decay hadronically. 7We apply an efficiency of 0.8 for both leptonic and hadronic taus. 8he invariant mass of the sum of the four-momenta of the above final states should exceed 350 GeV, otherwise the event is vetoed.Finally, we form combinations of b-quark pairs, and at least one pair is required to have an invariant mass close to the mass of the Higgs boson, m b b ∈ [110, 140] (in GeV).In the case of the 4b2τ channel the event passes the pre-selection also if the invariant mass criterion is satisfied by the invariant mass of the di-tau state, m τ τ .

Graph Embedding and Neural Network Architecture
GNNs, stemming from the idea that certain types of data can be efficiently represented as graphs, have been increasingly utilised in particle physics.Various works have indicated their applicability for BSMrelevant tasks such as event classification [52,53], jettagging [54,55], particle reconstruction [56], identifying anomalies in data arising from BSM interactions [57,58] and obtaining constraints on parameters in SMEFT or the κ-framework [59,60]. 9The latter is what we aim to achieve by performing a fit within the κ 3 -κ 4 plane after the efficient selection of a signal region from the GNN.Similar architectures using graphs have also been recently utilised by experiments at the LHC, see e.g.Ref. [62].
The generated events need to be embedded in graphs before they are passed to the neural network.We explore two different paths10 : 1. Fully Connected (FC): Add nodes for all the considered final states (i.e.b quarks and τ leptons, denoted as b i and τ i according to their p T values) and edges connecting all the nodes.We use the transverse momentum, pseudorapidity, azimuthal angle, energy, mass and PDG identification number as node features, [p T , η, ϕ, E, m, PDGID], while no edge features are introduced.A node is also added for the missing momentum of the event.
2. Reconstructed Nodes (RN): Add fully connected nodes for b quarks (and τ leptons for the 4b2τ final state) as before, but additionally add nodes H i for reconstructed pairs of particles i, j that are (relatively well) compatible with the Higgsboson mass, m ij = 125 ± 25 GeV.This is achieved by forming combinations between all the b-quarks and (if applicable) the τ -pair.The H i nodes correspond to the four-momentum and mass of the reconstructed pair, ordered according to which is closest to the Higgs-boson mass of 125 GeV.All the nodes have [p T , η, ϕ, E, m, PDGID] as associated features, where the PDGID for H i is zero.
Such physics-inspired approaches according to the expected chain of the event have been shown to improve results in semi-leptonic top decays [59] and are actively being explored [63].
GNNs operate by calculating messages using node features (and edge features if these exist) and iteratively updating the node features for each message passing layer.We rely on the EdgeConv [64] operation for message passing, where the message of node i at the l-th message passing layer is calculated with where Θ and Φ indicate linear layers.The node features for l = 0 are the kinematical quantities we have defined as inputs, and the updated node features are obtained from the messages by averaging over the neighbouring nodes, The final node features after all EdgeConv operations are aggregated to a single vector using a 'mean' graph readout operation.In principle, it is possible to additionally include further (non-graph related) layers at this stage.
The final network score is obtained with a linear layer with the SoftMax activation [65] that reduces the resulting features to a two-dimensional vector, with each entry representing the probability that the event was signal or background.The amount of EdgeConv and the following linear layers need to be optimised to achieve high performance while at the same time avoiding overfitting.
After experimenting with different setups, we settled on using two EdgeConv layers with hidden features of size 96 before the output layer for both channels.

Training and comparison of graph embeddings
The data is split into subsamples of ∼ 56%, 19% and 25% for training, validation and testing, respectively 11 , and we minimise the cross-entropy loss function in order to train the network using the Adam optimiser [66].The learning rate is one of the hyperparameters requiring tuning, and for our case the value of 0.001 (0.01) performs best for 3b2τ (5b).If for three epochs in a row the loss has not decreased, then the learning rate is reduced by a factor of 0.1.In principle the training can run for up to 200 epochs, although we impose early stopping conditions if the loss has not improved for ten consecutive epochs.A batch size of 128 is used for every update of the loss function.
The GNN for the 5b analysis is trained on two classes (signal and background).The situation is more involved for the 3b2τ case where the analysis benefits significantly from a multi-class classification which allows identifying different thresholds for the different background scores.In particular we choose to train on the W W bbbb, Zbbbb and t t(H → τ + τ − ) contributions.The signal events used for training are always for the (κ 3 , κ 4 ) = (1, 1) point (using different values does not significantly alter the performance of the network).
We use the EdgeConv implementation from the Deep Graph Library [67] with PyTorch [68] as backend.The graph embedding relies on PyLHE [69] to extract events from the Les Houches Events (LHE) files [70].In order to compare the different graph embeddings, we use functionality from scikit-learn [71] to calculate the true and false positive rates at different thresholds 12 , and we show the corresponding Receiver Operating Characteristic (ROC) curves for both channels in Fig. 4.
The ROC curves and the distributions allow one to conclude that the RN embedding utilising the reconstructed Higgs boson mass can lead to significant improvements.This is not unexpected as additional information (available at detector level) is passed to the network to aid classification.While in principle a sufficiently deep neural network with fully-connected graphs could also eventually learn to map the input features of the b-jets (and taus) to the masses of the reconstructed Higgs bosons, including the information in the graph embedding allows easier optimisation and quick convergence with a shallow network.We therefore utilise only the RN embedding for performing the final signal region selection.FIG.4: ROC curves for the 5b and 3b2τ analyses are displayed on the left and right, respectively, showing the performance of the two embedding cases.For the 3b2τ case we binarise in a one-vs-rest scheme, grouping together the background classes.
The areas under the 'RN' and 'FC' curves for the 5b case are 0.862 and 0.823, respectively.For the 3b2τ analysis the 'RN' area is 0.909 and the 'FC' area is 0.833.

HL-LHC Results
For simplicity we will use the signal efficiency of the network for the (κ 3 , κ 4 ) = (1, 1) point and assume that it will be mostly the same irrespective of the coupling modifier values as our analysis is largely dependent on the cross section rates.Ideally one could train and optimise a network on each point, or alternatively train on event samples from topologies that depend differently on κ 3 and κ 4 . 13For the 5b analysis we optimise the signal selection to reduce the false positive rate to ∼ 0.6%.In the 3b2τ channel we require the following conditions to be satisfied on the background network scores: It should be noted that even though the network was trained only on a subset of possible background contributions, it still performs well, as discussed below, and manages to remove background contributions from other sources as well.We calculate the efficiencies for each contribution and show the reduction of cross sections in Tab.I. Our results for both channels include a K-factor 13 As a simple test we also trained on a sample that includes two signal classes for (κ 3 , κ 4 ) = (1, 0) and (0, 1) for the 3b2τ analysis (effectively separating diagrams with κ 3 and κ 4 insertions) and tested this setup on different points.While there is an improvement in efficiency for certain points (up to 10%), for most cases of (κ 3 , κ 4 ) the efficiency is closer to the simpler setup with one signal class.We chose to utilise the latter, since it also enables an easier interpretation of our results in Sec.III B.  for the signal of 1.7 [28] and a conservative estimate of the higher-order contributions to the background processes in terms of a K-factor of 2.
We define the significance for our analysis according to [72] where S and B denote the signal and background events, respectively.This allows us to obtain 1σ and 2σ bounds within the κ 3 -κ 4 plane (which roughly correspond to 68% and 95% CL, respectively), as shown in Fig. 5 for the 5b and 3b2τ analyses.We assume an integrated luminosity at the HL-LHC of 3/ab and a combined ATLAS and CMS luminosity of 6/ab.Overall, we observe that the 3b2τ analysis is more sensitive than the 5b analysis, and the latter will additionally suffer from further subdominant electroweak contri- butions to the background that have not been included. 14owever both channels should be utilised in combinations to maximise the significance.Assuming for simplicity zero correlations between the channels, we combine the significances as Z comb = Z 2 5b + Z 2 3b2τ , giving rise to the contours shown in Fig. 6 (left).While the projected bounds of about ±20 times the predicted value for the quartic Higgs self-coupling in the SM may appear to be quite weak, in view of our discussion above we emphasise that such bounds go much beyond the existing theoretical bounds.Furthermore, deviations of this size in κ 4 are well compatible with the existing experimental bounds on κ 3 according to the correlations between κ 3 and κ 4 that are present in the BSM scenarios analysed above.Regarding the sensitivity to κ 3 from triple Higgs boson production at the HL-LHC, Fig. 6 shows that the expected sensitivity in this channel at the HL-LHC is weaker than the present experimental limits that have been derived from di-Higgs production.Combining this independent set of experimental information on κ 3 with the experimental results from di-Higgs production may nevertheless turn out to be useful.While our analysis may be optimistic in some respects (e.g.we neglect fake taus), on the other hand we note that further developments of the triggers, tagging and reconstruction algorithms of final states could result in higher efficiencies than the values that we have adopted in our analysis, enhancing the significance.The ability to discriminate between jet flavours is highly important for HHH studies (as well as HH studies) and could also allow experiments to study fully hadronic final states where H decays to W bosons.On the other hand, we note that even in the case that the backgrounds are increased by 50%, the resulting constraints on κ 3 and κ 4 degrade only slightly, as shown in Fig. 6 (right).

B. Interpretability of NN scores
Understandably, NN techniques are often viewed as "black boxes", due to their inability to indicate the input features that are most important for determining their predicted scores.In order to address this shortcoming, various approaches have been explored in the recent years with the goal to yield interpretability, allow efficient debugging of the network, better understand the mapping between input and output, and ultimately allow the identification of ways to improve it.These methods gained traction in particle physics in the recent years to obtain a better insight for various different tasks such as jet-and top-tagging and detector triggers [73][74][75][76][77][78][79].
There are various techniques for gaining interpretability in ML, but in general they can be separated into two categories: intrinsically interpretable models that are specifically designed to increase transparency providing intuition and post-hoc explanation methods that were developed to enhance our understanding of generic ML models.The latter is what applies to the case of this work.However, many post-hoc techniques lack certain properties that are beneficial to maintain; for example one could directly use the product of the gradients computed during backpropagation and the input in order to attribute the most relevant features [80,81].As the gradients of the network hold information regarding variations of the inputs, it should be possible to use them to quantify the dependence of the score on features.It is known, though, that gradient methods can yield the same attribution for an input and a baseline that differ from each other and have different outputs (for an example see Ref. [82]), due to the gradient becoming flat (this is often the case as NNs are trained until the loss is saturated).Shapley [83] values (originating from Game Theory), are formulated based on certain axioms to distribute the attributions amongst the participating variables in a ML approach and have been applied for obtaining interpretations [84] (for an application in particle physics, see Refs.[85][86][87]).Their attractiveness stems from the fact that they follow axiomatic principles unlike earlier methods (e.g.DeepLift [88] or Layer-wise relevance propagation [89]).However, their evaluation is often computationally expensive and requires multiple calls of the neural network.
Integrated Gradients (IGs) is an alternative approach, designed in Ref. [82] using axiomatic considerations, which requires significantly fewer calls to the network function.The trade-off is the requirement that the ML technique must be differentiable, which is the case for NNs optimised through gradient descent approaches, and the application of IGs also requires access to the gradient of the model 15 .Let a generic classification NN denoted as F : R n → [0, 1] for input features x ∈ R n and x ′ ∈ R n denote an appropriate baseline (e.g. a zero vector).Integrating over all the gradients of F in a straight path from x ′ to x defines IGs as We thus utilise IGs, implemented in the Captum [90] library, in order to obtain attributions for our predictions and identify the most relevant inputs for our processes.
The attributions obtained from IGs allow us to interpret the results of the network in terms of the input parameters for each node, as shown in Fig. 7, although some care is necessary when interpreting such results.Quite intuitively the transverse momenta and the energy of the b-jets are relevant parameters that receive high attributions.This is expected since restricting to higher values of p T can help in the discrimination between signal and background (this was also the reason for applying a pre-selection momentum cut).Angular momenta are 15 Often techniques such as Shapley values are called "black-box" approaches as they have no access to anything other than the output of the ML approach, while IGs and similar techniques are refered to as "white-box" approaches.not so helpful for discrimination; this is not unexpected as we are dealing with scalars.The network additionally utilises the PID of the tau leptons more than the identification of the b quarks; this is likely due to the fact that the di-tau state is correlated with the highly discriminative reconstructed invariant mass of the Higgs boson.We clearly see that the introduction of the reconstructed masses significantly boosts the performance of the network, being the most important observable for the signal events.We note that as a reconstructed particle, the Higgs node has been assigned a PID of zero which as required by the 'dummy' axiom 16 has no attribution and thus zero contribution to the network results. 16The 'dummy' axiom states that a variable that is not contributing to the output of the network should have no attribution, ensuring that the attribution is insensitive to irrelevant inputs.It is a standard axiom imposed by interpretation methods (see e.g.Refs.[82,83]).
Taking a closer look at the reconstructed masses and their attributions, we see in Fig. 8 that the node with a reconstructed mass that is closest to 125 GeV receives a sizeable attribution. 17The attributions from the mass of the H reco 1 node indicate that due to the similarity of the t t(H → τ + τ − ) background with the signal, the network is unable to clearly discriminate the two classes based on this feature alone.The inclusion of the mass of the second reconstructed Higgs boson, however, helps the network as indicated by the higher attributions assigned to the signal events as compared to the other sources of backgrounds.This implies that the inclusion of reconstructed observables can enhance the performance of GNNs in certain analyses, as also expected from the discussion in Sec.III A 3.
We stress that while the IG attributions provide an indication of the most important variables, our approach does not yield detailed information on how the specific correlations between the input features can impact the network score.While in many cases this would be desired, this is beyond the scope of our work where we use IGs as a method to verify that the introduction of the reconstructed Higgs-boson mass is indeed the most relevant variable.We leave explorations of alternative techniques (also specific to GNNs) pinpointing to important connections between input features and nodes for future work.
In our work we utilised interpretation methods mostly to ensure that the GNN works as expected and in order to identify potential issues during the implementation of the network.However, the usefulness of such techniques extends well beyond this.For example, in the case of limited computing resources one could check which features are irrelevant and remove them from the analysis before scaling the network up.Indeed, in our analysis we checked that if we remove the seemingly unused angular information, we obtain similar results as before (resulting in no visible changes in Fig. 5 for 3b2τ ).Additionally for analyses with multiple final states the most practical observable that can be exploited is not always straightforward to identify.Interpretation techniques could therefore be used as a first step to identifying the most relevant observables before optimising the analysis to enhance its significance.

IV. REACH ASSESSMENT FOR LEPTON COLLIDERS AND COMPARISON WITH THE HL-LHC
For comparison with the prospects of the HL-LHC, we finally consider the expected upper limits on κ 3 and κ 4 from possible future lepton colliders. 18We consider an inclusive analysis of ℓℓ → HHH + other which includes both the associated ZHHH production and the production through WBF.In principle one could consider dedicated analyses for each channel, optimising the selection of final states; however, we choose to perform an inclusive analysis to avoid further assumptions on the identification of other states which could vary depending on the collider concept and the detector.We will consider the decay H → b b of the Higgs boson, which yields the largest possible cross section for the signal, and assume throughout that the b-tagging efficiencies will be 0.8.Our analysis relies solely on identifying 5b jets in the clean environment provided by lepton collisions.We apply an additional ∼ 0.83 efficiency which arises from requiring the p T of the b jets to be larger than 30 GeV.We note that in practice the results for an electron or muon collider would be similar, i.e. the obtained contours for the limits in the κ 3 -κ 4 plane for a given collider c.m. energy and integrated luminosity would not be expected to significantly differ for the two collider types.Therefore we will refer to generic lepton colliders in the following, although we use the centre-of-mass energies of 1 and 3 TeV envisaged for the ILC and CLIC, as well as 10 TeV collisions that could be realised at a muon 18 This topic has previously been explored in Refs.[21,91,92].collider.We scan over different values of κ 3 and κ 4 for the aforementioned energies and subsequently apply the relevant tagging efficiencies.An important limitation of high-energy lepton collisions in this case, however, arises from the region where the detectors can tag b jets.While for energies ∼ 1 TeV the b quarks are in the central part of the detector, the situation is significantly different for 10 TeV collisions, as shown in Fig. 9.It is thus necessary to explore possibilities for extending the tagging capabilities of future detectors to even |η| ∼ 4 in order to avoid a significant loss of events.
The leptonic collisions deliver clean signals avoiding the large background contamination from QCD that is present at hadron colliders.We checked the leading order QCD background of the signature (5b + other) and found that the cross sections of these background processes are small.Assuming that the selection of the signal region will enforce p T (b) > 30 GeV, the requirement that one di-bottom pair should be compatible with the mass of the observed Higgs boson and a cut ensuring that the total invariant mass of the final state particles produced in the process is at least 350 GeV would result in no remaining background events (even with more relaxed cuts, the number of events is negligible when taking b-tagging efficiencies into account).However, similar to Refs.[21,91,92] we do not take into account electroweak backgrounds which could be dominant and deserve a dedicated study.We turn to a Poissonian analysis as described in Ref. [93], where n corresponds to the number of events expected from the SM, i.e. for (κ 3 , κ 4 ) = (1, 1).Upper limits on the mean value of the Poisson distribution µ are then calculated with where F −1 χ 2 denotes the inverse of the cumulative distribution of the χ 2 distribution, and CL is the confidence level (e.g.95%).
The resulting bounds at 95% CL are shown in Fig 10 for different centre-of-mass energies and integrated luminosities.The plots show that the large luminosities expected to be utilised by colliders at 3 TeV and 10 TeV (as envisaged for CLIC and muon-colliders, see Refs.[94,95]) provide significant constraining power via the triple Higgs production process for κ 4 and κ 3 .
The lepton collider projections are compared with our results for the HL-LHC in Fig. 11.We find that the HL-LHC sensitivity for κ 4 is competitive with the one achievable at a 1 TeV lepton collider such as the ILC.In particular the comparison shows that for negative κ 4 the HL-LHC is expected to have a better sensitivity than a 1 TeV lepton collider.
As discussed above further developments in ML could increase both the tagging and selection efficiencies beyond our assumptions, and additional channels will provide additional information.

V. CONCLUSIONS
Our investigation of the prospects at the HL-LHC shows that even though triple-Higgs production is limited by low rates at the LHC, its exploration provides interesting information even if it does not receive additional contributions from new scalar resonances.Bounds can be placed on κ 4 significantly beyond the theoretical constraints from perturbative unitarity.
While as expected the bounds on κ 3 will be much weaker than the ones from double-Higgs production, they should be useful for improving the sensitivity through combinations.Additionally, if deviations from the SM are found, the correlation between the Higgs self-couplings can shed light on the possible scenarios of physics beyond the SM.
If an excess in the triple Higgs production process is observed, the correlation with the result for double-Higgs production will be immensely informative.On the one hand, if no deviation from the SM value is identified in κ 3 from other channels, an indication for a large deviation in κ 4 would likely imply the presence of non-linear effects that cannot be described consistently within an effective field theory approach via the expansion in terms of a heavy scale.On the other hand, a deviation in both coupling modifiers could indicate a correlation between κ 3 and κ 4 that can be confronted with predictions of specific models such as the 2HDM and of effective field theories.
The physics gain that can be achieved via the statistically limited channel of triple Higgs production at the HL-LHC crucially depends on an efficient signalbackground discrimination.For this purpose we have employed in our analysis the use of GNNs.It is already evident from current experimental searches that such ML techniques will be the centerpiece of future studies.However, it is especially important in particle physics to be able to identify the relevant kinematical features that contribute to the identification of the signal.An unintuitive behaviour (e.g. a high-attribute quantity that is already known to be irrelevant) could indicate a possible issue in the learning framework.Alternatively, po-tentially interesting quantities could be identified that could provide discriminative power even in simpler analyses that do not use ML algorithms.We have explored interpretability within GNNs using IGs which satisfy necessary axioms.We have shown that, as expected, the invariant mass of bottom and tau pairs is the most important feature in the data that is utilised for discrimination.We expect that such techniques will play an important role not only for the development of analyses for BSM searches but also for further applications in particle physics.Our comparison of the prospects at the HL-LHC with future lepton colliders shows that the sensitivity to κ 4 at the HL-LHC should be competitive with a 1 TeV lepton collider such as ILC.While the sensitivities of lepton colliders at 3 and 10 TeV (e.g.CLIC or a possible muon-collider) are expected to be considerably higher, these results will presumably become available only on a longer time scale, such as the one for a future higherenergetic hadron collider.Thus, it can be expected that the HL-LHC will be able to establish the first bounds on κ 4 beyond theoretical considerations.

60 FIG. 1 :
FIG.1: Bounds from perturbative unitarity on κ3 and κ4 as obtained from HH → HH scattering.In addition, the current experimental bounds on κ3 are shown (black dashed lines), as well as the expected projections from the HL-LHC (black solid lines).The shaded light blue region indicates where the dimension-eight contributions to κ4 are smaller than the dimension-six ones, while the dotted blue line corresponds to the case where the dimension-eight contributions vanish, κ4 − 1 ≃ 6(κ3 − 1).

FIG. 2 :
FIG.2:The left plot shows the impact of an increasing splitting between the masses of the BSM Higgs bosons on the one-loop prediction for the trilinear Higgs coupling κ3, where M = mH = 600 GeV and mA = M H ± is varied, in agreement with the results of Ref.[3] for the quoted benchmark point.On the right, the respective plot for κ4 is shown.The shorthand notations s β , c β and t β denote sin β, cos β and tan β, respectively.

40 FIG. 6 :
FIG.6:The left plot shows the projected contours indicating the 1σ and 2σ bounds in the κ3-κ4 plane obtained from a combination of the 5b and 3b2τ channels under the assumption that there are no correlations.The right plot shows the corresponding result where the backgrounds for both channels are increased by 50%.

FIG. 7 :
FIG.7: Attributes for the features of the b jets, τ leptons and the reconstructed invariant Higgs-boson mass that is closest to the 125 GeV resonance.The height of the attribution value indicates to what extent the network is using the particular feature in order to discriminate between signal and background.In the figure we denote collectively all the background classes as 'B' and the signal as 'S'.

FIG. 8 :
FIG.8: Histograms showing the attribution for different events against the value of the reconstructed mass for the (true) signal and backgrounds.The plot on the left (right) shows the reconstructed mass of the Higgs boson that is closest (second-closest) to the 125 GeV resonance.A positive attribution close to 1 indicates events with a high output score (i.e.identified as signal), while lower values of the attribution imply a low output score.

FIG. 9 :
FIG. 9: Pseudorapidity distribution for the leading b quark for different collision energies.

10 FIG. 10 :FIG. 11 :
FIG. 10: On the left, the projected 95% CL contours for lepton colliders at different energies and integrated luminosities are shown, mainly focusing on the energies of ILC, CLIC and a possible muon collider.The SM value is shown as a black dot.The plot on the right shows a zoomed-in version.

TABLE I :
Background contributions included in the 3b2τ analysis and reduction of the generated cross sections (labelled as "gen.")after pre-selection cuts ("sel.")and GNN selection ("NN").W -bosons arising from tops are allowed to decay hadronically and c-jets can be mis-tagged as b-jets with a probability of 0.2.