Variational Autoencoders for Regression: Recovering Fully Leptonic $b\bar{b}W^+W^-$ in Di-Higgs Searches

The search for double Higgs production in $b\bar{b}W^+W^-$, where both $W$ bosons decay to leptons, has been rehabilitated as a good option to look for that key process to the Standard Model scalar sector study in the LHC. The missing neutrinos, however, hinder the reconstruction of useful information like the Higgs pair mass, which is very sensitive to the trilinear Higgs self-coupling. We present a solution to that problem using a Variational Autoencoder for Regression (VAER) to reconstruct the Higgs and top pairs decays $hh,t\bar{t}\to b\bar{b}W^+W^-\to b\bar{b}\ell^+\ell^{\prime -}\nu_\ell\bar{\nu}_{\ell^\prime}$. The algorithm predicts the invariant mass of non-resonant $hh$ irrespective of the trilinear coupling, even for events whose Higgs self-couplings were never presented to it. VAER is also able to identify a new Higgs resonance in an unsupervised way, showing generalization power for events not presented in its training phase. Finally, we demonstrate that VAER prediction is as useful to statistical inference as ground truth simulated distributions by computing a $\chi^2$ between trilinear coupling hypotheses based on binned invariant mass distributions of $b\bar{b}\ell^+\ell^{\prime -}\nu_\ell\bar{\nu}_{\ell^\prime}$.


INTRODUCTION
A challenging problem in high-energy physics phenomenology is recovering information lost in collisions that produce feebly interaction particles that escape detection like neutrinos.In particular, for kinematics reconstruction, missing neutrinos pose a problem whenever we want to detect resonances or measure theory parameters that are sensitive to that kinematics.For example, to measure the W boson mass, we rely only on the kinematic distributions of the charged lepton that accompany the neutrino in the leptonic decay mode since it is not possible to reconstruct its four-momentum in this case, and because two jet decay is plagued by overwhelming QCD backgrounds.In the absence of a resonant peak, the transverse mass, a W mass-sensitive variable, is used to compare data against prediction.As an outcome, the W mass is measured with much less precision than the Z mass whose resonance peak is available through the lepton pair invariant mass [1].
Transverse mass is a typical feature that is engineered to substitute for the missing information that prevents us from building an optimal variable to measure a theory parameter.Many examples exist in other contexts.For instance, in models with dark matter, measurements of the intermediate particles that produce them are hindered, like in SUSY models, where sleptons might decay promptly to a charged lepton and a stable neutralino that escapes detection and carries away the information on the slepton's mass.Instead of a clear peak from where the mass can be read, one needs to look up the information in the endpoints [2] of suitable kinematic distributions at the cost of precision.Other ingenious solutions and variables are devised to solve that kind of problem, but, of course, it would be much better if we could somehow recover the information lost to build the most sensitive variables to measurements.For a good review of kinematic variables engineering, see Ref. [3].
In the SM context, missing particles also get in the way of accessing vital information.Among the SM measurements, the scalar potential is of prime importance, even more so now that gravitational wave astronomy opened up the possibility of giving hints about the electroweak phase transition [4].Apart from that, anyway, new physics might lurk in deviations of the SM scalar parameters.The most straightforward way to access that information is by measuring the Higgs selfcouplings in double and triple Higgs production at colliders.In the SM, the Higgs self-interactions, after electroweak symmetry breaking, are given by where λ SM = m 2 h /2v 2 ≈ 0.13, and m h = 125 GeV, and v = 246 GeV represent the SM Higgs mass and the vacuum expectation value.Here, κ 3 and κ 4 parametrize deviations from the SM values.
As we are interested in studying trilinear self-couplings, we define κ 3 ≡ κ λ from now on.
In the LHC, the prospects of detecting Higgs self-interactions in single channels until the end of the experiment are not particularly bright, especially for the quartic coupling.Only by combining several search channels a 68% confidence limit (CL) of 0.57 ≤ κ λ ≤ 1.5 can be reached [5].
Among the decay channels for hh studies, b bγγ is the most promising one and dominates the combination, while b bW + W − and b bZZ are the less important ones [5].Recently, however, the authors of Ref. [9] rehabilitated b bW + W − by computing new features that can efficiently discern between b bW + W − , with leptonic W bosons, from double Higgs and its backgrounds, mainly the t t events, increasing the statistical significance by a factor of ∼ 4 and reaching ∼ 2.1σ after 3 ab −1 .
This makes the fully leptonic b bW + W − as competitive as the best channels to look for hh.
The hh production rate is sensitive to λ, and an inference of this parameter can be made by counting the number of events in excess of expected backgrounds.However, the dependence of the total cross section on λ is polynomial, causing a twofold ambiguity in the determination of the trilinear coupling for a given number of measured events.That ambiguity will probably not be lifted at the 95% CL even after 3 ab −1 for a single experiment, so a combination of the ATLAS and CMS results is important [5].Better prospects are expected at the next linear collider generation [10,11] where both the total rates and the shape of suitable distributions can be used to constrain the λ parameter.
In fact, the same strategy can be employed at hadron colliders.In this respect, the hh invariant mass distribution shows good sensitivity to the λ parameter due to the contributions from a triangle and a box diagram to the total amplitude.The exact dependence on the trilinear coupling and the top quark Yukawa coupling determines the interference pattern of the two contributions shaping the hh mass.That shape can be used to further test the coupling hypotheses.However, in the case of final states where neutrinos are present, like fully leptonic b bW + W − , for example, the hh mass cannot be reconstructed.Moreover, detector and hadronization effects smear the hh mass distributions, blurring the distinction between two sets of couplings and diminishing the advantage of using the shape of the distribution.
In this work, we propose a neural network solution -a Variational Autoencoder for Regression (VAER) algorithm -that addresses the difficulties in recovering the hh and t t masses from the observable kinematics from detector-level events.We will show that VAER has a very good generalization power predicting distributions of events never presented at the learning phase of the algorithm both for non-resonant and resonant hh production.We will demonstrate that the predicted distributions can be used for practical statistical purposes, for example, in a χ 2 test between coupling hypotheses based on partonic binned b bℓ + ℓ − ν ℓ νℓ ′ mass.The proposed algorithm can be used in many other contexts, like dark matter searches and long-lived particles that escape detectors.It can also be used as an unfolding algorithm to discount for detector effects and difficulties brought by hadronization of jets once it learns the partonic underlying information from simulated events.Finally, we envisage applications to recover other variables hidden by information leakage, such as W and Z polarization studies and spin and mass measurements that need a full reconstruction of kinematic variables.
Our paper is organized as follows.In section 2, we describe the VAER algorithm; in section 3, details of our simulations are provided; in sections 4 and 5, our results for the non-resonant and the resonant hh production are presented, respectively; in section 6, we present our conclusions and an outlook of possible applications and future work using VAER.

VARIATIONAL AUTOENCODER FOR REGRESSION
The VAER algorithm was originally designed to predict the age of a person from the 3D structural brain magnetic resonance image [12].The authors of that work also demonstrate that the regression task works even for tabular data representing other types of measurements of the brain.
To understand how VAER works, we need to recall the basics of autoencoders and variational autoencoders.
An autoencoder works by learning a dimensionally reduced representation of the data, encoding the original data, x, into a latent space, z, through a neural network z = E θ (x), where θ represents the parameters of the neural net encoder.The encoder is stimulated to produce good latent representations of the original data by decoding the latent representation of the data back to x ′ through another neural net x ′ = D ϕ (z), where ϕ represents the parameters of the neural net decoder, and minimizing the dissimilarity between x and x ′ , for example, their mean squared error The Variational Autoencoder (VAE) [13], by its turn, is a generative neural network model that learns the probability distribution of a dataset, D. As such, it can be used to draw new instances from that distribution and that resemble the data.The variational aspect of the algorithm refers to the probabilistic nature of the latent space.Instead of a static encoding, the encoder is built as a Gaussian function that learns the mean and the standard deviation of the data, that is, a neural net, µ θ , is trained to encode the multidimensional mean of the data set, and another neural net, σ 2 θ , to capture the variance of the dataset.This way, given a data point, x, its latent representation is z ∼ N (x; µ θ (x), σ 2 θ (x)).Once the latent representation has been learned, creating new instances is easy.Draw a z and decode it with the neural net decoder such that x ′ = D ϕ (z) is a brand new instance, not contained in the dataset, but hopefully emulating a true member of D. Notice that, in VAE, E θ is probabilistic, but D ϕ , is deterministic.
Let us start with the distribution of the data conditioned on a latent representation vector, z, We know neither the prior p(z) nor the likelihood P (x|z), so we use neural networks to learn them from data.The problem is that this process is very inefficient as the majority of latent points are not likely to produce x that resembles the data.Instead, we can learn a function, q ϕ (z|x), that is conditioned on x and write P (x) as Here, q ϕ (z|x) and P θ (x|z) now represent the encoder and the decoder models, respectively.To produce a generative model, we just need to have a pdf for the latent space from which we draw latent vectors that can be decoded into instances that emulate drawing from P (x) itself.This can accomplished with q ϕ (z|x) = N (z; µ ϕ (x), σ 2 ϕ (x)), where µ ϕ and σ 2 ϕ are modeled by neural networks.There is an important computational detail here, though: z should be randomly generated in the training phase, as Eq. ( 4) suggests, but backpropagation does not work in sampling nodes.The solution is the reparametrization trick, calculating points of the latent space as z = µ ϕ (x) + σ ϕ (x) ⊙ ϵ, ϵ ∼ N (0, 1), with deterministic mean and variance.But how to learn the mean, µ ϕ , and the variance, σ 2 ϕ , models?We calculate the following Kullbach-Liebler (KL) divergence [14] D KL (q ϕ (z|x)||P (z|x)) = E z∼q ϕ log q ϕ (z|x) using the Bayes' rule for P (z|x).This expression can be rearranged as follows log P (x) = E z∼q ϕ log P θ (x|z)p(z) where E z∼q ϕ [log P (x)] = log P (x) once P (x) does not depend on z.
The first term on the right side of this expression is called the Evidence Lower Bound (ELBO), L(x; θ, ϕ).Because KL divergence is always non-negative, log P (x) ≥ L(x; θ, ϕ).This inequality is very convenient for obtaining an objective function for the learning process.The posterior distribution P (z|x) is probably a too difficult multidimensional distribution to be learned, but P θ (x|z) is the deterministic neural network decoder while p(z) = N (0, 1) is a prior distribution that can be taken as a simple normal distribution, for example.Thus, the first term of Eq. ( 6) can be modeled.
All this leads us to carry the inference process via a Maximum Likelihood Estimation (MLE).
The goal is to maximize log P (x), which is the same as maximizing the ELBO with respect to the neural net parameters θ and ϕ, This is valid as long as q ϕ (z|x) approaches the true posterior distribution P (z|x) and saturates the lower bound as D KL (q ϕ (z|x)||P (z|x)) → 0. Now, we are ready to answer the question made previously: how to learn µ ϕ and σ 2 ϕ ?The MLE posed above can be solved by minimizing the loss function where x ′ (θ) = P θ (x|z = µ ϕ (x) + σ ϕ (x) ⊙ ϵ), ϵ ∼ N (0, 1).L R = ||x − x ′ || is the reconstruction loss, and the distance measure between x and x ′ can be chosen as the mean absolute error, the mean square error, or a cross-entropy measure, for example.The KL divergence can be calculated analytically when q ϕ (z|x) and p(z) are Gaussian functions as discussed earlier, resulting in the KL-loss, the L KL term.This is the standard VAE loss.
How can this algorithm be used for a regression task?The key ingredient is to build an orthogonal dimension in the latent space that is sensitive to variations of the target.Embedding this dimension into the latent space, hopefully, correlates the target variable to the data representation.
The latent representation is then said to be disentangled.
In practice, VAER 1 works via the variational inference of a probabilistic regressor for the target vector, r.The likelihood distribution is now given by and taking the same steps that led us to Eq. ( 6), gives us the ELBO for VAER The novelty is that the variables are now conditioned to r.Assuming that z and r are independent variables, we have Q ϕ (z, r|x) = q ϕ (z|x)q φ (r|x), where q φ (r|x) is a neural network regressor.
Working on the ELBO expression above, we have (denoting parameters collectively as {θ}) P ϑ (z|r) is the latent generator [12], an essential component to correlate the latent vector to the regression target through P ϑ (z|r) = N (z; u T ⊙ r, σ 2 1) where u is a normalized vector.Note that the mean is a linear model of r: µ ϑ (r) = u T ⊙ r.This is sufficient to correlate r to a disentangled dimension from z such that traversing u yields r-specific latent representations.Just like VAEs, here q ϕ (z|x) is a Gaussian whose mean, µ ϕ (x), and variance, σ 2 ϕ (x), are neural net models while P θ (x|z) is a neural net decoder.The regressor q φ (r|x) is actually a probabilistic regressor within this variational inference approach, and it is also modeled as a Gaussian distribution: q φ (r|x) = N (r; µ φ (x), σ 2 φ (x)1) where µ φ and σ 2 φ are neural nets.The prior on r is assumed to be a simple standard Gaussian distribution, p(r) = N (r; 0, 1).The loss function of VAER can now be derived, where, again, x ′ (θ) = P θ (x|z = µ ϕ (x) + σ ϕ (x) ⊙ ϵ), ϵ ∼ N (0, 1).We depict a graphical diagram of VAER in Figure 1.The predicted target can be taken from r = µ φ (x) or, when convenient, as r ∼ N (r; µ φ (x), σ 2 φ (x)1).Let us now discuss the practical application of VAER to our problem.cyan, are the input and output spaces.The purple rectangle, q φ (r|x), is the probabilistic regressor from which we make predictions.

SIMULATION DETAILS
We simulate partonic level events with MadGraph5 [15] at the 14 TeV LHC for two types of process: 1. Double Higgs production and decay up to one extra jet.The W boson's leptonic decays comprise electrons and muons, ℓ = e, µ.
The trilinear coupling is treated as a free parameter that controls the interference between the triangle and the box diagrams, and the Yukawa couplings are kept fixed at their SM values.We simulate 100k events for each λ = κλ SM , κ from −3 to 3 with steps of 0.5.
2. The main background source, the top quark pair production at the next-to-leading order QCD.
Hadronization of jets was performed with Pythia8 [16], and detector effects were simulated with Delphes3 with default settings, while jet reconstruction and clustering were performed with Fastjet [17].The MLM merging scheme [18] was adopted to merge hard and soft radiation from MadGraph5 and Pythia8, respectively.The following basic selection criteria were imposed to generate the events 2 b-tagged jets, 2 opposite charged leptons We also recorded the four-momenta of up to two leading non-b jets (j) of the events with p T (j) > 20 GeV, and |η j | < 3.

Kinematic Variables and Representation of Events
The target of the reconstruction is the double Higgs and the top pair invariant masses so besides the two b-jets, the two hardest non-b jets, the two opposite charged leptons, and the missing transverse momentum at the detector level, we also kept the four-momenta, in the laboratory frame, of the intermediate Higgs bosons and top quarks of the event as generated at the parton level.Note that NLO QCD radiation effects are taken into account in these four-momenta.It would be possible to reconstruct the partonic center-of-mass energy, √ ŝ, of the collision once we have the four-momenta of the initial state partons at our disposal.This variable also accounts for the energy of all the radiation emitted alongside hh or t t, which would require a more careful simulation of high-order effects.
The basic representation of the events thus comprises 34 low-level features.This low-level representation is augmented by high-level features described below.
• the transverse momentum, p T , and rapidity, η, of the two b-jets and the two leptons, • the transverse momentum of the pairs b b, ℓ + ℓ ′− , jj.In events where only one non-b jet is identified, the transverse momentum is just p T (j).When the event contains no jets besides the bottom jets, the entries corresponding to those jets are filled with zeroes, • the rapidity of the pairs b b and ℓ + ℓ ′− , • the energy and z-component of the three-momentum of the pairs b b and ℓ + ℓ ′− , and of the • the invariant masses of the combinations b b, ℓ + ℓ ′− , b bℓ + ℓ ′− , b bjj, jjℓ + ℓ ′− .Again, when just one or no jet j is present, the invariant masses are calculated accordingly.In events where no jets appear, some redundancy between these variables occurs, • the distance in the η × ϕ plane: ∆R ij = (∆η ij ) 2 + (∆ϕ ij ) 2 , between the pairs b b, ℓ + ℓ ′− , bℓ, bj, and jj, • the azimuth angle difference, ∆ϕ bb , between b and b, and, ∆ϕ ℓℓ , between ℓ + and ℓ ′− , • the Barr variable [19]: cos θ * bbℓℓ = tanh 1 2 ∆η(bb, ℓℓ) between the b b and ℓ + ℓ ′− systems, • the missing transverse momentum, ̸ p T , Besides all these kinematic variables, we also compute the Higgsness, H, and the Topness, T , of the events [9].Higgsness is an adimensional variable defined as where σ h , σ W , and σ ν might represent experimental uncertainties (in GeV), but for our purposes, they can be treated as free parameters.In the process of construing Higgsness, the four-momentum of the neutrino and the anti-neutrino must be searched to achieve the maximum compatibility with the decay chain h → W + W − , W + → ℓ + ν ℓ , W − → ℓ ′− νℓ ′ where one of the W bosons is off its mass shell.The peak of the M ν ν and M W * distributions occur approximately at 37 and 31 GeV, respectively.We fixed σ h = 2 GeV, σ W = σ W * = 5 GeV, and σ ν = 10 GeV as in Ref. [9].
By its turn, we define Topness as follows ) where σ t = 5 GeV, as in Ref. [9].In this case, as we do not know the b-jet charge, we have to test between two options to get the better consistency of the event with the t t production and decay chain The minimization process was performed with a simplex method from Scipy [20].
We show, in Figure 2, the joint Higgsness and Topness distributions for the SM double Higgs production and the t t.We see a clear distinction between the two kinds of events with Higgs pairs concentrating in the region log(T ) > 5 and log(H) < 5.This behavior is largely independent of the strength of the trilinear Higgs self-coupling and also shows a similar pattern for resonant hh production.The challenge, however, is to recover the information carried away by the neutrinos from W s.
Neural networks offer the possibility to fit a parametrized function of the observable information brought by leptons, jets, and b-jets from data.Our solution is to train a probabilistic neural net regressor from a variational inference process as described in Section 2.
We tested two types of target: (1) a single-valued one, the hh or t t mass, denoted collectively as M bbℓℓνν ; (2) a 2-component vector, (M bbℓℓνν , p T (ℓℓνν)), where p T (ℓℓνν) denotes the transverse momentum of hardest leptonic W boson.We observed better performance of the vector target across our experiments and tuning, so from now on, we will present the results and analysis for this target.Because p T (W ) is strongly correlated to p T (b b), especially in the case of double Higgs, we conjecture that including p T (ℓℓνν) in the target of the regression task helps to create ties with the vector feature of the events what could explain the better performance of the algorithm.Our focus, however, is the M bbℓℓνν mass of the event.Let us discuss the preparation of the data to feed the neural networks.We generated around 10 6 events to train and test VAER.The dataset was split into 75% for training and 25% for testing.A 5-fold cross-validation was performed to evaluate the error in prediction caused by statistically independent test sets.The training set comprises t t and hh for κ λ = −3, −2, −1, 0.1, 1, 2, 3 trilinear couplings.We will refer to this coupling set as the support couplings.The test set contains the same types of events and hh events with the addition of intermediate trilinear couplings κ λ = −2.5, −1.5, −0.5, 0.5, 1.5, 2.5.This is the interpolated couplings set.We also generated events for new heavy Higgs bosons from xSM [21,22], with masses from 300 to 1000 GeV, decaying to hh → b bW + W − .We will discuss the resonant case in detail ahead.
To Training a neural network with signal events of different model parameters to help it to generalize across the parameters space was shown to be successful in Ref. [23].The parametrized neural networks obtained from this framework are fed with physics parameters and then used to classify events for intermediate points of the parameters space for which the algorithm was not trained, saving time and computational resources.In our case, we do not provide any physics parameters to the algorithm, neither trilinear couplings nor masses.Nonetheless, as we are going to show, VAER learns the target variables across those parameter spaces.
To reduce the magnitude of the target variables, we took their logarithm for the regression task.The features and target vector were scaled with the RobustScaler from scikit-learn [24].
This scaler removes the median of the data feature-wise and scales them with the interquantile range between the first and third quartiles of the data, making the dataset less sensitive to outliers events.
The algorithm is trained for 2000 stochastic gradient descent iterations in batches of 1024 examples.A stopping criterion is adopted, halting the training if no reduction in the loss function is observed after 20 iterations.The learning rate is reduced by half if no improvement is observed after 10 iterations.The initial learning rate is 10 −3 .The neural networks were built with Keras [25] and Tensorflow [26].The optmizer adopted was the AdamW [27] with a weight decay of 10 −4 .
We tested several architectures and hyperparameters, but no extensive tuning was performed.
Improvements in the performance of the algorithm can thus be achieved.We display, in Table 1, the architecture and the hyperparameters of the various components of VAER.The dimension of the latent space was 3. The target loss, L reg in Eq. ( 12), was multiplied by β = 10 to encourage the algorithm to better predict the target variables.

M bbℓℓνν in the Standard Model
We now present the results for the reconstruction of M bbℓℓνν in the Standard Model.In Figure 3, we depict the M bbℓℓνν mass for the SM hh production (left panel) and the t t background (right panel).The lower panels show the true-to-predicted ratio.The blue shaded area in the histograms The blue-shaded regions represent the cross-validation uncertainties in the prediction.The ratio between VAER prediction and ground truth is also depicted in these plots.Lower panels: scatter plots for true versus predicted masses.
represents the variation of predictions from the 5-fold cross-validation where the test set is split into 5 independent sets of events.The dashed blue line is the mean prediction from the five test sets.The agreement between true and predicted invariant masses is within a few percent both for hh and t t production up to 1 TeV.The uncertainty increases for higher invariant masses as the number of events in the tail of the distributions drops.
A quantitative assessment of our results can be read in Table 2. Let us call the true M bbℓℓνν mass of an event, t, and the predicted one as p.To quantitatively access the performance of the algorithm, we compute the root mean squared error, √ MSE = (t − p) the fraction of events whose relative difference between true and predicted invariant mass is less than 10%; and the correlation coefficient, R 2 , defined as follows where t is the mean of the true target.Except for the JS D , which is computed from binned invariant mass distributions, all the other metrics are evaluated on an event-by-event basis.
Corroborating the visual agreement we see in Figure 3, the purple entries of Table 2 show the excellent performance of VAER in predicting the b bℓℓνν mass for signal and background.The The disentangled dimension associated with the latent space representations for SM hh and t t predictions.
MAE of both SM hh and t t events are both comparable to the bin width of the distributions.The MSE of t t events are larger than hh ones as the background presents a harder spectrum.In both cases, the correlation coefficient is high, especially for Higgs events.The lower panels of Figure 3 show the scatter plots of true versus predicted M bbℓℓνν and confirm the high correlation coefficient.
As discussed in Section 2, a VAE for regression associates a disentangled dimension to the latent space representation of the events.In Figure 4 we show log(M bbℓℓνν ) as a function of two out of the three latent space dimensions.As anticipated, the linear regression model in terms of the latent dimension suffices for a good prediction.

Varying the Trilinear Higgs Coupling: Support Couplings
Besides the SM double Higgs production and its main background source, VAER is also able to predict M bbℓℓνν for non-SM trilinear couplings spanned in the training phase.The hh invariant mass shape changes considerably from λ = −3λ SM to λ = +3λ SM due to the relative importance of the triangle amplitude in the interference with the box contribution.
In Leading Order, the differential cross section of hh production can be expanded in powers of 1/m 2 t [28].Ignoring symmetry factors, charges, and couplings, it reads where F △ is the loop function of the triangle contribution that contains the trilinear coupling, λ, and the top quark Yukawa coupling, y t , while F □ and G □ come from the box diagram and are proportional to y 2 t ; p T is the Higgs boson transverse momentum, s = (p 1 + p 2 ) 2 = M 2 hh , and t = (k 1 − p 1 ) 2 .This is a crude approximation to the partonic invariant mass distribution, and it is shown that including 1/m 4 t terms actually worsens the agreement with the exact results [28].However, this approximation captures the main features of the hh mass for m hh ≲ 1 TeV.
Taking into account the gluon luminosity, the differential M hh distribution is where τ = M 2 hh /S, √ S = 14 TeV, x and τ /x are the fractions of the protons' momenta brought by the gluons to the hard scattering.For our purposes, a simplified gluon distribution function might be taken as g(x, µ F ) ≈ 1/x δ for x ≪ 1.We took δ = 2 to mimic the SM distribution as closely as possible.Again, these approximations are crude but capture the basic dynamics that build the M hh distribution.
To understand how λ affects the distributions, first notice that the triangle contribution is enhanced compared to the box contribution towards the hh production threshold due to the propagator 1/(s − m 2 h ).Second, the role played by the interference term is dictated by κ λ and the relative sign between F △ and F □ .Finally, G □ effectively contributes only to high p T .
In Figure 5, we show dσ/dM hh (in arbitrary units) as a function of M hh for some trilinear couplings.Negative λ turns the interference constructive with all the contributions reinforcing each other once the interference inherits a similar kinematic structure from the triangle and box contributions and, in special, the 1/(s − m 2 h ) propagator that makes it also peak towards M hh ∼ 2m h .When the trilinear coupling is positive, however, the interference term is negative and contributes destructively to dσ/dM hh .For the SM production, the cancellation of the peak near the threshold is almost exact, and M hh increases, reaching a peak right after 400 GeV.For larger λ, on the other hand, the interference term cancels the other contributions for larger M hh , carving a  21)- (24).The histograms of κ λ = 2 and 3 were multiplied by 30 for better visualization.
dip in the distribution, causing a kind of amplitude-zero situation where no events are expected for certain M hh values.Large λ of both signs tend to resemble each other once the triangle contribution dominates.
The behavior of the contributions is also important to predict the impact of cuts.For example, large transverse momentum cuts favor the box contribution because the triangle amplitude is an s-channel diagram where Higgs bosons are mainly produced near the production threshold.
Figure 6 shows the true and the predicted dσ/dM hh for some support couplings.The agreement is good in all cases.A quantitative assessment the predictions is given in the black entries of Table 2.For κ λ = 2, VAER predicts the shape of the dip in the distribution with good accuracy, as we see in the rightmost panel of Figure 6.The bin right at the local minimum of the distribution, where the disagreement is the largest, differs by ∼ 20%.The disagreement increases for κ λ = 3, reaching an excess of 50% compared to truth.Interestingly, this prediction is expected if we take detector smearing and higher-order corrections into account.
Predicted Mbb νν [GeV] λ = 2λ SM FIG. 6. Upper panels: the true and predicted hh invariant masses for the support couplings κ λ = −2, 0.1, and 2. The blue-shaded regions represent the cross-validation uncertainties in the prediction.The ratio between VAER prediction and ground truth is also depicted in these plots.Middle panels: scatter plots for true versus predicted masses.

Varying the Trilinear Higgs Couplings: Interpolated Couplings
Collision events associated with the true trilinear coupling might not pertain to the training set of the algorithm.The solution is to cover a finite grid of couplings during the learning process and expect the neural networks to generalize for intermediate couplings that were not presented to the regressor.In principle, this can be achieved with parametrized neural networks [23] where the value of the coupling is concatenated with the features matrix.This approach is very useful for training a classifier that depends on theory parameters that affect the kinematic distributions and change the label prediction of the algorithm.It saves an enormous time in generating events during the training phase but it cannot be used, of course, in predicting the labels of data without knowing the true theory parameters.The same caveat applies to a regression problem.
Contrary to unsupervised classification algorithms, like anomaly detection, for example, predicting a real-valued target function that depends on unknown theory parameters is a hard task.
In our case, there is also the issue related to the missing neutrinos that carry information away.
The target we need to predict is a function M bbℓℓνν (x obs |θ) where x obs comprises only observable  information and θ, the model parameters, are unknown.Moreover, the background must be taken into account in a joint learning process, that is it, we also want a single regressor to be able to correctly predict the background and the signal irrespective of the unknown theory parameters.
In Ref. [29], a combination of neural networks for signal versus background separation and knearest neighbors regressor are used for a post-discovery regression.In kNNNN, a pre-classification step to separate signal and background precedes the regression of the invariant mass.Once the event is classified, the algorithm uses a dedicated regressor for that specific class.The regressor is a simple kNN that precludes a training phase.As a clustering algorithm, it is unsupervised, which is a good feature but its weakness is needing a classification step to guide the regression.Deep learning regressors were used in the reconstruction of tops from semi-leptonic t t events [30] and t th reconstruction [31].In those cases, the jet combinatorics have to be solved to correctly assign the jets to top quarks for their reconstruction, enabling mass and top-Higgs coupling measurements, respectively.Contrary to our task, that reconstruction assumes a pure and unambiguous identification of samples.If some other type of events other than those the neural networks were trained to recognize are present, there is no guarantee that they will generalize properly.The examples cited above show some of the difficulties in the task of machine learning-assisted regression of kinematic variables without some previous knowledge of the events.What VAER tries to emulate is a function of observable information that predicts the b bℓ + ℓ ′− ν ℓ νℓ ′ mass with less previous information about the nature of the events.The framework is not completely unsupervised, though.
We trained the algorithm with some of the types of events we guess that might appear in that channel, signals, and background.However, the regressor training occurs in a single stage, and no previous classification or label assignment is needed.In this respect, VAER offers a step ahead in the solution of reconstructing events with missing information.
Concerning the signals, a useful algorithm recognizes events associated with new trilinear couplings, and possibly other model parameters, that did show up in the training phase.It must generalize the reconstruction to other parameters never seen, at least for parameters inside the range of the support grid couplings.In Figure 7, we depict M bbℓℓνν for some intermediate λ.None of them were previously presented to VAER.In the upper plots of Figure 7, we show the distributions for κ λ = −2.5, 0.5, and +2.5.The lower plots show the true versus predicted masses.The visual agreement is again corroborated by the qualitative assessment of the performance displayed in the cyan entries of Table 2.A general feature that might be improved is that the prediction deteriorates at the extremes of the distribution, in the first bin, at the onset of the distribution, and in the last bins, in the tail.This might be mitigated by choosing larger bins and possibly by increasing the number of examples at the training phase.A coarser grid of support couplings can also help to bring the predictions closer to the ground truth in those bins.As in the case of κ λ = +3, the algorithm correctly identifies the dip in the distribution caused by the destructive interference when κ λ = +2.5 as we see in the rightmost panel of Figure 7, but it is shallower than the true distribution.In all cases, the bulk of the distribution around the peak value is very well predicted.Overall, however, the interpolated predictions present a diminished quality compared to the support ones, although they are still good.

Robustness against Kinematic Cuts
If VAER truly emulates M bbℓℓνν as a function of observable kinematics, it should reconstruct the event in the whole of the phase space, just like any parametric function.We tested VAER in Delphes3 [32] reaches a higher b-jet tagging efficiency of around 70% mimicking the detector's true efficiency.In Figure 8, we show the SM hh and t t mass distributions for the cuts of Eq. ( 28).The agreement remains good, especially for hh events.In the case of harder cuts in log(H) and log(T ) to isolate t t events, we observe a somewhat harder predicted spectrum compared to truth.In all cases, though, a very good agreement is achieved for masses up to 800 GeV.Moreover, the true distribution always lies within the error band of the cross-validation.
These experiments give us confidence that the VAER prediction can be useful in helping the phenomenological analysis of these types of events by providing another distinctive kinematic variable to isolate the signal events.We reinforce that the training dataset just contains events with the basic selection requirements of Eq. ( 15).

Chi-Square Computation with VAER distributions
The sensitivity of M hh to λ makes it a good target for inferring the Higgs trilinear self-coupling offering its shape along with the number of events of its normalization to measure that theory parameter.In the case where hh → b bW + W − → b bℓ + ℓ ′− ν ℓ νℓ ′ , the M bbℓℓνν distribution inherits that sensitivity but, of course, it must be reconstructed despite the missing neutrinos components.
VAER, as we have shown, provides accurate histograms of M bbℓℓνν that can be used for statistical inference of λ.
To demonstrate its usefulness for practical purposes, we show, in Figure 9, a simplified χ 2 computation ignoring backgrounds after imposing a hard cut on Higgsness and Topness variables of log(H) < 5, log(T ) > 5.5.We checked that no t t survives to those cuts.The number of signal events, however, is also small, a few tens at most, and other cuts might be needed to surely ignore the backgrounds [9].We do not intend to calculate bounds to λ in this work but just to demonstrate that the VAER prediction can be used for that purpose.The computation was performed using a 10-bins histogram of M bbℓℓνν .
The χ 2 is thus computed as to test an alternative λ hypothesis against the SM one.In this formula, S(κ λ ) is the number of signal events for a given κ λ , after the hard Higgsness and Topness cuts mentioned in the previous paragraph.
As we see in Figure 9, the agreement between the χ 2 computed from the true and the VAER True M bb νν VAER M bb νν FIG. 9.The χ 2 between S(κ λ ̸ = 1) and the SM S, the number of signal events for each λ hypothesis.We depict both the true and the predicted curves, adding 1 to χ 2 just to enable us to show them in log scale.
predicted M bbℓℓνν distributions is very good.The VAER χ 2 curve is slightly above the true curve, making the inference a bit conservative.To test VAER in resonant hh production, we generated events for gg → H → hh → b bℓ + ℓ ′− ν ℓ νℓ ′ with MadGraph5, Delphes3, and Pythia8.We tested four hypothetical masses, m H : 400, 600, 800, and 1000 GeV.In all cases, we fixed the total width of the new boson to m H /10. All simulation parameters were fixed as the non-resonant cases.
We also hardened the selection cuts to mimic the possible experimental searches in that channel.
are reconstructed using the Heavy Mass Estimator (HME) technique [34].The HME technique resembles the Higgsnes calculation but it keeps the solutions to the neutrinos' momenta and uses them to calculate the mass of H.The results from Ref. [33] for heavy Higgses of masses comparable to those we simulated in this work show a similar accuracy and peak resolution, however, they do not generalize to background events.
We postpone to a future investigation a detailed statistical estimate of the signal significance that can be achieved by searching for such a resonance in the tail of the background M bbℓℓνν , but with the help of Higgsness and Topness variables, we believe that a statistical analysis may benefit from the VAER reconstruction of the peaks, possibly enabling an estimate of the mass of the resonance.

CONCLUSIONS E OUTLOOK
Recovering information leaked in the emission of neutrinos, dark matter, or long-lived particles is a research field of its own.Much effort has been put into reconstruction algorithms and proxy functions that might capture the kinematics of the missing components in collision events at highenergy colliders.In this work, we proposed a parametrized function of the observable momenta in the reconstruction of the final state b bℓ + ℓ ′− ν ℓ νℓ ′ from double Higgs production and its leading background source, t t pairs.The parameterization is provided by neural networks in a variational autoencoder algorithm designed for regression tasks and trained with a dataset comprising detectorlevel events generated from a grid of trilinear couplings for hh simulated data besides t t data.
We showed that VAER presents a very good generalization power, accurately predicting the partonic invariant mass M bbℓℓνν of the t t background and across events associated with the various trilinear coupling of the support grid of the training set.Moreover, it also provides good predictions of M bbℓℓνν in events of trilinear couplings and resonant new Higgs production and decay into hh which were not present in the training phase corroborating its generalization performance.
Its usefulness was tested against harder selection cuts beyond those used to select the training dataset and, once more, confirmed that VAER is capable of learning a function of the observable kinematics to output a variable that encompasses missing momenta.The algorithm is easy to train, not requiring extensive tuning or a large amount of data.All our predictions were validated through statistically independent cross-validation sets and showed a good degree of robustness.
Reconstructing M bbℓℓνν opens the possibility of using the shape of a distribution that is very sensitive to λ in measuring the trilinear coupling, besides the cross section measurement, in the b bW + W − channel that has been recently rehabilitated as a competitive channel for double Higgs studies [9].In conjunction with powerful variables like Higgsness and Topness [9], for example, VAER could provide a variable to compare data and theory to measure the trilinear coupling of the SM scalar potential.As a practical evaluation of the algorithm, we showed that a χ 2 computation based on VAER M bbℓℓνν histograms can be used as a reliable estimate of the statistic.
We envisage other applications, though.For example, VAER can be used to reconstruct final states with dark matter particles, including intermediate particles that decay into them.Recovering partonic distributions from detector-level events should also be easy, making the algorithm an option for unfolding.Mass and spin measurements could also benefit from fully available kinematic variables.Of course, without mentioning the original application that motivated us, the regression of a variable from images, as in Ref. [12], where VAER could be adapted to infer properties of jets from their images, for example.

FIG. 1 .
FIG. 1.The graphical diagram of the Variational Autoencoder for Regression (VAER).Yellow triangles represent neural networks.The central rectangle represents the latent space, while the external ones, in

FIG. 2 .
FIG. 2. The joint distribution of the logarithm of Higgsness and Topness for Higgs pairs events (left panel), and t t events (right panel).

4 .
RECONSTRUCTION OF FULLY LEPTONIC b bW + W − EVENTS: NON-RESONANTCASEThe double Higgs invariant mass is sensitive to the Higgs self-coupling.Besides the total cross section expected at the collider, the shape of M hh might help to measure λ and possible deviations from the SM.As discussed before, with the help of powerful discerning variables, like Higgsness and Topness, the fully leptonic b bW + W − mode becomes an interesting option to measure the trilinear Higgs coupling at the LHC.If not used for fits, M hh and M t t can be used to further discern between Higgs pairs and top pairs in a cut-based or multivariate analysis.

4. 1 .
Data Preparation, Training and Validation, and Algorithm Structure establish the generalization power of VAER, intermediate couplings and heavy Higgs events are not presented to the algorithm during the training phase.The intermediate coupling events test the interpolation ability of the algorithm, which is supposed to learn the b bℓ + ℓ ′− ν ℓ νℓ ′ mass from the observable information.For that purpose, diversity is essential.The heavy Higgs events test the extrapolation power of the algorithm once they populate regions of the representation space that are poorly populated by training examples.We should expect that extrapolation works significantly worse than interpolation.

FIG. 3 .
FIG.3.Upper panels: the true and predicted SM hh and the t t invariant masses at left and right, respectively.

FIG. 7 .
FIG.7.Upper panels: the true and predicted hh invariant masses for the interpolated couplings κ λ = −2.5, 0.5, and 2.5.The blue-shaded regions represent the cross-validation uncertainties in the prediction.The ratio between VAER prediction and ground truth is also depicted in these plots.Lower panels: scatter plots for true versus predicted masses.

5 .
RECONSTRUCTION OF FULLY LEPTONIC b bW + W − EVENTS: HEAVY HIGGS DECAYIn extended scalar models, like xSM[21,22,33], besides shifts in trilinear couplings, new heavy Higgs bosons, H, might appear in the particle spectrum.If the new scalar has a sizeable decay into SM Higgs bosons, a resonance in hh mass would be a smoking gun signature.Of course, the resonance is missing if the W bosons of hh → b bW + W − decay leptonically so VAER can be used to reconstruct the peak of the H decay.

TABLE 1 .
Hyperparameters and architecture of the decoder, the encoder, and the regressor neural networks.No dropout layers were needed.The total number of parameters of this VAER configuration is ∼ 1.5 × 10 6 .