Reconstruction of Missing Resonances Combining Nearest Neighbors Regressors and Neural Network Classifiers

Neutrinos, dark matter, and long-lived neutral particles traverse the particle detectors unnoticed, carrying away information about their parent particles and interaction sources needed to reconstruct key variables like resonance peaks in invariant mass distributions. In this work, we show that a $k$-nearest neighbors regressor algorithm combined with deep neural network classifiers, a $k$NN, is able to accurately recover binned distributions of the fully leptonic $WW$ mass of a new heavy Higgs boson and its Standard Model backgrounds from the observable detector level information at disposal. The output of the regressor can be used to train even stronger classifiers to separate signals and backgrounds in the fully leptonic case and guarantee the selection of on-mass-shell Higgs bosons with enhanced statistical significance. The method assumes previous knowledge of the event classes and model parameters, thus suitable for post-discovery studies.

mass and width could be even more accurately measured if the information from fully leptonic W W, ZZ → + − ν ν , ( ) = e, µ, modes were recoverable. Instead, apart from ZZ → 4 , we need to rely upon the semi-leptonic or fully hadronic modes to perform those measurements with a significantly higher level of backgrounds. Identifying bumps and sharp thresholds in the invariant mass distribution of observable and dark states would also help disentangle new physics signals like heavy Higgs bosons [2], Higgs pair production with one invisible Higgs [3,4], sleptons and charginos [5,6], and new gauge bosons decays neutrinos and/or dark matter from their associate backgrounds [7,8], to name a few possibilities. Another important example where a fully-leptonic mode benefits from a clean environment is the measurement of the scattering angles for W, Z bosons in polarization studies [9].
In processes where N ν neutrinos are produced in the hard scattering, there are 4N ν unknowns that should be recovered to reconstruct the parent particles. The negative of the sum of the transverse momentum vector of all the observed objects in the event furnishes two constraints, despite not exactly equal the sum of neutrinos transverse momentum due to detector effects, contamination from neutrinos, and other missing particles from hadronic jets, for example. Mass constraints must provide the complementary information necessary for reconstruction. The number of mass constraints, N m , is process dependent though and, in many cases, they do not suffice to recover the four momenta of the neutrinos if 4N ν ≥ N m + 2. Even in cases where sufficient mass constraints exist, like fully leptonic tt signals [10,11], the misresconstruction of the neutrinos transverse momentum, combinatorial particle assignment, and ambiguities arising from the quadratic nature of the equations do not guarantee meaningful solutions for all events.
In a process-independent way, one approach to circumvent the impossibility of recovering the four-momenta of all the escaping particles is to design kinematic variables and methods that correlate with the lost information, for example, with the masses of the parent particles. Many such variables are smartly crafted to provide useful hints about decaying particles in many situations [12][13][14][15][16][17][18][19][20][21][22]. Yet, none of them, by construction, is capable of recovering a resonance peak.
Another approach could be using a regression algorithm to predict the neutrinos four-momenta or some variable of interest from the observed information. One might tackle tasks of that type by training an algorithm to parameterize a multivalued function f : R n → R m , with a neural network, for example [23][24][25][26]. Methods of density estimation [27] might also be useful 1 . The fundamental difficulty in these cases is that essential information for the reconstruction of resonances, like masses and widths, are encoded in the signal events but not in the backgrounds, the only ones sufficiently known to permit the training of an algorithm. In other words, it is necessary to rely on supervised algorithms with previous knowledge of the parameters to train a regressor to recover a resonance peak; otherwise, there are no guarantees that regressors trained for backgrounds will generalize.
Contrary to classification problems whose targets are mutually excluding categorical attributes, a regression task targets a real number representing a continuum data attribute. For this reason, it is much easier to build a weakly supervised or even an unsupervised algorithm for classification, but not for regression.
Assuming previous knowledge about signals, the most straightforward approach to reconstructing a mass variable involving escaping neutrinos is by interpolating a support set of events from simulations instead of adjusting the parameters of some universal function that should generalize from training to test datasets. Such an accurate and efficient algorithm for supervised regression is the k-nearest neighbors algorithm, as we will demonstrate in this work. As we argued, the caveat of this approach, like any other supervised regression algorithm, is that we need to know what type of event is produced in the collisions beforehand to select the correct support set for interpolation of the variable. Our approach takes advantage of the exquisite power of neural networks to classify the events. In principle, it is possible to identify signal events without any previous knowledge using outliers detection and unsupervised methods; however, as we discussed, without knowing the mass parameters, reconstructing a mass peak is challenging 2 .
In this work, we show how to combine neural networks for classification and kNN for regression is useful in reconstructing a new heavy Higgs boson decaying to W + W − → + − + ν ν , ( ) = e, µ, a fully leptonic final state with two escaping neutrinos, and its main SM backgrounds. We will show that the predicted mass of the charged leptons and neutrinos can be reliably used as a powerful new attribute to clean up the backgrounds further while enabling the selection of on-mass shell Higgs bosons.
The work is organized as follows. In Section 2, we describe the kNN regression algorithm; in Section 3, we provide details of the combined construction of regressors and classifiers to identify the heavy Higgs boson and its main SM backgrounds, while in Section 4 we present our final results in terms of improvement of the statistical significance of the signal hypothesis; Section 5 is devoted to conclusions and prospects.

DETAILS OF THE kNN REGRESSION
The k-nearest neighbors regressor [32] is a simple but effective algorithm for interpolation. First of all, we define a support dataset S = {(X i , F (X i )), i = 1, · · · , N s }, these are the exemplars which will be used to predict the value of the function of interest. Second, we define a distance metric, Dist(X, Y), to decide which exemplars of S are closer to a new point, X new , where X, in our case, is a R n vector. Third, we choose how many nearest neighbors to X new will be used to compute F (X new ), the target of our regression, according to a weighted mean . (1) Substituting Dist(X new , X m ) = 1 in the formula above corresponds to an arithmetic mean estimator for F . The weighted or arithmetic option will be decided in the tuning stage of the analysis.
In principle, once we have chosen the distance metric, the number of nearest neighbors, k, used to compute F (X new ) is the only hyperparameter of the algorithm. Note that this model has no parameters to be adjusted contrary to a neural network. This is the reason we do not need a training phase. However, the distance metric, k, and possibly other hyperparameters should be adjusted to get a good regressor by minimizing some error function. All F (X m ), m = 1, · · · , N s are known thus, we are in the realm of supervised learning.
In our case, the target function of the regression, F , is the leptonic + − ν ν , invariant mass, M νν . The input of this function is the observable information obtained from the electrons and muons four-momenta, p e and p µ , respectively. The representation of the events was chosen as the energies and 3-momentum of the charged leptons plus high level functions construed from that low level information: X = (f ij (p , p¯ ), i = 1, · · · , N ev , j = 1, · · · , M ) representing N ev events with M features.
If the number of dimensions of the features space is large, distance-based models like kNN might perform poorly. For that reason, it is usual to project the features space onto a latent space of reduced dimensionality. There are various ways to do that. We chose to linearly transform the original features using a principal component analysis (PCA) [33] and looking for the nearest neighbors in the transformed space of the first P < M variables which best explain the variance of the data, X pca = T P (X). We also adjust P to obtain the best regressors.
[ , i = 1, · · · , N bins }. We found that predicting the bin where the event falls in histograms of M νν works better than predicting the value of M νν itself.
We chose relatively large but fixed-sized bins. The binning itself, therefore, could be adjusted for performance mainly for large M νν where the number of events expected drops sharply. The regressor for the bins of the histogram of the M νν is given by . ( Let us now construct the regressors for the signal and the backgrounds.

RECONSTRUCTION OF FULLY LEPTONIC RESONANCES
The dataset consists of 400000 simulated signal events pp The partonic events are used to obtain the ground truth M νν distributions once the neutrinos momenta are available. Note that this distribution explicitly assumes that missing energy is all due to escaping neutrinos produced in the hard scattering, but not the misreconstruction of observable momenta or the missing of other particles. However, the leptons momenta and the event's missing energy, which feed the algorithms, include all the simulated effects. This is another reason to construct a regressor for the distribution bins. For sufficiently large bins, the mismatch between the partonic M νν and M E T can be more easily accommodated without affecting the quality of the regression.
We adopt the following basic acceptance cuts to select events with two opposite charge leptons and missing energy where p T, and η denote the transverse momentum and pseudo-rapidity of the leptons, respectively, while M , E T and ∆R denote the invariant mass of the charged leptons, the missing transverse energy and the distance in the η × φ plane of the event. The last cut, on the rapidity gap between the charged leptons, was imposed to suppress weak boson fusion backgrounds, which are neglected in the subsequent analysis. The M helps to suppress the low mass leptons backgrounds from Zγ * , which showed to be a source of contamination among the events classes.
In Fig. (1), we show the distributions of some features chosen to represent the events and predict their classes and M νν . Along with the energies and the components of the 3-momenta of the charged leptons, we also include their transverse momentum, and the following variables: • M , the charged leptons invariant mass, • E T , the missing transverse energy, • ∆R = (∆η ) 2 + (∆φ ) 2 , where ∆η and ∆φ represent the rapidity and azimuthal angle differences between the charged leptons, • cos θ * = tanh ∆η 2 , proposed in Ref. [37], • the number of jets tagged as a bottom jet to suppress tt events.
For each class, we construct a regressor function according to Eq. (2). At this stage, we employed 0.9 and 1.2 million events for signals and backgrounds, respectively. To ensure that the dataset's size would not play a role in the results, we separated 80% of that data for tuning the regressors.
We adjusted, with a grid search, the number of nearest neighbors, k, the distance metric 4 , Dist, the The space of hyperparameters in the grid search is the following We display, in Fig The kNN regressor is robust against most parameter variations while being very accurate for predictions. Overall, for all backgrounds and the signals, the nearest neighbor to a new point in the latent space of PCA transformation is the most accurate prediction for our target variable. We tested various alternatives to kNN as gradient boosting and neural networks regressors, and the nearest neighbors approach showed itself superior in approximating the true distribution of masses.
We also found that neural networks present an improved generalization performance across classes compared to other algorithms, especially the kNN algorithm, which is very dependent on the class of the event. For example, we found that training a neural network with W W backgrounds might be useful to obtain M νν for the other classes, especially the backgrounds, but its performance on signal events is still not competitive with much simpler proxy variables that correlate with the resonance mass, as √ŝ (0) [15] and other transverse mass variables. The significant advantage of algorithms with good generalization performance is being agnostic towards the other classes, depending less on the previous knowledge of the types of events.
The number of PCA dimensions where the original data representation is projected onto showed a more significant variation. While for the ZZ background and the 1 TeV Higgs, the smaller MSE could be reached with just a one-dimensional latent space, the W W background performed better in a two-dimensional PCA space, the tt and a 1. Γ H /m H = 10%, and the W W , ZZ(γ * ), and tt backgrounds. As we see, the regressors work very well for each class of events. The binning of the distributions also affects the quality of regression.
We found that the mean square error between the true and predicted distribution gets larger as the bin widths get smaller as expected. It is easier to predict in which bin an event will fall when it is wide. We checked that the width of the resonances affects too little the accuracy of the regression from Γ H /m H = 1% up to 10%.

Pre-regression classification
The construed M νν regressor of a given class can predict the target distribution of events that pertain to that class exclusively. If one feeds a background regressor with a signal event, for instance, the background regressor will find the target value of the background distribution, which is closer to the signal event. In order to predict the classes' targets correctly, we need first to predict the classes as accurately as possible. We also need to know the mass of the resonance.
The classification of events was performed with neural networks (NN) [30,38] based on the same features used for regression. We took 1.5 million signals and 4 million backgrounds events to tune, train and test the algorithms. As we will discuss later, this body of data was further split to independently adjust, train, and test a second neural network; that is why we need such a large number of simulations.
Beside the kinematic variables described in the previous section, we also constructed a new one that we describe now. In Ref. [39], a kinematic variable, called Higgsness 5 , is introduced to denounce the presence of a SM Higgs boson decaying to W ± W * ∓ → + − + ν ν . The idea is to search for the neutrinos 4-momenta of an event which minimize where δ H and δ W , in principle, represent experimental uncertainties, but for our purposes, they can be treated as free parameters. In fact, the value of these parameters matters for the Higgsness distributions, and we adjust them for maximum discernment among the classes.
In Fig. (4)   against W pair production. This similarity is summarized in the right panel of Fig. (5) where we see that the scores distributions of W W and tt events overlap.
However, the most mistagged class is ZZ, where 19% of the sample is classified as a W W event.
On the other hand, only 3% of signal events are wrongly assigned to background classes. The same behaviour was observed for the other two mass values. This somewhat large misidentification of ZZ(γ * ) events might be explained by the introduction of Higgsness as a feature of the dataset.
As we see in Fig. (4), while being very powerful to discern the signals, Higgsness is very similar for background classes. Withdrawing Higgsness from the data representation decreases the true positive rate of the 2 TeV Higgs boson from 97 to 94%, while also decreasing the proportion of ZZ events to be labeled as W W events from 19 to 6%. Apparently, singling out signal events with Higgsness make the background classes less discernible among themselves.
With the NN classifier in hand, we can reconstruct the M νν mass of the events. We emphasize that it is necessary to know the class of the events before the regression once the target variable can only be correctly estimated when interpolated over the proper support dataset of the kNN algorithm. In other words, the nearest neighbors regressor does not generalize from one class to another. If one presents instances never seen by the regressor, the lack of necessary correlations will result in meaningless outputs. For example, in the Higgs rest frame, the sum of the charged leptons energy and the energy of the neutrino equals the Higgs mass, E * + E * νν = m H . In this case, the regressor can only learn the simple relation E * νν = m H − E * to recover the missing information from the observed one if it is trained on signal events with known m H . In Fig. (6), we show the predicted M νν mass of the events classified by the neural network model for a 2 TeV Higgs. Again, the results for other masses and total widths are nearly the same.
We note clear contamination by signal events in the tail of the distributions for W W and tt events. This is expected, once 1.7 and 1.4% of signal events are classified as W W and tt events, respectively.
Only 0.19% of H 2 events are classified as ZZ events, though and that's why we do not observe a clear peak in the tail of the ZZ distribution. By its turn, 4.1 and 3.7% of W W and tt sample, respectively, is mistagged as a signal event, populating the low mass bins of the H 2 distribution above the true distribution. In practice, if one is interested in identifying Higgs bosons, requiring the score to be greater than 0.5 or larger is effective to mitigate the contamination of background distributions permitting a reliable estimate of backgrounds in the resonance region. However, the signal contamination is not affected much, yet, once the signal distribution contamination occurs for low mass bins, the resonance region estimate is also reliable.
A way around these contaminations in order to improve the confidence in the mass estimates is presented in the next section.

Post-regression classification and the kNNNN algorithm
How can we get rid of the mistagged contamination in backgrounds and signal distributions?
In Ref. [44], an ensemble of classifiers was used to boost the classification accuracy of Higgs boson events with a performance almost as good as deep neural networks [45]. We used the same idea to boost the performance of our classifier by stacking another neural network model on the top of the first classifier described in the previous section. For a good review of ensemble methods, see [46].
We show a flowchart of our proposed algorithm from beginning to end in Fig. (7). The original dataset comprising the kinematic features, X, described in Section 3, plus the Higgsness variable is first split into many subsets to train/validate the classifiers and the regressor. Two subsets are used to train the first classifier, depicted as NN 1 in Fig. (7), and the kNN Regressor. In this scheme, the Regressor is fed by kinematic features, but Higgsness, and also with the output scores, In Fig. (8), we display the confusion matrix and the score outputs of the NN 2 classifier. The   separation of the classes is improved after the second classification. To confirm that improvement, we calculate the overall accuracy and the score asymmetry defined as where N is the number of events of the class.
We also compute the positive and negative likelihood ratios, as defined in Ref. [47] LR + = Sensitivity 1 − Specificity = Sensitivity False Positive Rate , These two metrics are aimed to measure how effective a classifier is in predicting the classes in a binary problem. Sensitivity, the ratio between the number of events correctly classified as positives and the total number of events classified as positives, measures how good the classifier is in identifying the positive class, our H 2 events. Specificity, by its turn, is the ratio between the number of events correctly classified as negatives and the total number of events classified as negatives, our backgrounds. In order to apply these metrics, we gather all background events into a single negative class. Analogously to sensitivity, specificity measures how competent the classifier is in correctly identifying negative instances.
For the signals, LR + summarizes how many times more likely signals are correctly predicted to be signals than backgrounds are wrongly predicted to be a signal. On the other hand, LR − summarizes how many times less likely signals are wrongly predicted to be backgrounds than backgrounds events are correctly predicted to be a background. A better classifier must therefore maximize LR + and minimize LR − . In the comparison of two classifiers, let's say, NN 1 and NN 2 , if LR + (NN 2 ) > LR + (NN 1 ) and LR − (NN 2 ) < LR − (NN 1 ), then NN 2 is better than NN 1 in the confirmation of both positives and negatives. When the inequality of the first condition still holds but the second flips, then NN 2 is better than NN 1 in the confirmation of positive class but worse for the negative class. At the same time, if the inequality of the first condition flips but the second still holds, then NN 2 is worse than NN 1 in the confirmation of positive class but better for the negative class. In Table 2, we display the accuracy, the asymmetry, and the positive and negative likelihood ratios just described for all the three Higgs boson masses investigated in our work. All metrics indicate an overall improvement of NN 2 over NN 1 , but the gain in performance is more pronounced in the 1 TeV case. Lighter masses present attributes less discernible than the backgrounds, so profit more from an ensemble of classifiers that use more distinctive features like the classification scores and the M νν mass.
The improvement is more significant for the signals and the W W background compared to ZZ(γ * ) and tt events. This can be further confirmed by looking at the Fig. (9), the difference between the confusion matrices of the NN 1 and NN 2 classifiers. First of all, we want the diagonal of Fig. (9) to be all positive, which means that NN 2 increases the true positive rate compared to NN 1 . At the same time, negative non-diagonal entries mean less misclassification among classes.
Overall, taking into account the results for the three Higgs masses, we see a clear improvement of NN 2 compared to NN 1 . Except for the ZZ(γ * ) class in the 1 and 1.5 TeV cases, all diagonal entries are positive, with a major improvement of W W classification. Moreover, the 1 TeV signal  class benefits more from NN 2 than the heavier masses. This is a good feature of kNNNN; it helps in the more difficult cases for the signals. Concerning the non-diagonal entries, we observe a clear trend -the ZZ(γ * ) class is more accurately identified by models whose task is to separate heavier Higgs signals. In contrast, the other classes are less confused among themselves by NN 2 . On the other hand, the more accurate W W and ZZ(γ * ) classification comes at the cost of a slight increase in mistagging of tt events as W W .
After the second classification, using the class scores of NN 1 and the predicted M νν mass of the Regressor, the second neural network NN 2 now provides more accurate predictions to inform the Regressor which support set to use for the regression task. As an outcome, the contaminations from other classes get reduced, and the prediction of M νν improves. We show the predicted νν invariant mass after the second classification in Fig. (10).

IMPROVEMENT OF THE SIGNAL SIGNIFICANCE
Now that we have established a working algorithm to predict the M νν mass, we want to investigate whether it is helpful to boost the statistical signal significance when employing a machine learning classifier. The signal significance is computed according to where N S and N B i , i = W W, ZZ, tt denote the number of signal and backgrounds events, respectively; (S) cut and (i) cut , i = W W, ZZ, tt denote the signal and backgrounds cut efficiencies (both on kinematic variables and score outputs), respectively; finally, ε B represents a systematic uncertainty in the backgrounds rates assuming, for simplicity, a common uncertainty for all background sources. As discussed previously, we are interested in showing the boost in the signal significance that our proposed algorithm is expected to produce by including the predicted M νν mass in the data representation. Moreover, we also wish to check if the predicted masses cause an underestimation or overestimation of the statistical significance compared to what we could get if we knew the true M νν distribution. In Fig. (11), we show, at the left panels, the statistical significance, assuming a ε B = 10% systematic uncertainty in the backgrounds rates, for a new Higgs boson of 1, 1.5, and 2 TeV mass, from top to bottom rows, respectively. To raise the significance, we cut on the classifiers' signal score output represented in the plots' horizontal axis. The 1st NN and 2nd NN lines depict the significance of NN 1 and NN 2 , respectively, without including the M νν prediction. Even without reconstructing the resonance, the stacking of the neural networks boosts the significance, as expected. The statistical significance is much enhanced, including the predicted mass, as shown in the top lines in all the left panels. As we see from the dashed lines, the agreement with what should be expected using the true masses in the data representation is good. The agreement is better for lower masses, while a more pronounced overestimation is observed in the 2 TeV case.
An insufficient number of simulated background samples might cause that effect. Yet, the quality of the resonance reconstruction enables us to employ the method to select the signal events better.
At the right panels of Fig. (11), we show the significance gain relative to the first neural network classifier, NN 1 . While not including the predicted M νν mass leads to gains around 2, including them boosts the gains to up to 6, 8 and 10 for 1, 1.5 and 2 TeV masses, respectively. As noted in the left panels, there is a more pronounced overestimation of around 20 to 25%, depending on the cut score, for 2 TeV Higgs bosons. Similar gains were observed when we varied the Higgs bosons widths down to Γ H /m H = 1%. The train/test/validation dataset was randomly split five times to assess the robustness of these results, and tiny variations were observed in this cross-validation.

CONCLUSIONS AND PROSPECTS
As the search for new physics intensifies following the LHC program schedule, new ways to identify particles that hide information through invisible decays are surely welcome. In this work, we designed an algorithm capable of reconstructing the mass of a new heavy Higgs boson decaying to W + W − → + − ν ν and its main SM backgrounds using a simple but adequately tuned nearest neighbors algorithm. The algorithm assumes the previous knowledge of the event classes and the Higgs boson mass; therefore, it is useful for post-discovery studies, for example, an analysis that requires a selection of on-mass shell Higgs bosons.
More importantly, including the predicted kNN M νν mass as an attribute for a neural network classification improves the accuracy, the true and false positives/negatives rates, and the likelihood of true class classifications when compared to a neural network that does not have a clue about the masses. The gain in the statistical significance is the ultimate test for the proposed algorithm. We found a gain factor in significance up to a factor of 10 for a 2 TeV Higgs boson mass. For lighter masses, of 1 and 1.5 TeV, the gains are less pronounced but also high, up to 6 and 8, respectively, depending on the cut placed on the signal class score. We checked that the predicted mass is reliable and robust as a new feature for classification by comparing our results against classifiers trained with the true M νν masses. Not only the binned invariant mass distributions agree but also the final statistical significance agree within a few tens of percent, at most.
The kNNNN algorithm can be applied to other observable variables as well. For example, the scattering angle of the W bosons can be obtained in the fully leptonic channel beside the charged leptons angles. The masses of particles in different topologies can also be obtained. For example, we guess that sparticles' mass distributions from decay chains of various lengths might be recovered after their determination with other methods. The next step in this kind of investigation is to relax the previous knowledge of the mass parameters and weaken the level of supervision when training the classifiers and regressors. Outlier detection and other unsupervised techniques can be readily used to dismiss previous knowledge of the signal class, yet, using kNN for regression requires the knowledge of mass parameters. A completely weakly supervised regression algorithm that assumes just the knowledge of the background classes is challenging once it involves generalization across classes with essential information loss.
We are currently investigating deep neural networks and variational autoencoders for regression algorithms trained on a single background class but still assuming previous knowledge of signal mass parameters. These results will be presented elsewhere.