Comparing Traditional and Deep-Learning Techniques of Kinematic Reconstruction for polarisation Discrimination in Vector Boson Scattering

Measuring longitudinally polarised vector boson scattering in WW channel is a promising way to investigate unitarity restoration with the Higgs mechanism and to search for possible physics beyond the Standard Model. In order to perform such a measurement, it is crucial to develop an efficient reconstruction of the full W boson kinematics in leptonic decays with the focus on polarisation measurements. We investigated several approaches, from traditional ones up to advanced deep neural network structures, and we compared their ability to reconstruct the W boson reference frame and to consequently measure the longitudinal fraction W_L in both semi-leptonic and di-leptonic WW decay channels.


Introduction
The Large Hadron Collider (LHC) at the European Organization for Nuclear Research (CERN) has, in the recent years, delivered unprecedented high-energy proton-proton collisions that have been collected and studied by two multipurpose experiments, ATLAS and CMS. The two collaborations are exploiting this data to explore the fundamental nature of matter and the basic forces that shape the Universe, testing the Standard Model of particle physics (SM) in regimes never investigated before. One process in particular, the Vector Boson Scattering (VBS), i.e. the scattering of two massive vector bosons, VV → VV with V = W or Z, is the key for probing the electroweak sector of the SM in the TeV regime and the mechanism of electroweak symmetry breaking (EWSB) [1,2]. Experimentally, both validation a e-mail: michele.grossi01@universitadipavia.it b e-mail: jakob.novak@ijs.si of SM predictions and the discovery of new physics as unexpected deviations, are of great importance, in the short and medium terms. In the short term the studies are being performed by the experiments on the currently collected LHC data. In the medium term, LHC will be upgraded to the High Luminosity LHC project (HL-LHC), an ESFRI landmark, therefore providing even larger statistics.
The VBS measurement is extremely challenging, because of its low signal yields, complex final states and large backgrounds. Its understanding requires a coordinated effort of experimental and theoretical physicists, with a wide spectrum of skills: detector knowledge, event reconstruction and simulation, and data mining. Only an interdisciplinary approach, as is currently indeed taking place, allows the best exploitation of the LHC data.
In more detail, the VBS production involves quartic gauge boson self interactions, and the s− and t−channel exchanges of a gauge boson or of a Higgs boson. In the SM, the contribution of the Higgs boson (H), discovered at the LHC [3,4], regularizes the VBS amplitude by cancelling divergences arising from longitudinally polarised vector bosons at high energy, as an explicit consequence of the ESWB [5,6]. Furthermore, any enhancements in the VBS cross section beyond the SM prediction, as studied in many scenarios investigating physics beyond the SM, could potentially be detected at the LHC [7,8].
At the LHC, the dominant VBS processes result in final states with two gauge bosons and two jets (VV j j). The same final state can however be achieved also through processes that do not involve VBS but are a result of other electroweak-mediated and/or QCD-mediated contributions. Studies have shown that the same-sign W ± W ± j j production has the largest VBS contribution to the production cross section [9], because tree-level Feynman diagrams not involving the self interactions are absent in the s−channel and suppressed in other channels. Consequently, the same-sign W ± W ± j j production is thus well suited for EWSB and new physics studies involving VBS at the LHC. The observation of the W ± W ± j j electroweak production was already reported by the ATLAS [10] and CMS [11] collaborations using data recorded in Run 2 at a centre-of-mass energy of √ s = 13 TeV.
The main aim of the study presented in this paper is to find an optimal method to identify the contribution of longitudinally polarised W bosons in the VBS process, by investigating strategies to reconstruct the event kinematics and thereby develop a technique that would efficiently discriminate between the longitudinal contribution and the rest of the participating processes in the VBS scattering. Several approaches have been studied and are presented in the following. The results demonstrate the advantages of machinelearning (ML) based techniques, which profit from optimal phenomenological observables in several stages of the procedure. The studies presented have been performed on a VBS prototype analysis, to demonstrate the validity of the approach, but the same strategies can be implemented in any ATLAS and CMS analysis which involves VBS and/or W bosons.

Vector Boson Scattering kinematics and polarisation properties
A representative Feynman diagram of the W ± W ± j j VBS process at the LHC is shown in Fig. 1, where the gauge bosons are radiated off the incoming quarks and then scatter via the quartic self interaction vertex. In this process, the two jets, produced by the fragmentation of the scattering partons, are expected to fly in predominantly forward direction of the opposite hemispheres, i.e. with large rapidity difference. This is a typical signature for this type of a scattering process [12]. The two W bosons can decay either leptonically or hadronically, whereby leptonic decays into light leptons (electrons and muons) are experimentally preferred due to a lower background contamination and more efficient data acquisition triggers. In particular, the first observation papers of the ATLAS and CMS collaborations, cited above, required same-sign charged di-lepton pairs in the final state. In this paper, we evaluate two scenaria, the semi-leptonic case where one of the W bosons decays into light leptons and the fully-leptonic case where both W bosons decay leptonically.
The EW scattering of the two W bosons is a non-resonant process, having no strong kinematic constraints on the W ± W ± rest frame -the opposite would be true if the W ± W ± pair would be produced by a heavy resonance decay. Consequently, the constraints on the event kinematics are less stringent, which, as we will show in the following, makes the full reconstruction from the experimentally measurable kinematic quantities (i.e. angles and charged particle four-momenta) much more difficult in the presence of non-measurable (one or two) neutrinos being produced in decay of the W ± W ± pair. Full kinematic reconstruction of the event is essential in view of being able to efficiently disentangle the different polarisation contributions of the W bosons in the VBS process. The polarisation fractions of the W boson are reflected directly in the angular distributions of the resulting leptons. In absence of kinematic cuts on the leptons, the relation is very straightforward [13], as one can write the angular distribution in the W boson rest frame as follows: 1 where the polarisation fractions ∑ i=0,L,R f i = 1 weigh the Legendre polynomials of the lepton polar angle ϑ , defined as the angle between the lepton in the W rest frame, with respect to the W direction in the laboratory frame. As argued in the same paper Ref. [13], these relations become more complex in the presence of kinematic cuts as well as reconstruction inefficiencies and ambiguities. The principal challenge is thus to be able to reconstruct the leptonically-decaying W boson rest frame(s), i.e. to fully reconstruct the neutrino momenta.
In what follows, we will present several approaches for trying to achieve this goal, in both semi-leptonic and fullyleptonic VBS cases, where we make use of simple cuts, advanced kinematic variables and machine learning approaches.

Simulated sample definition
VBS events are generated with PHANTOM [14], a tree level Monte Carlo generator for six parton final states at O(α 6 ew ) and O(α 4 ew · α 2 s ) in perturbation theory, including possible interferences between the two sets of diagrams (see Appendix A for more details on the generator and Appendix B for the event generation configuration). In order to study the impact of experimental measurements, generated events are processed with Delphes [15], a framework for fast simulation of collider experiments, which mimics the detector effects on the measured quantities, such as final state physics objects. This fast simulation includes a generic tracking system, embedded into a magnetic field, electromagnetic and hadronic calorimeters and a muon system. Each of the subsystems can be customized to reproduce the effect of any specific experiment sub-detector. In the present study, the ATLAS configuration (an approximate ATLAS detector description) has been adopted 1 . These studies show that our findings at the truth (generator-level) level and the (approximate) Delphes simulation are qualitatively the same: since we use only measurable quantities in our studies, only a degraded resolution on these quantities, introduced by experimental uncertainties, is added by detector simulation.
4 Reconstruction of the W Boson reference frame using kinematic cuts

Single W boson decay kinematics
As already stated, the approach evaluated in this paper consists in extracting polarisation fractions from the angular distribution of leptons in the W rest frame. In order to reconstruct the W boson rest frame, one needs to reconstruct the full event kinematics, including the momenta of the neutrinos resulting from W boson decays. It is important to note that with only one neutrino present, transverse neutrino components are measurable, i.e. they can be evaluated from the measured missing transverse momentum (MET) in an event, but longitudinal ones are not, which is the case in the semileptonic decays. In fully-leptonic decays, both the neutrinos contribute to the measured MET and the kinematic reconstruction is consequently even more challenging.
To this end, let us first consider a single W boson decaying leptonically. Requiring the four-momentum conservation of the W boson decay products in the ultra-relativistic limit, and solving the constraint for the longitudinal component of the neutrino, the second-order equation in p νL can 1 https://cp3.irmp.ucl.ac.be/projects/delphes/browser/ git/cards/delphes_card_ATLAS.tcl be easily obtained: where m W is the W boson invariant mass, p L , p T are respectively the longitudinal and transverse components of the lepton momentum and E represents its energy, while p νT represents the transverse component of the momentum of the neutrino. Eq. 2 defines the unknown p νL as the zeros of a secondorder polynomial, therefore the problem of negative discriminant could also be considered. Some ad-hoc solutions have been adopted by the experimental analysis groups, such as setting the discriminant to zero or recalculating the discriminant by applying a W transverse mass constraint. In this context, speculations on the meaning of the imaginary solutions are also considered. In the present study we ignore the negative discriminant cases, and we focus instead only on the ambiguity of sign (+/−) in the two solutions for the positive discriminant (∆ = b 2 −4ac > 0) case. There is a priori no physical reason to prefer one solution instead of the other.
In the following, we will comment on how this ambiguity could be addressed and resolved, by adopting specific physics selection criteria.

Semi-leptonic VBS channel
The VBS semi-leptonic channel considered here consists in an hadronically-decaying W boson with large transverse momentum p T , a W boson decaying to a light charged lepton (electron or muon) and a neutrino, plus jets in the forward regions of the detector. This signature benefits from a larger branching ratio with respect to the fully leptonic one, thanks to the hadronic decay of one vector boson, but it faces an higher reducible background from W +jets production and an intrinsic ambiguity at detector level in identifying the jets coming from the W boson decay. Moreover, due to the presence of the neutrino which escapes any form of detection, the W boson leptonically decayed can be reconstructed only from the MET combined to the lepton four-momentum, and applying the W mass constraint.
In the VBS semi-leptonic channel, sign ambiguity in longitudinal neutrino momentum can be partially resolved by adopting certain decision criteria, which select the appropriate sign choice in Eq. 2. The decision criteria impose requirements for the correct solution to pass a certain kine- -Selection 0: sign is chosen randomly (this is the worstcase scenario); -Selection 1: the correct solution is required to have the absolute value of the scalar product of the reconstructed neutrino three-momentum with the reconstructed W threemomentum smaller than 5000 GeV 2 , random solution is chosen if both solutions pass/fail this criterion; -Selection 2: the correct solution is required to have the absolute value |p νL | smaller than 50 GeV, random solution is chosen if both solutions pass/fail this criterion; -Selection 3: the correct solution is required to have the absolute value of the scalar product of the reconstructed neutrino three-momentum with the reconstructed W three- The effect of the different selection criteria on the reconstruction of the longitudinal W boson momentum p νL , from truth level quantities, can be appreciated in Fig. 2, which shows the relative deviation of this quantity from the true (generator-level) value p truth νL . A qualitative analysis of this plot demonstrates that selections 2 and selection 4 remove the peak at 0 whereas selection 1 lies under selection 3. In general the widths are limited from below by the resolution on the reconstructed W momentum. Comparisons of truth and reconstructed lepton angular distribution from Trial momentum  generator-level measurable quantities for longitudinally and transversely polarised sample are shown in Fig. 3. An alternative metric to asses the performance of the selection criteria, is the fraction of correct solution choices, where correct denotes a solution giving p νL with a smaller absolute error with respect to the true value. The results for different polarisations of the W boson, unpolarised W boson generated in the OSP framework 2 , and full computation (unpolarised W boson, with non-resonant production modes included) are reported in Table 1.
To summarise, in the semi-leptonic channel the reconstruction of the W reference frame can be obtained up to a neutrino sign ambiguity. The use of selection criteria presented here to select the correct solution is not fully effective and could be improved by using machine learning techniques, as described in Section 5.

Fully-leptonic VBS channel
The approach described in the previous section for the semileptonic channel can be extended to the fully-leptonic VBS case, which is the most promising process in terms of background discrimination, despite its lower cross section due to W boson decay branching ratios. In this case, the kinematics is more complex, with two neutrinos in the final state, and the goal is to find which measurable variables are more suitable to distinguish W decay components in terms of polarisation. The reconstruction of the two W boson rest frames represents a real challenge from an experimental point of view. There are eight unknown parameters (two for each neutrino four-momentum) and only six equations defining kinematic constraints.
Several algorithms developed to fully reconstruct the kinematics of a process with two invisible particles in the final state, as does the MT2-Assisted On-Shell (MAOS) algorithm considered used here, exploit the fact that most of the processes involving an exchange of the W boson (or another heavy decaying particle) prefer the phase space region with low heavy particle transverse mass. This approach has been successfully used in SUSY searches and also in H → WW reconstruction [16,17,18].
In order to simplify the notation, let us restrict ourselves to a VBS process where one W decays to a muon and the other to an electron. The MAOS algorithm introduces a functional expression of trial neutrino transverse momenta p 1 and p 2 as: While the transverse momenta are reconstructed using the MAOS procedure, the W mass constraint from the semi-leptonic case is adopted for reconstruction of both longitudinal neutrino momenta. The random choice and selection 4, as described in Section 4.2, were studied as strategies to resolve solution ambiguities. with: where p µ T and p e T represent muon and electron transverse momentum respectively. The minimum of this functional defines the m T2 variable and represents a compromise for minimization of both W transverse masses at the same time: with the solutions p ν e T and p ν µ T giving the estimates for the neutrino transverse momenta. In general, there are two classes of the solutions to the minimization problem (5) -balanced and unbalanced. The difference between the two cases is diagrammatically shown in Fig. 4. When the masses of the final state particles are negligible with respect to the W mass, the unbalanced case is not feasible and the minimum has to lie on the intersection of the m  T (p T − p 1 ). In case of the leptonic W decay we find ourselves in this regime. The intersection curve can be expressed analytically, while the m T2 solution has to be estimated numerically along this curve. The next kinematic reconstruction steps of the longitudinal neutrino momenta then follow the same approach as in the semi-leptonic case described in Section 4.2 for each W boson separately. In Fig. 5  constructed longitudinal and transverse neutrino momenta p νL,T from the true (generator-level) values p truth νL,T are shown inclusively for both neutrinos, since a complete symmetry is expected. A comparison of the true and reconstructed lepton angular distributions for the longitudinally and transversely polarised samples is shown in Fig. 6.
To sum up: in the fully-leptonic case, the reconstruction is more complicated. While the MAOS framework considered here performs relatively well compared to other (nonmachine-learning) approaches, it turns out that W polarisation separation in the lepton angular distributions (evaluated in the reconstructed W reference frame) is still quite poor, obscuring the differences between transverse and longitudinal polarisations. Consequently, as in the semi-leptonic case, trying a machine learning approach again seems worthy of investigation.

Extracting polarisation fractions using Neural Network
As already stated, one of the goals of this work is the investigation of a deep-learning multivariate technique to reconstruct the W boson rest frame(s) and consequently the polarisation fractions. Particular attention has been dedicated to the determination of the optimal network architecture and configuration, in terms of performance and reliability.
Abundant literature (see, for instance, [19] and [20]) exists about machine learning (ML) and deep learning (DL) applications in high energy physics studies, including the specific case of VBS addressed here ( [21]).  In the present work, we exploit a DL technique, namely Deep Neural Networks (DNN) developed, using two different approaches, Binary classification (Section 5.1) and Regression (Section 5.2). The two adopted approaches, explained in the following sections, make use of the same dataset for train, test and validation phase, the only difference between the two is due to the neural network problem formulation. All details about dataset dimension, variables and assumptions can be found in Appendix D.

Binary classification technique
The first DNN implementation makes use of a binary classifier to determine the correct solution of Eq. 2, and then uses it to reconstruct the W boson reference frame. Binary classifier is a neural network with one neuron in the output layer. The code deduces which of the two (+/−) solutions is chosen by nature (addressed in the following as the correct solution), based on training variables. In this way the kinematics of the boson decay is fully reconstructed, as in Section 4.2 but with a set of selection criteria configured by the DL approach.
The DNN input is a set of 27 variables, and includes simple variables related to the momentum and geometrical disposition of the physical objects as well as more complex objects that combine several features. The detailed list is given in Appendix D.
Correct solutions are determined event by event by analysing PHANTOM events. In our DNN, a label is assigned on event level basis following the criterion: events with negative (positive) sign giving the result closer to truth p νL are assigned label 0 (1), events with negative discriminant are discarded. 3 Our DNN is trained on truth solution labels to return score bounded between 0 and 1. The score distribution, obtained when training on the transversely polarised PHANTOM sample is shown in Fig. 7. Optimal working points for classification have been selected for each model based on the Receiver Operating Characteristic (ROC) curves [23]. An example of ROC curves for the DNN model using 60 neurons and 4 hidden layers is shown in Fig. 8. Values of area under curve (AUC) and fractions of correct solution choices are summarized in Table 2.
A dedicated study has been made to evaluate the dependence of the DNN on the dimension of the training set. Different sizes of the training sets have been considered, spanning from 500 events to 10M events. The resulting ROC curves are shown in Fig. 9. One can see that the performance improves with the increase of the training set size, but seems to converge quite well at size 1M, since the difference between 1M and 10M can be considered negligible. Given this feature, we chose training set size of 1M to perform further optimizations of the DNN parameters.
Following Eq. 2, p νL and E ν in the laboratory frame are calculated. Finally the lepton momentum is boosted into reconstructed W rest frame, defined by the vector sum of neutrino and lepton momentum. The resulting angular distribution in the W boson rest frame, obtained using the Binary classification technique, is shown in Fig. 10 as a function of cos ϑ , for a DNN with a fixed number of neurons (60) and different hidden-layer configurations.
In addition to the standard open-source platforms we used a specific one available on IBM Cloud, named AutoAI, that exploits artificial intelligence routines to generate candidate model pipelines customized for your predictive modelling problem. These model pipelines are created iteratively as AutoAI analyses datasets and discovers data transformations, algorithms, and parameter settings that work best for the problem setting. It covers all the steps of a typical machine learning project from data pre-processing, to model selection, feature optimization and model deployment. In particular it automatically detects and categorizes features based on data type, such as categorical or numerical. Then it transforms the raw data into the combination of features 3 In this approximation the mass of W boson is fixed at the value of 80.38 GeV, according to the Particle Data Book [22].  that best represents the problem to achieve the most accurate prediction. AutoAI uses a unique approach that explores various feature construction choices in a structured, nonexhaustive manner, while progressively maximizing model accuracy using reinforcement learning. Finally, a hyper-parameter optimization step refines the best performing model pipelines. AutoAI uses a novel hyper-parameter optimization algorithm optimized for costly function evaluations such as model training and scoring that are typical in machine learning. This approach enables fast convergence to a good solution despite long evaluation times of each iteration. Comparing Figures 10 and 11 we see that there is no further improvements in our binary classification as we extracted all possible information belonging to training data. We made this comparison for binary classification where the AUC value represent a clear statistical indicator, although we got similar conclusion also for the technique that will be discussed in the next paragraph.

Regression technique
A second approach can be applied. Instead of using the neural network to address the solution of Eq. 2, we could train the algorithm to determine directly the reconstructed angular distributions of the lepton in W rest frame, with respect to the W direction in the laboratory frame. The DNN input is again set of 27 variables, from simple variables related to the momentum and geometrical disposition of the physical objects to more complex objects that combine several features. The detailed list is given in Appendix D. Differently from the binary classification approach, here the DNN is trained with the true value of the cos ϑ variable, which it aims at reproducing. With this approach, we tested as many different neural network topologies as in the case of binary classifica-  Given the information about the correct value of the cos ϑ from the regression DNN approach, the resulting angular distribution in the W boson rest frame as a function of cos ϑ , is shown in Fig. 12 for a DNN with a fixed number of neurons (60) and different hidden-layer configurations. To evaluate the performance of this approach, the distribution of the reconstructed cos ϑ for longitudinally and transversely polarised samples angular distribution is compared to the truth values.
One additional way to evaluate the performance of the model for different neural network topologies is presented in Fig. 13, where the distance of points from the main diagonal represents the ability of the network to reconstruct cos ϑ .  Similarly to binary classification, a dedicated study has been made to evaluate the dependence of the DNN on the dimension of the training set. Different sizes of the training sets have been considered, spanning from 500 events to 10M events. Results are shown in Fig. 14. As expected, the larger the training set size, the better the performance, although one should also consider the possibility of over-fitting of the DNN.

Neural Network approach to Fully-leptonic Channel
Following the approach used for the semi-leptonic case, we applied the DNN study also to the fully-leptonic channel. From a technical perspective this channel can be thought as an extension of the semi-leptonic case presented before, where only the regression approach can be considered. The fully-leptonic channel kinematics reconstruction can be supplemented with the MAOS algorithm, as described in Section 4.3. Since we wanted to check whether the information obtained from the MAOS could improve the DNN reconstruction, we included MAOS algorithm outputs among the training features, namely MAOS reconstructed neutrino transverse momenta (p ν T from Eq. 5) and m T2 . In the fullyleptonic channel, our methodology extends the possibility for the network to reconstruct both cos ϑ for the muon and electron part directly and then combines these two contributions to get the total angular distribution of a certain polarisation component. This approach will be referred to as a direct approach from here on.
As another step in the process of improving our ability to extract a model for angular distribution we imple-  mented also the indirect approach by introducing an intermediate step: In this case the DNN training labels are now the six components of neutrino momenta and the neural network now has to solve a regression problem with a sixdimensional output. Once the DNN model outputs the reconstructed six neutrino components we calculate and reconstruct the cosϑ distribution. A summary of the performance of both reconstruction approaches, with and without the inclusion of the MAOS quantities among the training features, is shown in Fig. 15.
As a further validation of the implemented reconstruction procedure, in Fig. 16 we plot the reconstructed transverse momentum of the neutrino pair (p Tνν ) that represents the total missing momentum of the process. From a deep learning perspective we tested more network topologies with respect to the semi-leptonic case, as we noticed that some particular combinations of neurons and hidden layers were not able to learn. We realized that the loss functions in some cases were flat, or we had to increase the number of training epochs suggesting complex convergence structure. Evolution is in some cases represented by a negative exponential decay of the loss function and in other cases by a composition of step losses.

Comparison and fit results
We estimated the goodness of a particular method in terms of its ability to extract the longitudinal polarisation fraction from the unpolarised distribution. This is done by means of a maximum likelihood fit of the polarised angular distributions to the full computation. The full computation does not only contain admixtures of the longitudinal and transverse polarisation, but also interference among them and non-resonant contributions. Since the interference can be constructive as well as destructive, we adopted an approach in which all the non-purely longitudinal components (transverse polarisation, interference and nonresonant contribution) are summed up and treated as a single distribution. They are obtained by subtraction of the longitudinal distribution from the full computation. With this approach we avoid fitting histograms with the negative entries, which cannot be handled in the maximum likelihood fit. From here on, the mixture of these three components will be referred to as the background distribution.
In the fit, the full computation is hypothesized to be well described by the sum of the longitudinal and background distribution and its normalizations are chosen to be the fit parameters. There is of course a very high anti-correlation between the fitted parameters because the benchmark is a direct sum of both. The latter has as a consequence also the exact convergence of the best fit values to the truth polarisation fractions, as predicted by the generator.
Interesting values are the confidence intervals (CI) for longitudinal and background normalization. The resulting normalization CI widths at 95% CL using different sources of angular distributions (random solution choice, binary classification, regression and truth) are summarized in Table 3.
Both approaches, analytical with binary classification and regression, show an improvement with respect to the random solution choice, as expected. An interesting observation emerges from the comparison of the fit performance employing different reconstruction techniques with the truth information. While the binary classifier very closely approaches the truth performance, the regression model even outperforms the fit using the truth distributions. What we might have considered as a failure of the DNN in reconstructing the truth angular distribution seems to encode some further important information for the distinction of the longitudinal polarisation from the background, which might not be simply extracted from the truth angular distribution with respect to only one variable. We summarize the performance of the different methods exploited in this paper in Tables 4, 5 and 6 where RMSE is used as a general metrics. DNN technique gives smaller values with respect to classic selection criteria, confirming the potential of the method according to different vector boson scattering processes.

Conclusion
Several approaches, in the direction of optimally reconstructing the full kinematics of W boson leptonic decays for the case of same-sign WW scattering with semi-leptonic and fully leptonic decay channels have been presented in this paper, aiming at the extraction of the longitudinal polarisation fraction of the W bosons in the physics process. The binary classification and regression ML (DNN) approaches to Phantom is an event generator which is capable of simulating any set of reactions with six partons in the final state at pp colliders at order O(α 6 ew ) and O(α 4 ew α 2 s ) including possible interferences between the two sets of diagrams. This includes all purely electroweak contributions as well as all contributions with one virtual or two external gluons. In the semi-leptonic channel we define the channel mu vm whereas for full leptonic case the channel is defined as mu e vm ve . In this generation we removed all b−quarks and t−quarks contribution. Phantom performs an exact calculation of the matrix-element at tree level, without using any production times decay approach for the W bosons interaction. Moreover it is computationally efficient since it is a dedicated generator for six fermions final states. Phantom implements a method called On Shell Projection (OSP) to compute the amplitude of resonant contributions projecting (in the numerator) the four momenta of the decay particles on shell, as shown in Eq. A.2. The formula can be seen as an on shell production times decay modulated by the Breit-Wigner width shape with all exact spin correlations.
With this method, PHANTOM conserves the total four-momentum of the WW system, the direction of the two W bosons in the WW centre of mass frame and the direction of each charged lepton in his W centre of mass frame. In the case of semi leptonic single polarisation/resonant study we use the flag OSP1 that represents the projection for only one boson whereas the other boson width is not 0 and it is not gauge invariant.

Appendix B: Event generation configuration
PHANTOM events used in this study have been produced with program version 1.6 by configuring the generator according to the following parameters and sets of cuts, in particular cuts on forward/backward jets, missing p T and lepton masses to ensure the validity of generator polarisation definition and selection.
-Calculation type: α 6 ew at 13 TeV CM energy. p T > 20 GeV; -|η | < 2.5; p miss T > 40 GeV; p j T > 30 GeV; -|η j | < 4.5; -|∆ η j j | > 2.5; m j j > 500 GeV; -∆ R > 0.3; -∆ R j > 0.3; Table 7 Number of events used for each sample in the training, validation and test datasets   sample  training validation test   unpolarised  530219  530219  706959  longitudinal  264052  264052  352070  transverse  269769  269769  359692  full-computation 535236  535236  713648 network to accomplish more difficult classification tasks because its approximation power increases [24]. Before starting the training procedure, a Gaussian standardization technique is applied to the whole training dataset, signal and background: each variable is centred in 0 and scaled to unit variance independently. This procedures helps the neural network training because reduces the scale differences of its inputs.
For sake of completeness we include representative plots of loss function behaviour in Figures 17 and 18 for different DNN scenaria and configurations in order to highlight the performance of our framework.

Appendix E: Computation details
All activities have been performed in two main steps. First all the events were generated on a PC computing cluster using the HTCondorCE that allows the user to parallelize all the computation splitting it into several nodes.
Secondly, once all the events were created, we converted them into Root files [25] to perform common data preparation and finally we converted them into pandas arrays to get train, test and validation dataset. These data was migrated to an IBM Power9 machine to perform all the training and evaluation steps. IBM Power System AC922 is widely used in general context for its performance for analytics, artificial intelligence (AI), and modern HPC. The Power AC922 is engineered to be the most powerful training platform available, providing the data and compute-intensive infrastructure needed to deliver faster time to insights. Data scientists get to use their favourite tools without sacrificing speed and performance, while IT leaders get the proven infrastructure to accelerate time to value. This is IT infrastructure redesigned for enterprise AI 4 .  Fig. 18 Shape of the loss function related to a specific neural network topology used for regression in the semi-leptonic and full-leptonic channel. In this case the loss function is not converging, in particular in the first case the network is not improving its parameters at all giving a flat loss function and relatively unstable results for the validation dataset. In the second case the neural networking is not improving and only after a relatively high number of epochs the loss is starting to evolve. For the optimal treatment of each particular convergence scenario an early stop automation can be employed so that the number of training epochs in each case can be tuned according to the learning rate and loss function behaviour.