Amplitude-assisted tagging of longitudinally polarised bosons using wide neural networks

Extracting longitudinal modes of weak bosons in LHC processes is essential to understand the electroweak-symmetry-breaking mechanism. To that end, we propose a general method, based on wide neural networks, to properly model longitudinal-boson signals and hence enable the event-by-event tagging of longitudinal bosons. It combines experimentally accessible kinematic information and genuine theoretical inputs provided by amplitudes in perturbation theory. As an application we consider the production of a Z boson in association with a jet at the LHC, both at leading order and in the presence of parton-shower effects. The devised neural networks are able to extract reliably the longitudinal contribution to the unpolarised process. The proposed method is very general and can be systematically extended to other processes and problems.


Introduction
Accessing the polarisation of electroweak (EW) bosons at high-energy colliders is crucial to gain insights in the electroweak-symmetry-breaking mechanism (EWSB), whose nature is currently explained by the Higgs mechanism [1][2][3].By means of the EWSB, the W and Z bosons are given a mass and a longitudinal-polarisation state.Therefore, any deviation in the production of longitudinal bosons in scattering processes would suggest the presence of new-physics effects, implying a different realisation of the EWSB compared to the Standard Model (SM) one.
The investigation of polarised-boson signals in LHC processes is becoming an important part of the analysis programme of the ATLAS and CMS collaborations with Run-2 data, as shown in recent measurements of di-boson production and vector-boson scattering (VBS) [4][5][6][7].The increase in statistics of Run-3 and the High-Luminosity phase will drastically improve the precision of current analyses and give access to polarised signals in complex multi-boson processes [8][9][10].
The analysis paradigm for the measurement of polarisations of EW bosons with Run-1 data at 7/8 TeV was the evaluation of angular coefficients of the boson decay rate, which are related to the polarisation fractions.The extraction of angular coefficients in LHC processes was proposed in seminal phenomenological works [11][12][13] and applied in experimental analyses of W+j [14,15] and Z+j [16,17] events, as well as of top-quark decays [18][19][20].Owing to its simplicity, this strategy has been further investigated and extended in more recent phenomenological studies [21][22][23][24][25][26][27][28].However, its application is limited [11-13, 25, 26, 29] to inclusive decays (i.e.without selections on single decay products of bosons) and to two-body decays [i.e.without radiative corrections to the decay (e.g.EW corrections)].
A number of recent studies have been carried out targeting the polarisation extraction in the presence of hadronic decays of the weak bosons [44][45][46][47][48]40], both with the polarised-template method and with machine-learning techniques.The usage of machine learning was also proposed to extract polarisation fractions in VBS, starting from the kinematic structure of the events [49][50][51][52][53]45].
Independently of the specific approach that may be used for the interpretation, the polarisation state of an unstable particle, like EW bosons, is not directly accessible in the detectors.Therefore, the information about it can only be reconstructed (in a probabilistic way) from the stable decay products.In other words, the polarisation of EW bosons is a pseudo-observable.On the other hand, from a theoretical perspective, the whole information regarding the fundamental quantum field theory and therefore the polarisation properties is encoded in the amplitude.Therefore, accessing the amplitude of scattering processes at experiments would give the maxi-mal information possible, i.e. the maximal predictive power.However, in a realistic experimental environment only the momenta of the visible final states can be reconstructed.This means that the momenta of the initial states as well as their parton type, both needed for the exact evaluation of the amplitude, is a priori unknown.As explained in the rest of the article, machine learning (ML) can actually be used to approximate well the amplitudes based only on the partial information available experimentally.
It is worth pointing out that in this article we exclusively refer to amplitudes and not matrix elements, which are actually equivalent for our purposes.Our method, though close in spirit, should not to be confused with the matrix-element method [54,55] and the optimal-observable method [56][57][58].
The article is organised as follows.In Sect. 2 we explain the difficulties in tagging longitudinal bosons and propose a solution based on wide neural networks and amplitudes.A concrete application of the proposed method to Z + j at the LHC is then detailed and discussed in Sect.3. In Sect. 4 we draw the conclusions of our work.

Definition of the problem
A generic (unpolarised) amplitude featuring a resonant gauge boson decaying into a leptonneutrino pair can be written as follows (in the unitary gauge), where M P and M D describe the production and decay part of the amplitude, respectively.The quantities M V and Γ V represent the gauge-boson mass and width, respectively.In particular, the tensor part of the propagator can be cast into the following form, where the {ε µ λ (k)} represent polarisation vectors of the massive gauge boson.The sum runs over four polarisation states, namely the three physical states and a fourth one, whose structure depends on the EW-gauge choice, and is thus is unphysical1 .Throughout the article, we use the labels L, +, and − for the longitudinal, right-handed, and left-handed states, respectively.Notice that the polarisation vectors are defined in such a way that they are transverse w.r.t. the boson four momentum, but do not transform as Lorentz covariants, therefore they must be defined in a specific Lorentz frame.Also, we would like to emphasise that the definition of the polarisation of the massive gauge bosons is only meaningful when gauge bosons are on the mass shell or when the resonant contributions are treated in the narrow-width [59,60] or pole approximation [61][62][63][64][65][66].The reason for this is to guarantee gauge invariance.
The amplitude in Eq. ( 1), including both production and decay parts, can therefore be written as, where M λ is the amplitude with a polarised intermediate gauge boson (with state λ), Hence, squaring the unpolarised amplitude leads to where the first sum represents the incoherent sum over polarised squared amplitudes, while the second one includes all interference terms.For phenomenological purposes it is convenient [30] to define a transverse (T) contribution as the coherent sum of the left-and right-handed contributions to the squared amplitude, leading to a simpler structure of Eq. ( 5), The term |M L | 2 defines the longitudinally polarised squared amplitude which is the focus of the present work.Note that the interference terms of Eq. ( 5) [or of Eq. ( 7)] are in general non vanishing and can take either positive or negative values.Hence, the fully differential unpolarised and longitudinal cross sections schematically read, where the flux factor and phase-space measure are denoted by F and dΦ, respectively.The differential longitudinal fraction in a generic observable O, that is ought to be extracted experimentally, is therefore defined as, Problem The main challenge is to extract the longitudinal fraction experimentally for arbitrary observables, i.e. in a fully differential way.In other words, we would like to answer the question: how can we infer on an event-by-event basis the probability for an LHC event to be longitudinally polarised?In what follows, we will address exactly this problem.
As briefly mentioned in Sect. 1, a number of methods have been proposed in past and recent years to address the issue of longitudinal-event tagging.In the rest of this section, we review some of them.
For the interpretation of LHC Run-1 data, the angular-coefficient method was typically applied with polarisation-extraction purposes.It relies on the functional structure of the tree-level decay rate of EW bosons, that can be written as follows [11,12], where θ * and ϕ * are the polar and azimuthal decay angle of a decay lepton in the decayedboson rest frame, calculated w.r.t. the boson trajectory in a certain Lorentz frame (the one where polarisation states are defined).The coefficients {A 0 , . . ., A 7 }, which are functions of the observable O (independent of decay angles), are related to polarisation fractions f L and f ± via linear combinations.Projecting Eq. ( 10) onto suitable spherical harmonics of rank 2, the polarisation fractions can be then easily extracted.This strategy is valid for a single boson, in the absence of radiative corrections to the decay, and in a fully inclusive decay-phase-space measure, i.e. without any cut on individual decay products.Applying the projections in the presence of transverse-momentum and/or rapidity cuts on decay products, as done in many experimental analyses, may give results that are far from describing the polarisation structure of a process [12,13,29,22,26].Extensions of the method to account for multi-boson spin correlations [23,24,27,67] are also limited due to similar reasons.The extraction of angular coefficients from decay rates can also be applied differentially in any LHC observables, providing a way to reweight unpolarised LHC events according to polarisation fractions and therefore split the events into longitudinal and transverse samples.This approximate method has been applied for Run-1 V + j events [14][15][16][17] but was proven to fail in certain kinematic regimes, especially due to the wrong assumption that polarisation fractions are the same in the presence and absence of decay-product selections [30].
The most prominent way to extract polarised signals out of Run-2 LHC data is the so-called polarised-template method.Building on a theoretically sound definition of polarised signals at amplitude level [29] that can be systematically extended to higher orders in perturbation theory [33][34][35][36][37][38][39], the method relies on separate templates for each physical polarisation state and for the interference terms.Upon a previous subtraction of reducible and irreducible backgrounds, the (unpolarised) signal events are simultaneously fitted with fully independent polarised templates.This can be applied differentially to any LHC observable.In practice, these fits are restricted to those observables that are thought to be the most sensitive to discriminate between longitudinal and transverse modes.This does not guarantee that the polarisation information is fully exploited from the available data.In addition, the fitting procedure requires quite intensive theoretical calculations that only recently achieved (N)NLO accuracy [33][34][35][36][37][38][39].To be of any use, such calculations should be performed in a fiducial phase-space volume which is exactly the one used in the experimental analysis, a task that can turn out to be far from trivial.
The idea of using ML methods to facilitate the extraction of polarisation fractions has been already explored in the literature.It has been applied in the case of EW bosons produced in VBS and inclusive di-boson production, both with leptonic [49-51, 45, 52, 53] and hadronic decays [45,46].The proposed ML approaches typically rely on kinematic observables approximating decay angles in the case of leptonic decays of W bosons, and on jet-substructure observables to treat hadronic decays, with the aim of performing an event-by-event classification, possibly accounting for new-physics effects that may distort the underlying dynamics.
The method we propose to tag longitudinal bosons lies somehow at the intersection amongst the aforementioned methods, complementing accessible kinematic information of LHC events with a genuine theoretical input given by amplitudes describing the process dynamics.
As a last remark before detailing our strategy, we stress that the polarisation structure of a process is model dependent, owing to possibly different dynamics at production and decay level.The advantage of the solution we present in the following is that the model dependence is uniquely encoded in the amplitudes.

A machine-learning-based solution
Equation ( 9) implies that at the phase-space-point level, the polarisation fraction is equal to the ratio of the longitudinally polarised squared amplitude over the unpolarised one, This statement is exact at leading order (LO) in perturbation theory for each partonic channel occurring in the process.It means that computing r L for a given process requires the knowledge of the full kinematics as well as of the flavours of the partonic-process external particles.
Considering an unpolarised event sample, this ratio can be computed for each event separately.One can therefore tag each event as longitudinally polarised or not by sampling on the value of r L .From the unpolarised event sample, one can therefore obtain a longitudinally polarised sample.This procedure is completely equivalent to generating a longitudinal sample from scratch.It follows that the fully differential knowledge of r L allows for the assessment of longitudinal polarisation on a even-by-event basis.We note that while Eq. ( 5) does not guarantee this ratio to be comprised between 0 and 1, in practice it is and allows a straightforward sampling without requiring to determine the minimum and maximum of r L beforehand.
As mentioned above, this procedure is exact at LO accuracy.It can actually be extended to a LO sample with parton-shower (PS) corrections using a similar procedure.Considering an unpolarised sample at LO+PS accuracy, one can compute r L with the original event (before showering) and tag the event after showering based on the value of r L .Again, this procedure is equivalent to generate a longitudinal sample at LO+PS accuracy from scratch2 .
Key concept Given the possibility to evaluate r L , one can tag events as longitudinally polarised.While doable theoretically, this is unfortunately not possible experimentally as the evaluation of r L at the event level requires the knowledge of all momenta and flavours of the initial and final partons, an information which is not available experimentally.Instead, what is available experimentally is the knowledge of the final-state momenta which constitutes therefore only a partial information.The central idea is therefore to bypass this lack of information by using a neural network (NN) to obtain an approximate value of r L called rL which in turn depends only on the experimental information available.In other words, the NN is trained to mimic r L with an incomplete information.Later, we show that this method is applicable in practice.
It implies, therefore, that one can use rL to tag experimental events as longitudinal.The longitudinal fraction extracted in this way can then be compared against theoretical predictions.If r L is computed within the SM, an agreement between the theoretical predictions and the extracted value of the fraction indicates that the data is compatible with SM expectations.On the other hand, a disagreement would be the sign of a failure of the SM to describe the physics at hand.The procedure can be applied not only to the SM but also to any UV-complete model as well as to rather model-independent frameworks like simplified models or effective-field theories.
The method we propose does not require any fitting procedure.It is by definition multidimensional and therefore ensures that all possible information available experimentally is used.It is also very flexible with respect to the phase-space requirements.In fact, if the training is done in an inclusive phase space, the trained model can be used in any fiducial volume that is more restrictive.
In summary, the key idea of our approach is to relate the tagging of LHC events to a single theoretically clean quantity, using machine learning to cope with incomplete information.

Application: Z+jet
In order to illustrate the newly devised method, we apply it to the extraction of the longitudinal polarisation of a Z boson in Z+j production at the LHC.We would like to emphasis that, in spite of possibly different input features for the training of the NN for a different process, our method is fully general and can be applied to any process featuring one or several Z or W boson(s).The process we consider is, pp While providing a non-trivial test-bed, this reaction is particularly suited for polarisation studies as it has a very high cross section and allows for the full reconstruction of the final state.Note that the production of a muon-antimuon pair is mediated by a photon and a Z boson.However, since we aim at extracting the polarisation of an intermediate Z boson, the photon contribution is regarded as an irreducible background to be subtracted before any polarisation analysis [36,40].In the presence of a cut on the lepton-pair invariant mass around the Z pole mass, the photon background (as well as the photon-Z interference) is typically small.In the setups considered here (see Sect. 3.1), this irreducible background is at the level of 1%, estimated from a comparison between the Z-mediated signal of Eq. ( 12) and the full off-shell calculation of pp → j + µ + µ − .
Since we are interested in polarised signals, we choose to define polarisation vectors in the Lorentz frame where the Z boson and the jet are back to back, which coincides (at LO) with the partonic centre-of-mass frame.This reference frame is the one where the 2 → 2 scattering happens and can be entirely reconstructed up to experimental uncertainties.Therefore, this choice is well motivated both from a theoretical and from an experimental viewpoint.We stress that any polarisation extraction from simulated events or experimental data is frame dependent.This means that, although the general strategy we propose can be applied to any polarisationframe definition, the application considered here depends on the specific choice of the polarisation frame.In practice, the r L quantity defined in Eq. ( 11) takes different values when computed for the same phase-space point but for different polarisation-frame choices, therefore the NN-training stage is tailored to the specific choice of polarisation frame.

Input parameters and event selections
In this section, we list the input SM parameters used for the numerical computations and the event selections considered for the phenomenological analysis.
The simulations are performed at a centre-of-mass energy of √ s = 13.6 TeV for protonproton collisions at the LHC.The parton distribution function NNPDF31_nlo_as_0118 [68] has been utilised thanks to Lhapdf [69].The renormalisation and factorisation scales are fixed to The EW coupling is fixed through the G µ scheme [65,70] is used for the electroweak coupling as The following masses and widths have been taken, The masses or widths of all other particles do not play a role in this process or have been set to zero.Note that these parameters are essentially the default ones in MG5_aMC@NLO [71].
A number of different event selections are used in this work.The first one, which we label generation-level, is characterised by a transverse-momentum and rapidity cut on the leading jet, as well as an invariant-mass cut on the charged-lepton pair, With this setup we have generated the initial parton-level event samples for both unpolarised and longitudinally polarised Z bosons.
The second selection, labeled inclusive, used for some of the phenomenological results with and without PS effects, is characterised by slightly more restrictive cuts, in order to avoid biasing the PS application, namely, Notice that both the generation-level and inclusive setups avoid any additional cut on the Zboson decay products, making the selections not realistic in a collider environment.However, it enables to be as inclusive as possible for the training of NNs, ensuring that any realistic selection will be enclosed in the phase-space region.Finally, the third selection, which we dub fiducial, is then used to mimic a realistic setup at the LHC.In addition to the cuts in Eq. ( 17), transverse-momentum and rapidity cuts are applied on the charged leptons,

Tools
For the generation of longitudinal and unpolarised parton-level events, we have used version 2.7.3 of MG5_aMC@NLO [32], which enables to select intermediate-resonance helicity states in the narrow-width approximation [60].As a validation of the longitudinal signal, we have compared the MG5_aMC@NLO results against those obtained with the private Monte Carlo framework MoCaNLO, that uses the pole-approximation approach detailed in Refs.[72,33,34,36,40].Good agreement has been found.In order to compute PS effects, we have used version 8.244 of the Pythia8 program [73] with standard settings.The space-like and time-like shower have been applied with both QCD and QED effects.For what concerns QED effects, we veto further photon splittings into fermion pairs in the shower.Note that we have not included multi-parton interactions and hadronisation effects.The reason is to keep this example, while non-trivial, as simple as possible.We argue that including these extra effects could simply be included upon performing a new training of the NN.The principle of the ML technique we propose would not be hampered by different scale choices, matching and PS settings that may be needed especially when including higher-order effects.Finally, in order to compute r L for the various partonic channels, we have used the matrixelement provider Recola 1 [74,75] 3 .
To obtain an approximate value rL , a machine learning approach was employed, utilising a feed-forward NN.The model is built from the dataset characterised by the following twelve input features, namely the four momenta of the (leading) jet, the antimuon, and the muon: along with the quantity r L defined in Eq. ( 11), representing the continuous label.
The dataset used for training and testing containes 286 073 and 285 187 elements, respectively.As part of the data-preparation process, the dataset has been standardised according to a general procedure where each feature is normalised, with the subtraction of the average value of the feature and divided by its standard deviation.This approach is more suited than a min-max normalisation, due to its lower sensitivity to outliers.The architecture of the NN is wide, and this represents a crucial aspect of the proposed technique.Four hidden layers, consisting of 1000 nodes each, were employed.The mathematical representation of this NN involves a series of transformations.The input x has a dimensionality of R 12×1 .The subsequent layers, indexed as i ranging from 0 to 4, were computed using the formula, The weights are represented by , and b 5 ∈ R 1×1000 .The activation function used was the Rectified Linear Unit (ReLU) [77], defined as σ(x) = max{0, x}.The structure of this NN is depicted in Figure 1.
The design of the machine learning model in this study follows the principles outlined in Ref. [78].According to the theory presented, the effectiveness of a NN is influenced by the dynamics of its training process.It suggests that NNs with a "deep and narrow" architecture exhibit chaotic dynamics during training, while those with a "shallow and wide" architecture are easier to train.In the asymptotic case, infinitely wide NNs possess a convex loss landscape, enabling the optimal solution to be found through gradient descent.However, such models essentially become linear, losing the non-linear expressivity of the original network and potentially limiting its representational capacity.Therefore, a compromise must be made between the ease of training and the network's expressivity.In our experiments, we achieved satisfactory performance by employing a wide NN with a width-to-depth ratio of 200.Such a choice has been made by empirical experimentation and refinement.
The training of the NN model was performed using the RMSprop algorithm [79], an adaptive learning rate optimisation method specifically designed for mini-batch learning.The algorithm's parameters were set as follows: learning rate η = 0.001, smoothing constant α = 0.99, weight decay of 0, and momentum of 0. The training process spanned 1000 epochs, with batches of size 500.The code implementing the model was developed in Python 3, making use of the PyTorch library [80].

Results
The key quantity in this application is the ratio r L defined in Eq. (11).It encodes the whole information on the longitudinal-polarisation dynamics, relatively to the polarisation balance in the unpolarised process.As such, it determines the shape and normalisation of the kinematic distributions for the longitudinal signal.It is therefore a multi-dimensional function with as many dimensions as the number of random variables needed to generate the momenta of the full final state.As explained above, this ratio is actually different for each partonic channel.It means that in order to evaluate it on an event-by-event basis, one does not only need the full kinematic but also the knowledge of the partonic channel.In order to have a feeling about the structure of r L , we show in Fig. 2 the differential distribution in r L for unpolarised events, in the generation-level setup.We also show, for comparison purposes, the distribution in the corresponding r T and r int ratios, that are respectively defined as, From Eq. ( 21) and Fig. 2, it is clear that, since all amplitudes are complex numbers, the interference term can take negative values, and both r L and r T can exceed the unit.Owing to a peak at 1, the r T distribution in Fig. 2 suggests that in the considered process the transverse-polarisation component is way larger than the longitudinal one.This is a well-known result in the SM [11,12].
It also shows that the proposed method is particularly efficient as it can make full use of this discriminating power.
r L -reweighting/tagging The first key observation that we have made above is that unpolarised event samples can be reweighted/tagged4 using r L to obtain longitudinally polarised  samples.This statement does not only hold at LO but also when including PS effects, thanks to the factorisation of the radiative corrections as implemented in a PS, i.e. adding multiple QCD and QED radiations in the collinear approximation at leading-logarithmic accuracy.This can be seen in Fig. 3, where two differential distributions are shown at LO and LO+PS accuracy.The distributions obtained with the r L -reweighting reproduce very well those obtained with longitudinal events generated with the Monte Carlo (MC truth).This confirms that a longitudinally polarised sample can be obtained by simply reweighting an unpolarised one with r L factors.In addition, one can observe that while the PS corrections are sizeable (boh in the overall normalisation and in the distribution shapes), the statement about the reweighting is equally true in the presence of PS corrections.From the results of Table 1, it can also be appreciated how the r L -reweighting performs well both in inclusive setups and in the presence of more exclusive selection cuts.The differential results analogous to those of Fig. 3 but in the fiducial setup approach, the longitudinal sample is extracted from the unpolarised one by means a one-dimensional sampling according to rL (or rL) weights.The selected events are then used to compute the longitudinal distributions.Notice that the two methods are equivalent within statistical uncertainties.(not shown here) also highlight an almost perfect behaviour of the reweighting method as in the inclusive setup.
Leading order Turning the problem around, the results detailed in the previous paragraph imply that experimental data (here idealised by LO+PS unpolarised events) can be used to extract polarisation fraction provided that r L is known and can be computed on an event-byevent basis.Actually, r L cannot be computed from experimental data, which do not give access to the full kinematic dependence (including the initial state) and to the flavour of all external particles.To bypass this issue, one can use NNs to obtain an approximation rL of the true ratio, based on an incomplete information, namely the one available experimentally which consists in the visible final-state momenta.Along this line, the first step is therefore to check if one can obtain a good approximation of r L at LO by training a NN in a supervised setting, as described above, with r L as input label and the final-sate jet momentum and lepton momenta as incomplete information for the training features.
In Table 2 and Figs.4-5, two different NNs, predicting rL factors as approximations of the r L ones, are compared against the true Monte Carlo results, both at the level of polarisation fractions and at the level of differential cross sections.The first network (labeled NN 1 , purple curves in plots) underwent training for 10 6 epochs, employing a batch size of 100 and a learning rate of 10 −4 .The second network (labeled NN 2 , olive curves in plots) was trained for 2 × 10 3 epochs, using a batch size of 500 and the same learning rate.The generation-level events were used for the training.
The results of    model underestimates the longitudinal fraction by 3-4%.In Figs. 4 and 5, the cosine of the angular difference between the positive lepton and the jet as well as the transverse momentum of the jet are shown.The two figures differ in their phase-space regions: Fig. 4 is for the inclusive setup while Fig. 5 is for the fiducial one.One observes that the first NN is reproducing better the true result.In general, the agreement is at the per-cent level for the phenomenologically relevant part of the phase space and therefore good enough for our purpose.Also, it is worth pointing out that in suppressed regions of phase space where the statistics is low, the agreement degrades substantially.The limited statistics used for the training stage in this suppressed region does not constrain strongly enough the NN model, leading therefore to a systematic error in the NN-model prediction for rL .For example, above 150 GeV for the transverse momentum of the jet in Fig. 5, the agreement is worth than 20%.This is nonetheless not an issue given that this region is suppressed by two orders of magnitude, meaning that it contributes to about 1% to the cross section and therefore introduces only a per-mille error or less in total.
Finally, we note that the results for the inclusive and fiducial setups are equally good.The only difference that one can notice is that the fiducial results suffer from larger fluctuations.This can be attributed to the lower statistics used in the fiducial case (≈ 220k events), owing to more restrictive selections that cut away more than half of unpolarised events used in the inclusive setup (≈ 480k events).
Parton-shower effects Overall, the above results prove that the method is reliable also in typical experimental regions.Nonetheless, this is a simplified version of the problem as this exercise was performed at LO meaning for events of identical multiplicity.A more realistic description of the data necessarily requires PS corrections.Indeed LHC events are typically affected by several effects such as multi-particle interactions, beam remnants, hadronisation, extra QCD and QED radiations etc.These phenomena are well described by multi-purpose PS programs like Pythia [73].In our case, we have included QCD and QED radiations but other effects could equally be included.
In order to account for PS effects in our method, one can try to use the previous NN trained with LO events and apply it to events modeled with PS corrections.The results of this procedure are shown in Fig. 6 for the cosine of the angular separation between the antimuon and the leading jet, and for the leading-jet transverse momentum.Notice that in this case, we have only included effects from QCD PS, avoiding further photon radiations.From the plots, it is rather clear that this approach is failing.The reason for this is that the PS generates more QCD radiations leading to a sizeable distortion of the event kinematics.In fact, comparing Fig. 6 with the fixed-order results in Fig. 3, one observes that applying the NN trained with LO events to LO+PS events tends to reproduce the LO distribution shapes rather than the LO+PS ones.
As shown previously, one can apply a reweighting of the unpolarised sample and then apply the PS procedure or viceversa in order to obtain a longitudinally polarised sample with PS effects.This also means that for each showered event one can compute r L with the original LO momenta before PS and therefore associate a meaningful r L to each showered event.One can therefore train a new NN with the original r L (computed before showering) along with the momenta after showering.Given that showered events possess more than one jet, only the four momentum of the jet with the largest transverse momentum is provided as a feature for the NN-model training.
In order to tackle this problem, we adopted the following training strategy.First, a wide neural network (labeled NN ws ) is trained using a warm-start initialisation [81].Utilising the warm-start approach in training NNs involves initialising the model with weights from a previously trained model.This strategy potentially accelerates convergence, enhances performance, and reduces the need for extensive data, creating an efficient framework for model training.From a physics intuition, in the generation chain that starts with LO process and move to LO+PS, this step does not represents a completely new learning task rather it is a perturbation of the original process.It follows that the procedure involved utilising the configuration of the network previously trained on the LO events and starting the training with the LO+PS events from its optimal configuration in terms of architecture of the NN and its relative weights.This procedure, which is well-known in other machine-learning applications, has not been yet fully exploited in high-energy physics.Skipping a complete NN architecture optimisation procedure is advantageous because of the faster identification of the best model and of the computational-resource saving.For the sake of comparison, a second general NN (labeled NN nows ) is built from scratch, looking for the best depth-to-width ratio with a randomly chosen initial configuration.3: Longitudinal-polarisation fractions at LO+PS determined from Monte Carlo-truth longitudinal events (MC) and from reweighting of unpolarised events with NN-predicted rL (NN), in the inclusive and fiducial setups.Monte Carlo uncertainties on the fractions are shown in parentheses.The NN-predicted fractions are assigned Monte-Carlo-like uncertainties according to the number of events at testing level.
The results provided by these two models are reported in Table 3 and Figs.7-8 (green curves for NN ws , blue curves for NN no ws ).As one can see from both integrated and differential results that the NN with warm start is outperforms the one built from scratch.From the results, it is clear that the warm start has beneficial effects on the NNs.Firstly, it biases the training towards solving a similar task, allowing the network to adjust its parameters to the new data, which limits the search space and leads to faster convergence.Additionally, as the problem becomes easier to solve, the quality of the solution improves.From a physics viewpoint, the good behaviour of the NN ws implies that the LO step is actually of crucial importance to be able to use this method in an experimental analysis.In particular, the results are per-cent accurate at the level of polarisation fractions, which is good enough for the level of precision of this study.Considering differential observables, Fig. 7 refers to the inclusive case while Fig. 8 refers to the fiducial case.The same conclusions as at fixed order hold, namely that the limited statistics do play a role in the accuracy of the method, as can be observed in the far tails of the transversemomentum distribution in Fig. 8 or in other phase-space regions which are the least populated ones.Nonetheless, at 100 GeV in the transverse-momentum distribution of the leading jet, a 10-20% mismodelling can be observed.These effects cannot be solely attributed to the statistics but should be considered as a systematic error of the NN.This mismodelling might originate from PS effect as shown in Fig. 3 where the region around 100 GeV marks a quantitative change in the PS corrections.This could also be interpreted as a limitation of the NN model to capture all features.Nonetheless, these 10-20% discrepancies appears in bins that are suppressed by almost two orders of magnitude and therefore they are not physically significant when integrating over the whole transverse-momentum spectrum.Hence, the method proposed here is still per-cent accurate.
Event tagging with rL As already discussed in Sect.2.2, the probabilistic interpretation of r L leads to the expectation that the rL predicted by the NN models is positive.However, the NN models have no physics insights about this constraint and rL is not always positive.
It turns out that at LO+PS level, the NNs are able to predict positive rL for more than 99% of the event, both in the inclusive and in the fiducial setup.Interestingly, at fixed order, the performances are worse, with positive longitudinal weights predicted for only roughly 95% of the events.The events for which the NN predicts negative weights give a harder p T,j 1 spectrum compared to the events with positive rL , highlighting that in order to improve the accuracy of the NN also in boosted regimes, a dedicated training with boosted events is needed.We checked that discarding the events with negative rL , in spite of a partial improvement in the reproduction of the transverse-momentum shapes, overestimates by several per-cent the overall longitudinal fraction.A viable strategy could be to include a suppression function for negative rL at the level of the last layer of the NN models, as a small step toward physics-informed approaches.These ones have already been applied for classification tasks in particle physics, where enforcing symmetries conservation for transformations under the Lorentz group, provides a much more physically interpretable model [82].However, there is no guarantee of improved accuracy in the NN.In our specific case, enforcing the positivity of the label actually worsens the overall performance.This is due to the introduction of constraints complicating the landscape of the loss function, resulting in more challenging geometries with multiple local minima.As a result, training becomes less effective, leading to poorer predictions from the model.We have refrained from investigating this aspect further, as the LO+PS results are satisfactory for the present application.
So far we have indistinguishably used the expression reweighting and tagging.However, while the reweighting strategy can be applied also in the presence of negative weights, the event tagging is not well defined anymore in that case.In other terms, performing a longitudinal tagging according to the NN predictions is not possible for events with negative rL .While at LO accuracy this means throwing away 5% of the events, at LO+PS accuracy, which is the most important case as it mimics the experimental environment, less than a per cent of the events have to be thrown away, which is good enough for our purposes.
To illustrate the applicability of the method, we show final results for the event tagging, which turns out to be equivalent to the reweighting ones, at LO+PS accuracy.In Table 4, the polarisation fractions obtained with tagging are compared to the MC-truth ones.As expected, these results are good and almost equivalent to the reweigthing results provided in Table 3, since only very few events have a negative rL .Notice that for this comparison we have only considered the NN that employs the warm-start approach (NN ws ).The results are equally good at the differential level, as can be observed in Figs. 9 and 10 for the inclusive and fiducial setup, respectively.In particular, in these plots, the MC-truth longitudinal distributions are compared with those obtained reweighting and tagging according to rL .Both are equivalent up to statistical fluctuations.This finally demonstrates that using rL with experimental inputs enables an actual longitudinal-polarisation tagging on an event-by-event basis.

Discussion
With this non-trivial LHC application detailed in Sect.3, we have shown that one can assert the polarisation fraction on an event-by-event basis using amplitude information by reverting to machine learning.The method is per-cent accurate and particularly versatile.In this section we discuss limitations of the methods as well as possible extensions, generalisations, and further applications.Validity of the method In the present example the training phase has been performed on events spanning a very inclusive sample.The trained models have then be used on a reduced phase-space as for typical experimental analyses.This ensures that the method is used in its region of validity.We therefore recommend to always perform the training on a more inclusive phase-space than the one actually used in the analysis.If this is not the case, it is not guaranteed that the method will still work as the network has not been trained (and thus validated) in the whole region considered at testing level.While it is not excluded that the NN can perform some extrapolation outside its training region, this has to be carefully verified.In particular, using the extrapolation power of the NN might require a different NN and a potentially a dedicated study on out-of-support extrapolation problem into a problem of within-support generalization.While the specific application considered in this work concerns pp → Z + j at the LHC, the general idea of training a NN with experimentally accessible kinematic information and squared polarised amplitudes can be applied to any single-or multi-boson process at colliders.We stress that in order to apply this strategy to another process, a new NN has to be constructed, relying on the corresponding input features that depend on the experimental signature.For example, for processes with final-state neutrinos, whose momenta cannot be fully reconstructed, the input feature would be the missing transverse momentum instead of the complete momenta of the neutrinos.
Error propagation As formulated here, the method would provide a numerical value corresponding to the experimentally-extracted fraction that can be compared against theoretical predictions.Nonetheless this extracted value has uncertainty of different sources: the accuracy of the theory prediction it relies on, the limited statistics of the training data set, and the experimental accuracy (both statistical and systematic) of the data.
Usually, the theoretical uncertainty on the prediction is assessed by means of scale variations of the factorisation and renormalisation scale.The envelop of the values of r L extracted for different scale combinations would then provide the theory uncertainty associated to r L .This quantity being a ratio of squared amplitudes, we expect the correlated scale uncertainties to be rather small, owing to cancellations between the numerator (longitudinal matrix element) and the denominator (unpolarised).
The uncertainty related to the finite size of the training sample can be inferred by performing the training with different sample sizes or by performing error propagation in the NN.The same applies to the experimental error associated to the reconstructed event kinematics, and it can be estimated by repeating the method using pseudo-data.
It is important to consider that NNs are complex models, and training them using stochastic gradient descent over a non-convex landscape does not guarantee optimal parameter quality upon convergence.In contrast, linear and kernel models can be trained more efficiently due to their convex and low-dimensional loss landscapes, albeit resulting in simpler predictors.The non-linearity of NNs allows them to learn new and more effective representations of the data, a process known as feature learning [78].This feature learning effect makes NNs more powerful but also presents challenges in their training process.In our approach, we have opted for wide NNs that strike a balance between linear and nonlinear models.While infinite-width NNs are equivalent to linear models and enjoy convex optimization landscapes [83], wide networks with finite width exhibit a slightly more challenging training landscape.Nevertheless, training wide NNs remains effective, with the difficulty of the landscape increasing as the depth-to-width ratio grows.Considering all these aspects related to the complexity behavior of ML models, we refrain from assigning any intrinsic uncertainty to the predicted output.
Model independence As already mentioned, the polarisation of weak bosons is a pseudoobservable and its extraction necessarily understands some degree of model dependence.In the method we propose, the model dependence is encoded into the r L function.In the present work, the SM is considered: it means that events are tagged according to SM expectations and the longitudinal fractions extracted should be compared against the one of theoretical predictions within the SM.Such a model dependence is impossible to avoid.In fact, even a simplified version of r L relying only on the boson-decay matrix elements still depends on the polarisation fractions determined by the model-specific production mechanism.However, since the method proposed in this work is model agnostic, the same procedure can be performed with more general models, i.e. simplified models or effective field theories, or even with UV-finite theories.
Extension to higher orders In the present work, we have restricted our analysis to LO+PS accuracy.Nonetheless, it is in principle possible to extend this to higher orders in perturbation theory, at least for QCD corrections to processes with leptonically decaying bosons.If one can produce a sample of unweighted events at a given order in perturbation theory, the presented method can be extended.
Having unweighted events at fixed order implies having events with different jet multiplicity (depending on the order considered).It means that for each multiplicity i = 0, 1, ..., the exact r i L can be computed using loop and/or tree amplitudes depending on the accuracy of the sample.As ratios, they should actually be free of infrared singularities if QCD dependencies factorise from the polarisation effects.As in the presented application, the NN can learn a single approximate rL based on the experimental input available and in particular by feeding only the leading jet(s) in the transverse momentum.Adding PS or further corrections can then be achieved as shown in the previous sections.
We stress that we have not explicitly tested this method with higher orders and therefore that if ones wants to use this proposal, it should be carefully checked first.In particular, the main assumption here is that QCD corrections and polarisation effects factorise to a large extend (as PS and polarisation effects).This implies that the inclusion of EW corrections would probably requires a more refined analysis given that they are known not to factorise.
Generalisation to other problems As highlighted several times, the key aspect of the method is to encode the whole physics problem in one single ratio (in the present application r L ) which can be approximately reconstructed using incomplete information thanks to NN methods.It therefore implies that the method can be applied to any physics problem that can be cast in this form.The only requirement being that the key quantity is bounded as it is the case for ratios of amplitudes.It also means that appropriate problems for this method are the extraction of a signal over a background which is very common in experimental particle physics.

Conclusions
The polarisation of heavy gauge bosons encodes the intricate structure of the electroweak sector of the Standard Model.The theoretical study and the experimental extraction of such pseudoobservables is thus of prime importance for the present and upcoming physics programme of the LHC.It is therefore key to combine our theoretical understanding to make use of all the information available in experimental data in order to probe the structure of the Standard Model at the deepest.
In this work, we have designed an original method to extract polarisation fractions using the maximal information encoded in the amplitude thanks to the versatility of neural networks.The key feature is that all information is encapsulated in a single number which can be computed on a event-by-event basis.In particular, the neural network is able to construct a particularly good approximation of this quantity which can then be evaluated with incomplete information, namely the one available in experiments.This number allows to assert whether an event is most likely longitudinally polarised or not.In this way, all information i.e. the fully differential information is exploited and not only the information contained in one or several observables as it is the case for other methods.It also means that no fitting procedure is required.Another advantage is that the theory dependence is clearly identified as it is only encoded in the amplitude.Finally, the amplitude considered can be the one of arbitrary-general or -specific models of quantum field theory.
To illustrate the method, we have applied it to the extraction of the longitudinal polarisation of a Z boson in the hadronic process pp → Z + j, in the leptonic decay channel at the LHC.We have demonstrated that the idea is working with a per-cent accuracy by reverting to the sequential training of a neural network.In particular, when being used in actual experimental analyses, the closure tests that we have presented here should be carried out to ensure the correctness of the results.
Finally, we point out that the method we have developed is very general.It can therefore be applied to other problems and/or generalised.In particular, the method seems to be particularly appropriate for the extraction of signals over irreducible or even reducible background. z

Figure 3 :
Figure 3: Longitudinal reweighting of unpolarised events with r L (orange) compared to MCtruth longitudinal events (red) at LO (solid) and LO+PS (dashed).Absolute differential cross sections are shown in the top panel, ratios of reweighted results over MC-truth ones are shown in the bottom panel.The following observables are considered: cosine of the angular separation between the antimuon and the leading jet (left), leading-jet transverse momentum (right).The inclusive setup is understood here.

Figure 4 :
Figure 4: Longitudinal reweighting of unpolarised events with rL predicted by two different NN models (olive and purple curves) compared to MC-truth longitudinal events (red curve) at LO. Absolute differential cross sections are shown in the top panel, ratios of reweighted results over MC-truth ones are shown in the bottom panel.The following observables are considered: cosine of the angular separation between the antimuon and the leading jet (left), leading-jet transverse momentum (right).The inclusive setup is understood.

Figure 5 :
Figure 5: Same structure as Fig. 4. The fiducial setup is understood.

Figure 6 :
Figure 6: Longitudinal reweighting of unpolarised LO+PS events with rL from the NN model trained with LO events (olive curve), compared to MC-truth longitudinal events (red curve) at LO+PS (QCD shower only).Absolute differential cross sections are shown in the top panel, ratios of reweighted results over MC-truth ones are shown in the bottom panel.The following observables are considered: cosine of the angular separation between the antimuon and the leading jet (left), leading-jet transverse momentum (right).The inclusive setup is understood.

Figure 7 :
Figure 7: Longitudinal reweighting of unpolarised events with rL from two different NN models (blue and green curves) compared to MC-truth longitudinal events (red curve) at LO+PS (both QCD and QED showers included).Absolute differential cross sections are shown in the top panel, ratios of reweighted results over MC-truth ones are shown in the bottom panel.The following observables are considered: cosine of the angle between the antimuon and the leading jet (left), transverse momentum of the leading jet (right).The inclusive setup is understood.

Figure 8 :
Figure 8: Same structure as Fig. 7.The fiducial setup is understood.

Figure 9 :
Figure9: Longitudinal reweighting (solid green) and tagging (dashed green) of unpolarised events with rL predicted by the NN ws model compared to MC-truth longitudinal events (solid red) at LO+PS (both QCD and QED showers included).Absolute differential cross sections are shown in top panels, ratios over MC-truth ones are shown in bottom panels.The following observables are considered: cosine of the angle between the antimuon and the leading jet (top left), rapidity separation between the antimuon and the leading jet (top right), tranverse momentum of the antimuon (bottom left), transverse momentum of the leading jet (bottom right).The inclusive setup is understood.

Figure 10 :
Figure 10: Same structure as Fig. 9.The fiducial setup is understood.

Table 1 :
Distributions in the r L , r T and r int quantities defined in Eqs.(11) and (21), all normalised to the unpolarised total cross section.The generation-level setup is understood.Longitudinal-polarisation fraction determined from MC-truth longitudinal events and from r L -reweighting of unpolarised events, in the inclusive and fiducial setups.Monte Carlo uncertainties on the fractions are shown in parentheses.
Table 2 show a good performance of the NN 1 model in reproducing the polarisation fractions both at inclusive and fiducial level with sub-per-cent accuracy.The NN 2

Table 2 :
Longitudinal-polarisation fractions at LO determined from Monte Carlo-truth longitudinal events (MC) and from reweighting of unpolarised events with NN-predicted rL (NN), in the inclusive and fiducial setups.Monte Carlo uncertainties on the fractions are shown in parentheses.The NN-predicted fractions are assigned Monte-Carlo-like uncertainties according to the number of events at testing level.
7. The fiducial setup is understood.

Table 4 :
Longitudinal-polarisation fractions at LO+PS determined from MC-truth longitudinal events (MC) and sampling (NN-sampl) of unpolarised events according to rL predicted with the NN ws model, in the inclusive and fiducial setups.Monte Carlo uncertainties on the fractions are shown in parentheses.The NN-predicted fractions are assigned Monte-Carlo-like uncertainties according to the number of events at testing level.