From Optimal Observables to Machine Learning: an Effective-Field-Theory Analysis of e + e − → W + W − at Future Lepton Colliders

We apply machine-learning techniques to the effective-field-theory analysis of the e + e − → W + W − processes at future lepton colliders, and demonstrate their advantages in comparison with conventional methods, such as optimal observables. In particular, we show that machine-learning methods are more robust to detector effects and backgrounds, and could in principle produce unbiased results with sufficient Monte Carlo simulation samples that accurately describe experiments. This is crucial for the analyses at future lepton colliders given the outstanding precision of the e + e − → W + W − measurement ( ∼ 10 − 4 in terms of anomalous triple gauge couplings or even better) that can be reached. Our framework can be generalized to other effective-field-theory analyses, such as the one of e + e − → t ¯ t or similar processes at muon colliders.

For many processes, including e + e − → W + W − and e + e − → t t, it is essential to extract the information from differential distributions, which are sensitive to the new physics contributions.Different methods vary in terms of complexity and effectiveness (i.e.how much information is extracted).One of the simplest approaches is to bin the distribution and combine the likelihood (chi-square) from the rate measurements in all bins. 1 However, it becomes impractical if the phase space of the process has a large dimension.For example, assuming both W s are on shell, each e + e − → W + W − event can be characterized by five angles, one for the production and two from each W decay.A 5-dimensional binned distribution can have too many bins for a practical analysis, while projecting the distribution to a lower phase-space dimension necessarily throws away part of the information.The other extreme is to construct the likelihood from all the measurement events directly, given that one could calculate the differential cross section (in terms of the model parameters), which is essentially the probability density function up to normalization.This method is often denoted as the matrix-element method, as one needs to first calculate the matrix element.It could, in principle, extract the maximum amount of information from data and provide the best statistical reach, but it can be time and computational-power-consuming.
There also exists a method that is both simple and effective, but with the requirement that the observables depend on (new physics) parameters only at the linear level [33].In this case, one only needs to measure a set of so-called "optimal observables", from which the full likelihood (as in the matrix-element method) can be constructed, and the method is thus statistically optimal.More specifically, assuming the differential cross section is of the form where S 0 and S 1 are functions of the phase space Ω, and g i are a set of physics parameters, it was shown in Ref. [33] that the inverse covariance matrix of the g i corresponding to the optimal reach are given by where L is the integrated luminosity.The c −1 ij can be obtained from the optimal observables, defined as O i = S 1,i S 0 , with the relation c −1 ij = nV ij where V ij is the covariance matrix of the optimal observables and n is the number of events.The central values of g i are given by where E is the expectation and E 0 is the SM expection with all g i set to zero.In the SMEFT framework, assuming the new physics scale Λ is sufficiently larger than the electroweak (EW) scale and the energy scale of the experiment, the leading new physics contribution to observables is given by the interference term between the SM amplitude and the one of dimension-6 operators.Keeping only this leading contribution, and discarding the sub-leading contributions such as the dimension-6 squared terms and the interference between SM and dimension-8 operators, observables can be written in terms of Eq. (1.1) where S 0 is the SM differential cross section and g i are the dimension-6 Wilson coefficients.Importantly, this linear approximation gives a very good parameterization of almost all the Higgs, EW and top measurements at future lepton colliders, as a result of the outstanding precision that can be achieved. 2oth the matrix-element method and the optimal-observable one rely on accurate knowledge of the differential cross section for the measured process, which can not be achieved in practice.Detector effects, backgrounds, as well as initial and final state radiations, all need to be taken into account.While some of these effects can be calculated (to some extent) and convoluted into the differential cross section, doing so would easily generate very complicated expressions for the differential cross section, making the calculation impractical.Generally speaking, there is a distinction between the "truth-level" differential cross section (for which one could calculate an analytic expression) and the measured one.For the latter, one could at best have a sample from Monte Carlo simulation, assuming it accurately simulates all the effects mentioned above. 3One may argue that, with the clean environment at future lepton colliders, systematic effects are subdominant, and the truth-level differential cross sections would be very good approximations of the measured ones.However, for exactly the same reason, lepton colliders can reach a high measurement precision, demanding the precision of theoretical predictions to be at least at a similar level.As we will show explicitly later, an unacceptable bias in the reach of the Wilson coefficients could be introduced if (small) systematic effects are not taken into account.It is therefore desirable to extract a "detector level" differential cross section, or equivalent statistical quantities such as the likelihood ratios, directly from the Monte Carlo samples.
This goal, while difficult for conventional methods, has become increasingly achievable in recent years with the rapid development of machine learning techniques in high energy physics  (see also Refs.[64][65][66][67][68] for reviews of the subject).With powerful computing tools and enough training samples, a neural network can be trained as a good estimator of the detector-level likelihood ratios, which can be used on real data to obtain the optimal reaches on the model parameters (Wilson coefficients), at least in the limit of large MC statistics and perfect training.Furthermore, the Monte Carlo tools also allow one to exploit the truth-level information that could greatly improve the training efficiency and results [36][37][38].
In this paper, we apply machine learning methods to a well-studied process at lepton colliders, e + e − → W + W − .Recent SMEFT analysis using optimal observables have shown that the measurements of this process at future colliders can reach a precision of ∼ 10 −4 or even better in terms of anomalous triple gauge couplings (aTGCs), under somewhat idealistic assumptions [18,22].We consider the tree-level process with contributions from CP-even dimension-6 operators and focus on the semi-leptonic channel.On top of the truth-level process, we apply detector simulations and include e + e − → ZZ as a possible background.We compare the performances of the optimal observables and machine learning methods, and in particular, a machine-learning version of the optimal observables (denoted as SALLY in [36,37]) and its variations.We found that, while at the parton tails.See e.g.Refs.[34,35] for more discussions.level, the optimal observables give the optimal reach and the machine-learning methods have no advantage, with detector effects and backgrounds, the optimal observables easily generate an unacceptable bias, while a much smaller bias (subject to limited MC training samples and imperfect training) can be obtained with appropriate machine-learning methods.This demonstrates, as proof of principle, that machine-learning methods are much more robust under detector effects and backgrounds, which can be crucial for future collider analyses.
The rest of this paper is organized as follows: In Section 2 we provide a brief overview of the theoretical and phenomenological aspects of the e + e − → W + W − process.In Section 3, we lay out the framework of our machine learning analyses.The details of the sample preparation and training are described in Section 4. Our results are presented in Section 5, which are based on a circular collider running at 240 GeV.Finally, we conclude in Section 6.In Appendix A, we provide the numerical χ 2 for the full EFT parameterization (excluding modifications of m W ) for several scenarios considered in Section 5.In Appendix B, we present the results of the individual neutral network models which provides a measure of the systematic error due to imperfect training and illustrates the importance of the averaging step described in Section 4.
2 The e + e − → W + W − process Measuring e + e − → W + W − process is an important part of the precision EW and Higgs program at future Lepton colliders.Conventionally, the new physics contributions are parameterized in terms of the aTGCs, assuming that they enter only the 3-gauge-boson vertex in the left diagram of Fig. 1.In a more general framework, all contributions from dimension-6 operators in SMEFT should be considered [69].Focusing on CP-even treelevel effects, contributions from dimension-6 operators to e + e − → W + W − can be written in terms of effective couplings, including the 3 independent aTGCs, δg 1Z , δκ γ and λ Z , and modifications to the electron gauge couplings δg ℓ W , δg e Z,L and δg e Z,R (or the muon ones for muon colliders), where s w ≡ sin θ W , c w ≡ cos θ W and θ W is the weak mixing angle.Dimension-6 operators could also modify the W -boson mass, m W .Here we simply set m W to be SM-like, given that it will be measured to a precision of O(10 −5 ) (at the MeV level) at future lepton colliders.Note that a large set of operators could modify the W branching ratio.While they need to be included in a more general global-fit framework, here we ignore their effects for simplicity.Practically, this can be done by including only the differential information and discarding the likelihood (chi-squared) of the total rate measurement since modifications of the W branching ratio only affect the latter.With these assumptions, the new physics effects in e + e − → W + W − can be fully captured by the following six parameters: Our machine-learning analysis in the next section is based on this 6-parameter framework.Furthermore, in Section 5 we mainly focus on the 3 aTGCs, δg 1Z , δκ γ and λ Z , assuming other parameters are already sufficiently constrained by Z-pole and W (width and branching ratio) measurements.It is straightforward to extend the framework and include more parameters, including coefficients of CP-odd operators.However, the focus of this paper is on the machine learning methods rather than providing the most general global-fitting results.We adopt the narrow width approximation (NWA) and assume both W bosons are on shell.As such, each W W event can be characterized by five angles, as shown in Fig. 2. θ is the production polar angle between the incoming e − and the outgoing W − .4θ 1 is the angle between f 1 and the W − (with W − → f 1 f2 , where f 1 = ℓ − , d, c) in the rest frame of W − , ϕ 1 is the angle between the decay plane and the production plane formed by e − , e + , W − , W + ; For W + , a similar pair of angles θ 2 , ϕ 2 can also be defined with , is universal for all W decay channels at the parton level with massless quarks and leptons.However, for hadronic decays, the strong QCD shower creates multiple hadrons and forms jets, making the two quarks highly indistinguishable in practise 5 .For the rest of this work, Figure 2: Assuming the W bosons are on shell, each W W event can be characterized by 5 angles, including the production polar angle θ and 2 angles from each W decay.Note that θ 1 and θ 2 are defined in the rest frames of W − and W + , respectively.the corresponding differential cross sections are "folded" by taking the average of two different quark configurations.We also found the impact of NWA to be minimal from our numerical result.This is due to the difference between NWA and full matrix element results largely cancel after taking joint likelihood ratios.

Simulation-based inference by machine learning
In this section, we will introduce the simulation-based (or equivalently likelihood-free) inference [66] framework of our machine learning analysis.We first list here the essential statistical quantities and the notations we use for them, which mainly follow the ones in Refs.[36][37][38].For simplicity, we will often use the unbolded symbols θ, x and z instead of θ, x and z, and they still generally denote a column of quantities rather than a single one.
• θ is the set of parameters of interest, which we wish to determine from experiments.
In SMEFT, they are simply the Wilson coefficients of the higher dimensional operators.More explicitly, in our study, we have For most of the results presented in Section 5, we will focus on the 3-aTGC parameterization, θ = { δg 1Z , δκ γ , λ Z }.
• x is the set of detector-level observables that we can measure.In principle, one could have a very large set of observables at multiple levels, including the lowlevel observables such as detector readout, etc.Here we focus on the particle-level observables, assuming some basic particle reconstructions (e.g.jet clustering) has already been done.These include the IDs (if any) and 4-momenta of visible particles, as well as observables constructed from them, such as the reconstructed 5 angles.For a given event e, the values of x are denoted as x e .
• z is the set of truth-level observables, including the ID and 4-momenta of all particles and the observables constructed from them.They cannot be measured in experiments, and are called "latent variables" in Refs.[36][37][38].However, in MC simulations, we have access to them (which are just the parton level events) and also know the mapping between x e and z e for any given event.
• p(z|θ) is the truth-level probability density function for a given parameter θ.It is the differential cross section normalized by the total cross section, 1 σ dσ dΩ .This can be computed analytically, at least in principle.Given a set of N events, we can construct a total likelihood from p(z|θ), given by N Π e=1 p(z e |θ), with z e the values of observables of the event e.In practice, it is more convenient to use the log likelihood, which turns the product into a sum.
• p(x|θ) is the detector-level probably density function.Formally, it can be convoluted from the truth-level probably density function as where p(x|z) contains the information on the parton showering, detector simulations, etc. Due to the stochastic nature of the generative process from z → x, it is generally not possible to directly calculate p(x|θ).For N events one could construct a total likelihood, N Π e=1 p(x e |θ).
• p(x, z|θ) = p(x|z)p(z|θ) is the joint probability density of x and z.
• r(x|θ 0 , θ 1 ) = p(x|θ 0 ) p(x|θ 1 ) is the likelihood ratio between two hypothesis, θ 0 and θ 1 .According to the Neyman-Pearson lemma, the likelihood ratio provides the optimal statistical test for a binary hypothesis problem.By custom, θ 1 usually denotes the SM, with all Wilson coefficients of higher dimensional operators set to zero.For a given set of events, we can directly construct the total likelihood ratio between θ 0 and θ 1 as is the joint likelihood ratio of x and z.Note that we assume p(x|z) is universal for all events and thus cancels in the expression, making this quantity calculable with relatively low computational cost.Following the discussion in [37], the knowledge on the joint-likelihood r(x, z|θ 0 , θ 1 ) is able to determine r(x|θ 0 , θ 1 ) with sufficient data, which will be explained in detail below.
• For a given test sample with N events, one could also include the information of the total rate in the likelihood ratio.Assuming the total rates follow Poisson distributions, the total likelihood ratio can be written as where X denotes collectively all the detector-level observables in the full data set, n 0 (θ 0 ) and n 1 (θ 1 ) are the expected number of events for hypothesis θ 0 and θ 1 .dσ ≡ dσ dΩ is the differential cross section.(Note that n 0 = σ(θ 0 ) • I and n 1 = σ(θ 1 ) • I, where the integrated luminosity I cancels in the ratio).It is then convenient to define a modified likelihood ratio [41], and similarly the modified joint likelihood ratio, r(x, z|θ 0 , θ 1 ) ≡ dσ(z|θ 0 ) dσ(z|θ 1 ) .Note that dσ(z) and dσ(x) correspond to the parton-level and detector-level differential cross sections, respectively.The modified (joint) likelihood ratio r differs from the unmodified r only by a factor of σ(θ 0 )/σ(θ 1 ).The total log likelihood ratio for N events can then be written as 6log R(X|θ 0 , θ 1 ) = N e=1 log r(x e |θ 0 , θ 1 ) + n 1 − n 0 . (3.5) In reality, we do not have access to p(x|θ), but only MC simulated events sampled with p(x|θ).Our goal is to obtain an estimator of the (modified) likelihood ratio, r(x|θ 0 , θ 1 ) or r(x|θ 0 , θ 1 ),7 from a set of (or multiple sets of) MC samples which we call the training samples.Ideally, this estimator should closely resemble the true likelihood ratio.This is done by constructing a so-called "loss function(al)" L, which is a functional of the estimator r(x|θ 0 , θ 1 ) or some variations of it.A required feature of the loss function is that it is minimized by the true likelihood ratio in the large statistics limit.Many ways of constructing the loss function exist, some of which rely only on MC samples.However, in MC simulation, the parton level events sampled by dσ(z|θ) are also available and can be exploited to increase the training efficiency, as pointed out in Refs.[36][37][38].This requires the knowledge of dσ(z|θ) and the mapping between x e and z e , both of which are accessible from MC simulations.
In practice, the integration of dxdz p(x, z|θ 1 ) from Eq. (3.6) is represented by the MC generation/sampling procedure with parameter θ 1 .The loss functional L is recognized as the loss function evaluated event-by-event correspondingly.In the context of machine learning, one is able to construct a general ansatz for r(x|θ 0 , θ 1 ) and obtain a good estimator by numerically minimizing the loss function of a set of training samples with respect to the neural network parameters, which is often referred to as the training process.
In the most general case, each training only gives us the likelihood ratio of two points (θ 0 and θ 1 ) in the model parameter space.However, if additional assumptions are made, for instance, p(x|θ) is linear or quadratic in θ, then only a finite set of training is needed to obtain an estimator r(x|θ, θ 1 ) for the entire parameter space of θ.In SMEFT analyses, a quadratic dependence of the Wilson coefficients is generally a good assumption and has been implemented in e.g.Ref. [41,46].Furthermore, with the large statistics and high precision of the e + e − → W + W − measurement at future lepton colliders, it is a very good approximation to keep only the leading linear contributions of the dim-6 Wilson coefficients, in which case the likelihood ratio and the joint likelihood ratio could be simply written as where n is the number of model parameters (dim-6 Wilson coefficients) and α i (z) are just the parton-level optimal observables.Once we obtain an estimator αi (x) for each α i (x), we could construct an estimator r(x|θ, θ 1 ) for the entire parameter space θ.We could thus directly construct a loss function in terms of the αi (x).By plugging in Eq. (3.10) and replacing the integration dxdz p(x, z|θ 1 ) with the sum over the events sampled with p(x, z|θ 1 ), the loss function in Eq. (3.6) can be written as (3.11)One could then obtain a loss functional for each αi of the form e |α i (x e ) − α i (z e )| 2 by choosing a suitable set of θ 0,i (i.e. by switching on one Wilson coefficient at a time).Finally, the n loss functionals could simply be summed together into a more compact form, which is minimized when each one is minimized.We therefore have where, by default, z e and x e are always sampled according to p(x, z|θ 1 ).Eq. (3.12) is minimized for αi (x e ) = α i (x e ) in the large statistics limit.The dependence on (θ 0 − θ 1 ) is dropped out as it is independent of x and z.The method using the loss function in Eq. (3.12) is denoted as the Sally (Score Approximates Likelihood Locally) method in Ref. [37], where the α i s are the "scores."As we noted above, it is essentially a machinelearning version of the optimal observable analysis, and is the main method that we use.
We have compared it with several other methods and found Sally to give the best result in our analysis, which is expected since the linear approximation (for the dependence on Wilson coefficients) works very well in our case.Finally, with an estimator r(x|θ, θ 1 ), 8 for a set of test samples (either from actual experiments or from simulations) one could construct the χ 2 as a function of the model parameters θ, given by which can be used to set limits on the model parameters θ. ∆χ 2 (θ) = χ 2 (θ) − χ 2 min (θ) can be obtained by first finding the minimum of Eq. (3.13) and then subtract it from Eq. (3.13).

Machine Learning with Background events
In particle physics, background events are inevitable in most studies for various reasons.In general, background events will be involved in the analysis due to a finite detector resolution that projects events with distinct z representations into the same x vector.In many cases, the region in z where background events dwell may even has no overlap with that of signal events.This case becomes quite generic when the definition of z includes all truth-level information from an MC generator, such as the identities of intermediate states.Moreover, in practice, the simulation of signal and background processes does not have to be within the same framework or even handled by varying MC generators.
We first recognize the signal and background regions (z sig and z bkg ) in the z space so that the two regions cover all possible events and do not overlap with each other: We assume all backgrounds are insensitive to parameter θ so that ∂ ∂θ p(z|θ) ≡ 0 when z ∈ z bkg for simplicity.The likelihood of p(x|θ) is made up by the signal (p(x|θ) sig ) and background (p(x) bkg ) contribution.The θ dependence of the latter is omitted since p(z|θ) does not depend on θ.According to the above discussion, the key for simulationbased inference is to predict the likelihood ratio r(x|θ 0 , θ 1 ) and its modified counterpart r(x|θ 0 , θ 1 ).In the presence of backgrounds, their relationship reads where the subscripts indicate the contribution from z sig or z bkg accordingly.For precision measurements, the expansion in Eq. (3.9) on r(x|θ, θ 1 ) is modified as (3.16) The factor S(x|θ 1 ) ≡ dσ(x|θ 1 ) sig dσ(x|θ 1 ) sig +dσ(x) bkg ⩽ 1 modifying α stems from the impact of background events around a certain observable configuration x.Its value is not known in general from the MC sampling due to the highly stochastic projection from the latent z space to the observable x space.As a result, the value of S(x|θ 1 ) will be evaluated by machine learning methods.
Interestingly, the loss function in Eq. (3.12) can stabilize around the correct output if the training sample is weighted properly.Let's begin with a training sample being a mixture of signal and background events, with their number ratio the same as the ratio of their cross sections in the SM (θ 1 ).When the sample size is very large, for any coarsegrained observable point x ′ , the total weight of signal and background samples will be proportional to dσ(x ′ |θ 1 ) sig and dσ(x ′ ) bkg , respectively.For the subset of samples around x ′ , the loss function L(α) takes minimum when The above equation is satisfied when αi (x ′ ) = S(x ′ |θ 1 )α i (x ′ ), which can be directly used to calculate r(x ′ |θ, θ 1 ).Therefore, the Sally algorithm described in the last section is applicable to cases with non-trivial backgrounds when trained with the correctly weighted samples.Besides the method above, any other method that reproduces S(x|θ 1 )α(x) as its output is also valid for simulation-based inference.One of the simplest ways is to train a separate classifier model on a mixed sample of signal and background events since S(x|θ 1 ) is a typical output of a classifier with a binomial log-likelihood or quadratic loss [72].Such a classifier model can be trained separately and work together with a Sally inference model trained only on signal event samples.The product of the two models' output will be the final output for R(X|θ, θ 1 ) calculation.

Preparation of samples
The e + e − → W + W − signal events can be first classified into three cases depending on how many leptons are generated from the W pair decay.Although the dileptonic channel with both W decaying leptonically is potentially the cleanest one that is largely free from hadronic systematics, it has the smallest branching ratio (≲ 10%).In addition, with two missing neutrinos, their momenta can not be reconstructed unambiguously. 9Conversely, the fully-hadronic channel with both W decaying to quarks has a large branching ratio but suffers more from hadronic systematics.The charge information of outgoing W 's crucial for the analysis is also significantly washed out by QCD effects.In this paper, we focus on the semileptonic channel where only one of the W bosons decays leptonically, which enjoys a large branching ratio, uniquely reconstructed missing momentum, and W charge information kept by the isolated lepton track.
In order to evaluate the performance of different methods when various systematic effects are present, Monte Carlo (MC) simulations of e + e − → W + W − → qqℓν events are performed with proper data processing.In particular, we use Madgraph 5 [73] to generate unweighted e + e − → W + W − → qqℓν samples at √ s = 240 GeV. 10 The initial state radiation (ISR) from the electron beam is also incorporated using the method of Ref. [74].The parton shower and hadronization are handled by Pythia 8 [75].Delphes 3 [76] is used to incorporate detector effects with the detector profile derived from the ILD [77].A lepton is isolated if the net energy of other particles within its ∆R < 0.5 cone is smaller than 0.12 (electron) or 0.25 (muon) of the lepton energy.All jets are clustered with the antik t algorithm [78] with R = 0.6.During the procedure, only SM samples are generated as demonstrated in Eq. (3.6) (where θ 1 denotes the SM).The background samples are simulated in the same way, while in this case, we only consider ee → ZZ → qqℓℓ as the major background for simplicity.Such events will fake the semileptonic W W signal if one of the final state leptons is not identified.The reasons for being unable to identify a lepton may include the low lepton energy below the threshold, large |η| away from the detector coverage, or accidental escape from gaps between detector materials.By choosing the ZZ → qqℓℓ event as the representative background, we make the assumption that the background is insensitive to the SMEFT dim-6 contributions.This is not true in reality, but is nevertheless a good approximation as the leading dim-6 contribution to this process effectively only modifies the Zf f vertices, which are constrained by Z-pole measurements. 11We claim that the purpose of our simulation is not to derive the potential of future lepton colliders with ML in the most realistic context.Instead, we aim to provide a proof of concept that, with the help of ML, the future lepton colliders can overcome various sources of systematics and reach the target precision, while a full simulation of background events and their EFT dependence will be left to future studies.
For the event selection, we apply only the basic requirements that each event contains exactly two jets and one lepton, and all three particles are required to have p T > 5 GeV.Finally, all simulated samples are converted to a vector of data as the input for ML methods.In our case, the input will be dimension 21.The first 16 dimensions consist of the four-momenta of outgoing leptons, jets, and neutrinos.Here the neutrino fourmomentum is obtained by subtracting the vector sum of the momenta of the two jets and the lepton from the initial state with √ s = 240 GeV.The vector of input also contains the five angles (cos θ, cos θ 1 , ϕ 1 , cos θ 2 , ϕ 2 ) derived from the momenta above, forming an N × (16 + 5) = 21 × N -dimensional dataset, where N is the number of events.Notice that this is only a fraction of the original information of the whole event, which typically involves tens of outgoing particles.Including more detailed event information is certainly compatible with our framework, resulting in a much larger x space.Since we target validating the machine-learning framework in the context of precision tests of the SM, only basic jet and lepton kinematics are involved to reduce the complexity and cost of network training.
Given that the key observable in our study is the differential distribution of the five angles (cos θ, cos θ 1 , ϕ 1 , cos θ 2 , ϕ 2 ), it is important to see how it is modified from the parton level to the detector level.We show in Fig. 3 the individual distributions of the five angles at the parton level and the detector level (top panel), as well as the distributions of the difference between the parton level and the detector level angles.

Machine Learning Algorithms and Training Details
For the need of different strategies, three types of training and testing samples are prepared.
• Sample-T (Truth level) is the truth-level sample (before hadronization) with signal only.
• Sample-D (Detector level) is the detector-level sample with signal events only.
• Sample-DB (Detector level with Background) is the detector-level sample with both the e + e − → W + W − signal and the e + e − → ZZ background.
Each set of training sample has a total number of 2×10 6 events.For Sample-T and Sample-D, all events are semilepton W W signal.For Sample-DB, 90% are W W signal and 10% are ZZ background events (which pass the pre-selection).The signal-to-background ratio, 9:1, is slightly larger than the actual one, which we found to be around 15:1 after the preselection cuts.The ratio 9:1 is chosen for simplicity and being conservative, since our goal is to study the performance of machine-learning methods under background rather than estimating the actual reach.Each set of validation sample has 5 × 10 5 W W events, and again for Sample-DB this contain 10% ZZ events.As mentioned above, we use the Sally method to train the machine-learning models that outputs an estimator of the likelihood ratio.Three simulation-based inference networks are trained from the three different samples.They are denoted as Sally-T, Sally-D and Sally-DB models, respectively.For all background events, we set α i (z e ) = 0 in the loss function in (3.12), which implies the approximation that background is independent of θ (and is always SM-like).
In Sally-DB, the differential cross sections of signal and background are combined with the correct weighting (and assuming background is always SM-like) as described in Section 3.2.Conversely, the background can be filtered by a trained classifier network before the inference network.Sally-DC (Detector level with Classifer) combines Sally-D and a classifier that filters the background from the signal.The classifier networks is trained separately on the mixed Sample-DB, while the inference network is identical to the Sally-D model with no extra training.For comparison, we also consider the OOC method, which is optimal observables (OO) combined with the Classifier that removes backgrounds.
One of the key motivations of this work is to evaluate various ML strategies for EW precision tests at lepton colliders.A simple fully connected neural network (FCNN) is match the detector level jets to the parton-level ones (which are distinguishable) with the smaller overall angle differences.This almost always gives the correct pairing due to the small smearing effects.While the smearing effects are small and approximately symmetrical, they can still have nontrivial impacts on the angular distributions.For instance, the polar angle distributions are slightly flattened.The effects are barely visible in Fig. 3.However, due the high precision goal of our study, these small effects could have a significant impact on the results if not properly taken account of, as we will show in Section 5.
adopted instead of a more delicate network structure to reduce the training cost.Such a FCNN setup with O(10 5 ) parameters is able to handle the 21-dimensional data from semileptonic e + e − → W + W − events.All inference networks are of 9 layers and 200 nodes per layer for our inference network for Sally algorithms.The input of the neural network is the observable x, which has a size of N × 21, where N is the number of events in the training set, and the 21 features include the four-momentum of four final-state particles and five angles.The outputs are a set of αi (x), one for each model parameter.For the classifier, we simply change the output of the neural network to be a binary output, which is 1 for signal and 0 for background.In the training process, we use the ADAM optimization [81] algorithm for stochastic gradient descent and apply early stopping [82] to prevent over-fitting, as well as learning rate decay strategy.
To better evaluate machine learning strategies and in particular, to eliminate the random noise from the training phases, it is beneficial to consider the ensembles of individual models [83].We introduce Sally-DA (Detector level Average), Sally-DBA (Detector level with Background Average) and Sally-DCA (Detector level with Classifier Average), which are averaged versions of Sally-D, Sally-DB and Sally-DC, respectively.In each case, we trained 8 neural network models in parallel, which are different from each other in terms of the parameterization and training history.In particular, the initial randomization of weights and the random segmentation/batching of the training data are different for the 8 models, while the architecture of all 8 models are the same.The reconstructed α (obtained by minimizing the loss function in Eq. (3.12)) of the 8 models are averaged to produce the final result.Assuming the biases from different models are uncorrelated, the averaging would effectively reduce the model uncertainty.As we will show in Appendix B, results of the individual models do exhibits non-negligible systematic error (especially in the central values) which can be effectively reduced by the averaging step.

Results
With the setup in the previous section, we perform machine learning analyses to obtain an estimator for the likelihood ratio from training samples, which is used to construct a χ 2 of the validation sample.For better comparison between different algorithms, when presenting the limits on EFT parameters, the corresponding signal event number for each limit is scaled to 10 4 events.This is equivalent to a luminosity of 2.2 fb −1 .Note that this is only a tiny fraction of all proposed future lepton collider runs of O(ab) −1 .
While the χ 2 is obtained for the most general parameterization in Eq. ( 2.3), we choose to present the results from a global fit of 3 aTGCs, {δg 1Z δκ γ , λ Z }.This is because a global fit with all 6 parameters contains flat directions and is unbounded without the inclusion of other EW measurements, e.g.Z-pole observables.Such a 3-aTGC parameterization is well-justified for circular colliders with a tera-Z program, as shown in e.g.Ref. [18].The full χ 2 for each sample type (obtained from the method with the best performance) is provided in Appendix A.
Our main results are presented in Fig. 4, which shows the limits on the three aTGCs for various scenarios.The results in each row come from the same (3-parameter) global fit, which is projected on the (δg 1Z , δκ γ ), (δκ γ , λ Z ) and (λ Z , δg 1Z ) planes, respectively.Each contour shows the 68% confidence level (CL) bound for two d.o.f., which corresponds to ∆χ 2 = 2.28.For the first row, the parton-level Sample-T, which corresponds to the most ideal case, is used.Two relevant sets of results are shown, which are the one from applying Optimal Observables (OO, dashed blue), and the one from Sally-T (green).To better illustrate the performance of various algorithms, the minimal χ 2 point (i.e., the center) of the OO result is calibrated to the origin such that δκ γ = λ Z = δg 1Z = 0.This is because OO at the truth level is the ideal inference method.It will not be affected by systematic effects and can thus always recover the truth value. 13All other contours are evaluated with the same test sample set and calibrated by the same amount as the ideal ones, ensuring that deviations shown are due to systematic effects or training-induced quantities instead of MC fluctuations.In this case, Sally-T has no obvious advantage over OO since it suffers from imperfect training.Nevertheless, we found Sally-T to have a good performance, and its results are close to the ideal ones.
In the second row of Fig. 4, Sample-D is used, which includes parton shower, hadronization, various detector effects, and ISR radiations.Three sets of contours are presented, corresponding to the ideal limit (the same as the OO result in the first row), the results from Sally-DA, and OO (which is now directly applied on detector-level events).For OO, a significant bias in the central value of the fit is observed, which is the result of detector effects that make p(x|θ) different from p(z|θ).In particular, the five ϕ and θ angle recovered from detector level information deviates from the truth, leading to inaccurate α predictions.Conversely, Sally-T is replaced by Sally-DA, and its performance only slightly deviates from the ideal case.Such robustness of the Sally-DA model under detector effects is a clear advantage over Optimal Observables.
Finally, in the last row of Fig. 4, we show the performance on Sample-DB, which includes both detector effects and backgrounds from e + e − → ZZ.It is clear that, while the background only contributes 10% to the total events, failing to incorporate background information results in unacceptably large shifts of central values, as shown by the OO and Sally-D results.The OO results also have uncertainties that are even smaller than the ideal ones (the blue ellipses are smaller).This is another form of bias, suggesting an overestimation of the corresponding precision reach.Instead, Sally-DBA (red) can take account of the background effects and significantly reduce the systematic bias compared to the previous methods, disregarding the effects of backgrounds.
In Fig. 5, we further compare the performances of different ML-based strategies on sample-DB.These include OOC, Sally-DCA and Sally-DBA.By combining with a classifier filtering out background events, OOC is able to deliver a far better performance than OO.Due to the low background acceptance of the classifier trained, the overall perfor-  The central values of the 3 aTGCs in various scenarios, which provide a measure of the bias of different methods.Note that the ideal central values are zero by construction, and so are the ones for OO on Sample-T due to the calibration (see Fig. 4).mance of OOC is very similar to the OO method directly applied to Sample-D, which is, however, still unacceptable for the precision reach of future lepton colliders as we discussed previously.In contrast, Sally-DCA is able to handle the impact of both background events and noise from systematic effects, and has a slightly better performance than OOC.It is also notable that Sally-DBA delivers better performance than Sally-DCA with a smaller model thanks to its integrated structure handling all systematic effects simultaneously.Each input event, either signal or background ones, will be assigned with a proper target output α(z).Backgrounds that occasionally pass through the classifier network in Sally-DCA, instead, may sit in a location in the x space where the (signal-event-only) training sample has zero density, giving rise to unregulated α output.The reconstructed central values of all scenarios are presented in Table 1 for reference and illustrated (for Sample-DB) in Fig. 6.We also note that the averaging step in Sally-DBA and Sally-DCA (as well as Sally-DA when applied to Sample-D) is important for reaching the desired performance level.Figure 6: The results in Table 1 for Sample-DB presented as bar plots (with only the magnitude of the central values).Note that OO and Sally-DA are generically not applicable to samples with backgrounds, showing significant biases as expected.
Even with Sally-DBA, there is still a bias in the central values of the 3 aTGCs around O(10 −3 ).This bias is mainly due to limited MC sample size and imperfect training.Therefore, with more MC simulation and computing resource (which should be easily achievable in the future), this bias will hopefully be reduced to the desired level for the actual run scenarios of future lepton colliders, which typically has a precision reach of ∼ 10 −3 -10 −4 for the 3 aTGCs (so that the bias needs to be below O(10 −4 )).To illustrate this, we show in Fig. 7 the change of the bias in the central values of the 3 aTGCs with Sally-DB where the training sample size is varied from 2 × 10 3 to 2 × 10 6 events.We can see (up to statistical fluctuations) a clear pattern that the bias decreases with the increase of training statistics, and very roughly scales as 1/ √ N (where N is the total number of training events).It is thus reasonable to expect that the bias would continue to decrease as the sample size goes beyond ∼ 10 6 .Of course, further studies are required to check if the bias could indeed be reduced to the desired level.

Conclusion
Proposals of various future lepton collider projects have been actively investigated by the particle physics community.While we should remain optimistic that at least one of these proposals will be realized, it is likely that even in the most optimistic scenario, several decades are needed before a sizable amount of data is obtained from any future lepton collider.On the other hand, the field of machine learning, as well as its applications in high energy physics, has been highly active in recent years.With the anticipated developments in both machine-learning algorithms and computing resources, it is foreseeable that machine learning will play a more important role in future collider experiments and become an essential part of many physics analyses.Understanding the physics potential of machine-learning methods is thus an important aspect of future collider studies.
In this paper, we applied machine-learning methods to the SMEFT analysis of the e + e − → W + W − process at a 240 GeV lepton collider, and compared their performances with the ones of conventional methods, in particular the Optimal Observables.Our study is based on MC simulations at the detector level.We focus on the semi-leptonic channel, e + e − → W + W − → jjℓν, and consider the e + e − → ZZ → jjℓ + ℓ − background with a missing lepton.As a proof of principle, we have shown that optimal observables are subject to unacceptable biases if various systematic effects (such as ISR and detector effects) and backgrounds are not taken into account.On the other hand, the machinelearning version of the optimal observables, SALLY, and its variations, are much more robust under the systematic and background effects.Such methods could be ideal for the SMEFT analyses at future lepton colliders assuming that one has access to accurate MC simulation tools.
The ability to mitigate background effects is essential for future lepton colliders, as the target precision is often very high.Even relatively small SM backgrounds can cause big errors in results.The background events often have profiles similar to signal events but have distinct physics origins.The truth-level description and MC simulation framework of signal and backgrounds could differ a lot accordingly, but it is hard to tell from each other.Facing the challenge, we derived the algorithm to handle general backgrounds.In particular, two approaches are proposed to deliver EFT constraints with proper background mitigation.It is then demonstrated that by either combining a specific classifier network or training the inference network with proper samples, the impact from backgrounds is well under control.The numerical result also suggests that the latter approach with a single network has a moderately higher precision than the case with an extra classifier.
Limited by time and resources, a few interesting (and challenging) aspects remain to be studied in depth in this work.We hope that future work can address these open questions.One of the most obvious possibilities is to apply the machine-learning methods also to the dileptonic and fully-hadronic channels and evaluate the improvements.In particular, the hadronic channel has a sizable branching ratio but is more challenging due to jet measurement uncertainties and larger backgrounds.Secondly, it is desirable to achieve more realistic MC simulation and model training in future works.Due to the limitation on computing resources, we have used 2 × 10 6 events for training and 5 × 10 5 for validation, which are orders of magnitudes less than the actual sample size at future lepton colliders (∼ 10 8 ).To verify if the observed biases can be indeed decreased to the target level, a much larger signal and background sample size will be necessary.Beyond e + e − → W + W − , one could also apply our methods to other SMEFT analyses, with e + e − → t t being an obvious example.It will be interesting to explore different machinelearning techniques and more complex neural network architectures in the hope of finding more efficient and robust methods.As a final remark, we stress that machine learning is most beneficial when theoretical and experimental uncertainties are addressed correctly.In reality, MC simulation may not perfectly match experimental data, which could induce intrinsic (theoretical) uncertainties in the machine learning analysis.Taking account of such uncertainties could be crucial for studies at future lepton colliders.We leave these important directions to future studies.

B Results of individual neutral network models
As mentioned in Section 4 and Section 5, for each one of Sally-DA, Sally-DCA and Sally-DBA, we train 8 neural network models and take the average of the reconstructed α of the 8 models to produce the final result.This averaging step is important for reducing the systematic error of individual models due to imperfect training.To illustrate this, we show in Fig. 8 the limits on the 3 aTGCs obtained from the 8 individual models of Sally-DA (top row), Sally-DC (middle row) and Sally-DB (bottom row).In each case, it is clear from Fig. 8 that the results of the 8 individual models have a significant spread.

Figure 4 :Figure 5 :
Figure4: The 68%CL contours (∆χ 2 = 2.28) in the (δg 1,Z , δκ γ ) (left), (δκ γ , λ Z ) (middle) and (λ Z , δg 1,Z ) (right) planes obtained from the 3-aTGC fit, assuming a signal sample (semi-leptonic W W ) with 10 4 events.Each row corresponds to a different sample set, with the results obtained from several methods.Top: Sample-T with Optimal Observables (OO, which coincides with the ideal result in this case) and Sally-T methods.Middle: Sample-D with OO and Sally-D methods.Bottom: Sample-DB with OO, Sally-D and Sally-DBA methods.All contours are calibrated so that the ideal result is always centered at the SM prediction δκ γ = λ Z = δg 1Z = 0.The deviations of each contour's center, therefore, only stem from systematic effects or training-induced quantities instead of MC fluctuations.
The results for the individual Sally-D, Sally-DB and Sally-DC models are presented in Appendix B. The performances of the individual models are affected by their modeling and training history, which causes non-negligible systematic error, especially in the central values, as shown in Fig. 8.The averaging step of the α output reduces the variances picked up by individual models during training.

Figure 7 :
Figure 7: The bias of Sally-DBA (in terms of the central values of aTGCs) trained on different size training set, which shows an decay in the bias for all 3 aTGC parameters.

- 0 .---Figure 8 :
Figure 8: The 68%CL contours as in Fig. 4 but displaying the results of the 8 (un-averaged) neural network models separately.Sally-D applied to Sample-D.Middle: Sally-DC applied to Sample-DB.Bottom: Sally-DB applied to Sample-DB.