Unbinned multivariate observables for global SMEFT analyses from machine learning

Theoretical interpretations of particle physics data, such as the determination of the Wilson coefficients of the Standard Model Effective Field Theory (SMEFT), often involve the inference of multiple parameters from a global dataset. Optimizing such interpretations requires the identification of observables that exhibit the highest possible sensitivity to the underlying theory parameters. In this work we develop a flexible open source framework, ML4EFT, enabling the integration of unbinned multivariate observables into global SMEFT fits. As compared to traditional measurements, such observables enhance the sensitivity to the theory parameters by preventing the information loss incurred when binning in a subset of final-state kinematic variables. Our strategy combines machine learning regression and classification techniques to parameterize high-dimensional likelihood ratios, using the Monte Carlo replica method to estimate and propagate methodological uncertainties. As a proof of concept we construct unbinned multivariate observables for top-quark pair and Higgs+$Z$ production at the LHC, demonstrate their impact on the SMEFT parameter space as compared to binned measurements, and study the improved constraints associated to multivariate inputs. Since the number of neural networks to be trained scales quadratically with the number of parameters and can be fully parallelized, the ML4EFT framework is well-suited to construct unbinned multivariate observables which depend on up to tens of EFT coefficients, as required in global fits.


Introduction
The extensive characterization of the Higgs boson properties achieved at the LHC [1][2][3] following the tenth anniversary of its discovery [4,5] represents a powerful example of the unique potential that precision measurements have in unveiling hypothetical signals of beyond the Standard Model (BSM) physics in highenergy collisions. This potential motivates ongoing efforts within the theory and experimental communities to develop novel frameworks, tools, and analysis techniques that enhance the sensitivity of precision LHC measurements to BSM signals in comparison with more traditional approaches.
Since the early days of quantum field theory, effective field theories (EFTs) have proven a robust framework to describe the low-energy limits of theories whose ultraviolet completions are either unknown or with which the computation of predictions is too challenging. Of particular relevance for the model-independent interpretation of LHC measurements is the Standard Model effective field theory (SMEFT) [6][7][8] (see also the reviews in [9][10][11][12][13]), which extends the SM while preserving its (exact) symmetries and its field content. In order to maximize the constraining power of this framework and explore the broadest possible region in the parameter space, it is advantageous to integrate the information contained in different types of processes within a coherent global SMEFT analysis. Several groups have presented combined SMEFT interpretations of LHC data from the Higgs, top-quark, and electroweak sectors, eventually complemented with the information from low-energy electroweak precision observables (EWPOs) and/or flavor data from B-meson decays, e.g. [14][15][16][17][18][19][20]. These analyses rely on unfolded binned distributions provided by the experiments, that is, they are based on reinterpreting "SM measurements" within the SMEFT framework.
In addition to such a combination of multiple datasets and processes, another avenue towards improved SMEFT analyses is provided by the design of tailored observables characterized by enhanced, or even maximal, sensitivity to the underlying Wilson coefficients for a given process. Optimal observables are able to maximally exploit the kinematic information contained within a given measurement, event by event, to carry out parameter inference by comparing with the corresponding theoretical predictions. The lowmultiplicity final states present in electron-positron collisions make them particularly amenable to this strategy, and optimal observables have been used in the context of parameter fitting at LEP, e.g. [21,22], and for future lepton collider studies [23]. Constructing optimal observables is instead more difficult in hadron collisions, where the higher complexity and multiplicity of the final state, the significant QCD shower and non-perturbative effects, and the need to account for detector simulation make difficult the evaluation of the event-by-event likelihood. This is one of the reasons why most LHC measurements are presented as unfolded binned cross-sections, with the exact statistical model [24] replaced by a multi-Gaussian approximation.
Despite technical challenges associated to their definition and their presentation [25], there is growing evidence that at the LHC unbinned multivariate observables accounting for the full event-by-event kinematic information are advantageous to constrain the SMEFT parameters. As an illustration, the most stringent limits on top quark EFT operators from CMS data are those arising from unbinned detector-level observables [26,27]. As compared to traditional measurements, unbinned observables enhance the sensitivity to EFT coefficients by preventing the information loss incurred when adopting a specific binning or when restricting the analysis to a subset of the possible final-state kinematic variables. Constructing such observables for hadronic collisions can be achieved with the analytical evaluation of the event likelihood using e.g. the Matrix Element Method (MEM) [28][29][30][31][32] or numerically by means of Monte Carlo (MC) simulations. In the latter case, Machine Learning (ML) techniques provide a powerful toolbox to efficiently construct highsensitivity observables for EFT studies [33][34][35][36][37][38][39][40][41][42][43][44][45][46][47][48], see also [49][50][51][52][53] for related work. Such optimal observables are relevant in other contexts beyond EFTs such as PDF fits [54,55], see [56] for a recent example.
In this work we develop a general framework enabling the integration of tailored unbinned multivariate observables from LHC processes within global SMEFT fits. Our strategy, implemented in the python open source package ML4EFT, combines machine learning regression and classification techniques to parameterize high-dimensional likelihood ratios for an arbitrary number of kinematic inputs and EFT coefficients. Once the likelihood ratio is parametrized in terms of neural networks trained on MC simulations, the posterior probability distributions in the EFT coefficients can be inferred by means of Nested Sampling. The Monte Carlo replica method is used to estimate methodological uncertainties, such as those associated to the finite number of training events, and to propagate them to the inferred confidence level intervals. A key feature of ML4EFT is that the number of networks to be trained, which scales quadratically with the number of EFT parameters, can be fully parallelized. While previous studies of ML-assisted optimized observables for EFT fits consider relatively small operator bases, our framework is hence well-suited to construct general unbinned multivariate observables which depend on up to tens of EFT coefficients as required in global fits.
As a proof of concept of the ML4EFT framework, we construct unbinned multivariate observables for two processes relevant for global EFT interpretations of LHC data: inclusive top-quark pair production and Higgs boson production in associated with a Z boson, in the bℓ + ν ℓb ℓ −ν ℓ (dilepton) and bbℓ + ℓ − final states respectively. We consider fiducial regions where these measurements are statistically-limited and therefore systematic errors can be neglected. Whenever possible, we compare the results based on the ML parametrization with those provided by the analytical evaluation of the exact event-by-event likelihood. We demonstrate the improved constraints that these unbinned multivariate observables provide on the SMEFT parameter space as compared to their binned counterparts, and study the information gain associated to the inclusion of multiple kinematic inputs. Our analysis motivates and defines a possible roadmap towards the measurement (and delivery) of unbinned observables tailored to SMEFT parameters at the LHC.
The outline of this paper is as follows. First of all, Sect. 2 introduces the statistical framework which is adopted to construct unbinned multivariate observables. Then Sect. 3 discusses how this general framework applies to the SMEFT and how machine learning is deployed to parametrize high-dimensional likelihood functions. Sect. 4 describes our pipeline for the MC simulation of LHC events in the SMEFT and the settings of the pseudo-data generation. Our results are presented in Sect. 5, which quantifies the constraints on the EFT parameter space provided by unbinned observables in tt and in hZ production. Finally, in Sect. 6 we summarize and discuss possible future avenues. App. A presents the main features of the open source ML4EFT framework, while App. B discusses the Asimov dataset in the case of unbinned observables.

From binned to unbinned likelihoods
We begin by presenting the statistical framework that will be adopted in this work in order to construct unbinned observables in the context of global EFT analyses. While we focus on applications to the SMEFT, we emphasize that this formalism is fully general and can be deployed to also construct unbinned observables relevant e.g. to the determination of SM parameters such as the parton distribution functions.

Binned likelihoods
Let us consider a dataset D. The corresponding theory prediction T will in general depend on n p model parameters, denoted by c = {c 1 , c 2 , . . . , c np }, and hence we write these predictions as T (c). The likelihood function is defined as the probability to observe the dataset D assuming that the corresponding underlying law is described by the theory predictions T (c) associated to the specific set of parameters c, L(c) = P (D|T (c)) .
(2.1) This likelihood function makes it possible to discriminate between different theory hypotheses and to determine, within a given theory hypothesis T (c), the preferred values and confidence level (CL) intervals for a given set of model parameters. In particular, the best-fit values of the parametersĉ are then determined from the maximization of the likelihood function L(c), with contours of fixed likelihood determining their CL intervals. The most common manner of presenting the information contained in the dataset D is by binning the data in terms of specific values of selected variables characteristic of each event, such as the final state kinematics. In this case, the individual events are combined into N b bins. Let us denote by n i the number of observed events in the i-th bin and by ν i (c) the corresponding theory prediction for the model parameters c. For a sufficiently large number of events n i per bin (typically taken to be n i ∼ > 30) one can rely on the Gaussian approximation. Hence, the likelihood to observe n = (n 1 , . . . , n N b ) events in each bin, given the theory predictions ν(c), is given by where we consider only statistical uncertainties and neglect possible sources of correlated systematic errors in the measurement (uncorrelated systematic errors can be accounted for in the same manner as the statistical counterparts). This approximation is justified since in this work we focus on statistically-limited observables, e.g. the high-energy tails of differential distributions. The binned Gaussian likelihood Eq. (2.2) can also be expressed as −2 log L(n; ν(c)) = that is, as the usual χ 2 corresponding to Gaussianly distributed binned measurements. The most likely values of the parametersĉ given the theory hypothesis T (c) and the measured dataset D are obtained from the minimization of Eq. (2.3). The Gaussian binned likelihood, Eq. (2.3), is not appropriate whenever the number of events in some bins becomes too small. Denoting by n tot the total number of observed events and ν tot (c) the corresponding theory prediction, the corresponding likelihood is the product of Poisson and multinomial distributions: L(n; ν(c)) = (ν tot (c)) ntot e −νtot(c) n tot !
where the total number of observed events (and the corresponding theory prediction) is equivalent to the sum over all bins, When imposing these constraints, Eq. (2.4) simplifies to L(n; ν(c)) = which is equivalent to the likelihood of a binned measurement in which the number of events n i in each bin follows an independent Poisson distribution with mean ν i (c). As in the Gaussian case, Eq. (2.3), one often considers the negative log-likelihood, and for the Poissonian likelihood of Eq. (2.6) this translates into −2 log L(n; ν(c)) = −2 where we have dropped the c-independent terms. In the limit of large number of events per bin, n i ≫ 1, it can be shown that the Poisson log-likelihood, Eq. (2.7), reduces to its Gaussian counterpart, Eq. (2.3). Again, the most likely values of the model parameters,ĉ, are those obtained from the minimization of Eq. (2.7).
Confidence level intervals. In order to determine confidence level intervals associated to the model parameters for both the Gaussian and the Poisson likelihood one can adopt, rather than the likelihood L(c), the profile likelihood ratio (PLR) as test statistic of choice. The PLR is defined as where as mentioned aboveĉ denotes the maximum likelihood estimator of the theory parameters c given the observed dataset D. By construction, the PLR q c is semi-positive definite for any value of the theory parameters. Larger values of q c indicate increasing incompatibility between theory predictions T (c) and observed data D, while lower values (down to q c = 0) correspond to improved compatibility. One important difference between the absolute likelihood L(c) and the profile likelihood ratio q c is that the latter can only be constructed after having determinedĉ.
Adopting the profile likelihood ratio as test statistic is advantageous, particularly in light of the powerful result due to Wilks [57] stating that q c is distributed according to a χ 2 distribution under the null hypothesis, that is, under data D whose underlying law is described by the theory predictions T (c). Furthermore, in the large sample limit and assuming specific regularity conditions, the profile likelihood ratio Eq. (2.8) follows a χ 2 distribution with n p degrees of freedom, χ 2 np . The main benefit of the profile likelihood ratio is hence that it allows for an efficient limit setting procedure given that one has direct access to the asymptotic probability distribution. For instance, for the Gaussian likelihood we can determine the endpoints of the 100(1 − α)% confidence level (CL) intervals by imposing the condition on the p-value p c , where F χ 2 k (y) is the cumulative distribution function of the χ 2 k distribution with k degrees of freedom, and recall that both χ 2 k and q c are semi-positive-definite quantities. For instance, α = 0.05 corresponds to the calculation of the 95% CL intervals of the theory parameters c. Isolating q c from Eq. (2.9) gives and hence the resulting confidence level intervals satisfy The determination of the CL contours on the theory parameter space for the binned Gaussian likelihood is obtained by solving Eq. (2.12) for c. Working with q c directly, the same result is obtained by demanding Eq. (2.11). A similar derivation can be used to determine CL intervals in the case of the Poisson likelihood, Eq. (2.7).

Unbinned likelihood
The previous discussion applies to binned observables, and leads to the standard Gaussian and Poisson likelihoods, Eqns. (2.2) and (2.6) respectively, in the case of statistically-dominated measurements. Any binned measurement entails some information loss by construction, since the information provided by individual events falling into the same bin is being averaged out. To eliminate the effects of this information loss, one can construct unbinned likelihoods that reduce to their binned counterparts Eqns. (2.2) and (2.6) in the appropriate limits. Instead of collecting the N ev measured events into N b bins, when constructing unbinned observables one treats each event individually. We denote now the dataset under consideration as with x i denoting the array indicating the values of the n k final-state variables that are being measured. Typically the array x i will contain the values of the transverse momenta, rapidities, and azimuthal angles of the measured final state particles, but could also be composed of higher-level variables such as in jet substructure measurements. Furthermore, the same approach can be applied to detector-level quantities, in which case the array x i contains information such as energy deposits in the calorimeter cells.
As in the binned case, we assume that this process is described by a theoretical framework T (c) depending on the n p model parameters c = {c 1 , c 2 , . . . , c np }. The kinematic variables of the events constituting the dataset Eq. (2.13) are independent and identically distributed random variables following a given distribution, which we denote by f σ (x, c), where the notation reflects that this probability density will be given, in the cases we are interested in, by the differential cross-section evaluated using the null hypothesis (theory T (c) in this case). For such an unbinned measurement, the likelihood factorizes into the contributions from individual events such that (2.14) It is worth noting that in Eq. (2.14) the data enters as the experimentally observed values of the kinematic variables x i for each event, while the theory predictions enter at the level of the model adopted f σ (x, c) for the underlying probability density. By analogy with the binned Poissonian case, the likelihood can be generalized to the more realistic case where the measured number of events N ev is not fixed but rather distributed according to a Poisson with mean ν tot (c), namely the total number of events predicted by the theory T (c), see also Eq. (2.4). The likelihood Eq. (2.14) then receives an extra contribution to account for the random size of the dataset D which reads Eq. (2.15) defines the extended unbinned likelihood, with corresponding log-likelihood given by where again we have dropped all terms that do not depend on the theory parameters c since these are not relevant to determine the maximum likelihood estimators and confidence level intervals. The unbinned loglikelihood Eq. (2.16) can also be obtained from the Poissonian binned likelihood Eq. (2.7) in the infinitely narrow bin limit, that is, when taking n i → 1 (∀i) and as expected, where we have used the normalization condition for the probability density Hence one can smoothly interpolate between the (Poissonian) binned and unbinned likelihoods by reducing the bin size until there is at most one event per bin. Again, we ignore correlated systematic errors in this derivation. As mentioned above, the probability density associated to the events that constitute the dataset Eq. (2.13) and enter the corresponding likelihood Eq. (2.16) is, in the case of high-energy collisions, given by the normalized differential cross-section (2.20) For convenience of notation, let us we define The latter is especially useful in cases such as the SMEFT, where the alternative hypotheses corresponds to the vanishing of all the theory parameters (the EFT Wilson coefficients), and T reduces to the SM. In terms of this notation, we can then express the profile likelihood ratio for the unbinned observables Eq. (2.20) as One can then use either Eq. (2.20) or Eq. (2.22) to derive confidence level intervals associated to the theory parameters c in the same manner as in the binned case, namely by imposing Eq. (2.11) for a given choice of the CL range. Provided double counting is avoided, binned and unbinned observables can simultaneously be used in the context of parameter limit setting. In this general case one assembles a joint likelihood which accounts for the contribution of all available types of observables, namely which can then be used to construct the profile likelihood ratio Eq. (2.8) in order to test the null hypothesis and determine confidence level intervals in the theory parameters c. The main challenge for the integration of unbinned observables in global fits using the framework summarized by Eq. (2.24) is that the evaluation of L (ub) k (c) is in general costly, since the underlying probability density is not known in closed form and hence needs to be computed numerically using Monte Carlo methods. In this next section we discuss how to bypass this problem by adopting machine learning techniques to parametrize this probability density (the differential cross-section) and hence assemble unbinned observables which are fast and efficient to evaluate, as required for their integration into a global SMEFT analysis.

Unbinned observables from machine learning
In this section we describe our approach to construct unbinned multivariate observables tailored for global EFT analyses by means of supervised machine learning. We discuss how neural networks are deployed as universal unbiased interpolants in order to parametrize likelihood ratios associated to the theoretical models of the SM and EFT differential cross-sections, making possible the efficient evaluation of the likelihood functions for arbitrary values of the Wilson coefficients as required for parameter inference. We emphasize the scalability and robustness of our approach with respect to the number of coefficients and to the dimensionality of the final state kinematics, and validate the performance of the neural network training.

Differential cross-sections
Following the notation of Sect. 2.2, we consider a given process whose associated measurement D consists of N ev events, each of them characterized by n k final state variables, The kinematic variables (features) x under consideration depend on the type of measurement that is being carried out. For instance, for a top quark measurement at the parton level, one would have that the x i are the transverse momenta and rapidities of the top quark, while for the corresponding particle-level measurement, one would use instead b-jet and leptonic kinematic variables. Likewise, x i could also correspond to detectorlevel kinematic variables for measurements carried out without unfolding. The inclusive cross-section case corresponds to n k = 0 when one integrates over all final state kinematics subject to fiducial cuts. The probability distribution associated to the events constituting D is given by the differential cross-section in terms of the model parameters c.
In general not all n k kinematic variables that one can consider for a given process will be independent. For example, 2 → 2 processes with on-shell particles (like pp → tt before decay) are fully described by three independent final-state variables. For more exclusive measurements, n k grows rapidly yet the final-state variables remain partially correlated to each other. The best choice of x and n k should in this respect be studied separately from the impact associated to the use of unbinned observables as compared to their binned counterparts. Furthermore, in the same manner that one expects that the constraints provided by a binned observable tend to those from unbinned ones in the narrow bin limit, these constraints will also saturate once n k becomes large enough that adding more variables does not provide independent information.
In the specific case of the SMEFT, the parameters of the theory framework T (c) are the Wilson coefficients associated to the n eft higher dimensional operators that enter the description of the processes under consideration for a given set of flavor assumptions. Given that a differential cross-section in the dimension-six SMEFT exhibits at most a quadratic dependence with the Wilson coefficients, one can write the differential probability density in Eq. (3.2) as 1 where f σ (x, 0) corresponds to the SM cross-section, f σ (x) indicates the linear EFT corrections arising from the interference with the SM amplitude, and f (j,k) σ (x) corresponds to the quadratic corrections associated to the square of the EFT amplitude. We note that while f σ (x, 0) and f (j,k) σ (x) arise from squared amplitudes and hence are positive-definite, this is not necessarily the case for the interference cross-section f  σ (x), and f (j,k) σ (x) can be evaluated in perturbation theory, and one can account for different types of effects such as parton shower, hadronization, or detector simulation, depending on the observable under consideration. The SM cross-sections f σ (x, 0) can be computed at NNLO QCD (eventually matched to parton showers) for most of the LHC processes relevant for global EFT fits, while for the EFT linear and quadratic corrections the accuracy frontier is NLO QCD [58]. The settings of the calculation should be chosen to reproduce as close as possible those of the corresponding experimental measurement, while aiming to minimize the associated theoretical uncertainties. In this work we evaluate the differential cross-sections f σ (x, c) numerically, cross-checking with analytic calculations whenever possible.
In order to construct unbinned observables in an efficient manner it is advantageous to work in terms of the ratio between EFT and SM cross-sections, Eq. (2.21), which, accounting for the quadratic structure of the EFT cross-sections in Eq. (3.3), can be expressed as where we have defined the linear and quadratic ratios to the SM cross-section as .
Parameterizing the ratios between the EFT and SM cross-sections, Eq. (3.4), is beneficial as compared to directly parameterizing the absolute cross-sections since in general EFT effects represent a moderate distortion of the SM baseline prediction. As indicated by Eq. (2.22), the profile likelihood ratio used to derive limits on the EFT coefficients can be expressed in terms of the ratio Eq. (3.4). Indeed, in the case of the dimension-six SMEFT the PLR reads where theĉ denotes the maximum likelihood estimator of the Wilson coefficients. We emphasize that in this derivation the SM serves as a natural reference hypothesis in the EFT parameter space -ratios expressed with respect to another reference point, say c ′ , are trivially equivalent according to the following identity The main challenge in applying limit setting to unbinned observables by means of the profile likelihood ratio of Eq. (3.6) is that the evaluation of the EFT cross-section ratios r (j) σ (x) and r (j,k) σ (x) is computationally intensive, and in many cases intractable, specifically for high-multiplicity observables and when the number of events considered N ev is large. As we explain next, in this work we bypass this challenge by parameterizing the EFT cross-section ratios in terms of feed-forward neural networks, with the kinematic variables x as inputs, trained on the outcome of Monte Carlo simulations.

Cross-section parametrization
As first introduced in Sect. 2, the profile likelihood ratio provides an optimal test statistic in the sense that no statistical power is lost in the process of mapping the high-dimensional feature vector x onto the scalar ratio r σ (x, c). Performing inference on the Wilson coefficients using the profile likelihood ratio from Eq. (3.6) requires a precise knowledge about the differential cross section ratio r σ (x, c) for arbitrary values of c. However, in general one does not have direct access to r σ (x, c) whenever MC event generators can only be run in the forward mode, i.e. used to generate samples. The inverse problem, namely statistical inference, is often rendered intractable due to the many paths in parameter space that lead from the theory parameters c to the final measurement in the detector. In the machine learning literature this intermediate (hidden) space is known as the latent space.
Feed-forward neural networks are suitable in this context as model-independent unbiased interpolants to construct a surrogate of the true profile likelihood ratio. Consider two balanced datasets D eft (c) and D sm generated based on the theory hypotheses T (c) and T (0) respectively, where by balanced we mean that the same number of unweighted Monte Carlo events are generated in both cases. We would like to determine the decision boundary function g(x, c) which can be used to classify an event x into either T (0), the Standard Model, or T (c), the SMEFT hypothesis for point c in parameter space. We can determine this decision boundary by using the balanced datasets D eft (c) and D sm to train a binary classifier by means of the cross-entropy loss-functional, defined as In practice, the integrations required in the evaluation of the cross-entropy loss Eq. (3.8) are carried out numerically from the generated Monte Carlo events, such that where σ fid (c) and σ fid (0) represent the integrated fiducial cross-sections in the SMEFT and the SM respectively. Recall that we have two independent sets of N ev events each generated under T (0) and T (c) respectively, and hence in Eq. (3.9) the first (second) term in the RHS involves the sum over the N ev events generated according to T (c) (T (0)). It is also possible to adopt other loss functions for the binary classifier Eq. (3.8), such as the quadratic loss used in [33]. The outcome of the classification should be stable with respect to alternative choices of the loss function, and indeed we find that both methods lead to consistent results, while the cross entropy formulation benefits from a faster convergence due to presence of stronger gradients as compared to the quadratic loss.
In the limit of an infinitely large training dataset and sufficiently flexible parametrization, one can take the functional derivative of L with respect to the decision boundary function g(x, c) to determine that it is given by , (3.10) and hence in this limit the solution of the classification problem defined by the cross-entropy function Eq. (3.8) is given by the EFT ratios r σ (x, c) that need to be evaluated in order to determine the associated profile likelihood ratio. Hence our strategy will be to parametrize r σ (x, c) with neural networks, benefiting from the characteristic quadratic structure of the EFT cross-sections, and then training these machine learning classifiers by minimizing the loss function Eq. (3.8).
In practice, one can only expect to obtain a reasonably good estimatorĝ of the true result due to finite size effects in the Monte Carlo training data D eft and D sm and in the neural network architecture. Since EFT and SM predictions largely overlap in a significant region of the phase space, it is crucial to obtain a decision boundary trained with as much precision as possible in order to have a reliable test statistic to carry out inference. The situation is in this respect different from usual classification problems, for which an imperfect decision boundary parameterized by g can still achieve high performances whenever most features are disjoint, and hence a slight modification of g does not lead to a significant performance drop. In order to estimate the uncertainties associated to the fact that the actual estimatorĝ differs from the true result g(x, c), in this work we use the Monte Carlo replica method described in Sect. 3.3.
Given the quadratic structure of the EFT cross-sections and their ratios to the SM prediction, Eqns.  σ (x) and r (j,k) σ (x) are determined throughout the entire phase space one can straightforwardly evaluate the EFT differential cross sections (and their ratios to the SM) for any point in the EFT parameter space. Here we exploit this property during the neural network training by decoupling the learning problem of the linear cross section ratios from that of the quadratic ones. This allows one to extract r (j) σ and r (j,k) σ independently from each other, meaning that the neural network classifiers can be trained in parallel and also that the training scales at most quadratically with the number of EFT operators considered n eft .
To be specific, at the linear level we determine the EFT cross-section ratios r and generated at linear order, O Λ −2 , in the EFT expansion with all Wilson coefficients set to zero except for the j-th one, which we denote by c (tr) j . For such model configuration, the EFT cross-section ratio can be parametrized as r σ (x, c (tr) In practice, this relation will only be met with a certain finite accuracy due to statistical fluctuations in the finite training sets. This limitation is especially relevant in phase space regions where the cross-section is suppressed, such as in the tails of invariant mass distributions, and indicates that it is important to account for these methodological uncertainties associated to the training procedure. By means of the Monte Carlo replica method one can estimate and propagate these uncertainties first to the parametrization of the EFT ratio r σ and then to the associated limits on the Wilson coefficients.
Concerning the training of the EFT quadratic cross-section ratios r (j,k) σ , we follow the same strategy as in the linear case, except that now we construct the EFT dataset at quadratic order without any linear contributions. By omitting the linear term, we reduce the learning problem at the quadratic level to a linear one. Specifically, we generate events at pure O Λ −4 level, without the interference (linear) contributions, in the EFT by switching off all Wilson coefficients except two of them, denoted by c (3.14) and parametrize the cross-section ratio as where only purely quadratic terms with both c k ) depends only on the product c j c k , whereas when j = k it depends only on terms proportional to c 2 j . The cross-section ratio is parametrized in this way to facilitiate separate training of the c 2 j , c 2 k and c j c k terms, and we make use of training data in which the contributions from each of these terms has been separately generated, as discussed in more detail in Sect. 4.1. By the same reasoning as above, in the large sample limit we will have that (3. 16) We note that in the case that the Monte Carlo generator used to evaluate the theory predictions T (c) does not allow the separate evaluation of the EFT quadratic terms, one can always subtract the linear contribution numerically by means of the outcome of Eq. (3.13). By repeating this procedure n eft times for the linear terms and n eft (n eft + 1)/2 times for the quadratic terms, one ends up with the set of functions that parametrize the EFT cross-section ratio Eq. (3.4), The similar structure that is shared between Eq. (3.12) and Eq. (3.15) implies that parameterizing the quadratic EFT contributions in this manner is ultimately a linear problem, i.e. redefining the product c j,k maps the quadratic learning problem back to a linear one: Eq. (3.17) represents the final outcome of the training procedure, namely an approximate parametrization r σ (x, c) of the true EFT cross-section ratio r σ (x, c), valid for any point in the model parameter c, as required to evaluate the profile likelihood ratio in Eq. (2.22) and to perform inference on the Wilson coefficients. Below we provide technical details about how the neural network training is carried out and how uncertainties are estimated by means of the replica method.
Cross-section positivity during training While the differential cross-section f σ (x, c) (and its ratio to the SM) is positive-definite, this is not necessarily the case for the linear (interference) EFT term, and hence in principle Eq. (3.12) is unbounded from below. At the level of the training pseudo-data, we avoid the issue of negative cross-sections by generating our pseudo-data at fixed values of the Wilson coefficients, specifically chosen such that the differential cross sections are always positive. For example, in the case of negative interference between the EFT and the SM, we generate our training pseudo-data assuming a negative Wilson coefficient such that the net effect of the EFT is an enhancement relative to the SM. The choices of Wilson coefficients used in our study will be further discussed in Sect. 4.5 and in Table 4.5.
It only then remains to ensure that the physical requirement of cross-section positivity is satisfied at the level of neural network training, and hence that the parameter space region leading to negative cross-sections is avoided. Cross-section positivity can be implemented at the training level by means of adding a penalty term to the loss function whenever the likelihood ratio becomes negative through a Lagrange multiplier. That is, the loss function is extended as where ReLU stands for the Rectified Linear Unit activation function. Such a Lagrange multiplier penalizes configurations where the likelihood ratio becomes negative, with the penalty increasing the more negative r σ becomes. The value of the hyperparameter λ should be chosen such that the training in the physically allowed region is not distorted. This is the same method used in the NNPDF4.0 analysis to implement PDF positivity and integrability [59,60] at the training level without having to impose these constraints in at the parametrization level. However, the Lagrange multiplier method defined by Eq. (3.20) is not compatible with the cross-entropy loss function of Eq. (3.8), given that this loss function is only well defined for 0 < g(x, c) < 1 corresponding to positive likelihood ratios. We note that this is not the case for other loss-functions for which configurations with r σ (x, c) < 0 are allowed, such as the quadratic loss-function used by [33], making them in principle compatible with the Lagrange multiplier method to ensure cross-section positivity.
Instead of using the Lagrange multiplier method, in this work we introduce an alternative parameterization of the cross section ratio r σ such that cross-section positivity is guaranteed by construction. Specifically, we modify Eq. (3.12) to enforce positivity, namely the condition for any value of x and c, by transforming the outcome of the neural network NN (j) (x) as follows where ϵ is an infinitesimal positive constant to ensure r σ (x, c) > 0 when the linear contribution becomes negative, NN (j) (x) < 0. The transformation of Eq. (3.22) can be thought of as adding a custom activation function at the end of the network such that the cross-entropy loss is well-defined throughout the entire training procedure. We stress that it is the transformed neural network NN (j) which is subject to training and not the original NN (j) . Regarding imposing cross-section positivity at the quadratic level, we note that the transformation of Eq. (3.22) applies just as well in the quadratic case by virtue of Eq. (3.18), and therefore the same approach can be taken there. The main advantage of Eq. (3.22) as compared to the Lagrange multiplier method is that we always work with a positive-definite likelihood ratio as required by the cross-entropy loss function.

Neural network training
Here we describe the settings of the neural network training leading to the parametrization of Eq. (3.7). We consider in turn the choice of neural network architecture, minimizer, and other related hyperparameters; how the input data is preprocessed; the settings of the stopping criterion used to avoid overfitting; the estimate of methodological uncertainties by means of the Monte Carlo replica method; the scaling of the ML training with respect to the number of EFT parameters; and finally the validation procedure where the machine learning model is compared to the analytic calculation of the likelihood ratio.
14.1 ± 8.7 Table 3.1. Overview of the settings for the neural network trainings. For each of the processes to be described in Sect. 4, we specify the n k kinematic features x used for the likelihood ratio parametrization, the architecture, the learning rate, the number of mini-batches, and the training time per network averaged over all replicas. As indicated by Eq. (3.19), given a process for which the parameter space is spanned by n eft Wilson coefficients, there are (n 2 eft + 3n eft )/2 independent neural networks to be trained. For each choice of n k kinematic features, these neural networks share the settings listed here.
Architecture, optimizer, and hyperparameters. Table 3.1 specifies the training settings that are adopted for each process, e.g. the features that were trained on, the architecture of the hidden layers, the learning rate η and the number of mini-batches. Given a process for which the SMEFT parameter space is spanned by n eft Wilson coefficients, there are a maximum of N nn = (n 2 eft + 3n eft )/2 independent neural networks to be trained. In practice, this number can be smaller due to vanishing contributions, in which case we will mention this explicitly. We have verified that we select redundant architectures, meaning that training results are stable in the event that a somewhat less flexible architecture were to be adopted. For every choice of n k kinematic features, these N nn neural networks share the same hyperparameters listed there. The last column of Table 3.1 indicates the average training time per replica and the corresponding standard deviation, evaluated over the N nn × N rep networks to be trained for a given process. In future work one can consider an automated process to optimize the choice of the hyperparameters listed in Table 3.1 along the lines of the strategy adopted for the NNPDF4.0 analysis [60,61].
We train these neural networks by performing (mini)-batch gradient descent on the cross-entropy loss function Eq. (3.9) using the AdamW [62] optimizer. Training was implemented in PyTorch [63] and run on AMD Rome with 19.17 HS06 per CPU core. We point the interested reader to App. A and the corresponding online documentation where the main features of the ML4EFT software framework are highlighted.
Data preprocessing. The kinematic features x that enter the evaluation of the likelihood function f σ (x, c) and its ratio r σ (x, c) cannot be used directly as an input to the neural network training algorithm and should be preprocessed first to ensure that the input information is provided to the neural nets in their region of maximal sensitivity. For instance, considering parton-level top quark pair production at the LHC, the typical invariant masses m tt to be used for the training cover the range between 350 GeV and 3000 GeV, while the rapidities are dimensionless and restricted to the range y tt ∈ [−2.5, 2.5]. To ensure a homogeneous and well-balanced training, especially for high-multiplicity observables, all features should be transformed to a common range and their distribution in this range should be reasonably similar.
A common data preprocessing method for Gaussianly distributed variables is to standardize all features to zero mean and unit variance. However, for typical LHC process the kinematic distributions are highly non-Gaussian, in particular invariant mass and p T distributions are very skewed. In such cases, one instead should perform a rescaling based on a suitable interquartile range, such as the 68% CL interval. This method is particularly interesting for our application because of its robustness to outliers at high invariant masses and transverse momenta, in the same way that the median is less sensitive to them than the sample mean. In our approach we use a robust feature scaler which subtracts the median and scales to an inter-quantile range, resulting into input feature distributions peaked around zero with their bulk well contained within the [−1, 1] region, which is not necessarily the case for the standardized Gaussian scaler. Further justification of this choice will be provided in Sect. 4. See also [64] for a recent application of feature scaling to the training of neural networks in the context of PDF fits.
Stopping and regularization. The high degree of flexibility of neural networks in supervised learning applications has an associated risk of overlearning, whereby the model ends up learning the statistical fluctuations present in the data rather than the actual underlying law. This implies that for a sufficiently flexible architecture a training with a fixed number of epochs will result in either underlearning or overfitting, and hence that the optimal number of epochs should be determined separately for each individual training by means of a stopping criterion.
Here the optimal stopping point is determined separately for each trained neural network by means of a variant of the cross-validation dynamical stopping algorithm introduced in [65]. Within this approach, one splits up randomly each of the input datasets D sm and D eft into two disjoint sets known as the training set and the validation set, in a 80%/20% ratio. The points in the validation subset are excluded from the optimization procedure, and the loss function evaluated on them, denoted by L val , is used as a diagnosis tool to prevent overfitting. The minimization of the training loss function L tr is carried out while monitoring the value of L val . One continues training until L val has stopped decreasing following n p (patience) epochs with respect to its last recorded local minimum. The optimal network parameters, those with the smallest generalization error, then correspond to those at which L val exhibits this global minimum within the patience threshold.
The bottom-left plot of Fig. 3.1 illustrates the dependence of the training and validation loss functions in a representative training. While L tr continues to decrease as the number of epochs increases, at some point L val exhibits a global minimum and does not decrease further during n p epochs. The position of this global minimum is indicated with a vertical dashed line, corresponding to the optimal stopping point. The parameters of the trained network are stored for each iteration, and once the optimal stopping point has been identified the final parameters are assigned to be those of the epoch where L val has its global minimum.
Uncertainty estimate from the replica method. In general the ML parametrizationr σ (x, c) will differ from the true EFT cross-section ratio r σ (x, c) for two main reasons: first, because of the finite statistics of the MC event samples used for the neural network training, leading to a functional uncertainty in the ML model, and second, due to residual inefficiencies of the optimization and stopping algorithms. In order to quantify these sources of methodological uncertainty and their impact on the subsequent EFT parameter inference procedure, we adopt the neural network replica method developed in the context of PDF determinations [66][67][68][69].
The basic idea is to generate N rep replicas of the MC training dataset, each of them statistically independent, and then train separate sets of neural networks on each of these replicas. As explained in Sect. 3.2, we train the decision boundary g(x, c) from a balanced sample of SM and EFT events. If we aim to carry out the training ofr σ on a sample of N ev events (balanced between the EFT and SM hypotheses), one generates a total of N ev × N rep events and divides them into N rep replicas, each of them containing the same amount of information on the underlying EFT cross-section r σ . Subsequently, one trains the full set of neural networks required to parametrizer σ separately for each of these replicas, using in each case different random seeds for the initialization of the network parameters and other settings of the optimization algorithm.
In this manner, at the end of the training procedure, one ends up instead of Eq. (3.19) with an ensemble of N rep replicas of the cross-section ratio parametrization, which estimates the methodological uncertainties associated to the parametrization and training. Confidence level intervals associated with these uncertainties can then be determined in the usual way, for instance by taking suitable lower and upper quantiles. In other words, the replica ensemble given by Eq. (3.23) provides a suitable representation of the probability density in the space of NN models, which can be used to quantify the impact of methodological uncertainties at the level of EFT parameter inference. For the processes considered in this work we find that values of N rep between 25 and 50 are sufficient to estimate the impact of these procedural uncertainties at the level of EFT parameter inference.
Scaling with number of EFT parameters. If unbinned observables such as those constructed here are to be integrated into global SMEFT fits, their scaling with the number of EFT operators n eft considered should be not too computationally costly, given that typical fits involve up to n eft ∼ 50 independent degrees of freedom. In this respect, exploiting the polynomial structure of EFT cross-sections as done in this work allows for an efficient scaling of the neural network training and makes complete paralellisation possible. We note that most related approaches in the literature, such as e.g. [70], are limited to a small number of EFT parameters and hence not amenable to global fits. In other approaches, e.g. [33], the proposed ML parametrization is such that the coefficients of the linear and the quadratic terms mix, and in such case no separation between linear and quadratic terms and between different Wilson coefficients is possible. This implies that in such approaches all neural networks parameterizing the likelihood functions need to be trained simultaneously and hence that parallelization is not possible. Within our framework, assembling the parametrization of the cross-section ratio Eq. (3.7) involves n eft independent trainings for the linear contributions followed by n eft (n eft + 1)/2 ones for the quadratic terms. Hence the total number of independent neural network trainings required will be given by which scales polynomially (n 2 eft ) for a large number of EFT parameters. Furthermore, since each neural net is trained independently, the procedure is fully parallelizable and the total computing time required scales rather as n 2 eft /n proc with n proc being the number of available processors. Thanks to this property, even for the case in which n eft ∼ 40 in a typical cluster with ∼ 10 3 nodes the computational effort required to construct Eq. (3.7) is only 50% larger as compared to the case with n eft = 1. This means that our method is well suited for the large parameter spaces considered in global EFT analyses.
Furthermore, for each unbinned multivariate observable that is constructed we repeat the training of the neural networks N rep times to estimate methodological uncertainties. Hence the maximal number of neural network trainings involved will be given by For example, for hZ production with quadratic EFT corrections we will have n eft = 7 coefficients and N rep = 50 replicas, resulting into a maximum of 1750 neural networks to be trained. 2 While this number may appear daunting, these trainings are parallelizable and the total integrated computing requirements end up being not too different from those of the single-network training.
Validation with analytical likelihood. As will be explained in Sect. 4, for relatively simple processes one can evaluate the cross-section ratios Eq. (3.7) also in a purely analytic manner. In such cases, the PLR and the associated parameter inference can be evaluated exactly without the need to resort to numerical simulations. The availability of such analytical calculations offers the possibility to independently validate its machine learning counterpart, Eq. (3.19), at various levels during the training process. Fig. 3.1 presents an overview of representative validation checks of our procedure that we carry out whenever the analytical cross-sections are available. In this case the process under consideration is partonlevel top quark pair production, to be described in Sect. 4, where the kinematic features are the top quark pair invariant mass m tt and rapidity y tt , that is, the feature array is given by x = (m tt , y tt ). The neural network training shown corresponds to the quadratic term NN (j,j) with j being the chromomagnetic operator c tG .
First, we display a point-by-point comparison of the log-likelihood ratio in the ML model and the corresponding analytical calculation, namely comparing Eqns. (3.19) and (3.7) evaluated on the kinematics of the Monte Carlo events generated for the training in the specific case of c tG = 2. One obtains excellent agreement within the full phase space considered. Then we show the median value (over replicas) of the ratio between the analytical and machine learning calculations of NN (j,j) evaluated in the (m tt , y tt ) kinematic feature space, with j again being the chromomagnetic operator c tG . We also show the pull between the analytical and numerical calculations in units of the Monte Carlo replica uncertainty. From the median plot we see that the parametrized ratior σ reproduces the exact result within a few percent except for  low-statistics regions (large |y tt | and m tt , y tt tails), and that these differences are in general well contained within the one-sigma MC replica uncertainty band.
The bottom right plot of Fig. 3.1 displays the resultant decision boundary g(x, c) for y tt = 0 as a function of the invariant mass m tt in the training of the quadratic cross-section ratio proportional to c 2 tG in the specific case of also for c tG = 2. The band in the ML model is evaluated as the 68% CL interval over the trained MC replicas, and is the largest at high m tt values where statistics are the smallest. Again we find that the ML parametrization is in agreement within uncertainties when compared to the exact analytical calculation, further validating the procedure. Similar good agreement is observed for other EFT operators both for the linear and for the quadratic cross-sections.

Theoretical modeling
We describe here the settings adopted for the theoretical modeling and simulation of unbinned observables at the LHC and their subsequent SMEFT interpretation. We consider two representative processes relevant for global EFT fits, namely top-quark pair production and Higgs boson production in association with a Z-boson. We describe the calculational setups used for the SM and EFT cross-sections at both the parton and the particle level, justify the choice of EFT operator basis, motivate the selection and acceptance cuts applied to final-state particles, present the validation of our numerical simulations with analytical calculations whenever possible, and summarize the inputs to the neural network training.

Benchmark processes and simulation pipeline
We apply the methodology developed in Sect. 3 to construct unbinned observables for inclusive top-quark pair production and Higgs boson production in association with a Z-boson in proton-proton collisions. We evaluate theoretical predictions in the SM and in the SMEFT for both processes at leading order (LO), which suffices in this context given that we are considering pseudo-data. For particle-level event generation we consider the fully leptonic decay channel of top quark pair production, and that of the Higgs decaying to a pair of bottom quarks and with the Z-boson decaying leptonically, The evaluation of the SM and SMEFT cross-sections at LO is carried out with MadGraph5_aMC@NLO [71] interfaced to SMEFTsim [72,73] with NNPDF31_nnlo_as_0118 as the input PDF set [74]. As discussed in Sect. 3.2, at the quadratic level in the EFT we parametrize the cross-section ratio such that we have separate neural networks for terms proportional to c 2 j and for the quadratic mixed terms c j c k 3 . In addition to Eq. (4.1), we also carry out tt simulations at the undecayed parton level with the goal of comparing with the corresponding exact analytical calculation of the likelihood ratio for benchmarking purposes. Such analytical evaluation becomes more difficult (or impossible) for realistic unbinned multivariate measurements presented in terms of particle-level or detector-level observables.
The flow chart in Fig. 4.2 describes the pipeline adopted to evaluate the analytical expressions for the parton level analysis at LO in the QCD expansion. First, we generate a FeynArts [75] model file from the SMEFTsim top U3l UFO model in the {m W , m Z , G F } input scheme [73]. This amounts to a U (3) l × U (3) e flavor symmetry in the leptonic sector and U (2) q × U (3) d × U (2) d in the quark sector, consistent with the flavor assumptions made in the SMEFiT analysis [15]. Then we use FeynArts to construct the diagrams associated to a given production process before passing pass them on to FormCalc_v9.9 [76] interfaced to Mathematica, that ultimately produces the analytical differential cross section in the SMEFT.
Satisfactory agreement is found between the analytical SM and SMEFT calculations and the outcome of the corresponding MadGraph5_aMC@NLO simulations both at the linear and quadratic EFT level for all processes considered here. This agreement is illustrated by Fig. 4.3 Analytical dσ(c) SMEFT predictions for the m tt distribution in top-quark pair production for some of the values of the c tG and c (8) tu coefficients used for the neural network training. Similar agreement is found for other distributions and other points in the EFT parameter space. These analytical calculations also make possible validating the accuracy of the neural network training, as exemplified in Fig. 3.1, and indeed the agreement persists at the level of the training of the decision boundary g(x, c).
Dominance of statistical uncertainties As discussed in Sect. 2, we restrict our analysis to measurements dominated by statistical uncertainties for which correlated systematic uncertainties can be neglected. This condition can be enforced by restricting the fiducial phase space such that the number of events per bin satisfies δσ where ν i (0) is the number of expected events in bin i according to the SM hypothesis after applying selection, acceptance, and efficiency cuts. The threshold parameter δ (stat) min is set to δ (stat) min = 0.02 for our baseline analysis. We have verified that our qualitative findings are not modified upon moderate variations of its value. Since Eq. (4.3) must apply for all possible binning choices, it should also hold for N b = 1, namely for the total fiducial cross-section. Therefore, we require that the selection and acceptance cuts applied lead to a fiducial region satisfying (δσ min . This condition implies that the requirement of Eq. (4.3) will also be satisfied for any particular choice of binning, including the narrow bin limit, i.e. the unbinned case.   The first option is adjusting the integrated luminosity L int corresponding to this measurement. In this work we will take a fixed baseline luminosity L int = 300 fb −1 , corresponding to the integrated luminosity accumulated at the end of Run III. The second option is to adjust the fiducial region such that Eq. In this work we take the second option, imposing kinematic cuts restricting the events to the high-energy, low-yield tails of distributions, such as by means of a strong m tt cut in the case of top quark pair production, see Table 4.2. It is then possible to generalize the results presented in this work for L int = 300 fb −1 to higher integrated luminosities by making the cuts that define the fiducial region more stringent.

Top-quark pair production: parton level
For inclusive top quark pair production with stable tops the MadGraph5 aMC@NLO calculation is accompanied by and benchmarked against the analytical evaluation of the likelihood, see also Fig. 4.3. We consider the effects of two representative dimension-six SMEFT operators modifying inclusive top quark pair production, namely the chromomagnetic dipole operator O tG and the two-light-two-heavy four-fermion color-octet operator O tu defined as in [73] in the topU3l flavor scheme: tt (8 TeV) 9.13 · 10 1 2.71 · 10 0 1.57 · 10 −1 2.80 · 10 0 1.10 · 10 0 1.19 · 10 0 3.53 · 10 −1 3.55 · 10 −1 9.68 · 10 1 1.00 · 10 0 8.58 · 10 −2 9.97 · 10 −1 4.14 · 10 −1 4.11 · 10 −1 1.32 · 10 tu leads to a newqqtt vertex. At the level of undecayed tops, a 2 → 2 process such as top quark pair production is uniquely determined by specifying three independent kinematic variables, since the four-momenta p µ t and p μ t satisfy the massshell conditions and transverse momentum conservation. We refer to these kinematic variables as features in the context of ML classification problems. We choose the three independent features to be the transverse momentum of the top quark, p t T , and the invariant mass and rapidity of the top quark pair, m tt and y tt respectively. It can be verified how considering additional variables does not improve the sensitivity to the EFT parameters given the redundancy of the extra features. No fiducial cuts are imposed in the MadGraph5 aMC@NLO calculation to facilitate the comparison with the analytical result.
Concerning the event generation settings, for each point in the EFT parameter space c that enters the neural network training we generate N rep = 50 independent sets of events (replicas) containing N ev = 10 5 events each, for a total of 5 × 10 6 events, see also the overview in Table 4.5. Note that we adopt the convention whereby N ev denotes Monte Carlo events generated to train the machine learning classifier while N ev = ν tot indicates the physical events that enter the EFT parameter inference. The former can be made as large as one wants, while the latter is fixed by the assumed integrated luminosity and the value of the fiducial cross-section, Eq. (4.4). In addition, for each replica we generate an independent set of 10 5 SM events. This is required so that the two terms of the cross-entropy loss function Eq. (3.9) are properly balanced, and results in a training set of 2 × 10 5 events per replica. Similar settings are used by the particle level processes described next, and we have verified that the size of this Monte Carlo dataset is sufficient to ensure a stable and accurate parametrization of the likelihood ratio.

Top-quark pair production: particle level
In the particle-level case, where the top-quark events generated from the diagrams in Fig. 4.1 are decayed into the bℓ + ν ℓb ℓ −ν ℓ final state, one considers a broader set of kinematic features. As in the parton level case, SM and EFT events are simulated with MadGraph5 aMC@NLO at LO in the QCD expansion, though now the analytical calculation is not available as a cross-check. In order to select the relevant EFT operators, we adopt the following strategy. Since we consider a single process, it is only possible to constrain a subset of operators, which are taken to be the n eft Wilson coefficients with the highest Fisher information value, namely those that can be better determined from the fit. In the upper part of Fig. 4.4 we display the diagonal entries of the Fisher information matrix corresponding to the operators that enter the calculation of the LHC tt production measurements at 8 TeV and 13 TeV from [15], where each row is normalized to 100. The dark blue entry indicates the dominating operator, while the operators listed in red are those with the highest Fisher information and selected to construct the unbinned observables. Constraining additional operator SMEFiT SMEFTsim SMEFT@NLO Definition   Wilson coefficients would require extending the analysis to consider unbinned observables for processes such as ttV which span complementary directions in the parameter space. In Table 4.1 we indicate the SMEFT operators entering inclusive top-quark pair production and listed in Fig. 4.4. For each operator we provide its definition in terms of the SM fields and the notation used to refer to the corresponding Wilson coefficients in SMEFTsim (in the topU3l flavor scheme), SMEFiT, and SMEFT@NLO [58]. These operator definitions are consistent with those used in the SMEFiT global analyses [15,77] as required for the eventual integration of the unbinned observables there.
The selection and acceptance cuts imposed on the final-state particles of the tt → bbℓ + ℓ − ν ℓνℓ process are adapted from the Run II dilepton CMS analysis [78] and listed in Table 4.2. Concerning the array of kinematic features x, it is composed of n k = 18 features: p T of the lepton p ℓ T , p T of the antilepton pl T , leading p ℓ T , trailing p ℓ T , lepton pseudorapidity η ℓ , antilepton pseudorapidity ηl, leading η ℓ , trailing η ℓ , p T of the dilepton system p ℓl T , invariant mass of the dilepton system m ℓl , absolute difference in azimuthal angle |∆ϕ(ℓ,l)|, difference in absolute rapidity ∆η(ℓ,l), leading p T of the b-jet, trailing p T of the b-jet, pseudorapidity of the leading b-jet η b , pseudorapidity of the trailing b-jet η b , p T of the bb system p bb T , and invariant mass of the bb system m bb . These features are partially correlated among them, and hence maximal sensitivity of the unbinned observables to constrain the EFT coefficients will be achieved for n k < 18.
Since no parton shower or hadronization effects are included, the b-quarks can be reconstructed without the need of jet clustering and assuming perfect tagging efficiency. These simulation settings are not suited to describe actual data but suffice for the present analysis based on pseudo-data, whose goal is the consistent comparison of the impact on the EFT parameter space of unbinned multivariate ML observables with their Figure 4.5. Same as Fig. 4.1 for Higgs associated production with a Z-boson, hZ → bbℓ + ℓ − , indicating representative corrections arising from the SMEFT operators considered. For this process SMEFT effects modify also particle decays, in particular via the Yukawa interaction relevant for h → bb.

Higgs associated production with a Z-boson
The second process that we consider is Higgs production in association with a Z-boson in the bbℓ + ℓ − final state, for which representative Feynman diagrams indicating the impact of the EFT operators considered are displayed in Fig. 4.5. Following the same strategy as for top quark pair production, the bottom panel of Fig. 4.4 indicates in red the selected n eft = 7 operators with the largest value of the Fisher information matrix when evaluated on the 13 TeV LHC hZ production data. The definition of these operators, again consistent with those in [15], is listed in The selection and acceptance cuts imposed on the final-state particles of the hZ → bbℓ + ℓ − process are collected in Table 4.4 and have been adapted from the ATLAS Run II analysis [79]. In addition to these cuts, another cut on the jet cone radius ∆R in the bb system is applied depending on the value of p Z T , with ∆R(b 1 , b 2 ) < 3.0, 1.8, 1.2 for p Z T ∈ (75, 150] GeV, (150, 200] GeV, and (200, ∞) GeV respectively. The array of kinematic features x for this process is composed of the following n k = 7 features: the transverse momentum of the Z boson p Z T , that of the b-quark p b T , that of the bb pair p bb T , the angular separation ∆R bb of the b-quarks, their azimuthal angle separation ∆ϕ b,bb , the rapidity difference between the dilepton and the bb system ∆η Z,bb , and the azimuthal angle separation ∆ϕ ℓb . Again, most of these features are correlated among them and hence there will be a degree of redundancy in the analysis.   In order to illustrate shape (rather than normalization) differences of the NN inputs, all distributions are normalized by their fiducial cross-sections. The corresponding comparisons at the level of the tt → bbℓ + ℓ − ν ℓνℓ process, displaying the complete set of n k = 18 kinematic features used to train the cross-section ratio at the quadratic-only level, are shown in Fig. 4.7. From Figs. 4.6 and 4.7 one can observe how each operator modifies the qualitative shape of the various kinematic features in different ways. Furthermore, in general the EFT quadratic-only corrections enhance the shift with respect to the SM distributions as compared to the linear ones. The complementarity of the information provided by each kinematic feature motivates the inclusion of as many final-state variables as possible when constructing unbinned observables, though as mentioned above the limiting sensitivity will typically be saturated before reaching the total number of kinematic features used for the training.

Inputs to the neural network training
As mentioned in Sect. 3.3, an efficient neural network training strategy demands that the kinematic features x entering the evaluation of the r σ (x, c) cross-section ratios are preprocessed to ensure that the input information is provided to the neural networks in the region of maximal sensitivity. That is, all features should be transformed to a common range and their distribution within this range should be reasonably similar. Here we use a robust scaler to ensure that this condition is satisfied. Fig. 4.8 displays the comparison between two different preprocessing schemes applied to the input features before the neural network training of the hZ → bbℓ + ℓ − likelihood is carried out. We display results for a Standardized Gaussian scaler and for a robust scaler: the latter subtracts the median and scales to the 95% inter-quantile range, while the former rescales all features to have zero mean and unit variance. The robust scaler leads to input feature distributions peaked around zero with their bulk contained within the [−1, 1] region, which is not the case in general for the Standardized Gaussian scaler. Therefore our default robust scaler facilitates the incorporation of new kinematic features, since the shapes of the input distributions are such that their bulk belongs to the high-sensitivity region of the neural networks. Table 4.5 summarizes the settings adopted for the neural network training of the likelihood ratio function Eq. (3.19) for the processes considered. For each process, we indicate the number of replicas N rep generated, the values of the EFT coefficients that enter the training as specified in Eqns. (3.12) and (3.15), the number of Monte Carlo events generated N ev for each replica, and the number of neural networks to be trained per replica N nn . The values of the Wilson coefficients are chosen to be sufficiently large so as to mitigate the effect of MC errors that might otherwise dominate the SM-EFT discrepancy. Furthermore, the sign of each Wilson coefficient is chosen such that the effect of the EFT is an enhancement relative to the SM, and therefore the differential cross sections are consistently positive. For example, in the case of negative EFT-SM interference, we select negative values of Wilson coefficients. Cross-section positivity must also be maintained during training of the neural networks, and this is further discussed in Sect. 3.2. The last column indicates the total number of trainings required to assemble the full parametrization including the N rep replicas, namely #trainings = N rep × N nn . For parton-and particle-level top-quark pair production      Fig. 4.6 (quadratic-only EFT effects included) for the n k = 18 kinematic features entering the parametrization of the likelihood ratio in the tt → bbℓ + ℓ − ν ℓνℓ process. and for hZ production our procedure requires the training of 200, 1000, and 1500 networks respectively in the case of the quadratic EFT analysis. 4 As discussed in Sect. 3 these trainings are parallelizable and the overall computational overhead remains moderate.
The values listed in the last two columns of Table 4.5 correspond to the case of quadratic EFT fits, since as will be explained in Sect. 5 at the linear level the presence of degenerate directions requires restricting the subset of operators for which inference can be performed. For each process, the total number of Monte Carlo events in the SMEFT that need to be generated is therefore N ev × N rep × N nn , and in addition the training needs a balanced SM sample composed by N ev × N rep events. For example, in the case of hZ production the total number of SMEFT events to be generated is 10 5 × 50 × 30 = 1.5 × 10 8 events.

EFT constraints from unbinned multivariate observables
We now present the constraints on the EFT parameter space provided by the unbinned observables constructed in Sect. 3 in comparison with those provided by their binned counterparts. We study the dependence of these results on the choice of binning and on the kinematic features. We also quantify how much the EFT constraints are modified when restricting the analysis to linear O(Λ −2 ) effects as compared to when the quadratic O(Λ −4 ) contributions are also included.
First, we describe the method adopted to infer the posterior distributions of the EFT parameters for a given observable, either binned or unbinned. Second, we present results for parton-level inclusive top quark pair production, described in Sect. 4.2. The motivation for this is to validate our machine learning methodology by comparing it with the results of parameter inference based on the analytical calculation of the likelihood. Then we consider the analogous process, now at the particle level in the dilepton final state (Sect. 4.3), and quantify the information gain resulting from unbinned observables and its dependence on the choice of kinematic features used in the training. This is followed by the corresponding analysis for hZ production (Sect. 4.4). Finally, we will discuss the impact of methodological uncertainties, discussed in Sect. 3.3, on the constraints we obtain on the EFT parameter space. The results presented here can be reproduced and extended to other processes by means of the ML4EFT framework, summarized in App. A.

EFT parameter inference
For each of the LHC processes considered in Sect. 4, Monte Carlo samples in the SM and the SMEFT are generated in order to train the decision boundary g(x, c) from the minimization of the cross-entropy loss function Eq. (3.8). See Tables 4.2 and 4.4 for the pseudodata generation settings. The outcome of the neural network training is a parametrization of the cross-section ratior σ (x, c), Eq. (3.19), which in the limit of large statistics and perfect training reproduces the true result r σ (x, c), Eq. (3.4). To account for finite-sample and finite network flexibility effects, we use the Monte Carlo replica method as described in Sect. 3.3 to estimate the associated methodological uncertainties. Therefore, the actual requirement that defines a satisfactory neural network parametrizationr σ is that it reproduces the exact result r σ , within the 68% CL replica uncertainties evaluated from the ensemble Eq. (3.23). From the exact result for the cross-section ratio r σ (x, c), or alternatively its ML representationr σ (x, c), one can carry out inference on the EFT Wilson coefficients by means of the profile likelihood ratio Eq. (2.22) applied to the N ev generated pseudo-data events. We emphasize that the N ev events entering in Eq. (2.22), or alternatively in Eq. (3.6), are the physical events expected after acceptance and selection cuts for a given integrated luminosity L int , see also the discussion in Sect. 4.1. The Monte Carlo events used to train the ML classifier are instead different: in general, the classifier is trained on a larger event sample than the one expected for the actual measurement. Once the machine learning parametrization of the cross-section ratio, Eq. (3. 19), has been determined alongside its replica uncertainties, the profile likelihood ratio Eq. (2.22) can be used to infer the posterior probability distribution in the EFT parameter space and thus determine confidence level intervals associated to the unbinned observables. The same method is applied to binned likelihoods and eventually to extended likelihoods which combine the information provided by binned and unbinned observables, Eq. (2.24).
In this work, the EFT parameter inference based on the machine learning parametrization of the PLR is carried out by means of the Nested Sampling algorithm, in particular via the MultiNest implementation [80]. We recall that this is the baseline sampling algorithm used in the SMEFiT analysis [15] to determine the posterior distributions in the EFT parameter space composed of n eft = 36 independent Wilson coefficients. The choice of MultiNest is motivated from its previous applications to EFT inference problems of comparable complexity to those relevant here, and also to facilitate the integration of unbinned observables into global EFT fits. The posterior distributions provided by MultiNest are represented by N s samples in the EFT parameter space, where N s is determined by requiring a given accuracy in the sampling procedure. From this finite set of samples, contours of the full posterior can be reconstructed via the kernel density estimator (KDE) method, also used e.g. in the context of Monte Carlo PDF fits [81,82].
Methodological uncertainties associated to the N rep replicas are propagated to the constraints on the EFT parameter space in two complementary ways. Firstly, after evaluating the neural networks on the N ev events entering into the inference procedure, we calculate the median of each NN (j) i (x) and NN (j,k) i (x) in Eq. (3.23). The corresponding median profile likelihood ratio is then calculated and used in the Nested Sampling algorithm, from which contours of the posterior distribution are obtained. The results shown in the following Sects. 5.2, 5.3 and 5.4 have been produced using this method. Alternatively, one can assess the impact of the 2-σ methodological uncertainties on these contours. We do so by defining N rep profile likelihood ratios, one from each of the N rep replicas, and performing Nested Sampling for each, resulting in N rep sets of N s samples from the EFT posterior distribution. The samples are then combined, and the KDE method used to determine contours of the posterior distribution. In Sect. 5.5 we will assess the differences observed between these two methodologies.  Furthermore, we show results for the corresponding observable based on a single feature, the invariant mass m tt , obtained by marginalizing the double differential binning over the rapidity y tt . The contours derived from these two binned observables in Fig. 5.1 are compared with the corresponding unbinned observables constructed either with the analytic likelihood calculation ("Unbinned exact") or with its machine learning parametrization ("Unbinned ML"). For the two-feature observable in the quadratic case, upper right panel in Fig. 5.1, we observe how the 95% CL contours shrink first when moving from a coarser binning to a finer binning and then further when using the unbinned observable. Good agreement is found between the inference carried out with the exact likelihood calculation and with its ML interpolation, with the small remaining differences covered by the spread of the replicas as discussed below. We observe that the constraints on the chromomagnetic operator c tG are similar for the various observables considered, as expected, since it is mostly determined from the absolute normalization of the tt fiducial cross-section and hence adding kinematic information on the shape of the distributions does not provide significant extra information. The main improvement is seen in the constraints on the two-light-two-heavy operator c (8) tu , which benefits from exploiting the kinematic information.

Top-quark pair production: parton level results
By comparing with the corresponding analysis based on the single-feature m tt observable (bottom right panel) one finds a moderate worsening of the obtained bounds. No major differences are observed, however, indicating how in this case the addition of y tt as kinematic feature does not significantly modify the inference results. We have verified that good agreement is found between the exact likelihood calculation and the binned approach when a sufficiently fine binning is taken, as expected, because in the limit of infinitely fine bins the binned approach will exactly reproduce the analytical calculation. This is illustrated by Binning 3, given by m tt ∈ [1.45, 1.5, 1.55, 1.6, 1.7, 1.8, . . . , 4.8, 4.9, 5.0, ∞) TeV.
We have also verified that in this relatively simple scenario (a 2 → 2 process) the addition of a third independent kinematic feature such as p t T to the unbinned observable does not affect the outcome. As we show next, for a more realistic particle-level configuration it is instead clearly beneficial to increase the number of kinematic features considered in the EFT parameter inference.
In the case of the linear EFT level results, displayed in left panels in Fig. 5.1, we note that there is a strong correlation between the two coefficients for the case of the coarse binning, indicating a quasi-flat direction. This is removed by using first a finer binning and then the unbinned observable, which in this case provides only a moderate improvement as compared to the finer binning. As in the case of the quadratic analysis, adding a second feature, y tt , does not significantly modify the results relative to the analysis of a single feature m tt .

Top-quark pair production: particle level results
We assess next the bounds on the EFT parameter space obtained from unbinned observables in the case of particle-level top quark pair production, specifically in the dilepton final state described in Sect. 4.3, pp → tt → bℓ + ν ℓb ℓ −ν ℓ . As discussed there, this process is most sensitive to the n eft = 8 operators with the highest Fisher information in inclusive tt production, as listed in Table 4.1. While for the quadratic EFT analysis it is possible to derive the posterior distribution associated to the full set of n eft = 8 operators, at the linear level there are quasi-flat directions that destabilize the ML training of the likelihood ratio and the subsequent parameter inference. For this reason, in the linear EFT analysis of this process we consider a subset of n eft = 5 operators that can be simultaneously constrained from inclusive top-quark pair production [83], given by c tG , c qt , c Qu , c (1,8) Qq , c td .
tu ) plane obtained for the parton-level top quark pair production process described in Sect. 4.2. We present results based on EFT calculations at the linear level (left) and also including quadratic contributions (right panels). The black cross indicates the SM result, which is the hypothesis assumed in the generation of the pseudo-data that enters the inference procedure. We compare the results from two binned analyses (Binning 1 and 2) with those of the corresponding unbinned observable, constructed either with the analytic likelihood calculation or with its machine learning parametrization. We demonstrate how the analytical calculation is reproduced for a sufficiently fine binning (Binning 3). We display the contours corresponding to two different observables, the first built upon two kinematic features (m tt , y tt ) (upper) and the second based on the invariant mass m tt as a single feature (lower panels). quadratic EFT corrections are accounted for is displayed in Fig. 4.7, showing how different features provide complementary information to constrain the EFT and thus making a fully-fledged multivariate analysis both interesting and necessary in order to fully capture the EFT effects. For example, the pseudo-rapidity distributions bend around η = 0 corresponding to the direction transverse to the beam pipe, while the transverse momenta are sensitive to energy growing effects in the high-p T tails of the distributions. Pair-wise 95% CL contours for the Wilson coefficients entering top quark pair production in the dilepton final state, see Sect. 4.3 for more details. These contours are obtained by marginalizing over the full posterior distribution provided by Nested Sampling. We consider here n eft = 5 Wilson coefficients that can be simultaneously constrained from inclusive top-quark pair production at the linear level in the EFT expansion. We compare the results obtained from both binned and unbinned ML observables constructed on the (p ℓl T , η ℓ ) kinematic features. As in Fig. 5.1, the black cross indicates the SM values used to generate the pseudo-data that enters the inference. The comparison of the unbinned ML observable trained on (p ℓl T , η ℓ ) with its counterpart trained on the full set of n k = 18 kinematic features is displayed in Fig. 5.3. As for the parton-level case, a cut in the invariant mass m tt is applied to ensure that Eq. (4.3) is satisfied and that statistical uncertainties dominate. Note that for all observables the pseudo-dataset used for the EFT parameter inference is the same and was not used during training.
From the comparisons in Fig. 5.2 one can observe how, for all observables and all operator pairings, the 95% CL intervals include the SM values used to generate the pseudo-data. The bounds obtained from the binned and unbinned observables are in general similar and compatible, and no major improvement is obtained with the adoption of the latter in this case. However, here we consider only n k = 2 kinematic features and hence ignore the potentially useful information contained in other variables that can be constructed from the final state kinematics. In order to assess their impact, Fig. 5.3 displays the same pair-wise marginalized comparison, now between ML unbinned observables with only p ℓl T and η ℓ as input features and with the full set of n k = 18 kinematic variables displayed in Fig. 4.7. We now find a very significant change in the bounds on the EFT parameter space, improving by up to an order of magnitude or better in all cases. Again, the 95% CL contours include the SM hypothesis used to generate the pseudo-data. Furthermore, the inclusion of the full set of kinematic features reduces the correlations between operator pairs that arise when only p ℓl T and η ℓ are considered, indicating a breaking of degeneracies in parameter space. This result indicates that a multivariate analysis improves the information on the EFT parameter space that can be extracted from this process as compared to that obtained from a subset of kinematic features. analysis. For instance, the two-light-two-heavy operators entering tt production at the linear level are highly correlated, which means that adopting a multivariate analysis leads to a sizable effect partly because it breaks degeneracies in the parameter space. Hence, the actual impact of unbinned multivariate observables for EFT analyses depends on which other datasets and processes are considered and can only be assessed on a case-by-case basis. In this respect, beyond the implications for the specific processes considered in this work, what our framework provides is a robust method to quantify the information on the EFT parameters provided by different types of observables constructed on exactly the same dataset: binned versus unbinned, different choices of binnings, and different numbers and types of kinematic features. As is well known, in top quark pair production quadratic EFT corrections are important for most operators, in particular due to energy-growing effects in the tails of distributions. These sizable effects are highlighted by the distortions with respect to the SM distributions induced by quadratic EFT effects, shown in Fig. 4.7, in the kinematic features used to train the ML classifier for this process. In order to investigate how the results based on linear EFT calculations vary once quadratic corrections are considered, in Figs. 5.4 and 5.5 we present the analogous comparisons to Figs. 5.2 and 5.3 respectively in the case of EFT calculations that also include the quadratic corrections. We display the results for pair-wise contours obtained from the marginalization of the posterior distribution in the space of the full set of n eft = 8 Wilson coefficients, given that once quadratic corrections are accounted for it becomes possible to constrain the full set of relevant operators simultaneously. We also include in Fig. 5.4 the results obtained from the unbinned ML analysis based on p ℓl T as the single input feature, while Fig. 5.5 also displays the two-feature binned contours as reference. We note that once quadratic effects are accounted for the 95% CL contours will in general not be elliptic, and may even be composed of disjoint regions in the case of degenerate maxima.
Comparing the constraints provided by the binned (p ℓl T , η ℓ ) observable in Fig. 5.4 with those in Figs. 5.2, one observes how the bounds on the EFT coefficients are improved when accounting for the quadratic EFT corrections. This improvement is consistent with the large quadratic corrections to the kinematic distributions entering the likelihood ratio of this process which lead to an enhanced sensitivity, and with previous studies [15,77] within global SMEFT analyses. From the results Fig. 5.4 we find that for all operator pairs considered, the most stringent bounds are provided by the unbinned observable built upon the (p ℓl T , η ℓ ) pair of kinematic features. Furthermore, one can identify the factor which brings in more information: either using the full event-by-event kinematics for a given set of features, or increasing the number of kinematic features being considered. For some operator combinations, the dominant effect turns out to be that of adding a second kinematic feature, for instance in the case of (c (8) tu , c (8) td ) binned and unbinned observables coincide when using the (p ℓl T , η ℓ ) features while the unbinned p ℓl T observable results in larger uncertainties. For other cases, it is instead the information provided by the event-by-event kinematics which dominates, for example in the (c (3,8) Qq , c (8) Qu ) plane the constraints from binned (p ℓl T , η ℓ ) are clearly looser than those from their unbinned counterparts.
The comparison between the constraints on the EFT parameter space provided by the unbinned observable based on two kinematic features, (p ℓl T , η ℓ ), and those provided by its counterpart based on the full set n k = 18 kinematic features is displayed in Fig. 5.5. It confirms that at the quadratic level there is also a marked gain in constraining power obtained from increasing the number of kinematic variables considered in the analysis. For reference, the plot also displays the bounds obtained with the two-feature binned observable in the same process. Interestingly, when considering the full set of kinematic features, one obtains posterior contours which display a reduced operator correlation, highlighting how the unbinned multivariate operator is especially effective at removing (quasi)-degeneracies in the EFT coefficients. As for the linear case, the improvement in the bounds obtained with the multivariate can reach up to an order of magnitude as compared to the two-feature unbinned analysis.
One can compare the bounds obtained in the present study with those from the corresponding SMEFiT global analysis of LHC data. For the same EFT settings, the 95% CL marginalized bounds on the chromomagnetic operator c tg are [0.062, 0.24] in SMEFiT, to be compared with a bound of ∆c tg ≃ 0.4 obtained from the binned and unbinned (p ℓl T , η ℓ ) observable and that of ∆c tg ≃ 0.1 from the multivariate unbinned observable trained using all features. While there are too many differences in the input dataset and other settings to make possible a consistent comparison, this initial estimate suggests that unbinned multivariate measurements based on Run III data could provide competitive constraints on the EFT parameter space as compared to the available binned observables.
All in all, we find that for inclusive top quark pair production, deploying unbinned multivariate observables makes it possible to tap into a source of information on the EFT parameter space which is not fully exploited in binned observables, both at the level of linear and quadratic EFT analyses, and that increasing the number of kinematic features considered in the likelihood parametrization enhances the constraining power of the measurement.

Higgs associated production with vector bosons
We consider next the impact of unbinned multivariate observables in Higgs associated production with Z bosons in the pp → Zh → bbℓ + ℓ − channel. The pseudo-data for this process has been generated with the settings and fiducial cuts described in Sect. 4.4, and the list of operators constrained is defined in Table 4.3. As for the case of tt production, we compare the bounds provided by binned and by unbinned observables at the level of linear and quadratic EFT calculations, and study the dependence of the results on the number of kinematic features used to parametrize the likelihood ratio. Pair-wise 95% CL contours obtained from two-parameter fits in case of pp → hZ → bbℓ + ℓ − at the linear level in the EFT expansion for the n eft = 7 Wilson coefficients relevant for the description of this process. For each panel, the contribution from the operators not displayed is set to zero. We compare the bounds obtained from a binned p Z T distribution with those from two ML unbinned observables, first when only p Z T is used for the training and then when n k = 7 kinematic features are used to parametrize the likelihood ratio. Fig. 5.6 displays the pair-wise 95% CL contours obtained from two-parameter fits of pp → hZ → bbℓ + ℓ − pseudo-data at the linear level in the EFT expansion. In these two-parameter fits, for each panel the contribution from the operators not being shown is set to zero. The choice of displaying the outcome of two-parameter fits, rather than a full marginalized analysis, is motivated by the fact that the linear EFT analysis is not stable when all coefficients are simultaneously considered due to quasi-flat directions and strong operator correlations. This restriction can be lifted once we account for quadratic effects.
We show the bounds obtained from a binned p Z T distribution, with p Z T being the dilepton transverse momentum, where the choice of binning is given by consistent with the definitions entering the Simplified Template Cross Section (STXS) [84] adopted by the LHC Higgs Working Group. These binned bounds are compared with those from two ML unbinned observables, first when only the dilepton transverse momentum p Z T is used for the training and second when the full set of n k = 7 kinematic features is used to parametrize the likelihood ratio. As in the case of top quark production, comparing the constraints from the binned observable with those from the unbinned one trained on the same feature quantifies the information loss incurred by the binning procedure, while comparing one feature with multivariate unbinned observables determines the information loss associated to the use of a restricted set of kinematic features At this linear EFT level, the improvement in the constraints provided by unbinned observables as compared to their binned p Z T counterparts is moderate, yet appreciable, for most operator pairs. The most stringent bounds are those associated with unbinned observables, and the constraints provided by the multivariate unbinned observable based on 7 kinematic features are similar to those associated to the singlefeature version trained on p Z T . One observes strong correlations between all operators pairs being considered, indicating quasi-flat directions: these will be broken either by quadratic corrections or by adding other processes sensitive to the same SMEFT operators, such as vector-boson scattering and diboson data [16]. The analysis of Fig. 5.6 indicates that for the hZ → bbℓ + ℓ − process the chosen binning in p Z T provides the bulk of the information on the EFT operators, with minor additional improvements provided by the unbinned multivariate analysis.
Therefore, for this specific process the formalism developed in this work provides a diagnosis tool that determines when and under which circumstances a binned analysis provides information on the EFT parameter space close to the optimal one, represented here by the bounds associated to the unbinned multivariate observable. This finding illustrates the two-fold applicability of our formalism: whenever unbinned multivariate observables markedly improve over their binned counterparts, they can be included in the global SMEFT fit, otherwise one obtains quantitative confirmation that the information provided by the specific settings of the binned observables is sufficiently close to the optimal amount.
The results of the corresponding analyses of the impact of unbinned observables in the hZ → bbℓ + ℓ − process in the presence of quadratic EFT corrections are shown in Fig. 5.7 for the case of two-parameter fits, and in Fig. 5.8 for the case where all relevant operators are considered simultaneously, producing marginalized constraints on each operator pair. As mentioned above, the latter is possible since quadratic corrections break up the EFT parameter degeneracies that destabilize the inference procedure in the linear case.
Considering first the results of Fig. 5.7, one finds that on the one hand the improvement associated to the use of unbinned observables is more marked as compared to the linear EFT case. On the other hand, the addition of other kinematic features on top of p Z T does not modify the discrimination power of the unbinned observables, as also observed for the linear analysis. For specific operator pairs, there is a clear reduction in the parameter bounds based on the unbinned observables, for instance in the case of the (c φW , c φW B ) pair. We note that enhanced discrimination in the presence of quadratic EFT corrections can be partially attributed to energy-growing effects in the p Z T distribution which are more pronounced as compared to the linear level.
Another benefit associated to unbinned observables in this specific process is that of breaking degeneracies leading to a double maximum structure in the posterior distribution. Indeed, the unbinned observables removes the second minimum present in the binned analysis arising from a bimodal distribution in the (c (3) φq , c bφ )-plane and corresponding to the scenario where the quadratic EFT contribution becomes of the same magnitude and opposite sign as the linear term and hence canceling it. We also note that, as opposed to the tt production case, for the hZ process, even with unbinned observables, one ends up with some operator pairs that exhibit relatively large correlations which would thus only be broken by adding other processes to the fit. The results of Fig. 5.7 highlight how in general the relevance of considering unbinned observables depends not only on the processes and on the order in the EFT expansion, but also on the specific directions in the EFT parameter space in which one is interested.
The results of Fig. 5.7 display good agreement between unbinned observables trained only on p Z T and on the full set of n k = 7 kinematic features, suggesting that a multivariate analysis does not appear to be beneficial for this process. However, this interpretation turns out to be an artifact of restricting the analysis to the case of two-parameter fits, and a different picture emerges in the case of the marginalized analysis results displayed in Fig. 5.8. The impact on the EFT parameter space of adopting unbinned observables is more important now, leading in some cases to stringent bounds improved by up to a factor 5 as compared to the binned results, for instance for the c φW operator. Furthermore, one can also appreciate the improved constraints associated to the multivariate observable based on the full set of kinematic features as compared to the single-feature unbinned observable. We also note how the degenerate minimum associated to the bottom Yukawa operator present in the binned analysis, now for all operator pairs, goes away when unbinned observables are considered. The comparison of Figs. 5.7 and 5.8 emphasizes that bounds obtained in twoparameter fits, neglecting the effects of other operators, will in general overestimate the EFT constraints associated to a given dataset.
To summarize, as in the case of top quark pair production, for hZ production the use of unbinned multivariate observables is advantageous in constraining the EFT parameter space. Our analysis indicates that the choice of p Z T binning adopted by the STXS analysis is close to being optimal, especially in the linear case, at least for the assumed integrated luminosity and for two-parameter fits. When considering a full marginalized analysis in the presence of quadratic EFT corrections, marked improvements associated first to the use of unbinned observables and second to that of multiple kinematic features are obtained. The use of unbinned observables also results in the removal of degenerate minima corresponding to solutions where the quadratic EFT correction approximately cancels out the linear one. The comparison between the output of the two-parameter and the marginalized fits emphasizes the crucial importance of carrying out global SMEFT interpretations, since setting to zero a subset of operators that contribute to a given process is likely to overestimate its impact in the EFT parameter space.

Methodological uncertainties
The results for the unbinned observables derived so far are obtained from EFT parameter inference carried out with Eq. (3.19), the parametrization of the likelihood ratio, as described in Sect. 5.1. As explained in Sect. 3.3, within our approach, rather than a single best model, we produce a distribution of models, denoted as replicas, each of them trained on a different set of Monte Carlo events. Hence the end result of our procedure is Eq. (3.23), the representation of the probability distribution of the likelihood ratio {r  σ (x, c)} to determine the bounds in the EFT parameter space, and here we assess the impact of methodological uncertainties by displaying results for parameter inference based on the full distribution of replicas of the likelihood ratio parametrization.
Figs. 5.9 and 5.10 display the bounds in the parameter space obtained for particle-level top-quark pair and hZ associated production, respectively, from the unbinned multivariate observables based on the complete set of kinematic features and in the case of theory simulations that include quadratic EFT corrections. We compare results for parameter inference based on the median of the replica distribution {r for Higgs associated production with a Z boson, now for the 95% CL contours. and 5.8 confirm that these procedural uncertainties, as estimated with the replica method, do not modify the qualitative results obtained from the median of the likelihood ratio. In particular, the observation that bounds obtained using multivariate unbinned observables are much stronger than those based on unbinned models based on one or two kinematic features, which is common to both the tt and hZ processes, remains valid once replica uncertainties are accounted for. Furthermore, we note that in principle it is possible to reduce the spread of the replica distribution by training on higher-statistics samples and adopting more stringent stopping criteria. In this respect, our approach provides a strategy to quantify under which conditions the computational overhead required to achieve more accurate machine learning trainings is justified from the point of view of the impact on the EFT coefficients. This said, these results also indicate that in realistic scenarios based on finite training samples methodological uncertainties cannot be neglected, and should be accounted for in studies of the impact of ML-based observables in EFT analyses.

Summary and outlook
The general problem of identifying novel types of measurements which, given a theoretical framework, provide enhanced or even maximal sensitivity to the parameters of interest is ubiquitous in modern particle physics. Constructing such optimized observables has a two-fold motivation: on the one hand, to achieve the most stringent constraints on the model parameters from a specific process, and on the other hand, to provide bounds on the maximum amount of information that can be extracted from the same process.
Optimized observables can hence be used to design traditional observables in a way that approaches or saturates the limiting sensitivity by highlighting the best choices of binning and kinematic features.
In this work, we have presented a new framework for the design of optimal observables for EFT applications at the LHC, making use of machine learning techniques to parametrize multivariate likelihoods for an arbitrary number of higher-dimensional operators. To illustrate the reach of our method, we have constructed multivariate unbinned observables for top-quark pair production and Higgs production in association with a Z boson. We have demonstrated how these observables either lead to a significant improvement in the constraints on the EFT parameter space, or indicate the conditions upon which a traditional binned analysis already saturates the EFT sensitivity.
The ML4EFT framework presented in this work and the accompanying simulated event samples are made available as an open source code which can be interfaced with existing global EFT fitting tools such as SMEFiT, FitMaker [17], HepFit [14], and Sfitter [83]. Its scaling behavior with the number of parameters, together with its parallelization capabilities, make it suitable for its integration within global EFT fits which involve several tens of independent Wilson coefficients. While in its current implementation the parametrized likelihood ratios are provided in terms of the output of the trained network replicas, work in progress aims to tabulate this output in terms of fast interpolation grids, as done customarily in the case of PDF analyses [85][86][87][88]. The resulting interpolated unbinned likelihood ratios can then be combined with Gaussian or Poissonian binned measurements within a global fit. While currently ML4EFT can only be used in combination with simulated Monte Carlo pseudo-data, all of the ingredients required for the analysis of actual LHC measurements are already in place.
The results presented in this work could be extended along several directions. First, one could include experimental and theoretical correlated systematic uncertainties, as required for the interpretation of measurements which are not dominated by statistics. Second, other machine learning algorithms could be adopted, which may offer performance advantages as compared to those used here: one possibility could be graph neural networks [89,90] which make possible varying the kinematic features used for the training on an event-by-event basis. Third, ML4EFT could be applied to more realistic final states with higher order corrections, such as by means of NNLO+PS simulations of tt and hV [91][92][93], and accounting for detector effects to bridge the gap between the theory predictions and the experimentally accessible quantities. It would also be interesting to compare the performance of different approaches to construct ML-assisted optimized observables for EFT applications in specific benchmark scenarios.
Beyond hadron colliders, the framework developed here could also be relevant to construct optimal observables for EFT analyses in the case of high-energy lepton colliders, such as electron-positron collisions at CLIC [94,95] or at a multi-TeV muon collider [96][97][98]. As mentioned, statistically optimal observables were actually first designed for electron-positron colliders, where the simpler final state facilitates the calculation of the exact event likelihood for parameter inference [21][22][23]. These methods may not be suitable to the high multiplicity environment of multi-TeV lepton colliders, in particular due to the complex pattern of electroweak radiation. Therefore, the ML4EFT method could provide a suitable alternative to construct unbinned multivariate observables achieving maximal EFT sensitivity at high-energy lepton colliders.
In addition to applications to the SMEFT, our framework could also be relevant to other types of theory interpretations of collider data such as global PDF fits. In particular, PDFs at large values of Bjorken-x are poorly constrained [99] due to the limited amount of experimental data available. This lack of knowledge degrades the reach of searches for both resonant and non-resonant new physics in the high-energy tail of differential distributions, as recently emphasized for the case of the forward-backward asymmetry in neutral-current Drell-Yan production [100]. Given that measurements of these high-energy tails are often dominated by statistical uncertainties, the ML4EFT method could be used to construct unbinned multivariate observables tailored to constrain large-x PDFs at the LHC. Our method could also be applied in the context of a joint extraction of PDFs and EFT coefficients [101][102][103][104][105], required to disentangle QCD effects from BSM ones in kinematic regions where they potentially overlap, such as large Bjorken-x.

B The unbinned Asimov data set
The Asimov data set was first introduced in [70] as an efficient method of obtaining the distribution of the profile likelihood ratio q c , Eq. (2.8), under the alternative hypothesis c ′ ̸ = c without having to resort to computationally expensive MC simulations. Since then, it has become part of the standard analysis toolkit at the LHC to determine e.g. expected exclusion limits. The analysis of [70] considers the Asimov data set applied to binned observables and here, by adopting a simple EFT-like toy model, we present its generalisation to a continuous Asimov data set. This makes possible determining expected exclusion limits in a sampling free way by extending the results of [70] to unbinned observables. While we do not make use of this unbinned Asimov data set in the results presented in this work, this derivation can play an important role in the general discussion of unbinned multivariate measurements for EFT studies.
Let us consider a kinematic variable x (e.g. the invariant mass) restricted to x ∈ X = [1,5] and distributed according to a probability distribution f σ (x, c). The EFT toy model is chosen such that the effect of the EFT on the kinematic variable is an enhancement in the tail of the distribution f σ (x, c) with respect to the SM, as is common in many EFT scenarios. The distribution f σ (x, c) is defined as where f The number of expected events ν(c) is determined from σ(X, c) and the luminosity L, which we set to L = 1000 unless otherwise specified. We work with dimensionless parameters throughout this toy example. In Fig. B.1a we show the distribution f σ (x, c) at c = 0 and c = 1, corresponding to the SM and the EFT hypotheses respectively. One can see a clear energy growing effect in the tail of the EFT distribution. The histograms have been constructed using accept/reject sampling and contain 1M samples from each hypothesis. These will serve as our parent data sets from which we will take subsequent samples in the following. Now, taking the EFT at c = 1 as our null hypothesis, the goal is to compute the expected significance at which we can either accept or reject the null assuming data generated under the SM (the alternative hypothesis). We adopt the PLR from Eq. (2.8) as our test statistic, whose distribution under the SM and the EFT we denote by pdf(q c |0) and pdf(q c |c) respectively. One can imagine repeatedly drawing toy datasets from the SM parent data set, computing q c , and evaluating its p-value under the null. The median p-value then serves as a measure of expected sensitivity. Moreover, since the p-value is monotonic with q c , it is equivalent to report the p-value associated with the median q c under the SM, denoted Med[q c |0], From the above discussion it becomes clear one not only needs pdf(q c |c), which follows a χ 2 -distribution as discussed in Sec. 2.1, but also pdf(q c |c ′ ) with c ′ ̸ = c. At this point we employ Wald's theorem [106], which states that for a data set of size N generated under f σ (x, c ′ ), the PLR q c can be approximated by whereĉ follows a normal distribution with mean c ′ and a variance σ 2 , i.e.ĉ ∼ N (c ′ , σ). Not only does the approximation provided by Wald's theorem depend on a large sample size N [106], but the approximation improves as c →ĉ ≃ c ′ . This can be understood by Taylor expanding Eq. (2.8) aboutĉ. Defining ∆ ≡ (c−ĉ), we find where on the last line we have used that Recalling thatĉ ∼ N (c ′ , σ), we note that q wald c corresponds to a squared Gaussian random variable with unit variance and non-zero mean. This is known to follow a non-central χ 2 distribution [70], The problem of obtaining pdf(q c |c ′ ) with c ′ ̸ = c is now reduced to the question of determining the non-centrality parameter Λ to be used in Eq. (B.10). We can construct this using MC toys: for example, in Fig. B.1b we show the distribution ofĉ under the SM, obtained using 10K toy data sets containing n ev samples each, where n ev is drawn from Pois(ν(0)) each time a new experiment is run. The distribution in Fig. B.1b gives σ = 0.19, resulting in Λ MC = 27.12. However, although constructing pdf(ĉ|c ′ ) using MC toys gives us the info we need to extract Λ, it is computationally highly inefficient. This is especially so in case one needs to perform scans along different c ′ , e.g. to find expected exclusion limits. This is where the Asimov data set enters as an efficient tool to obtain Λ [70]. The Asimov data set is defined such that one recovers the true value of c during maximum likelihood estimation, i.e.ĉ = c. In the case of a Poissonian likelihood L pois defined as in Eq. The Asimov data set is then constructed such that Eq. (B.12) impliesĉ = c, which can only be true if n i = ν i (c). Therefore, we define the Asimov data set under the hypothesis c as , indicating that the distribution of q c is well described by the noncentral χ 2 evaluated with non-centrality parameter Λ A,unbinned . We notice a disagreement between the distribution of q exact c and q wald c , meaning that neglecting the higher order corrections in Wald's theorem leads to sizable differences. On the right (b) it can be seen that this difference gets smaller when c → c ′ , i.e. ∆ → 0. This is the case both at c = 1 and c = 0.5, and provides a validation of the expression for Λ A,unbinned determined in Eq. (B.15). This agreement suggests Eq. (B.15) includes some higher order contributions that are otherwise omitted if one adopts instead Eq. (B.11). This was also conjectured in [70].