Benchmarking simplified template cross sections in $WH$ production

Simplified template cross sections define a framework for the measurement and dissemination of kinematic information in Higgs measurements. We benchmark the currently proposed setup in an analysis of dimension-6 effective field theory operators for $WH$ production. Calculating the Fisher information allows us to quantify the sensitivity of this framework to new physics and study its dependence on phase space. New machine-learning techniques let us compare the simplified template cross section framework to the full, high-dimensional kinematic information. We show that the way in which we truncate the effective theory has a sizable impact on the definition of the optimal simplified template cross sections.


Introduction
Since the discovery of the Higgs boson at the LHC in 2012, the experimental and theoretical communities have focused intense scrutiny on the question of whether the observed particle has exactly the properties predicted by the Standard Model (SM) [1]. The interpretation of the vast number of Higgs measurements requires a theoretical framework, which should be as flexible as possible. From a theoretical perspective and given that there are no non-SM particles observed at the electroweak scale, potential deviations from the SM are best analyzed in an effective field theory (EFT) approach [2]. Such a framework allows us to systematically include both rate information and kinematic distributions in the analysis [3,4]. In this study, we assume that the Higgs boson is in an SU (2) L doublet and work in the so-called Standard Model Effective Field Theory (SMEFT) In this case deviations from the SM are parameterized in terms of a series of SU (3) c × SU (2) L × U (1) Y invariant operators and possible new physics effects are contained in the Wilson coefficients of these operators. The expansion involves a new high energy scale Λ, and we truncate the expansion with dimension-6 operators, whose effects are generically suppressed relative to SM predictions by factors of v 2 /Λ 2 or E 2 /Λ 2 .
To interpret Higgs measurements in a combined experimental and theory analysis we need to a) develop an efficient strategy to extract the maximum information from Higgs events, and b) publish this information in an efficient form. Many analyses focus on total rate measurements, but the available global fits have shown that kinematic information on new Lorentz structures provides the most powerful limits on new physics, for instance, exploiting the boosted regime of W H production [3,[5][6][7][8][9]. For the first aspect this means that total or fiducial rate measurements alone are not useful. As for the second question, the experimental collaborations could choose to publish full likelihoods [10][11][12], which would ensure the most accurate modeling of systematic uncertainties and their correlations. However, despite some recent progress this strategy has not been widely adapted. Lacking the full likelihoods, the publication of measured event counts or cross sections in all bins of a histogram including uncertainties and their correlations is desirable [13]. In reality, global fitting projects [7,[14][15][16][17][18] often rely on extracting the necessary information from plots in publications and backwards-engineering as much of the analyses as possible.
An ad-hoc solution to these issues is the use of simplified template cross sections (STXS) [19,20] which propose to measure and publish cross sections in slices of simple phase-space parameters. They have been proposed for all major Higgs production channels and are continuously developed alongside the corresponding analysis opportunities. Given the established SMEFT approach we can benchmark any proposed method of extracting and publishing kinematic event information. A particularly well-suited approach for benchmarking is based on information geometry. Its central object, the Fisher information, encodes the maximal precision with which continuous parameters of a Lagrangian can be measured, and therefore allows us to directly compare the power of single LHC observables or STXS to the power of multivariate analysis strategies [21,22]. In our case, these parameters are the Wilson coefficients of the dimension-6 SMEFT operators. Until now, a major limiting factor in this approach was that detector effects and invisible particles like neutrinos could not accurately be taken into account. This problem was recently solved with a technique that combines matrix element information and machine learning [23][24][25], automated in the MadMiner tool [26].
In this paper we benchmark the W H production process, which is well known for its leading impact in extracting SMEFT coefficients [9,27]. In Sec. 2, we first review the SMEFT framework and define the Wilson coefficients included in our study. The basics of the Fisher information approach and the MadMiner machinery [26] are described in Sec. 4. Physics results on the relationships between effective operators and kinematic regions are presented in Sec. 5, starting with the simple kinematic distributions of information. Motivated by the differences between effects arising at linear and quadratic order in the Wilson coefficients, we propose an improved STXS definition for W H production in Sec. 5.3 and benchmark the reach in EFT coefficients in Sec. 5.4. Many technical details as well as a discussion of systematic uncertainties are given in the appendices.

Effective operators
New physics beyond the Standard Model (SM) can be parameterized in terms of the effective Lagrangian The dimension-d operators O d k form a complete basis of SU (3) c ×SU (2) L ×U (1) Y invariant operators containing only SM fields [14][15][16]. This defines the effective field theory to be the SMEFT. All non-SM physics effects are contained in the Wilson coefficients C d k . We further assume that the operators conserve C, P , baryon and lepton numbers, and neglect all flavor effects [28]. With these assumptions there are 59 dimension-6 operators. We work in the context of the Warsaw basis [29], although one of the strengths of our approach is that it is straightforward to go from one basis to another. Several groups have performed global fits to determine the restrictions on SMEFT coefficients from LEP and LHC data [3,[5][6][7][8][9], with various assumptions to reduce the number of operators. In the future, such fits will need to become increasingly sophisticated as the amount of data increases. Our study is a step in the direction of optimizing the flow of information into these analyses.
For the hard partonic process we start with the experimental inputs in terms of the measured values of M W , M Z , and G F . At tree level in the SMEFT, these parameters are shifted from their SM values, which introduces a dependence on C HD , C HW B , C , and C H [2,30]. The effective couplings of the fermions to the Z and W gauge bosons are also shifted from their SM values. None of these effects change the momentum structures of the vertices and so they are of limited interest to anomalous coupling fits to the W H process at the LHC with its large QCD backgrounds. In this study, we also neglect dipole operators that do not interfere with the SM prediction (C dW , C uW , C Hu , C dH , C eW ) and also C Hud which contributes only to the CKM matrix. Finally, C HW B , C , C H are strongly limited by global fits to the Warsaw basis coefficients [7,31,32], and so we neglect the contributions of these operators. The dependence on C dH vanishes for m b = 0 and we also set this coefficient to zero.
All of these assumptions lead to a restricted set of operators that are expected to have the dominant effects on the W H process, namelỹ The first line of Eq. 2.3 gives the combination of two operators that affect the W H amplitude in exactly the same way, namely through a finite Higgs wave function renormalization. We describe them with one Wilson coefficient, assuming the normalization Λ = 1 TeV unless otherwise noted. Obviously, the degeneracy between O H and O HD will be broken by other Higgs and gauge observables contributing to a global analysis, so we can neglect this effect in this analysis. In principle, dimension-6 operators can also appear in the W and H decays. With our assumptions, the decays only receive SM contributions, but more generally we also know that the dominant momentumdependent corrections are completely dominated by the Higgs production processes [21].
As mentioned above, the effects of dimension-6 operators can either scale like v 2 /Λ 2 or like E 2 /Λ 2 . While in general all three operators in Eq.(2.3) allow for the second kind of scaling, it turns out that for single Higgs productionÕ HD changes only the wave function of the Higgs field and therefore the Higgs couplings to all other particles. In contrast, O HW changes the momentum structure of the W W H vertex [33]. Even more interesting is the existence of a qq W H 4-point vertex proportional to C (3) Hq , which avoids the s-channel suppression of the Standard Model qq → W → W H diagram and is momentum-enhanced at high energy [9]. To be more specific, we can compute the helicity amplitudes for the on-shell process ud → W +,λ H, where λ is the W -helicity [34]. The leading powers of the ratio M W / √ s contributing to the amplitude squared for W H production in the Standard Model and including the three operators are: This structure obviously causes problems with a universal definition of sensitive phase-space regions even in terms of the relatively simple observable m W H ≡ √ s, since the operators contribute differently in different phase-space regions.

Signal and backgrounds
We simulate the Higgs production process at 13 TeV with MadGraph5 aMC@NLO [35] at tree level using the PDF4LHC15 PDF set [36] (lhaid=90900) and the default dynamical scale choice of MadGraph, namely, the transverse mass of the 2 → 2 system formed from a k T clustering [37]. We include in the matrix elements all diagrams with an intermediate W and H, though non-resonant and t-channel contributions are generally unimportant. The electroweak contributions from diagrams with a Z or γ * are neglected, as they are largely removed by the analysis cuts.
The higher-dimensional operators are implemented at tree level in the Warsaw basis [2,29] using the SMEFTsim package [38] with the {M W , M Z , G F } input scheme and assuming U (3) 5 invariance in the flavor sector. Because we want to compare the linear dimension-6 contributions with the dimension-6 squared terms, we generate all amplitudes to dimension 6 and do not truncate the cross section.
The leading QCD backgrounds to our process are pp → W bb, pp → tb (tb), and pp → tt .
We also simulate them with MadGraph5 aMC@NLO [35] at tree level. At generator level, we apply cuts that mimic a typical experimental analysis [39,40]. We require all leptons to have p T, > 10 GeV, / E T > 25 GeV, and tagged b-jets to have p T,b > 35 GeV. The leptons and b-jets are required to lie within |η ,b | < 2.5, and be well separated, ∆R bb, b > 0.4. We also require a Higgs mass window 80 GeV< m bb < 160 GeV, which has no effect on the signal, but allows for efficient sampling of the backgrounds. To reduce the semi-leptonic tt background, we require additional light jets to lie outside the central region, demanding p T,j < 30 GeV and ∆R bj, j > 0.4. We ignore the fully leptonic and fully hadronic tt backgrounds since they lead to different final states.
Prior to analysis, the parton-level events are processed to approximate the most important detector effects. The energies of the b-jets are smeared with a Gaussian transfer function chosen to approximate the h → bb mass resolution from Ref. [40]. We also smear the transverse components of the missing transverse energy according to the most recent ATLAS performance [41]. We assume a flat b-tagging probability of 70%, and neglect charm and light-flavor mis-tagging probabilities. For simplicity, we neglect any reconstruction inefficiencies for electrons and muons. Further details on the treatment of detector effects and a comparison with alternative methods is presented in App. A.

Statistical analysis
The task of this paper is to quantify the information in the W H → νbb process and to identify observables and phase-space regions that allow us to probe new physics effects parameterized in terms of the Wilson coefficients given in Eq. (2.5). While our reference process is relatively simple, we know from Sec. 2 that the three relevant operators contribute very differently to the event kinematics. To describe their effects we need to quantize the sensitivity of the W H phase space to a set of continuous model parameters. We start by discussing the challenges of such an analysis in a high-dimensional observable space, show how optimal observables can be constructed for this process, and review the Fisher information as an appropriate and convenient object to encode our analysis sensitivity.

Score as the optimal observables
In a typical LHC measurement we link n observed events {x}, each of which is characterized by a phase space vector x i , to a vector of model parameters θ = (C HD , C HW , C Hq ). The central quantity that describes their relation is the likelihood function, the probability density of the observed data as a function of the model parameters. It is given by [42] p full (x|θ) = Pois(n|L σ(θ)) i p(x i |θ) . (4.1) The first term with Pois(n|λ) = λ n e −λ /n! describes the probability of observing n events given an integrated luminosity L and predicted cross sections σ(θ). The remaining factors describe the kinematic information in each event x i and are equal to the fully differential cross section normalized to one: The Poisson term is relatively straightforward to compute. Calculating the event-wise kinematic likelihoods p(x|θ) explicitly, however, requires inverting the chain of Monte-Carlo tools used to simulate LHC events and integrating over an extremely high-dimensional space, which is impossible in practice [23,24].
The most common solution is to restrict the data x to one or a few hand-picked kinematic variables such as invariant masses, transverse momenta, or angular correlations, which integrates out some kinematic information and typically reduces the power of the analysis [21,22]. The STXS also follow this approach. We compare them to an alternative method, which defines statistically optimal observables for all directions in parameter space. We start by assuming that the values of the Wilson coefficients θ i are small, that is, we consider the parameter space region close to the SM. In this case one can explicitly construct the most powerful observables, which turn out to be [23] The vector t(x) is the gradient in model space, described by the Wilson coefficients θ i . Each component t i (x) is a function of phase space x, in other words, the set of basic kinematic observables like reconstructed energies, momenta, angles, and so on. The high-dimensional observable vector x can be compressed to these functions in each event without losing any statistical power: the t i (x) are the sufficient statistics. Furthermore, confidence limits on the Wilson coefficients can be calculated from histograms of the components t i in the same manner as with histograms of one or two familiar kinematic variables. Such an analysis of the t i (x) is guaranteed to lead to optimal statistical limits in the neighborhood of the SM.
In particle physics, the t(x) are known as Optimal Observables and have been calculated in a parton-level approximation for many processes [43][44][45], an approach closely related to the Matrix Element Method [46]. In the statistics community t(x) is originally known as the score [47]; we follow this older nomenclature.
Calculating the score is not straightforward: it is defined through the likelihood function p(x|θ), which is intractable when we use a realistic simulation of shower and detector effects. Recently, however, a new technique was developed that solves this problem with a combination of matrix-element information and machine learning: the Sally algorithm introduced in Refs. [23][24][25] allows us to train a neural network to estimate t(x) as a function of the observables x. * This way the score defines optimal observables not only at the parton level, but including the effects of invisible and undetected particles, parton showers, and detector response.
We use the implementation of the Sally algorithm in MadMiner 0.5 [26] to extract matrix-element information from our Monte-Carlo simulations, train neural networks to estimate the score, and calculate the expected limits on the Wilson coefficients. In addition to the four-momentum components, we choose to include a number of higher-level observables to aid in the training of the network, like the p T and η of each particle, the ∆R, ∆φ, invariant mass, transverse mass, and transverse momentum of each particle pair, altogether 48 observables for our W H process. We simulate 4 · 10 6 signal and 16 · 10 6 background events (after all cuts) and train fully connected neural networks with three hidden layers of 100 units and tanh activation functions. The Sally loss function [24] is minimized with the Adam optimizer [49] over 50 epochs with a learning rate that decays exponentially from 10 −3 to 10 −5 and a batch size of 128. We train ensembles of five net-works, using the ensemble mean as prediction and the ensemble variance as a diagnostic tool to check the consistency of the predictions.

Fisher information
Benchmarking an analysis strategy such as the STXS requires a meaningful, but convenient metric. We propose to use the Fisher information, which we review in this section. Consider the expected significance with which a parameter point θ can be excluded in a measurement. It is essentially [50,51] given by the expected log likelihood ratio whereθ is the maximum-likelihood estimator of the Wilson coefficients. The log likelihood ratio is a generalization of the χ 2 test statistic to non-Gaussian distributions. If the best fit point is close to the SM and the tested theory parameters θ ∼ C/Λ 2 are small, we can expand the log likelihood ratio, where E[·] now denotes the expectation value assuming events distributed according to the SM. The leading term in the Taylor expansion defines the Fisher information I ij . This matrix has n × n entries, where n is the number of theory parameters (plus nuisance parameters when discussing systematic uncertainties; see Appendix C). It quantifies the expected sensitivity of a measurement, where larger components correspond to more precise measurements, and has several useful properties: • According to the Cramèr-Rao bound, the covariance matrix of any unbiased estimator, or the precision of a measurement, is bounded by the inverse Fisher information: So the larger an eigenvalue of the (by definition diagnolizable) Fisher information matrix, the more precisely we can measure the direction in parameter space given by the corresponding eigenvector. The diagonal elements correspond to the sensitivity to a particular theory parameter when keeping all other parameters fixed.
• The Fisher information is independent of the parameterization of the observables.
• It transforms covariantly under parameter transformations, so it can easily converted between model space bases.
• It is additive for different processes and phase-space regions: the full sensitivity of multiple analysis is given by the sum of the Fisher information matrices of several channels or phase-space regions. This also allows us to study the distribution of information over phase space.
• It has a geometric interpretation as a metric in parameter space, defining distances that quantify the expected reach of a measurement. These distances directly correspond to expected confidence levels [26]. For example, the 95% CL contours for two free parameters corresponds to a distance d = 2.447.
• Finally, the Fisher information formalism allows us to easily include systematic uncertainties, which are modeled with nuisance parameters and profiled over in the statistical analysis [21].
For a more detailed discussion of this formalism we refer the reader to Refs. [21,26].
If we insert Eq. (4.1) into the definition of the Fisher information in Eq. (4.5) we find where N is the number of events in the Monte-Carlo sample. This means that after we train a neural network to estimate the score t(x) as discussed in the previous section, we can calculate the full Fisher information, corresponding to the maximal reach based on all observables including correlations. As mentioned above, the Fisher Information also easily allows to include systematic uncertainties, which are discussed in App. C.
In addition, we can calculate the information in the distribution of one or two kinematic variables with histograms [21,22]. We will use this to calculate the sensitivity of the STXS. Finally, as a cross-check we can calculate the parton-level information directly from event weights, using the technique introduced in Ref. [21]. This information neglects shower and detector effects and assumes that we can exactly reconstruct the parton-level final state, including flavor information and the full four-momenta of invisible particles. While this is not a realistic scenario, we will use this method to verify the machine-learning results in App. B.

Beyond the leading Fisher information
By definition, the Fisher information on dimension-six Wilson coefficients measures their leading effects, i. e. it is linearized in powers of 1/Λ 2 . Similarly, the score vector t(x) defined in Eq. (4.3) is only statistically optimal in the parameter region where the Wilson coefficients are small. Further away from the SM, the O(1/Λ 4 ) terms become important and eventually dominate. The same happens if the interference between dimension-6 amplitudes and the SM at order O(1/Λ 2 ) vanishes [52]. In this situation, the Fisher information approximation is no longer accurate, and while an analysis based on the score will still lead to correct confidence limits, they might no longer represent the best possible limits. One way to discuss this case is to use machine-learning techniques to estimate the full likelihood or likelihood ratio function to all orders in 1/Λ 2 [23][24][25]53]. In a related way, using the geometric interpretation of the Fisher information, one could calculate distances along geodesics that capture these higher-order effects as well [21].
Here we follow a different approach. In the limit where the linear term vanishes or is very small, we can still discuss the Fisher information and define optimal observables by assuming that the effects from one operator squared ( ∼ θ 2 i ), dominate over both interference terms ( ∼ θ i ) and cross terms ( ∼ θ i θ j ). In this approximation we can write the likelihood in terms of the squared Wilson coefficients Θ i ≡ θ 2 i as new parameters. We note that while the Θ i are perfectly well defined statistical quantities, their physics interpretation requires the additional condition Θ i ≥ 0. In a global EFT fit this leads to a series of interesting effects [54] which will have no impact on our discussion in Sec. 5.2.
To analyze the impact of the contributions at quadratic order, O(1/Λ 4 ), we again define the score as a function of Θ according to Eq. (4.3) and the Fisher information as a function of Θ according to Eq. (4.8). In practice, this means we replace the derivative with respect to θ i by a derivative with respect to Θ i , which is proportional to the second derivative with respect to θ i . We will label confidence limits based on the score in terms of the Θ as Harry † .

New physics in kinematics
The physics question we are going to tackle in this study is how the effects of the three operators in Eq. (2.3) are distributed over phase space and how we can define LHC observables to best constrain them. In particular, we need to determine if the established simplified template cross sections serve this purpose, or if they can be improved.
The Fisher information is an ideal tool to analyze questions on where in phase space we can search for new physics signals and how we can extract this information from kinematic analyses [21,22]. The first question we discuss is the impact of detector effects, which can hide information that in principle exists at the parton level but is not accessible in a realistic measurement [23,24]. As described in Sec. 4 we can use machine learning to calculate the actually observable information at the detector level. We will illustrate this aspect in Sec. 5.1. Next, in Sec. 5.2 we will discuss how the sensitivity to different dimension-6 operators is distributed over phase space, both linearized in the Wilson coefficients and including the Wilson coefficients squared. In Sec. 5.3 we will study the information in two-dimensional and multivariate distributions. We analyze how much of the available information is captured by simplified template cross sections and discuss their proposed form. Finally, we calculate expected exclusion limits for this channel based on the use of STXS as well as a multivariate analysis [19,20].

Information after detector effects
Fisher information calculations for fully reconstructable signals and their irreducible backgrounds, also including leading detector effects, have been studied in the past [21,22]. Using MadMiner we can also study signatures which cannot be fully reconstructed. Specifically in the W H production process, we study the information based on the smeared missing † We leave it to the reader to imagine a carefully chosen acronym.  . We include full parton-level information without detector effects (dotted black), detector effects, but retaining the full neutrino 4-momentum (dashed orange), and finally retaining only the transverse neutrino momentum (solid blue). We also show the information using only the signal rate in grey.
transverse energy rather than the full neutrino momentum. In this section we analyze the loss of information due to these detector effects, comparing three sets of observables: first, keeping the full information on all initial and final state particles at parton level, second, using all final state particles after detector smearing, and third, keeping only the observable missing transverse momentum in place of the neutrino momentum, including all detector-level smearing.
To gain some intuition about the impact of detector effects on the information content of the hard process we analyze two of the three operators given in Eq. (2.5) at a time, setting the third Wilson coefficient to zero. We also limit ourselves to linearized differential cross sections in each of the Wilson coefficients at this point, the effects of the squared terms will be discussed in detail later.
In Fig. 1 we show how the full information about the dimension-6 signal at the parton level is reduced when we move from parton-level truth to realistic observables step by step. The dotted black line corresponds to retaining the full parton-level information without detector effects that is used for the evaluation of the matrix elements. This includes unobservable degrees of freedom such as the initial-state parton flavors, particle helicities and the un-smeared momenta. Removing this information and introducing a realistic smearing of the particle momenta washes out possible structures, for example asymmetries in the rapidity distributions, as can be seen when going from the black to the orange line. In this step we still assume that we can measure the full neutrino four-momentum. Finally, we remove the unobserved longitudinal neutrino momentum and neutrino energy, which changes the orange line into the blue line. While the longitudinal neutrino momentum can in principle be reconstructed if we assume an on-shell W boson, this reconstruction is spoiled by the detector smearing.
In the top panels of Fig. 1, we neglect the QCD backgrounds and consider only W H production with linear dimension-six contributions. The loss of information has the strongest impact on the correlated measurements ofC HD -C HW , because the phase-space effects of both operators are very SM-like. In this situation we rely on the best possible understanding of the full final state. In theC HD -C Hq planes, we know that the phase-space effects are more dramatic, so the additional information from the third neutrino direction is less relevant. We also show the information from the total rate measurement, which is not only much poorer, but it also shows a perfect flat direction for each pair of operators.
In the lower panel of Fig. 1 we show the same change in available information at detector level, now in the presence of the QCD backgrounds. In this situation, a complete understanding of the final state neutrinos is desirable for understanding small changes in the full phase space distribution due to BSM effects, and hence the missing information affects all three operators.

Distribution of information
From Sec. 2 we know that the three dimension-6 operators we are considering have distinctly different effects on the W H production process. If we want to exploit the momentum enhancement of C signal is the partonic energy or equivalently [55] p T,W ≈ p T,H or m W H .
This simple description as a 2 → 2 process is broken by two aspects, the 2 → 3 structure of the W bb continuum background and angular correlations reflecting the W polarization.
In the left panel of Fig. 2, we show the p T,W distribution for the W H signal, together with the information distribution. We start with the normalized distribution of the diagonal element of the Fisher information introduced in Sec. 4.2, by definition linearized in the Wilson coefficient, as solid lines. These curves correspond to the diagonal elements of the Fisher Information matrix computed using only events that lie within the bin boundaries for a chosen kinematic variable. These diagonal elements are inversely proportional to the estimated limit on the theory parameter, and the full information (and optimal limit) can be obtained by adding up the information in each bin. The distributions thus demonstrate where in phase space the constraining power on each Wilson coefficient arises, and where the experimental analyses should be focused.
We see that the information on all three operators reflects their respective Lorentz structures. The information onÕ HD mimics the distribution of the signal, and the information in O HW is energy-suppressed because it couples to the transverse W -modes. In contrast, O Hq is visible at large momentum transfer because of the 4-point vertex.
In the same figure we also show the information distribution as dotted lines for the dimension-6 squared terms only for each of the three operators. In this case the interference with the Standard Model does not act as a projector onto the W -polarization structure of the Standard Model. In these plots we treat the squared Wilson coefficients as unconstrained parameters; physically, they are restricted to be non-negative, but this does not change our conclusions. What we see clearly is that the sensitive phase-space region for O HW now peaks around 400 GeV, well outside the dominant phase-space region of the SM signal. Also the peak in the information on O Hq moves from around 400 GeV to 600 GeV, dominated by the squared 4-point vertex.
The right panel of Fig. 2 shows the distribution and information distribution over the total transverse mass of the final state, where E bb T = |p bb T | 2 + m 2 bb and p bb T is the vector sum of the transverse momenta of the lepton and the b-jet, with similar conclusions.
In Fig. 3, we demonstrate how including the continuum QCD background changes the picture. First, we see that regions of low p T or low m T,tot are swamped by the QCD backgrounds, so they do not lend themselves to extracting momentum-enhanced contributions. ForÕ HD there is no momentum enhancement, and the information in the two distributions behaves exactly as before. Similarly, the information distribution on O HW in the linear case is primarily in the low bins, where the event rate is larger, but is largest for p T,W 300 or m T,tot 700 GeV if we consider the dimension-6 squared term. In contrast, O Hq gains only very little information from the low-p T,W or low-m T,tot regime. We note that for operators only affecting the total rate, the distribution of information over p T,W is completely consistent with the distribution of the statistical significance of the V H production over the QCD background, as studied in Ref. [56].
Here we pause to comment on the validity of the EFT expansion. While for each operator, only the combination C i /Λ 2 is observable, matching with a perturbative theory at higher scales requires C i 1, and thus more precise measurements are intrinsically probing potentially higher scales. In order for the constraints to be self-consistent, it is important that we do not try to set constraints on the EFT parameters using data that is outside the range of validity of the EFT expansion. This could be done rigorously by implementing a cut demanding that some kinematic variable tracing the momentum transfer not exceed some multiple of Λ, but fortunately, in Fig. 3, we see that all of the Information arises from events with p T,W 1 TeV, so such a cut is unnecessary for our analysis here.
Finally, in Fig. 4 we consider two more kinematic distributions. First, the opening angle of the charged lepton and the leading bottom becomes ∆R b 1 ≈ π for boosted 2 → 2 processes, like the part of W H phase space sensitive to O Hq . Second, the reconstructed rapidity of the bb system also becomes more central for boosted 2 → 2 processes. The opening angle is useful for discriminating against the tt and tb backgrounds, where we do not expect the two b-jets to recoil against the lepton, but they do not otherwise appear too promising to gain additional information into one of the three dimension-6 operators, or into the relation between linearized and quadratic contributions.
Altogether, we have learned that when looking at 1D kinematic distributions those which scale like momentum, like p T or m T,tot , are best-suited to separate the effects of the different operators. Including QCD backgrounds makes the high-momentum regime even more attractive. The only problem is that the relevant region, for instance in p T,W , depends not only on the operator, but also on whether we linearize the effective field theory for the cross section or keep the O(1/Λ 4 ) squared term. Furthermore, the balance between linear and quadratic contributions to a combined analysis will change with the LHC luminosity and the kinematic focus of the analysis.

Information in STXS
Testing one effective operator at a time violates the basic assumptions of effective theories, namely that we expect to see effects from all operators allowed by the underlying theory. We now examine how the kinematic variables introduced in the previous section can be used to set constraints on the full multi-dimensional parameter space.
Simplified template cross sections (STXS) have been proposed to provide experimental results on kinematic distributions for global analyses [19,20]. For the W H production process they include rate measurements binned in p T,W . At stage 1, the three bins are defined by p T,W = 0−150 GeV, p T,W = 150−250 GeV, and p T,W > 250 GeV. For increased statistics, stage 1.1 proposes five bins, now split at p T,W = 75, 150, 250, 400 GeV, as shown in Fig. 7. Results using this approach have recently been presented by ATLAS [57] using the 0-jet, single lepton data sample, to obtain limits on anomalous couplings, assuming only one non-zero Wilson coefficient at a time. Unfortunately, they do not consider the most interesting 4-point vertex from O Hq , for which boosted V H production provides the strongest constraints [9].
We evaluate these proposals by calculating the Fisher information in the STXS bins. From Fig. 3, it is clear that the stage 1.1 STXS version will not capture the full information on O We compare the results to the full information in the high-dimensional phase space using the Sally method discussed in Sec. 4

.1. In addition, we train a neural network with the
Sally method on just p T,W as input variable, which lets us calculate the information in the full p T,W distribution in the infinite-bin limit; see App. D for more details.
In Fig. 5, we show the expected limits on pairs of Wilson coefficients from the p T,W distribution. First, we see the almost flat direction in theÕ HD − O HW plane, reflecting the fact that both operators are constrained at low transverse momenta, as long as we only consider linearized predictions. This flat direction will be broken by other observables in a proper global analysis [9], so it is of less interest for our purposes. The situation changes once we consider the correlation of either of these two operators with O Hq , because now the two directions are tested by distinctly different p T,W regimes. The only requirement in this case is that we include enough p T,W bins to distinguish the two regimes, as is provided by the stage 1.1 setup. Indeed, this framework collects essentially all information on our set of three operators affecting the p T,W distribution.
In the second row of Fig. 5 we switch from the linear terms in the Wilson coefficient to the squared terms alone, as defined in Section 4.3. This allows us to analyze the relative sensitivity of the different analysis strategies and binnings on the squared terms which, as we saw in Figs. 3 and 4, have very different kinematic behaviors compared to the linear terms. Showing the information on the squared terms as contours also allows us to study the correlations between different squared parameters, which, as we will see, are often different than the correlations between linear terms. In the left panel we see hardly any effect of the different binnings forÕ HD , but the flat direction has vanished and the reach in O HW is increased dramatically by including the high p T,W bins. Because this sensitivity to O HW comes from high momentum transfer, we start to observe a strong anti-correlation with O Hq . To break it, we need to resolve the high-p T,W range, and for this purpose the additional bin in Eq. (5.3) is crucial. Indeed, we see that it improves the situation considerably.
In the lower six panels of Fig. 5 we show the same kind of information, but now profiled over the respective third operator. The results are generally much weaker, and the flat directions are more significant. As mentioned above, this is not a big concern, since our W H analysis has to be embedded in a global analysis. An interesting aspect is that with three operators actually contributing to the analysis the gap between the few-bins STXS approach and the full information from the entire p T,W distribution widens. Some details on the scaling with larger number of bins and on the number of bins needed to saturate the information in a kinematic distribution are given in App. D. Altogether, it becomes even more important that we extract as much information for instance on O After having established that the currently used STXS for W H production need to be supplemented by another high-energy bin, we can test how much of the available information of the full phase space is captured in the binned p T,W distribution. The reasoning behind picking an additional observable scaling like momentum is that the signal process  is dominantly 2 → 2 scattering, described by two kinematic observables, and the scattering angle or rapidity are not found to be particularly useful, as shown in Fig. 4. On the other hand, if the signal from dimension-6 operators populates phase space far away from the SM we will become sensitive to the background kinematics, and the simple picture of a 2 → 2 process can completely change [56,58].
In the top row of Fig. 6 we show the limits from the Sally approach trained on linearized effects in the Wilson coefficients (the same as in Fig. 1) as a blue line. The dotted pink line comes from limiting the information of the Sally training to the observables p T,W and m T,tot , which can be understood as the constraints computed from the infinitebin limit of a 2-dimensional histogram. It is not clear a priori which second observable complements p T,W best, but we found that m T,tot as defined in Eq. (5.2) works well. In the linearized approach, we confirm that compared to the full phase space we still lose a significant amount of information on the three operators. The next question is how much of the information from the two-dimensional contribution we can capture in a reasonable number of STXS bins. as illustrated in Fig. 7. Because m T,tot and p T,W are by no means uncorrelated, this captures the effects of the 2 → 3 background phase space and adds an implicit sub-division in p T,W . In the top row of Fig. 6, we see that for the linearized case our new proposed STXS definition performs extremely well, but the STXS stage 1.1 also does fine. In the second row, we show the same effects, but now for the squared new physics terms only and not including the imposed positivity condition. The blue curve is now generated consistently based on the squared terms, using the Harry method. As expected, the improved 2-dimensional STXS outperform the stage 1.1 results when we test O Hq , as a better understanding of the information in high p T,W or m T,tot bins is effective. The lower panels of Fig. 7 show the same effect, but profiled over the full range of the respective third Wilson coefficient. The large effect related to O Hq now propagates through the entire set of measurements. Moreover, the limited number of bins in the STXS stage 1.1 become an issue.

Limits with consistent squared terms
Finally, we go beyond the linearized Fisher-information approach and calculate expected exclusion limits on the dimension-6 Wilson coefficients with the squared terms included in a physically consistent way. Generally, power counting in 1/Λ 2 suggests that the linear effects should be dominant. However, in some directions of parameter space the interference contribution to the analyzed distributions is small, the analysis is only sensitive further away from the SM, and the squared terms are the dominant effects of new physics. We thus expect the squared terms to break the flat directions of the linearized results.
This is exactly what we find in Fig. 8, where we show the expected limits for a rate-only analysis (gray), the improved STXS defined in the previous section (green), and a multi-  variate analysis of the full phase-space information based on the Sally technique described in Sec. 4.1 (blue). In the plane spanned byC HD and C HW (left panel) we can clearly see that the squared contributions remove the approximate flat direction in the improved STXS limits. More generally, the full exclusion limits based on the Sally method differ from those obtained from the linearized Fisher information, indicating the importance of squared new physics effects in this region of parameter space. In these parameter-space regions we expect that even stronger limits can be constructed with techniques that estimate the full likelihood function to all orders in the Wilson coefficients [23][24][25][26]. Closer to the SM, we find that the full limits are well approximated by the Fisher information, confirming that the linear operator effects dominate there. Generally, these full limit calculations confirm the main result of the last section: the simplified template cross sections, in particular our improved version, are substantially more informative than a simple rate measurement, but fall short of capturing all of the kinematic information in the high-dimensional final state.

Improved STXS
There are numerous global studies that compare the sensitivity to various coefficients at LEP and at the LHC [7-9, 17, 31], as well as studies examining the reach of the V H processes separately [27,59,60]. The global Warsaw basis fit of Ref. [7] finds single parameter fits which are roughly consistent with our improved STXS projections of Fig. 8. Our results should be interpreted as demonstrating the potential improvements in precision fits that could be obtained by improving the binning or by adding information from additional kinematic variables.  Figure 8. Expected exclusion limits at 95% CL based on a rate-only analysis (grey), from the improved STXS (green), and from a multivariate analysis of the full phase-space information with the Sally algorithm. We show (linearized) limits based on the Fisher information (dashed) and full exclusion limits that take into account the effects of linearized and squared new physics effects (solid). In each panel, we set the operator not shown to zero.

Conclusions
We have, for the first time, performed a comprehensive benchmarking of simplified template cross sections (STXS) in the W H channel with a leptonic W decay. The Fisher information allowed us to quantify the reach of different analysis strategies and to identify which phasespace regions are sensitive to the three dimension-6 operatorsÕ HD , O HW , and O Hq . We compared the STXS to a machine-learning-based analysis of the full, high-dimensional final state, using the Sally technique of Refs. [23][24][25] implemented in MadMiner [26] to calculate the statistically optimal observables and the maximal new physics reach of an analysis.
We found that p T,W is the most promising kinematic observable, but that the STXS stage 1.1 criteria benefit from being supplemented by an additional high-energy bin. The ability to distinguish different operator signatures is further enhanced when we include a second observable in the definition of the STXS, m T,tot being one workable option. We showed that for the W H process such a two-dimensional approach is promising, even though it cannot obtain all of the kinematic information in the process that can be unearthed with machine-learning-based techniques. Our results suggest, however, that the current STXS results can be significantly improved by the two-dimensional binning. More generally, this study presents a blueprint for a systematic benchmarking of STXS or any other method of publishing results needed for a global Higgs analysis.
The Our analysis code is available online at Ref. [73].

A Detector effects
The leading detector effect relevant for this analysis is the smearing of the di-jet invariant mass peak. The distribution has been carefully simulated by CMS [74] (see Fig. 9) and we aim to reproduce it for our analysis. In the MadMiner framework, this smearing can be simulated in three different ways: i) In the simplest approach, we explicitly smear the parton level b-quark energies after event generation. This is parameterized by a gaussian transfer function with width σ E /E = 0.1. ii) Alternatively, we can include the smearing already in the event generation process by modifying the Higgs propagator [21,22,58]. To reproduce the m bb distribution obtained by CMS, the Breit-Wigner propagator is simply replaced by the square-root of the Gaussian with mean m H , and width σ = 15 GeV. As a result, the joint scores already include the smearing and can therefore be used as estimator for the score, making this approach a useful tool for validation (see Appendix B). iii) Finally, it is also possible to simulate parton shower, hadronization and detector response using Pythia and Delphes.
In the left panel of Fig. 9 we compare the three approaches to the di-jet mass distribution obtained by CMS [74] (solid black). We can see that both the parton level energy smearing (dashed red) and propagator modifcation (dashed blue) approach can reproduce the CMS mass spectrum, while a fast detector simulation using Pythia 8 [67] and Delphes 3 [61] with default settings (dotted green) systematically underestimates the di-jet mass.
Another important detector effect is the smearing of the missing transverse energy / E T . In this work, we aim to reproduce the most recent ATLAS performance [41] and explicitly smear the transverse components of the missing energy using a Gaussian transfer function with width σ / E T = 12.5 GeV. An additional smearing at higher energies is induced through the smearing of the b-quark energies. As shown in the right panel of Fig. 9  procedure reproduces the experimental results well. In contrast, a fast detector simulation using Pythia 8 and Delphes 3 once again fails to accurately reproduce the experimental performance without tuning the simulation parameters, as indicated by the green dotted line.

B Fisher information in the presence of backgrounds
A key feature of the machine learning approach in MadMiner is that one can reliably estimate the score in the presence of backgrounds. As explained in Ref. [26], this also works when signal and background samples are generated separately from each other, implicitly inducing an additional discrete latent variable which labels each training event as either signal or background. This latent variable is then integrated out in the machine learning step alongside with all other unobservable latent variables, such as the longitudinal neutrino momenta, initial state quark flavours or additional unobserved particles in reducible backgrounds (such as the jets in top pair production). Reference [26] contains examples to validate the performance of MadMiner's machine learning approach in the case of either a single observable or a single process. In the following, we will consider another simplified scenario in which we can validate MadMiner's reach estimate in the presence of both a high-dimensional observable space and irreducible backgrounds.
As mentioned in Sec. 4.2, for a fixed initial and final state at parton level, we can calculate the Fisher information directly from the event weights and hence cross check the machine learning based results. In particular, we consider the process ud → µ + ν bb, with contributions coming from both the W H signal and the irreducible W bb background, and Hq . We first generate a set of signal events {x k }, described by event weights ∆σ S,k (θ). In contrast to the results in Section 3, where the smearing of the m bb peak was done at analysis level, here we generate the signal events using the propagator modification method described in Appendix A. We then use the MadGraph reweighting tool [75] to obtain the background weights ∆σ B,k for the same set of events. As shown in Ref. [21], the Fisher information at parton level for a set of events can then be computed as A comparison of the Fisher information obtained using machine learning results with the Fisher information directly obtained from the parton level event weights is shown in Fig. 10. The inner ellipses show the reach in the absence of backgrounds (∆σ k = ∆σ S,k ), while the outer ellipses take into account both signal and background (∆σ k = ∆σ S,k + ∆σ B,k ). In both cases, the Fisher information obtained at parton level (dotted curves) and using a Sally estimator trained on the same sample (solid curves) are indistinguishable, indicating excellent machine learning performance. This is then compared to the standard way of estimating the Fisher information, using a Sally estimator trained with separately generated signal and background samples, which is indicated by the green dot-dashed ellipse. This result agrees remarkably well with the other results, confirming that the treatment of backgrounds in the MadMiner algorithm works correctly.  Hq in the Fisher information approximation obtained with the Sally estimator evaluated using the full set of physical variables (blue), the Stage 1.1 simplified template cross sections (red), and measurements of the rate (grey), withC HD fixed to zero. The solid lines show the projected limits obtained in the absence of systematic uncertainties, while the dashed lines show the reach contours after marginalizing over the additional nuisance parameters.

C Systematics
An important question that must be addressed when comparing the expected limits from different sets of kinematic data from experiments is the effect of systematic uncertainties on the expected limits. While a treatment of all experimental systematics is beyond the scope of this study, an important step in this direction is to understand the effects of theory uncertainties, namely the dependence on the choice of renormalization and factorization scales as well as the PDFs. These uncertainties are fully integrated in the MadMiner framework [26] and parameterized through nuisance parameters, which are then marginalized over to set limits.
In the following we only consider the theory uncertainties on the W H signal. Since uncertainties on the backgrounds are typically mitigated through the use of sideband analyses, rather than through direct simulations, including the scale and PDF uncertainties on the backgrounds would dramatically overestimate the actual systematic uncertainty in a realistic measurement.
In the left panel of Fig. 11 we show the normalized distribution of the W boson's transverse momentum including scale and PDF uncertainties for several benchmark parameter points. In particular we vary the renormalization and factorization scale by a factor of two and use the PDF4LHC15 nlo 30 set to estimate PDF uncertainties. We see that the relative uncertainty is small in the background-dominated low-p T,W regime, and increases to O(5%) at large transverse momentum. In W H production, the PDF shape uncertainties dominate over the scale uncertainties, given that the scale dependence of the quark PDF is small.
In the right panel of Fig. 11 we show how including systematics changes the expected limits. While the solid contours show the reach neglecting systematic uncertainties, the dashed lines account for the systematic uncertainties by profiling over the nuisance parameters. We can see that the presence of systematic uncertainties mainly reduce the sensitivity in the rate-sensitive direction, rotating the orientation of the ellipses. The overall effect is sizable, indicating the importance of understanding these systematic uncertainties. In Fig. 11, we see that at p T,W ∼ 300 − 400 GeV , the energy scaling of C HW and C HQ is roughly the same, due to the fact that we are not yet in the high p T,W regime where the longitudinal modes dominate. In this energy range, the transverse modes, that have the same scaling with √ s/M W , are clearly important.

D Binning distributions
In view of an appropriate definition of simplified template cross sections, it is interesting to ask how finely we need to bin a kinematic distribution to extract as much of the available information as possible. This is illustrated in Fig. 12, where we show the Fisher information corresponding to a p T,W histogram for varying number of bins with equal size in the range [0, 800 GeV]. The three panels correspond to the diagonal element of the Fisher information forC HD (left), C HW (center) and C QH (right). This is compared to the information obtained using a Sally score estimator trained with only p T,W as input observable (solid black) and the information in the STXS with 3 bins (stars), 5 bins (crosses) and 3 bins (triangles). The shaded regions indicated the uncertainties of the Fisher information, where the uncertainty for the Sally method was estimated using an ensemble of estimators with varying architecture. We can see that the information in histograms approaches the information in the distribution in the limit of large number of bins, showing that the Sally approach gives the correct large-n bin limit.
For all three operators, a histogram with an O(10) number of bins will essentially extract the full information of the p T,W distribution. The STXS with 3 bins already collects almost all information onC HD and C HW , while for C (3) QH a significant improvement is obtained when increasing the number of bins to 6.