The top quark legacy of the LHC Run II for PDF and SMEFT analyses

We assess the impact of top quark production at the LHC on global analyses of parton distributions (PDFs) and of Wilson coefficients in the SMEFT, both separately and in the framework of a joint interpretation. We consider the broadest top quark dataset to date containing all available measurements based on the full Run II luminosity. First, we determine the constraints that this dataset provides on the large-x gluon PDF and study its consistency with other gluon-sensitive measurements. Second, we carry out a SMEFT interpretation of the same dataset using state-of-the-art SM and EFT theory calculations, resulting in bounds on 25 Wilson coefficients modifying top quark interactions. Subsequently, we integrate the two analyses within the SIMUnet approach to realise a simultaneous determination of the SMEFT PDFs and the EFT coefficients and identify regions in the parameter space where their interplay is most phenomenologically relevant. We also demonstrate how to separate eventual BSM signals from QCD effects in the interpretation of top quark measurements at the LHC.


Introduction
The top quark is one of the most remarkable particles within the Standard Model (SM). Being the heaviest elementary particle known to date, with a mass around 185 times heavier than a proton, and the only fermion with an O(1) Yukawa coupling to the Higgs boson, the top quark has long been suspected to play a privileged role in potential new physics extensions beyond the Standard Model (BSM). For instance, radiative corrections involving top quarks are responsible for the so-called hierarchy problem of the SM, and the value of its mass m t determines whether the vacuum state of our Universe is stable, metastable, or unstable [1][2][3]. For these reasons, since its discovery at the Tevatron in 1995 [4,5] the properties of the top quark have been scrutinised with utmost attention and a large number of BSM searches involving top quarks as final states have been carried out. The focus on the top quark has further intensified since the start of operations at the LHC, which has realised an unprecedented top factory producing more than 200 million top quark pairs so far, for example.
In addition to this excellent potential for BSM studies, top quark production at hadron colliders also provides unique information on a variety of SM parameters such as the strong coupling constant α s (m t ) [6,7], the CKM matrix element V tb [8], and the top quark mass m t [9,10], among several others. Furthermore, top quark production at the LHC constrains the parton distribution functions (PDFs) of the proton [11,12], The structure of this paper is as follows. To begin with, in Sect. 2 we describe the data inputs and the theory calculations (both in the SM and in the SMEFT) used in our study, focusing on top quark sector measurements. The SIMUnet methodology deployed for the simultaneous extraction of PDFs and EFT coefficients, including its application to the fixed-PDF and SM-PDF analyses, is reviewed in Sect. 3. Subsequently, in Sect. 4 we present the results of the SM-PDF fits, and in particular we quantify the impact on the large-x gluon of recent high-statistics Run II measurements. In Sect. 5 we consider the fixed-PDF analyses and present the most extensive SMEFT interpretation of top quark data from the LHC to date, including comparisons with previous results in the literature. The main results of this paper are presented in Sect. 6, namely the simultaneous determinations of the PDFs and EFT coefficients and the comparison of these with both the fixed-PDF and SM-PDF cases. We summarise our results and outline some possible future developments in Sect. 7.
Technical details of the analysis are collected in the appendices. App. A provides usage recommendations for interpretations of top quark measurements sensitive both to PDFs and SMEFT coefficients. App. B collects the theory settings for the SMEFT calculations, mainly concerning input schemes and operator definitions. App. C carries out a benchmark comparison between SIMUnet (in the fixed-PDF case) and the public SMEFiT code, demonstrating the agreement between the two frameworks at the linear level. The fit quality to the datasets considered in the analysis is presented in App. D, and representative data-theory comparisons are given. Finally, in App. E we discuss the difficulties in extending the simultaneous analysis of PDFs and EFT coefficients to the case where terms quadratic in the EFT Wilson coefficients are dominant.

Experimental data and theory calculations
We begin by describing the experimental data and theoretical predictions, both in the SM and in the SMEFT, used as input for the present analysis. We start in Sect. 2.1 by describing the datasets that we consider, with emphasis on the top quark production measurements. Then in Sect. 2.2 we use a modified version of the selection criteria defined in [40] to determine a maximally consistent dataset of top quark data to be used in the subsequent PDF and SMEFT interpretations. Finally, in Sect. 2.3 we describe the calculation settings of the SM and SMEFT cross-sections for top quark processes, pointing the reader to the appendices for the technical details of their implementation.

Experimental data
With the exception of the top quark measurements, the dataset used in this work for fitting the PDFs both in the SM-PDF and SMEFT-PDF cases overlaps with that of the NNPDF4.0 determination presentend in Ref. [40]. In particular, the no-top variant of the NNPDF4.0 dataset consists of 4535 data points corresponding to a wide variety of processes in deep-inelastic lepton-proton scattering [43][44][45][46][47][48][49][50][51] and in hadronic proton-proton collisions ; see [40] for more details.
Concerning the LHC top quark measurements considered in the present analysis, they partially overlap, but significantly extend, the top datasets included in global PDF fits such as NNPDF4.0 [40] as well as in SMEFT analyses of the top quark sector [42,91]. Here we discuss in turn the different types of measurements to be included: inclusive tt cross sections and differential distributions; tt production asymmetries; the W -helicity fractions; associated top pair production with vector bosons and heavy quarks, includingttZ, ttW ,ttγ,tttt,ttbb; t− and s−channel single top production; and associated single top and vector boson production.
Choice of kinematic distribution. Many of these measurements, in particular those targeting top quark pair production, are available differentially in several kinematic variables, as well as either absolute distributions, or distributions normalised to the fiducial cross-section. We must decide which of the available kinematic distributions associated to a given measurement should be included in the fit, and whether it is more advantageous to consider absolute or normalised distributions.
Regarding the former, we note that correlations between kinematic distributions are in general not available, and only one distribution at a time can be included without double-counting (one exception is the ATLAS tt lepton+jet measurement at √ s = 8 TeV [92] where the full correlation matrix is provided). Therefore, wherever possible we include the top-pair invariant mass m tt distributions with the rationale that  The inclusive cross-sections and differential distributions for top quark pair production from ATLAS and CMS that we consider in this analysis. For each dataset, we indicate the experiment, the centre of mass energy √ s, the final-state channel, the observable(s) used in the fit, the integrated luminosity L in inverse femtobarns, and the number of data points n dat , together with the corresponding publication reference. In the last two columns, we indicate with a the datasets that are included for the first time here in a global PDF fit and in a SMEFT interpretation, respectively. The sets marked with brackets have already been included in previous studies but here we account for their constraints in different manner (e.g. by changing spectra or normalisation), as indicated in the table and in the text description.
these have enhanced sensitivity to SMEFT operators via energy-growing effects; they also provide direct information on the large-x PDFs. Otherwise, we consider the top or top-pair rapidity distributions, y t and y tt respectively, which also provide the sought-for information on the large-x PDFs; furthermore they benefit from moderate higher-order QCD and electroweak corrections [14].
Regarding the choice of absolute versus normalised distributions, we elect to use normalised distributions together with corresponding fiducial cross-sections throughout. Normalised distributions are typically more precise that their absolute counterparts, since experimental and theoretical errors partially cancel out when normalising. In addition, normalisation does not affect the PDF and EFT sensitivity of the measurement, provided the fiducial cross section measurements used for normalising are also accounted for. From the implementation point of view, since in a normalised measurement one bin is dependent on the others, we choose to exclude the bin with lowest m tt value (the production threshold) to avoid losing sensitivity arising from the high-energy tails.
Inclusive tt production. A summary of the inclusive tt fiducial cross sections and differential distributions considered in this work is provided in  √ s, the final-state channel, the observable(s) used in the fit, the luminosity, and the number of data points n dat , together with the corresponding publication reference. In the last two columns, we indicate with a the datasets that are included for the first time here in a global PDF fit (specifically, those which are new with respect to NNPDF4.0 ) and in a SMEFT interpretation (specifically, in comparison with the global fits of [42,91]). The sets marked with brackets have already been included in previous studies, but are implemented here in a different manner (e.g. by changing spectra or normalisation), as indicated in the table; more details are given in each paragraph of the section.
The ATLAS dataset comprises six total cross section measurements and five differential normalised cross section measurements. Concerning the latter, at 8 TeV we include three distributions from the dilepton and +jets channels. In the +jets channel, several kinematic distributions are available together with their correlations. Following the dataset selection analysis carried out in [40], we select to fit the y t and y tt distributions as done in the NNPDF4.0 baseline. At 13 TeV, we include the normalised cross sections differential in m tt from the +jets and hadronic channels, with both measurements being considered for the first time here in the context of a PDF analysis.
Moving to CMS, in the inclusive tt category we consider five total cross section and four normalised differential cross section measurements. At √ s = 8 TeV we include differential distributions in the +jets and dilepton channels, the latter being doubly differential in y tt and m tt . The double-differential 8 TeV measurement is part of NNPDF4.0 , but there the (y t , m tt ) distribution was fitted instead. At 13 TeV, we include the m tt distributions in the dilepton and +jets channels. In the latter case we include the single m tt distribution rather than the double-differential one in (m tt , y tt ), which is also available, since we find that the latter cannot be reproduced by the NNLO SM predictions. We present a dedicated analysis of the double-differential distribution in Sect. 5.3. As mentioned above, we will study the impact of our dataset selection choices by presenting variations of the baseline SM-PDF, fixed-PDF, and SMEFT-PDF analyses in the following sections.
tt asymmetry measurements. The tt production asymmetry at the LHC is defined as: with N (P ) being the number of events satisfying the kinematical condition P , and ∆|y| = |y t | − |yt| is the difference between the absolute values of the top quark and anti-top quark rapidities. The asymmetry A C can be measured either integrating over the fiducial phase space or differentially, for example binning in the invariant mass m tt . Measurements of A C are particularly important in constraining certain SMEFT directions, in particular those associated to the two-light-two-heavy operators. However, they are unlikely to have an impact on PDF fitting due to their large experimental uncertainties; nevertheless, with the underlying motivation of a comprehensive SMEFT-PDF interpretation of top quark data, we consider here the A C measurement as part of our baseline dataset, and hence study whether or not they also provide relevant PDF information. A summary of the asymmetry measurements included in this work is given in Table 2.2.
W -helicity fractions. The W -helicity fractions F 0 , F L and F R are PDF-independent observables sensitive to SMEFT corrections, and the dependence of the theory predictions with respect to the Wilson coefficients can be computed analytically. Since these W -helicity fractions are PDF-independent observables, to include them in the joint SMEFT-PDF analysis one has to extend the methodology presented in [39] to include in the fit datasets that either lack, or have negligible, PDF sensitivity and depend only on the EFT coefficients. We describe how this can be achieved within the SIMUnet framework in Sect. 3. In Table 2  Associated top quark pair production. The next class of observables that we discuss is associated tt production with a Z-or a W -boson (Table 2.4), a photon γ (Table 2.5), or a heavy quark pair (ttbb or tttt, Table 2.6). While measurements of ttV have been considered for SMEFT interpretations, we use them for the first time here in the context of a PDF determination. The rare processes ttγ, ttbb, and tttt exhibit a very weak PDF sensitivity and hence in the present analysis their theory predictions are obtained using a fixed PDF, in the same manner as the W -helicity fractions in Table 2.3. Concerning the ttZ and ttW data, from both ATLAS and CMS we use four fiducial cross section measurements at 8 TeV and 13 TeV, and one distribution differential in p Z T at 13 TeV. These measurements are particularly interesting to probe SMEFT coefficients that modify the interactions between the top quark and the electroweak sector. For top-quark production associated with a photon, we include the fiducial cross-section measurements from ATLAS and CMS at 8 TeV; also available is a differential distribution at 13 TeV from ATLAS binned in the photon transverse momentum p γ T [129], but we exclude this from our analysis because of the difficulty in producing SMEFT predictions in the fiducial phase space (in the FitMakeranalysis, its inclusion is only approximate, and in SMEFiT this distribution is neglected entirely). Finally, we include fiducial measurements of ttbb and tttt production at 13 TeV considering the data with highest luminosity for each available final state.
Inclusive single-top pair production. The inclusive single-top production data considered here and summarised in Table 2.7 comprises measurements of single-top production in the t-channel, which have previously been included in PDF fits [16,40], as well as measurements of single-top production in the schannel, which in the context of PDF studies have been implemented for the first time in this study. For t-channel production, we consider the ATLAS and CMS top and anti-top fiducial cross sections √ s = 7, 8, and 13 TeV, as well as normalised y t and yt distributions at 7 and 8 TeV (ATLAS) and at 13 TeV (CMS). For s-channel production, no differential measurements are available and hence we consider fiducial crosssections at 8 and 13 TeV from ATLAS and CMS.
Associated single top-quark production with weak bosons. Finally, Table 2.8 lists the measurements of associated single-top production with vector bosons included in our analysis. We consider fiducial cross-sections for tW production at 8 and 13 TeV from ATLAS and CMS in the dilepton and single-lepton final states, as well as the tZj fiducial cross-section at 13 TeV from ATLAS and CMS in the dilepton final state. In addition, kinematical distributions in tZj production from CMS at 13 TeV are considered for the first time here in an EFT fit. For these differential distributions, the measurement is presented binned in either p Z T or p t T ; here, we take the former as default for consistency with the corresponding ttZ analysis.

Dataset selection
The top quark production measurements listed in Tables 2.1-2.8 summarise all datasets that have been considered for the present analysis. In principle, however, some of these may need to be excluded from the baseline fit dataset to ensure that the baseline dataset is maximally consistent. Following the dataset selection procedure adopted in [40], here our baseline dataset is chosen to exclude datasets that may be either internally inconsistent or inconsistent with other measurements of the same process type. These inconsistencies can be of experimental origin, for instance due to unaccounted (or underestimated) systematic errors, or numerically unstable correlation models, as well as originating in theory, for example whenever a given process is affected by large missing higher-order perturbative uncertainties. Given that the ultimate goal of a global SMEFT analysis, such as the present one, is to unveil deviations from the SM, one should strive to deploy objective dataset selection criteria that exclude datasets affected by such inconsistencies, which are unrelated to BSM physics. The first step is to run a global SM-PDF fit including all the datasets summarised in Tables 2.1-2.8 (and additionally a fit with the data summarised therein, but with the CMS measurement of the differential tt cross-section at 13 TeV in the +jets channel replaced with the double-differential measurement) and monitor in each case the following two statistical estimators: • The total χ 2 per data point and the number of standard deviations n σ by which the value of the χ 2 per data point differs from the median of the χ 2 distribution for a perfectly consistent dataset, where the χ 2 in this case (and in the rest of the paper unless specified) is the experimental χ 2 per data point, which is defined as where T 0 i are the theoretical predictions computed with the central PDF replica, which is the average over the PDF replicas, and the experimental covariance matrix is the one defined for example in Eq. (3.1) of Ref. [149].
Specifically, we single out for further examination datasets for which n σ ≥ 3 and χ 2 ≥ 2 per data point, where the poor description of the data is unlikely to be caused by a statistical fluctuation (note that these conditions relax those given in [40], which we hope gives the opportunity for the EFT to account for poor quality fits to data, rather than immediately attributing poor fits to inconsistencies). The question is then to ascertain whether this poor χ 2 can be explained by non-zero EFT coefficients (and in such case it should be retained for the fit) or if instead there one can find other explanations, such as the ones mentioned above, that justify removing it from the baseline dataset.
• The metric Z defined in Ref. [150] which quantifies the stability of the χ 2 with respect to potential inaccuracies affecting the modelling of the experimental correlations. The calculation of Z relies exclusively on the experimental covariance matrix and is independent of the theory predictions. A large value of the stability metric Z corresponds to datasets with an unstable covariance matrix, in the sense that small changes in the values of the correlations between data points lead to large increases in the corresponding χ 2 . Here we single out for further inspection datasets with Z ≥ 4.
As also described in [150], it is possible to regularise covariance matrices in a minimal manner to assess the impact of these numerical instabilities at the PDF or SMEFT fit level, and determine how they affect the resulting pre-and post-fit χ 2 . To quantify whether datasets with large Z distort the fit results in a sizable manner, one can run fit variants applying this decorrelation procedure such that all datasets exhibit a value of the Z-metric below the threshold. We do not find it necessary to run such fits in this work.
In Tables 2.9 and 2.10 we list the outcome of such a global SM-PDF fit, where entries that lie above the corresponding threshold values for χ 2 , n σ , or Z are highlighted in boldface. In the last column, we indicate whether the dataset is flagged. For the flagged datasets, we carry out the following tests to ascertain whether it should be retained in the fit: • For datasets with n σ > 3 and Z > 4, we run a fit variant in which the covariance matrix is regularised. If, upon regularisation of the covariance matrix, the PDFs are stable and both the χ 2 per data point and the |n σ | decrease to a value below the respective thresholds of 2.0 and 3.0, we retain the dataset, else we exclude it.
• For datasets with χ 2 > 2.0 and n σ > 3 we carry out a fit variant where this dataset is given a very high weight. If in this high-weight fit variant the χ 2 and n σ estimators improve to the point that their values lie below the thresholds without deteriorating the description of any of the other datasets included the dataset is kept, then the specific measurement is not inconsistent, it just does not have enough weight compared to the other datasets. See Ref. [40] for a detailed discussion on the size of the weight depending on the size of the dataset.
From the analysis of Tables 2.9 and 2.10, one finds that only two datasets in the inclusive top quark pair production (lepton+jets final state) category are flagged as potentially problematic: the ATLAS |y tt | distribution at 8 TeV and the CMS double-differential distributions in m tt and y t at 13 TeV. The first of these was already discussed in the NNPDF4.0 analysis [40]. It was observed that each of the four distributions measured by ATLAS and presented in Ref. [92] behave somewhat differently upon being given large weight. The χ 2 of all distributions significantly improves when given large weight. However, while for the top transverse momentum and top pair invariant mass distributions this improvement is accompanied by a rather significant deterioration of the global fit quality, in the case of the top and top pair rapidity distributions the global fit quality is very similar and only the description of jets deteriorates moderately. The rapidity distributions thus remain largely compatible with the rest of the dataset, hence they are kept.
Also shown in one row of Table 2.10 is the fit-quality information for the CMS double-differential distribution at 13 TeV in the +jets channel, from a separate fit wherein the CMS single differential distribution at 13 TeV in the +jets channel is replaced by this dataset. We find that the 2D set is described very poorly, with a χ 2 = 6.43, corresponding to a 22σ deviation from the median of the χ 2 distribution for a perfectly consistent dataset. To investigate this further, we performed a weighted fit; however, we find that the χ 2 improves only moderately (from χ 2 = 6.43 to χ 2 = 4.56) and moreover the χ 2 -statistic of the other datasets deteriorates significantly (with total χ 2 jumping from 1.20 to 1.28). The test indicates that the double-differential distribution is both incompatible with the rest of the data and also internally inconsistent given the standard PDF fit. Hence we exclude this dataset from our baseline and include instead the singledifferential distribution in m tt , which is presented in the same publication [106] and is perfectly described in the baseline fit. To check whether the incompatibility we observe in the double-differential distribution can be cured by the inclusion of SMEFT corrections, we will run a devoted analysis presented in Sect. 5.3.

Theoretical predictions
In this section we describe the calculation settings adopted for the SM and SMEFT cross-sections used in the present analysis.
SM cross-sections. Theoretical predictions for SM cross-sections are evaluated at NNLO in perturbative QCD, whenever available, and at NLO otherwise. Predictions accurate to NLO QCD are obtained in terms of fast interpolation grids from MadGraph5 aMC@NLO [151,152], interfaced to APPLgrid [153] or FastNLO [154][155][156] together with aMCfast [157] and APFELcomb [158]. Wherever available, NNLO QCD corrections to matrix elements are implemented by multiplying the NLO predictions by bin-by-bin K-factors, see Sect. 2.3 in [159]. The top mass is set to m t = 172.5 GeV for all processes considered.
In the case of inclusive tt cross sections and charge asymmetries, a dynamical scale choice of µ R = µ F = H T /4 is adopted, where H T denotes the sum of the transverse masses of the top and anti-top, following the recommendations of Ref. [18]. This scale choice ensures that the ratio of fixed order NNLO predictions to the NNLO+NNLL ones is minimised, allowing us to neglect theory uncertainties associated to missing higher orders beyond NNLO. To obtain the corresponding NNLO K-factors, we use the HighTEA public software [160], an event database for distributing and analysing the results of fixed order NNLO calculations for LHC processes. The NNLO PDF set used in the computation of these K-factors is either NNPDF3. NNPDF4.0 , depending on whether a given dataset was already included in the NNPDF4.0 global fit or not, respectively. For associated tt and W or Z production, dedicated fast NLO grids have been generated. Factorisation and renormalisation scales are fixed to µ F = µ R = m t + 1 2 m V , where m V = m W , m Z is the mass of the associated weak boson, as appropriate. This scale choice follows the recommendation of Ref. [161] and minimises the ratio of the NLO+NLL over the fixed-order NLO prediction. We supplement the predictions for the total cross section for associated W and Z-production at 13 TeV with NLO+NNLL QCD K-factors taken from Table 1 of [161]. On the other hand, the ttγ, tttt and ttbb data are implemented as PDF independent observables, and the corresponding theory predictions are taken directly from the relevant experimental papers in each case.
The evaluation of theoretical predictions for single top production follows [16]. Fast NLO interpolation grids are generated for both s-and t-channel single top-quark and top-antiquark datasets in the 5-flavour scheme, with fixed factorisation and renormalisation scales set to m t . Furthermore, for the t-channel production we include the NNLO QCD corrections to both total and differential cross sections [23]. When the top decay is calculated, it is done in the narrow-width approximation, under which the QCD corrections to the top-(anti)quark production and the decay are factorisable and the full QCD corrections are approximated by the vertex corrections.
SMEFT cross-sections. SMEFT corrections to SM processes are computed both at LO and at NLO in QCD, and both at the linear and the quadratic level in the EFT expansion. Flavour assumptions follow the LHC TOP WG prescription of [26] which were also used in the recent SMEFiT analysis [42]. The flavour symmetry group is given by U we single out operators that contain top quarks (right-handed t and SU (2) doublet Q). This also means that one works in a five-flavour scheme in which the only massive fermion in the theory is the top. As far as the electroweak input scheme is concerned, we work in the m W -scheme, meaning that the 4 electroweak inputs are {m W , G F , m h , m Z }.
At dimension-six, SMEFT operators modify the SM Lagrangian as: where Λ is the UV-cutoff energy scale, {O n } are dimension-six operators, and {c n } are Wilson coefficients. The 25 operators considered for this study are listed in Table B.1 in the Warsaw basis [162]. In this work we neglect renormalisation group effects on the Wilson coefficients [163]. For hadronic data, i.e. for protonproton collisions, which are the only data affected by the SMEFT in this study, the linear effect of the n-th SMEFT operator on a theoretical prediction can be quantified by: where i, j are parton indices, L NNLO ij is the NNLO partonic luminosity defined as ij,SMEFT the corresponding partonic cross section associated to the interference between O n and the SM amplitude A SM when setting c n = 1. This value of c n is only used to initialize the potential contributions of the SMEFT operator; the effective values of the Wilson coefficient are found after the fit is performed. Quadratic effects of the interference between the n-th and m-th SMEFT operators can be evaluated as The computation of the SMEFT contributions is performed numerically with the FeynRules [164] model SMEFTatNLO [165], which allows one to include NLO QCD corrections to the observables. The obtained cross sections are then combined in so-called BSM factors by taking the ratio with the respective SM cross sections, in order to produce R with c nm = c n c m . Eq. (2.8) is at the centre of the SIMUnet methodology, which we discuss in Sect. 3.

Fitting methodology
In this work, the joint determination of the PDFs and the EFT coefficients is carried out using the SIMUnet methodology, first presented in [39], which is substantially extended in this work. The core idea of SIMUnet is to incorporate the Wilson coefficients into the optimisation problem that enters the PDF determination, by accounting explicitly for their dependence in the theoretical predictions used to fit the PDFs. Specifically, the neural network model used in the SM-PDF fits of NNPDF4.0 is augmented with an additional layer, which encodes the dependence of the theory predictions entering the fit on the Wilson coefficients.
In this section, first we provide an overview of the SIMUnet methodology, highlighting the new features that have been implemented for the present study.

SIMUnet overview
The SIMUnet [39] methodology extends the NNPDF4.0 framework [40,41] to account for the EFT dependence (or, in principle, any parametric dependence) of the theory cross-sections entering the PDF determination. This is achieved by adding an extra layer to the NNPDF4.0 neural network to encapsulate the dependence of the theory predictions on the EFT coefficients, including the free parameters in the general optimisation procedure. This results in a simultaneous fit of the PDF as well as EFT coefficients to the input data. As in the NNPDF methodology, the error uncertainty estimation makes use of the Monte Carlo replica method, which yields an uncertainty estimate on both PDF and EFT parameters. We discuss the limitations of this method in App. E.
The SM theoretical observables are encoded using interpolation grids, known as FK-tables [158,166,167], which encode the contribution of both the DGLAP evolution and the hard-scattering matrix elements and interface it with the initial-scale PDFs in a fast and efficient way.
The simultaneous fit is represented as a neural network using the Tensorflow [168] and Keras [169] libraries. The architecture is schematically represented in Fig. 3.1. Trainable weights are represented by solid arrows, and non-trainable weights by dashed arrows. Through a forward pass across the network, the inputs (x-Bjorken and its logarithm) proceed through hidden layers to output the eight fitted PDFs at the initial parametrisation scale Q 0 . For each of the experimental observables entering the fit, these PDFs are then combined into a partonic luminosity L (0) at Q 0 , which is convolved with the precomputed FK-tables Σ to obtain the SM theoretical prediction T SM . Subsequently, the effects of the N EFT coefficients c = (c 1 , . . . , c N ), associated to the operator basis considered, are accounted for by means of an extra layer, resulting in the final prediction for the observable T entering the SMEFT-PDF fit. The SIMUnet code allows for both linear and quadratic dependence on the EFT coefficients. In linear EFT fits, the last layer consists of N trainable weights to account for each Wilson coefficient. In quadratic EFT fits, in addition to the N trainable weights, a set of N (N + 1)/2 non-trainable parameters, which are functions of the trainable weights, is included to account for all diagonal and non-diagonal contributions of EFT-EFT interference to the cross-sections. The results obtained with the quadratic functionality of SIMUnet are, however, not displayed in this work, for the reasons explained in App. E. The PDF parameters θ and the EFT coefficients c entering the evaluation of the SMEFT observable in Fig. 3.1 are then determined simultaneously from the minimisation of the fit figure of merit (also known as loss function).
The SIMUnet architecture can be minimally modified to deal with the fixed-PDF case, in which only the EFT coefficients are treated as free parameters in the optimisation process. This can be achieved by Hidden layer 2 Schematic representation of the SIMUnet architecture for a general observable. Trainable weights are represented by solid arrows, and non-trainable weights by dashed arrows. Through a forward pass across the network, the inputs (x-Bjorken and its logarithm, in green) proceed through 2 hidden layers (in blue) to output the PDFs f 1 , · · · , f 8 (in red) at the initial parametrisation scale Q 0 . For each of the experimental observables entering the fit, these PDFs are combined into a partonic luminosity L (0) at Q 0 , which is then convolved with precomputed FK-tables Σ to obtain the SM theoretical prediction T SM . Subsequently, the effects of the EFT coefficients c i are accounted for by means of an extra layer. In linear EFT fits this layer simplifies to just N trainable weights to account for each coefficient, and in quadratic EFT fits a set of N (N + 1)/2 non-trainable weights has to be added to account for the EFT-EFT interference. The forward-pass of this layer results in the final prediction for the observable T entering the SMEFT-PDF fit. By setting the weights in the EFT layer to zero, one recovers the SM-PDF case. By freezing the PDF-related weights in the network architecture, one can carry out a fixed-PDF EFT determination or include in the joint SMEFT-PDF fit observables whose PDF dependence can be neglected.
freezing the PDF-related weights in the network architecture to the values obtained in some previous fit, for example a SM-PDF determination based on NNPDF4.0 . In this manner, SIMUnet can also be used to carry out traditional EFT fits where the PDF dependence of the theory predictions is neglected. Furthermore, for PDF-independent observables, computing an FK-table Σ is not required and the SM cross-section T SM can be evaluated separately and stored to be used in the fit.
As illustrated in Fig. 3.1, within the SIMUnet framework a single neural network encapsulates both the PDF and the EFT dependence of physical observables, with the corresponding parameters being simultaneously constrained from the experimental data included in the fit. Specifically, we denote the prediction of the neural network as: where the covariance matrix in Eq. (3.2) is the t 0 covariance matrix, which is constructed from all sources of statistical and systematic uncertainties that are made available by the experiments with correlated multiplicative uncertainties treated via the 't0' prescription [170] in the fit to avoid fitting bias associated with multiplicative uncertainties. Once Eq. (3.2) is minimised for each replica, subject to the usual cross-validation stopping, one ends up with a sample of best-fit values for both the EFT coefficients and the PDF parameters: from which one can evaluate statistical properties such as averages, variances, higher moments, or confidence level intervals. For example, the preferred value of the EFT coefficients could be evaluated over the mean over the replica sample, though one could also define the preferred value as the median or mode of the distribution. Note that, in this methodology, the Monte Carlo error propagation automatically propagates the PDF uncertainty to the distribution of the best-fit values of the EFT coefficients. Hence the variance on the EFT coefficients reflects not only the experimental uncertainty of the data included in the fit, but also the functional uncertainty associated with the PDFs. As we discuss below, the current implementation of the SIMUnet methodology also allows performing fixed-PDF fits, where only the Wilson coefficients are optimised. This is done by freezing the weights of the PDF part of the neural network during the minimisation of the loss function (3.2) from some other previous fit, In this limit, SIMUnet reduces to a fixed-PDF EFT fit such as the MCfit variant of SMEFiT [171]. Likewise, by setting to zero the EFT coefficients, one recovers the same PDF weights θ (k) as in NNPDF4.0 , or those of the SM-PDF fit being used as baseline in the analysis. An important caveat here is that, while in the SIMUnet methodology the PDF uncertainty is propagated to the posterior distribution of the EFT coefficients via the Monte Carlo replica method, in the MCfit variant of the SMEFiT methodology the fit of the EFT only considers the central PDF member (which in the NNPDF4.0 case corresponds to the average of the PDF replicas) for all N rep replicas, and the PDF uncertainty is propagated to the EFT coefficients by utilising an additional covariance matrix (both in the fit of the EFT coefficients and in the generation of the Monte Carlo replicas of the experimental data) that is added to t 0 covariance matrix. Namely, where cov th includes the PDF contribution [30,42], computed as in which the average is taken over PDF replicas. The two ways of propagating PDF uncertainties to the distribution of the EFT coefficients are equivalent assuming that PDF uncertainties are Gaussian and uncorrelated. SIMUnet adopts the same optimisation settings as those set in the NNPDF4.0 analysis for the PDFdependent part of the network. On the other hand it adjusts only those hyperparameters associated to the EFT-dependent layer. Within the joint SMEFT-PDF fit, several of the fit settings such as the prior ranges for the EFT parameters and the learning rates are improved in an iterative way until convergence is achieved. In doing so, we also iterate the t 0 covariance matrix and the preprocessing exponents as customary in the NNPDF procedure. In the fixed-PDF EFT fit, the user can decide both the ranges and the prior distributions to be used in the initial sampling of EFT coefficients as determined e.g. from a previous fit or from one-parameter scans.

New features
We now discuss some of the new features that have been implemented in SIMUnet, in comparison with [39], which are motivated by the needs of the SMEFT-PDF fits to LHC top quark data presented in this work. We consider in turn the following new features: the implementation of the quadratic contributions to the EFT cross-sections in the joint fits; fitting observables whose PDF dependence is negligible or non-existent; initialising the PDF weights of the neural network with the results of a previous fit; and finally, the improved initialisation of the EFT coefficients.
Quadratic EFT contributions. The version of SIMUnet used in [39] for the SMEFT-PDF fits of highmass Drell-Yan data allowed the inclusion of quadratic contributions to the EFT cross-sections only under the approximation in which the cross-terms proportional to c i c j with i = j in Eq. (2.9) were neglected. In the current implementation, SIMUnet can instead account for the full quadratic contributions to the EFT cross-sections, including the non-diagonal cross-terms. This feature can be especially important for the interpretation of top quark measurements at the LHC, given that for many observables quadratic corrections, including cross-terms relating different operators, can be sizeable specially in the high-energy region.
The implementation consists of explicitly accounting for the cross terms, as parameters which depend on the Wilson coefficients and can be differentiated as a function of them during the training procedure.
PDF-independent observables. In the original version of SIMUnet, only physical observables with explicit dependence on both the PDFs (via the FK-tables interface) and the EFT coefficients could be included in a simultaneous fit. We have now extended the SIMUnet framework to describe observables that are independent of the PDF parameters θ, namely the weights and thresholds of the network depicted in Fig. 3.1 that output the SM partonic luminosity L (0) . For these PDF-independent observables, the SM predictions T SM are evaluated separately and stored in theory tables which can be used to evaluate the SMEFT cross-sections after applying the rescaling of Eq. (2.9); hence, these observables only depend on the Wilson coefficients c n .
In the current analysis the four-heavy cross-sections σ tot (ttbb) and σ tot (tttt), the W -helicity measurements, and the associated top quark production cross-sections tZ, tW and ttγ are treated as PDFindependent observables, as for those cross-sections the PDF dependence can be neglected in comparison with other sources of theoretical and experimental uncertainty. The possibility to include PDF-independent observables makes SIMUnet a global SMEFT analysis tool on the same footing as SMEFiT [30,171], Fit-Maker [91], HepFit [172], EFTfitter [173], and Sfitter [28] among others. This is demonstrated in App. C, where it is shown that the results of a linear fixed-PDF SMEFT analysis performed with SIMUnet coincide with those obtained with SMEFiT [171] once the same experimental data and theory calculations are used. Moreover, the new feature will allow us to include in future analyses any non-hadronic observables, such as electroweak precision observables (EWPO) [174].
Fixed-PDF weight initialisation. Within the current SIMUnet implementation, one can also choose to initialise the PDF-dependent weights of the network in Fig. 3.1 using the results of a previous Monte Carlo fit of PDFs, for example an existing SM-PDF analysis obtained with the NNPDF4.0 methodology. The weights of the latter are written to file and then read by SIMUnet for the network initialisation.
This feature has a two-fold application. First, instead of initialising at random the network weights in a simultaneous SMEFT-PDF fit, one can set them to the results of a previous SM-PDF fit, thus speeding up the convergence of the simultaneous fit, with the rationale that EFT corrections are expected to represent a perturbation of the SM predictions. Second, we can use this feature to compute EFT observables in the fixed-PDF case described above using the FK-table convolution with this previous PDF set as input, as opposed to having to rely on an independent calculation of the SM cross-section. Therefore, this PDF weight-initialisation feature helps realise SIMUnet both as a fixed-PDF EFT analysis framework, and to assess the stability of the joint SMEFT-PDF fits upon a different choice of initial state of the network in the minimisation.
Improved initialisation of the EFT coefficients. In the original implementation of SIMUnet it was only possible to initialise the EFT coefficients at specific values, selected beforehand by the user. In this work, we have developed more flexible initialisation schemes for the Wilson coefficients, in the sense that they can now be sampled from a prior probability distributed defined by the user. Specifically, each Wilson coefficient c i can be sampled from either a uniform U[a i , b i ] or a normal N (µ i , σ i ) distribution. The ranges (a i , b i ) of the uniform distribution U and the mean and standard deviation (µ i , σ i ) of the Gaussian distribution N are now user-defined parameters, which can be assigned independently to each Wilson coefficients that enter the fit. This feature enhances the effectiveness and flexibility of the minimisation procedure by starting from the regions of the parameter space with the best sensitivity to the corresponding Wilson coefficient.
Another option related to the improved initialisation of EFT coefficients is the possibility to adjust the overall normalisation of each coefficient by means of a user-defined scale factor. The motivation to introduce such a coefficient-dependent scale factor is to end up with (rescaled) EFT coefficients entering the fit which all have a similar expected range of variation. This feature is advantageous, since the resulting gradients entering the SGD algorithm will all be of the same order, and hence use a unique learning rate which is appropriate for the fit at hand.

Impact of the top quark Run II dataset on the SM-PDFs
Here we present the results of a global SM-PDF determination which accounts for the constraints of the most extensive top quark dataset considered to date in such analyses and described in Sect. 2. The fitting methodology adopted follows closely the settings of the NNPDF4.0 study [40], see also [149] for a rebuttal of some critical arguments. This dataset includes not only the most up-to-date measurements of top quark pair production from Run II, but it also includes all available single top production cross-sections and distributions and for the first time new processes not considered in PDF studies before, such as the A C asymmetry in tt production and the ttV and tV associated production (with V = Z, W ).
We begin by summarising the methodological settings used for these fits in Sect. 4.1. Then in Sect. 4.2 we assess the impact of adding to a no-top baseline PDF fit various subsets of the top quark data considered in this study. In particular, we assess the impact of including updated Run II tt and single-top measurements in comparison with the subset used in the NNPDF4.0 analysis, see the second-to-last column of Tables 2.1-2.7. Furthermore, we quantify for the first time the impact of associated vector boson and single-top (tV ) as well as associated vector boson and top-quark pair production (ttV ) data in a PDF fit. Finally in Sect. 4.3 we combine these results and present a NNPDF4.0 fit variant including all data described in Sect. 2, which is compared to both the NNPDF4.0-notop baseline and to the original NNPDF4.0 set.

Fit settings
An overview of the SM-PDF fits that are discussed in this section is presented in Table 4.1. First of all, we produce a baseline fit, denoted by NNPDF4.0-notop , which is based on the same dataset as NNPDF4.0 but with all top quark measurements excluded. Then we produce fit variants A to G, which quantify the impact of including in this baseline various subsets of top data (indicated by a check mark in the table). Finally, fit variant H is the main result of this section, namely the fit in which the full set of top quark measurements described in Sect. 2 is added to the no-top baseline.
In these fits, methodological settings such as network architecture, learning rates, and other hyperparameters are kept the same as in NNPDF4.0 , unless otherwise specified. One difference is the training fraction f tr defining the training/validation split used for the cross-validation stopping criterion. In NNPDF4.0 , we used f tr = 0.75 for all datasets. Here instead we adopt f tr = 0.75 only for the no-top datasets and f tr = 1.0 instead for the top datasets. The rationale of this choice is to ensure that the fixed-PDF SMEFT analysis, where overfitting is not possible [42], exploits all the information contained in the top quark data considered in this study, and then for consistency to maintain the same settings in both the SM-PDF fits (this section) and in the joint SMEFT-PDF fits (to be discussed in Sect. 6). Nevertheless, we have verified that the resulting SM-PDF fits are statistically equivalent to the fits obtained by setting the training fraction to be 0.75 for all data, including for the top quark observables.
Fits A-G in Table 4.1 are composed by N rep = 100 Monte Carlo replicas after post-fit selection criteria, while the NNPDF4.0-notop baseline fit and fit H are instead composed by N rep = 1000 replicas. As customary, all fits presented in this section are iterated with respect to the t 0 PDF set and the pre-processing exponents.

Impact of individual top quark datasets
First we assess the impact of specific subsets of LHC top quark data when added to the NNPDF4.0-notop baseline, fits A-G in Table 4.1. In the next section we discuss the outcome of fit H, which contains the full top quark dataset considered in this work. . These findings imply that the effect on the SM-PDFs of the new Run II top data is consistent with, and strengthens, that of the data already part of NNPDF4.0 , and suggests a possible tension between top quark data and other measurements in the global PDF sensitive to the large-x gluon, in particular inclusive jet and di-jet production. The reduction of the gluon PDF uncertainties from the new measurements can be as large as about 20% at x ≈ 0.4. Differences are much reduced for the quark PDFs, and restricted to a 5% to 10% uncertainty reduction in the region around x ∼ 0.2 with central values essentially unchanged.
To disentangle which of the processes considered dominates the observed effects on the gluon and the light quarks PDFs, Fig. 4.2 compares the relative PDF uncertainty on the gluon and on the d/u quark ratio in the NNPDF4.0-notop baseline fit at Q = m t = 172.5 GeV with the results from fits A, B, C, and D. As indicated in Table 4.1, these fit variants include the following top datasets: inclusive tt (A), inclusive tt + A C (B), single top (C), and their sum (D). The inclusion of the top charge asymmetry A C data does not have any impact on the PDFs; indeed fits A and B are statistically equivalent. This is not surprising, given   that in Eq. (2.1) the dependence on PDFs cancels out. Concerning the inclusion of single top data (fit C), it does not affect the gluon PDF but instead leads to a moderate reduction on the PDF uncertainties on the light quark PDFs in the intermediate-x region, x ≈ [0.01, 0.1], as shown in the right panel displaying the relative uncertainty reduction for the d/u ratio. This observation agrees with what was pointed out by a previous study [16], and the impact of LHC single-top measurements is more marked now as expected since the number of data points considered here is larger. We conclude that the inclusive tt measurements dominate the impact on the large-x gluon observed in Fig. 4.1, with single top data moderately helping to constrain the light quark PDFs in the intermediate-x region.
We now consider the effect of the inclusion of data that were not included before in any PDF fit, namely either tt or single-top production in association with a weak vector boson. Although current data exhibits large experimental and theoretical uncertainties, it is interesting to study whether they impact PDF fits at all, in view of their increased precision expected in future measurements; in particular, it is useful to know which parton flavours are most affected.  Table 4.1, which include the ttV (E) and tV (F) data as well as their sum (G). The pull of ttV is very small, while the pull of the tV measurements is in general small, but consistent with those of the inclusive tt measurements, namely preferring a depletion of the large-x gluon. This result indicates that ttV and tV data may be helpful in constraining PDF once both future experimental data and theoretical predictions become more precise, although the corresponding inclusive measurements are still   Table 4.1, which includes the full set of top quark measurements considered in this analysis. expected to provide the dominant constraints.

Combined effect of the full top quark dataset
The main result of this section is displayed in Fig. 4.4, which compares the NNPDF4.0 and the NNPDF4.0-notop fits with variant H in Table 4.1, namely with the fit where the full set of top quark measurements considered in this analysis has been added to the no-top baseline. As in the case of Fig. 4.1, we show the large-x gluon normalised to the central value of NNPDF4.0-notop and the associated 1σ PDF uncertainties (all normalised to the central value of the baseline). The results of fit H are similar to those of fit D, although slightly more marked. This is expected, since as shown above the associated production datasets ttV and tV carry little weight in the fit.
From Fig. 4.4 one observes how the gluon PDF of fit H deviates from the NNPDF4.0-notop baseline markedly in the data region x ∈ [0.1, 0.5]. The shift in the gluon PDF can be up to the 2σ level, and in particular the two PDF uncertainty bands do not overlap in the region x ∈ [0.2, 0.35]. As before, we observe that the inclusion of the latest Run II top quark measurements enhances the effect of the top data already included in NNPDF4.0 , by further depleting the gluon in the large-x region and by reducing its uncertainty by a factor up to 25%. Hence, one finds again that the new Run II top quark production measurements lead to a strong pull on the large-x gluon, qualitatively consistent but stronger as compared with the pulls associated from the datasets already included in NNPDF4.0 .
To assess the phenomenological impact of our analysis at the level of LHC processes,     as expected given the negligible changes in the quark PDFs observed in fit H, and hence are not discussed further. From Figs. 4.5-4.6 one observes that both for the quark-gluon and gluon-gluon luminosity the impact of the LHC top quark data is concentrated on the region above m X ∼ > 1 TeV. As already reported for the case of the gluon PDF, also for the luminosities the net effect of the new LHC Run II top quark data is to further reduce the luminosities for invariant masses in the TeV range, with a qualitatively similar but stronger pull as compared to that obtained in NNPDF4.0 . While NNPDF4.0 and its no-top variant agree at the 1σ level in the full kinematical range considered, this is not true for fit H, whose error bands do not overlap with those of NNPDF4.0-notop for invariant masses m X between 2 and 4 TeV. On the other hand, NNPDF4.0 and fit H are fully consistent across the full m X range, and hence we conclude that predictions for LHC observables based on NNPDF4.0 will not be significantly affected by the inclusion of the latest LHC top quark data considered in this work.
Finally, concerning the fit quality of fit H is essentially stable, actually better relative to the NNPDF4.0-notop baseline. The experimental χ 2 per data point on their respective datasets is 1.156 for the no-top baseline, whilst for fit H is reduced to 1.144. A complete summary of the χ 2 information for all of the fits in this section is given in App. D. It is interesting to observe that all new top data included in fit H are already described well by using the NNPDF4.0 set, although clearly the χ 2 per data point improves (from 1.139 to 1.102) once all data are included in the fit. This confirms the overall consistency of the analysis. 5 Impact of the top quark Run II dataset on the SMEFT We now quantify the impact of the LHC Run II legacy measurements, described in Sect. 2, on the top quark sector of the SMEFT. As compared to previous investigations of SMEFT operators modifying top-quark interactions [42,91] [26][27][28][29][30][31][32][33], the current analysis considers a wider dataset, in particular extended to various measurements based on the full LHC Run II luminosity. In the last column of Tables 2.1-2.8 we indicated which of the datasets included here were considered for the first time within a SMEFT interpretation. Here we assess the constraints that the available LHC top quark measurements provide on the SMEFT parameter space, and in particular study the impact of the new measurements as compared to those used in [42,91]. In this section we restrict ourselves to fixed-PDF EFT fits, where the input PDFs used in the calculation of the SM cross-sections are kept fixed. Subsequently, in Sect. 6, we generalise to the case in which PDFs are extracted simultaneously together with the EFT coefficients.
The structure of this section is as follows. We begin in Sect. 5.1 by describing the methodologies used to constrain the EFT parameter space both at linear and quadratic order in the EFT expansion. We also present results for the Fisher information matrix, which indicates which datasets provide constraints on which operators. In Sect. 5.2, we proceed to give the results of the fixed-PDF EFT fits at both linear and quadratic order, highlighting the impact of the new Run II top quark data by comparison with previous global SMEFT analyses. In Sect. 5.3, we assess the impact of replacing the CMS 13 TeV differential measurement of tt in the +jets channel, binned with respect to invariant top quark pair mass, by the corresponding double-differential measurement binned with respect to both invariant top quark pair mass and top quark pair rapidity. In the dataset selection performed in Sect 2.2 we rejected the double-differential distribution due to its poor χ 2 -statistic in the SM, which could not be improved by a weighted fit of PDFs; in the present section, it is interesting to see whether the SMEFT can help account for the poor fit of this dataset. Finally, in Sect. 5.4 we evaluate the correlation between PDFs and EFT coefficients to identify the kinematic region and EFT operators which are potentially sensitive to the SMEFT-PDF interplay to be studied in Sect. 6.

Fit settings
Throughout this section, we will allow only the SMEFT coefficients to vary in the fit, keeping the PDFs fixed to the SM-PDFs baseline obtained in the NNPDF4.0-notop fit discussed in Sect. 4; with this choice, one removes the overlap between the datasets entering the PDF fit and the EFT coefficients determination. Our analysis is sensitive to the N = 25 Wilson coefficients defined in Table B.1, except at the linear level where the four-heavy coefficients c 8 Qt , c 1 QQ , c 8 QQ , c 1 Qt and c 1 tt (which are constrained only by tttt and ttbb data) exhibit three flat directions [42]. In order to tackle this, we remove the five four-heavy coefficients from the linear fit * Hence, in our linear fit we have N = 20 independent coefficients constrained from the data, whereas in the quadratic fit, we fit all N = 25 independent coefficients.
The linear EFT fits presented in this section are performed with the SIMUnet methodology in the fixed-PDF option, as described in Sect. 3; we explicitly verified that the SIMUnet methodology reproduces the posterior distributions provided by SMEFiT (using either the NS (Nested Sampling) or MCfit options) for a common choice of inputs, as explicitly demonstrated in App. C. However, in the case of the quadratic EFT fits we are unable to use the SIMUnet methodology due to a failure of the Monte-Carlo sampling method utilised in the SIMUnet and SMEFiT codes; this is discussed in App. E and a dedicated study of the problem will be the subject of future work. For this reason, quadratic EFT fits in this section are carried out with the public SMEFiT code using the NS mode [171]. To carry out these fits, the full dataset listed in Tables 2.1-2.8, together with the corresponding SM and EFT theory calculations described in Sect. 2.3, have been converted to the SMEFiT data format (this conversion was also already used for the benchmarking of App. C).
Fisher information. The sensitivity to the EFT operators of the various processes entering the fit can be evaluated by means of the Fisher information, F ij , which quantifies the information carried by a given * In principle one could instead rotate to the principal component analysis (PCA) basis and constrain the two non-flat directions in the four-heavy subspace, but even so, the obtained constraints remain much looser in comparison with those obtained in the quadratic EFT fit [42].
In our fits, we also keep the tttt and ttbb datasets after removing the five four-heavy coefficients. We have verified that including or excluding theses sets has no significant impact whatsoever on the remaining coefficients. dataset on the EFT coefficients c i [175]. In a linear EFT setting, the Fisher information is given by: where the k-th entry, L (i) k , of the vector L (i) is the linear contribution multiplying c i in the SMEFT theory prediction for the k-th data point, and cov exp is the experimental covariance matrix. In particular, the Fisher information is an N × N matrix, where N is the number of EFT coefficients, and it depends on the dataset. An important property of the Fisher information is that it is related the covariance matrix C ij of the maximum likelihood estimators by the Cramer-Rao bound : indicating that larger values of F ij will translate to tighter bounds on the EFT coefficients. Before displaying the results of the fixed-PDF SMEFT analysis in Sect. 5.2, we use the Fisher information to assess the relative impact of each sector of top quark data on the EFT parameter space; this is done in the linear analysis, including O(1/Λ 2 ) SMEFT corrections. In the quadratic case, once O(1/Λ 4 ) SMEFT corrections are included, the dependence of F ij on the Wilson coefficients makes interpretation more difficult. Writing Since F ii (D) corresponds to the constraining power of the dataset D in a one-parameter fit of the Wilson coefficient c i , this definition only quantifies how much a dataset impacts one-parameter fits of single Wilson coefficients in turn; however, this will give a general qualitative picture of some of the expected behaviour in the global fit too. We display the results of evaluating the relative constraining power of each top quark data sector on each of the parameters in Fig. 5.1, quoting the results in percent (%). As expected, tt total cross sections constitute the dominant source of constraints on the coefficient c tG . Each of the four-fermion operators receive important constraints from differential tt distributions and charge asymmetry measurements. Note that this impact is magnified when we go beyond individual fits, in which case measurements of charge asymmetries are helpful in breaking flat directions amongst the four-fermion operators [28,91]. The coefficient c 1,3 Qq is the exception as it is instead expected to be well-constrained by single top production. We note that the measurements of W helicities are helpful in constraining the coefficient c tW , while ttZ measurements provide the dominant source of constraints on c

Fixed-PDF EFT fit results
In this section, we present the results of the linear and quadratic fixed-PDF fits with the settings described in Sect 5.  The values of the χ 2 per data point for the fixed-PDF EFT fits presented in this section, both for individual groups of processes, and for the total dataset. Here the χ 2 is actually the χ 2 exp+th defined by using the theory covariance matrix defined in Eq. (3.7). In each case we indicate the number of data points, the χ 2 exp+th obtained using the baseline SM calculations, and the results of both the linear and quadratic EFT fits.
Fit quality. We begin by discussing the fit quality of the global SMEFT determination, quantifying the change in the data-theory agreement relative to the SM in both the linear and quadratic SMEFT scenarios. Table 5.1 provides the values of the χ 2 per data point in the SM and in the case of the SMEFT at both linear and quadratic order in the EFT expansion for each of the processes entering the fit. Here, in order to ease the comparison of our results to those of SMEFiT and FitMaker, we quote the χ 2 per data point computed by using the covariance matrix defined in Eq. (3.7), which includes both the experimental uncertainty and the PDF uncertainty. The corresponding values obtained by using the experimental χ 2 definition of Eq. (2.3), along with a fine-grained fit quality description are given in App. D.
We observe that in many sectors, the linear EFT fit improves the fit quality compared to the SM; notably, the χ 2 exp+th per data point for inclusive tt is vastly improved from 1.71 to 1.11. When quadratic corrections are also considered, the fit quality is usually poorer compared to the linear fit. For example, in inclusive tt the χ 2 exp+th per data point deteriorates from 1.11 to 1.69. This is not unexpected, however, since the flexibility of the quadratic fit is limited by the fact that for sufficiently large values of Wilson coefficients the EFT can only make positive corrections. † It is also useful to calculate the goodness of fit, quantified by the the χ 2 per degree of freedom, χ 2 /n dof = χ 2 /(n dat − n param ), which additionally accounts for the complexity of the models we are using in each fit. In our case, we find χ 2 exp+th /n dof = 1.25 in the SM and χ 2 exp+th /n dof = 0.95 and 1.33 in the linear and quadratic EFT scenarios respectively. We see that while the EFT at quadratic order does not provide a better fit than the SM, neglecting quadratic EFT corrections leads to a significant improvement in the overall fit quality.
Constraints on the EFT parameter space. Next, we present the constraints on the EFT parameter space. In Fig. 5.2, we display the 95% CL constraints on the 20 Wilson coefficients entering the linear fit. Two sets of constraints are shown; in green, we give the intervals obtained from a fit to the 175 data points   [42] (note that at the linear EFT level, the results obtained with SIMUnet coincide with those provided by SMEFiT for the same dataset, as demonstrated in App. C). Note also that the constraints on selected coefficients are rescaled by the factors shown, for display purposes. introduced in Sect. 2, whilst in orange, we give in the intervals obtained from a fit to the older top quark dataset used in the global analysis of Ref. [42], obtained from a fit of 150 data points. This comparison allows us to quantify the information gained from the latest Run II datasets, relative to those available to previous analyses. The same comparison, this time at quadratic order in the EFT expansion, is shown in Fig. 5.3 (note that in this plot we display constraints on all 25 coefficients, including the 4-heavy coefficients c 8 Qt , c 1 QQ , c 8 QQ , c 1 Qt and c 1 tt ). We first note that Figures 5.2 and 5.3 both demonstrate good agreement between the fits using old and new datasets, and consistency between the new fit SMEFT bounds and the SM. At the linear level, the most noticeable improvement concerns c tG ; its 95% C.L. bounds decrease from [−0.13, 0.41] to [−0.18, 0.17], thanks to the increased amount of information in the input dataset, coming in particular from tt data. This results in both a tightening of the constraints by about 35% and a shift in the best-fit point. For many of the other coefficients, the bounds are either stable (e.g. c tZ ), or exhibit a shift in central value but no significant tightening (e.g. c ϕt , undergoes a shift of -14.3, but a decrease in the size of the constraints 95% C.L. by only 1%, and c (−) ϕQ undergoes a shift of −7.33, but its bounds only tighten by 2%). Finally, we note that some coefficients instead exhibit a broadening of constraints with the new dataset relative to the old dataset (for example, some of the four-fermion operators). The increase in the size of the constraints could point to some inconsistency within the new inclusive tt dataset; however, given that the bounds are very large anyway, at the edges of the intervals we are likely to approach a region where the EFT is no longer valid in both cases, hence no definite conclusions may be drawn.
At the quadratic order in the EFT expansion, however, the impact of the latest Run II dataset is clear; we see a marked improvement in many of the SMEFT constraints. As shown in Fig. 5.3, the bounds on all 14 of the four-fermion operators become noticeably smaller as a result of the increase in precision in the tt sector. The constraint on c tZ is improved by the inclusion of measurements of the ttγ total cross sections, resulting in a tightening of 24%. The addition of the p γ T spectrum [129] would yield an even stronger constraint, as seen in [91]. We will make use of this observable in future work when unfolded measurements are made available. Contrary to the linear fit, where we singled out c ϕt and c  simply to the ambiguity associated to the EFT validity in that particular region of the parameter space.
Correlations. Figure 5.4 shows the correlations the between Wilson coefficients evaluated in this analysis both at the linear and the quadratic order in the EFT expansion, shown on the left and right panels respectively. In the linear fit, we first note a number of large correlations amongst the octet four-fermion operators which enter the tt production together. The singlet four-fermion operators are similarly correlated among themselves, although their correlations are comparatively suppressed. The coefficients c ϕt and c ϕQ remain correlated though. The 4-heavy operators are also included in this quadratic fit, and we find large anti-correlations between c 1 QQ and c 8 QQ , indicating that they are poorly distinguished in the tttt and ttbb processes. Finally, note that we obtain subtle non-zero correlations between the octet and singlet four-fermion operators constructed from the same fields, for example between c 8 Qd and c 1 Qd . This is a result of the fact that tt measurements provide the dominant source of constraints on these coefficients and are very sensitive to quadratic corrections, and at this order the contribution from these operators differs only by a numerical factor.

Study of the CMS 1D vs 2D distribution
In the dataset selection discussed in Sect. 2.2, the double-differential distribution measurements performed by CMS at √ s = 13 TeV in the +jets channel [106] -binned with respect to the top quark pair invariant mass m tt and the top quark pair rapidity y tt -is found to be poorly described by the SM theory, with an experimental χ 2 that is 22σ away from the median of the χ 2 distribution of a perfectly consistent dataset made of n dat = 34. Even by increasing the weight of this dataset in a weighted fit (see Sect. 2.2 for a more detailed explanation), the χ 2 exp per data point improves only moderately to 4.56 and the χ 2 -statistic of the other datasets deteriorates significantly, hinting to both an internal incompatibility of the CMS 2D distribution and to an incompatibility with the rest of the data. For this reason, the dataset was excluded from our analysis, and replaced by the the single-differential distribution in m tt -presented in the same publication [106]. The latter is well described by the SM theoretical predictions. In this section we present a dedicated analysis to assess whether the SMEFT corrections can improve the theoretical description of this dataset. In particular, we compare a fixed-PDF fit including the 13 TeV CMS double-differential (m tt , y tt ) distribution (CMS 2D) to the default one including the 13 TeV single-differential m tt distribution (CMS 1D).
First of all, it is interesting to notice that the inclusion of quadratic SMEFT corrections in the fit does not improve much the quality of the fit of the CMS 2D distribution, with χ 2 exp+th /n dat decreasing from 2.  Table 5.2 we compare the fit quality of the two (fixed-PDF) SMEFT fits, CMS 1D and CMS 2D, both for individual groups of processes, and for the total dataset. We observe that the fit quality deteriorates both in the SM and in the quadratic SMEFT fit once the CMS 2D distribution is fitted. The deterioration of the fit quality is mostly driven by a deterioration of the fit quality of the tt and ttW sectors. However at the level of linear SMEFT fit, the quality of the fit deteriorates only moderately and it is mostly driven by the slight deterioration in the inclusive tt sector and in the tttt & ttbb one.
At the level of fit of the Wilson coefficients, the quadratic SMEFT fits yields similar 95% C.L. bounds on the EFT coefficients. This is somewhat expected, given that the fit quality of the CMS 2D data does not improve once quadratic SMEFT corrections are included, and the fit quality of the other datasets remains pretty much the same. On the other hand the bounds obtained from a SMEFT linear fit change more significantly depending on whether the CMS 1D or the CMS 2D distribution is used in the fit. In particular the central value of the c tG 95% C.L. bounds is shifted upwards by 1σ, as well as the bounds on c 1 Qd and c 1 ut , while the bounds on c 1 Qu and c 1 qt shift downwards, as expected from the EFT coefficient correlations displayed in Fig. 5.4.
As a result of our analysis, we observe that the bounds on the EFT in the linear case do depend on the input dataset quite significantly and careful consideration has to be made to the overall dataset compatibility and to the outcome of the fit once quadratic corrections are included. The current analysis in particular shows that the incompatibility of the 13 TeV CMS double differential distribution cannot be entirely attributed to new physics effects parametrised by the SMEFT expansion, rather there are internal experimental or theoretical incompatibilities that affect the results.  Table 5.1, now comparing the values of the χ 2 exp+th per data point for the fixed-PDF EFT fits presented in this section, the default one including the 13 TeV single-differential m tt distribution (CMS 1D) and the one including the 13 TeV CMS double-differential (m tt , y tt ) distribution (CMS 2D), both for individual groups of processes, and for the total dataset.

Correlations between PDFs and EFT coefficients
We conclude this section by discussing the correlations observed between the PDFs and Wilson coefficients. The PDF-EFT correlation coefficient for a Wilson coefficient c and a PDF f (x, Q) at a given x and Q 2 is defined as where c (k)) is the best-fit value of the Wilson coefficient for the k-th replica and f (k) is the k-th PDF replica computed at a given x and Q, and · k represents the average over all replicas. We will compute the correlation between a SM PDF and the Wilson coefficients, both of which have been separately determined from the total dataset including all new top quark data. By doing so we hope to shed light on which Wilson coefficients, and which PDF flavours and kinematical regions, are strongly impacted by the top quark data and therefore exhibit a potential for interplay in a simultaneous EFT-PDF determination. The EFT corrections will be restricted to linear order in the EFT expansion. Fig. 5.6 displays a selection of the largest correlations. We observe that the gluon PDF in the large-x region is significantly correlated with the Wilson coefficients c tG , c (8) ut . On the other hand, relatively large correlations are observed between c tG and the the total singlet Σ, while the total valence V , and non-singlet triplet T 3 PDFs show no relevant correlations with the selected coefficients. This is not surprising, given the impact of top quark pair production total cross sections and differential distributions in constraining these PDFs and Wilson coefficients. Whilst these correlations are computed from a determination of the SMEFT in which the PDFs are fixed to SM PDFs, the emergence of non-zero correlations provides an indication of the potential for interplay between the PDFs and the SMEFT coefficients; this interplay will be investigated   We show the results for the gluon, the total singlet Σ, total valence V , and non-singlet triplet T 3 PDFs. We provide results for representative EFT coefficients, namely c tG and c (8) ut .
in a simultaneous determination in the following section.

SMEFT-PDFs from top quark data
In this section we present the main results of this work, namely the simultaneous determination of the proton PDFs and the SMEFT Wilson coefficients from the LHC Run II top quark data described in Sect. 2, following the SIMUnet methodology summarised in Sect. 3. This determination of the SMEFT-PDFs from top quark data is carried out at the linear, O(1/Λ 2 ), level in the EFT expansion. We do not perform simultaneous fits at the quadratic level due to shortcomings of the Monte Carlo replica method, on which SIMUnet is based; this is discussed in detail in App. E, and will also be the subject of future work.
PDFs from a joint SMEFT-PDF fit. We begin by discussing the PDFs obtained through a joint fit of PDFs and Wilson coefficients from the complete LHC top quark dataset considered in this work. Simultaneously extracting the PDFs and the EFT coefficients from top quark data has a marked impact on the former, as compared to a SM-PDF baseline, but we shall see has much less impact on the latter, as compared to the results of the corresponding fixed-PDF EFT analyses. Fig. 6.1 displays a comparison between the gluon and quark singlet PDFs, as well as of their relative 1σ PDF uncertainties, for the NNPDF4.0-notop baseline, the SM-PDFs of fit H in Table 4.1 which include the full top quark dataset, and their SMEFT-PDF counterparts based on the same dataset. PDFs are compared in the large-x region for Q = m t = 172.5 GeV. In the left panel they are normalised to the central value of the NNPDF4.0-notop fit.
While differences are negligible for the quark singlet PDF, both in terms of central values and uncertainties, they are more marked for the gluon PDF. Two main effects are observed therein. First, the central value of the SMEFT-PDF gluon moves upwards as compared to the SM-PDF fit based on the same dataset, ending up halfway between the latter and the NNPDF4.0-notop fit. Second, uncertainties increase for the SMEFT-PDF determination as compared to the SM-PDFs extracted from the same data, becoming close to the uncertainties of the NNPDF4.0-notop fit except for x ∼ > 0.5. In both cases, differences are restricted to the large-x region with x ∼ > 0.1, where the impact of the dominant top quark pair production measurements is the largest.
The results of Fig. 6.1 for the gluon PDF therefore indicate that within a simultaneous extraction of the PDFs and the EFT coefficients, the impact of the top quark data on the PDFs is diluted, with the constraints it provides partially 'reabsorbed' by the Wilson coefficients. This said, there remains a pull of the top quark data as compared to the no-top baseline fit which is qualitatively consistent with the pull obtained in a SM-PDF determination based on the same dataset, albeit of reduced magnitude. Interestingly, as we show below, while the SMEFT-PDF gluon is significantly modified in the joint fit as compared to a SM-PDF reference, much smaller differences are observed at the level of the bounds on the EFT parameters themselves.  with the results obtained at the PDF level, one finds that the three luminosities are almost identical for m X ≤ 1 TeV, and at higher invariant masses the SMEFT-PDF predictions are bracketed between the NNPDF4.0-notop fit from above and the SM-PDF which includes all top data (fit H in Table 4.1) from below. Hence, the net effect of simultaneously fitting the PDFs and the EFT coefficients is a dilution of constraints provided by the top quark data on the former, which translates into larger PDF uncertainties (which end up being rather similar to those of NNPDF4.0-notop ) and an increase in the large-m X luminosity, e.g. of 5% in the gg case for m X 3 TeV, as compared to the SM-PDF luminosity.
This dilution arises because of an improved description of the top quark data included in the fit, especially in the high m tt bins. In the SM-PDF case this can only be obtained by suppressing the large-x gluon, while in a SMEFT-PDF analysis this can also be achieved by shifting the EFT coefficients from their SM values. In other words, as compared to the NNPDF4.0-notop baseline, the gluon PDF experiences a suppression at large-x of up to 10% when fitting the top quark data, and this pull is reduced by approximately a factor two in the joint SMEFT-PDF determination due to the coherent effect of the linear EFT cross-sections.
EFT coefficients from a joint SMEFT-PDF fit. As opposed to the marked effect of the SMEFT-PDF interplay found for the large-x gluon, its impact is more moderate at the level of the bounds on the EFT coefficients, and is restricted to mild shifts in the central values and a slight broadening of the uncertainties. This is illustrated by Fig. 6.3, showing the posterior distributions for the Wilson coefficient c tG associated to the chromomagnetic operator in the joint SMEFT-PDF determination, compared with the corresponding results from the fixed-PDF EFT analysis whose settings are described in Sect. 5. The comparison is presented both for the fits which consider only top-quark pair production data and those based on the whole top quark dataset considered in this work. The leading effect of the chromomagnetic operator O tG is to modify the total rates of tt production without altering the shape of the differential distributions, and hence it plays an important role in a simultaneous SMEFT-PDF determination based on top quark data.
For fits based only on inclusive tt data, as shown in the left panel of Fig. 6.3, the two posterior distributions are similar; the distribution based on the SMEFT-PDF analysis is slightly broader, approximately 10% so, as compared to the fixed-PDF EFT fit to the same measurements. This slight broadening is washed out in the fit to the full top quark dataset, as shown in the right panel of Fig. 6.3. In both cases, the determination of c tG is consistent with the SM at a 95% CL, and the best-fit values of the coefficient are the same in the SMEFT-PDF and fixed-PDF EFT analyses. Hence, in the specific case of the chromomagnetic operator, the interplay between PDFs and EFT fits is rather moderate and restricted to a broadening of at most 10% in the 95% CL bounds. Similar comparisons have been carried out for other EFT coefficients as well as in the context of fits to a subset of the data and/or to a subset of the coefficients. We find that in general the impact of the SMEFT-PDF interplay translates to a broadening of the uncertainties in the EFT  coefficients, which at most reaches 30%, and alongside which the best-fit values remain stable ‡ . All in all, within the global fit based on the best available theory predictions, results for the EFT coefficients turn out to be very similar in the fixed-PDF EFT and SMEFT-PDF fits. This indicates that, provided a broad enough dataset and the most up-to-date theory calculations are used, the PDF dependence on the cross-sections entering an EFT interpretation of the LHC data is currently subdominant and can be neglected (this is not the case for the PDFs, see Fig. 6.2). Nevertheless, this statement applies only to the dataset currently available, and it is likely that the SMEFT-PDF interplay will become more significant in the future once HL-LHC measurements become available [176,177], as demonstrated in the case of high-mass Drell-Yan [35].
The moderate impact of the SMEFT-PDF interplay on the Wilson coefficients for the full top quark dataset considered in this work is summarised in Fig. 6.4, which compares the 95% CL intervals of the 20 fitted Wilson coefficients relevant for the linear EFT fit obtained from the outcome of the joint SMEFT-PDF determination and the fixed-PDF EFT analysis. The latter is based on SM and EFT calculations performed with NNPDF4.0-notop as input; see also the description of the settings in Sect. 5. The dashed horizontal line indicates the SM prediction, and some coefficients are multiplied by the indicated prefactor to facilitate the visualisation of the results. Fig. 6.4 demonstrates that, other than slight broadenings and shifts in the central values, the results of the two analyses coincide.
Correlations. Fig. 6.5 displays the correlation coefficients [166] between the SMEFT-PDFs and the Wilson coefficients evaluated at Q = 172.5 GeV as a function of x. Each panel displays the correlations of the coefficient c k with the gluon and the total singlet Σ, total valence V , and non-singlet triplet T 3 PDFs, and we consider four representative EFT coefficients, namely c 8 td , c 8 tu , c tG , and c tW . The largest correlations within the EFT coefficients considered in this work are associated to the gluon PDF and four-fermion operators such as c 8 td and c 8 tu in the large-x region, peaking at x 0.3. Correlations for other values of x and for the quark PDFs are negligible for all operators entering the analysis. We note that future data with an enhanced coverage of the high-m tt region in top quark pair-production might alter this picture, given that for m tt ∼ > 3 TeV the qq luminosity starts to become more relevant and eventually dominates over the gg contribution.  where c * n and σ n are the median value and the standard deviation of the Wilson coefficient c n respectively, with n = 1, . . . , N , where N is the number of operators. The outcome of the joint SMEFT-PDF determination is compared with that of a fixed-PDF EFT analysis where we use as input for the theory calculations the SMEFT-PDFs obtained in the joint fit, rather than the NNPDF4.0-notop set. That is, in both cases the information provided by the top quark data on the PDFs and Wilson coefficients is accounted for, but in one case the cross-correlations are neglected whereas they are accounted for in the other. The residuals are similar in the two cases; they are slightly bigger (in absolute value) in the fixed-PDF case in which the correlations between the SMEFT-PDFs and the EFT coefficients are ignored. This analysis further emphasises that, for the currently available top quark data, the cross-talk between PDFs and EFT degrees of freedom does not significantly modify the posterior distributions in the space spanned by the Wilson coefficients.
In summary, on the one hand we find that from the point of view of a PDF determination, SM-PDFs and SMEFT-PDFs extracted from top quark data differ by an amount comparable to their respective uncertainties in the case of the large-x gluon. On the other hand, at the level of Wilson coefficients the results are unchanged irrespective of the PDF set used as input for the theory calculations; that is, bounds based on NNPDF4.0-notop or the SMEFT-PDFs are almost the same. Hence, while EFT interpretations of top quark data can safely ignore the PDF dependence, at least for the settings adopted in this work, a global PDF fit could be significantly distorted if BSM physics were to be present in the large-energy top quark distributions. We use these findings in App. A to provide recommendations for interpretations of LHC top quark data in the context of either PDF or EFT analyses.

Conclusions and outlook
In this work we have carried out a comprehensive theoretical interpretation of top quark data from the LHC, including all available measurements based on the full integrated luminosity of Run II, in terms of the proton PDFs and of Wilson coefficients in the SMEFT. We have integrated a global determination of PDFs with the fit of dimension-six Wilson coefficients modifying top quark interactions into a simultaneous determination of the SMEFT-PDFs and the EFT coefficients. The main outcome of our analysis is the assessment of the conditions under which the usual assumptions of SM-PDFs and fixed-PDF EFT analyses are valid, and establishing when the SMEFT-PDF interplay cannot be neglected without introducing a sizeable bias to the results. Furthermore, we have provided strategies to disentangle eventual BSM signals from QCD effects in the interpretation of these top quark measurements. As a by-product of this determination of the SMEFT-PDFs, we have also presented state-of-the-art SM-PDF and fixed-PDF EFT interpretations of top quark data based on the broadest LHC experimental dataset to date, therefore fully exploiting the information contained in the legacy Run II top quark measurements. From the SM-PDF study, we have quantified the impact of the recent Run II top quark production data on the large-x gluon, and assessed their compatibility with the information provided by the datasets already included in previous global fits such as NNPDF4.0 . From the fixed-PDF SMEFT analysis, we have benchmarked the SIMUnet performance reproducing the SMEFiT results, determined the improved constraints provided by the full-luminosity Run II data, and compared our findings with the results presented from other groups. All in all, we demonstrate the unparalleled sensitivity of Run II top quark data both to probe the proton structure and to search for signatures of new physics arising as anomalous top quark properties and interactions.
The results presented in this work could be extended in a number of directions. One of the most pressing matters is the provision of a simultaneous determination of PDFs with EFT coefficients entering at quadratic  order ; the difficulties associated with using quadratics with our current SIMUnet methodology are described in App. E. This will involve significant modification to the Monte-Carlo replica method used by SIMUnet, and will be the subject of future work.
On a different note, top quark measurements at the LHC have also been used to determine the strong coupling α s (m Z ) as well as the top quark mass m t , both independently and simultaneously with the PDFs [7]. It could hence be interesting to extend the SIMUnet framework to account for the determination of other SM parameters from top quark data in addition to the PDFs, as it is sketched in the original SIMUnetpublication [39], and to study their interplay with eventual new physics effects. Additionally, it would be interesting to extend future versions of the SIMUnet methodology to account for potential renormalisation group effects on the Wilson coefficients [163,[178][179][180]. One may also want to assess the SMEFT-PDF interplay for other types of processes beyond those considered so far, and in particular study inclusive jet and di-jet production, which are instrumental for PDF fits in the same region constrained by the top data [181] and also provide information on a large number of poorly-known SMEFT operators. In the longer term, even processes that are never used for PDF fits, such as Higgs or gauge boson pair production, may reach a precision that makes them competitive, and in this case the only option to account for this information is by means of the simultaneous determination of PDFs and EFT coefficients.
Another possible development of the SIMUnet methodology would be to establish the interplay between PDFs and model parameters such as masses and couplings in specific UV-complete BSM scenarios by using the SMEFT as a matching bridge [182]. It would also be interesting to extend the joint SMEFT-PDF determination as implemented in SIMUnet to novel types of observables with enhanced or even optimal sensitivity to either the PDFs of the EFT coefficients, such as the ML-assisted unbinned multivariate observables introduced in Ref. [183]. Finally, following along the lines of the HL-LHC projections for high-mass W and Z production presented in [35], one can study the projected SMEFT-PDF synergies at future colliders, certainly at the HL-LHC but also at the Electron Ion Collider [184], sensitive to directions in the SMEFT parameter space not covered by LHC data, and at the Forward Physics Facility [185,186], where proton and nuclear structure can be probed while also obtaining information on anomalous neutrino interactions parametrised by extensions of the SMEFT.
All results presented in this paper, as well as a broader selection of results, including fits performed with various theory settings, on subsets of datasets and/or operators are available at the webpage: https://www.pbsp.org.uk/research/topproject.
Readers are encouraged to contact the authors, should they need any specific SM-PDF or SMEFT-PDF fits that can be obtained with our methodology.
In order to facilitate that studies similar to those presented in this paper are carried out by the LHC experimental collaborations with their own data, we plan to release the open source SIMUnet framework together with documentation and user-friendly examples in a future publication, in a way that can be seamlessly integrated with the NNPDF fitting code [41]. The availability of SIMUnet as open source code would allow the LHC collaborations to study the SMEFT-PDF interplay in their own internal analyses, and in particular to modify the input datasets and EFT operator basis to match their own settings. In the meantime, we provide in App. A usage recommendations concerning the joint interpretation of LHC measurements in terms of both PDFs and EFT coefficients based on the findings of this work.
As the LHC approaches its high-luminosity era, it becomes more and more important to develop novel analysis frameworks that make possible the full exploitation of the information contained in the legacy LHC measurements. The results presented in this work contribute to this program by demonstrating how to achieve a consistent simultaneous determination of the PDFs and EFT parameters from the same top quark dataset, bypassing the need for the assumptions required for the SM-PDF and fixed-PDF EFT interpretations.

A Recommendations to quantify SMEFT-PDF interplay
The strategy presented here, based on carrying out a simultaneous determination of the SMEFT-PDFs and the Wilson coefficients, makes possible a quantitative assessment of the interplay between the PDF and SMEFT sensitivity when interpreting a given set of LHC measurements. As discussed in Sect. 7, the SIMUnet framework will be made public only at a later stage. In the meantime, researchers interested in quantifying this interplay in their own analysis can use available open-source platforms like xFitter, as done by the CMS collaboration in their QCD and EFT analyses of double-differential inclusive jet cross sections at 13 TeV [38]. In several cases, however, a joint analysis may not be necessary, and one can estimate the possible role of the SMEFT-PDF interplay by means of a simplified approach. The discussion is presented here for the case of top quark measurements, but it can be straightforwardly extended to any other class of measurements.
• The most conservative strategy is to introduce a hard boundary between those processes used for PDF fits and those entering SMEFT interpretations. For instance, for a SMEFT analysis of LHC top quark measurements one can use PDF fits without top data, such as the NNPDF4.0-notop variant. We recall that with the NNPDF open-source code [41] one can produce fit variants with arbitrary input datasets.
• A somewhat less conservative assumption would be to introduce a hard boundary, not at the process level, but rather at the kinematic level. In the case of inclusive top quark pair production data, one can introduce a threshold on the top quark pair invariant mass m tt such that data points below the threshold are used for the PDF fit and above it for the EFT interpretation, thus benefiting from the increased sensitivity to BSM effects in the high-energy tails of the LHC distributions. Again, using public PDF fitting codes one can produce fit variants with tailored kinematical cuts to separate the "PDF" region from the "EFT" region. One drawback of this approach is that there is never a clear-cut separation between these regions; depending on the EFT operators considered, the high-energy region may not be the dominant one.
• As demonstrated in this work, many measurements of relevance for EFT interpretations have limited sensitivity to PDFs and vice-versa. In such cases, the SMEFT-PDF interplay can be safely neglected. One can determine under which settings this condition is satisfied by adding the dataset under consideration either to a global PDF fit (using for example the NNPDF fitting code [41]) or to a global SMEFT fit (using the SMEFiT or FitMakerframeworks). If the results of either the PDF or EFT fits are unchanged, one can safely neglect the SMEFT-PDF interplay in this case.
• If instead the above analysis shows that a given dataset provides non-trivial information on both the PDFs and on the EFT Wilson coefficients, the only options for a consistent theoretical interpretation are either introducing a hard boundary in the analysis (either at the process level or at the kinematic level) or to carry out the joint SMEFT-PDF interpretation of the full dataset using SIMUnet or other available tools.
Furthermore, it should be emphasised that, as opposed to the PDF case, the sensitivity to EFT coefficients of a given measurement depends in general on the choice of EFT operator considered. Hence, the above considerations assume a specific choice of operators and could be different if this choice is varied. Table B.1 lists the dimension-six SMEFT operators relevant for this analysis, together with the corresponding degrees of freedom (DoF) entering the fit. In terms of the flavour assumptions, we follow the LHC top quark working group recommendations [26], which were also adopted in [42,91]. The flavour symmetry group is given by U

B EFT operator basis
i.e. one singles out operators that contain top quarks (right-handed t and SU(2) doublet Q). This means that we work in a 5-flavour scheme, where the only massive fermion in the theory is the top quark. The upper part in Table B.1 defines the relevant twofermion operators modifying the interactions of the third-generation quarks. We also indicate the notation used for the associated Wilson coefficients; those in brackets are not degrees of freedom entering the fit, and instead the two additional DoF defined in the middle table are used. The bottom table defines the four-fermion DoF entering the fit, expressed in terms of the corresponding four-fermion Wilson coefficients associated to dimension-six SMEFT operators in the Warsaw basis.
EFT cross-sections at the linear and quadratic order in the EFT expansion are computed for the datasets considered in this analysis and for the DoF defined in Table B.1, both at LO and NLO in perturbative QCD. Furthermore we use the m W -scheme, meaning that the four EW inputs are {m W , G F , m h , m Z }. In particular, the electric charge e becomes a dependent parameter and is shifted by the effects of higherdimensional operators.

C Benchmarking with SMEFiT
Here we benchmark the performance of SIMUnet when operating as a fixed-PDF EFT fitter, by means of a tuned comparison with the SMEFiT framework when identical theory and experimental inputs are used in both cases. SMEFiT currently provides two strategies to determine posterior distributions in the EFT parameter space. The first, known as MCfit, is based on the Monte Carlo replica method followed by parameter optimisation (as similarly used in SIMUnet). The second is based on Nested Sampling (NS), a purely Bayesian approach for parameter inference. At the linear EFT level, SMEFiT results based on MCfit and NS are identical [42]. At the quadratic level [171], however, the MCfit approach is affected by the potential pitfalls described in App. E, which also prevent the use of SIMUnet for joint SMEFT-PDF fits in the presence of quadratic EFT corrections. We demonstrate now that, for a linear EFT determination and a common choice of inputs, the results obtained using the two frameworks display excellent agreement. Definition    dataset discussed in this work as well as the same SM and EFT theory calculations (and hence also the same operator basis). To ensure identical inputs, a parser has been written that automatically converts data and theory files from the SIMUnet to the SMEFiT standard formats. The SIMUnet results displayed here coincide with those displayed in Fig. 6.4 in the fixed-PDF EFT analysis. Satisfactory agreement between the two fitting frameworks is obtained, and similar benchmarks have been successfully performed for other variants of the fixed-PDF EFT fits presented in this work. This benchmark study ensures that the optimisation algorithm adopted by SIMUnet is suitable both for PDF determinations (since it is based on the NNPDF4.0 settings) as well as for the determination of the EFT coefficients. The optimisation settings that provide the best performance of the SM-PDF and fixed-PDF EFT analyses are then combined for the simultaneous SMEFT-PDF extraction, where all weights of the SIMUnet network are allowed to be constrained by the data.

D Fit quality
Here we summarise the χ 2 (computed using the experimental definition, Eq. 2.3) for the key PDF and EFT analyses presented in this work. Tables D.1, D.2 and D.3 display the values of the χ 2 per data point for representative PDF and EFT fits, where for each dataset we also indicate the number of data points n dat . Specifically, we consider: (i) three SM-PDF fits: NNPDF4.0-notop , NNPDF4.0 , and Fit H in Table 4.1 (using our full top quark dataset); (ii) two fixed-PDF EFT fits, one based on NNPDF4.0-notop as input, and the other using Fit H as input; (iii) the outcome of the simultaneous SMEFT-PDF determination. We show sequentially the non-top datasets, the tt inclusive and associated production datasets, and the single-top inclusive and associated production datasets. We also provide the total χ 2 for separate groups of processes, including the full non-top and top datasets, and well as for their total sum.
In Tables D.2 and D.3, entries in italics indicate datasets that do not enter the corresponding SM-PDF fit; for these datasets, we evaluate the associated χ 2 values a posteriori using the resulting PDFs. For instance, all top data is removed from the NNPDF4.0-notop fit, and for some top quark observables such as four-heavy-quark production, PDF dependence is neglected. Furthermore, we note that the χ 2 values for the NNPDF4.0-notop fit and for the fixed-PDF EFT fit based on NNPDF4.0-notop are identical in Table D.1, since the latter uses only top data as input. For the same reason, the entries in the columns for the SM-PDF Fit H and the fixed-PDF EFT fits based on Fit H are the same.
Several observations can be drawn from the results presented in Tables D.1, D.2 and D. 3. First of all, we note that the description of the non-top data is improved in the simultaneous SMEFT-PDF analysis, with χ 2 = 1.137 to be compared with the analogous SM-PDF analysis (Fit H) which leads to χ 2 = 1.146. Given that we have n dat = 4535 no-top data points, this improvement corresponds to 40 units in the absolute χ 2 . This improvement cannot traced back to a specific measurements or group of processes. In turn, this improvement is also reflected for the full dataset, whose χ 2 is the lowest (1.126 for n dat = 4710 points) in the SMEFT-PDF analysis. This lower χ 2 as compared to the SM-PDF baseline fits is not surprising, taking into account the fact that (as discussed in Sect. 5) we now have more degrees of freedom in the fit, and in particular the inclusion of linear EFT corrections is instrumental in improving the χ 2 to the top quark data as compared to the SM-PDF fits. We also find that the fit quality to the top datasets is similar in the joint SMEFT-PDF analysis and in the fixed-PDF EFT fits with either NNPDF4.0-notop or Fit H as input PDFs, with χ 2 = 0.848, 0.868, 0.839 respectively in each case for n dat = 175. The small differences between the three cases (less than five units in absolute χ 2 ) are consistent with the reported independence of the EFT interpretation of top quark data with respect to the choice of input PDF in the calculation.
Finally, we note the only class of top quark production process for which the χ 2 values are markedly different in the SM-PDF fits is inclusive top quark pair production. Indeed, the χ 2 values for the complete top quark dataset are 1.370, 1.139, and 1.102 in the SM-PDF fits without top data, with NNPDF4.0 , and with the full top quark dataset (Fit H) respectively; this improvement arises mostly from the tt group, whose χ 2 values are 1.700, 1.328, and 1.271 for the same three fits respectively. This observation is consistent with the findings of Sect. 4, that the dominant PDF sensitivity in Fit H arises from inclusive tt production. We also note that the χ 2 for top quark data is similar between NNPDF4.0 and Fit H, again in agreement with the fact that the additional top quark measurements considered here as compared to those in NNPDF4.0 have a consistent pull on the PDFs, in particular on the large-x gluon.

E Pitfalls of the Monte-Carlo replica method for quadratic EFT fits
This analysis was originally prepared with the intention of performing a fully simultaneous SMEFT-PDF fit using NLO QCD theory, including quadratic, O Λ −4 , contributions from the SMEFT. To this end, the capabilities of the SIMUnet framework were extended, as discussed in Sect. 3.2, and all necessary SMEFT K-factors needed for the quadratic predictions were produced. However, whilst benchmarking our code, we noticed significant disagreement between quadratic SMEFT-only fits produced using the SIMUnet methodology and the SMEFiT Nested Sampling option; § on the other hand, we note perfect agreement between our SIMUnet quadratic SMEFT-only fits and the SMEFiT MCfit option.
This disagreement can be traced back to a deficiency in the Monte-Carlo sampling method used to propagate experimental error to the SMEFT coefficients, which currently prevents us from applying the the SIMUnet framework to joint SMEFT-PDF fits with quadratic EFT calculations. In this Appendix, we describe our current understanding of these limitations within the context of a toy model, and give a more realistic example from this work; further work on the topic is deferred to a future publication..

E.1 A toy model for quadratic EFT fits
In the following subsection, we consider a toy scenario involving a single data point d and only one Wilson coefficient c (we ignore all PDF-dependence). We suppose that our observed experimental data point d is a random variable drawn from a normal distribution centred on the underlying quadratic theory prediction, and with experimental variance σ 2 : d ∼ N (t(c), σ 2 ). (E.1) We assume that the theory prediction is quadratic in the SMEFT Wilson coefficient c, taking the form:   where we set Λ = 1 TeV for convenience. Recall that t quad > 0, since it corresponds to a squared amplitude. Given the observed data d, we would like to construct interval estimates for the parameter c (usually confidence intervals in a frequentist setting or credible intervals in a Bayesian setting). Here, we shall describe the analytical construction of two interval estimates: first, using the Bayesian method, and second, using the Monte-Carlo replica method.
Bayesian method. In the Bayesian approach, c is treated as a random variable with its own distribution. By Bayes' theorem, we can write the probability distribution of c, given the observed data d, up to a proportionality constant (given by 1/P(d), where P(d) is called the Bayes' evidence) as: where P(c|d) is called the posterior distribution of c, given the observed data d, and P(c) is called the prior distribution of c -this is our initial 'best guess' of the distribution of c before the observation of the data takes place. This distribution is often taken to be uniform in SMEFT fits; we shall assume this here. Given a value of c, the distribution P(d|c) of the data d is assumed to be Gaussian, as specified in Eq. (E.1). In particular, we can deduce that the posterior distribution of c obeys the following proportionality relation: ¶ This posterior distribution can be used to place interval estimates on the parameter c. One way of doing this is to construct highest density intervals. These are computed as follows. For a 100α% credible interval, we determine the constant p(α) satisfying: Now, given that d (i) is a random variable drawn from the normal distribution N (d, σ 2 ), one can infer the corresponding distribution of the random variable c (i) , which is a function of the pseudodata d (i) . For a real random variable X with associated probability density P X (x), a function f : R → R of the random variable has the distribution: In our case, c (i) is a multi-valued function of d (i) given the two square roots, but the formula is easily generalised to this case. Recalling that the pseudodata replicas are generated according to a Gaussian distribution around the central measurement d with variance σ 2 , we find that the probability density function for the Wilson coefficient replica c (i) is given (up to a proportionality constant) by: Simplifying the delta functions in the second and third integrals, we find: This result is different from the posterior distribution obtained by the Bayesian method in Eq. (E.4). Notable features of the posterior distribution P c (i) (c) are: (i) the distribution has a Dirac-delta peak at c = −t lin /2t quad ; (ii) elsewhere, the distribution is given by the Bayesian posterior distribution rescaled by a prefactor dependent on c. Therefore, the Monte Carlo replica optimisation method will not in general reproduce the Bayesian posterior. However, one can note that in an appropriate limit, the Bayesian posterior is indeed recovered. In particular, suppose that the quadratic EFT cross-section is subdominant compared to the linear term, t lin t quad ; in this case, we have that t min → −∞ so that the first term in Eq. (E.11) vanishes and the prefactor of the second term can be approximated with 2ct quad + t lin ≈ t lin . Thus, the Bayesian posterior from Eq. (E.4) is indeed recovered, and we see that the two methods are formally identical for a linear EFT analysis. Further, it is possible to show analytically that for multiple SMEFT parameters and multiple correlated data points, if only linear theory is used the two distributions agree exactly.
This calculation demonstrates that, for quadratic EFT fits, the Monte-Carlo replica method will not in general reproduce the Bayesian posteriors that one would obtain from, say, a nested sampling approach; agreement will only occur provided quadratic EFT corrections are sufficiently subdominant in comparison with the linear ones. For this reason, in this work we restrict the SMEFT-PDF fits based on SIMUnet (which rely on the use of the Monte-Carlo replica method) to linear EFT calculations; we defer the further investigation of the use of the Monte-Carlo replica method, and how it might be modified for use in SIMUnet, to future works.

E.2 Application to one-parameter fits
As demonstrated above, the Monte-Carlo replica method will lead to posterior distributions differing from their Bayesian counterparts whenever quadratic EFT corrections dominate over linear ones. Here, we show the numerical impact of these differences in a model case, namely the one-parameter fit of the coefficient c 8 compares the experimental data from this measurement with the corresponding SM theory calculations at NNLO using the NNPDF4.0 (no top) PDF set as input. We observe that the SM theory predictions overshoot the data, especially in the high m tt regions, where energy-growing effects enhance the EFT corrections. Given that the pseudodata replicas d (i) are fluctuated around the central value d, the configuration where the SM overshoots the data potentially enhances the contribution of the upper solution in Eq. (E.7) leading to the Dirac delta peak in the posterior Eq. (E.11).
For the case of the c 8 dt coefficient, the quadratic EFT corrections dominate over the linear ones and hence the net effect of a non-zero coefficient is typically an upwards shift of the theory prediction. Indeed, we have verified that for this coefficient the biggest negative correction one can obtain is of order ∼ 2%. For the last m tt bin, the minimum of the theory cross section t min in Eq. (E.8) is obtained for a value c 8 dt ≈ −0.2, while for the second to last bin instead t min is minimised by c 8 dt ≈ −0.3. The combination of these two features (a dominant quadratic EFT term, and a SM prediction overshooting the data) suggests that the Monte-Carlo replica method's posterior will be enhanced for c 8 dt ∈ (−0.3, −0.2) as compared to the Bayesian posterior. In Fig. E.2 we compare the posterior distributions for a one-parameter fit of the four-fermion coefficient c 8 dt with the sole experimental input being the CMS m tt distribution displayed in Fig. E.1. Results are obtained with SMEFiT and we compare the outcome of Nested Sampling (in blue) with that of MCfit (in red) for linear and quadratic EFT fits. The agreement in the linear fit is lost for its quadratic counterpart, with the main difference being a sharp peak in the region c 8 dt ∈ (−0.3, −0.2) in which the contribution from the delta function in Eq. E.11 is most marked.
The scenario displayed in Fig. E.2 is chosen to display the maximum effect, based on a single coefficient with large quadratic EFT corrections, and a dataset where the SM overshoots the data in the m tt region where EFT effects are the largest. Within a global fit, these differences are partially washed out (indeed the Bayesian and MCfit posterior distributions mostly agree well for the quadratic SMEFiT analysis, as shown in [171], for the majority of fitted coefficients). Nevertheless, at least in its current implementation, Fig. E.2 highlights that the Monte-Carlo replica method is affected by pitfalls that prevent its straightforward application to global EFT interpretations of experimental data which include quadratic corrections. [62] ATLAS Collaboration, G. Aad et al., "Measurement of the low-mass Drell-Yan differential cross section at √ s = 7 TeV using the ATLAS detector," JHEP 06 (2014) 112, arXiv:1404.1212 [hep-ex].