1 Introduction

It is now an accepted fact that frontier high-energy physics at colliders requires percent-level accuracy both in theory and experiment [1]. On the theoretical side, the two main obstacles to achieving this are missing higher order corrections in perturbative computations [2], and uncertainties in parton distribution functions (PDFs) [3, 4]. The main aim of this paper is to show how percent-level accuracy might be achieved for PDFs.

The most recent set of PDFs determined by NNPDF, NNPDF3.1 [5], was the first to extensively include LHC data, and was able to reach 3–5% precision in the PDF uncertainties. It was based on NNPDF3.x fitting methodology, the first to be validated by means of closure tests, thereby ensuring that this precision was matched by a comparable accuracy.

The NNPDF4.0 PDF set presented here is a major step forward in three significant aspects: (i) the systematic inclusion of an extensive set of run I LHC at 7 and 8 TeV data and, for the first time, of LHC Run II data at \(\sqrt{s}=13\) TeV and of several new processes not considered before for PDF determinations; (ii) the deployment of state-of-the-art machine learning algorithms which result in a methodology that is considerably faster and leads to more precise PDFs; (iii) the validation of these PDF uncertainties both in the data and in the extrapolation regions using closure and future tests.

All in all, the main accomplishment of this new PDF set is to go one step further in achieving the main goal that motivated the NNPDF methodology in the first place [6], namely, to reduce sources of bias in PDF determination. The use of a wider dataset reduces sources of bias that might be related to the dominance of a particular process. The use of a machine learned methodology reduces sources of bias related to methodological choices, that are now mostly made through an automated procedure. Finally, the extensive set of validation tools explicitly checks the absence of bias: in fact, “future tests”, to be discussed below, can expose the historical bias that was present in previous PDF determinations.

The NNPDF4.0 global analysis includes 44 new datasets in comparison with NNPDF3.1. These involve a number of new LHC measurements of processes already present in NNPDF3.1, but also data from several new processes, whose impact on PDFs has been the object of dedicated studies. Specifically, direct photon production (studied in Ref. [7]), single-top production (studied in Ref. [8]), dijets (studied in Ref. [9]), W+jet (studied in Ref. [10]), and deep-inelastic jet production. A significant consequence of this extension of the dataset is that now the PDFs are largely controlled by LHC data: unlike in the past, a DIS-only PDF determination leads to much larger uncertainties and visibly different results.

NNPDF4.0 is the first PDF determination based on a methodology that is selected automatically rather than through manual iterations and human experience. All aspects of the neural network PDF parametrization and optimization (such as neural net architecture, learning rates or minimization algorithm) are selected through a hyperparameter optimization procedure [11], an automated scan of the space of models that selects the optimal methodology. A quality control method is used in order to make sure that the optimization does not produce a methodology that leads to overfitted PDFs. This is done through K-folding [6], checking iteratively the effectiveness of any given methodology on sets of data excluded in turn from the fit. All this is made possible by a speedup of the NNPDF fitting code, which is now able to fit an individual replica about twenty times faster, thanks mostly to the use of stochastic gradient descent methods provided by the TensorFlow library, rather than through the genetic algorithm minimization used previously, along with various technical improvements to be discussed below [11,12,13].

The widening of the dataset (with fixed methodology), and especially the methodological improvements (with fixed dataset) lead to a reduction of PDF uncertainties, so their combination brings us close to percent precision. This demands a careful validation of these uncertainties, which is achieved by means of two classes of tests.

The first is closure tests, already introduced in NNPDF3.0 [14], which here are considerably extended and systematized, thanks to the much greater fitting efficiency. These consist of fitting PDFs to pseudo-data generated assuming a certain underlying true PDF, and comparing the result of the fit to the known true PDF by means of suitable statistical estimators. The closure test verifies that PDF uncertainties are faithful, specifically in comparison to the data used to fit them. The second is future tests [15]: these compare the results obtained fitting PDFs to a subset of the data, which covers a small kinematic region compared to the full dataset. For example, PDFs are fitted to a pre-HERA dataset, and the result is compared to LHC data. The future test verifies that PDF uncertainties are faithful when extrapolated outside the region of the data used to fit them.

As a further test of methodological reliability, we study the robustness of results upon methodological variations, and in particular we show that PDFs are stable upon changes of the parametrization basis (i.e. the particular linear combination of PDFs that is parametrized by neural nets), thereby confirming that results are parametrization-independent.

Fig. 1
figure 1

The NNPDF4.0 NNLO PDFs at \(Q=3.2\) GeV (left) and \(Q=10^2\) GeV (right)

NNPDF4.0 PDFs also include a number of improvements at all stages of the PDF determination procedure. The most relevant ones are the following:

  • While the main PDF determination is performed with NNLO QCD (with further sets provided at NLO and LO), NLO electroweak (EW) and mixed QCD-EW processes are implemented for all LHC processes using recent dedicated tools [16] and assessed both for phenomenology and in the determination of the input dataset to be used for PDF fitting.

  • Whenever heavy nuclear or deuteron targets are involved, nuclear effects are accounted for as theoretical uncertainties using the methodology of Refs. [17,18,19], and the results of the nNNPDF2.0 nuclear PDF determination [20].

  • Strict positivity of \(\overline{\mathrm{MS}}\) PDFs is implemented following the results of Ref. [21].

  • Finiteness of non-singlet baryon number, i.e., integrability of all non-singlet PDF first moments is enforced. This specifically implies finiteness of the Gottfried sum [22] \(U-D\) and of the strangeness sum \(U+D-2 S\), where U, D and S denote respectively the first moment of the sum of quark and antiquark PDFs for up, down and strange quarks.

  • The selection of a consistent dataset is based on an objective two-stage procedure. Potentially problematic datasets are identified on the basis of either poor compatibility with the global dataset, or indications of instability of their experimental covariance matrix. These datasets are then subjected in turn to a dedicated fit in which the failed dataset is given a large weight, and then accepted or rejected depending on the outcome.

The main missing features of the current PDF determination, which are left for future work, are the inclusion of theory uncertainties (specifically missing higher order corrections), which could be done using the methods of Refs. [23, 24], and the full inclusion of EW and mixed QCD-EW corrections directly at the fitting level, which will be possible using the tools of Ref. [16].

The NNPDF4.0 PDF set is released at LO, NLO and NNLO QCD, for a variety of values of \(\alpha _s\). The default PDF sets are provided in the FONLL variable-flavor number scheme [25] with maximum number of flavors \(n_f=5\), and an independently parametrized charm PDF. PDF sets with different maximum number of flavors and with a perturbatively generated charm PDF are also made available, along with PDF sets determined using reduced datasets, which may be useful for specific applications. The main sets are delivered in the following formats: a Monte Carlo representation with 1000 replicas; a Hessian set with 50 eigenvectors obtained from the Monte Carlo set via the MC2Hessian algorithm [26, 27]; and a compressed set of 100 Monte Carlo replicas, obtained from the original 1000 through the Compressor algorithm [28] as implemented in the new Python code of Ref. [29]. The final NNPDF4.0 NNLO PDFs are shown in Fig. 1 both at a low (\(Q=3.2\) GeV) and a high (\(Q=100\) GeV) scale.

More importantly, the full NNPDF software framework is released as an open source package [30]. This includes the full dataset; the methodology hyperoptimization; the PDF parametrization and optimization; the computation of physical processes; the set of validation tools; and the suite of visualization tools. The code and the corresponding documentation are discussed in a companion paper [31].

The structure of this paper is the following. First, in Sect. 2 we present the input experimental data and the associated theoretical calculations that will be used in our analysis, with emphasis on the new datasets added in comparison to NNPDF3.1. Then in Sect. 3 we discuss the fitting methodology, in particular the parametrization of PDFs in terms of neural networks, their training, and the algorithmic determination of their hyperparameters. The procedure adopted to select the NNPDF4.0 baseline dataset is described in Sect. 4. The main result of this work, the NNPDF4.0 determination of parton distributions, is presented in Sect. 5, where we also compare with previous NNPDF releases and with other PDF sets. The closure test and future test used to validate the methodology are described in Sect. 6.

Subsequently, we assess the dependence of our PDFs on the dataset in Sect. 7, where we study the impact of new data in comparison with NNPDF3.1, and verify the impact of individual processes by studying PDF determinations in which data corresponding to individual classes of processes are removed in turn. Also, we present PDFs determined by adding specific datasets, such as the EMC charm structure function, the NOMAD neutrino dimuon structure functions, and the HERA DIS jet data. Then in Sect. 8 we assess the dependence of PDFs on the methodology and verify the robustness of our results, by comparing with PDFs obtained using the previous NNPDF3.1 methodology and by studying the impact of new positivity and integrability constraints, checking the independence of results of the choice of PDF parametrization, discussing the impact of independently parametrizing the charm PDF, and studying the role of nuclear corrections. We finally present a first assessment of the implications of NNPDF4.0 for LHC phenomenology in Sect. 9, by computing PDF luminosities, fiducial cross-sections, and differential distributions for representative processes. In Sect. 10 we summarize and list the NNPDF4.0 grid files that are made available through the LHAPDF interface [32] and provide a summary and outlook.

A brief overview of the NNPDF fitting code is presented in App. A, while a more extensive description is provided by the companion publication [31]. In App. B we compare the NNPDF4.0 dataset to that adopted in other PDF determinations.

2 Experimental and theoretical input

We present the NNPDF4.0 dataset in detail. After a general overview, we examine each of the processes for which new measurements are considered in NNPDF4.0, we present the details of the measurements, and, for each dataset, we describe how the corresponding theoretical predictions are obtained. In NNPDF4.0, theoretical predictions for data taken on nuclear targets are supplemented by nuclear corrections, which are specifically discussed in a dedicated section. Experimental statistical and systematic uncertainties are treated as in previous NNPDF determinations: see in particular Sect. 2.4.2 of Ref. [14] for a detailed discussion.

The global dataset presented in this section is the basis for the final NNPDF4.0 dataset, which will be selected from it by applying criteria based on testing for dataset consistency and compatibility, and for perturbative stability upon the inclusion of electroweak corrections. The selection of the final dataset will be discussed in Sect. 4 below.

2.1 Overview of the NNPDF4.0 dataset

The NNPDF4.0 dataset includes essentially all the data already included in NNPDF3.1, the only exceptions being a few datasets that are replaced by a more recent final version, and single-inclusive jet datasets which are now partly replaced by dijet data, as we discuss below. All the new datasets that were not included in NNPDF3.1 are more extensively discussed in Sect. 2.2. For all those already included in NNPDF3.1 we refer to Sect. 2 of Ref. [5] for a detailed discussion. Nevertheless we give a summary below.

The NNPDF3.1 dataset included data for lepton-nucleon, neutrino-nucleus, proton-nucleus and proton-(anti)proton scattering processes. The bulk of it consisted of deep inelastic scattering (DIS) measurements: these included fixed-target neutral current (NC) structure function data from NMC [33, 34], SLAC [35] and BCDMS [36], fixed-target inclusive and dimuon charged current (CC) cross-section data from CHORUS [37] and NuTeV [38, 39], and collider NC and CC cross-section data from the HERA legacy combination [40]. The combined H1 and ZEUS measurement of the charm cross-section [41] and the separate H1 [42] and ZEUS [43] measurements of the bottom cross-section were also included, both to be replaced by more recent data as we discuss below. The charm structure function measured by the EMC experiment [44] was also studied in a variant fit, in which its constraining power on the intrinsic component of the charm PDF was explicitly assessed, and the same will be done here.

In addition to the DIS measurements, the NNPDF3.1 dataset included fixed-target DY data from the Fermilab E605 [45] and E866 [46, 47] experiments, inclusive gauge boson production [48,49,50,51] and single-inclusive jet production [52] cross-section data from the Tevatron. A sizable amount of LHC data were also included, specifically: inclusive gauge boson production data from ATLAS [53,54,55,56], CMS [57,58,59,60] and LHCb [61,62,63,64]; Z-boson transverse momentum production data from ATLAS [65] and CMS [66]; and top pair production total and differential cross-section data from ATLAS [67,68,69] and CMS [70,71,72]. Single-inclusive jet production data from ATLAS [73,74,75] and CMS [76, 77] were also included. These will be partly replaced by dijet data as we discuss below. For the determination of NLO PDFs, W production measurements in association with a charm jet from CMS [78] were also included. Most of these LHC measurements were performed at \(\sqrt{s}=7\) TeV [53,54,55,56,57,58,59, 61,62,63, 67, 70, 73, 75, 76, 78]; two single-inclusive jet measurements were performed at \(\sqrt{s}=2.76\) TeV [74, 77]; two gauge boson production measurements [60, 64], the Z-boson transverse momentum measurements [65, 66] and some top pair production measurements [67, 69, 70, 72] were performed at \(\sqrt{s}=8\) TeV; and two top pair total cross-section measurements [68, 71] were performed at \(\sqrt{s}=13\) TeV.

The NNPDF4.0 dataset builds upon NNPDF3.1, by adding various new datasets to it. On the one hand, a variety of new LHC measurements for processes already present in NNPDF3.1, on the other hand data corresponding to new processes. New datasets for existing LHC processes are added for electroweak boson production, both inclusive and in association with charm, single-inclusive jet production, and top pair production. The new processes are gauge boson with jets, single top production, inclusive isolated photon production, and dijet production.

For inclusive electroweak boson production we consider: at \(\sqrt{s}=7\) TeV, the ATLAS W and Z distributions [54] in the central and forward rapidity regions (only the subset corresponding to the central region was included in NNPDF3.1); at \(\sqrt{s}=8\) TeV, the ATLAS Z double- and triple-differential distributions [79, 80], the ATLAS W differential distribution [81] and the LHCb W differential distribution [82]; at \(\sqrt{s}=13\) TeV, the ATLAS W and Z total cross-section [83] and the LHCb Z differential distributions [84]. For electroweak gauge boson production with charm, we consider the ATLAS [85] and CMS [86] differential distributions at \(\sqrt{s}=7\) TeV and \(\sqrt{s}=8\) TeV, respectively. Given that the corresponding NNLO QCD corrections are not available in a format suitable for inclusion in a fit [87], these two datasets are included only in the determination of NLO PDFs.

For single-inclusive jet production we consider the ATLAS [88] and CMS [89] double differential cross-sections at \(\sqrt{s}=8\) TeV. For top pair production we consider: at \(\sqrt{s}=5.02\) TeV, the CMS total cross-section [90]; at \(\sqrt{s}=8\) TeV, the ATLAS differential distributions [91] and the CMS double differential distributions [92], both of which are measured in the dilepton final state; at \(\sqrt{s}=13\) TeV, the CMS differential distributions measured in the lepton+jets [93] and in the dilepton [94] final states. For W-boson production with jets we consider the ATLAS differential distributions at \(\sqrt{s}=8\) TeV [95]. For single top production, we consider only measurements in the t-channel, specifically: at \(\sqrt{s}=7\) TeV, the ATLAS top to antitop total cross-section ratio, with the corresponding differential distributions [96] and the CMS combined top and antitop total cross-sections [97]; at \(\sqrt{s}=8\) TeV, the ATLAS [98] and CMS [99] top to antitop total cross-section ratios and the ATLAS differential distributions [98]; at \(\sqrt{s}=13\) TeV the ATLAS [100] and CMS [101] top to antitop cross-section ratios. For inclusive isolated photon production we consider the ATLAS differential cross-sections at \(\sqrt{s}=8\) TeV [102] and at \(\sqrt{s}=13\) TeV [103]. For dijet production we consider, at \(\sqrt{s}=7\) TeV, the ATLAS [88] and CMS [76] double differential distributions and, at \(\sqrt{s}=8\) TeV, the CMS triple differential distributions [89].

Additional LHC measurements at \(\sqrt{s}=13\) TeV for processes relevant to PDF determination are in principle available: specifically, the ATLAS [104] and CMS [105] Z transverse momentum distributions; the CMS W+jets distributions [106]; the ATLAS [107] and CMS [108] single-inclusive jet distributions; and the ATLAS [109] and LHCb [110] top pair distributions. We do not include these measurements because either they are first analyses based on a still reduced luminosity sample, or because they do not come with complete information on experimental uncertainties, or because NNLO QCD corrections are not yet available.

The non-LHC dataset is also expanded in NNPDF4.0. For DIS, we now also consider the dimuon to inclusive cross-section ratio measured by the NOMAD experiment [111], though only in a variant determination, see Sect. 7.3.4. We also consider a selection of differential cross-sections for single-inclusive and dijet production in DIS measured by ZEUS [112,113,114] and H1-HeraII [115, 116], again only in a variant determination that will be discussed in Sect. 7.3.5. For fixed-target DY, we include the recent measurement for the proton-deuteron to proton-proton differential cross-section ratio performed by the E906/SeaQuest experiment [117].

The theoretical treatment of the data already included in NNPDF3.1 is the same in all respects as in that analysis, to which we refer for details. The general NNPDF3.1 settings will in fact be adopted throughout, with specific aspects relevant for the new data to be discussed in Sect. 2.2 below. Fast interpolation grids, accurate to NLO in perturbative QCD, are produced in the APFELgrid format [118]; APFEL [119] and various fixed-order Monte Carlo event generators [120,121,122,123,124,125,126] (possibly interfaced to APPLgrid [127] or FastNLO [128,129,130] with MCgrid [131, 132] or aMCfast [133]) are utilized for the computation of DIS and non-DIS observables, respectively. The charm PDF is parametrized by default and the FONLL general-mass variable flavor number scheme [25, 134, 135] is utilized to compute DIS structure functions.

Except for DIS and for DIS jets, for which we also make use of NNLO fast interpolation grids, NNLO QCD corrections to matrix elements are implemented by multiplying the NLO predictions by a K-factor. This is defined as the bin-by-bin ratio of the NNLO to NLO prediction computed with a pre-defined NNLO PDF set (see Sect. 2.3 in [14] for details). For all of the fixed-target DY data and for all of the new LHC datasets considered in NNPDF4.0, this PDF set is NNPDF3.1_nnlo_as_0118 [5]; for the Tevatron and LHC datasets already included in NNPDF3.1, we used the same PDF sets specified in Sect. 2.1 of [5]. For these datasets the PDF dependence of the K-factors is generally smaller than all the other relevant uncertainties, as explicitly shown in [5]. We have checked this explicitly by recomputing the K-factors for all of the inclusive gauge boson production measurements, for both fixed-target and collider experiments, and for all of the top-quark pair production measurements with the baseline NNPDF4.0 set, and then repeating the NNLO PDF determination. The ensuing PDFs turn out to be statistically equivalent to the NNPDF4.0 baseline. The values of all physical parameters are the same as in NNPDF3.1.

The NNPDF4.0 dataset is thus a superset of NNPDF3.1 with the following exceptions. First, in the NNPDF4.0 baseline the single-inclusive jet data are replaced by their dijet counterparts (though the single-inclusive jet data will be considered in a variant NNPDF4.0 determination, see Sect. 7.3.3 below). Furthermore, a number of small alterations is made to the original set of NNPDF3.1 data, or to their theoretical treatment, as we now discuss.

In terms of data, the total cross-section results from Ref. [68] are no longer used, as they are replaced by the more recent measurement [136] based on the full Run II luminosity, to be discussed in Sect. 2.2.6 below. For the differential distributions measured by ATLAS at \(\sqrt{s}=8\) TeV in the lepton+jets final state [69] only one distribution out of the four available was included in NNPDF3.1 while all of them are included in NNPDF4.0, because the correlations between distributions have become available meanwhile. The single-inclusive jet measurements from ATLAS [74] and CMS [77] at \(\sqrt{s}=2.76\) TeV and from ATLAS [53] at \(\sqrt{s}=7\) TeV are no longer included in NNPDF4.0 because NNLO QCD corrections, which are provided with the optimal scale choice of Ref. [137], are not available for these measurements. For the same reason the CDF single-inclusive jet data [52] are also not included. These datasets were already removed in intermediate updates of the NNPDF3.1 determination [8, 10] or in subsequent studies [19, 23, 24, 138].

In terms of theoretical treatment the changes are the following. For DIS we correct a bug in the APFEL computation of the NLO CC structure functions, that mostly affects the large-x region; and we re-analyze the NuTeV dimuon cross-section data by including the NNLO charm-quark massive corrections [139, 140], as explained in [10], and by updating the value of the branching ratio of charmed hadrons into muons to the PDG value [141], as explained in [18]. For fixed-target DY, we include the NNLO QCD corrections for the E866 measurement [47] of the proton-deuteron to proton-proton cross-section ratio: these corrections had been inadvertently overlooked in NNPDF3.1. For gauge boson production at the Tevatron, we correct a small bug affecting the CDF Z rapidity distribution [48], whereby the last two bins had not been merged consistently with the updated measurement. For jets, we update the theoretical treatment of the single-inclusive jet measurements at \(\sqrt{s}=7\) TeV [75, 76], in that NLO and NNLO theoretical predictions are now computed with factorization and renormalization scales equal to the optimal scale choice advocated in Ref. [137], namely, the scalar sum of the transverse momenta of all partons in the event, see Ref. [9].

Table 1 The DIS datasets analyzed in the NNPDF4.0 PDF determination. For each of them we indicate the name of the dataset used throughout this paper, the corresponding reference, the number of data points in the NLO/NNLO fits before (and after) kinematic cuts (see Sect. 4), the kinematic coverage in the relevant variables after cuts, and the codes used to compute the corresponding predictions. Datasets not previously considered in NNPDF3.1 are indicated with an asterisk. Datasets not included in the baseline determination are indicated in square brackets. The Q coverage indicated for NOMAD is to be interpreted as an integration range (see text)
Table 2 Same as Table 1 for DIS jet data
Table 3 Same as Table 1 for fixed-target DY data
Table 4 Same as Table 1 for collider (Tevatron, top, and LHC, bottom) inclusive gauge boson production data
Table 5 Same as Table 1 for other LHC processes. From top to bottom we list: W-boson production in association with a jet of charm or of light quarks; Z-boson transverse momentum production; total and differential top pair production; single-inclusive and dijet production; inclusive isolated photon production; and single top t-channel total and differential production

To assess the impact of these changes in dataset and theoretical treatment, we will consider a variant of NNPDF3.1 in which all of these changes, but not the replacement of single-inclusive jets with dijets, are taken into account. This determination will be referred to as NNPDF3.1-like henceforth. It will be used to carry out various methodological tests in Sects. 3 and 6. The NNPDF3.1-like determination contains 4092 data points for a NNLO fit.

The data included in NNPDF4.0 are summarized in Tables 1, 2, 3, 4 and 5, respectively for DIS, DIS jets, fixed-target DY, collider inclusive gauge boson production and other LHC processes. For each process we indicate the name of the dataset used throughout this paper, the corresponding reference, the number of data points in the NLO/NNLO fits before (and after) kinematic cuts (see Sect. 4), the kinematic coverage in the relevant variables after cuts, and the codes used to compute the corresponding predictions. Datasets not previously considered in NNPDF3.1 are indicated with an asterisk. Datasets not included in the baseline determination are indicated in brackets.

The total number of data points included in the default PDF determination is 4426 at NLO and 4618 at NNLO, to be compared to 4295 at NLO 4285 at NNLO in NNPDF3.1 and to 4092 (at NNLO) in NNPDF3.1-like fits presented here. A comparison between the datasets considered in NNPDF4.0 and the datasets included in NNPDF3.1 and in other recent PDF determinations, namely ABMP16 [142], CT18 [143] and MSHT20 [144], is presented in App. B, see Tables 33, 34, 35, 36, 37 and 38.

The kinematic coverage in the \((x,Q^2)\) plane of the NNPDF4.0 dataset entering the default NNLO fit is displayed in Fig. 2. For hadronic data, kinematic variables are determined using LO kinematics. Whenever an observable is integrated over rapidity, the center of the integration range is used to compute the values of x. The data points corresponding to datasets that are new in NNPDF4.0 are indicated with a black edge.

The complete information on experimental uncertainties, including the breakdown into different sources of systematic uncertainties and their correlations, is taken into account whenever available from the corresponding publications or from the HEPData repository [150]. No decorrelation models are used, except when explicitly recommended by the collaboration. This is the case of the single-inclusive jet cross-section measurement performed by ATLAS at \(\sqrt{s}=8\) TeV [88]. Decorrelation models [9, 151,152,153,154] were studied for the ATLAS jet measurements at \(\sqrt{s}=7\) TeV [75] and for the ATLAS top pair measurements at \(\sqrt{s}=8\) TeV [69]. However these are not considered in our default determination, but only in variant fits, see Sect. 8.7.

2.2 New data in NNPDF4.0

We now discuss in detail the new datasets considered in NNPDF4.0. These are indicated with an asterisk in Tables 1, 2, 3, 4 and 5. The data are presented by process, with the processes already considered in NNPDF3.1 addressed first.

Fig. 2
figure 2

The kinematic coverage of the NNPDF4.0 dataset in the \((x,Q^2)\) plane

2.2.1 Deep-inelastic scattering

We include the combined H1 and ZEUS measurements of reduced electron-proton NC DIS cross-sections for the production of open charm and bottom quarks [145]. These measurements extend the previous combination of open charm production cross-sections [41] and supersede the separate H1 [42] and ZEUS [43] datasets for the structure function \(F_2^b\) that were included in NNPDF3.1. As for the other DIS measurements included in the NNPDF4.0 dataset, they are analyzed in the FONLL scheme [25, 134, 135] within fixed order perturbative accuracy (i.e. not including resummation).

We also consider the measurements of the ratio \(\mathcal {R}_{\mu \mu }\) of dimuon to inclusive neutrino-nucleus CC DIS cross-sections performed by the NOMAD experiment [111]. These measurements are presented alternatively as a function of the neutrino beam energy \(E_\nu \), of the momentum fraction x, or of the final state invariant mass W. Because experimental correlations are not provided among the three distributions, they cannot be included in the fit at the same time. We therefore select only one of them, namely the measurement as a function of the neutrino beam energy, the only variable among the three that is directly measured by the experiment. This choice is based on the previous study [10], carried out in the context of a variant of the NNPDF3.1 determination, in which it was shown that the three distributions have a similar impact in the fit.

The treatment of this dataset in NNPDF4.0 closely follows Ref. [10]. Specifically we incorporate the recently computed NNLO charm-quark massive corrections [139, 140] by means of a K-factor (see Sect. 2.2.2 in [10]). The NOMAD data are not included in our default determination, however we assess its impact on the NNLO PDFs by means of Bayesian reweighting [155, 156]. The reason for this choice is dictated by the fact that the observable is integrated over Q and x (see e.g. Eq. (2.1) in Ref. [10]), which complicates the generation of fast interpolation tables in the APFELgrid format.

2.2.2 Jet production in deep-inelastic scattering

We consider a selection of DIS single-inclusive jet (1j) and dijet production (2j) cross-sections measured by ZEUS [112,113,114] in the high-Q (HQ) region and by H1-HeraII [115, 116] in the HQ and low-Q (LQ) regions. Specifically we consider cross-sections double differential in \(Q^2\) and in the transverse momentum of the jet or of the jet pair, listed in Table 2. Experimental correlations between single-inclusive jet and dijet measurements, which are available only for H1, are taken into account. These allow us to include single-inclusive jet and dijet datasets simultaneously. Additional available measurements, in particular from H1-HeraI [157, 158], are left for future studies. Likewise, variants of the H1-HeraII measurements [115, 116], in which cross-sections are normalized to the inclusive NC cross-section integrated over the width of each \(Q^2\) bin, are not yet considered. These normalized cross-sections might benefit from cancellations of systematic uncertainties and uncertainty correlation with HERA inclusive DIS measurements.

Theoretical predictions for the ZEUS and H1-HeraII datasets are obtained using fast interpolation grids precomputed with NNLOjet. These incorporate the recently determined NNLO QCD corrections [159]. Multiplicative hadronization correction factors, as provided in the experimental analyses, are included throughout. Because this theoretical input has become available only very recently, the ZEUS and H1-HeraII datasets are not included in our default determination, but only in a variant NNLO set by means of Bayesian reweighting, see Sect. 7.3.5.

2.2.3 Fixed-target Drell–Yan production

We consider the new measurement recently performed by the SeaQuest experiment at Fermilab [117] for production of a Z boson decaying into muon pairs. Like the previous NuSea measurement [47], which was included in the NNPDF3.1 dataset, the SeaQuest experiment measures the ratio of the scattering cross-section of a proton beam off a deuterium target to the cross-section off a proton target. The measurement is double differential in the partonic momentum fractions of the struck partons. The SeaQuest data extend the NuSea data to larger values of x, \(0.15\lesssim x \lesssim 0.40\), with the aim of constraining the antiquark asymmetry in this region [47]. Theoretical predictions are computed by taking into account acceptance corrections, according to Eq. (10) in Ref. [117]. Fast interpolation tables accurate to NLO are generated with APFEL; these are then supplemented with a NNLO K-factor computed with a version of Vrap [160] that we modified to account for the isoscalarity of the deuteron target. Nuclear effects are taken into account by means of the procedure discussed in Ref. [19] and further summarized in Sect. 2.3.

2.2.4 Inclusive collider electroweak gauge boson production

The new datasets we consider for inclusive W and Z boson production and decay are from the ATLAS and LHCb experiments.

We include the ATLAS measurements of the W and Z differential cross-section at \(\sqrt{s}=7\) TeV [54] in the central and forward rapidity regions. As mentioned above, these data were already included in NNPDF3.1, but only the subset corresponding to the central region. The measurements cover, respectively, the pseudo-rapidity range \(|\eta _\ell |<2.5\) (for W bosons) and the rapidity range of the lepton pair \(|y_{\ell \ell }|<3.6\) (for the Z boson). In the latter case, the invariant mass of the lepton pair is \(46\le m_{\ell \ell }\le 150\) GeV. The measurements correspond to an integrated luminosity of 4.6 \(\hbox {fb}^{-1}\). We consider the combination of measurements in the electron and muon decays.

We consider the ATLAS measurements of the double and triple differential DY lepton pair production cross-section at \(\sqrt{s}=8\) TeV [79, 80]. The differential variables are the invariant mass and rapidity of the lepton pair, \(m_{\ell \ell }\) and \(y_{\ell \ell }\), and, in addition to these for the latter case, the cosine of the Collins-Soper angle \(\cos \theta ^*\). The measurements cover two separate invariant mass ranges, respectively \(116\le m_{\ell \ell }\le 1500\) GeV and \(46\le m_{\ell \ell }\le 200\) GeV, in the same central rapidity range \(|y_{\ell \ell }|<2.4\). The same data sample corresponding to an integrated luminosity of 20.2 \(\hbox {fb}^{-1}\) is used in the two cases, which therefore overlap in the interval \(116\le m_{\ell \ell }\le 200\) GeV. The two analyses are consistent in this region, however because the one in [79] is optimized to high invariant masses, we remove the overlapping bins from the dataset in [80]. In both cases we consider the measurements in which the electron and muon decay channels have been combined; for the triple differential distribution, we consider the measurement integrated over \(\cos \theta ^*\) in order to reduce sensitivity to the value of the Weinberg angle \(\sin ^2\theta _W\).

We include the ATLAS measurement of the W production cross-section and decay at \(\sqrt{s}=8\) TeV [81]. The data are differential in the pseudo-rapidity of the decay muon \(\eta _\mu \), which is accessed in the central pseudo-rapidity range \(|\eta _\mu |<2.4\) by analyzing a data sample corresponding to an integrated luminosity of 20.2 \(\hbox {fb}^{-1}\). As for the companion ATLAS measurement at \(\sqrt{s}=7\) TeV [54], we consider the separate \(W^+\) and \(W^-\) differential distributions rather than their asymmetry.

We consider the ATLAS measurement of the total W and Z cross-section and decay into leptons at \(\sqrt{s}=13\) TeV [83]. The measurement corresponds to an integrated luminosity of 81 \(\hbox {pb}^{-1}\).

We include the LHCb measurement of the W cross-section at \(\sqrt{s}=8\) TeV [82]. The data are differential in the pseudo-rapidity of the decay electron \(\eta _e\), which is accessed in the forward range \(2.00<|\eta _e|<4.25\). The data sample corresponds to an integrated luminosity of 2 \(\hbox {fb}^{-1}\). In this case, we cannot consider the separate \(W^+\) and \(W^-\) differential distributions, because we find that the correlated experimental uncertainties lead to a covariance matrix that is not positive definite. Therefore, in this case we make use of the asymmetry measurement, which is not affected by this problem since most of the correlations cancel out.

Finally, we include the LHCb measurement of the Z cross-section at \(\sqrt{s}=13\) TeV [84]. The data are differential in the Z boson rapidity \(y_Z\) [84], with \(2.00<|y_Z|<4.50\), and it covers the Z-peak lepton pair invariant mass range \(60\le m_{\ell \ell }\le 120\) GeV. The data sample corresponds to an integrated luminosity of 294 \(\hbox {pb}^{-1}\). We include separately the datasets in the dimuon and dielectron decay channels.

These datasets, specifically from ATLAS, are particularly precise, with systematic uncertainties of the order of percent or less and even smaller statistical uncertainties. They are dominated by the luminosity uncertainty, which is of the order of 1.9-2.1% (1.2-3.9%) for ATLAS (LHCb) respectively at \(\sqrt{s}=8\) TeV and at \(\sqrt{s}=13\) TeV.

Theoretical predictions are computed at NLO with MCFM (v6.8) [120,121,122] and are benchmarked against those obtained with mg5_aMC (v3.1) [124, 125]. The NNLO K-factor is computed with FEWZ (v3.1) [161,162,163] for all the datasets excepting those of [80, 81], for which DYNNLO [164, 165] is used instead. We benchmarked these calculations against MCFM (v9.0) [166], and found the relative difference between different computations to be negligible in comparison to the data uncertainties. The renormalization and factorization scales are set equal to the mass of the gauge boson, for total cross-sections and for cross-sections differential in rapidity or pseudorapidity variables, or to the central value of the corresponding invariant mass bin, for cross-sections that are also differential in the invariant mass of the lepton pair.

2.2.5 Gauge boson production with additional jets

On top of inclusive gauge boson production, we consider more exclusive measurements in which a W boson is produced in association with \(N_\mathrm{jets}\) jets of light quarks, or with a single jet of charm quarks.

Specifically, we include the ATLAS data for W production with \(N_\mathrm{jets}\ge 1\) [95] at \(\sqrt{s}=8\) TeV. The measurement corresponds to an integrated luminosity of 20.2 \(\hbox {fb}^{-1}\). We select the distribution differential in the transverse momentum of the W boson, \(p_T^W\), which covers the range \(0\le p_T^W\le 800\) GeV. Theoretical predictions are determined as in the ATLAS study of [167]: at NLO, fast interpolation grids are generated with MCFM; at NNLO, QCD corrections are implemented by means of K-factors determined with the \(N_\mathrm{jetti}\) program [168, 169]. The factorization and renormalization scales are set equal to the mass of the W boson.

We further include the ATLAS [85] and CMS [86] data for production of W with a charm jet, at \(\sqrt{s}=7\) TeV and \(\sqrt{s}=13\) TeV, respectively. The two measurements correspond to integrated luminosities of 4.6 \(\hbox {fb}^{-1}\) and 35.7 \(\hbox {fb}^{-1}\). In both cases, we utilize the cross-sections differential in the pseudo-rapidity of the decay lepton \(\eta _\ell \), which is accessed in the range \(|\eta _\ell |<2.5\) for ATLAS and \(|\eta _\ell |<2.4\) for CMS. In the case of ATLAS, separate distributions for the production of positively and negatively charged bosons are provided; in the case of CMS, only the distribution for the sum of the two is available. Theoretical predictions are computed at NLO with MCFM; NNLO QCD corrections have been computed very recently [87], although in a format that does not allow for their ready implementation. These datasets are therefore not included in the determination of NNLO PDFs. The factorization and renormalization scales are set equal to the mass of the W boson.

All the measurements discussed in this section have been included in a PDF determination, in a specific study based on NNPDF3.1 [10].

2.2.6 Top pair production

We consider several new datasets for top pair production at the LHC. At \(\sqrt{s}=8\) TeV, we include the ATLAS normalized differential cross-section [91] and the CMS normalized double differential cross-section [92], both of which are measured in the dilepton channel. Companion measurements in the lepton+jets channel [69, 72] were already part of NNPDF3.1. These measurements correspond respectively to an integrated luminosity of 20.2 \(\hbox {fb}^{-1}\) and 19.7 \(\hbox {fb}^{-1}\). At \(\sqrt{s}=8\) TeV, we include the ATLAS total cross-section [136] and the CMS absolute differential distributions in the lepton+jets channel [93] and in the dilepton channel [94]. The ATLAS measurement is based on the full Run II sample, corresponding to an integrated luminosity of 139 \(\hbox {fb}^{-1}\) and replaces the corresponding measurement, determined from a partial luminosity [68], included in NNPDF3.1; the CMS measurements are for an integrated luminosity of 35.8 \(\hbox {fb}^{-1}\).

Various differential distributions are available for each of these measurements. Because correlations between different distributions are not available, only one distribution at a time can be included. Rapidity distributions are generally affected by small higher order corrections [170], hence we chose the rapidity of the top quark, when available, as our preferred observable, and otherwise, the rapidity of the top pair. Specifically, we select the distributions differential in the rapidity of the top pair in the case of [91], the double-differential distribution in the rapidity of the top quark and the invariant mass of the top pair in the case of [92] and in the rapidity of the top quark in the case of [93, 94]. We have explicitly verified that the choice of any other distributions does not alter the results. The kinematic coverage of the distributions that we included is shown in Table 5.

Theoretical predictions are computed at NLO with mg5_aMC (v2.6.6) [125]; NNLO QCD corrections are determined from publicly available FastNLO tables [171, 172] for differential distributions and from top++ [173] for the total cross-section. The renormalization and factorization scales are set as in NNPDF3.1, see Sect. 2.7 in [5] for details.

2.2.7 Single-inclusive and dijet production

In NNPDF4.0, following the study of Ref. [9], we consider both single-inclusive jets (as in previous NNPDF determinations) and dijets, that have several desirable theoretical features [137].

For single-inclusive jet production, we include the ATLAS [88] and CMS [89] measurements at \(\sqrt{s}=8\) TeV. They correspond to integrated luminosities of 20.2 \(\hbox {fb}^{-1}\) and 19.7 \(^{-1}\), respectively. In both cases the measurements are provided for the cross-section differential in the transverse momentum, \(p_T^\mathrm{jet}\), and of the rapidity, \(y^\mathrm{jet}\), of the jet. The data cover the range \(70\le p_T^\mathrm{jet}\le 2.5\) TeV and \(|y^\mathrm{jet}|\le 3.0\). Theoretical predictions are computed at NLO with NLOJet++ (v4.1.3) [126] and benchmarked against the independent computation presented in [174]. NNLO QCD corrections are incorporated by means of the K-factor computed in the same publication. The factorization and renormalization scales are set equal to the optimal scale choice recommended in Ref. [137], namely, the scalar sum of the transverse momenta of all partons in the event.

For dijet production we consider the ATLAS [148] and CMS [76] measurements at \(\sqrt{s}=7\) TeV and the CMS measurement [149] at \(\sqrt{s}=8\) TeV. They correspond to integrated luminosities of 4.5 \(\hbox {fb}^{-1}\) (at 7 TeV) and of 19.7 \(\hbox {fb}^{-1}\) (at 8 TeV). For ATLAS, the cross-section is double differential in the dijet invariant mass \(m_{jj}\) and in the absolute difference of the rapidities of the two jets \(y^*\). The corresponding ranges are \(260\le m_{jj}\le 4.27\) TeV and \(0.0\le y^* \le 3.0\). For CMS, the cross-section is double differential in \(m_{jj}\) and in the maximum absolute rapidity of the two jets \(|y_\mathrm{max}|\) (at 7 TeV) and triple differential in the average transverse momentum of the jet pair \(p_{T,\mathrm{avg}}\), the dijet boost \(y_b\), and \(y^*\) (at 8 TeV). The corresponding ranges are \(133\le p_{T,\mathrm{avg}}\le 1.78\) TeV and \(0.0\le y_b,y^*\le 3.0\). As in the case of single-inclusive jets, theoretical predictions are computed at NLO with NLOJet++ and are benchmarked against the independent computation of Ref. [174]. This computation is also used to determine the NNLO QCD corrections, implemented as K-factors. The renormalization and factorization scale used in the computation are set to the invariant mass of the dijet system, again following the recommendation of Ref. [137].

Single-inclusive jet and dijet observables cannot be simultaneously included because full knowledge of the experimental correlations between them is not available. The selection of the optimal set of jet observables will be performed and discussed in Sect. 4, in the context of the final dataset selection.

2.2.8 Inclusive isolated-photon production

Isolated photon production was not included in previous NNPDF releases and is included in NNPDF4.0 for the first time. We specifically consider the ATLAS measurements at \(\sqrt{s}=8\) TeV [102] and \(\sqrt{s}=13\) TeV [175]. They correspond to integrated luminosities of 20.2 \(\hbox {fb}^{-1}\) and 3.2 \(\hbox {fb}^{-1}\), respectively. The measurements are provided for the cross-section differential in the photon transverse energy \(E_T^\gamma \) in different bins of the photon pseudorapidity \(\eta _\gamma \). The accessed ranges are, in both cases, \(E_T^\gamma <1500\) GeV and \(|\eta _\gamma |<2.37\). Theoretical predictions are computed at NLO with MCFM and benchmarked against the independent computation presented in [176]. The smooth cone isolation criterion [177] is adopted accordingly, with the parameter values determined in [178]. NNLO QCD corrections are incorporated by means of the K-factors computed in [176]; K-factors are also used to incorporate corrections due to resummation of the electroweak Sudakov logarithms at leading-logarithmic accuracy, according to the procedure presented in [179, 180]. The factorization and renormalization scales are set equal to the central value of \(E_T^\gamma \) for each bin. The impact of the measurements presented above on a PDF determination was studied in [7] in the context of a variant of the NNPDF3.1 fit. These data were found to be generally well described, except in the most forward rapidity region, and to have a mild impact on the gluon PDF at intermediate values of x.

2.2.9 Single top production

Another process included for the first time in an NNPDF release is t-channel single top production. We consider the ATLAS [96, 98, 100] and CMS [97, 99, 101] measurements at \(\sqrt{s}=7\), 8 and 13 TeV that correspond, for ATLAS (CMS), to integrated luminosities of 4.59, 20.2 and 3.2 \(\hbox {fb}^{-1}\) (2.73, 19.7 and 2.2 \(\hbox {fb}^{-1}\)), respectively. In the case of ATLAS, we consider the ratio of the top to antitop inclusive cross-sections at 7 and 13 TeV and the distributions differential in the top or antitop rapidity \(y_{t,\bar{t}}\) at 7 and 8 TeV normalized to the corresponding total cross-section. The rapidity ranges are \(|y_{t,\bar{t}}|<3.0\) and \(|y_{t,\bar{t}}|<2.2\) at \(\sqrt{s}=7\) and 8 TeV, respectively. In the case of CMS, we consider the sum of the top and antitop inclusive cross-sections at 7 TeV and the ratio of the top to antitop inclusive cross-sections at 8 and 13 TeV. Theoretical predictions are computed in the five-flavor scheme. At NLO the calculation is performed with mg5_aMC (v2.6.6) [125]; NNLO corrections are incorporated by means of the K-factors determined in [181, 182]. The renormalization and factorization scales are set equal to the top mass.

The measurements presented above were extensively studied in the context of a variant of the NNPDF3.1 fit in [8]. The choice of observables included for PDF determinations is based on the results of that reference. In particular, distributions differential in the transverse momentum of the top quark or antiquark are also provided by the experimental collaborations. However, their inclusion would result in a double counting, given that experimental correlations across uncertainties for different distributions are not provided. In [8] these measurements were found to have a mild impact on the up and down PDFs at \(x\gtrsim 0.1\).

Single top t-channel production is in principle also sensitive to the theoretical details of the matching schemes and, in the five-flavor scheme, to the bottom PDF. Here we determine the bottom PDF using perturbative matching conditions, but it could in principle be parametrized independently, like the charm PDF. However, while this may become relevant in the future, it does not seem necessary at present given the precision and kinematic coverage of the existing data.

2.3 Treatment of nuclear corrections

The NNPDF4.0 dataset, like its predecessors, includes a significant amount of data involving deuterium or heavy nuclear targets, both for deep inelastic and hadronic processes. These are summarized in Table 6, where we also report the corresponding reference, the number of data points in the NLO and NNLO baseline fits, and the species of the nuclear target. Overall, 1416 and 1417 data points come from nuclear measurements in the NLO and NNLO fits respectively, which amount to about 30% of the full dataset. All of these datasets but SeaQuest [117] were already included in the previous NNPDF3.1 determination [5].

Table 6 The nuclear datasets in NNPDF4.0 involving deuterium targets (left) or heavier nuclear targets (right) and corresponding targets; \(N_\mathrm{dat}\) denotes the number of data points included in the NLO/NNLO fits. Note that the EMC \(F_2^c\) dataset is not included in the default NNPDF4.0 PDF set

The inclusion of nuclear data in a fit of proton PDFs requires accounting for changes in the PDFs induced by the nuclear medium. The impact of such changes was studied by us in [14, 183] and found to be subdominant in comparison to the PDF uncertainty at that time. Specifically, it was shown (see Sect. 4.11 in [5]) that, upon removing data with nuclear targets from the dataset, the precision of up, down and strange quark and anti-quark PDFs deteriorated by an amount larger than the size of the effect of the nuclear corrections estimated on the basis of models. Nuclear corrections were consequently not included in the NNPDF3.1 determination.

In NNPDF4.0 we revisit this state of affairs, motivated by the significant reduction of the PDF uncertainty in comparison to NNPDF3.1, which suggests that nuclear effects can no longer be neglected. We now account for nuclear effects by viewing them as a theoretical uncertainty. The way this is determined and included follows the methodology developed in [18, 19], to which we refer for details. The basic idea is to determine the uncertainty from the difference between the values of observables computed with the proton and nuclear PDFs, with each different determination of the nuclear PDF taken as an independent nuisance parameter. This can then be used to compute a theoretical covariance matrix, that can be added to the experimental covariance matrix.

In order to apply this methodology an underlying set of nuclear PDFs is needed for the computation of the shifts. Heavy nuclear and deuteron corrections are treated separately because of the substantial difference in the observed size and expected origin of the nuclear effects. For heavier nuclei (Fe, Cu and Pb targets) we will use the nNNPDF2.0 nuclear PDFs [20]. For deuterium, we use the self-consistent procedure described in [19], whereby the proton and deuteron PDFs are determined simultaneously, each including the uncertainties on the other. This procedure thus requires in turn the use of a PDF determination without deuterium corrections in order to initiate the self-consistent iteration. Here we will apply it by starting with the NNPDF4.0 determination itself. The deuterium PDF determined by this procedure will be described in Sect. 8.6 below.

While nuclear effects will be included as an extra uncertainty in the default NNPDF4.0 determination, we will also discuss for comparison PDFs obtained by neglecting nuclear effects altogether, or by using the nuclear corrections computed as discussed above as a correction to the data and not just as an additional uncertainty, again following the methodology of Refs. [18, 19]. These alternative treatments of nuclear effects will be compared and discussed in Sect. 8.6 below and provide the motivation for including nuclear uncertainties without a correction in the default PDF determination.

3 Fitting methodology

As discussed in the introduction, NNPDF4.0 is the first PDF set to be based on a methodology fully selected through a machine learning algorithm. This means that, whereas the basic structure of the NNPDF4.0 methodology is the same as in previous NNPDF releases, specifically the use of a Monte Carlo representation of PDF uncertainties and correlations, and the use of neural networks as basic interpolating functions [5, 14], all the details of the implementation, such as the choice of neural network architecture and the minimization algorithm, are now selected through an automated hyperoptimization procedure. This is possible thanks to an extensive rewriting and reorganization of the NNPDF framework. Furthermore, some theory constraints built into the PDF parametrization are implemented for the first time in NNPDF4.0. Also for the first time we consider PDF determinations performed with different choices of parametrization basis.

In Sect. 3.1 we start by discussing the PDF parametrization and choice of basis and the way they implement theoretical constraints. In Sect. 3.2 we then present the new NNPDF fitting framework, which is the basis of the hyperoptimization procedure. The hyperoptimization in turn is discussed in Sect. 3.3, along with its output, which defines the baseline NNPDF4.0 methodology. We conclude in Sect. 3.4 with quantitative benchmarks assessing both the efficiency and speed of this final methodology compared to the methodology used for NNPDF3.1.

3.1 PDF parametrization and theoretical constraints

We now turn to the general structure of the PDF parametrization, and the theory constraints that are imposed upon it: specifically sum rules, positivity and integrability.

3.1.1 Parametrization bases

A PDF analysis requires a choice of basis, namely a set of linearly independent PDF flavor combinations that are parametrized at the input evolution scale \(Q_0\). In the NNPDF approach, this corresponds to choosing which are the PDF combinations whose value is the output of a neural network. Optimal results should in principle be independent of this specific choice of basis. Previous NNPDF releases adopted the so-called evolution basis, in which the basis PDFs are chosen as the singlet quark \(\Sigma \) and gluon g that mix upon QCD evolution, and valence \(V_i\) and nonsinglet sea \(T_i\) combinations that are eigenstates of evolution, namely

$$\begin{aligned} \Sigma&= u+\bar{u} + d+\bar{d} + s+\bar{s} + 2c \, , \nonumber \\ T_3&= \left( u+\bar{u}\right) - \left( d+\bar{d} \right) \, , \nonumber \\ T_8&= \left( u+\bar{u} + d+\bar{d} \right) - 2\left( s+\bar{s} \right) \, \nonumber \\ V&= \left( u-\bar{u}\right) + \left( d-\bar{d}\right) + \left( s-\bar{s}\right) \, ,\nonumber \\ V_3&= \left( u-\bar{u}\right) - \left( d-\bar{d} \right) \, , \nonumber \\ V_8&= \left( u-\bar{u} + d-\bar{d} \right) - 2\left( s-\bar{s} \right) \, . \end{aligned}$$

In NNPDF3.1, this set of linearly independent flavor combinations was supplemented by an independently parametrized total charm PDF \(c+\bar{c}\), with the charm asymmetry \(c-\bar{c}\) assumed to vanish at scale \(Q_0\). Here we will instead supplement the basis Eq. (3.1) with a further nonsinglet combination, namely

$$\begin{aligned} T_{15} = \left( u+\bar{u} + d+\bar{d} + s+\bar{s} \right) - 3\left( c+\bar{c}\right) \end{aligned}$$

still assuming \(c-\bar{c}=0\) at the parametrization scale. At NNLO a small charm asymmetry is then generated by perturbative evolution. The union of Eqs. (3.1, 3.2) will be referred to as the evolution basis henceforth.

We will also consider PDF determination carried out in the flavor basis, in which the PDFs that are parametrized are

$$\begin{aligned} \tilde{f}_{k} =\{ u,\,\bar{u},\,d,\bar{d},\,s,\,\bar{s},\, c,\, g\}, \end{aligned}$$

related to their evolution basis counterparts

$$\begin{aligned} {f}_{k}=\{V,\, V_3,\, V_8,\, T_3,\, T_8,\, T_{15},\, \Sigma ,\, g\}, \end{aligned}$$

by means of Eqs. (3.1, 3.2).

The evolution and flavor bases each have advantages and disadvantages.

For instance, if one chooses a factorization scheme in which PDFs are non-negative [21], positivity is easier to implement in the flavor basis. On the other hand, the integrability of the valence distributions \(V,V_3,V_8\), as required by the valence sum rules, is simpler in the evolution basis.

In this work, we take the evolution basis as our standard choice, however we will explicitly check basis independence, by verifying that equivalent results are obtained in the data region if the flavor basis is adopted instead, see Sect. 8.4 below.

The output of the neural network is supplemented by a preprocessing factor and by normalization constants. The relation between the PDFs and the neural network output is

$$\begin{aligned}&xf_k\left( x,Q_0; {\varvec{\theta }} \right) = A_k\,x^{1-\alpha _k}(1-x)^{\beta _k}\mathrm{NN}_k(x; {\varvec{\theta }}), \nonumber \\&\quad k=1,\ldots ,8\,, \end{aligned}$$

where k runs over the elements of the PDF basis, \(\mathrm{NN}_k(x;{\varvec{\theta }})\) is the k-th output of a neural network, and \({\varvec{\theta }}\) collectively indicates the full set of neural network parameters. The preprocessing function has the purpose of speeding up the training of the neural net. In order to make sure that it does not bias the result, the exponents \(\alpha _k\) and \(\beta _k\) are varied in a range that is determined iteratively in a self-consistent manner as described in [14], supplemented by a further integrability constraint, to be discussed in Sec. 3.1.4. The independence of result of the choice of preprocessing ranges has been recently validated in Ref. [184], where it is shown that results obtained here can be obtained by a suitable rescaling on the neural network input that avoids preprocessing altogether. The normalization constants \(A_k\) are constrained by the valence and momentum sum rules, also to be discussed below, in Sec. 3.1.2.

When using the flavor basis, the small-x preprocessing is removed from Eq. (3.5), i.e. \(\alpha _k=1\) for all k. This is because standard Regge theory arguments (see e.g. [185]) imply that the singlet and nonsinglet have a different small x behavior, and in particular the nonsinglet has a finite first moment, while the singlet first moment diverges. This means that the small-x behavior of flavor-basis PDFs is the linear combination of a leading singlet small-x growth and a subleading nonsinglet power behavior characterized by a different exponent. Hence, factoring out a common preprocessing exponent is not advantageous in this case.

3.1.2 Sum rules

Irrespectively of the choice of fitting basis, PDFs should satisfy both the momentum sum rule

$$\begin{aligned} \int _0^1 dx\,x\left( g\left( x, Q\right) + \Sigma \left( x, Q\right) \right) = 1 \, , \end{aligned}$$

and the three valence sum rules,

$$\begin{aligned} \int _0^1 dx\,\left( u(x,Q)-\bar{u}(x,Q)\right)= & {} 2 \, , \nonumber \\ \int _0^1 dx\,\left( d(x,Q)-\bar{d}(x,Q)\right)= & {} 1 \, , \nonumber \\ \int _0^1 dx\,\left( s(x,Q)-\bar{s}(x,Q)\right)= & {} 0 \, , \end{aligned}$$

for all values of Q. Provided these sum rules are imposed at the initial parametrization scale, \(Q_0\), perturbative QCD ensures that they will hold for any other value \(Q\ne Q_0\). When transformed to the evolution basis, Eq. (3.8), the valence sum rules read

$$\begin{aligned} \int _0^1 dx\, V\left( x, Q\right)= & {} \int _0^1 dx\, V_8\left( x, Q\right) = 3\,, \nonumber \\ \int _0^1 dx\, V_3\left( x, Q\right)= & {} 1\,. \end{aligned}$$

We have then four equations that fix four of the normalization constants \(A_k\), namely \(A_V\), \(A_{V_8}\),\(A_{V_3}\) and \(A_g\).

In the present analysis we always impose the sum rules in the evolution basis. This means that when performing a fit in the flavor basis, we express the evolution basis PDFs \(f_k\) Eq. (3.4) in terms of the flavor basis PDFs \(\tilde{f}_{k}\) Eq. (3.3) through a transformation matrix \(R_{kk'}\):

$$\begin{aligned} xf_k\left( x,Q_0; {\varvec{\theta }}\right) = A_k \sum _{k'} R_{kk'} \,x\tilde{f}_{k'}\left( x,Q_0; {\varvec{\theta }}\right) , \end{aligned}$$

and then impose Eqs. (3.6, 3.8).

The integrals in Eqs. (3.6, 3.8) are evaluated between \(x_\mathrm{min}=10^{-9}\) and \(x_\mathrm{max}=1\). Each time the neural network parameters \({\varvec{\theta }}\) are modified by the minimization algorithm, using an adaptative strategy that achieves a relative precision of \(\mathcal {O}\left( 10^{-5}\right) \) across the whole range of x.

3.1.3 Positivity of PDFs and physical observables

Hadron-level cross-sections are non-negative quantities, because they are probability distributions. However, PDFs beyond LO are not probabilities, and thus they may be negative. The reason is that, beyond LO, PDFs include a collinear subtraction which is necessary in order for the partonic cross-sections to be finite. Whether they remain positive or not then depends on the form of the subtraction, i.e. on the factorization scheme. Consequently, in previous NNPDF determinations, in order to exclude unphysical PDFs, we imposed positivity of a number of cross-sections, by means of Lagrange multipliers which penalize PDF configurations leading to negative physical observables. Specifically, we imposed positivity of the \(F_2^u\), \(F_2^d\), \(F_2^s\), and \(F_{L}\) structure functions and of the flavor-diagonal Drell–Yan rapidity distributions \(\sigma _{\mathrm{DY},u\bar{u}}\), \(\sigma _{\mathrm{DY},d\bar{d}}\), \(\sigma _{\mathrm{DY},s\bar{s}}\). However, since this set of positivity observables is not exhaustive, in some extreme kinematic regions physical observables (e.g. very high-mass \(W'\) production) could still become negative within uncertainties.

It was recently shown in Ref. [21] that PDFs for individual quark flavors and the gluon in the \(\overline{\mathrm{MS}}\) factorization scheme are non-negative.Footnote 1 We thus now also impose this positivity condition along with the constraint of positivity of physical cross-sections discussed above. Indeed, note that the positivity of \(\overline{\mathrm{MS}}\) PDFs is neither necessary nor sufficient in order to ensure cross-section positivity [21]: they are independent (though of course related) constraints that limit the space of acceptable PDFs.

We impose positivity of the gluon and of the up, down and strange quark and antiquark PDFs. The charm PDF is also positive in the \(n_f=3\) scheme, but it needs not be positive in the \(n_f=4\) scheme because perturbative matching conditions neglect the quark mass and this generally spoils positivity for a massive quark PDF [21]. We do, however, add a positivity constraint for the charm structure function \(F_2^c\), similar to the ones for other structure functions of individual flavors. Note that this constraint was not included in NNPDF3.1, though it was included in a more recent study based on NNPDF3.1 dataset and methodology [10], where it was found to have a significant impact on the strange PDF.

In the same manner as for the cross-sections, PDF positivity is implemented by means of Lagrange multipliers. Specifically, for each flavor basis PDF \(\tilde{f}_{k}\) Eq. (3.3), one adds a contribution to the total cost function used for the neural network training given by

$$\begin{aligned} \chi ^2_\mathrm{tot} \rightarrow \chi ^2_\mathrm{tot}+\sum _{k=1}^8 \Lambda _k \,\sum _{i=1}^{n_i} \,\text {Elu}_{\alpha }\left( -\tilde{f}_k\left( x_i,Q^2\right) \right) \,, \end{aligned}$$

with \(Q^2 = 5\, \text {GeV}^2\) and with the \(n_i\) values \(x_i\) given by 10 points logarithmically spaced between \(5\cdot 10^{-7}\) and \(10^{-1}\) and 10 points linearly spaced between 0.1 and 0.9. The Elu function is given by

$$\begin{aligned} \text {Elu}_{\alpha }\left( t\right) = {\left\{ \begin{array}{ll} t \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\text {if}\,\,\,\, t>0 \\ \alpha \left( e^t-1\right) \,\,\,\,\,\,\,\text {if}\,\,\,\, t<0 \end{array}\right. }\,, \end{aligned}$$

with the parameter \(\alpha =10^{-7}\). Eq. (3.10) indicates that negative PDFs receive a penalty which is proportional both to the corresponding Lagrange multipliers \(\Lambda _k\) and to the absolute magnitude of the PDF itself, and therefore these configurations will be strongly disfavored during the minimization. The Lagrange multiplier increases exponentially during the minimization, with a maximum value \(\Lambda _k^\mathrm{max}\) attained when the maximum training length is reached. We choose \(\Lambda _k^\mathrm{max}=10^{10} \) for the three Drell–Yan observables, and \(\Lambda _k^\mathrm{max}=10^6 \) for all the other positivity observables. These values are chosen in such a way that the constraint is enforced with sufficient accuracy in all cases. The starting values of the Lagrange multipliers and the maximum training length instead are determined as part of the hyperoptimization procedure described in Sect. 3.3 below.

When performing fits in the evolution basis, this PDF positivity constraint is applied after performing the inverse transformation to Eq. (3.9) in order to express the flavor basis PDFs \(\tilde{f}_{k}\) Eq. (3.3) in terms of their evolution basis counterparts \(f_{k}\).

3.1.4 PDF integrability

The small-x behavior of the PDFs is constrained by integrability requirements. First, the gluon and singlet PDFs must satisfy the momentum sum rule, Eq. (3.6), which implies that

$$\begin{aligned} \lim _{x\rightarrow 0} \, x^2f_k(x,Q)= 0 \, ,\quad \forall ~Q \, ,\qquad f_k=g,\,\Sigma \, , \end{aligned}$$

while the valence sum rules, Eq. (3.8), constrain the small-x behavior of the valence distributions,

$$\begin{aligned} \lim _{x\rightarrow 0}\, xf_k(x,Q)= 0 \, ,\quad \forall ~Q \, ,\qquad f_k=V,\,V_3\,,V_8 \, . \end{aligned}$$

Furthermore, as mentioned, standard Regge theory arguments suggest that the first moments of the non-singlet combinations \(T_3\) and \(T_8\) are also finite, so for instance the Gottfried sum (which is proportional to the first moment of \(T_3\)) is finite. This implies that also for these two combinations one has

$$\begin{aligned} \lim _{x\rightarrow 0}\, xf_k(x,Q)= 0 \, ,\quad \forall ~Q \, ,\qquad f_k=T_3,\,T_8 \, . \end{aligned}$$

To ensure that these integrability requirements are satisfied, first of all we constrain the range of the small-x preprocessing exponents \(\alpha _i\) Eq. (3.5). We supplement the iterative determination of the exponents described in Ref. [14] with the constraints \(\alpha _k <2\) for the singlet and gluon and \(\alpha _k <1\) for the nonsinglet combinations \(xV,\,xV_3,\, xV_8,\, xT_3\) and \(xT_8\). Indeed if the preprocessing exponent were to violate these bounds, the neural net \(\mathrm{NN}(x; {\varvec{\theta }})\) in Eq. (3.5) would have to compensate this behavior in order for integrability to hold. Preprocessing would then be slowing the minimization rather than speeding it up. Note that, in the flavor basis, the small-x preprocessing exponents are absent, so this requirement only applies to the evolution basis.

We observe that while Eq. (3.12) always turns out to be satisfied automatically when fitting to the experimental data, the additional constraints Eqs. (3.13) and (3.14) can sometimes be violated by the fit, and thus must be imposed. This is also achieved through Lagrange multipliers. We include in the total cost function additional contributions of the form

Table 7 Summary of the main differences between the NNPDF3.1 and the NNPDF4.0 code
$$\begin{aligned} \chi ^2_\mathrm{tot} \rightarrow \chi ^2_\mathrm{tot}+ \sum _k \Lambda _k^\mathrm{(int)} \sum _{i=1}^{n_i}\,\left[ xf_k\left( x_\mathrm{int}^{(i)},Q^2_i\right) \right] ^2\,, \end{aligned}$$

where \(f_k= T_3, T_8\) in the evolution basis while \(f_k=V,V_3,V_8,T_3,T_8\) in the flavor basis. The points \(\{ x_\mathrm{int}^{(i)}\}\) are a set of values in the small x region, \(Q^2_i\) is a suitable reference scale, and, like in the case of positivity, the Lagrange multipliers \(\Lambda _k^{(\mathrm int)}\) grow exponentially during the minimization, with a maximum value \(\Lambda _k^{(\mathrm int)}=100\) attained at maximum training length. We choose \(Q_i^2=5\) GeV\(^2\) and in the evolution basis \(n_i=1\) and \(x_\mathrm{int}^{(1)} = 10^{-9}\), while in the flavor basis \(n_i=3\) and \(x_\mathrm{int}^{(i)}=10^{-9},\,10^{-8},\,10^{-7}\). As for the positivity multiplier, the starting values of the Lagrange multipliers (as well as the maximum training length) are hyperoptimization parameters.

Finally, we introduce a post-selection criterion, in order to discard replicas that fail to satisfy the integrability and retain a large value at small x despite the Lagrange multiplier. It turns out that imposing

$$\begin{aligned} \sum _{i=1}^{n_{i}} \left| x_\mathrm{int}^{(i)} f_k\left( x_\mathrm{int}^{(i)}\right) \right| <{{1}\over {2}} \, , \qquad f_k=V,V_3,V_8,T_3,T_8 \, , \end{aligned}$$

is enough to preserve integrability for all replicas. This is due to the fact that the function xf(x) at its maximum is of order one, so the condition Eq. (3.16) ensures that at small x it is decreasing. When determining PDF replicas, we have explicitly checked a posteriori that the numerical computation of the first moment yields a finite result for all PDF replicas.

3.2 Fitting framework

The machine learning approach to PDF determination that we will discuss shortly has been made possible by a complete restructuring of the NNPDF fitting framework. Further motivations for this are the need to deal with a particularly large dataset, and the goal of releasing the NNPDF code as open source, which imposes stringent requirements of quality and accessibility. The code was written in the Python programming language and has been documented and tested thoroughly. The original developments of our new fitting framework were presented in Ref. [11]. The main differences between the NNPDF3.1 and NNPDF4.0 codes are summarized in Table 7.

3.2.1 General structure

A schematic representation of the NNPDF4.0 fitting framework is displayed in Fig. 3. The fit requires three main inputs, which are managed by the NNPDF framework as discussed in Ref. [31]: first, theoretical calculations of physical processes, which are encoded in precomputed tables (FK-tables, see below) possibly supplemented by QCD and EW K-factors. Second, experimental data provided in a common format, including fully correlated uncertainties encoded in a covariance matrix (possibly also including theoretical uncertainties). Third, hyperparameter settings that determine the particular fitting methodology adopted, determined through a hyperoptimization procedure as discussed below. The neural network optimization algorithm, with settings determined by the hyperparameters, finds the best fit of predictions to data by minimizing a figure of merit whose computation is illustrated in Fig. 4. Following a post-fit selection, where outliers with insufficient quality are discarded, the final PDFs are stored in LHAPDF grid format so that they are readily available for use.

Fig. 3
figure 3

Diagrammatic representation of the NNPDF fitting framework. The blue box contains the minimization of the \(\chi ^2\) figure of merit, whose computation is illustrated in Fig. 4

Fig. 4
figure 4

Diagrammatic representation of the calculation of the \(\chi ^{2}\) in the NNPDF fitting framework as a function of the values of \(\{x_n^{(k)}\}\) for the different datasets. Each block indicates an independent component

3.2.2 Evaluation of cross-sections and cost function

Figure 4 illustrates the structure of the part of NNPDF4.0 fitting code that evaluates the physical observables in terms of the input PDFs and then computes the associated figure of merit to be used for the fitting. This is at the core of the minimization procedure, indicated by a blue box in Fig. 3. Starting from a matrix of momentum fraction x values, \(\{x_n^{(k)}\}\), the code first evaluates the neural network and the preprocessing factors to construct unnormalized PDFs which are then normalized according to Eqs. (3.6, 3.8) in order to produce the PDFs at the input scale,

$$\begin{aligned} f_{jn}^{(k)} \equiv f_{j}\left( x_{n}^{(k)},Q_0\right) \,, \end{aligned}$$

where j, n, and k label the PDF flavor, the experimental dataset, and the node in the corresponding x-grid respectively. These PDFs are those listed in Eqs. (3.3) and (3.4) in the evolution and flavor bases respectively, and are related to the neural network output by Eq. (3.5).

The input scale PDFs are convoluted with partonic scattering cross-sections (including perturbative QCD evolution); these are encoded in precomputed grids called FK-tables (see Refs. [118, 189]) resulting in the corresponding physical observables \(\{\mathcal {O}_n\}\). Observables are split into a training and a validation set and cost functions \(\chi ^2_\mathrm{tr}\) and \(\chi ^2_\mathrm{val}\) are computed for each set. The \(\chi ^2\) is defined as in previous NNPDF determinations, and in particular it uses the \(t_0\) method [190] for the computation of multiplicative uncertainties.

Note that each block in Fig. 4 is fully independent, so that its settings can be modified or the whole block can be replaced as required. This characterizes the modular structure of the code. For instance, the block “Neural Net” implements by default the neural network which after hyperoptimization has the architecture displayed in Fig. 11, but it could be replaced by any other parametrization, even by a quantum circuit [191] based on the QIBO library [192]. Similarly, the \(\chi ^2\) with \(t_0\) uncertainties could be replaced by any other cost function.

3.2.3 Optimization strategy

Previous NNPDF determinations used stochastic algorithms for the training of neural networks, and in particular in NNPDF3.1 nodal genetic algorithms were used. Stochastic minimization algorithms are less prone to end up trapped in local minima, but are generally less efficient than deterministic minimization techniques, such as backpropagation combined with stochastic gradient descent (SGD). In the approach adopted here [11], the optimizer is just another modular component of the code, to be chosen through a hyperoptimization as we discuss shortly. The algorithms that we consider are SGD algorithms implemented in the Tensorflow [193] package. Restricting to gradient descent algorithms ensures greater efficiency, while the use of hyperoptimization guarantees against the risk of missing the true minimum or overfitting. The TensorFlow library provides automated differentiation capabilities, which enables the use of arbitrarily complex network architectures without having to provide analytical expressions for their gradients. However, the whole convolution between input PDFs and FK-tables, indicated in Fig. 4 between brackets, needs to be provided to the optimization library in order to use gradient based algorithms. The specific SGD optimizer and its settings are determined via the hyperoptimization procedure described in Sect. 3.3. In comparison to the genetic algorithms used in previous NNPDF releases, the hyperoptimized SGD-based optimizers improve both replica stability and computational efficiency, as we demonstrate in Sect. 3.4 below.

3.2.4 Stopping criterion and post-fit selection

As in previous NNPDF releases, a cross-validation method is used in order to avoid overfitting, which could lead the neural networks to learn noise (such as statistical fluctuations) in the data, rather than the underlying law. This is done through the patience algorithm shown diagrammatically in Fig. 5. This algorithm is based on the look-back cross-validation stopping method [14], whereby the optimal length of the fit is determined by the absolute minimum of \(\chi ^2_\mathrm{val}\) evaluated over a sufficiently large number of iterations of the minimizer. Specifically, the stopping algorithm keeps track of the training step with the lowest \(\chi ^2_\mathrm{val}\), and as soon as this value does not improve for a given number of steps (set equal to a percentage of the maximum number of training epochs), the fit is finalized.

There are three main differences between the stopping criterion used in NNPDF4.0 and that of its predecessor used for NNPDF3.1. First, the patience parameter is hyperoptimized, while previously it was set to be infinity, i.e., the values of \(\chi ^2_\mathrm{val}\) were monitored until the maximum number of iterations was reached. Second, the percentage of data that enters the training set has been increased to 75% for all datasets. This is motivated by the observation that the current dataset is so wide that even with just 25% validation overlearning does not occur in practice. In fact, even with the previous NNPDF3.0 dataset it was observed in the framework of closure testing in Ref. [14] that larger training fractions lead to essentially equivalent results. The faithfulness of results found with this training fraction will be confirmed by closure test studies in Sect. 6 below. Third, the stopping algorithm now also tracks the positivity requirement so that a fit cannot stop if the positivity condition is not satisfied. Instead in NNPDF3.1 replicas which were not fulfilling positivity could be generated and had to be discarded a posteriori. This is now done by verifying that the penalty term of Eq. (3.10) remains below the threshold value \(10^{-6}\) (numerically zero).

Once the optimal stopping point for a given fit has been identified, the same post-fit quality checks that were imposed in NNPDF3.1 are still enforced. Specifically, we remove replicas with too large \(\chi ^2\) values or with too large arc-lengths: in both cases, defined as replicas outside the \(4\sigma \) interval of their distribution. The post-fit selection algorithm also removes replicas that do not satisfy either the positivity or the integrability conditions. Imposing positivity and integrability constraints through post-fit selection has the consequence of making the fit results independent of the way the constraints are imposed: for instance, a looser constraint will simply have the effect of increasing the number of replicas that are discarded.

It is interesting to note that while previously on average around 30% of the fitted replicas were discarded upon applying these criteria, in NNPDF4.0 this fraction has been reduced to around 1%. This improvement is largely the result of the improved handling of these constraints during the fit as well as of the higher stability of the new SGD-based optimization strategy, which results in smoother PDFs with fewer outliers.

3.3 Hyperparameter optimization

Hyperoptimization is at the heart of the construction of the NNPDF4.0 methodology. In brief, hyperoptimization selects the methodology, just like gradient descent selects the values of weights and thresholds of the neural net. The k-folding method, to be discussed below, ensures that a proper fitting (i.e. not over- or under-fitting methodology) is arrived at, just like cross-validation achieves the same goal for neural network training.

Indeed, the optimization procedure (neural network training) described in Sect. 3.2 requires as input a number of methodological choices, such as the neural network architecture, the training rate, and the specific SGD variant to be used. We can view these choices as the set of hyperparameters that defines a specific fitting strategy. While in many ML studies (including previous NNPDF determinations) these hyperparameters are determined by trial and error, here we implement an automated algorithmic procedure to scan the space of hyperparameters and determine the optimal configuration according to a figure of merit.

In this work, the implementation of the hyperparameter scan is based on the hyperopt library [194], which uses a Bayesian optimization algorithm [195] to identify the best configuration.

Fig. 5
figure 5

Flowchart describing the patience algorithm used in NNPDF4.0 to determine the optimal length of the fit based on the look-back cross-validation stopping method

Fig. 6
figure 6

Graphical representation of the hyperoptimization loss function L corresponding to a subset of the hyperparameters in a scan based on 1500 configurations

In order to visualize a typical output of a hyperparameter scan, we show in Fig. 6 the result of a scan based on 1500 independent configurations. We display the hyperoptimization loss function L (figure of merit), to be defined below, for a representative subset of hyperparameters: the depth of the network, the algorithm for the initialization of the network weights, the learning rate and the SGD optimizer variant. The smaller the value of the loss function L, the better this specific point is in the hyperparameter space. The full list of hyperparameters is given in Table 9. Note that here we only display the outcome of hyperparameter configurations that satisfy the post-fit selection cuts. The shape of the reconstructed probability distributions provides an indication of the stability of the results, with a wider distribution corresponding to a higher stability with respect to this specific hyperparameter.

In the specific case of the number of hidden layers of the network, one observes that the hyperoptimization algorithm identifies that it cannot further improve the figure of merit with one single layer, and accordingly it tests more configurations with two and three layers. The hyperparameter configurations corresponding to two and three layers appear to be equivalent in terms of the loss L, with a slightly better stability towards lower values in the two-layer case. No clear preference for a specific SGD variant is observed.

3.3.1 Figure of merit and stability

The complex interplay between hyperparameters indicates that a judicious choice of the figure of merit L is crucial for the success of the hyperoptimization procedure. This figure of merit must relate to the quality of the fit: a possible choice would be setting the hyperoptimization loss to the validation \(\chi ^2\), that is, \(L=\chi ^{2}_\text {val}\). However, this quantity is already used in the stopping algorithm (Fig. 5) and hence using it may lead to hyperparameter configurations prone to over fitting [11] (“Goodhart’s law”, see Ref. [196]) . Rather, we define the loss L through a k-fold cross validation method [197].

Fig. 7
figure 7

Diagrammatic representation of the k-fold algorithm used for the hyperparameter optimization

Fig. 8
figure 8

Comparison between the gluon (left) and antidown (right) PDFs at \(Q=1.65\) GeV found by using methodologies in which hyperparameters are selected based on the “average” loss function Eq. (3.18) (green) or the “max” loss function Eq. (3.20) (orange)

Table 8 The four folds in which the NNPDF4.0 dataset is divided for the k-folds hyperoptimisation procedure represented in Fig. 6

A diagrammatic representation of the k-fold algorithm used for the hyperparameter optimization is displayed in Fig. 7. The hyperopt library generates a large number of hyperparameter configurations, and each of these is then used to produce fits to subsets of the experimental data. Specifically, for each point in the hyperparameter space we run \(n_\text {fold}\) fits to the central experimental data, where \(n_\text {fold}\) is the number of sets (folds) in which the data are being divided. We run a single fit to central data, rather than the standard set of around 100 replicas, because we prefer to scan over a very large number of hyperparameters, and fitting many replicas in each case would be computationally too intensive. In each of these \(n_\text {fold}\) fits, the k-th fold is left out; the remaining folds are combined in a dataset which is then separated into training and validation in the usual way, such that the patience stopping of Fig. 5 can be tested.

The loss figure of merit L is then defined as the average of the \(\chi ^2\) for the k-th, fold evaluated with the PDFs obtained in the k-th fit, in which this specific fold was left out, dubbed \(\chi ^2_k\) as illustrated in Fig. 7; that is

$$\begin{aligned} L = {{1}\over {n_\text {fold}}} \displaystyle \sum ^{n_\text {fold}}_{k=1} \chi _{k}^2 \, . \end{aligned}$$

We use the \(n_\mathrm{fold}=4\) folds defined in Table 8. These are chosen in such a way that each fold is representative of the global dataset, both in terms of process type and kinematic coverage. The optimal hyperparameter set \({\varvec{ \hat{\theta }} }\) is then selected to be those that produce the lowest average loss computed using Eq. (3.18),

$$\begin{aligned} \varvec{\hat{\theta }} = \underset{\varvec{\theta } \in {\varvec{\Theta }}}{\text {arg min}}\left( {{1}\over {n_\text {fold}}} \displaystyle \sum ^{n_\text {fold}}_{k=1} \chi _{k}^2({\varvec{\theta }}) \right) . \end{aligned}$$

We note that other choices of the loss function would be possible, such as

$$\begin{aligned} L = \mathrm{max}\left( \chi _{1}^2, \chi _{2}^2, \chi _{3}^2,\ldots , \chi _{n_\mathrm{fold}}^2 \right) , \end{aligned}$$

namely, the maximum value of \(\chi _{k}^2\) evaluated over the \(n_\mathrm{fold}\) folds. We checked that results obtained with either choice are completely equivalent. In Fig. 8 we compare PDFs obtained by methodologies found by hyperoptimizing either with the “average” loss function of Eq. (3.18), or the “max” loss function of Eq. (3.20). The final hyperparameter values found in either case are provided in Table 9. It is clear that these final setups are quite different, yet the PDFs found with either methodology are indistinguishable. The fact that different choices for the hyperopt loss function L result in rather different hyperparameter configurations that still produce indistinguishable PDFs demonstrates the stability of our methodology with respect to variations of the hyperoptimization procedure.

3.3.2 Hyperparameter correlation

An important motivation for the automated hyperparameter optimization procedure is the fact that the best value for a single hyperparameter cannot be determined independently of all the others, since there is a high degree of correlation between them. For instance, each variant of the SGD optimizer will have a different optimal value of the learning rate. We illustrate this interdependence with a specific hyperparameter, the clipnorm parameter of TensorFlow optimizers, for which a wrong choice can lead to significant overfitting even when all other hyperparameters are optimized. This parameter specifies the value at which to clip the norm of the gradient during a gradient descent step. That is, if the norm of the gradient at a given epoch is larger than the value of the clipnorm parameter, it will be rescaled such that the norm of the gradient used to update the neural network parameters has the clipnorm value.

The choice of clipnorm will affect the results of the optimization algorithm: if it is too small it can prevent convergence, while if it is too large the training will be unstable often leading to overfitting. In Fig. 9 we compare the strange PDF xs(xQ) at \(Q=1.7\) GeV in the large-x region for two variants of the NNPDF4.0 fit. In the first one, all the hyperparameters listed in Table 9 enter the hyperopt procedure, while in the second clipnorm is excluded and fixed by hand to an arbitrary value. While the two resulting hyperparameter configurations lead to similar values of the optimization figure of merit, the PDFs obtained in the latter case display undesirable overfitting behavior. This comparison illustrates the importance of including all relevant hyperparameters in the automated optimization.

Fig. 9
figure 9

Comparison between the results for the strange PDF and large x in two fits, one with all hyperparameters optimized and another where the clipnorm one is not hyperoptimized

Table 9 The baseline hyperparameter configuration (left) selected using the k-folds hyperoptimization procedure with hyperoptimization loss Eq. (3.19) and used to perform the NNPDF4.0 fits in the evolution basis. We also show a configuration selected using the alternative hyperoptimization loss Eq. (3.20) (center) and the hyperparameter configuration employed to perform fits in the flavor basis, Eq. (3.3) (right)

3.3.3 Baseline hyperparameters for NNPDF4.0

We have performed a k-folding hyperoptimization, as described above, and we have determined the best values of the hyperparameters that will be used for the NNPDF4.0 determination. These are listed in Table 9. The hyperparameters include the network architecture, the type of activation function, the Glorot-type [198] initializer, the optimizer, the values of the learning rate and of clipnorm, the maximum number of iterations and the stopping patience, and the initial values of the Lagrange multipliers for the PDF positivity and integrability constraints. The ranges of the hyperparameters that are sampled by the hyperoptimization algorithm are chosen empirically: we start out conservatively with very wide ranges, and once we are confident that the optimal value of a given hyperparameter falls within a sub-domain of this (conservative) range, we adjust the sampled domain accordingly to limit the runtime and computational resources of the hyperparameter scan.

In Table 9 we show both the optimal hyperparameters for our default methodology, based on the evolution basis and the hyperoptimization loss defined in Eq. (3.19), as well as the hyperparameter values obtained with the different choice of loss function Eq. (3.20), or with the same loss function but in the flavor basis. As mentioned both different choices of loss function (see Fig. 8) or a different choice of basis (see Sect. 8.4 below) lead to equivalent results, but the corresponding hyperparameter values can be quite different. For instance, the optimal architecture for fits based on the alternative loss function Eq. (3.20) has more than twice the number of neurons in the hidden layers compared to the baseline settings.

We now specifically discuss the hyperoptimization and its results for our default choice. Concerning the network architecture, until NNPDF3.1, each PDF was parametrized with an individual neural network. While the number of independently parametrized PDFs was gradually increased, this remained unchanged since NNPDF1.0 [199]. Now the hyperoptimization scan is run with a single network which outputs the value of all PDFs. So while in all NNPDF fits up to and including NNPDF3.1 \(\mathrm{NN}_k(x; {\varvec{\theta }})\) in Eq. (3.5) denotes the k-th neural network, in NNPDF4.0 it indicates the activation state of the k-th neuron in the last layer of the neural net. The architecture used in all previous NNPDF releases, namely 2-5-3-1 with sigmoid activation functions and a last linear layer is depicted in Fig. 10. The architecture selected by the hyperoptimization is 2-25-20-8 with hyperbolic activation functions except for the final linear layer, and it is shown in Fig. 11.

The NNPDF4.0 architecture has 763 free parameters, to be compared to a total of 296 parameters for the NNPDF3.1 neural nets. We emphasize however that a larger network does not necessarily imply better performance, and that for a given dataset there exists a lower bound to the number of required free network parameters but probably not an upper one. Given comparable performance, smaller networks are preferred in order to reduce the computational costs.

Fig. 10
figure 10

The neural network architecture adopted in all previous NNPDF determinations up to NNPDF3.1. Each independent PDF combination is parametrized by a separate neural network, all sharing a common architecture

Fig. 11
figure 11

The neural network architecture adopted for NNPDF4.0. A single network is used, whose eight output values are the PDFs in the evolution (red) or the flavor basis (blue box). The architecture displayed corresponds to the optimal choice in the evolution basis; the optimal architecture in the flavor basis is different as indicated by Table 9)

The differences between the optimizer variants are quite subtle. While all optimizers exhibit a reasonable performance, it is also found that after hyperoptimization Nadam results in lower absolute losses L than the other optimizers, while also appearing to be more stable. This further illustrates the benefits of hyperoptimization. Indeed, separately, the stability and general performance of all optimizers is quite similar, as can be seen in Fig. 6. This is something one might have also found by trial and error. However, a configuration Nadam that outperforms the other optimizers can be found thanks to the simultaneous sampling of different hyperparameters. This is something that cannot be concluded based on visual inspection of Fig. 6 and that would have been very difficult to establish by trial and error. It is supported by the fact that the top of the ranking of setups with the smallest losses is dominated by setups that use the Nadam optimizer.

3.3.4 Hyperoptimization stability

The main goal of the hyperoptimization procedure is to identify the best optimization settings for the current problem of determining the PDFs. This raises the question of deciding in which cases a new hyperoptimization would be required. Our current understanding encompasses changes to the experimental data, the theoretical description, and methodological choices (such as the choice of PDF basis).

We have checked that the procedure is quite stable upon reasonably small changes of the dataset. For instance, the appraisal and selection of the final dataset, see Sect. 4 below, did not require any new hyperoptimization. In fact, the datasets included in Table 8 do not correspond exactly to the datasets included in the final dataset, since the final appraisal of the data to be included was performed after the methodology was set. Furthermore, when removing datasets the given methodology remains viable, though in principle there might be a computationally more efficient one giving the same results for the small datasets. This will be seen explicitly in the context of “future tests” in Sect. 6.2 below. Of course in principle the only way of being absolutely certain whether a new hyperoptimization is needed or not is to actually perform it.

On the other hand, a substantial change in methodology or dataset generally needs a new hyperoptimization. This is illustrated by the fact (see Table 9) that the optimal settings for fitting in the flavor basis differ substantially from those of the evolution basis. Likewise, the addition of a large number of new datasets affecting kinematic regions or PDF combinations for which currently there is little or no information might have an impact on the fit sufficient to warrant a new run of the hyperoptimization procedure.

The open source NNPDF4.0 fitting framework released with this paper includes all necessary tools to carry out an automatic scan of hyperparameters, which means it can be readily used in situations which are very wildly different from the specific scenario considered in this work, be it in terms of the experimental data available or the theoretical framework being considered.

3.4 Performance and quality benchmarks

The new NNPDF fitting framework features a significantly improved computational performance compared to previous NNPDF. This improvement is mostly driven by the availability of the gradient-based optimizers provided by the TensorFlow library, combined with the dedicated hyperparameter optimization and other technical improvements in key parts of the code. Furthermore, the new fitting framework is able to take advantage of Graphical Processing Units (GPUs), which, when available, can further improve speed (although currently setting the same training and validation split for all replicas is needed for optimal performance).

Table 10 The average fitting time per replica, speed up factor (as compared to the NNPDF3.1 performance), and the RAM requirements in global PDF fits based on the NNPDF3.1 and NNPDF4.0 frameworks for the same input dataset. In the NNPDF4.0 case, we compare the performance obtained on CPUs with that on GPUs

To quantify the performance of the new fitting code, in Table 10 we show the average fitting time per replica in PDF fits based on the NNPDF3.1 and NNPDF4.0 fitting frameworks. The same global input dataset is used in both cases, in order to ensure a consistent comparison. In the case of NNPDF4.0, we compare the performances of running the code either in CPUs or in GPUs. These benchmark tests have been carried out on an Intel(R) Core(TM) i7-4770 at 3.40GHz CPU and on a NVIDIA Titan V GPU.

The comparisons in Table 10 show that, while in NNPDF3.1 the typical fitting time per Monte Carlo replica was around 15 hours, in NNPDF4.0 this has been reduced on average by a factor 24 (down to around 40 minutes) when running on CPUs, and by a factor of 140 (down to 7 minutes) when running on GPUs. This implies that, in the same time that it takes to run 100 replicas of NNPDF3.1, one can now run 2400 replicas of NNPDF4.0 or, alternatively, 24 variations (with different datasets or theory settings) of the same 100 NNPDF4.0 replicas. The enhanced performance of NNPDF4.0 is essential for the implementation of the hyperoptimization program: one can only explore thousands of different hyperparameter configurations if the fits are fast enough. Furthermore, we note that this significant increase in speed greatly facilitates several physics applications, from the \(\alpha _s(m_Z)\) determination [138] to the simultaneous fits of PDFs and EFT Wilson coefficients [200, 201], which rely on producing a sufficiently large sample of replicas.

From Table 10 one can also observe that this increase in speed has as a trade-off a greater RAM memory consumption by around a factor of four. These demanding requirements arise because the code needs to hold in memory not only the FK-tables (as was already the case in NNPDF3.1) but also the \(\chi ^2\) gradients used for the minimization, which were not stored before. While this increase in memory may appear limiting, we note that the FK-tables and the functional form of the gradient can be shared between Monte Carlo replicas running simultaneously on the same processor. This makes it possible to run a large number of replicas in parallel on a GPU, and is the main reason for the reduction of the average fit time per replica reported in Table 10.

In addition to the improved computational performance, the new framework underlying the NNPDF4.0 fits exhibits other benefits that impact in a positive manner the actual outcome of the global fit. To illustrate these, Fig. 12 compares the distribution over replicas of the training lengths, defined as the optimal stopping point of each replica, between fits based on the NNPDF3.1 and NNPDF4.0 methodologies for a common dataset. While the number of iterations of the two different optimization algorithms are incomparable, it is interesting to note that the rightmost bin of the distribution is populated by the replicas whose stopping point is determined by the maximum number of iterations, rather than by satisfying the look-back cross-validation stopping condition. These are thus replicas for which full convergence has not been reached. The fact that replica training does stop through cross-validation is what guarantees that the \(\chi ^2\) minimization is sufficiently accurate to actually determine the optimal fit.

From this comparison one finds that in NNPDF3.1, based on nodal genetic algorithms, around half of the replicas stop at the maximum number of generations, while for the SGD-based NNPDF4.0 fit this fraction is much smaller, around 15%. This observation implies that while in NNPDF3.1 many replicas might stop before proper training has been achieved, and may be affected by underlearning, this issue is much less severe in NNPDF4.0. Indeed, now 85% of the replicas stop when the optimal stopping point has been identified by the look-back cross-validation algorithm. One can therefore expect a reduction in the PDF uncertainties thanks to the new methodology, given that the fraction of replicas with potential underlearning is markedly reduced, leading to overall smoother and more similar replicas. We will study in more detail in Sect. 8 the impact at the PDF level of the new methodology.

Fig. 12
figure 12

Distribution of training lengths, defined by the optimal stopping point of each replica, in fits to a common global dataset based on the NNPDF3.1 (left) and NNPDF4.0 (right panel) methodologies

Similar considerations can be drawn from Fig. 13, which compares scatter plots with the values of \(\chi ^2_\mathrm{tr}\) and \(\chi ^2_\mathrm{val}\) for the \(N_\mathrm{rep}=100\) replicas between fits based on the NNPDF3.1 and NNPDF4.0 methodologies and the same global dataset. In these plots, the red square indicates the position of the mean value over the replicas, and a dashed line with unit slope is added in order to facilitate visualization. Note that \(\chi ^2_\mathrm{val}\) is expected to be (on average) somewhat higher than \(\chi ^2_\mathrm{tr}\) given that validation data are not used for the optimization.

Fig. 13
figure 13

Comparison of the values of the training and validation \(\chi ^2\) for each replica between the NNPDF3.1 and NNPDF4.0 methodologies, when fitting a common dataset. The red square indicates the mean value over the replicas

From this comparison, one can see that the spread in the values of \(\chi ^2_\mathrm{tr}\) and \(\chi ^2_\mathrm{val}\) is reduced when going from NNPDF3.1 to NNPDF4.0. Furthermore, in the latter case there are no outliers, while this is not the case in the NNPDF3.1-like fits. Also, for NNPDF4.0 around one quarter of the replicas have \(\chi ^2_\mathrm{val}<\chi ^2_\mathrm{tr} \), which is another indicator of proper training and stopping. This fraction is smaller in NNPDF3.1, again possibly signaling underlearning in some replicas.

All in all, the results presented in here indicate that the methodological improvements introduced in NNPDF4.0 not only lead to a significant improvement in terms of computational performance, but also to a more robust procedure where proper training is achieved for the majority of neural network replicas.

4 Determination of the baseline dataset

We discuss the selection criteria that we adopt to construct the NNPDF4.0 baseline dataset from the datasets described in Sect. 2. This baseline dataset will be used in all of the fits presented in the sequel. In previous PDF determinations, ad-hoc dataset selection criteria have often been applied. Here we strive to use objective criteria, not only for imposing kinematic cuts (which is standard), but also in order to select an optimal dataset for PDF determination out of the global dataset. We explain, in turn, our choice of kinematic cuts, our procedure to determine whether a measurement is to be included in the baseline dataset or not, and our selection of jet datasets, which deserve a separate treatment due to the need to choose the optimal observable.

4.1 Kinematic cuts

As in previous NNPDF analyses, kinematic cuts are imposed to ensure that we include only the data for which reliable predictions can be computed with fixed-order, pure QCD theory. In NNPDF3.1, see specifically Sect. 2 in [5], all the data points for which NNLO QCD corrections exceeded the corresponding experimental uncertainties were removed from the NLO fit. Likewise, all the data points for which electroweak (EW) corrections exceeded experimental uncertainties were removed from the NLO and NNLO fits. Additional cuts were also imposed on individual datasets on the basis of specific considerations. In the NNPDF4.0 analysis, kinematic cuts are determined on the ground of similar guiding principles, which we systematize as follows.

For the NLO fit, we discard datapoints that are subject to excessively large corrections: specifically, we compute, for each data point, the ratio between the absolute difference of the NNLO and NLO predictions to the experimental uncertainty. If this quantity is smaller than a given threshold value, the data point is retained in the NLO fit, otherwise it is discarded. We examined two alternative values of the threshold, 1 and 2 respectively. We concluded that a value of 1 is unnecessarily aggressive, as it leads to discarding an excessive number of data points from the NLO fit, while a value of 2 ensures that a reasonable number of data points are retained in the fit with reasonable theoretical accuracy. We therefore use 2 as our default threshold value. On the other hand, we do not include in the NNLO fits the data points for which NNLO theory is not available. This is the case for the \(W+c\) production measurements listed in Table 5. In this case, the full NNLO corrections to the dominant CKM-diagonal contribution have been recently computed in Ref. [87]. However the computation of Ref. [87] uses the flavor \(k_{\perp }\) algorithm, which is not used in the experimental measurement, thus the NNLO corrections cannot be implemented yet in a PDF fit.

The results of Ref. [16] allow for a more refined analysis of cuts motivated by electroweak effects than what was possible in NNPDF3.1. We can now evaluate EW and mixed QCD+EW corrections in a systematic and consistent way for all hadronic processes included in a PDF fit, by taking advantage of the recent automation of these computations in mg5_aMC [124], and using of fast-interpolation grids with matching accuracy in the electroweak and strong couplings produced using PineAPPL [16]. We use the NNPDF3.1QED set [202] for the photon PDF [16]. We then exclude from the NLO and NNLO fits all data points for which the difference between the pure NLO QCD calculation and the full NLO QCD+EW computation (which includes the mixed corrections) exceeds the size of the experimental uncertainty. This strategy will also be used to investigate phenomenological implications of the NNPDF4.0 PDF sets in Sect. 9.

Table 11 The set of kinematic cuts applied to the datasets considered in the NNPDF4.0 PDF determination for the NLO and NNLO fits. The kinematic cuts used in the LO fit are the same as in the NLO fit. Only the data points that satisfy the constraints listed in the table are retained. The cut on the HERA I+II \(\sigma _\mathrm{NC}^{c}\) dataset at NNLO is applied, in addition to the other cuts for DIS measurements, only when the charm PDF is independently parametrized

Additional kinematic cuts are implemented for specific datasets, as summarized in Table 11. For datasets already included in NNPDF3.1, these are the same as in that analysis, see Sect. 2 in [5]. For new datasets, these follow from similar considerations. We summarize here the motivations. For DIS measurements the cuts remove the low-energy (\(Q^2\)) region, where perturbative QCD becomes unreliable, and the large invariant mass (\(W^2\)) region, where higher-twist corrections may be non-negligible. We impose a stricter \(Q^2\) cut on the HERA I+II \(\sigma _\mathrm{NC}^{c}\) dataset in the NNLO fit if the charm PDF is fitted in order to minimize the possible impact of missing NNLO terms related to initial-state charm (see Sect. 2.2 in [5]). For fixed-target DY measurements (specifically for E866 and E605 \(\sigma ^p\)) the cuts remove the data points that are too close to the production threshold, as discussed in Ref. [5], based on the study of Ref. [203]. To this purpose, we define \(\tau =m_{\ell \ell }^2/s\) and \(y_\mathrm{max}=-{{1}\over {2}}\ln \tau \), where \(m_{\ell \ell }\) is the invariant mass of the dilepton pair and \(\sqrt{s}\) is the center-of-mass energy of the collision. For collider inclusive gauge boson production, we impose a cut on the D0 W electron and muon asymmetry at NNLO because of the difficulty in obtaining a sufficiently precise theoretical prediction when the measured asymmetry becomes too close to zero; we exclude the lowest lepton rapidity bins of all of the LHCb measurements from the NNLO fit because, due to rapidity cut on the leptons (\(y_\ell >2\)) in the last bin the phase space for both leptons to pass the cut is very small, thus leading to numerical instabilities in the computation of the NNLO K-factor; and we remove the large invariant mass bins from the ATLAS low-mass DY 2D 8 TeV measurement in order to avoid overlap with the corresponding high-mass measurement. For Z \(p_T\) production we follow Ref. [204] and remove the largest rapidity bins from the CMS Z \(p_T\) 8 TeV measurement because of an apparent incompatibility with the corresponding ATLAS measurement, while fully retaining the latter.

All the remaining cuts displayed in Table 11 are imposed to remove data points for which \(p_T\) resummation effects (typically in the low transverse momentum tail of the various distributions) or electroweak corrections (typically in the large transverse momentum or invariant mass tails of the various distributions) may become large. Finally, on top of the cuts listed in Table 11 we also apply at NLO a “similarity cut”: namely, if a datapoint is excluded at NNLO by one of the cuts in Table 11, then it is also excluded at NLO because the NLO to NNLO difference is unreliable so this point is potentially subject to large NNLO corrections.

Kinematic cuts in the LO fit are taken to be the same as in the NLO fit.

4.2 Baseline dataset

The datasets described in Sect. 2 and the kinematic cuts described in Sect. 4.1 above define an extended dataset out of which we determine a maximally consistent baseline dataset. This baseline dataset is determined through a new weighted-fit procedure that we introduce here. In this procedure, first we flag datasets that are problematic either in terms of fit quality, or because of the stability properties of their covariance matrix. This is done by comparing for each measurement respectively the value of the \(\chi ^2\) or the value of a stability indicator to a suitable threshold value. Measurements for which thresholds are exceeded are then subject to a dedicated weighted fit. The measurement is then retained or discarded based on the results of this weighted fit.

Below we will first discuss the issue of stability of covariance matrices and describe the stability indicator that we will use. We will then perform an appraisal of the full dataset of Sect. 2 based on our indicators and criteria. We will next present the weighted fit method, and finally apply it to our dataset and perform the final dataset selection based on it.

4.2.1 Stability of experimental covariance matrices

Given the high precision of modern collider experiments, in particular HERA and the LHC, many datasets are now limited by systematic, rather than statistical, uncertainties. In these situations, the \(\chi ^2\) of a given dataset often becomes extremely sensitive to small differences in the correlation model assumed for the experimental systematic errors. This implies that small inaccuracies in the estimate of the experimental correlated systematic uncertainties can potentially induce spurious disagreements between theory predictions and experimental data. Such spurious disagreements can complicate the interpretation of the quality of a PDF fit. A poor \(\chi ^2\) may be caused solely by an instability of the experimental covariance matrix upon its inversion, rather than by a genuine tension with the rest of the data in the fit, or by an inaccuracy in the theory.

In order to quantify the stability of the \(\chi ^2\) with respect to potential inaccuracies affecting the experimental covariance matrices, a new metric was derived in Ref. [205]. This metric has the key property of being independent of any theory predictions, and thus of the rest of the data in the fit, as it relies exclusively on the experimental covariance matrix as input. This property ensures it is independent of the actual fit quality (the value of the \(\chi ^2\)). The metric is derived by studying the stability of the \(\chi ^2\) given ideally matching theory predictions, that is, when these are sampled from the same multi-Gaussian distribution as the experimental data.

Given the often limited information available on the details of some experimental systematic errors, this metric has to rely on some assumptions. The first one is that diagonal uncertainties are accurately known, and that potential instabilities are entirely explained by an imperfect knowledge of the correlations. The second is that the source of inaccuracies can be traced back to a \(\mathcal {O}(1)\) number of specific entries in the correlation matrix. An example of the latter assumption would be an inaccuracy in the estimate of the correlation between two data bins in opposite kinematic regions.

Under these assumptions, one can decompose [205] the experimental covariance matrix C as

$$\begin{aligned} C = DRD \, , \end{aligned}$$

where D is a diagonal matrix whose entries are the square roots of the diagonal entries in the covariance matrix, i.e. the standard deviations, and R is the correlation matrix. If the smallest eigenvalue of the correlation matrix R is \(\lambda _0\), then the stability of the \(\chi ^2\) with respect to the inaccuracies of the experimental correlation model will be quantified by the condition number

$$\begin{aligned} Z =\lambda _0^{-{{1}\over {2}}} \, . \end{aligned}$$

The value of \((\sqrt{2}Z)^{-1}\) can be related to an estimate of the required precision at which correlations need to be determined in order to ensure that they affect the \(\chi ^2\) statistic by less than one standard deviation, that is, by less than \(\sigma _{\chi ^2}=\sqrt{2/N_\mathrm{dat}}\) when normalized by the number of data points

For example, a value of \(Z=5\) of the metric indicates that correlations must be estimated with an absolute uncertainty of less than 0.14. This means that if the correlation between two bins is estimated to be 1.0 while its real value is instead 0.86, one can expect that the \(\chi ^2\) may deviate significantly from unity (by more than \(\sigma _{\chi ^2}\)) even if the experimental data and theory calculations are perfectly consistent.

Therefore, by evaluating the datasets in the global fit with a relatively large value of the stability metric Z, one can identify those with a potentially unstable correlation matrix. If in addition these datasets display a poor fit quality, further investigation is required since a high value of the \(\chi ^2\) does not necessarily indicate a genuine tension in the data or a limitation of the theory calculations, but rather it could arise from the instability of the experimental covariance matrix.

In the remainder of this section, we will use the stability metric Z as a diagnostic tool to flag datasets that deserve further investigation. A regularization procedure in order to correct a covariance matrix with large Z can also be constructed [205]. Such a regularization procedure is not implemented in the default NNPDF4.0 fit, rather it will be implemented in Sect. 8.7 in order to assess the possible impact on the PDFs of regularizing the covariance matrix for those datasets characterized by large Z values.

4.2.2 Appraisal and selection criteria

We perform an appraisal of the full dataset discussed in Sect. 2 with the goal of determining its internal consistency. Specific measurements could be inconsistent with the rest of the dataset due to a variety of reasons of theoretical or experimental origin, such as for example large missing higher order QCD or electroweak corrections, missing systematic uncertainties, or underestimated experimental uncertainties. Our goal is not to attempt to have a full understanding of the nature of the inconsistencies, but rather, to single out and exclude from the baseline inconsistent data based on objective criteria. These data can then be studied separately through dedicated fits.

We start by performing a NNLO fit in which the full dataset is used. This fit adopts the theory settings discussed in Sect. 2, it implements the kinematic cuts of Sect. 4.1, and it is based on the methodology described in Sect. 3. For jet observables, it is impossible to include simultaneously dijets and single-inclusive jets because experimental correlations between them are not available. In this baseline fit, as well as in our default analysis, we choose to include dijets (and not single-inclusive jets) at 7 TeV and single-inclusive jet (and not dijets) at 8 TeV. The motivation for this choice will be presented in a separate analysis in Sect. 4.3.

We then consider, for each measurement, the following indicators and apply the following selection criteria:

  • The total \(\chi ^2\) per data point. We single out all the datasets for which \(\chi ^2> 1.5\). An excess from the expected unit value of the \(\chi ^2\) could arise from dataset inconsistencies, within the dataset or between the dataset and the rest of the extended dataset, from inaccuracies of the theoretical computations, from large statistical fluctuations (especially for datasets with a small number of data points) or from instabilities of the experimental covariance matrix.

  • The number of standard deviations \(n_\sigma \) by which the value of the \(\chi ^2\) per data point differs from the expected unit value,

    $$\begin{aligned} n_\sigma \equiv {{\chi ^2-1}\over {\sigma _{\chi ^2}}}={{\chi ^2-1}\over {\sqrt{2/N_\mathrm{dat}}}}. \end{aligned}$$

    We single out all the datasets for which \(|n_\sigma |> 2\). In these cases, the statistical significance of an anomalously large \(\chi ^2\) might not be explained by a statistical fluctuation.

  • The stability metric Z defined in Eq. (4.2). We single out the datasets with \(Z> 4\). This choice is based on the regularization studies performed in [205], which find that by minimally altering the correlation model such that they fulfill \(Z=4\), the induced changes in the resulting covariance matrix are very likely within the precision to which they were determined. The observed differences between the regularized and unregularized covariance matrices are \(5\%\) for the standard deviations and below 0.05 (in absolute units) for the correlation coefficients.

The first estimator flags all situations in which the significance of the discrepancy does not depend on the number of data points, such as for instance a missing higher order correction that affects all data points. The latter two instead are sensitive to cases in which there might be issues related to systematic uncertainties and their correlation, whose significance depends on the number of data points.

Table 12 The DIS datasets in the NNPDF4.0 fit to the extended dataset. For each dataset we show the number of data points, the \(\chi ^2\) per data point, the corresponding number of standard deviations \(n_\sigma \) and the stability metric Z, and the value of the weight \(\omega \) used in the definition of the weighted fit \(\chi ^2\) in Eq. (4.4). In the last column, we also indicate whether this dataset is retained in the NNPDF4.0 baseline dataset
Table 13 Same as Table 12 for fixed-target DY data
Table 14 Same as Table 12 for collider (Tevatron, top, and LHC, bottom) inclusive gauge boson production data
Table 15 Same as Table 12 for other LHC processes (listed in Table 5)

The number of data points \(N_\mathrm{dat}\) and the values of the three estimators outlined above are collected, for each measurement, in Tables 12, 13, 14 and 15. We flag the datasets that have both \(\chi ^2>1.5\) and \(|n_\sigma |>2\) or \(|n_\sigma |> 2\) and \(Z>4\). These datasets will be investigated through the weighted fit method presented in Sect. 4.2.3 below. The only exception is the ATLAS isolated photon production measurement at 8 TeV which is discarded given that it is superseded by the companion measurement at 13 TeV. We do not flag datasets with \(\chi ^2>1.5\) but with \(|n_\sigma |<2\), nor the datasets with with \(Z>4\) but with \(|n_\sigma |<2\). In the first case the large value of the \(\chi ^2\) is consistent with a statistical fluctuation. In the second case despite its unstable covariance matrix the dataset can nevertheless be fitted with acceptable fit quality. Datasets characterized by large Z values will be further investigated in Sect. 8.7 below, where their impact on the PDFs will be reassessed by means of a suitable regularization procedure that reduces their Z value.

The datasets that are flagged according to these criteria are singled out in Tables 12, 13, 14 and 15 by the presence of a weight in the penultimate column. These are: NMC and BCDMS proton structure functions; combined HERA charm structure function; D0 W electron asymmetry; 7 TeV ATLAS WZ central rapidity; 8 TeV ATLAS W rapidity; 7 TeV LHCb W; 8 TeV LHCb electron asymmetry; 8 TeV ATLAS lepton+jets top-pair; and 7 TeV ATLAS and CMS dijet.

These datasets are hence potentially inconsistent, and they are assessed using the weighted fit method as discussed below. All other datasets listed in Tables 12, 13, 14 and 15 are deemed to be consistent and thus included in the NNPDF4.0 baseline.

4.2.3 The weighted fit method

The weighted fit method is based on the idea that in order to determine whether a specific measurement is inconsistent with the global dataset one should produce a PDF determination that provides the best agreement to this dataset. One may then check whether this best agreement does or does not lead to the deterioration of the agreement with one or more of the other data included in the global dataset. This idea was recently used in Ref. [206] as a means of studying the determination of standard model parameters, such as the strong coupling \(\alpha _s(m_Z)\), from a global PDF fit. Related methods were previously discussed in Ref. [207].

The way the idea is implemented is by performing a weighted fit, in which the selected dataset is given a weight that is large enough for it to carry about the same weight as the rest of the global dataset. To this goal, the figure of merit optimized in the fit is modified as

$$\begin{aligned} \chi ^2= & {} {{1}\over {N_\mathrm{dat}}}\sum _{i=1}^{n_\mathrm{exp}}N_\mathrm{dat}^{(i)}\chi ^2_i \qquad \longrightarrow \nonumber \\ \chi ^2= & {} {{1}\over {N_\mathrm{dat}-N_\mathrm{dat}^{(j)}}}\sum _{i\ne j}^{n_\mathrm{exp}}N_\mathrm{dat}^{(i)}\chi ^2_i + \omega ^{(j)}\chi ^2_j \,, \end{aligned}$$

where \(N_\mathrm{dat}^{(i)}\) is the number of data points in the dataset i and \(\chi ^2_i\) is the contribution to the total \(\chi ^2\) from the given dataset. The value of \(\omega ^{(j)}\) is then chosen as

$$\begin{aligned} \omega ^{(j)}=N_\mathrm{dat}/N_\mathrm{dat}^{(j)}. \end{aligned}$$

The last column of Tables 12, 13, 14 and 15 lists the values of \(\omega ^{(j)}\) for the datasets that we have singled out according to the criteria discussed above. We have explicitly checked that the choice of the precise value of \(\omega ^{(j)}\) does not change the general conclusions, by repeating several weighted fits with two more choices of \(\omega ^{(j)}\), namely, twice or half the default value defined by Eq. (4.5).

The possible outcomes of a weighted fit, and the corresponding conclusions on dataset compatibility, are the following:

  • The value of \(\chi ^2_j\) does not improve significantly while the \(\chi ^2_i\) of the rest of the datasets remain essentially unaffected. In this case we conclude that the dataset j exhibits internal inconsistencies that however do not distort the global fit. We keep dataset j in the baseline.

  • The value of \(\chi ^2_j\) does not improve significantly and the \(\chi ^2_i\) of several of other datasets, including those belonging to the same process type of dataset j, worsen significantly. In this case we conclude that the internal inconsistencies of the given dataset distort the global fit. We remove dataset j from the baseline.

  • The value of \(\chi ^2_j\) improves significantly and the \(\chi ^2_i\) of the rest of the dataset is unchanged within statistical fluctuations. In this case we conclude that the dataset j was not fitted properly because it carries a small weight in the fit. We keep dataset j in the baseline.

  • The value of \(\chi ^2_j\) improves significantly but the \(\chi ^2_i\) of several of other datasets, including those belonging to the same process type of dataset j, worsen significantly. In this case we conclude that the given dataset is inconsistent with the global dataset. We remove dataset j from the baseline.

The appraisal, to be presented in Sect. 4.2.4 below, must be done on a case-by-case basis, as there are several factors, rather than a single figure of merit, that determine whether or not the fit quality to other datasets worsens significantly, such as, for instance, whether the \(\chi ^2\) that worsens corresponds to data from the same process type or sensitive to the same PDF, whether there are known issues related to missing higher order or resummation corrections, etc. In all cases which are not clear-cut, we keep the dataset under consideration.

4.2.4 Appraisal and selection

Table 16 reports the values of the \(\chi ^2\) obtained in the weighted fits for both the weighted dataset and for the rest of the datasets in the fit, grouped by process. In the latter, the \(\chi ^2\) includes the contribution coming from the weighted dataset (if the weighted dataset belongs to the process), but with \(\omega ^{(i)}=1\) in Eq. (4.4). For ease of reference, we also reproduce (in parenthesis) the values of the \(\chi ^2\) in the unweighted fit originally used to assess each dataset, as given in Tables 12, 13, 14 and 15.

Table 16 The \(\chi ^2\) obtained in the unweighted (first row) and weighted fits (rest of the table) to the extended dataset. In each of the weighted fits the dataset indicated in the first column receives the weight reported in Tables 12, 13, 14 and 15. For each fit, the second column reports the \(\chi ^2\) of the weighted dataset in the weighted fit. The value in the unweighted fit (same as in Tables 12, 13, 14 and 15) is also given for reference in parenthesis. The other columns display the \(\chi ^2\) of subsets of datasets, grouped by process, in the weighted fits. These values include the contribution from the weighted dataset but with \(\omega ^{(i)}=1\) in Eq. (4.4)

Based on Table 16, we reach to the following conclusions, which are also summarized in the last column of Tables 12, 13, 14 and 15.

  • NMC \(\sigma ^{NC,p}\). The \(\chi ^2\) of this dataset improves from 1.53 to 1.28. The \(\chi ^2\) of the other datasets and the total \(\chi ^2\) fluctuate only marginally. These results are consistent with those reported in [208,209,210] and confirm that this dataset is internally inconsistent. Because such an inconsistency does not alter the global fit significantly, we keep this dataset in the baseline.

  • BCDMS \(F_2^p\). The \(\chi ^2\) of this dataset improves from 1.42 to 1.05. The total \(\chi ^2\) worsens, however this worsening is moderate and it does not seem to come from any specific process. These results confirm a mild inconsistency of this dataset with the rest of the datasets in the fit, which however does not appear to be significant enough to justify its removal from the fit. We thus keep this dataset in the baseline.

  • HERA I+II \(\sigma _\mathrm{NC}^c\). The \(\chi ^2\) of this dataset improves from 2.03 to 1.37, but the agreement with all the other HERA data, driven by the inclusive structure function measurements, deteriorates, with a \(\chi ^2\) increase from 1.20 to 1.45. The \(\chi ^2\) of all of the other datasets fluctuate only marginally. We therefore conclude that this dataset is in tension with the small-x HERA inclusive structure function, as also observed in the CT18 and MSHT20 analyses [143, 144]. This tension will possibly be alleviated once small-x resummation effects are accounted for [211], though only a resummed PDF determination could tell whether this is the case or not. Nevertheless the PDFs in the global fit remain unchanged if the dataset is removed. Furthermore, this dataset is required in order to stabilize the charm PDF, especially in a DIS-only fit, as we will discuss in Sect. 7. For these reasons we keep the measurement in the baseline.

  • E866 \(\sigma ^p\) (NuSea). The \(\chi ^2\) of this dataset improves from 1.59 to 0.90. The \(\chi ^2\) of inclusive gauge boson production deteriorates somewhat, from 1.48 to 1.65. A possible reason for this is the lack of large-x resummation in the treatment of the theoretical predictions for this dataset [203]. Mild inconsistency of this experiment with NMC was argued in Ref. [212]. Nevertheless, the fit quality of this dataset in the original unweighted fit is only marginally above our selection criteria, and the deterioration of the global \(\chi ^2\) is also marginal. We keep it in the baseline.

  • D0 W electron asymmetry. The \(\chi ^2\) of this dataset improves from 3.54 to 1.94, a value that remains sub-optimal. The \(\chi ^2\) of all of the other datasets, in particular of those belonging to the same process (including the D0 W muon asymmetry), deteriorates very significantly. The dataset is surely inconsistent, though perhaps the inconsistency can be traced to a single data point. We discard the dataset from the baseline.

  • ATLAS WZ 7 TeV (\(\mathcal {L}=4.6\) \(\hbox {fb}^{-1}\)) (central rapidity range). The \(\chi ^2\) of this dataset improves from 1.86 to 1.23 while the overall \(\chi ^2\) of collider gauge boson production data deteriorates slightly, from 1.48 to 1.60. However, this deterioration is very moderate, and furthermore, as we will show in Sect. 8, a small amount of regularization of experimental correlations significantly improve the description of the dataset while leaving the PDFs unchanged. There is thus no evidence that this dataset is inconsistent, and we keep it in the baseline.

  • LHCb \(Z\rightarrow ee\) 7 TeV. The \(\chi ^2\) of this dataset improves from 2.32 to 0.77. At the same time the \(\chi ^2\) of all collider gauge boson production data deteriorates slightly from 1.48 to 1.65. Given the moderate amount of deterioration it is unclear that this dataset is inconsistent and we keep it in the baseline.

  • ATLAS W 8 TeV. The \(\chi ^2\) of this dataset improves from 3.50 to 1.11 but the description of the other datasets, except top pair production, deteriorates quite significantly. As in the case of the companion measurement at 7 TeV, given the large value of Z, we will investigate in Sect. 8 whether the description of this experiment could be improved by regularization of its covariance matrix. However, in unregularized form it is inconsistent and we discard the measurement from the baseline.

  • LHCb \(W\rightarrow e\) 8 TeV. The \(\chi ^2\) of this dataset improves from 2.61 to 0.19, while the \(\chi ^2\) for all of the inclusive gauge boson production measurements (including other LHCb data) deteriorates significantly from 1.48 to 1.79. We discard the dataset from the baseline.

  • ATLAS \(t\bar{t}\) \(\ell \)+jets 8 TeV. Here we have four different observables, that behave somewhat differently upon being given large weight. The \(\chi ^2\) of any of these distributions significantly improves when given large weight. For the top transverse momentum and top pair invariant mass distributions this improvement is accompanied by a rather significant deterioration of the global fit quality, in which the agreement with all other datasets is spoiled by a greater or lesser extent. In the case of the top and top pair rapidity distributions the global fit quality is very similar and only the description of jets deteriorates moderately. This is consistent with the results of previous studies by NNPDF [154, 170], suggesting that the rapidity distributions, despite being described less well than in NNPDF3.1 [5], remain largely compatible with the rest of the dataset. It is also consistent with previous studies concluding that the simultaneous description of all of the ATLAS 8 TeV top distributions is problematic, possibly also because of ill-defined correlations within individual distributions and between different distributions [152, 154], and indeed other recent PDF determinations [143, 144] include only a pair out of the four distributions (though their choice of pair differs from our own). We thus keep the two rapidity distributions (\(y_t\) and \(y_{t\bar{t}}\)) and discard the transverse momentum and invariant mass distributions from the baseline.

  • ATLAS and CMS dijet 7 TeV. The \(\chi ^2\) of these datasets improves from 2.16 to 1.84 and from 1.85 to 1.34, respectively, while the global fit quality is very similar and only the description of the top pair data deteriorates moderately. We accordingly keep these two datasets in the baseline. The reason why the improvement of the \(\chi ^2\) is moderate is likely related to the large value of the stability metric Z, rather than to internal inconsistencies. Also in this case we will investigate the effect of regularizing the covariance matrix in Sect. 8, where we will show that upon regularization the \(\chi ^2\) becomes close to unity but the PDFs are essentially unaffected.

Fig. 14
figure 14

The gluon (left) and antidown (right) PDFs at \(Q=1.65\) GeV at large x, for the unweighted fit and the weighted fits in which the ATLAS WZ 7 TeV (\(\mathcal {L}=4.6\,\hbox {fb}^{-1}\)) (central) and the ATLAS \(t\bar{t}\) \(\ell \hbox {+jets}\) 8 TeV datasets are assigned large weight

Inspection of the PDFs resulting from the weighted fits can provide additional guidance in assessing consistency. This information is used to support, dataset by dataset, the conclusions summarized above. As an example we display the gluon and antidown PDFs in Fig. 14. The PDFs are shown at the input scale \(Q_0=\) 1.65 GeV as a function of x in linear scale for the unweighted fit and for two weighted fits, specifically those in which the ATLAS WZ 7 TeV (\(\mathcal {L}=4.6\) \(\hbox {fb}^{-1}\)) (central) and the ATLAS \(t\bar{t}\) \(\ell \)+jets 8 TeV datasets are assigned large weight. It is clear that for the ATLAS \(t\bar{t}\) \(\ell \hbox {+jets}\) 8 TeV (\(1/\sigma d\sigma /dp_T^t\)) data, which are considered inconsistent based on the \(\chi ^2\) analysis, the PDFs in the weighted fit display a significant inflation of PDF uncertainties and an unnatural distortion of the overall PDF shape, including an unphysical valence-like structure of the antidown PDF. Conversely, for the ATLAS WZ 7 TeV (\(\mathcal {L}=4.6\) \(\hbox {fb}^{-1}\)) (central) data, which are considered consistent, the PDFs in the weighted fit have the same shape as the default and only moderately inflated uncertainties. A systematic analysis for all of the weighted fits shows that the behavior of the best fit PDFs confirms the conclusion of the \(\chi ^2\) analysis.

4.3 Choice of jet datasets

As discussed in Sect. 2.2.7, in NNPDF4.0 we consider both single-inclusive jet and dijet production datasets. However the two observables cannot be included simultaneously in the fit because full knowledge of experimental correlations is not available. This also means that we cannot assess their inclusion in the dataset based on weighted fits.

We therefore select the optimal set of jet observables by repeating the analysis carried out in [9]. Specifically, we start from a fit based on the baseline dataset identified above from which we remove all jet measurements. We then compare it to a series of NNLO fits that include, one at a time, the single-inclusive jet or dijet datasets discussed in Sect. 2.2.7, with the theory settings discussed there. The decorrelation model recommended in [88] is used in the case of the ATLAS 8 TeV single-inclusive jet measurement, while systematic uncertainties are decorrelated across rapidity bins in the case of the ATLAS 7 TeV single-inclusive jet measurement.

In Table 17 we report the values of the \(\chi ^2\) for all of these fits. Values are shown for all the data grouped by process type and for all single-inclusive jet and dijet data, for both those that are and those that are not included in each fit. The values corresponding to the datasets that are not included in each fit are indicated in square brackets. In Fig. 15 we compare the gluon PDF from all the fits, separately for those that include single-inclusive jet or dijet data, at a scale \(Q=100\) GeV. The gluon PDF is normalized to the fit that does not include any jet data. We have explicitly checked that all other PDFs are unaffected by the inclusion of jet data.

Table 17 The \(\chi ^2\) for an NNPDF4.0 variant in which all jet data are excluded, and a series of fits that add to this variant each of the jet measurements of Sect. 2.2.7 one at a time. Results are shown for all datasets, aggregated by process type. For jet data, results are shown both for the sets included in each fit and also for those not included, which are denoted by being enclosed in square brackets. Combined results for all of the jet production data (including data that are and that are not fitted) are also shown. The number of data points in each dataset is also reported
Fig. 15
figure 15

The gluon PDF, at \(Q=100\) GeV, for some of the fits of Table 17: the baseline variant with no jets, and the fits with each of the single-inclusive jet data (left) or each of the dijet data (right). Results are shown normalized to the central value of the no jets variant

Inspection of Table 17 and of Fig. 15 leads to the following conclusions.

  • All of the 7 TeV data have a rather moderate impact and the global fit quality is essentially unchanged in comparison to the baseline. There is a moderate pull on the large-x gluon, consistent between ATLAS and CMS and between single-inclusive jets and dijets, and also consistent with the baseline within uncertainties.

  • The 8 TeV single-inclusive jet data have a moderate pull on the large-x gluon, consistent between ATLAS and CMS, and consistent within uncertainties with the baseline. This pull is in qualitative agreement with but slightly stronger than that of the 7 TeV jet data. The fit quality to all the other data in the global fit is essentially unchanged.

  • The only available 8 TeV dijet measurement, from CMS, has a strong pull on the gluon, leading to a result which deviates by about two sigma from the baseline, though the pull is perhaps similar in shape to that of the single-inclusive 8 TeV jet data. The global fit quality deteriorates, but the deterioration is not due to hadron collider data that are sensitive to the gluon, like top and Z \(p_T\), whose description actually improves, but rather to DIS and DY data.

In general, the 8 TeV ATLAS and CMS single-inclusive jet measurements and the 7 TeV ATLAS and CMS dijet measurements have a very similar effect on the gluon PDF for \(x\lesssim 0.2\); dijet datasets seem to suppress the gluon PDF at slightly more moderate value of x than their single-inclusive jet counterparts. This does not seem to affect the description of the rest of the datasets included in the fits.

However, whereas all jet data are broadly consistent with each other, the CMS 8 TeV dijet data are somewhat problematic, as they lead to a gluon that is in disagreement with the baseline in the region around \(x\sim 0.3\) and to a visible deterioration in global fit quality. This measurement is peculiar in that it is the only one which is associated to a triple-differential distribution, it leads to the largest reduction of PDF uncertainty, and it is possibly the one that carries most of the experimental information among all of jet measurements. The fact that no corresponding ATLAS measurement is available, and that the global \(\chi ^2\) deteriorates noticeably in comparison to all of the other fits, leads us to conclude that it is more conservative to include the companion single-inclusive jet data in the baseline. For 8 TeV data we thus include in the baseline the single-inclusive jet measurements.

Given the fact that dijet data are preferred on theoretical grounds [9, 137, 213] we include the 7 TeV dijet measurements in the baseline. We will investigate the effect of replacing the 7 TeV ATLAS and CMS dijet measurements with their single-inclusive jet counterparts in Sect. 7.3.3.

5 The NNPDF4.0 parton set

We now present the main result of this work: the NNPDF4.0 parton set. We first discuss fit quality, then present the PDFs, and finally show a comparison of the quality of the fit to a selection of fitted data for a variety of different fits. The NNPDF4.0 PDFs presented here are determined from the baseline dataset of Sect. 4 with the methodology of Sect. 3. We use \(\alpha _s(m_Z)=0.118\) at all perturbative orders. All PDF sets are Monte Carlo ensembles of 100 replicas, except in the case of the NNLO NNPDF4.0 baseline, which is a set of 1000 replicas. Additional comparisons, beyond those reported in this section, can be obtained by the reader using the open source NNPDF software framework described in [31], and summarized in Appendix A. For all PDF determinations presented below a last iteration has been performed, in which both the range of the preprocessing exponents (see Sect. 3.1.1) and the \(t_0\) covariance matrix (recall Sect. 3.2) have been recomputed, and it has been checked explicitly that the results for PDFs are unchanged: this ensures that iterative procedures have achieved convergence.

5.1 Fit quality

Table 18 presents an overview of the fit quality for the LO, NLO and NNLO NNPDF4.0 baseline fits. As in previous NNPDF releases, \(\chi ^2\) values are obtained using the published experimental covariance matrix; this is thus not the figure of merit that is minimized in the fit, which is the \(\chi ^2\) computed using the \(t_0\) covariance matrix (see Ref. [14], specifically Table 9, for a discussion of this issue). The \(\chi ^2\) values that were reported for NNLO PDFs in the NNPDF3.1 analysis of Ref. [5] are also given for comparison.

Datasets are grouped by process type: fixed-target DIS, NC and CC; collider DIS, NC and CC; fixed-target DY; inclusive gauge boson production, separately for the Tevatron and the LHC; LHC gauge boson production with additional jets (including Z \(p_T\) and \(W\hbox {+jets}\)); LHC single-inclusive jet and dijet production (for NNPDF3.1 this also includes Tevatron single-inclusive jet production); LHC top pair production; LHC direct photon production; and LHC single top production. The number of data points included in each fit is indicated in parentheses, and \(\chi ^2\) values are provided only for fitted data. A detailed assessment of the compatibility of the NNPDF3.1 PDFs with the full NNPDF4.0 dataset will be presented in Sect. 6.2 below. A graphical representation of the NLO and NNLO values of Table 18 is provided in Fig. 16.

Table 18 Overview of \(\chi ^2\) value by process type for the LO, NLO, and NNLO NNPDF4.0 baseline fits; NNLO NNPDF3.1 is also shown for comparison
Table 19 Values of the \(\chi ^2\) for each individual experiment included in the NNPDF4.0 PDF determination at LO, NLO, and NNLO; NNPDF3.1 NNLO is also shown for comparison. A dash denotes that the dataset was not included in the specific determination
Fig. 16
figure 16

Graphical representation of the results of Table 18, comparing the \(\chi ^2\) of the NNPDF4.0 NLO and NNLO baseline fits

First, one can observe how fit quality markedly improves with perturbative order: the \(\chi ^2\) decreases from 3.35 at LO to 1.24 at NLO and 1.16 at NNLO. The significant improvement in fit quality from NLO to NNLO was already reported in NNPDF3.1 (see specifically Sect. 3.2 in [5]) and it is chiefly due to the large number of high-precision LHC data, for which the \(\chi ^2\) improves most: specifically gauge boson and top pair production. Fit quality is generally good: specifically, both the value of \(\chi ^2\) and the value of \(n_\sigma \) Eq. (4.3) corresponding to the global fit are similar to those of other recent global PDF determinations CT18 [143] and MSHT20 [144], despite the fact that this PDF determination includes a larger number of datapoints and of different processes. Of course, comparison of \(\chi ^2\) values between different PDF sets should be taken with care, given differences in dataset and theory settings: the recent PDF4LHC study [214, 215] has shown that fit quality in NNPDF3.1 is similar to that of CT18 and MSHT20. The largest \(\chi ^2\) value (\(\chi ^2=1.37\)) is found for LHC inclusive gauge boson production, which has by far the highest precision. The opposite extreme is single top datasets, which have relatively low precision and a very low \(\chi ^2\) value.

The quality of the NNLO NNPDF4.0 fit is comparable to that of its NNPDF3.1 counterpart. This is especially remarkable in view of the substantial extension of the dataset from NNPDF3.1 to NNPDF4.0. A comparative analysis of the impact of different data and an assessment of the role played by the methodology will be respectively presented in Sect. 7 and Sect. 8 below. Specifically, we will see that a NNLO fit to the NNPDF3.1-like dataset (see Sect. 7.1.1 below) leads to \(\chi ^2=1.145\) if NNPDF4.0 methodology is used, while the significantly worse value \(\chi ^2=1.186\) is found using NNPDF3.1 methodology.

In Tables 19, 20, 21 and 22 we provide the details of the \(\chi ^2\) value for each dataset included in each PDF determination. We make the following observations.

Table 20 Same as Table 19 for fixed-target DY datasets
Table 21 Same as Table 19 for inclusive gauge boson production datasets
Table 22 Same as Table 19 for all other LHC datasets
  • The impact of NNLO QCD corrections is apparent for several of the LHC datasets, in particular for Z \(p_T\) and top pair production, whose \(\chi ^2\) improves significantly when moving from NLO to NNLO.

  • Fit quality at NNLO is good and uniform across different datasets, with variations compatible with statistical fluctuations.

  • A good description of the inclusive gauge boson production data is achieved, irrespective of the kinematic region probed by specific datasets, despite their extremely high precision.

  • Measurements with poor fit quality are those already singled out in Sect. 4 that have been retained for the reasons explained there: specifically the combined HERA charm cross section, the D0 muon asymmetry, the LHC \(W,Z\rightarrow \mu \) 7 TeV rapidity distributions and the ATLAS top pair 8 TeV rapidity distributions in the lepton+jet final state and 7 TeV total cross-section. For some of these, fit quality is somewhat worse in NNPDF4.0 than NNPDF3.1, due to the larger number of competing datasets included in the NNPDF4.0 determination. We have checked explicitly that if we exclude in turn experiments with the worse fit quality, and we combine the ensuing replicas into a single set, we obtain results that are compatible within statistical fluctuations with those of the default global fit.

5.2 Parton distributions

We now examine the baseline NNPDF4.0 parton distributions. We first show the full set of PDFs, compared to their NNPDF3.1 predecessors. We then discuss sources of theoretical uncertainties: the dependence on the perturbative order and on the value of the strong coupling. We finally compare the NNLO NNPDF4.0 baseline PDFs to CT18 [143] and MSHT20 [144]. A further comparison with these PDF sets in terms of phenomenology, i.e. specifically for parton luminosities and theoretical predictions for LHC observables, will be presented in Sect. 9.

5.2.1 Comparison to NNPDF3.1

The full set of NNLO NNPDF4.0 and NNPDF3.1 PDFs are shown in Fig. 17, and the associated relative one-sigma uncertainties are displayed in Fig. 18. Specifically, we show the up, antiup, down, antidown, strange, antistrange, charm and gluon PDFs as a function of x at \(Q=100\) GeV. Results are normalized to the NNPDF4.0 central value.

Fig. 17
figure 17

The full set of NNLO NNPDF4.0 PDFs: the up, antiup, down, antidown, strange, antistrange, charm and gluon PDFs at \(Q=100\) GeV, compared to NNPDF3.1. Results are normalized to the central NNPDF4.0 value. Solid and dashed bands correspond to 68% c. l. and one-sigma uncertainties, respectively

Fig. 18
figure 18

Same as Fig. 17 but for one-sigma relative uncertainties

There is remarkable consistency between the new NNPDF4.0 PDF set and the previous NNPDF3.1 analysis. The only noticeable differences appear in the strange and antistrange PDFs and in the gluon. As we shall show in Sect. 7.1, in the former case this is mainly due to the inclusion of NNLO corrections in the treatment of the NuTeV data (see Sect. 2.1): indeed, this same effect was already observed in a recent dedicated study of strangeness [10]. In the latter case, the difference, i.e. the suppression of the gluon around \(x\sim 0.1\), is mainly due to the extra physical constraints provided by additional single-inclusive jet, dijet and top pair measurements included in NNPDF4.0, see also the discussion of Sect. 7.

The precision of the PDFs in the NNPDF4.0 set increases significantly in comparison to NNPDF3.1. Depending on the kinematic region and on the parton, the reduction of the PDF relative uncertainty ranges from 30% to more than 50%. The relative uncertainty of almost all of the NNPDF4.0 PDFs is of the order of 1-2% in the region probed by experimental data. In Sects. 7 and  8 we will disentangle how much of this reduction is due to the improved fitting methodology and how much to the extended dataset.

5.2.2 Dependence on the perturbative order and on the strong coupling

In Fig. 19 the up, antiup, charm and gluon NNPDF4.0 PDFs are compared for the three perturbative orders, LO, NLO and NNLO, as a function of x at \(Q=100\) GeV. Results are normalized to the central value of the NNLO set. As expected, a large shift is observed from LO to NLO due to the large NLO corrections, as is also clear from the poor quality of the LO fit seen in Tables 18, 19, 20, 21 and 22. This is consistent with previous NNPDF studies.

Fig. 19
figure 19

Comparison between the LO, NLO and NNLO NNPDF4.0 PDFs. The up, antiup, charm and gluon are shown at \(Q=100\) GeV. All results are normalized to the central value of the NNLO set. Solid and dashed bands correspond respectively to 68% c. l. and one-sigma uncertainties

However, the difference between NLO and NNLO PDFs is also noticeable. While the NLO and NNLO PDFs are very compatible within uncertainties for the up quark, in the case of the charm quark PDF at intermediate values of x and in the case of the gluon PDF at large values of x the shift in central value is comparable or even somewhat larger than the uncertainty band. This means that at NLO the missing higher order uncertainty is no longer negligible in comparison to the PDF uncertainty, unlike in previous PDF determinations, including NNPDF3.1 (see Fig. 3.12 in [5]), where NLO and NNLO PDFs generally agreed within their larger errors. Interestingly, the shift in central value in the NLO PDFs observed in Refs. [23, 24] when missing higher order corrections are added during the fit seems to be of the same size and sign as the shift between NLO and NNLO results seen in Fig. 19. This suggests that the inclusion of the missing higher order uncertainty along the lines of Refs. [23, 24] would be highly desirable also at NNLO.

An important source of theory uncertainty that is routinely included is that related to the variation of \(\alpha _s\). The default value of the strong coupling adopted for NNPDF4.0 at all perturbative orders is \(\alpha _s(m_Z)=0.118\), in agreement with the latest PDG value of \(\alpha _s(m_Z)=0.1179 \pm 0.0010\) [141]. In order to properly include correlated PDF+\(\alpha _s\) uncertainties [216] in the computation of LHC observables, we also provide sets corresponding to different values of \(\alpha _s\). Specifically, we provide PDFs obtained with \(\alpha _s(m_Z)=0.116,\, 0.117,\,0.1175,\, 0.1185,\,0.119,\,0.120\). They are shown in Fig. 20, along with the baseline, normalized to the central value of the latter. Only the change in central value is shown: relative PDF uncertainties are essentially unchanged when \(\alpha _s\) is varied. Note that the change in central value as \(\alpha _s\) is varied by one-sigma is smaller or much smaller than the PDF uncertainty. Of course, the gluon displays the strongest dependence on \(\alpha _s\), and it decreases at small x and increases at large x as the value of \(\alpha _s\) is increased.

Fig. 20
figure 20

Same as Fig. 17, now comparing PDFs obtained using different values of \(\alpha _s(m_Z)=0.116,\, 0.117,\,0.1175,\,0.118,\, 0.1185,\,0.119,\,0.120\), normalized to the \(\alpha _s(m_Z)=0.118\) baseline, with only the central value shown for other sets

In Table 23 we show the value of the \(\chi ^2\) per data point obtained in the NNLO fit corresponding to each value of \(\alpha _s\). Whereas a full determination of \(\alpha _s\) should be done [206] by using the correlated replica method of Ref. [138], and also including theory uncertainties, these values suggest that the best-fit value of \(\alpha _s\) within the NNPDF4.0 framework is consistent with the NNPDF3.1-based determination of Ref. [206] and with the current PDG value.

Table 23 Values of the total \(\chi ^2\) per data point for the NNLO global fit with different values of \(\alpha _s(m_Z)\)

As already discussed in Ref. [5], the remaining parametric uncertainties, related to the values of the quark masses, are expected to be very small, since the dependence on the charm mass is almost entirely removed by parametrizing the charm PDF, and the dependence on the bottom quark mass is very small (except on the b-PDF itself and processes specifically sensitive to it).

5.2.3 Comparison to other PDF sets

The NNPDF4.0 NNLO PDFs are compared to other recent global sets, namely CT18 [143] and MSHT20 [144], in Fig. 21. Note that there are substantial differences in the underlying dataset: the CT18 dataset is very close to that of NNPDF3.1 while the MSHT20 dataset is somewhere in between NNPDF3.1 and NNPDF4.0 (see Appendix. B for a detailed comparison). All results are shown at \(Q=100\) GeV, normalized to the central NNPDF4.0 value. Relative uncertainties are compared in Fig. 22. Note that while for NNPDF4.0 there are eight independently parametrized PDFs, for CT18 the strange and antistrange are not independently parametrized, and for both CT18 and MSHT20 charm is not independently parametrized.

Fig. 21
figure 21

Comparison between the NNPDF4.0, CT18 and the MSHT20 NNLO PDF sets. The up, antiup, down, antidown, strange, antistrange, charm and gluon PDFs are shown at \(Q=100\) GeV, normalized to the central NNPDF4.0 value. For NNPDF4.0, solid and dashed bands correspond respectively to 68% c. l. and one-sigma uncertainties

Fig. 22
figure 22

Same as Fig. 21 but for one-sigma relative uncertainties

The three parton sets are overall in fair agreement within their respective uncertainties, though some differences in shape are observed. Interestingly, these follow the pattern already observed in [5] when comparing NNPDF3.1 [5] to CT14 [217] and MMHT2014 [218] (see in particular Fig. 12 in Ref. [5]) . The up and down PDFs are in good agreement, in particular the NNPDF4.0 result is always within the envelope of the CT18 and MSHT20 uncertainties. More marked differences are observed for the antiup and antidown PDFs: note, however, that the CT18 and MSHT20 PDF sets do not include the E906/SeaQuest and the LHCb 13 TeV measurements, which provide additional constraints on sea quark flavor separation at mid- and large-x values, as discussed in Sect. 7 (see Ref. [212] for a discussion of the SeaQuest data in the CT18 framework). The NNPDF4.0 strange and antistrange PDFs agree very well with MSHT20: in both these PDF sets, strangeness is enhanced in comparison to CT18. As suggested in [10, 144], this is likely due to the fact that the ATLAS WZ 7 TeV data are not included in the default CT18 fit (though they are included in the CT18A variant set), and that NNLO massive corrections to the neutrino DIS dimuon cross-sections are also not accounted for.

The NNPDF4.0 charm PDF is suppressed at intermediate values of x in comparison to CT18 and MSHT20, as a consequence of the fact that charm in CT18 and MSHT20 is determined by perturbative matching conditions and is not independently parametrized. The gluon is in fair agreement in the region of \(x\lesssim 0.03\) which is relevant for Higgs production though the NNPDF result is at the upper edge of the MSHT20 and CT18 uncertainty; this was already the case when comparing NNPDF3.1 to CT14 and MMHT2014. At larger values of x, the NNPDF4.0 gluon is suppressed in comparison to CT18 and MSHT20. This behavior is likely due to the LHC top pair and jet data that are included in NNPDF4.0 but not in the other sets.

Concerning the associated PDF uncertainties, NNPDF is generally more precise, while CT18 has generally larger uncertainties. This is consistent with the observation that CT18 is based on a somewhat smaller dataset than NNPDF4.0, with MSHT20 being in between, see Appendix B for more details.

5.3 Comparison to experimental data

In Fig. 23 we present for illustrative purposes a comparison between a selection of data included in the NNPDF4.0 baseline fits and the corresponding NLO and NNLO best-fit results, with the main goal of providing a visual assessment of the fit quality and of the relative size of the data and PDF uncertainties. The data shown are selected as representative of the global dataset; specifically we show results for the following data: the lowest Q bin of the combined HERA charm cross-section [145]; the SeaQuest (DYE906) differential cross section [117]; the central rapidity bin of the ATLAS 7 TeV \(W^+\) rapidity distribution [54]; the highest dilepton invariant mass bin for ATLAS 8 TeV high-mass DY [79]; the \(0.5 \le |y| \le 1.0\) dijet rapidity bin for the CMS 7 TeV dijets [76]; the lowest \(p_T^Z\) bin of the CMS 8 TeV Z \(p_T\) distribution [66]; the ATLAS 8 TeV normalized single top rapidity distribution [98]; and the top rapidity distribution for CMS 13 TeV top pairs in the lepton+jets final state [93]. All results are normalized to the central experimental value. Data error bars correspond to the sum in quadrature of all uncertainties. Correlated systematic uncertainties are large or even dominant in several cases, therefore the plots displayed in Fig. 23 should be viewed as a qualitative indication, while a quantitative assessment is provided by the \(\chi ^2\) values of Tables 19, 20, 21 and 22. A full set of comparisons of the NNLO PDF to all the data included in the fit are linked to the NNPDF website https://nnpdf.mi.infn.it/nnpdf4-0/ and can be found in [219].

Fig. 23
figure 23

Comparison between data points and NLO and NNLO best-fit results for a selection of fitted data points (see text). Results are generally shown as ratios to the central experimental value, with one-sigma experimental and PDF uncertainties. The experimental uncertainty is the sum in quadrature of all statistical and systematic uncertainties

It is clear that NNLO corrections are significant in many cases as already noticed: specifically for combined HERA charm, SeaQuest, the CMS 7 TeV dijets, CMS 8 TeV Z \(p_T\) and the CMS 13 TeV top pairs. In all these cases, the quality of the best fit visibly improves at NNLO. PDF uncertainties are generally smaller than data uncertainties. This is in part due to the fact that experimental uncertainties are correlated while the diagonal uncertainty is shown in the plots, but also to the fact that PDFs are simultaneously constrained by several datasets. Indeed, PDF uncertainties become comparable to data uncertainties when the data shown are the only ones to constrain the relevant PDFs: an example is the SeaQuest data at very large \(x_2\) (momentum fraction of the struck parton), which is essentially the only dataset that constrains the \(\bar{d}/\bar{u}\) ratio in this region.

6 Validation of the methodology

We perform here a detailed validation of the NNPDF4.0 fitting methodology, with the main goal of verifying that the resulting PDF uncertainties have been faithfully estimated. A validation technique through closure tests was introduced by us in Ref. [14], in order to validate the NNPDF3.x methodology. This technique checks for the faithfulness of PDF uncertainties in the region in which PDFs are constrained by the data. We will apply it systematically to NNPDF4.0 in Sect. 6.1.1: thanks to the greater computational efficiency of the NNPDF4.0 methodology (see Sect. 3.4) we can now perform much more extensive and systematic tests than was previously possible. Furthermore, we can now also test for faithfulness of uncertainties in the extrapolation region, i.e. where PDFs are not directly constrained by data, by means of future tests, introduced recently in Ref. [15]. Future tests of the NNPDF4.0 methodology will be presented in Sect. 6.2. This extensive validation, both in the data and the extrapolation regions, is especially desirable given the small, percent-level PDF uncertainties that NNPDF4.0 achieves.

6.1 Closure testing NNPDF4.0

The closure testing methodology was introduced for global PDF fits in Ref. [14], following a suggestion in Ref. [220] and previous studies in Ref. [221]. Here we follow the original approach of Ref. [14] and supplement it with a wider variety of estimators and more systematic studies. First, we review the closure testing methodology and describe the settings adopted for the closure tests of NNPDF4.0. Then we introduce the statistical estimators used to validate the outcome of these tests, including the definition of some new estimators. Finally, we present a detailed closure test analysis of the NNPDF4.0 methodology, based on the statistical estimators introduced previously. A discussion of the limitations of the closure testing methodology is also given in conclusion. A more detailed theoretical discussion of the statistical underpinnings of the closure testing methodology that we adopt can be found in Ref. [222].

6.1.1 The closure test setup

The basic idea of closure testing is to perform a PDF determination based on artificial data that have been generated with perfect statistical properties from a known underlying law. Comparing results to the known truth then allows one to check for statistical consistency.

Specifically, assume that we have \(N_\mathrm{dat}\) experimental measurements, normally distributed around the true values \(\varvec{f}\) with covariance matrix \(C\). The central values of the experimental data \(\varvec{z}\) will then be given in terms of their true values as

$$\begin{aligned} z_{i} = f_i+ \eta _i\, , \quad i=1\,\ldots ,N_\mathrm{dat}\, , \end{aligned}$$

where the vector of shifts \(\varvec{\eta }\) is drawn from a multi-Gaussian distribution with covariance \(C\), \(\mathcal {N}(\varvec{0}, C)\). Within the Monte Carlo replica method for error propagation adopted in this work, the pseudodata which are used as actual input for the PDF fit, \(\varvec{y}^{(k)}\), are generated by adding a further layer of fluctuations,

$$\begin{aligned} y^{(k)}_{i} = f_i+ \eta _i+ \epsilon ^{(k)}_{i}\, , \quad i=1\,\ldots ,N_\mathrm{dat}\, , \quad k=1\,\ldots ,n_\mathrm{rep}\, , \end{aligned}$$

where the index \(k\) indicates that each Monte Carlo replica is generated by drawing an independent noise vector \(\varvec{\epsilon }\) from the same multi-Gaussian distribution \(\mathcal {N}(\varvec{0}, C)\). In the NNPDF approach, for each Monte Carlo replica k defined in Eq. (6.2) a neural network such as that displayed in Fig. 11 is trained from the minimization of a figure of merit, see also the discussion in Sect. 3. This means that the neural network parameters are chosen by optimizing

$$\begin{aligned} E^{(k)} = {{1}\over {N_\mathrm{dat}}} \sum _{ij} (g^{(k)}_i - y^{(k)}_i) C^{-1}_{ij} (g^{(k)}_j - y^{(k)}_j)\,, \end{aligned}$$

where we denote by \(\varvec{g}^{(k)}\) the predictions for the experimental data obtained from the neural network model fitted to the k-th replica.

In a fit to actual experimental data we have access to the measured central values \(\varvec{z}\) and to the covariance matrix \(C\) as estimated by the experimentalists. In a closure test we instead use a given set of PDFs and associated theoretical calculation as input for the central values. Hence, the starting point of the closure test is a known proxy of the true underlying observable values, \(\varvec{f}\). Subsequently, a proxy for the experimental central values is generated following Eq. (6.1). A closure test thus amounts to applying to closure test data the NNPDF methodology as it would be used in a fit to actual experimental data.

6.1.2 Statistical estimators

A successful closure test must be such that the resulting PDF fit yields a faithful statistical description of the known underlying law. In order to assess quantitatively the degree of success of the NNPDF4.0 closure tests presented here, we have extended and systematized the set of estimators introduced in previous studies [14]. Here we provide a summary of the estimators and their justification; for more detailed derivations and arguments showing of how they fit into a Bayesian approach to inverse problems we refer the reader to [222].

Bias, variance, and noise in closure tests We define an error function as the expectation value across PDF replicas, denoted as \(\mathbf {E}_{\epsilon }\left[ \cdot \right] \), of the \(\chi ^2\) evaluated between the data predictions obtained from the k-th PDF replica, \(\varvec{g}\), and the corresponding experimental central values, \(\varvec{z}\),

$$\begin{aligned} \mathbf {E}_{\epsilon }\left[ {\chi ^2}^{(k)} \right] \equiv {{1}\over {N_\mathrm{dat}}} \mathbf {E}_{\epsilon }\left[ \sum _{ij} (g^{(k)}_i - z_i) C^{-1}_{ij} (g^{(k)}_j - z_j) \right] \, . \end{aligned}$$

It is easy to check [222] that this expression can be decomposed as

$$\begin{aligned} \mathbf {E}_{\epsilon }\left[ {\chi ^2}^{(k)} \right]= & {} \mathrm{noise} + \mathrm{bias} + \mathrm{variance} - \mathrm{cross\,term} \nonumber \\= & {} \mathrm{noise} + \mathrm{variance} + \Delta _{\chi ^2}, \nonumber \\= & {} \chi ^2 + \mathrm{variance} , \end{aligned}$$

where each of the quantities on the right-hand side is defined as follows.

First of all, the noise is defined as

$$\begin{aligned} \mathrm {noise} = {{1}\over {N_\mathrm{dat}}} \sum _{ij} \left( f_i - z_i \right) C^{-1}_{ij} \left( f_j - z_j \right) \end{aligned}$$

and represents the fluctuations of the experimental data \(\varvec{z}\) around the true value \(\varvec{f}\). Eq. (6.6) is clearly independent of the model adopted, being an intrinsic property of the experimental measurements. Note that by construction the noise will tend to one in the limit of large \(N_\mathrm{dat}\).

The bias is defined as the difference between the central value of the model replica predictions, \(\mathbf {E}_{\epsilon }\left[ g \right] \), and the true observable values \(\varvec{f}\), in units of the experimental covariance matrix, i.e.

$$\begin{aligned} \mathrm{bias}= {{1}\over {N_\mathrm{dat}}} \sum _{ij} \left( \mathbf {E}_{\epsilon }\left[ g \right] - f\right) _iC^{-1}_{ij} \left( \mathbf {E}_{\epsilon }\left[ g \right] - f\right) _j\, . \end{aligned}$$

The bias measures the deviation between the result of the fit and the underlying law. In general, it is desirable for a PDF fit to exhibit a smaller bias because that indicates that the fit results are closer to the truth. However, consistency of a PDF fit does not depend on the size of the bias, but rather, on whether the size of the bias is correctly reproduced by the PDF uncertainty, as we discuss below.

Finally, the variance term describes the fluctuations of the model replica predictions around their mean value again in units of the experimental covariance matrix,

$$\begin{aligned}&\mathrm{variance}= \nonumber \\&{{1}\over {N_\mathrm{dat}}} \mathbf {E}_{\epsilon }\left[ \sum _{ij} \left( \mathbf {E}_{\epsilon }\left[ g \right] - g^{(k)}\right) _{i} C^{-1}_{ij} \left( \mathbf {E}_{\epsilon }\left[ g \right] - g^{(k)}\right) _{j} \right] ,\nonumber \\ \end{aligned}$$

which can be interpreted as the projection of the PDF uncertainty to the space of experimental data. We note that this variance as defined in Eq. (6.8) actually corresponds to the square of the estimator \(\phi \) introduced in [14]. For a discussion of the cross term in Eq. (6.5) we refer to [222].

Since the variance can be determined purely from the model predictions and the experimental covariance matrix, it can also be calculated for fits to real experimental data. This is in contrast to the noise Eq. (6.6) and bias Eq. (6.7), which depend on the true law \(\varvec{f}\) and hence can only be evaluated in closure tests. It is also important to note here that both variance and bias can be computed without using any knowledge of statistical fluctuations that enter closure tests.

One can observe that the second line of the decomposition of the error function in Eq. (6.5) is expressed as the sum of the noise, the variance, and of \(\Delta _{\chi ^2}\). This last quantity was introduced in [14] and is defined as the difference between the \(\chi ^2\) evaluated from comparing the expectation value of the model predictions \(\mathbf {E}_{\epsilon }\left[ g \right] \) and the level one data \(\varvec{z}\), that is \(\chi ^2\left[ \mathbf {E}_{\epsilon }\left[ \varvec{g} \right] ,\varvec{z}\right] \), and the \(\chi ^2\) evaluated between the underlying observable values \(\varvec{f}\) and the same level one data, that is \(\chi ^2 \left[ \varvec{f},\varvec{z} \right] \). We note that the latter coincides with the noise in Eq. (6.6). Here we slightly redefine \(\Delta _{\chi ^2}\) as compared to [14] by normalizing by the number of data points, such that

$$\begin{aligned} \begin{aligned} \Delta _{\chi ^2}&\equiv \chi ^2\left[ \mathbf {E}_{\epsilon }\left[ \varvec{g} \right] ,\varvec{z}\right] - \chi ^2 \left[ \varvec{f},\varvec{z} \right] \\&= \chi ^2 - \mathrm{noise}\, . \end{aligned} \end{aligned}$$

With this definition, constant values of \(\Delta _{\chi ^2}\) define elliptical contours in data space centered on the pseudodata Eq. (6.1).

The value of \(\Delta _{\chi ^2}\) can be interpreted as a qualitative measure of over- or under-fitting, when it is evaluated on data included in the fit. In particular, \(\Delta _{\chi ^2} = 0\) defines a contour which is centered on the fitted level one data and passes through the underlying observables. If \(\Delta _{\chi ^2} < 0\) then the expectation value of the model predictions fit the level one data better than the underlying observables: this then suggests an overfitting of the shift \(\varvec{\eta }\). Similarly, \(\Delta _{\chi ^2} > 0\) indicates underfitting of the level one data. As discussed in Ref. [222] however, the replica distribution can be perfectly sampled from the posterior distribution in model space and \(\Delta _{\chi ^2}\) can still be negative. The overall shift of the PDF predictions is thus not an issue as long as the uncertainties account for it. The bottom line is that finding values of \(\Delta _{\chi ^2} \le 0\) in the closure test remains acceptable provided their magnitude is sufficiently small, which would indicate some combination of a smaller correlation with the level one data and a smaller bias. Assuming that in such a case one finds that the PDF uncertainties are faithful, this result can be interpreted as passing the closure test.

In summary, the closure tests provide us with indicators that allow us to assess whether PDF uncertainties are faithful, and furthermore how close the fit is to the truth, i.e. whether the final result is optimal fit, or an over- or under-fit. This provides a criterion for comparing methodologies: given two methodologies that both produce a faithful result, an over- or under-fitted methodology is disfavored in comparison to one that leads to a proper fit. We now turn to our main indicator for faithfulness, the bias-to-variance ratio.

The bias-to-variance ratio for closure tests In the context of a closure test fit, the experimental central values (or level one data) defined in Eq. (6.1) are viewed as stochastic variables. When one performs fits to experimental data, \(\varvec{z}\) is fixed at the published central value which will be to some extent shifted from the true observable value due to the experimental uncertainties. However, in closure fits we are free to generate several instances of the shift \(\varvec{\eta }\), and use this feature to design our estimators — these would correspond to “runs of the universe” in the real world.

Considering the data which are included in the fit, the bias Eq. (6.7) is potentially driven by two methodology related features which we are aiming to validate with the closure test. The first mechanism is broadly described as under-fitting, and covers inflexibility of the model or inability for the optimization algorithm to sufficiently minimize the cost function. The second mechanism would be over-fitting of the level one shift, which means that the central value of the observables is systematically shifted towards the level one data by an amount that is not properly accounted for by the PDF uncertainties, which are thus underestimated. Note that in order for the testing of these effects to be nontrivial it is necessary to select the underlying truth as sufficiently flexible and in a model-independent way.

Due to its dependence on the shift vector, \(\varvec{\eta }\), \(\Delta _{\chi ^2}\) is a stochastic variable. In order to characterize the regime our model is in, we need to understand its probability distribution, rather than computing a single instance of it. For this purpose, we run multiple closure fits, each time with different shifts; we then reconstruct the distribution, and determine the expectation value of \(\Delta _{\chi ^2}\) across fits. It is worth noting that, compared to previous NNPDF studies, a study using multiple full replica closure fits has only been made possible by the computational speed up from deployment of state-of-the-art machine learning algorithms detailed in Sec. 3. Results for the distribution of the \(\Delta _{\chi ^2}\) estimator over fits are presented in Sect. 6.1.4.

The main question to be addressed by the closure test is whether the uncertainty of the PDFs, represented by an ensemble of PDF replicas, is a faithful propagation of the data uncertainty into the space of PDFs. In the context of running multiple closure fits this question can be answered either by looking at the PDFs directly (as was done in Ref. [14]), or by looking at predictions for physical observables obtained using these PDFs. The latter choice offers the distinct advantage that the space of physical observables always has a finite dimension, equal to the number of data points for which predictions are computed. In order for the test to be nontrivial, we choose to evaluate the estimators on data which were not included in the fit, so that we are assessing whether uncertainties are faithful on new observables.

From a Bayesian perspective, the PDF replicas obtained from a fit to a given set of data can be treated as a sample from the prior model distribution for data which was not used in that fit, similarly to the concept of Bayesian reweighting [155, 156]. For the present study, we will perform fits on a subset of the full NNPDF4.0 dataset and then calculate the estimators discussed below on some test data which were not included in each fit.

In order to evaluate the faithfulness of the PDF uncertainties, one can first take the expectation of the bias across fits with different shifts in Eq. (6.1), namely

$$\begin{aligned} \mathbf {E}_{\eta }\left[ \mathrm{bias} \right]= & {} {{1}\over {N_\mathrm{dat}}} \mathbf {E}_{\eta }\left[ \sum _{ij} \left( \mathbf {E}_{\epsilon }\left[ g \right] - f\right) _iC^{-1}_{ij} \left( \mathbf {E}_{\epsilon }\left[ g \right] - f\right) _j \right] \nonumber \\= & {} {{1}\over {N_\mathrm{dat}}} \mathrm {tr}\left( \Sigma ^\mathrm{bias}C^{-1}_{}\right) , \end{aligned}$$

where the subindex \(\mathbf {E}_{\eta }\left[ . \right] \) indicates that we are averaging over fits with different level-one shifts \(\varvec{\eta }\). In Eq. (6.10) we introduced \(\Sigma ^\mathrm{bias}\), the covariance matrix of the difference between the central value of the predictions and the true observable values estimated from the sample of fits,

$$\begin{aligned} \Sigma ^\mathrm{bias}\equiv \mathbf {E}_{\eta }\left[ \left( \mathbf {E}_{\epsilon }\left[ g \right] - f\right) \left( \mathbf {E}_{\epsilon }\left[ g \right] - f\right) ^T \right] \, . \end{aligned}$$

The expectation of the bias across fits is then the expected distance between the central predictions and the true values in units of the covariance matrix averaged across all data. If the fluctuations over fits reproduce the experimental covariance C exactly, then the estimator defined in Eq. (6.10) should be equal to one.

Similarly, we can take the expectation value of the variance across fits with different shifts Eq. (6.1),

$$\begin{aligned}&\mathbf {E}_{\eta }\left[ \mathrm{variance} \right] \nonumber \\&= {{1}\over {N_\mathrm{dat}}} \mathbf {E}_{\eta }\left[ \mathbf {E}_{\epsilon }\left[ \sum _{ij} \left( \mathbf {E}_{\epsilon }\left[ g \right] - g^{(k)}\right) _{i} C^{-1}_{ij} \left( \mathbf {E}_{\epsilon }\left[ g \right] - g^{(k)}\right) _{j} \right] \right] \nonumber \\&= {{1}\over {N_\mathrm{dat}}} \mathbf {E}_{\eta }\left[ \mathrm {tr}\left( \Sigma ^\mathrm{var}C^{-1}_{}\right) \right] , \end{aligned}$$

which, in analogy to Eqs. (6.10) and (6.11), has introduced \(\Sigma ^\mathrm{var}\) which is the covariance of the fitted model predictions about their central value,

$$\begin{aligned} \Sigma ^\mathrm{var}\equiv \mathbf {E}_{\epsilon }\left[ \left( \mathbf {E}_{\epsilon }\left[ g \right] - g^{(k)}\right) \left( \mathbf {E}_{\epsilon }\left[ g \right] - g^{(k)}\right) ^T \right] \, . \end{aligned}$$

Since it is independent of the shift \(\varvec{\eta }\), \(\Sigma ^\mathrm{var}\) is expected to be constant across fits. However, in practice we prefer to take the expectation value across fits, since there are sure to be fluctuations in the variance due to the finite number of replicas in each fit.

We can then interpret the expectation of the variance across fits, Eq. (6.12), to be the uncertainty of the predictions propagated from PDFs when averaged across all data in units of the experimental covariance matrix. If the uncertainty associated to the PDF replicas is faithful, the bias-to-variance ratio (averaged over fits) is

$$\begin{aligned} {{\mathbf {E}_{\eta }\left[ \mathrm{bias} \right] }\over {\mathbf {E}_{\eta }\left[ \mathrm{variance} \right] }} = 1\, , \end{aligned}$$

i.e. the average distance between the central prediction from the replicas and the true value is of the same order as the variance across replicas. We note that both bias and variance are squared quantities and so in practice we shall instead consider the square root of the ratio,

$$\begin{aligned} \mathcal {R}_{bv}\equiv \sqrt{{{\mathbf {E}_{\eta }\left[ \mathrm{bias} \right] }\over {\mathbf {E}_{\eta }\left[ \mathrm{variance} \right] }}}. \end{aligned}$$

The bias-to-variance ratio Eq. (6.15) is somewhat coarse: it checks that the mean-square difference between central predictions and underlying law is the same as the mean-square difference between replica predictions and their central values. The value of \(\mathcal {R}_{bv}\) is a measure of how much the uncertainty has been over- or under-estimated, e.g., the uncertainty for a given fit is, on average, over- or under-estimated by a factor of \(1/\mathcal {R}_{bv}\).

This measure can be be made more fine-grained in two different ways. First, one can evaluate Eq. (6.15) separately for specific subsets or groups of processes, in addition to the total dataset: this then effectively tests faithfulness for different PDFs or different kinematic regions, namely, those to which the specific chosen processes are most sensitive. Second, one can view the bias and variance as measures of one-sigma deviations, and extend them to generic quantile statistics measures, as we now discuss.

Quantile statistics in PDF and data space In order to demonstrate that the PDF uncertainties were faithfully estimated, in the NNPDF3.0 closure test studies estimators \(\xi _{1\sigma }\), \(\xi _{2\sigma }\), etc. were defined, which provide the fraction of fits for which the input PDF falls within one-sigma, two-sigma, etc. intervals of the central PDF, averaged over PDF flavors and values of x, where the standard deviation is estimated as usual from the ensemble of PDF replicas. Specifically, the definition of these estimators was the following:

$$\begin{aligned}&\xi _{n\sigma }^\mathrm{(pdf)} = {{1}\over {n_\mathrm{flav}}}{{1}\over {n_x}}{{1}\over {n_\mathrm{fit}}} \sum _{i=1}^{n_\mathrm{flav}}\sum _{j=1}^{n_x} \sum _{l=1}^{n_\mathrm{fit}} \nonumber \\&\quad I_{\left[ -n\sigma ^{i(l)}(x_j), n\sigma ^{i(l)}(x_j)\right] } \left( \mathbf {E}_{\epsilon }\left[ q^{i(l)}(x_j) \right] - q_\mathrm{in}^i(x_j) \right) ,\nonumber \\ \end{aligned}$$

where \(I_A(x)\) denotes the indicator function of the interval A: it is only non-zero, and equal to one, if its argument lies in the interval A, while it vanishes for all other values of its argument. Here \(q_\mathrm{in}^i\) indicates the true value of the i-th flavor PDF used to generate the pseudodata and \(q^{i(l)}\) the corresponding fitted PDF from the l-th fit, and where both PDFs are evaluated at the input parametrization scale \(Q_0\). The average is carried out over the \(n_\mathrm{flav}\) non-zero flavors at \(Q_0\) over a grid \(\{ x_j\}\) with \(n_x\) nodes. Finally, \(\sigma ^{i(l)}(x_j)\) is the standard deviation of the replicas of the l-th fit for flavor i estimated at \(x_j\) from the fitted replica distribution.

The estimators defined in Eq. (6.16) can be evaluated in the closure test fits which reproduce the methodology of an actual fit, and is thus where the replica distribution should give faithful uncertainties. For a successful closure test one should find that \(\xi _{1\sigma }\simeq 0.68\) if the PDF uncertainties are correctly estimated. An important caveat here is that one relies on the assumption that both the PDF replicas and expectation values of the PDFs across fits both are distributed normally. This assumption holds by construction for the closure test data Eqs. (6.1, 6.2), so for PDFs it likely only holds in the region where the PDFs are constrained by the normally distributed data. The measure Eq. (6.16) is thus only significant if computed for well constrained PDFs \(q^{i(l)}(x_j)\): it can then be defined by choosing a suitable sampling of PDFs in the relevant region.

One can also define an analogous estimator, now in the space of experimental data as opposed to the PDF-space definition of Eq. (6.16), as follows

$$\begin{aligned} \xi ^{(\mathrm data)}_{n\sigma } = {{1}\over {N_\mathrm{dat}}} {{1}\over {n_\mathrm{fit}}} \sum _{i}^{N_\mathrm{dat}} \sum _{l}^{n_\mathrm{fit}} I_{[-n\sigma _i^{(l)}, n\sigma _i^{(l)}]} \left( \mathbf {E}_{\epsilon }\left[ g_i \right] ^{(l)} - f_i \right) , \end{aligned}$$

where \(\sigma ^{(l)}_i\) is the standard deviation (PDF uncertainty) of the theory predictions for the i-th observable estimated from the \(n_\mathrm{rep}\) replicas of the l-th fit. Here, if the test is performed by computing the estimator for data not used for PDF fitting, in order to make sure that the Gaussianity assumption holds one must choose testing data which are sensitive to PDF combinations and kinematic regions that are well constrained by the fitting data.

This \(\xi ^{(\mathrm data)}_{n\sigma }\) estimator provides the desired generalization to quantile statistics of the bias-to-variance ratio \(\mathcal {R}_{bv}\). To see this, note first that we can calculate \(\xi ^{(\mathrm data)}_{n\sigma }\) in different bases and that, unlike \(\chi ^2\) or other quantities with bilinear forms, \(\xi ^{(\mathrm data)}_{n\sigma }\) is not basis independent. Then, in order to compare \(\xi ^{(\mathrm data)}_{n\sigma }\) to \(\mathcal {R}_{bv}\), compute \(\xi ^{(\mathrm data)}_{1\sigma }\) in the basis which diagonalizes the experimental covariance matrix. The sum across data points then becomes the sum across eigenvectors of the experimental covariance matrix.

In this basis, one can then evaluate [222] Eq. (6.17) by means of the approximation

$$\begin{aligned} \xi _{n\sigma }^{(\mathrm data)} \approx \mathrm{erf}\left( {{n \mathcal {R}_{bv}}\over {\sqrt{2}}}\right) , \end{aligned}$$

which is the standard result of integrating a Gaussian over some finite symmetric interval, assuming that the ratio of uncertainties is approximately constant across all eigenvectors of the experimental covariance matrix. Clearly, if the distribution of central predictions about the underlying law matches the distribution of the replica predictions around the central predictions (\(\mathcal {R}_{bv}\simeq 1\)), then the expected value of \(\xi _{1\sigma }^{(\mathrm data)}\) is 0.68. This shows that the bias-to-variance ratio tests for accuracy of quantile statistics, just like the estimator Eq. (6.17), and its counterpart in PDF space Eq. (6.16).

Once again, note that the computation of the estimators Eqs. (6.16, 6.17) requires running multiple replica closure fits based on different underlying data Eq. (6.1). This, as mentioned, is only possible now, thanks to the much greater computational efficiency of the current methodology. Indeed, in Ref. [14] the estimator Eq. (6.16) was only evaluated approximately, based on a single closure test run and a suitable approximation. We have in fact now verified a posteriori that the approximation of Ref. [14] is reasonably accurate, but only now it is possible to compute the estimator exactly.

6.1.3 Closure test settings

Fig. 24
figure 24

The replica (solid green line) chosen as the true underlying PDF \(\varvec{f}\) for the closure test: the gluon (left) and quark singlet (right) are displayed. The NNPDF4.0 central value and 68% confidence interval (same as in Fig. 17) are also shown for reference

We have performed a closure test by assuming as input PDF set used to produce the true observable values \(\varvec{f}\) a specific replica randomly selected out of the \(N_\mathrm{rep}\) replicas of the NNPDF4.0 NNLO global determination. The reason for this choice is that on the one hand, it automatically satisfies known theoretical constraints, such as the sum rules of Sect. 3.1.2. On the other hand, thanks to it being randomly selected out of a replica sample, it satisfies the criteria of flexibility and model-independence of Sect. 6.1.2. In particular, individual replicas have generally more structure than the final central PDF, so by choosing a repica, rather than the central fit from either NNPDF or any other PDF set, we are making the closure test somewhat more stringent. The specific replica that we chose is shown in Fig. 24 (gluon and quark singlet), together with the NNPDF3.1 central value and uncertainty.

We have produced \(n_\mathrm{fit}=25\) sets of data Eq. (6.1), each of which has been used to produce a fit with \(N_\mathrm{rep}=40\) replicas. Results are then bootstrapped [223, 224] in order to improve stability. We have checked that increasing the number of replicas or the number of fits results are unchanged within the bootstrap uncertainty. The fits are produced using the NNPDF3.1-like dataset discussed in Sect. 2.1.

Data space estimators, such as the bias-to-variance ratio \(\mathcal {R}_{bv}\), are produced by selecting out of the full datasets that enter the NNPDF4.0 determination all data that were not already used for fitting. An advantage of this choice is that the kinematic coverage of the fitting dataset and the testing dataset are then reasonably similar, thus ensuring Gaussianity, as discussed above,

In PDF space, we perform tests for PDFs in the evolution basis at the PDF parametrization scale and over a grid of x points, chosen for the gluon and singlet as logarithmically spaced for \(10^{-3}< x < 0.1\) and linearly spaced for \(0.1< x <0.5\), and for nonsinglet quark distributions V, \(V_3\), \(T_3\), and \(T_8\) as purely linearly spaced for \(0.1< x <0.5\). We do not consider the \(V_8\) and \(T_{15}\) nonsinglet combinations that are too noisy at the initial scale. Furthermore, we evaluate \(\xi _{1\sigma }\) in Eq. (6.16) with \(n_x=4\) to reduce the correlations between points, and we also rotate into the basis which diagonalizes the covariance estimated on the PDF replicas as an extra precaution.

6.1.4 Validation of the NNPDF4.0 methodology

We now turn to the validation of the NNPDF4.0 methodology. First of all, we evaluate the expectation value of \(\Delta _{\chi ^2}\), Eq. (6.9), over the \(n_\mathrm{fit}\) fits that constitute the NNPDF4.0 closure tests and present in Table 24 the results separated into groups of datasets. As mentioned, the input dataset is NNPDF3.1-like. One can observe how \(\mathbf {E}_{\eta }\left[ \Delta _{\chi ^2} \right] < 0\) for all datasets considered, indicating the absence of under-fitting. Furthermore, its small absolute magnitude, typically at the per-mille level or at most being a couple of percent, corresponds to a negligible amount of overfitting, and it is thus consistent with proper learning.

Table 24 The expectation value of \(\Delta _{\chi ^2}\), Eq. (6.9), evaluated over the \(n_\mathrm{fit}\) fits that constitute the NNPDF4.0 closure test. Results are presented separated into different processes

We now turn to the bias-to-variance ratio \(\mathcal {R}_{bv}\), Eq. (6.15), which is shown in Table 25, evaluated for testing datasets that were not used as input to the closure test fits, with results divided by groups of processes. The combination of the fitting set used to evaluate Table 24 and the testing set shown here add up to the complete NNPDF4.0 baseline dataset. The last column indicates the uncertainty of the \(\mathcal {R}_{bv}\), determined as its standard deviation over a bootstrap sample of both fits and replicas.

For the total testing set, it is found that \(\mathcal {R}_{bv}\simeq 1\) within the bootstrap error, demonstrating the faithfulness of the estimated PDF uncertainties.

Table 25 The bias-to-variance ratio \(\mathcal {R}_{bv}\), Eq. (6.15), divided by groups of processes and evaluated for the testing datasets that were not used as input to the NNPDF4.0 closure test fits. The last column indicates the uncertainty associated to \(\mathcal {R}_{bv}\), determined as its standard deviation over a bootstrap sample of both fits and replicas

In order to gain some more understanding of the results from Table 25, it is instructive to plot the full distributions of both the total bias, Eq. (6.7), and of the total variance, Eq. (6.8), over the \(n_\mathrm{fits}\) constituting the NNPDF4.0 closure tests. From these two distributions, displayed in Fig. 25, one can observe that not only are their means consistent, but also that they exhibit a similar shape. The only difference is that the distribution over the variances is somewhat broader, with a small tail towards large values of the estimator. Since each of the \(n_\mathrm{fit}\) fits has 40 replicas, one expects better statistics in the distributions over variances as compared to that over biases, which is why the tail of the former is better sampled. Furthermore, we performed checks that the results in Table 25 are stable upon removing selected fits and replica within the bootstrap uncertainty, and hence we are confident that the results are not subject to finite size effects.

Fig. 25
figure 25

The normalized distribution of the total bias, Eq. (6.7), and of total variance, Eq. (6.8), over the \(n_\mathrm{fits}\) constituting the NNPDF4.0 closure tests. The square root of the mean of these two distributions defines \(\mathcal {R}_{bv}\), the bias-to-variance ratio

Table 26 The one-sigma quantile estimator in the space of experimental data, \(\xi ^{(\mathrm data)}_{1\sigma }\) Eq. (6.17) and evaluated for the same testing dataset as used for Table 25, together with the corresponding bootstrap error. For each group of processes, we also display the value of \(\mathrm{erf}(\mathcal {R}_{bv}/\sqrt{2})\) evaluated using the corresponding bias-to-variance ratio

The fact that the bias-to-variance ratio satisfies \(\mathcal {R}_{bv}\simeq 1\) both for the total testing dataset and at the level of groups of processes indicates that the PDF uncertainties in the NNPDF4.0 methodology are being faithfully estimated. Further confirmation of this property can be obtained by evaluating the quantile estimators in both PDF and data space, respectively defined in Eqs. (6.16, 6.17). First of all, Table 26 displays the one-sigma quantile estimator in the space of experimental data, \(\xi ^{(\mathrm data)}_{1\sigma }\), evaluated for the same testing dataset as that used for Table 25, together with the corresponding bootstrap error. In addition, we also indicate the value of \(\mathrm{erf}(\mathcal {R}_{bv}/\sqrt{2})\) evaluated using the corresponding bias-to-variance ratio. As indicated by Eq. (6.18), for a successful closure test one expects that these two quantities coincide, that is, \(\xi ^{(\mathrm data)}_{1\sigma }\simeq \mathrm{erf}(\mathcal {R}_{bv}/\sqrt{2})\).

It is clear that \(\xi ^{(\mathrm data)}_{1\sigma }\) and \(\mathrm{erf}(\mathcal {R}_{bv}/\sqrt{2})\) agree well with each other within the bootstrap error, which provides a non-trivial consistency test. Furthermore, \(\xi ^{(\mathrm data)}_{1\sigma } = 0.68\) for the total dataset as expected, with reasonable fluctuations between different process types. The observed deviations between the two indicators may be explained by quantile statistics being more robust to outliers, or because the value of \(\mathrm{erf}(\mathcal {R}_{bv}/\sqrt{2})\) can be dominated by a few eigenvectors of the experimental covariance matrix.

In order to provide a graphical representation of the information contained in Table 26, it is instructive to evaluate the difference between the mean value (over replicas) of the theory predictions and the corresponding truth observable values normalized by the PDF uncertainties, that is

$$\begin{aligned}&\delta _i^{(l)} \equiv {{\left( \mathbf {E}_{\epsilon }\left[ g_i \right] ^{(l)} - f_i \right) }\over {\sigma _i^{(l)}}} , i=1,\ldots ,N_\mathrm{dat} \, ,\qquad l=1,\ldots ,n_\mathrm{fit} .\nonumber \\ \end{aligned}$$

The normalized distribution of these relative differences \(\delta _i^{(l)}\) is displayed in the left panel of Fig. 26 together with a univariate zero-mean Gaussian for reference. The fraction of the histogram entries which fall inside the 1-sigma confidence interval of the scaled Gaussian is then equal to the value of the total \(\xi ^{(\mathrm data)}_{1\sigma }\) displayed in Table 26.

From Fig. 26 it is apparent that the central values of the model predictions for physical observables fluctuate around the true values by an amount which is consistent with the expectations of the associated PDF uncertainties. Indeed, there is excellent agreement between the distribution of \(\delta _i^{(l)}\) and that of the reference Gaussian, consistently with the value of \(\xi ^{(\mathrm data)}_{1\sigma }=0.68\) reported in Table 26.

Fig. 26
figure 26

The normalized distribution of relative differences \(\delta _i^{(l)}\) in data data space Eq. (6.19 (left) or \(\widetilde{\delta }_{i,j}^{(l)}\) Eq. (6.20 in PDF space (right). In both cases, a univariate zero-mean Gaussian distribution is plotted for reference

We now compute the quantile estimator in PDF space, defined in Eq. (6.16). This estimator, \(\xi _{n\sigma }^\mathrm{(pdf)}\), was already introduced as part of the original study in [14]. However, as mentioned it was only possible to evaluate it approximately, as performing multiple closure test fits was computationally infeasible. The values of \(\xi _{1\sigma }^\mathrm{(pdf)}\) are presented in Table 27, along with their bootstrap error. In general, there is reasonable agreement within bootstrap errors between the computed value of \(\xi _{1\sigma }^{(\mathrm pdf)}\) and the expected value of 0.68. However, in comparison to the corresponding estimator in data space larger fluctuations are observed, specifically for the singlet PDF \(\Sigma \), and the average value \(\xi _{1\sigma }^{(\mathrm pdf)}=0.71\pm 0.02\) is somewhat overestimated. It should be noticed that the PDF-space estimator is somewhat less stable and accurate than that in data space, due to the need to pick a grid of points that corresponds to the measured region, and also because of the very high correlation between PDFs at neighboring points which may lead to an unstable covariance matrix. The fact that the average \(\xi _{1\sigma }^{(\mathrm pdf)}\) is slightly more than 0.68 suggests that anyway PDF uncertainties are conservatively estimated.

Table 27 The values of the quantile estimator in PDF space, \(\xi _{1\sigma }^\mathrm{(pdf)}\) Eq. (6.16), separated into the contributions from individual flavor combinations together with the corresponding bootstrap uncertainty

Finally, in Fig. 26 the histogram of relative differences is also shown using a PDF space definition:

$$\begin{aligned}&\widetilde{\delta }_{i,j}^{(l)}\equiv {{\left( \mathbf {E}_{\epsilon }\left[ q^{i(l)}(x_j) \right] - q_\mathrm{in}^i(x_j) \right) }\over {\sigma ^{i(l)}(x_j)}}\,,\nonumber \\&\quad i=1,\ldots ,n_\mathrm{flav} \, ,\quad j=1,\ldots ,n_{x} \, ,\quad l=1,\ldots ,n_\mathrm{fit} , \nonumber \\\end{aligned}$$

We see that, even though also in this case there is excellent agreement with the expected univariate Gaussian behavior, results are indeed rather noisier than in data space.

6.1.5 Extent and limitations of closure testing.

The closure tests presented in this section are entirely successful, thereby validating the NNPDF4.0 methodology in the data region. However, it is important to understand what the closure tests do and do not verify.

The closure test, at least in its present incarnation, makes two assumptions. The first is that the underlying distribution of the experimental data is known exactly. Specifically, if the data are Gaussian, it is assumed that their distribution is unbiased and that the covariance that characterizes this multi-Gaussian distribution is fully known. In realistic situations, of course, this, even in the best of hypotheses, can only be approximately the case, since the experimental covariance matrix is itself an observable which is extracted from the data, and thus it is characterized by an uncertainty on the uncertainty. Furthermore, some sources of systematic uncertainty are based on theoretical estimates and thus subject to theoretical uncertainties which are difficult to estimate. Finally, in the worst case it may happen that some data or the associated uncertainty are simply incorrect: this would correspond to a biased distribution (wrong central value) or an incorrect uncertainty or correlations (wrong covariance matrix).

The second assumption is that the data are obtained from the PDF using a known underlying physical law. In realistic situations this is surely not the case, since theoretical predictions are computed at finite perturbative accuracy, and thus data predictions are affected by an uncertainty corresponding to the very least to missing higher order perturbative corrections, and generally also to other possible corrections such as nuclear effects, electroweak corrections, heavy quark mass effects, limited knowledge of standard model parameters, and so on.

Therefore, the closure test presented here checks for faithfulness of the component of the PDF uncertainty which is induced by the data uncertainty, assuming the latter is perfectly known. It does not check for other sources of uncertainty, such as theory uncertainties: this would have to be added separately. A methodology for doing so was discussed and applied to missing higher order perturbative uncertainties in Refs. [23, 24], but is not implemented in a global NNLO PDF determination yet. Also, it does not account for possible “data inconsistencies”, i.e., incorrectly estimated experimental values and uncertainties. This motivates the need to select a maximally consistent dataset, as we have done in Sect. 4, that guarantees that no major inconsistencies are present in the baseline dataset. However, remaining small inconsistencies might still lead to a certain amount of uncertainty underestimation, whose exact assessment will require performing closure tests with artificial inconsistent data.

6.2 Future testing NNPDF4.0

The closure tests presented in Sect. 6.1 allow for an assessment of the faithfulness of PDF uncertainties in the region covered by available experimental data. However, they are ill-suited for an assessment of the behavior of PDFs and their uncertainties in the extrapolation regions where little or no experimental constraints are available, for a variety of reasons, the most obvious of which is that the multi-Gaussian assumption is likely to fail outside the data region

Hence, closure tests have limited applicability to study the generalization power of the resulting PDF fit to new, unexplored kinematic regions. A more suitable strategy to assess this generalization power are the so-called “future tests” proposed in [15]. The main idea underlying future tests is that what we ask for in an extrapolation region is that PDF uncertainties correctly reflect the lack of information. Whereas in principle in the absence of information uncertainties are infinite, in practice PDF uncertainties not too far from the data region are constrained by requirements of continuity and smoothness. Whereas in the absence of direct information we cannot expect to be able to achieve a full and detailed knowledge of the covariance between any two PDFs (or indeed any two predicted data points), we do wish for PDF uncertainties to reproduce the possible deviation of the best-fit PDFs, and of physical predictions obtained using them, from the true value that would be obtained if the PDF was known say as accurately as it is known in the data region.

The future test verifies explicitly whether this is the case: a “future test PDF” is determined from a restricted subset of the full dataset that only covers a limited region. This future test PDF is then used to predict all the rest of the dataset. The restricted datasets can be thought of as representative of the limited knowledge that was available at some point in the past, (hence the name “future test”) but this is of course just a manner of speaking, as any partitioning of the dataset into restricted and full may be considered. Because future tests of NNPDF3.1 were never performed, here we will present future tests of both the NNPDF3.1 and NNPDF4.0 methodologies. This allows us to simultaneously validate the new methodology, and also put it in context.

6.2.1 Future testing methodology

Following the discussion in [15] we test the NNPDF3.1 and NNPDF4.0 methodologies by choosing as input specific subsets of the complete NNPDF4.0 baseline dataset, and determining corresponding PDF sets from them. The predictions obtained using these PDFs are then compared to the data not included in their fit, in order to assess whether the uncertainty in the prediction correctly accounts for the correspondingly missing information.

This is done by evaluating the \(\chi ^2\) to the datasets not used in the fit with PDF uncertainties also included along with the data uncertainties in the \(\chi ^2\) definition. Indeed, we expect in general that the \(\chi ^2\) evaluated including data uncertainties only should be larger than one, as soon as the deviation of the PDF from its true value is larger than the experimental uncertainty, which is bound to happen for sufficiently accurate data in an extrapolation region. However, if the deviation from the true value is correctly reproduced by the PDF uncertainty, the \(\chi ^2\) should then become again close to one once the PDF uncertainty is included. Note that the test is only nontrivial if the \(\chi ^2\) value before inclusion of the PDF uncertainty is significantly larger than one: otherwise, the data are not precise enough to test for faithfulness of the PDF uncertainty.

Specifically the \(\chi ^2\) with PDF uncertainties included is computed using the covariance matrix

$$\begin{aligned} \mathrm{cov}_{ij}^\mathrm{(tot)} = \mathrm{cov}_{ij}^\mathrm{(exp)} + \mathrm{cov}_{ij}^\mathrm{(pdf)} \, , \end{aligned}$$

where \(\mathrm{cov}_{ij}^\mathrm{(exp)}\) is the usual experimental covariance matrix, while the covariance matrix \( \mathrm{cov}_{ij}^\mathrm{(pdf)}\) corresponding to PDF uncertainties can be determined as

$$\begin{aligned} \mathrm{cov}_{ij}^\mathrm{(pdf)} = \left\langle \mathcal {F}_i\mathcal {F}_j \right\rangle _\mathrm{rep} - \left\langle \mathcal {F}_i \right\rangle _\mathrm{rep}\left\langle \mathcal {F}_j \right\rangle _\mathrm{rep} \, , \end{aligned}$$

where \(\mathcal {F}_i^{(k)}\) is the i-th physical prediction found using the k-th replica of a given PDF set, and the average is performed over replicas. Simply combining the two covariance matrices according to Eq. (6.21) is justified when the corresponding sources of uncertainty are uncorrelated [24]. This is clearly the case since the experimental uncertainty on data which are not fitted is completely independent of the PDF uncertainty, as the latter is driven by the uncertainty on the fitted data.

6.2.2 Future testing datasets

We choose three subsets of the full NNPDF4.0 datasets, inspired by the chronological order in which actual measurements became available, respectively chosen to correspond approximately to a “pre-HERA” and “pre-LHC” dataset, and to the NNPDF3.1-like dataset that was used as fitting dataset in the closure tests of Sect. 6.

They are defined as follows:

  • Pre-HERA. Only fixed-target DIS structure function data and fixed-target Drell–Yan cross-sections data are included.

  • Pre-LHC. This is a superset of the pre-HERA dataset, which is extended to also include HERA collider inclusive and charm structure function data, and Tevatron W and Z production data.

  • NNPDF3.1-like. This is the dataset defined in Ref. [10] and used as fitting dataset in the closure tests presented in Sect. 6.

Fig. 27
figure 27

Scatter plots comparing various future test data subsets to the full NNPDF4.0 of Fig. 2. Left: comparison of the pre-HERA, pre-LHC and NNPDF4.0 datasets. Note that each dataset is a superset of the previous one, so all the pre-HERA data are included in the pre-LHC set, and all data are included in the NNPDF4.0 set. Right: the data points which are included in NNPDF4.0 but not in the NNPDF3.1-like dataset, grouped by process type

It is important to draw a distinction between the NNPDF3.1-like dataset and the other two subsets; while going from pre-HERA to pre-LHC to NNPDF4.0 consecutively adds data in new kinematic regions, going from NNPDF3.1-like to NNPDF4.0 instead adds more data points in regions for which data already exist. So we can think of the transition from NNPDF3.1 to NNPDF4.0 as an interpolation rather than an extrapolation. This is reflected in scatter plots in Fig. 27, where the difference between the first two subsets and NNPDF4.0 are shown on the left, and the difference between the NNPDF3.1-like subset and NNPDF4.0 is shown on the right.

More specifically, first (left) we compare the pre-HERA, pre-LHC and full NNPDF4.0 datasets. Note that the pre-LHC dataset contains the points marked with an orange triangle as well as the pre-HERA points, and the NNPDF4.0 dataset contains all points: the three datasets are each a superset of the previous one. It is also clear that each dataset probes an increasingly wide kinematic region. Specifically, pre-HERA data are restricted to and GeV, while pre-LHC data cover the complete range of x but only GeV. Furthermore, each dataset provides an increasingly wide handle on specific PDFs or PDF combinations: for instance, pre-HERA data provide little handle on quark flavor decomposition, and pre-LHC data provide essentially no handle on the large-x gluon. The pre-HERA and pre-LHC allow us to test for far-extrapolation (pre-HERA) and near-extrapolation (pre-LHC).

Then (right) we show all the data that are included in NNPDF4.0 but not in the NNPDF3.1-like dataset, classified by process type. In this case the kinematic region covered by the new datasets included in NNPDF4.0 essentially overlaps with the NNPDF3.1-like dataset, though with a lower density. Hence, in this case it is interpolation, rather than extrapolation, what is being tested.

6.2.3 Future test results

We now present results of future testing both the NNPDF4.0 and NNPDF3.1 methodologies. We first discuss the case of near and far extrapolation, namely, the pre-HERA and pre-LHC datasets. The \(\chi ^2\) values for all data, divided by process type, are collected in Table 28. For either methodology we have determined three PDF sets from the three datasets, and we show \(\chi ^2\) values obtained using each of them. The process-type classification is made in such a way that the data for any given process type are either all included, or all excluded in the determination of each of the three PDF sets. When a process type is not included in the PDF determination both the \(\chi ^2\) value without PDF uncertainty (in italic) and the PDF value with PDF uncertainty (in boldface) are shown. All other \(\chi ^2\) values correspond to fitted data. We also tabulate \(\chi ^2\) values for the full set of data which in each case is not fitted, denoted as out-of-sample data.

Table 28 Values of the \(\chi ^2\) per datapoint for the total dataset and for specific process types obtained for NNPDF4.0 and for the pre-HERA and pre-LHC future test PDFs, determined using NNPDF3.1 or NNPDF4.0 methodology. All the data in each process type are either fully included or fully excluded from each PDF determination. Values in regular font correspond to fitted datasets, evaluated with the experimental covariance matrix. Values in bold or italics correspond to data that are not fitted. The value in italic is evaluated with the experimental covariance matrix, while the value in bold also includes PDF uncertainties, Eqs. (6.21, 6.22). Values of \(\chi ^2\) for the full set of data that are not fitted (denoted as total out-of-sample) is also given in each case

First, we note that the total \(\chi ^2\) for out-of-sample data is very large (of order twenty) for pre-HERA PDFs while it is moderately large (of order three) for pre-LHC PDFs. This shows that the test is nontrivial in both cases, and it indeed tests for far-extrapolation for pre-HERA and near-extrapolation for pre-LHC. A similar pattern is observed for all process types: HERA, that probes the small x gluon, top and jets, that probe the large x gluon, and Drell–Yan, that probes quark flavor separation.

When the PDF uncertainties are introduced, all \(\chi ^2\) values become of order one, thereby showing that the future test is successful. This is especially remarkable given that in some cases (such as HERA data or collider Drell–Yan data for the pre-HERA PDFs) the reduction in \(\chi ^2\) is by almost a factor 30. This means that the PDF uncertainty accounts for a deviation between data and theory which is over five times bigger than the data uncertainty.

Finally, comparing the two methodologies it is clear that both are equally successful in satisfying the future tests. However, with NNPDF3.1 methodology \(\chi ^2\) values computed without PDF uncertainty for out-of-sample data are rather larger than with NNPDF4.0 methodology. This means that while both methodologies lead to faithful uncertainties in the extrapolation region, NNPDF4.0 has smaller extrapolation uncertainties, i.e., it provides a more efficient generalization of the fitted data.

We then turn to fits based on the NNPDF3.1-like dataset. In this case, each process type is represented both in the fitted and extrapolation dataset, hence in Table 29 we only show \(\chi ^2\) values for the total fitted and out-of-sample datasets. In this case, the out-of-sample \(\chi ^2\) is smaller than two, and smaller than 1.5 for NNPDF4.0 methodology consistent with the fact that the out-of-sample data are now in an interpolation region. Also in this case, upon inclusion of the PDF uncertainty all \(\chi ^2\) value become of order one, and close to the \(\chi ^2\) value for the fitted dataset.

Table 29 Same as Table 28, now for the NNPDF3.1-like future test

We conclude from this analysis that the future test is fully successful for both methodologies, and that for the same datasets near- and far-extrapolation and interpolation uncertainties are smaller with NNPDF4.0 methodology as compared to its NNPDF3.1 counterpart.

By construction, the performance of future tests should always be assessed at the level of \(\chi ^2\). However, for the sake of visualization, we also provide some comparisons, both at the PDF level and at the data level, between future-test PDFs and PDFs determined from the global NNPDF4.0 baseline dataset. In Figs. 28 and 29 we compare future test pre-HERA and pre-LHC PDFs at the parametrization scale to those determined using the full dataset, using respectively the NNPDF3.1 and NNPDF4.0 fitting methodologies. The inflation of PDF uncertainties when a particular x range for a given PDF changes from data to extrapolation between different sets is apparent. The smaller extrapolation uncertainty found using NNPDF4.0 methodology in comparison to the NNPDF3.1 methodology is also visible. Finally, it is clear that there is good overall compatibility of all PDFs when comparing the data region of one set to the extrapolation region of a different set, in agreement with the \(\chi ^2\) values of Table 28. A possible exception is the gluon from the pre-HERA future test which, while compatible with the global result when using NNPDF3.1 methodology, disagrees with it at the two sigma level or even more when using NNPDF4.0 methodology in the \(x\lesssim 0.002\) region. This might be due to the poor internal consistency of the BCDMS and NMC data already noted in Sect. 4.2.4: if so, this would indicate that the NNPDF4.0 methodology is sensitive enough to pick up this, while the NNPDF3.1 methodology is not.

Finally, in Fig. 30 we compare predictions obtained using the pre-HERA, pre-LHC, and global (NNPDF4.0) PDFs to a representative selection of data included in the global fit but not in the pre-LHC fit. Specifically, we consider the HERA NC structure functions at \(\sqrt{s}=920\) GeV in the \(Q=1.871\) GeV bin; the dimuon rapidity distributions in forward \(Z\rightarrow \mu \mu \) production at LHCb; the top quark rapidity distributions in the ATLAS \(t\bar{t}\) lepton+jet measurement at 8 TeV; and the dilepton rapidity distribution for \(M_{\ell \ell }=25\) GeV and the CMS double-differential Drell–Yan measurement at 7 TeV. Of these, only the HERA structure function data are included in the pre-LHC fit, though of course not in the pre-HERA fit, while all other data are predictions for both future-test PDF sets. All results displayed in these comparisons have been obtained using NNPDF4.0 methodology. A historical curiosity here is the observation that the rise of the \(F_2\) structure function at HERA, which came as a surprize (see e.g. Refs. [225, 226]) is correctly reproduced by the pre-HERA fit based on its onset in pre-HERA data. Note however that this should not be taken as a prediction: both methodologies that we are testing here have been developed based on later datasets, and thus do encode to some extent some of the information contained in the later data.

The very large difference between fitted and extrapolation PDF uncertainty is apparent, and so is the hierarchy between near-extrapolation uncertainties (pre-LHC) and far-extrapolation uncertainties (pre-LHC), e.g. for the top pair production data. The good compatibility between data and predictions including PDF uncertainties is also clear, again confirming the success of the future test as summarised in Table 28.

Fig. 28
figure 28

Some pre-HERA and pre-LHC PDFs compared to PDFs based on full NNPDF4.0 dataset, in all cases obtained using the NNPDF3.1 fitting methodology. The up (top left), antidown (top right), strange (bottom left) and gluon (bottom right) are shown at the input parametrization scale of \(Q=1.65\) GeV

Fig. 29
figure 29

Same as Fig. 28, but now showing PDFs determined using the NNPDF4.0 methodology

Fig. 30
figure 30

Comparison of the theoretical predictions including PDF uncertainties from the pre-HERA and pre-LHC PDF sets based on NNPDF4.0 methodology and those of the global fit to four representative measurements: t HERA NC structure functions, dimuon rapidity distributions in Z production at LHCb, top rapidity distributions in ATLAS \(t\bar{t}\) production, and the dilepton rapidity distribution for CMS double-differential Drell–Yan (see text for details). Note that only the HERA structure function data enter the pre-LHC fit (but not the pre-LHC fit), and all the remaining data do not enter either the pre-HERA or pre-LHC fit. The uncertainty in the data corresponds to the total diagonal experimental error

7 Dataset dependence of the NNPDF4.0 parton set

Having established the reliability of the NNPDF4.0 determination, we now study in detail the impact on the PDFs of the data (in this section) and of the methodology (in the next section). This also provides us with further a posteriori checks of the stability and reliability of our results.

In this Section, we first assess the global impact of the change in dataset when going from NNPDF3.1 to NNPDF4.0, and then we present variants of the baseline NNPDF4.0 fit, in which the impact of specific datasets is studied by removing them from the baseline. Finally, we assess the impact of datasets that have not been included in the NNPDF4.0 baseline, with the main aim of checking the stability of our results. Whereas the analysis presented in this section gives some indication on the pull of some data on individual PDFs, a full assessment of the impact of various data on the PDFs would require the use of correlation tools such as presented in Ref. [27], as well as systematic studies of PDFs based on partial datasets, such as presented in Sect. 2.3 of Ref. [227] in the context of the NNPDF2.3 determination.

Except otherwise stated, all the fits presented in this section utilize the methodology discussed in Sect. 3 and correspond to Monte Carlo ensembles of 100 replicas.

7.1 Impact of the updated dataset

As explained in Sect. 2.1, the NNPDF4.0 dataset differs from NNPDF3.1 not only because of the addition of a significant amount of measurements not included in NNPDF3.1, but also because of changes in the treatment of some of the data already included in NNPDF3.1, mostly related to updates in the data and in the corresponding theory calculations. These changes are incorporated in a dataset called NNPDF3.1-like in Sect. 2.1 and used throughout this paper whenever comparisons to NNPDF3.1 are required, e.g. in the code benchmarks of Sect. 3.4 or in the future tests of Sect. 6.2. However, in Sect. 5.2.1 (specifically in Fig. 17) we compared NNPDF4.0 to the published NNPDF3.1 set. We must therefore start this discussion of dataset dependence with an assessment of the differences between the published NNPDF3.1 fit and its update based on this NNPDF3.1-like dataset.

7.1.1 The NNPDF3.1-like dataset and PDFs

The impact of the alterations made to the NNPDF3.1 dataset are studied by comparing the original NNPDF3.1 determination [5] to a PDF fit based on same NNPDF3.1 methodology but using the NNPDF3.1-like dataset discussed in Sect. 2.1. The corresponding PDFs are compared in Fig. 31.

Fig. 31
figure 31

The up, antiup, down, antidown, strange, antistrange, charm and gluon PDFs from NNPDF3.1 and from a fit based on the same NNPDF3.1 methodology but on the NNPDF3.1-like dataset defined in Sect. 2.1. Results are displayed at \(Q=100\) GeV, normalized to the NNPDF3.1 central value. Solid and dashed bands correspond to 68% and one-sigma uncertainties, respectively

Fig. 32