1 Introduction

With the restart of the Large Hadron Collider (LHC), particle physics is getting into the thick of a new era, whereby measurements are anticipated to attain an unprecedented, percent-level, statistical precision [1]. These measurements are utilised to improve the determination of Standard Model (SM) parameters [2], to constrain Parton Distribution Functions (PDFs) [3], to evaluate backgrounds for missing energy searches [4], or more generally to constrain higher-dimensional operators in the SM Lagrangian [5], and eventually to stress-test the properties of the Higgs boson [6].

In all of these cases, measurements are contrasted with theoretical predictions by means of statistical inference: a model is chosen and compared or optimised to the data through maximum likelihood estimation. The goal is to reject a test hypothesis or to obtain a confidence interval of model parameters. Because experimental uncertainties in LHC measurements are commonly assumed to be Gaussian, the figure of merit utilised for the statistical test or to optimise the model is the \(\chi ^2\) statistic, which is monotonic in the likelihood of sampling the experimental data given the theory.

The robustness of the \(\chi ^2\) as a figure of merit relies on the accuracy of theoretical expectations and of experimental uncertainties. In this paper we assume that the \(\chi ^2\) is not spoiled by inaccuracies in theoretical expectations — a fact that is possibly not true now, but that will become increasingly realistic in the future [7] — and focus only on inaccuracies of experimental uncertainties.

A proper estimation of uncertainties in LHC measurements is indeed becoming increasingly delicate. The large event samples collected during Run I and II have been making statistical uncertainties generally smaller than systematic uncertainties; the upcoming Run III will make the former even smaller. In contrast to statistical uncertainties, systematic uncertainties (which are not related to event counts but, e.g., to limitations of the detector or to assumptions made in their modelling) are more difficult to estimate. The reason being that custom procedures and subjective choices are involved in their quantification [8]. Furthermore, systematic uncertainties are usually correlated across different kinematic bins, both within the same measurement and across different measurements. Determining these correlations is even more difficult, and often it is not even attempted. In this case, simple assumptions, such as taking systematic uncertainties to be fully correlated or fully uncorrelated, may misrepresent the truth. More elaborated guesswork can be performed in order to devise ad-hoc correlation models, which however have no generality and can be time consuming.

Because the uncertainties on LHC measurements are being increasingly dominated by systematic uncertainties, any analysis that utilises them is implicitly dependent on the choices made in their characterisation. While this dependence is generally unavoidable, some care must be taken to prevent it from hampering the use of the data in precision physics analyses based on statistical inference.

The aim of this paper is to formulate and address this problem rigorously. We first demonstrate how inaccuracies in the estimation of systematic uncertainty correlations, even if small, can lead to instabilities in the experimental covariance matrix and how these can ultimately undermine the reliability of the \(\chi ^2\). We then devise a regularisation procedure whereby these instabilities are removed with minimal information on their source, and without loss of generality. The idea is to define a bound on the singular values of the correlated part of the matrix of uncertainties, and to clip them to a suitably chosen value that alters only the small subset of directions associated to instability. We finally apply this procedure to a particular problem relevant to LHC precision physics that utilises statistical inference: PDF determination. Although we orient our discussion towards this problem, our regularisation procedure is completely general, and can be applied to any statistical analysis that involves the evaluation of the \(\chi ^2\).

The structure of the paper is as follows. In Sect. 2 we introduce the matrix of uncertainties and formulate a stability criterion for it. In Sect. 3 we derive our regularisation procedure and demonstrate how it works with a toy model. In Sect. 4 we apply the procedure to PDF determination using the recently released NNPDF4.0 parton sets [9]. We summarise our results in Sect. 5. The paper is completed by two appendices: Appendix A is a glossary of some useful definitions used through the paper; Appendix B contains the proof of Eq. (14). Our regularisation procedure is made publicly available as part of the NNPDF software [10].

2 Matrix of uncertainties and its stability

In this section we formulate the problem of the reliability of the \(\chi ^2\) if instabilities, even if small, appear in the covariance matrix that enters its computation. We first introduce the matrix of uncertainties and write the \(\chi ^2\) in terms of it. We then derive an upper bound on the instability of the matrix of uncertainties that ensures the stability of the \(\chi ^2\) with minimal information.

2.1 Matrix of uncertainties and \(\chi ^2\)

The format of LHC measurements, as often made public through the HepData repository [11], consists of a central value and of a set of uncertainties for each of the data points that form the measurement itself. The set of uncertainties usually encompass a total statistical uncertainty and a set of independent systematic uncertainties. The latter are typically correlated across data points, by an amount that may be specified or not.

Let us consider an experimental measurement made of \(N_{\mathrm{dat}}\) data points, each of which has \(N_{\mathrm{err}}\) independent uncertainties. We call \(\mathbf {d}\) the vector of experimental mean values, \(\mathbf {d}=\{D_i\}\), and A the \(N_{\mathrm{dat}}\times N_{\mathrm{err}}\) matrix of uncertainties, \(A=\{A_{ij}\}\), with \(i=1, \dots , N_{\mathrm{dat}}\), and \(j=1, \dots , N_{\mathrm{err}}\). Assuming that all uncertainties are Gaussian and that they are combined additively, the experimental measurement defines a multi-Gaussian distribution with mean \(\mathbf {d}\), given by the experimental central values, and covariance matrix C, given by the product of the matrix of uncertainties and its transpose, \(C=AA^t\).

Depending on the information provided with a given experimental measurement, each element of the matrix of uncertainties can be obtained from knowledge of C, for example by taking its Cholesky decomposition, or from direct knowledge of experimental uncertainties. In this latter case, should \({\mathcal {O}}_i\) be the physical observable corresponding to the data points \(D_i\), and \(\{u_j\}\) the set of independent variables which contribute to the experimental uncertainty and on which the observable depends (each described by a Gaussian distribution with central value \(u_j^0\) and uncertainty \(s_j\)), any element of the matrix of uncertainties reads as

$$\begin{aligned} A_{ij} = \left. \frac{\partial {\mathcal {O}}_i}{\partial u_j} \right| _{\mathbf {u}=\mathbf {u}^0} s_j. \end{aligned}$$
(1)

If a given source of uncertainty \(u_l\) affects a single data point k, then \(\partial {\mathcal {O}}_i/\partial u_l=0\) for \(i\ne k\), and it corresponds to a row in A with a single non-zero entry \(A_{kl}\). For instance, this is the case for statistical uncertainties that originate from bin-by-bin event counts. These uncertainties, together with similarly fully uncorrelated systematic uncertainties, can therefore be encoded in a \(N_{\mathrm{dat}}\times N_{\mathrm{dat}}\) diagonal sub-matrix of A. We assume that such uncertainties are always present in a measurement, therefore we will henceforth consider that \(N_{\mathrm{err}} \ge N_{\mathrm{dat}}\), and that both A and C be full rank.

The inverse of the covariance matrix C is \(C^{-1}=A^{+t}A^+\), where \(A^+\) is the right inverse of A (see Appendix A). Denoting with \(\mathbf {t}=\{T_i\}\), \(i=1,\dots ,N_{\mathrm{dat}}\), the vector of theoretical predictions corresponding to the data mean values \(\mathbf {d}\), the \(\chi ^2\) can be written as

$$\begin{aligned} \chi ^2 = (\mathbf {d} - \mathbf {t})^t C^{-1} (\mathbf {d} - \mathbf {t}) = \left\| A^+(\mathbf {d}-\mathbf {t})\right\| ^2. \end{aligned}$$
(2)

In this equation we have explicitly factorised the two contributions that determine the value of the \(\chi ^2\): the difference between the mean experimental central values and the theoretical expectation values, encoded in \(\mathbf {d}-\mathbf {t}\); and the experimental uncertainties, encoded in the error matrix A.

Concerning the \((\mathbf {d} - \mathbf {t})\) term in Eq. (2), we assume perfect knowledge of theoretical expectations. This means that the vector of differences \(\mathbf {d}-\mathbf {t}\) is a realisation of a random variable which follows a multivariate Gaussian distribution with mean zero and covariance matrix C. The corresponding probability density can be given in terms of the matrix of uncertainties A and of a vector of \(N_{\mathrm{err}}\) independent standard Gaussian random variables, \(\mathbf {n}=\left\{ n_j \right\} \), \(j=1,\dots ,N_{\mathrm{err}}\),

$$\begin{aligned} \mathbf {d}-\mathbf {t} = A\mathbf {n}, \qquad \mathbf {n}\sim {\mathcal {N}}(\mathbf {0},I). \end{aligned}$$
(3)

Concerning the matrix of uncertainties A in Eq. (2), we consider two different cases. The first case corresponds to assuming that A has been estimated accurately. Substituting Eq. (3) in Eq. (2), we obtain that the expected value of the \(\chi ^2\) over samples of \(\mathbf {d}-\mathbf {t}\) is

$$\begin{aligned} \langle \chi ^2\rangle = \langle \left\| A^+A\mathbf {n}\right\| ^2\rangle . \end{aligned}$$
(4)

Using the fact that, for independent standard Gaussian variables, \(\langle n_j n_l\rangle = \delta _{jl}\), we find that \(\langle \chi ^2\rangle \) is given in terms of the Frobenius norm (see Appendix A) of \(A^+A\):

$$\begin{aligned} \langle \chi ^2\rangle = \sum _{j,l}^{N_{\mathrm{err}}} (A^+A)_{j,l}^2 = \left\| A^+A\right\| _F^2 = N_{\mathrm{dat}}, \end{aligned}$$
(5)

where the last equality follows from the singular value decomposition of \(A^+\), see Appendix A.

The second case corresponds to assuming that there are inaccuracies in the estimation of uncertainties, which do not need to be large. We define as \({\bar{A}}\) the matrix of uncertainties that contains such inaccuracies. This is different from A, which is therefore unknown. The covariance matrix used to compute the \(\chi ^2\) is now \({\bar{C}}={\bar{A}}{\bar{A}}^t\). However, because we still assume perfect knowledge of theoretical expectations, Eq. (3) continues to hold. Therefore, in analogy with Eqs. (4)–(5), the expectation value of the \(\chi ^2\) reads as

$$\begin{aligned} \langle {\bar{\chi }}^2\rangle = \left\| {\bar{A}}^+ A\right\| _F^2. \end{aligned}$$
(6)

A comparison between Eqs. (6) and (5) allows one to formulate a stability criterion for the expectation value of the \(\chi ^2\) and for the matrix of uncertainties A upon substituting A with \({\bar{A}}\), as we explain next.

2.2 Stability criterion

We state that the matrix of uncertainties A is stable upon the replacement \(A \rightarrow {\bar{A}}\) in the computation of the \(\chi ^2\) when the difference in its expected value, \(\Delta \chi ^2\), is smaller than statistical fluctuations of the \(\chi ^2\) statistic itself, as measured by the standard deviation of the corresponding \(\chi ^2\) distribution, which is equal to \(\sqrt{2N_{\mathrm{dat}}}\). We can therefore write a stability criterion for the expectation value of the \(\chi ^2\) as:

$$\begin{aligned} \Delta \chi ^2 = \left| \langle {\bar{\chi }}^2\rangle - \langle \chi ^2\rangle \right| < \sqrt{2N_{\mathrm{dat}}}. \end{aligned}$$
(7)

Substituting Eqs. (5) and (6) in Eq. (7), we can equivalently write

$$\begin{aligned} \left\| {\bar{A}}^+A\right\| _F^2 - N_{\mathrm{dat}} < \sqrt{2N_{\mathrm{dat}}}. \end{aligned}$$
(8)

We now seek to find an upper bound to the inaccuracies of the matrix \({\bar{A}}\) that satisfies the stability criterion on the \(\chi ^2\), Eq. (8), using minimal information. To this purpose, we write the matrix of uncertainties A, which we do not know, as a perturbation to the matrix \({\bar{A}}\), which we are given,

$$\begin{aligned} A = {\bar{A}} + \delta F, \end{aligned}$$
(9)

where F is a matrix of perturbations and \(\delta \) is a scalar parameter controlling the size of the fluctuation. We assume that \(\delta \) is sufficiently small that we can linearly expand around \(\delta =0\). Replacing Eq. (9) into Eq. (6), we find

$$\begin{aligned} \langle {\bar{\chi }}^2\rangle = \left\| {\bar{A}}^+({\bar{A}} + \delta F) \right\| _F^2. \end{aligned}$$
(10)

Using the triangle inequality, we can derive the upper bound

$$\begin{aligned} \langle {\bar{\chi }}^2\rangle \le \left( \left\| {\bar{A}}^+{\bar{A}}\right\| _F + \delta \left\| {\bar{A}}^+F\right\| _F \right) ^2. \end{aligned}$$
(11)

Then expanding the square, and using the fact that \({\bar{A}}\) is full rank since it corresponds to the covariance matrix obtained in the experimental analysis, we obtain

$$\begin{aligned} \langle {\bar{\chi }}^2\rangle \le N_{\mathrm{dat}} + 2 \delta \sqrt{N_{\mathrm{dat}}} \left\| {\bar{A}}^+F\right\| _F + ({\delta ^2}). \end{aligned}$$
(12)

Now, neglecting the quadratic terms in \(\delta \), we arrive at

$$\begin{aligned} \Delta \chi ^2 \le 2 \delta \sqrt{N_{\mathrm{dat}}} \left\| {\bar{A}}^+F\right\| _F. \end{aligned}$$
(13)

We apply the following inequality

$$\begin{aligned} \left\| XY\right\| _F \le \min \left( \left\| X\right\| _F\left\| Y\right\| _2, \left\| X\right\| _2\left\| Y\right\| _F\right) , \end{aligned}$$
(14)

which holds for arbitrary matrices X and Y of compatible shape (see Appendix B for a proof), to Eq. (13) and find

$$\begin{aligned} \Delta \chi ^2 \le 2 \delta \sqrt{N_{\mathrm{dat}}} \left\| {\bar{A}}^+\right\| _2\left\| F\right\| _F, \end{aligned}$$
(15)

where \(\left\| {\bar{A}}^+\right\| _2\) denotes the Euclidean (or \(L^2\)) norm of \({\bar{A}}\) (see Appendix A). Choosing \(\left\| {\bar{A}}^+\right\| _2\left\| F\right\| _F\) instead of \(\left\| {\bar{A}}^+\right\| _F\left\| F\right\| _2\) as bound in Eq. (15) results in tighter constraints. The reason being that, in practice, instabilities occur when \({\bar{A}}^+\) has large singular values.

Finally, combining Eq. (15) with Eq. (7), we conclude that the condition

$$\begin{aligned} \left\| {\bar{A}}^+\right\| _2\left\| F\right\| _F \le \frac{1}{\sqrt{2}\delta } \end{aligned}$$
(16)

is sufficient to avoid that the expectation value of the \(\chi ^2\) overestimates its true value by an amount larger than its statistical fluctuation. The advantage of Eq. (16) with respect to Eq. (8) is to provide a stability criterion that does not depend on the unknown matrix of uncertainties A, but only on the Frobenius norm of the matrix of fluctuations F. This dependence can be easily modelled as we explain in the next section.

3 Regularising the matrix of uncertainties

In this section we devise a procedure to regularise the matrix of uncertainties in such a way that the \(\chi ^2\) becomes insensitive to inaccuracies in the estimation of the experimental uncertainties. We then demonstrate the effectiveness of the procedure in a toy model that is representative of realistic LHC measurements.

3.1 Regularisation procedure

Our aim is to obtain a regularised matrix of uncertainties \(A_{\mathrm{reg}}\) which, for a given model of instabilities, fulfills the following criteria: i) \(A_{\mathrm{reg}}\) is more stable that \({\bar{A}}\); ii) \(A_{\mathrm{reg}}\) is compatible with \({\bar{A}}\) within the precision with which this is determined; and iii) the uncertainty estimated by \(A_{\mathrm{reg}}\) never decreases in comparison to that estimated by \({\bar{A}}\). To this purpose, we first need to characterise the inaccuracies in the matrix \({\bar{A}}\), by means of a simplified model that builds upon the stability criterion, Eq. (16). We note that sometimes such a characterisation comes as part of the measurement itself, generally as the result of a dedicated analysis. In these cases, this characterisation has to be preferred to the model discussed below.

The model of inaccuracies that we devise ought to be minimal, general, and realistic. Minimal, because it should alter the matrix of uncertainties \({\bar{A}}\) as little as possible; general, because it should be applied to any data set with no further information; and realistic, because it should capture the most likely sources of inaccuracy. These features lead us to making two assumptions.

The first assumption is that the correlations of experimental uncertainties across data points are determined much less precisely than the uncertainties for each data point, which we presume to be exact. This assumption is known to hold in practice, since the determination of certain correlations — such as those for two-point uncertainties defined as the difference between estimates obtained with two different Monte Carlo generators — require a certain amount of guesswork. This fact is occasionally reflected in different correlation models being presented with the measurement. Therefore we write

$$\begin{aligned} A = D A_\text {corr} , \end{aligned}$$
(17)

where D is the \(N_{\mathrm{dat}} \times N_{\mathrm{dat}}\) diagonal matrix of standard deviations for each data point

$$\begin{aligned} D_{ii} = \sqrt{\sum _j^{N_{\mathrm{err}}} A_{ij}^2}. \end{aligned}$$
(18)

We then assume that the covariance matrix provided by or built from the experiment has the true standard deviations, but correlations (encoded in \({\bar{A}}_\text {corr}\) below) may be different from the truth. Analogously to Eq. (9) we can therefore write

$$\begin{aligned} A = D({\bar{A}}_\text {corr} + \delta F_\text {corr}). \end{aligned}$$
(19)

Note that \(A_\text {corr}A_\text {corr}^t\) is the covariance matrix of the reduced differences \((d_i - t_i)/D_{ii}\), hence the analysis carried out in Sect. 2 can be repeated verbatim for these variables. Analogously to Eq. (8), we can write

$$\begin{aligned} \Delta \chi ^2 = \left\| A^+_\text {corr}A_\text {corr}\right\| - N_{\mathrm{dat}}, \end{aligned}$$
(20)

and finally arrive at a stability criterion, similar to Eq. (16), under the assumption that D is well determined,

$$\begin{aligned} \left\| {\bar{A}}_\text {corr}^+\right\| _2\left\| F_\text {corr}\right\| _F \le \frac{1}{\sqrt{2}\delta }. \end{aligned}$$
(21)

The second assumption is that \(\left\| F_\text {corr}\right\| _F\) is independent of the number of data points or correlated experimental uncertainties in the measurement. The model then implies that the prevalent source of inaccuracy in the correlation matrix concentrates on a subset of data points and originates from a small number of correlated experimental uncertainties (for example the correlation of some two-point systematic uncertainties between the most extreme kinematic bins). While this assumption is a simplification, we find that the model is effective, as we will discuss in the context of both a toy model (see Sect. 3.2) and of a realistic case (see Sect. 4.2). If instead the source of inaccuracy in the correlation matrix arised from a number of systematic uncertainties that increased, e.g., with the number of data points \(N_{\mathrm{dat}}\), the regularisation procedure described below would over-regularise small data sets and under-regularise large ones when simultaneously applied to a collection of measurements.

Since \(F_\text {corr}\) is a matrix of adimensional coefficients (both units and magnitude of the data uncertainties are absorbed in D), we can simply set the norm to a constant, e.g. \(\left\| F_\text {corr}\right\| _F = 1/\sqrt{2}\). Therefore, with the assumptions we have made, the model of uncertainties required to implement the stability criterion Eq. (7) contains one single adimensional parameter, \(\delta \), and the stability condition is

$$\begin{aligned} \left\| {\bar{A}}_\text {corr}^+\right\| _2 \le \frac{1}{\delta } . \end{aligned}$$
(22)

The free parameter \(\delta \) characterises the precision of the correlation matrix. Its optimal value depends on the features of the measurement, and clearly cannot be obtained from the matrix itself. In the case of PDF determination, we will obtain it by studying the dependence of global fits on it, as we will discuss in Sect. 4.

The stability condition Eq. (22), together with the requirements presented at the beginning of this section, lay out a regularisation procedure. Specifically, Eq. (22) implies that the largest singular values of \({\bar{A}}^+_\text {corr}\) must be bound by \(\delta ^{-1}\), and conversely that the smallest singular values of \({\bar{A}}_\text {corr}\) must be bound by \(\delta \) from below. The requirement that the regularised matrix gives the same description as the original one in the directions that do not contribute to instability implies that the singular vectors with singular values greater than \(\delta ^{-1}\) are unchanged. Following Eq. (17), we write \({\bar{A}}\) in terms of the singular value decomposition of \({\bar{A}}_\text {corr}\), \({\bar{A}}_\text {corr} = USV^t\)

$$\begin{aligned} {\bar{A}} = DUSV^t , \end{aligned}$$
(23)

and we can then define the regularised matrix \(A_{\mathrm{reg}}\) as

$$\begin{aligned} {\bar{A}}_{\mathrm{reg}} = DUS_\text {reg}V^t , \end{aligned}$$
(24)

where \(S_{\mathrm{reg}}\) is the matrix of singular values whose non-zero entries are

$$\begin{aligned} S_{\text {reg}(ii)}= {\left\{ \begin{array}{ll} \delta &{} s_{i}<\delta \\ s_{i} &{} \text {otherwise} \end{array}\right. } . \end{aligned}$$
(25)

Note that, beside the formulation of the regularisation procedure laid out above, the stability condition, Eq. (22), can be inverted to quickly assess the stability of experimental uncertainties in a given measurement. We define the condition number Z as

$$\begin{aligned} Z = \left\| {\bar{A}}_\text {corr}^+\right\| _2 = \left\| {\bar{A}}_\text {corr}\right\| _2^{-1}. \end{aligned}$$
(26)

It follows from Eqs. (22) and (25) that, if \(Z>\delta ^{-1}\), then it is likely that the precision with which correlations are determined is insufficient to ensure that they will not alter the expectation value of the \(\chi ^2\). This is demonstrated below with a toy model.

Fig. 1
figure 1

The expectation value of the \(\chi ^2\), as a function of the variable x, in the toy model, for two values of the parameter \(\epsilon \): \(\epsilon =0.1\) (left) and \(\epsilon =0.25\) (right). We show: the true expectation value of Eq. (30), \(\langle \chi ^2_{\mathrm{true}}\rangle \) (with one and six standard deviations for reference), given the matrix of uncertainties A, Eq. (27); the expectation value of Eq. (29), \(\langle {{\bar{\chi }}}^2\rangle \), given the inaccurate matrix of uncertainties \({\bar{A}}\), Eq. (28); and the expectation value of Eq. (36), \(\langle {{\bar{\chi }}}^2_{\mathrm{reg}}\rangle \), given the matrix of uncertainties \({\bar{A}}_{\mathrm{reg}}\), Eq. (34), obtained after applying the regularisation procedure with \(\delta =1\)

We note that the regularisation procedure can apply without modification to joint matrices constructed from multiple measurements when assuming the same value of \(\delta \) for each of them. For example, if the measurements are independent, and the joint matrix is block diagonal, with each block corresponding to the covariance matrix from one measurement, the effect of the regularisation on the joint matrix is the same as applying it independently to each of the individual matrices, while the Z condition number will be the maximum across the measurements. Systematic uncertainties that are shared between measurements (hence making the joint matrix not completely block diagonal) also require no change in the procedure.

Finally, we remark that Eq. (25) should work for any value of \(\delta \), even if it has been derived by neglecting terms of \({\mathcal {O}}(\delta ^2)\) in Eq. (12). This neglect, however, may make the interpretation of \(\delta ^{-1}\) as a measure of the precision with which correlations need to be known to ensure the stability of the \(\chi ^2\) looser for small values of \(\delta \).

3.2 Toy model

We now apply the regularisation procedure devised in Sect. 3.1 to a toy model which is representative of a realistic LHC data set. This exercise will show how inaccuracies in the degree of correlation of uncertainties can undermine the reliability of the \(\chi ^2\) as a figure of merit.

The model consists of a data set made of four experimental data points, with a small uncorrelated statistical uncertainty of size \(\epsilon \), equal for each data point, and one correlated systematic uncertainty of size 1, affecting only the first three data points. The fourth point also has a systematic uncertainty, whose correlation with the other points is, however, not precisely known. We parametrise this lack of knowledge in terms of the variable x, and write the systematic uncertainty on the fourth point as a fluctuation, by an amount x, with respect to the other systematic uncertainty of size 1. We assume that the total variance is known. Note that this is consistent with the assumptions made in Sect. 3.1: correlations can fluctuate (in a way that, in the model, is parametrised by x), while variances remain fixed.

The matrix of uncertainties describing this toy model is

$$\begin{aligned} A(x) = \left( \begin{matrix} \epsilon &{}\quad 0 &{} \quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0\\ 0 &{}\quad \epsilon &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad \epsilon &{}\quad 0 &{}\quad 1 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 0 &{}\quad \epsilon &{}\quad 1 - x &{}\quad \sqrt{1 - \left( 1 - x\right) ^{2}}\\ \end{matrix} \right) . \end{aligned}$$
(27)

By fixing the variance due to the systematics to 1 we let the parameter \(\epsilon \ll 1\) control the relative size of the uncorrelated to correlated uncertainties. The parameter x can take values in the interval [0, 2]: \(x=0\) corresponds to the case in which the systematic uncertainty on the fourth data point is fully correlated with that of the other data points; \(x=1\) corresponds to the case of full decorrelation; and \(x=2\) corresponds to the case of full anti-correlation.

We now consider the situation in which the correlation is (inaccurately) estimated to be maximal, that is \(x={\bar{x}}=0\). This inaccuracy is encoded in the matrix of uncertainties \({\bar{A}} = A({\bar{x}})\):

$$\begin{aligned} {\bar{A}} = \left( \begin{matrix} \epsilon &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{} \quad 0\\ 0 &{} \quad \epsilon &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad \epsilon &{} \quad 0 &{}\quad 1 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 0 &{}\quad \epsilon &{}\quad 1 &{}\quad 0\\ \end{matrix} \right) . \end{aligned}$$
(28)

According to Eq. (6), the expectation value of the \(\chi ^2\) given \({{\bar{A}}}\) is

$$\begin{aligned} \langle {\bar{\chi }}^2\rangle (x) = \left\| {\bar{A}}^+ A(x)\right\| _F^2 = 4 + \frac{6x}{\epsilon ^2(\epsilon ^2+4)}, \end{aligned}$$
(29)

which has to be compared with the true expectation value given A(x), see Eq. (5):

$$\begin{aligned} \langle \chi ^2_{\mathrm{true}}\rangle (x) = \left\| A^+ A(x)\right\| _F^2 = N_{\mathrm{dat}} = 4. \end{aligned}$$
(30)

The situation is depicted in Fig. 1, where the curves obtained with either Eq. (29) or Eq. (30) are contrasted as a function of the true (unknown) variable x. We consider two illustrative values of the model parameter \(\epsilon \), 0.1 and 0.25, that correspond to the situation in which the uncorrelated statistical uncertainty is equal, respectively, to 10% or 25% of the correlated systematic uncertainty. These values reflect the relative ratio of uncorrelated to correlated uncertainties in realistic current and future LHC measurements. As is apparent from Fig. 1, the incorrect estimation of \({\bar{x}}\) leads to a large deviation of the expectation value of the \(\chi ^2\) from its true value. The smaller the value of \(\epsilon \), the larger the deviation. For example, for a value of \(\epsilon \) equal to 0.25, it is sufficient that the true value of x is 0.12 instead of zero to run afoul of the stability criterion of Eq. (7). For \(\epsilon =0.1\), the true value of x can be as small as 0.02 to encounter a similar instability.

We now apply the regularisation procedure devised in Sect. 3.1. We first write the matrix \({\bar{A}}\), Eq. (28), in terms of the matrices D and \({\bar{A}}_{\mathrm{corr}}\), as per Eq. (17), which read

$$\begin{aligned} D = \sqrt{1+\epsilon ^2}\,I_{4\times 4} \qquad \text {and} \qquad {\bar{A}}_{\mathrm{corr}} = \frac{1}{\sqrt{1+\epsilon ^2}}\,{\bar{A}} . \end{aligned}$$
(31)

The matrix of singular values for \({\bar{A}}_{\mathrm{corr}}\) is

$$\begin{aligned} S = \frac{1}{\sqrt{1+\epsilon ^2}}\left( \begin{matrix} \epsilon &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 0 &{} \quad \epsilon &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad \epsilon &{}\quad 0 &{} \quad 0 &{} \quad 0\\ 0 &{} \quad 0 &{} \quad 0 &{}\quad \sqrt{4+\epsilon ^2} &{}\quad 0 &{}\quad 0\\ \end{matrix} \right) . \end{aligned}$$
(32)

We denote the first three singular values as \(s_{1,2,3}=\epsilon /\sqrt{1+\epsilon ^2}\) and the fourth one as \(s_{4}=\sqrt{4+\epsilon ^2}/\sqrt{1+\epsilon ^2}\), and note that \(0<s_{1,2,3}<s_4<2\) for any value of \(\epsilon >0\). We then apply the regularisation prescription given by Eqs. (23)–(25), by choosing \(s_{1,2,3}<\delta ^{-1}<s_4\). The regularised matrix of singular values therefore reads

$$\begin{aligned} S_{\mathrm{reg}} = \left( \begin{matrix} \delta ^{-1} &{}\quad 0 &{}\quad 0 &{} \quad 0 &{} \quad 0 &{}\quad 0\\ 0 &{} \quad \delta ^{-1} &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad 0 &{} \quad \delta ^{-1} &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 0 &{}\quad \frac{\sqrt{4+\epsilon ^2}}{\sqrt{1+\epsilon ^2}} &{}\quad 0 &{}\quad 0\\ \end{matrix} \right) , \end{aligned}$$
(33)

and the regularised matrix of uncertainties

$$\begin{aligned} {\bar{A}}_{\mathrm{reg}} = \left( \begin{matrix} a &{}\quad b &{}\quad b &{}\quad b &{}\quad 1 &{}\quad 0\\ b &{}\quad a &{}\quad b &{}\quad b &{}\quad 1 &{}\quad 0\\ b &{}\quad b &{}\quad a &{}\quad b &{}\quad 1 &{}\quad 0\\ b &{}\quad b &{}\quad b &{}\quad a &{}\quad 1 &{}\quad 0\\ \end{matrix} \right) , \end{aligned}$$
(34)

where

$$\begin{aligned} a= & {} \frac{1}{4}\left( \epsilon + 3\delta ^{-1}\,\sqrt{1+\epsilon ^2}\right) \qquad \text {and} \nonumber \\ b= & {} \frac{1}{4}\left( \epsilon - \delta ^{-1}\,\sqrt{1+\epsilon ^2}\right) . \end{aligned}$$
(35)

The expected value of the \(\chi ^2\) is finally

$$\begin{aligned} \langle {\bar{\chi }}^2_{\mathrm{reg}}\rangle (x)= & {} \left\| {\bar{A}}^+_{\mathrm{reg}} A(x)\right\| _F^2 = 4 + 6x\left( \frac{\delta ^2}{1 + \epsilon ^2} - \frac{1}{4+\epsilon ^2}\right) \nonumber \\&+ \frac{12\,\delta ^2\epsilon ^2}{1 + \epsilon ^2}. \end{aligned}$$
(36)

The expression in Eq. (36) is compared to those in Eqs. (29)–(30) in Fig. 1 for the value \(\delta =1\). We note that this value fulfils the requirement \(s_{1,2,3}<\delta ^{-1}<s_4\) for any value of the parameter \(\epsilon \). As is apparent from Fig. 1, the regularisation procedure successfully achieves the goal for which it was devised: the expectation value of the regularised \(\chi ^2\), \(\langle \chi ^2_{\mathrm{reg}}\rangle \), does not differ from the true expectation value, \(\langle \chi ^2\rangle _{\mathrm{true}}\), by more than one standard deviation of the \(\chi ^2\) distribution for any value of x.

The optimal value of \(\delta \) should be determined on a case-by-case basis depending on the precision with which x is known. This is the topic that we will investigate in the next section in the context of PDF determination.

We now turn our attention to the situation where we can make further assumptions on the uncertainties in the determination of the correlation structure, for example when having access to additional information during the experimental analysis. In that case it might be advisable to study the effects on stability of various modelling choices, and the corresponding regularisation, in a more refined way than the one described in Sect. 3.1, where we strived for generality. We simulate this situation by assuming a specific prior for the value of the x parameter. We choose that prior to be a beta distribution with support in \(x \in [0, 1]\) and such that \(x=0\) is the mode value. Specifically,

$$\begin{aligned} x \sim {\text {Beta}}(1, 5) \, , \end{aligned}$$
(37)

which corresponds to the probability density

$$\begin{aligned} f_x(\xi ) = 5(1 - \xi )^4 \, . \end{aligned}$$
(38)

Our discussion implies that even though \(x=0\) is the most likely value, analyses using it are subject to instabilities. We can quantify this by computing the expected error in the \(\chi ^2\) we would incur when assuming a particular value of x and when averaging over the distribution of possible values:

$$\begin{aligned} \langle \Delta \chi ^2\rangle (x) = \int _0^1 \left| \left\| {\bar{A}}^+(\xi ) A(x)\right\| _F^2 - N\right| \, f_x(\xi )\text {d}\xi \, . \end{aligned}$$
(39)
Fig. 2
figure 2

Stability of the toy model with additional assumptions on the value of the correlation. The green curve shows the deviation in \(\chi ^2\) averaged over the assumed prior of the x parameter, Eq. (39). The black dashed horizontal line marks the limit from the stability criterion Eq. (7). The orange dashed vertical line, in the intersection, shows the most likely value of x that fulfills the stability criterion, Eq. (40)

We represent \(\langle \Delta \chi ^2\rangle _x(x)\) in Fig. 2, where we have set \(\epsilon =0.1\). The comparison with the limit imposed by the stability criterion Eq. (7), also displayed in Fig. 2, shows that presenting the covariance matrix with values too close to the most likely value of the correlation under the prior yields large instabilities that would hamper the subsequent analysis. Selecting the most likely value that satisfies the stability criterion

$$\begin{aligned} x^* = {\text {*}}{argmax}_{\xi : \left\langle \Delta \chi ^2\right\rangle (\xi ) \le \sqrt{2N}} f_x(\xi ) \end{aligned}$$
(40)

may be a way to decide the value of the correlation with which to present the covariance matrix. This would correspond to \(x \approx 0.04\) under the settings presented here. Note that this small correction is consistent with the assumed knowledge of x, Eq. (37), but it would notably increase the accuracy of \(\chi ^2\) computations using the covariance matrix.

The obvious disadvantage of this analysis is the difficulty of obtaining estimates for the covariance matrix parameters such as Eq. (37). These are unattainable outside the experimental collaborations responsible for the analysis and presumably challenging within. However, it may be useful to assess and refine correlation models internally. The regularisation procedure presented in Sect. 3 and applied to the toy model in Eq. (36) is indicated for the more common situation where such detailed information is missing. We demonstrate its usage for the problem of PDF determination next.

4 Determining PDFs with a regularised data set

In this section, we apply the regularisation procedure devised in Sect. 3 to a data set utilised for PDF determination. This is a particular problem relevant to LHC precision physics that relies on the \(\chi ^2\) as a figure of merit. We first discuss how the regularisation procedure can be applied to characterise the data set that enters a given PDF determination. We then show how PDFs change if the nominal data set is replaced by a suitably regularised one, and study their dependence on the regularisation parameter \(\delta \). We finally investigate how the regularisation procedure performs in comparison to the correlation models provided with the measurements in the few cases in which these are available. All of our investigations are performed in the framework of the recent NNPDF4.0 PDF determination [9].

4.1 Characterising the data set

The NNPDF4.0 data set is the widest data set used for PDF determination to date. It consists of legacy fixed-target and collider deep-inelastic scattering and fixed-target Drell–Yan measurements, and of a wide range of measurements for various production processes in proton–proton collisions at the LHC. These include both Run I and Run II measurements and make about 30% of the NNPDF4.0 data set. Experimental uncertainties are typically of the order of few percent, the largest part of which is made of correlated systematic uncertainties. A detailed description of the NNPDF4.0 data set is provided in Sect. 2 of [9].

Here we take a closer look at the LHC measurements that are part of the NNPDF4.0 data set, and in particular scrutinise the matrix of uncertainties of each measurement that contains more than one data point. The goal is to identify the measurements for which an inaccurate estimation of experimental correlations may significantly affect their \(\chi ^2\). To this purpose, for each measurement, we compute the condition number Z, Eq. (26), apply the regularisation procedure delineated in Sect. 3 for different values of the parameter \(\delta \), and evaluate how much the regularised covariance matrix differs from the nominal one. This piece of information is collected in Table 1, where we indicate, for each LHC measurement included in the NNPDF4.0 data set, its reference and the condition number Z; we also indicate the maximum relative difference of the variances \(\Delta \sigma _r\) and the maximum absolute difference of the correlation \(|\Delta \rho |\) computed between the nominal data set and the data set regularised with \(\delta ^{-1}=1,2,3,4,5,7\). Blank spaces indicate that \(\Delta \sigma _r=|\Delta \rho |=0\), that is the regularisation procedure does not alter the nominal covariance matrix. We make two remarks.

Table 1 The LHC measurements included in the NNPDF4.0 data set [9]. For each measurement we indicate its reference, the condition number Z of the corresponding experimental covariance matrix, Eq (26), and the maximum relative difference of the variances \(\Delta \sigma _r\) (in percent) and the maximum absolute difference of the correlation \(|\Delta \rho |\) computed between the nominal data set and the data set obtained by applying the regularisation procedure delineated in Sect. 3 for \(\delta ^{-1}=1,2,3,4,5,7\). Blank spaces indicate that \(\Delta \sigma _r=|\Delta \rho |=0\), that is the regularisation procedure does not alter the nominal covariance matrix. For ATLAS WZ 7 TeV, CC and CF stand, respectively, for central-central and central-forward rapidity selections. We omit the data sets with a single data point
Table 2 The number of data points, \(N_{\mathrm{dat}}\), and the \(\chi ^2\) per data point, \(\chi ^2/N_{\mathrm{dat}}\), for the NNPDF4.0 NNLO baseline fit and for each of the fits performed with the regularisation procedure delineated in Sect. 3 for \(\delta ^{-1}=1,2,3,4,5,7\)

First, one can single out the data sets for which an inaccurate estimation of experimental correlations may be of concern in a PDF fit. These are the data sets with the largest value of the condition number Z. If these data sets turn out to also have an unsatisfactory \(\chi ^2\) in the fit, then additional investigations are needed to establish whether this is due solely to inaccurate experimental correlations, solely to inaccurate theoretical predictions, or to a combination of both. Conversely, if a data set has a low condition number Z but a large value of the \(\chi ^2\), the large value of the \(\chi ^2\) is reasonably due to genuine inconsistencies between the data set and theory predictions. These considerations may help determine the optimal data set utilised as input to PDF determination, as done for the NNPDF4.0 parton set (see in particular Sect. 4.2 in [9]).

Second, one can determine the optimal value of the regularisation parameter \(\delta \) in such a way that variances and correlations are not modified too much by the regularisation procedure in comparison to the nominal values. In this respect, inspection of Table 1 reveals that regularising the NNPDF4.0 data set with \(\delta ^{-1}=1\) or \(\delta ^{-1}=2\) is too aggressive, in that it leads to an increase of variances by an amount between 10% and 90%, and a variation of correlations between 0.1 and 0.5, depending on the data set. These figures are reduced, respectively, below 10% and 0.1 for \(\delta ^{-1}=3\) and even further, to a few percent and below 0.05 for \(\delta ^{-1}=4\) and \(\delta ^{-1}=5\). Higher values of \(\delta \) alter the nominal data set only minimally. As expected, the data sets associated to the highest condition number Z are those that are generally most affected by the regularisation procedure, in that they display the largest variation of variances and correlation; they also remain sensitive to the regularisation procedure even if a modest amount of regularisation (that is, a high value of \(\delta ^{-1}\)) is applied. In the next section we shall see how these variations affect a fit of PDFs.

Among all of the LHC data sets collected in Table 1, we single out the two measurements that are associated to large values of Z and \(\chi ^2\) (see Table 2) at the same time: ATLAS WZ 7 TeV CC [15] and ATLAS dijets R=0.6 7 TeV [24]. They are representative of extreme cases in which small inaccuracies in the determination of experimental correlations can have a large impact on the computation of the \(\chi ^2\). Indeed these data sets have been the subject of much scrutiny [9, 15, 44, 45]. With a value of Z of order 10, it means that correlations must be estimated with an absolute uncertainty of roughly less than 0.1 in order to ensure that they make the \(\chi ^2\) fluctuate by less than one standard deviation. If the correlation between two bins is estimated to be 1.0 while its real value is instead 0.9, one can expect the \(\chi ^2\) to deviate significantly (by more than one standard deviation) from unity, even if there is good consistency between experimental central values and theoretical expectations.

Note that other data sets may have a large value of Z, e.g. CMS Z \(p_T\) 8 TeV [32], but not an anomalously large \(\chi ^2\) (see Table 2). While our decorrelation procedure will also affect these data sets, as seen in Table 1, we do not consider them in the following discussion.

In Fig. 3 we show how the regularisation procedure described in Sect. 3.1 affects the covariance and correlation matrices of the two data sets singled out above. Specifically, we show the relative difference of the covariance matrix \(\Delta \sigma _r\) and the difference of the correlation matrix \(\Delta \rho \) for each of their elements, computed between the nominal data sets and the data set regularised with \(\delta ^{-1}=4\). For ATLAS WZ 7 TeV CC, we indicate the bins, differential in the rapidity of the lepton, \(\eta \), corresponding to \(W^+\), \(W^-\) and Z production (the latter in three kinematic regions); for ATLAS dijets R=0.6 7 TeV, we indicate the bins, differential in the invariant mass of the dijet, \(m_{12}\), corresponding to the six measured intervals of the absolute rapidity difference of the two leading jets, \(|y^*|\): \(0.0\le |y^*|\le 0.5\); \(0.5\le |y^*|\le 1.0\); \(1.0\le |y^*|\le 1.5\); \(1.5\le |y^*|\le 2.0\); \(2.0\le |y^*|\le 2.5\); and \(2.5\le |y^*|\le 3.0\). As already noted, differences are small and do not exceed \(5\%\) for variances and 0.05 for correlations. These variations seem very reasonable to us; their effect, as well as that induced by larger (smaller) variations corresponding to more (less) aggressive regularisation will be investigated next.

4.2 Fitting PDFs

We now study the sensitivity of PDF determination to the regularisation procedure. To this purpose, we perform a series of fits, all based on the experimental, theoretical, and methodological input that enters the default next-to-next-to-leading order (NNLO) NNPDF4.0 parton set (see [9] for details), in which we regularise the data set. Specifically, we perform six fits in each of which we consider a different amount of regularisation, namely \(\delta ^{-1}=1,2,3,4,5,7\). All the fits are made of \(N_{\mathrm{rep}}=100\) Monte Carlo replicas. Note that these fits are different from those presented in Sect. 8.7 of [9]: here the regularisation procedure is applied to the NNPDF4.0 data set as a whole (and indeed to the total covariance matrix), while there it was applied only to a specific measurement (that was part of the NNPDF4.0 data set or not) at a time.

In Table 2 we display the value of the \(\chi ^2\) per data point, \(\chi ^2/N_{\mathrm{dat}}\), for each of these fits, and compare it to that of the NNLO NNPDF4.0 default fit. Deep-inelastic scattering, fixed-target Drell–Yan, and Tevatron Drell–Yan measurements, which are mostly unaffected by the regularisation procedure, are all aggregated; ATLAS, CMS and LHCb measurements are instead displayed individually. The total values (for each experiment and for the total data set) are also shown, as well as the corresponding number of data points.

Fig. 3
figure 3

The relative difference of the covariance matrix \(\Delta \sigma _r\) (top) and the difference of the correlation matrix \(\Delta \rho \) (bottom) for each of their elements, computed between the regularised data sets and the data set nominal with \(\delta ^{-1}=4\). We show results for the two measurements in the NNPDF4.0 data set that have the largest value of Z, see Table 1: ATLAS WZ 7 TeV CC [15] (left) and ATLAS dijets R=0.6 7 TeV [24] (right). For ATLAS WZ 7 TeV CC, we indicate the bins, differential in the rapidity of the lepton, \(\eta \), corresponding to \(W^+\), \(W^-\) and Z production (the latter in three kinematic regions); for ATLAS dijets R=0.6 7 TeV we indicate the bins, differential in the invariant mass of the dijet, \(m_{12}\), corresponding to the six measured intervals of the absolute rapidity difference of the two leading jets, \(|y^*|\), see text for details

Fig. 4
figure 4

The PDFs obtained by fitting the NNPDF4.0 data set after regularisation with different values of the parameter \(\delta ^{-1}=1,3,4,5,7\). From top to bottom, left to right, we show the up, anti-up, down, anti-down, strange, anti-strange, charm and gluon PDFs at a scale \(Q=100\) GeV. PDFs are compared to the NNPDF4.0 baseline parton set, and normalised to its central value. For \(\delta ^{-1}=1,3,5,7\) we display only the central value. Otherwise uncertainties correspond to 68% confidence levels. All PDF fits are accurate to NNLO

In Fig. 4 we then display the resulting PDFs, specifically the up, anti-up, down, anti-down, strange, anti-strange, charm and gluon PDFs at a scale \(Q=100\) GeV. PDFs are compared to the NNPDF4.0 NNLO baseline parton set, and are normalised to its central value. For \(\delta ^{-1}=1,3,5,7\) we display only the central value. Otherwise uncertainties correspond to 68% confidence levels.

A joint inspection of Table 2 and of Fig. 4 reveals some interesting features. We first observe that, as expected, the regularisation procedure has a significant effect on the \(\chi ^2\). A general decrease of its value is observed in comparison to NNPDF4.0, by an amount that increases with the increase in the amount of regularisation (that is, with the decrease of the value of \(\delta ^{-1}\)). For the largest value \(\delta ^{-1}=7\), no statistically significant differences are seen with respect to NNPDF4.0, neither in the value of the \(\chi ^2\) per data point nor in PDFs. Conversely, for small values of \(\delta ^{-1}\), \(\delta ^{-1}=1\) and \(\delta ^{-1}=2\), the total \(\chi ^2\) per data point drops from 1.16 to 0.58 and 0.97, respectively. These variations correspond to a \(28\sigma \) and a \(9\sigma \) fluctuation in units of the \(\chi ^2\) standard deviation, which obviously denote an excessive regularisation of the NNPDF4.0 data set. As noted at the end of Sect. 3, such an excessive regularisation may also arise from neglecting terms of \({\mathcal {O}}(\delta ^2)\) in Eq. (12).

The PDFs obtained in the fit with \(\delta ^{-1}=1\) (and similarly in the fit with \(\delta ^{-1}=2\), which is not displayed in Fig. 4) are indeed consistently distorted in comparison to NNPDF4.0. The central value of the former fluctuates, in units of the NNPDF4.0 PDF uncertainty around the central value of the latter, by about one standard deviation for the up, anti-up, down and anti-down PDFs, and slightly more for the strange, anti-strange, charm and gluon PDF. In this respect, it is worth noting that the strange and gluon PDFs are sensitive, respectively, to the ATLAS WZ 7 TeV CC [15] and ATLAS dijets R=0.6 7 TeV [24] data sets: these have some of the largest values of Z and display the largest reduction of \(\chi ^2\) upon regularisation.

The outer cases corresponding to \(\delta ^{-1}=1,2,7\) are therefore to be interpreted as a validation of the regularisation procedure, which behaves as expected. The fits corresponding to \(\delta ^{-1}=3\), \(\delta ^{-1}=4\) and \(\delta ^{-1}=5\) are instead more interesting. Variations of the \(\chi ^2\) with respect to NNPDF4.0 correspond, respectively, to a \(3.8\sigma \), \(2.4\sigma \) and \(1.4\sigma \) fluctuation in units of the \(\chi ^2\) standard deviation. Interestingly, the difference between the expected \(\chi ^2/N_{\mathrm{dat}}=1\) and the \(\chi ^2\) obtained in the fits corresponding to \(\delta ^{-1}=3,4,5\) amounts, respectively, to \(3.3\sigma \), \(5.3\sigma \) and \(6.2\sigma \) in units of the \(\chi ^2\) standard deviation. This is a significant reduction in comparison to \(7.7\sigma \) of the default NNPDF4.0 determination.

Such an improvement in the \(\chi ^2\) statistic is accompanied by remarkably limited PDF variations if one compares the fits with \(\delta ^{-1}=3,4,5\) with NNPDF4.0. Central values fluctuate by a small fraction of the NNPDF4.0 PDF uncertainty, except for the gluon PDF, which varies by up to half of the NNPDF4.0 uncertainty around \(x\sim 0.3\); PDF uncertainties are almost unaffected. Remarkably, all these variations are much smaller than those due to variations of the data set itself (see Sect. 7 in [9]).

The fact that PDFs do not vary significantly in the fits to the regularised data set with \(\delta ^{-1}=3,4,5\) is further displayed in Fig. 5, where we show a data–theory comparison for some selected bins of the ATLAS WZ 7 TeV CC [15] and dijets R=0.6 7 TeV [24] measurements. Specifically, we show the \(W^+\) and \(W^-\) subsets, as a function of the absolute value of the lepton rapidity \(\eta \), for the former, and two bins in the absolute rapidity difference between the two leading jets \(|y^*|\) as a function of the di-jet invariant mass \(m_{12}\), for the latter. Theoretical predictions are obtained with the NNPDF4.0 baseline parton set and with the PDFs obtained by fitting the NNPDF4.0 data set regularised with \(\delta ^{-1}=1,3,4,5,7\). They are all accurate to NNLO in the strong coupling, both in the PDFs and in the matrix elements. Results are shown as ratios to the experimental central value, with one-sigma experimental and PDF uncertainties. The experimental uncertainty is the sum in quadrature of the statistical and of all systematic uncertainties.

Fig. 5
figure 5

Data–theory comparison for the \(W^\pm \) subset of the ATLAS WZ 7 TeV CC measurement [15], as a function of the absolute lepton rapidity \(\eta \) (top), and for two bins in the absolute rapidity difference between the two leading jets \(|y^*|\) of the ATLAS dijets R=0.6 7 TeV measurement [24], as a function of the di-jet invariant mass \(m_{12}\). Theoretical predictions are obtained with the NNPDF4.0 baseline parton set and with the PDFs obtained by fitting the NNPDF4.0 data set regularised with \(\delta ^{-1}=1,3,4,5,7\). They are all accurate to NNLO in the strong coupling, both in the PDFs and in the matrix elements. Results are shown as ratios to the experimental central value, with one-sigma PDF and experimental uncertainties. The latter is the sum in quadrature of the statistical and of all systematic uncertainties

As noted in Sect. 4.1, the data sets displayed in Fig. 5 are those with large values of Z and \(\chi ^2\), and for which the regularisation procedure introduces some of the largest differences in the variances and correlations of the data, see Table 1 and Fig. 3. In spite of this, only small differences are observed between predictions obtained with NNPDF4.0 and any of the regularised fits with \(\delta ^{-1}=1,3,4,5,7\); slightly larger fluctuations are observed in the fit with a large amount of regularisation (\(\delta ^{-1}=1\)), albeit only for the data points at central rapidity, for ATLAS WZ 7 TeV CC, or at large invariant mass, for ATLAS dijets \(\hbox {R}=0.6\) 7 TeV.

We therefore conclude that the PDFs obtained from any of the regularised fits with \(\delta ^{-1}=3,4,5\) represent the same underlying truth as the NNPDF4.0 parton set. They however lead to a \(\chi ^2\) that is better than the NNPDF4.0 one by up to \(4\sigma \), in units of the \(\chi ^2\) standard deviation, and that is only about \(3\sigma \) away from the expectation of unit \(\chi ^2\) (instead of about \(8\sigma \)). In other words, the nominal \(\chi ^2\) determined in [9] is likely to be spuriously inflated by inaccuracies in the estimation of the experimental correlations in the LHC data. Further discrimination among the equally good values \(\delta ^{-1}=3,4,5\) can be made on the basis of how big the changes to the covariance matrices are in relation to the precision at which they are estimated. Since the precision is unknown, this entails a degree of subjectivity. We deem that the values of \(\Delta \sigma _r < 5\%\) and \(|\Delta \rho | < 0.05\) implied by \(\delta ^{-1}=4\) suggest it safe to assume that the resulting regularised covariance matrices are compatible with the original ones within the precision at which they were determined, while ensuring stability against possibly bigger inaccuracies in the correlations. Therefore, the fit with \(\delta ^{-1}=4\) will be used as reference in the remainder of this paper.

4.3 Correlating and decorrelating experimental uncertainties with more information

As we have mentioned in Sect. 3.1, the correlation models provided with the measurements have to be preferred to our regularisation procedure whenever these are available, and if they result in a stable covariance matrix. For example, the correlation model recommended in [23] for the analysis of the ATLAS jets R=0.6 8 TeV measurement is used by default in the NNPDF4.0 determination [9] and in all the fits presented in Sect. 4.2. It is therefore not surprising that the regularisation procedure has almost no impact on the \(\chi ^2\) of this specific data set.

Correlation models, which follow from a careful experimental analysis of all of the sources of systematic uncertainties and of their correlations, are however not always available. Sometimes they become available only long after the measurement is published, and sometimes a clear recommendation for their usage is not provided. In order to remedy this lack of information, some guesswork is carried out to identify the systematic uncertainties whose nominal correlations are likely to be too strong. For instance, two of these [46, 47] have targeted, respectively, the ATLAS 7 TeV single-inclusive jet measurement [48] and the 8 TeV top-pair lepton+jet measurement [21]. They were performed in the framework of the MMHT2014 global analysis [49] by inspecting the nuisance parameters associated to each systematic uncertainty in the \(\chi ^2\). Similar studies [50, 51], targeting the same measurements and based on complete decorrelation of certain systematic uncertainties, were also carried out in the framework on the NNPDF3.1 global analysis [52]. Sometimes these analyses have been used to inform and/or validate the experimental correlation models. In this respect, our regularisation procedure can be utilised in the same spirit, with the advantage that it is more general and requires less information than the aforementioned analyses.

Here we investigate how the regularisation procedure performs in comparison to the correlation models provided with the measurement in the few cases in which these are available. We consider two cases. The first case concerns the ATLAS dijets R=0.6 7 TeV [24] measurement, for which a STRONG and a WEAK (de-)correlation models are provided on top of the nominal correlation model used in NNDPF4.0 and in all the fits of Sect. 4.2. None of these models are clearly recommended in [24], hence why they have not been previously considered. The second case concerns three ATLAS 8 TeV measurements, namely the \(W^\pm \)+jet [19], the \(t{\bar{t}}~\ell \)+jets [21], and the single-inclusive jets R=0.6 [23] measurements. Details on how to correlate or decorrelate systematic uncertainties between bins within and across these measurements have been provided only very recently [53]. This is the reason why they have not been previously considered. We will refer to this correlation model with the label ATLAS henceforth.

We then perform four fits, all based on the experimental, theoretical, and methodological input that enters the default NNPDF4.0 parton set, by considering these correlation models. The first two fits are performed using, respectively, the STRONG and WEAK correlation models for the ATLAS dijets \(\hbox {R}=0.6\) 7 TeV measurement. Experimental correlations for all of the other data sets are otherwise as in NNPDF4.0. The third fit is performed using the ATLAS correlation model for all the concerned ATLAS 8 TeV measurements. This correlation model was not completely utilised in NNPDF4.0 (in particular for what concerns correlations between pairs of points belonging to different data sets). It also does not enter the two aforementioned fits. The fourth fit is performed by combining the WEAK and ATLAS correlation models at the same time.

In Table 3 we display the value of the \(\chi ^2\) per data point, \(\chi ^2/N_{\mathrm{dat}}\), for each of these fits and compare it to that of the NNLO NNPDF4.0 default fit, and of the fit obtained by regularising the NNPDF4.0 data set with \(\delta ^{-1}=4\). For conciseness, we aggregate the data sets into one of the following classes: deep-inelastic scattering, fixed-target Drell–Yan, Tevatron Drell–Yan, ATLAS, CMS, and LHCb. For ATLAS, we also indicate the individual \(\chi ^2\) of the data sets affected by the correlation models. The corresponding number of data points, \(N_{\mathrm{dat}}\), is also indicated.

Table 3 The number of data points, \(N_{\mathrm{dat}}\), and the \(\chi ^2\) per data point, \(\chi ^2/N_{\mathrm{dat}}\), for the NNPDF4.0 NNLO baseline fit, for each of the fits performed with a different correlation model (see text for details), and for the fit to the NNPDF4.0 data set regularised with \(\delta ^{-1}=4\)

In Fig. 6 we show the resulting PDFs, specifically the anti-up, anti-down, charm and gluon PDFs at a scale \(Q=100\) GeV. PDFs are compared to the NNPDF4.0 NNLO baseline parton set, and to the PDFs obtained by regularising the NNPDF4.0 data set with \(\delta ^{-1}=4\). All the curves are normalised to the NNPDF4.0 central value. For all PDFs but the NNPDF4.0 NNLO baseline, we show only the central value. Otherwise the uncertainty corresponds to the 68% confidence interval.

Fig. 6
figure 6

The PDFs obtained by fitting the NNPDF4.0 data set with correlation models provided by the experiment for a subset of measurements (see text for details). From top to bottom, left to right, we show the anti-up, anti-down, charm and gluon PDFs at a scale \(Q=100\) GeV. PDFs are compared to the NNPDF4.0 baseline parton set, and normalised to its central value. Also shown are the PDFs obtained in a fit to the NNPDF4.0 data set regularised with \(\delta ^{-1}=4\). For all PDFs but the NNPDF4.0 baseline, we show only the central value. Otherwise the uncertainty corresponds to the 68% confidence interval. All PDFs are accurate to NNLO

A joint inspection of Table 3 and Fig. 6 reveals two features. First, the fit quality, as quantified by the value of the \(\chi ^2\) per data point, does not change upon variation of the available correlation models, either for the data sets affected by the model, or for the other data sets. This behaviour contrasts with the larger variations seen upon refitting a regularised data set, even when the amount of regularisation is fairly limited, see Table 2. Second, the shifts of PDF central values induced by a given correlation model, however modest they turn out to be, are generally very close to the shifts induced by the regularisation of the data set (specifically with \(\delta ^{-1}=4\)). This is apparent for the WEAK correlation model, whose only feature is to partially decorrelate certain uncertainties in one of the data sets with the largest condition number, ATLAS dijets R=0.6 7 TeV [24]. In this respect, this correlation model is relatively close to what the regularisation procedure achieves. For other correlation models, qualitatively similar shifts are also seen, although with some quantitative differences.

The fact that our regularisation procedure captures the same qualitative shifts on PDF central values as experimental correlation models, once one or the others are used to determine the PDFs, is suggestive. Whether this is a coincidental feature, limited to the correlation models considered, or a more general one, could only be investigated if additional correlation models become available to be tested. In general, it is reasonable that a correlation model altering the original covariance matrix as little as possible while improving the stability of the \(\chi ^2\) leads to results similar to those obtained with the regularisation procedure. That being said, the shift in central value remains so small that it would be hard to make any conclusions based on its statistical significance.

On the other hand, the fact that using correlation models does not lead to a better \(\chi ^2\) is a consequence of the fact that the condition number of the corresponding experimental covariance matrix is almost unaltered, as we have explicitly checked. We therefore conclude that the available correlation models are not enough to lead to a stable experimental covariance matrix and \(\chi ^2\). In light of these considerations, we find that our regularisation procedure can possibly be used as a useful diagnosis tool to inform and validate correlation models not only at the level of PDF fits, but also at the level of the corresponding experimental analyses.

5 Conclusions

In this paper we have shown how an (even slightly) inaccurate determination of bin-by-bin correlations in the uncertainties of experimental measurements may make the \(\chi ^2\) statistic fluctuate substantially, by more than one standard deviation. This problem is particularly relevant when dealing with high-precision measurements, in which the largest fraction of the uncertainty is correlated. This is the case for current and future LHC measurements that are routinely confronted with theoretical predictions by means of statistical inference. Because the \(\chi ^2\) is routinely utilised as a figure of merit in these analyses, instabilities in its computation can make the interpretation of the results unreliable.

We have formulated the problem rigorously, by deriving a stability criterion for the acceptable fluctuations of the uncertainties on data correlations. The criterion ensures that the expectation value of the \(\chi ^2\) does not overestimate its true value by an amount larger than its statistical fluctuation. To this aim, the criterion defines a bound on the singular values of the correlated part of the matrix of uncertainties. Building upon this criterion, we have then devised a regularisation procedure, whereby instabilities in the correlations of experimental uncertainties are removed with minimal information and without loss of generality. The idea is to clip the singular values of the correlated part of the matrix of uncertainties to a constant \(\delta \), whenever these are smaller than that, while leaving the rest of the singular vectors unchanged. This way, directions that do not contribute to instability are not affected and the alteration to the original matrix is minimal.

The key assumptions underlying the regularisation procedure are that correlations of experimental uncertainties across data points are determined much less precisely than the uncertainties on each data point, and that the prevalent source of inaccuracy on correlations concentrates on a subset of data points and originates from a small number of correlated uncertainties. The regularisation procedure leads to a covariance matrix that is more stable than the original one, when used to compute the \(\chi ^2\), is compatible with it within the precision with which it is determined, and does not lead to a reduction of the total uncertainty.

We have demonstrated how the regularisation procedure works in a toy model, and in a particular problem relevant to LHC precision physics that relies on the evaluation of the \(\chi ^2\) as a figure of merit: PDF determination. Specifically, we have considered the NNPDF4.0 determination [9], which is based on the widest data set to date. We have shown how the regularisation procedure can be utilised as a diagnosis tool to characterise the data set, in particular to single out those measurements for which an inaccurate estimation of experimental correlations may significantly affect their \(\chi ^2\). We have also studied how PDFs change if the nominal data set is replaced by a suitably regularised data set, and how these changes depend on the regularisation parameter \(\delta \). To this purpose, we have repeated the NNPDF4.0 baseline fit, now utilising a data set regularised with \(\delta ^{-1}=1,2,3,4,5,7\).

We have found that the \(\chi ^2\) of some LHC data sets can be indeed significantly affected by inaccuracies in the determination of the correlations of their uncertainties. These inaccuracies can be reasonably regularised by choosing \(\delta ^{-1}=4\). This value sets the precision with which uncertainties and correlations are known to less than \(5\%\) and 0.05, respectively. We have demonstrated that, by regularising the NNPDF4.0 data set with \(\delta ^{-1}=4\), the global \(\chi ^2\) is smaller than that of the baseline NNPDF4.0 determination by about \(2.4\sigma \). This means that it is only \(5.3\sigma \) away from the unity expectation (instead of \(7.7\sigma \) in the baseline NNPDF4.0 determination). At the same time, PDFs remain unaltered. These results highlight the fact that the nominal \(\chi ^2\) determined in [9] is likely to be spuriously inflated by inaccuracies in the estimation of the experimental correlations in the LHC data.

Finally, we have studied how the regularisation procedure performs in comparison to correlation models provided with the measurements in the few cases in which these are available. We have found that our regularisation procedure captures the same qualitative shifts on PDF central values as experimental correlation models, once one or the others are used to determine the PDFs. Whether this is a coincidental feature, limited to the correlation models considered, or a more general one, could only be investigated if additional correlation models become available to be tested. On the other hand, using correlation models does not lead to a better \(\chi ^2\) (or to a decrease in the condition number of the corresponding experimental covariance matrix). We therefore conclude that the available correlation models are not enough to lead to a stable \(\chi ^2\). In light of these considerations, we find that our regularisation procedure can possibly be used as a useful diagnosis tool to inform and validate correlation models not only at the level of PDF fits, but also at the level of the corresponding experimental analyses.

Our regularisation procedure is made publicly available as part of the NNPDF software [10]. The PDF sets discussed in this paper are available, in the LHAPDF format [54], from the NNPDF web page [55].