1 Introduction

Parton distribution functions (PDFs) play a crucial role in the prediction of Standard Model observables at the Large Hadron Collider (LHC) and beyond [1, 2]. PDF uncertainties are already a limiting factor in these predictions [3]. This means that a precise and accurate knowledge of PDFs is essential for full exploitation of the physics at LHC. Besides the familiar experimental uncertainties in the data that are input to a global PDF fit, it is becoming increasingly necessary to also consider theoretical uncertainties: missing higher order uncertainties (MHOUs) due to the truncation of perturbative expansions in the hard cross-sections and parton evolution, uncertainties due to the use of nuclear targets, uncertainties due to missing higher twist, uncertainties due to external parameters such as quark masses, showering and hadronization uncertainties in final state Monte Carlos, and so on. In the past the technique used to estimate the effect of theoretical uncertainties was to compare the PDFs produced with and without various theoretical corrections, rather than to incorporate the uncertainties into the fit itself. However the limitations of such a procedure are clear.

Recently a new approach to estimating the impact of theoretical uncertainties on global PDF fits has been developed, through the construction of a ‘theory covariance matrix’ [4], analogous to the experimental covariance matrix used in global PDF fits. By adding the theory covariance matrix to the experimental covariance matrix as an additional source of uncertainty, theoretical uncertainties can be incorporated directly in the fit, where they impact not only the overall level of PDF uncertainty, but also the relative weight of different datasets. This novel approach has so far been applied to uncertainties associated with nuclear effects [5, 6], and to the estimation of MHOUs by scale variation [7,8,9] (though other methods [10,11,12] of estimating MHOU are under development). Factorization scale variations estimate the MHOUs in parton evolution (correlated across all observables), and renormalization scale variations estimate the MHOUs in fixed order calculations of process-dependent hard cross-section (correlated only across a given class of processes). The first global PDFs including MHOUs estimated in this way were presented in [9].

When making predictions for hadronic observables there are again two sources of MHOU: uncertainties in the PDF evolution, which can also be estimated by factorization scale variation, and uncertainties in the hard cross-section, again estimated by renormalization scale variation. These arise on top of the MHOUs incorporated in the determination of the PDFs themselves, manifested as a part of the PDF uncertainty. So there will be correlations between the MHOUs in the PDFs and the additional MHOUs in the predictions. The PDFs themselves contain a wealth of data from a wide range of processes. When making a prediction for one of these processes (for example in a new kinematic regime), there will inevitably be correlations between the renormalization scale variation in the PDF determination, and that in the prediction. Even if the process is a new one, for example Higgs production, correlations due to factorization scale variation will still be present: all processes dependent on PDFs have a MHOU due to the need to evolve the PDFs to the scale of the process.

The potential importance of these correlations can be exposed by considering a simple situation in which a PDF is determined from data at a given scale on a single observable (such as a nonsinglet structure function), and used to make predictions of another observable [13]. In this special case the PDF can be eliminated altogether, and the predicted observable determined directly from the measured observable. Including the MHOU twice (once in the calculation of the measured observable, then again in making the predicted observable), whilst ignoring their possible correlation, would amount to “double counting”, and thus an overestimate of the uncertainty. Of course in a global PDF fit, involving many different processes, this particular formulation is no longer applicable. However the fact remains that there will still be some residual correlation between the MHOU in the PDF determination and the MHOU in the prediction, and to neglect it may lead to an overestimation of the MHOU.

In Ref. [9], it was seen that in a realistic NLO global fit, the effect of MHOU on the PDF uncertainty is quite small, the main consequence being a rebalancing of the impact of different datasets depending on their relative MHOU. The MHOU in the PDF is consequently rather smaller than the MHOU in the prediction, and if they are combined in quadrature, the effect of missing correlations will likely also be small. It was further argued that combining the PDF uncertainties and theoretical uncertainties in quadrature is conservative, since it can only lead to an overestimate of uncertainties, and thus better than neglecting MHOUs altogether. Nevertheless, since the correlation between the uncertainties has not been computed explicitly, there remains the intriguing possibility that in some circumstances including the correlation may yield more precise, and perhaps even more accurate, predictions.

In this paper we show explicitly how to propagate theoretical uncertainties in the determination of global PDFs into the predictions made using these PDFs, taking account of all correlations between theoretical uncertainties. As this is a complicated problem, we proceed step by step. In Sect. 2 we show how a single source of theoretical uncertainty can be reformulated in terms of a nuisance parameter, which holds the key to the propagation of uncertainties. We explore the interplay between the theoretical uncertainties and the experimental uncertainties in the data in two idealised contexts: the first in which there are no fitted parameters, the second in which there as many parameters as data, so the experimental data can be fitted exactly. This allows us to identify three distinct effects of the correlation of theoretical uncertainties. Then in Sect. 3 we consider a more realistic, but still simplified, situation where we fit the data using a single parameter, still in a theory with a single source of theoretical uncertainty, and find that all three of these correlation effects are still present, and can be computed. In Sect. 4 these results are extended to multiple theoretical uncertainties, in a multiparameter fit, and then less trivially to a PDF fit where the PDFs are continuous functions, with a functional uncertainty. Finally in Sect. 5 we present numerical results, made in the context of the NNPDF3.1 NLO global fit with MHOU presented in Ref. [9], and make predictions with MHOU for repetitions of the experiments included in the fit (so-called ‘autopredictions’), and for genuine predictions (top and Higgs production). We are able to compute all correlations explicitly, calculating corrections to central values and changes in PDF and theoretical uncertainties for both autopredictions and genuine predictions, and confirm that the correlated predictions can be both more accurate and more precise than the conservative prescription. A summary is provided in Sect. 6.

2 Predictions with correlated theoretical uncertainties

We showed in Refs. [4, 5] that when we fit N experimental data points \(D_i\) to theoretical predictions \(T_i\), \(i = 1,\ldots ,N\), then the uncertainties in the theoretical predictions can be incorporated into the fit simply by adding a theoretical covariance matrix \(S_{ij}\) to the experimental covariance matrix \(C_{ij}\). The only assumptions made in deriving this result are that all uncertainties, both experimental and theoretical, can be treated as Gaussian, and that the theoretical uncertainties are independent of the experimental data. We used this result in Refs. [5, 6] to include nuclear uncertainties in a PDF fit, and in Refs. [8, 9] to incorporate missing higher order uncertainties in global PDF fits.

The result of Sec. 2 of Ref. [5] may be summarised in terms of the Bayesian probability

$$\begin{aligned} P(T|D)\propto \exp \left( -{{1}\over {2}}(T-D)^T(C+S)^{-1}(T-D)\right) \end{aligned}$$
(2.1)

where for simplicity we adopt a matrix notation: C and S are real symmetric matrices, with C strictly positive definite (thus invertible), and S positive semi-definite. In practice theoretical uncertainties are highly correlated, so S may (and generally will) have zero eigenvalues. We determine T from D by maximizing P(T|D): this is equivalent to minimizing

$$\begin{aligned} \chi ^2 = (T-D)^T(C+S)^{-1}(T-D) \end{aligned}$$
(2.2)

with respect to free parameters characterizing the theoretical prediction.

In this section we will also assume that there is only a single source of fully correlated theoretical uncertainty, so that the theory covariance matrix can be written as

$$\begin{aligned} S = \beta \beta ^T \end{aligned}$$
(2.3)

for some real nonzero vector \(\beta \) (i.e. that \(S_{ij} = \beta _i\beta _j)\). Then all eigenvalues of S except one are zero. We will develop a nuisance parameter formalism to propagate this theoretical uncertainty (Sect. 2.1), and then show how to apply it in two extreme cases: firstly to the situation in which the theory contains no free parameters, and thus where there is no fitting (Sect. 2.2), and then to the situation where there as many free parameters as data points, so that we can achieve a perfect fit (Sect. 2.3).

2.1 Nuisance parameters

It will be useful in what follows to introduce a nuisance parameter \(\lambda \) for the correlated theoretical uncertainty. Following the notation in Ref. [14], we model the theoretical uncertainty as a fully correlated shift in the theoretical prediction: \(T\rightarrow T+\lambda \beta \). The nuisance parameter \(\lambda \) then gives the size of the shift. For a given shift, with Gaussian experimental uncertainties

$$\begin{aligned}&P(T|D\lambda )\nonumber \\&\quad \propto \exp \left( -{{1}\over {2}}(T+\lambda \beta -D)^TC^{-1}(T+\lambda \beta -D)\right) . \end{aligned}$$
(2.4)

Now using Bayes’ Theorem,

$$\begin{aligned} P(T|D\lambda )P(\lambda |D) = P(\lambda |TD)P(T|D). \end{aligned}$$
(2.5)

To determine P(T|D) we need to first fix the prior distribution of \(\lambda \). Since this is a theoretical uncertainty, it is reasonable to assume that the prior is independent of the data, thus that \(P(\lambda |D)=P(\lambda )\). Then, marginalizing Eq. (2.5) over \(\lambda \),

$$\begin{aligned} P(T|D) = \int \! d\lambda \, P(T|D\lambda )P(\lambda ). \end{aligned}$$
(2.6)

Taking \(P(\lambda )\) as a Gaussian, centred on zero (so the theoretical uncertainty is unbiased), with unit width (fixing the overall normalization of the prior theoretical uncertainty), we choose

$$\begin{aligned} P(\lambda ) \propto \exp \left( -{{1}\over {2}}\lambda ^2\right) . \end{aligned}$$
(2.7)

The integration over the nuisance parameter is now Gaussian,

$$\begin{aligned}&P(T|D) \propto \int d\lambda \,\nonumber \\&\quad \times \exp \left( -{{1}\over {2}}[(T+\lambda \beta -D)^TC^{-1}(T+\lambda \beta -D)+\lambda ^2]\right) \, , \end{aligned}$$
(2.8)

and can be performed in the usual way by completing the square:

$$\begin{aligned}&(T+\lambda \beta -D)^TC^{-1}(T+\lambda \beta -D)+\lambda ^2 \nonumber \\&\quad = Z^{-1}\left( \lambda +Z\beta ^TC^{-1}(T-D)\right) ^2\nonumber \\&\qquad + (T-D)^TC^{-1}(T-D) - Z(\beta ^TC^{-1}(T-D))^2,\nonumber \\ \end{aligned}$$
(2.9)

where we have defined

$$\begin{aligned} Z = (1+\beta ^TC^{-1}\beta )^{-1} = 1-\beta ^T(C+S)^{-1}\beta . \end{aligned}$$
(2.10)

The second expression was obtained by noting that

$$\begin{aligned} (1+\beta ^T C^{-1}\beta )(1-\beta ^T(C+S)^{-1}\beta )=1 \end{aligned}$$
(2.11)

using Eq. (2.3).

Since

$$\begin{aligned} (\beta ^TC^{-1}(T-D))^2 = (T-D)^TC^{-1}\beta \beta ^TC^{-1}(T-D), \end{aligned}$$
(2.12)

we can combine the two terms on the second line of Eq. (2.9) to give

$$\begin{aligned}&(T-D)^T(C^{-1}-ZC^{-1}\beta \beta ^TC^{-1})(T-D) \nonumber \\&\quad = (T-D)^T(C+\beta \beta ^T)^{-1}(T-D), \end{aligned}$$
(2.13)

since

$$\begin{aligned}&(C+\beta \beta ^T)(C^{-1}-ZC^{-1}\beta \beta ^TC^{-1}) =1 + \beta \beta ^TC^{-1}- Z\beta \beta ^TC^{-1} \nonumber \\&\quad - Z\beta \beta ^TC^{-1}\beta \beta ^TC^{-1} = 1, \end{aligned}$$
(2.14)

by substituting (from Eq. (2.10)) \(\beta ^TC^{-1}\beta = Z^{-1}-1\) in the last term. Using the definition Eq. (2.3), we recognise Eq. (2.13) as the \(\chi ^2\), Eq. (2.2). Defining

$$\begin{aligned} \overline{\lambda }(T,D) = Z\beta ^TC^{-1}(D-T)=\beta ^T(C+S)^{-1}(D-T),\nonumber \\ \end{aligned}$$
(2.15)

we can thus write Eq. (2.8) as

$$\begin{aligned} P(T|D)\propto & {} \int d\lambda \exp \left( -{{1}\over {2}}Z^{-1}(\lambda -\overline{\lambda })^2 - {{1}\over {2}}\chi ^2\right) \nonumber \\\propto & {} \exp \left( -{{1}\over {2}}\chi ^2\right) , \end{aligned}$$
(2.16)

since the integration over \(\lambda \) yields a factor \((2\pi Z)^{1/2}\) which is independent of T and D.

The nuisance parameter formalism thus produces our original result Eq. (2.1). However it is useful because it allows us to also determine the posterior distribution of the nuisance parameter: inverting Eq. (2.5)

$$\begin{aligned} P(\lambda |TD)\propto \exp \left( -{{1}\over {2}}Z^{-1}(\lambda -\overline{\lambda }(T,D))^2\right) , \end{aligned}$$
(2.17)

so once we use the information on T and D, the prior distribution Eq. (2.7) is modified. In particular the peak of the distribution is shifted away from zero to \(\overline{\lambda }\), while the width is now given by Z. It is easy to see from the definition Eq. (2.10) that

$$\begin{aligned} 0< Z < 1, \end{aligned}$$
(2.18)

so the width of the theoretical uncertainty is generally reduced by the addition of new information.

2.2 Predictions and autopredictions without fits

In order to explore the implications of the formalism described in the previous section we first consider a ‘pure’ theory \(T=T_0\), one where there are no unknown parameters to be fitted, but which nevertheless has a theoretical uncertainty. Despite the fact that this theory cannot be fitted to the data, so in general \(T_0\ne D\), the data can still inform the theoretical predictions.

Computing expectation values of functions of \(\lambda \) using the probability distribution \(P(\lambda |T_0 D)\),

$$\begin{aligned} \mathrm{E}[\lambda ] = \mathcal{N}_\lambda \int d\lambda \;\lambda \; P(\lambda |T_0 D) = \overline{\lambda }(T_0,D), \end{aligned}$$
(2.19)

the normalization \(\mathcal{N}_\lambda \) being chosen such that \(\mathrm{E}[1] =1\), while

$$\begin{aligned} \mathrm{Var}[\lambda ] \equiv \mathrm{E}[(\lambda -\mathrm{E}[\lambda ])^2] = Z. \end{aligned}$$
(2.20)

Now in the nuisance parameter formalism, the theoretical predictions are

$$\begin{aligned} T(\lambda )=T_0 +\lambda \beta . \end{aligned}$$
(2.21)

If we made these predictions before comparing to any data, we would use the prior distribution Eq. (2.7) for \(\lambda \). Since this is a unit gaussian centred on zero, we would then find that \(\mathrm{E}[T(\lambda )] = T_0\), while \(\mathrm{Cov}[T(\lambda )] = \beta \beta ^T=S\) as expected.

However if instead we first compare T to the data D, then we can instead compute expectation values using \(P(\lambda |TD)\). We then find (using Eqs. (2.19, 2.15))

$$\begin{aligned} \mathrm{E}[T(\lambda )]= & {} T_0+\overline{\lambda }(T_0,D)\beta \nonumber \\= & {} T_0 + \beta \beta ^T(C+S)^{-1}(D-T_0), \end{aligned}$$
(2.22)

while (using Eq. (2.20))

$$\begin{aligned} \mathrm{Cov}[T(\lambda )]\equiv & {} \mathrm{E}[(T(\lambda ) -\mathrm{E}[T(\lambda )])(T(\lambda ) -E[T(\lambda )])^T]\nonumber \\= & {} \mathrm{Var}[\lambda ]\beta \beta ^T =ZS. \end{aligned}$$
(2.23)

One can consider this as an ‘autoprediction’: first the theory is compared to the data, and then using this information one makes new theoretical predictions for precise repetitions of the same experiments. The original theoretical predictions \(T_0\) are then shifted by an amount

$$\begin{aligned} \delta T = -S(C+S)^{-1}(T_0-D) \end{aligned}$$
(2.24)

and their theoretical uncertainties are reduced by a factor of \(\sqrt{Z}\), since the data add new information: the covariance matrix of the autopredictions is

$$\begin{aligned} ZS = S - S(C+S)^{-1}S = C(C+S)^{-1}S = S(C+S)^{-1}C. \end{aligned}$$
(2.25)

This is a simple example of Bayesian learning: the theory ‘learns’ from the data, within the constraints imposed by the prior theoretical uncertainty.

It is interesting to compare the experimental \(\chi ^2\) of the original predictions

$$\begin{aligned} \chi ^2_{\mathrm{exp}} = (T_0-D)^TC^{-1}(T_0-D) \end{aligned}$$
(2.26)

to that obtained with the autopredictions

$$\begin{aligned} \chi ^2_{\mathrm{auto}}= & {} (T_0+\delta T-D)^T C^{-1}(T_0+\delta T-D)\nonumber \\= & {} (T_0-D)^T (C+S)^{-1}C(C+S)^{-1}(T_0-D), \nonumber \\ \end{aligned}$$
(2.27)

since

$$\begin{aligned} T_0+\delta T-D = C(C+S)^{-1}(T_0-D). \end{aligned}$$
(2.28)

It is easy to see that \(\chi ^2_{\mathrm{auto}}\le \chi ^2_{\mathrm{exp}}\) since C is positive definite (i.e. \(C>0\)) and S semi-positive definite (i.e. \(S\ge 0\)), so \(2S+SC^{-1}S\ge 0\), whence \((C+S)C^{-1}(C+S)\ge C\), and \((C+S)^{-1}C(C+S)^{-1}\le C^{-1}\). So the data induced shifts are always such as to improve the quality of the fit to the data, by exploiting the theoretical uncertainty.

To make this more explicit, consider a simple model in which the experimental uncertainties are uncorrelated, and the same for each data point: then we can write

$$\begin{aligned} C = \sigma ^2 1,\qquad S = s^2 e e^T, \end{aligned}$$
(2.29)

with e a unit vector, \(e^Te=1\), and \(\beta = s e\) so \(S=\beta \beta ^T\). Thus \(\sigma \) is the experimental uncertainty on each data point, and \(s/\sqrt{N}\) is the size of the correlated theoretical uncertainty. Then it is easy to see that

$$\begin{aligned} (C+S)^{-1} = {{1}\over {\sigma ^2}}\left( 1-{{s^2}\over {\sigma ^2+s^2}}e e^T\right) , \end{aligned}$$
(2.30)

and (using Eq. (2.10))

$$\begin{aligned} Z = (1+s^2/\sigma ^2)^{-1}. \end{aligned}$$
(2.31)

So the reduction in theoretical uncertainties depends on the ratio \(s^2/\sigma ^2\): when \(s^2\ll \sigma ^2\) the influence of the data on the theoretical uncertainty is very small, while when \(s^2\gg \sigma ^2\) the size of the theoretical uncertainty is reduced from s to \(\sigma \), since in the limit \(s^2/\sigma ^2\rightarrow \infty \)

$$\begin{aligned} ZS = {{\sigma ^2 s^2}\over {\sigma ^2+s^2}}ee^T \rightarrow \sigma ^2 ee^T. \end{aligned}$$
(2.32)

Note that when the theoretical uncertainties are comparable to the experimental, \(s^2/N\sim \sigma ^2\), and \(Z \sim 1/(N+1)\): if there are a large number of independent data points, the reduction of the theoretical uncertainty can be very substantial.

In this model the shift in the theoretical predictions is (using Eq. (2.24))

$$\begin{aligned} \delta T = - {{s^2}\over {\sigma ^2+s^2}} (e^T(T_0-D)) e, \end{aligned}$$
(2.33)

as expected in the direction e of the theoretical uncertainty. When \(s^2/\sigma ^2\rightarrow \infty \), \(e^T(T_0+\delta T)\rightarrow e^TD\), so in this direction the autopredictions coincide precisely with the data. The experimental \(\chi ^2\) for the autopredictions is (using Eq. (2.27))

$$\begin{aligned} \chi ^2_{\mathrm{auto}} = (T_0-D)^T {{1}\over {\sigma ^2}}\Big (1 - {{s^2(s^2+2\sigma ^2)}\over {(s^2+\sigma ^2)^2}}ee^T\Big )(T_0-D),\nonumber \\ \end{aligned}$$
(2.34)

So \(N-1\) contributions to the \(\chi ^2\) orthogonal to e are unchanged, while the contribution along e is reduced by a factor \(Z^2\). The autopredictions will in general have a \(\chi ^2\) of size \(N-1\), rather than the N of the original predictions, as naively expected since the nuisance parameter is effectively fitted.

Of course we can also consider genuine predictions \(\widetilde{T}_I\), \(I=1,\ldots ,\widetilde{N}\), which again have no free parameters, but have a theoretical uncertainty which is correlated to that of the observables \(T_i\), \(i=1,\ldots , N\) for which we have the data \(D_i\). The theoretical predictions including the theoretical uncertainty may be written

$$\begin{aligned} \widetilde{T}(\widetilde{\lambda }) = \widetilde{T}+ \widetilde{\lambda }\widetilde{\beta }, \end{aligned}$$
(2.35)

where the vector \(\widetilde{\beta }_I\) gives the size and direction of the theoretical uncertainty in \(\widetilde{T}_I\). Now if the nuisance parameters \(\widetilde{\lambda }\) are independent of the parameters \(\lambda \), the theoretical uncertainties in \(\widetilde{T}\) are uncorrelated with those in T, and the theory covariance matrix for the prediction \(\widetilde{T}\) is given by

$$\begin{aligned} \widetilde{S}= \widetilde{\beta }\widetilde{\beta }^T, \end{aligned}$$
(2.36)

However if they are correlated, \(\widetilde{\lambda }=\lambda \), and then taking advantage of the data D for T, we can compute expectation values for the predictions using \(P(\lambda |TD)\). We then find that

$$\begin{aligned} \mathrm{E}[\widetilde{T}(\lambda )] = \widetilde{T}+\overline{\lambda }(T,D)\widetilde{\beta }= \widetilde{T}+ \widetilde{\beta }\beta ^T(C+S)^{-1}(D-T_0), \end{aligned}$$
(2.37)

so the predictions are shifted by

$$\begin{aligned} \delta \widetilde{T}= -{\widehat{S}}(C+S)^{-1}(T_0-D), \end{aligned}$$
(2.38)

where

$$\begin{aligned} {\widehat{S}}= \widetilde{\beta }\beta ^T, \end{aligned}$$
(2.39)

is the cross-covariance matrix between observables T for which we have data D, and the predictions \(\widetilde{T}\). Likewise

$$\begin{aligned} \mathrm{Cov}[\widetilde{T}(\lambda )]\equiv & {} \mathrm{E}[(\widetilde{T}(\lambda ) -\mathrm{E}[\widetilde{T}(\lambda )])(\widetilde{T}(\lambda ) -E[\widetilde{T}(\lambda )])^T] \nonumber \\= & {} \mathrm{Var}[\lambda ]\widetilde{\beta }\widetilde{\beta }^T = Z \widetilde{S}. \end{aligned}$$
(2.40)

so the covariance matrix of the predictions is reduced by the same factor Z as that for the autopredictions. Thus the data D can lead to more precise (and if the data is correct, also more accurate) predictions for observables that are not yet measured, through the correlation of theoretical uncertainties. The reduction in the size of the covariance matrix is through the same factor Z as for the autopredictions, while the size of the shift is proportional to the cross-covariance between the theoretical uncertainties.

Note that if, having made the predictions \(\widetilde{T}+\delta \widetilde{T}\), an experimentalist made measurements to produce independent data \(\widetilde{D}\), with experimental covariance matrix \(\widetilde{C}\), and we wanted to combine the datasets \(\{D,\widetilde{D}\}\) into a single dataset, then the combined \((N+\widetilde{N})\times (N+\widetilde{N})\) theoretical covariance matrix for \(\{T,\widetilde{T}\}\) to be used to compare to the data would be

$$\begin{aligned} \left( \begin{array}{cc} S&{}\quad {\widehat{S}}^T\\ {\widehat{S}}&{}\quad \widetilde{S}\end{array}\right) = \left( \begin{array}{cc} \beta \beta ^T&{}\quad \beta \widetilde{\beta }^T\\ \widetilde{\beta }\beta ^T&{}\quad \widetilde{\beta }\widetilde{\beta }^T\end{array}\right) . \end{aligned}$$
(2.41)

While we might hope that the shifted predictions \(\widetilde{T}+\delta \widetilde{T}\) give a better \(\chi ^2\) to the new data \(\widetilde{D}\), this is no longer guaranteed, since the shifts are driven by the old data D, and it is possible that \(\widetilde{D}\) are inconsistent with them.

2.3 Autopredictions in perfect fits

In the previous section we considered the situation of a theory T which was not fitted to the data D. Now consider what is in some sense the opposite situation: a ‘perfect’ fit, where the theoretical predictions T have sufficient flexibility to fit the data D exactly. For a perfect fit, P(T|D) is always maximized when \(T=D\), and thus (Eq. (2.2)) \(\chi ^2=0\). We compute expectation values of functions of T using the probability distribution P(T|D) in Eq. (2.1), thus

$$\begin{aligned} \mathrm{E}[T] = \mathcal{N}_T \int dT\; T\; P(T|D) = D, \end{aligned}$$
(2.42)

the normalization \(\mathcal{N}_T\) being chosen such that \(\mathrm{E}[1] =1\), while

$$\begin{aligned} \mathrm{Cov}[T] \equiv \mathrm{E}[(T -\mathrm{E}[T])(T -\mathrm{E}[T])^T] = C+S. \end{aligned}$$
(2.43)

These are the expectation value and covariance matrix of the variables T fitted to the data D.

To compute the autopredictions, we need to consider \(T(\lambda )=T +\lambda \beta \) (Eq. (2.21)), and this is more subtle since, as we saw in the previous section, expectation values of \(\lambda \) involve T. We thus need to generalise the definition Eq. (2.19) of expectation values of functions of \(\lambda \) using the probability distribution \(P(\lambda |TD)\), to also include the subsequent integration over T: thus we define

$$\begin{aligned} \mathrm{E}[f(T,\lambda )] \equiv \mathcal{N}_T \int dT\; \Big (\mathcal{N}_\lambda \int d\lambda \;f(T,\lambda )\; P(\lambda |TD)\Big )P(T|D),\nonumber \\ \end{aligned}$$
(2.44)

for any function \(f(T,\lambda )\) of \(\lambda \) and T. There are several things to note about this procedure:

  • We always perform the integration over \(\lambda \), weighted with the probability distribution \(P(\lambda |TD)\), before we perform the integration over T using P(T|D): this is because while \(P(\lambda |TD)\) depends on T, P(T|D) does not depend on \(\lambda \), because it has been marginalised as in Eq. (2.16).

  • The data D are always held fixed throughout: both \(P(\lambda |TD)\) and P(T|D) are conditional on D.

  • For functions \(f(T,\lambda )\) which only depend on T, the integration over \(\lambda \) is trivial and and we recover for example the results Eqs. (2.42, 2.43).

  • For the pure theory discussed in the previous section, the theory T was held fixed, so the T integration was trivial and we recover the results Eqs. (2.19, 2.20).

Thus in the case of a perfect fit, the expectation value of the nuisance parameter is

$$\begin{aligned} \mathrm{E}[\lambda ] = E[\overline{\lambda }(T,D)]=\beta ^T(C+S)^{-1}E[D-T] = 0,\nonumber \\ \end{aligned}$$
(2.45)

using Eq. (2.15) and then finally Eq. (2.42). The calculation of the variance requires some care, to ensure that the average over \(\lambda \) is separated out from the average over T. This is most easily accomplished by adding and subtracting \(\overline{\lambda }(T,D)\), so

$$\begin{aligned} \mathrm{Var}[\lambda ]= & {} \mathrm{E}[(\lambda -\mathrm{E}[\lambda ])^2]= \mathrm{E}[(\lambda -\overline{\lambda }(T,D)+ \overline{\lambda }(T,D))^2]\nonumber \\= & {} \mathrm{E}[(\lambda -\overline{\lambda }(T,D))^2]+ E[(\overline{\lambda }(T,D))^2]\nonumber \\= & {} Z + \beta ^T(C+S)^{-1}E[(T-D)(T-D)^T](C+S)^{-1}\beta \nonumber \\= & {} Z + \beta ^T(C+S)^{-1}\mathrm{Cov}[T] (C+S)^{-1}\beta \end{aligned}$$
(2.46)
$$\begin{aligned}= & {} 1 - \beta ^T(C+S)^{-1}\beta + \beta ^T(C+S)^{-1}\beta = 1. \end{aligned}$$
(2.47)

where in the second line we note that the cross term vanishes, while in the last line we used Eq. (2.10) for Z, and Eqs. (2.43) to simplify the second term. We thus find that in a perfect fit, the probability distribution of the nuisance parameters after fitting the data is the same as the prior distribution Eq. (2.7): we learn nothing from the data about the theoretical uncertainty because all the information in the data is absorbed in the fitted parameters. The calculation of the variance is particularly instructive: the reduction by the factor Z found in the pure theory, Eq. (2.20) is now precisely cancelled by the fluctuation of \(\overline{\lambda }(T,D)\) due to the covariance Eq. (2.43) of T.

Turning to the autopredictions Eq. (2.21), we have

$$\begin{aligned} \mathrm{E}[T(\lambda )] = E[T+\lambda \beta ]=D, \end{aligned}$$
(2.48)

using Eqs. (2.42, 2.45): as expected in a perfect fit, the autopredictions simply return the original data. Their covariance is more interesting, as we now have to take account of the correlation between fitted theory T and the nuisance parameter \(\lambda \): proceeding as in Eq. (2.47)

$$\begin{aligned} \mathrm{Cov}[T(\lambda )]= & {} \mathrm{E}[(T(\lambda ) -\mathrm{E}[T(\lambda )])(T(\lambda ) -\mathrm{E}[T(\lambda )])^T]\nonumber \\= & {} \mathrm{E}[(T -D +\lambda \beta )(T-D+\lambda \beta )^T]\nonumber \\= & {} \mathrm{E}[(T-D)(T-D)^T] + \mathrm{E}[\lambda \beta (T-D)^T] \nonumber \\&+ \mathrm{E}[(T-D)\lambda \beta ^T]+ \mathrm{E}[\lambda ^2]\beta \beta ^T. \end{aligned}$$
(2.49)

Now the first term is just \(\mathrm{Cov}[T]\), and the last is just \(\mathrm{Var}[\lambda ]S\), while the cross-terms can be evaluated using Eq. (2.15):

$$\begin{aligned} \mathrm{E}[\lambda \beta (T-D)^T]= & {} \mathrm{E}[ \beta \overline{\lambda }(T,D)(T-D)^T]\nonumber \\= & {} -S(C+S)^{-1}\mathrm{E}[(T-D)(T-D)^T] \nonumber \\= & {} -S(C+S)^{-1}\mathrm{Cov}[T]. \end{aligned}$$
(2.50)

We thus find that

$$\begin{aligned} \mathrm{Cov}[T(\lambda )]= & {} \mathrm{Cov}[T]-S(C+S)^{-1}\mathrm{Cov}[T]\nonumber \\&-\mathrm{Cov}[T](C+S)^{-1}S+\mathrm{Var}[\lambda ]S \end{aligned}$$
(2.51)
$$\begin{aligned}= & {} (C+S) - S -S + S = C, \end{aligned}$$
(2.52)

where in the last line we used Eq. (2.43) for \(\mathrm{Cov}[T]\), and Eq. (2.47) for \(\mathrm{Var}[\lambda ]\). Thus the covariance of the autopredictions in a perfect fit is simply the covariance of the data. The way in which this arises is that the theory covariance arising from the fit (the first term in Eq.(2.51) and the theory covariance arising in the autoprediction (the last term in Eq.(2.51) are each cancelled by the cross-covariance between fitting and prediction, just as was argued in Ref. [13]. When we have a perfect fit, there is really no distinction between the autoprediction and the data, and the theory uncertainty thus becomes irrelevant. So in a sense this model is pure phenomenology: the only nontrivial information is in the data. Indeed in this model as it stands there is no possibility of making genuine predictions, since a prediction of the form Eq. (2.35) is useless unless we can determine \(\widetilde{T}\), presumably as functions of T, and for this we need a genuine theory.

3 Correlated theory uncertainties in one parameter fits

In the previous section, we considered two simple but unrealistic models: the first in which the theory T is fixed, with no free parameters to be fitted to the data (pure theory), and the second in which the theory T is so flexible that we could achieve a perfect fit, \(T=D\) (pure phenomenology). These exercises were useful, in that they gave us some practice in the use of nuisance parameters to propagate theoretical uncertainties. However we now need to consider the more realistic situation in which the theory has parameters that can be constrained by data, but is still sufficiently restrictive that it can be considered a theory. The fit to the data is then not perfect, but the theory is sufficiently constraining that it can be used to predict new observables, \({\widetilde{T}}\), for which we as yet have no data. We will find that the interesting features of the pure theory and pure phenomenology models (the shifts, the reduction in uncertainties due to Bayesian learning, and the correlations between theory uncertainties in the fitting and theory uncertainties in predictions) are also found in these more realistic theories.

In this section we consider a theory with only one fitted parameter: in Sect. 3.1 we explain how the fitting is performed using replicas, in Sect. 3.2 we consider autopredictions in such a theory, and in Sect. 3.3 we consider general predictions. Generalization to many fitting parameters will be considered in the following section.

3.1 Fitting a theory with a single parameter

We can model this situation by considering theoretical predictions \(T(\theta )\) which depend on a single parameter \(\theta \), so that \(\chi ^2(\theta )\) is minimized for some choice of this parameter, \(\theta =\theta _0\), with some variance \(\mathrm{Var}[\theta ]\). Other observables \(\widetilde{T}(\theta )\) are then predicted to be \(\widetilde{T}(\theta _0)\), with an associated uncertainty proportional to \(\mathrm{Var}[\theta ]\). We assume as before that uncertainties are Gaussian, which means that we can linearize \(T(\theta )\) about \(T(\theta _0)\equiv T_0\):

$$\begin{aligned} T(\theta ) = T_0 + (\theta -\theta _0)\dot{T}_0. \end{aligned}$$
(3.1)

This model has the advantage that while it captures the essence of the fitting problem, it is sufficiently simple that we can solve it exactly.

In order to determine the uncertainty on \(\theta \), we will need to propagate the experimental uncertainties in the data D and the theoretical uncertainties in the predictions \(T(\theta )\) into \(\theta \). This can be done most easily by generating \(N_{\mathrm{rep}}\) pseudodata replicas \(D^{(r)}\) distributed according to a Gaussian distribution centred on the actual data D, with covariance \(C+S\): defining the replica average

$$\begin{aligned} \langle F(D^{(r)})\rangle = \lim _{N_{\mathrm{rep}}\rightarrow \infty }{{1}\over {N_{\mathrm{rep}}}}\sum _{r=1}^{N_{\mathrm{rep}}}F(D^{(r)}) \end{aligned}$$
(3.2)

for any function F of the replicas, the replicas are chosen such that

$$\begin{aligned} \langle D^{(r)}\rangle \equiv D, \qquad \langle (D^{(r)}-D)(D^{(r)}-D)^T\rangle = C+S. \end{aligned}$$
(3.3)

A parameter replica \(\theta ^{(r)}\) is then fitted to each pseudodata replica \(D^{(r)}\) by maximizing \(P(T(\theta )|D^{(r)})\) as given by Eq. (2.1), and thus by minimizing

$$\begin{aligned} \chi _r^2[\theta ] = (T(\theta )-D^{(r)})^T(C+S)^{-1}(T(\theta )-D^{(r)}), \end{aligned}$$
(3.4)

with respect to \(\theta \), replica by replica. Using Eq. (3.1), minimization of the quadratic gives

$$\begin{aligned} \theta ^{(r)} - \theta _0 = {{\dot{T}_0^T(C+S)^{-1}(D^{(r)}-T_0)}\over {\dot{T}_0^T(C+S)^{-1}\dot{T}_0}}. \end{aligned}$$
(3.5)

Using the replica averages Eq. (3.3), and choosing \(\theta _0 = \langle \theta ^{(r)}\rangle \), we find for consistency

$$\begin{aligned} \dot{T}_0^T(C+S)^{-1}(D-T_0)=0, \end{aligned}$$
(3.6)

and thus we can rewrite Eq. (3.5) as

$$\begin{aligned} \theta ^{(r)} - \theta _0 = {{\dot{T}_0^T(C+S)^{-1}(D^{(r)}-D)}\over {\dot{T}_0^T(C+S)^{-1}\dot{T}_0}}. \end{aligned}$$
(3.7)

Then since C and S are symmetric matrices,

$$\begin{aligned} \mathrm{Var}[\theta ]= & {} \langle (\theta ^{(r)}-\theta _0)^2\rangle \nonumber \\= & {} {{\dot{T}_0^T(C+S)^{-1}\langle (D^{(r)}-D)(D^{(r)}-D)^T\rangle (C+S)^{-1}\dot{T}_0}\over {(\dot{T}_0^T(C+S)^{-1}\dot{T}_0)^2}}\nonumber \\= & {} (\dot{T}_0^T(C+S)^{-1}\dot{T}_0)^{-1}. \end{aligned}$$
(3.8)

Note the way the double reciprocation in this expression works: data points with a relatively large dependence on \(\theta \) (i.e. large \(\dot{T}_0\)) contribute more than those with small dependence, however directions with large uncertainty (i.e. projections of \(C+S\)) contribute less than those with small uncertainty.

Now that we understand the uncertainty of the fitted parameter \(\theta \), we can use it to predict the uncertainties of \(T(\theta )\):

$$\begin{aligned} E[T] \equiv \langle T(\theta ^{(r)})\rangle = T(\theta _0) = T_0, \end{aligned}$$
(3.9)

so, writing \(T^{(r)} = T(\theta ^{(r)})\),

$$\begin{aligned} X\equiv \mathrm{Cov}[T(\theta )]= & {} \langle (T^{(r)}-T_0)(T^{(r)}-T_0)^T\rangle \end{aligned}$$
(3.10)
$$\begin{aligned}= & {} \dot{T}_0\langle (\theta ^{(r)}-\theta _0)^2\rangle \dot{T}_0^T\nonumber \\= & {} \dot{T}_0(\dot{T}_0^T(C+S)^{-1}\dot{T}_0)^{-1}\dot{T}_0^T \end{aligned}$$
(3.11)
$$\begin{aligned}= & {} n(n^T(C+S)^{-1}n)^{-1}n^T, \end{aligned}$$
(3.12)

where n is a unit vector in the direction of \(\dot{T}_0\), \(n^Tn=1\). This shows that X depends only on n, and not on \(|\dot{T}_0|\).

The singular matrix X will play an important role in what follows: it is the covariance matrix of T due to the experimental and theoretical uncertainties in the fitting of the parameter \(\theta \) – the ‘fitting uncertainty’. When the fitted parameter minimizes the \(\chi ^2\), and is thus given by Eq. (3.7), X satisfies the projective relation

$$\begin{aligned} X = X(C+S)^{-1}X. \end{aligned}$$
(3.13)

Using Eq. (3.7) in Eq. (3.1), we see that \(X(C+S)^{-1}\) projects the data replicas onto the theory replicas:

$$\begin{aligned} T^{(r)}-T_0 = X(C+S)^{-1}(D^{(r)}-D). \end{aligned}$$
(3.14)

Because this relation is projective, some information is lost whenever we perform the fit: Eq. (3.14) cannot be inverted to obtain data replicas from theory replicas. This is an inevitable consequence of describing N data (assuming \(N>1\)) with only a single parameter \(\theta \).

To make X more explicit, consider the simple model Eq. (2.29) for C and S. Then using Eq. (2.30),

$$\begin{aligned} n^T(C+S)^{-1}n= {{\sigma ^2+s^2\sin ^2\phi }\over {\sigma ^2(\sigma ^2+s^2)}}, \end{aligned}$$
(3.15)

where \(\cos \phi = n^Te\), and thus if we project X onto n (projections orthogonal to n give zero)

$$\begin{aligned} n^TXn = {{\sigma ^2(\sigma ^2+s^2)}\over {(\sigma ^2+s^2\sin ^2\phi )}}. \end{aligned}$$
(3.16)

The contribution of the theory uncertainty s thus depends on how well aligned e is to the direction n of the parameter dependence: if \(\phi =0\) we have complete alignment, and the variance of T in this direction is \(\sigma ^2+s^2\) as expected, while if \(\phi ={{\pi }\over {2}}\) we have orthogonality, and the variance of T is \(\sigma ^2\) – the theory uncertainty is then irrelevant to the fitting.

3.2 Autopredictions in single parameter Fits

We can now consider the evaluation of the mean and covariance of the ‘autopredictions’

$$\begin{aligned} T(\theta ,\lambda )=T(\theta )+\lambda \beta \end{aligned}$$
(3.17)

in this one parameter model. Just as in Sect. 2.3, we do this by first computing expectation values over \(\lambda \), using \(P(\lambda |TD)\), which depend on T, and then evaluate the expectation values over T, according to the probability distribution P(T|D), now performed by averaging over theory replicas \(T^{(r)} = T(\theta ^{(r)})\). It is important to note that both these averages are performed holding the data D fixed, as both probabilities are conditional on the data: the data replicas \(D^{(r)}\) employed in Sect. 3.1 are only a device to generate the theory replicas \(T^{(r)}\), and are not to be averaged over when determining expectation values. Accordingly, Eq. (2.44) now becomes

$$\begin{aligned} \mathrm{E}[f(T,\lambda )] =\Big \langle \Big (\mathcal{N}_\lambda \int d\lambda \;f(T^{(r)},\lambda )\; P(\lambda |T^{(r)}D)\Big )\Big \rangle ,\nonumber \\ \end{aligned}$$
(3.18)

where the angled brackets denote the replica average Eq. (3.2).

Following the same steps as in the perfect fit in Sect. 2.3, but now using the theory replicas \(T^{(r)} = T(\theta ^{(r)})\) determined in the one parameter fit Sect. 3.1, we find

$$\begin{aligned} E[\lambda ] \equiv \langle {\overline{\lambda }}(T(\theta ^{(r)}),D)\rangle = \beta ^T(C+S)^{-1}(D-T_0)\equiv {\overline{\lambda }}_0,\nonumber \\ \end{aligned}$$
(3.19)

Unlike in the perfect fit Eq. (2.45), but just as in the pure theory Eq. (2.19) the nuisance parameters can now have nonzero expectation values, since the one parameter fit no longer fits the data exactly. These in turn give nontrivial shifts in the theoretical predictions:

$$\begin{aligned} E[T(\theta ,\lambda )]= & {} \langle T^{(r)}+{\overline{\lambda }}(T^{(r)},D)\beta \rangle \nonumber \\= & {} T_0+{\overline{\lambda }}_0\beta = T_0+\beta \beta ^T(C+S)^{-1}(D-T_0).\nonumber \\ \end{aligned}$$
(3.20)

So again the data give us information, inducing shifts in the autopredictions:

$$\begin{aligned} \delta T = -S(C+S)^{-1}(T_0-D). \end{aligned}$$
(3.21)

Note however that since (from Eq. (3.6)) \(n^T(C+S)^{-1}(T_0-D)=0\), these shifts will only be nonzero when n and e (the data and the theory) point in different directions: when they are parallel (\(\phi =0\)), the theoretical uncertainty is simply absorbed by the fit, just as it was in the perfect fit in Sect. 2.3. We can use the same argument as in Sect. 2.2, Eq. (2.27), to show that the shifts will always improve the fit to the experimental data.

For the uncertainties, consider first the variance of \(\lambda \): following the same argument that led to Eq. (2.47) in Sect. 2.3, we now find

$$\begin{aligned} \mathrm{Var}[\lambda ]= & {} \mathrm{E}[(\lambda -\mathrm{E}[\lambda ])^2]= \mathrm{E}[(\lambda -\overline{\lambda }(T,D)\nonumber \\&+ \overline{\lambda }(T,D)-{\overline{\lambda }}_0)^2]\nonumber \\= & {} \mathrm{E}[(\lambda -\overline{\lambda }(T,D))^2]+ \langle (\overline{\lambda }(T^{(r)},D)-{\overline{\lambda }}_0)^2\rangle \nonumber \\= & {} Z + \beta ^T(C+S)^{-1}\langle (T^{(r)}-T_0)(T^{(r)}-T_0)^T\rangle \nonumber \\&\times (C+S)^{-1}\beta \nonumber \\= & {} 1 - \beta ^T(C+S)^{-1}\beta \nonumber \\&+ \beta ^T(C+S)^{-1}X(C+S)^{-1}\beta \equiv {\overline{Z}}. \end{aligned}$$
(3.22)

where in the second line we turned the expectation value over T into a replica average, and in the last line we used Eq. (2.10) for Z, and Eq. (3.10) for \(\mathrm{Cov}[T]\). We thus find that in the more restrictive environment of the one parameter fit, the last two terms no longer cancel: the information in the data can no longer be entirely absorbed in the single fitted parameter, and so it can still inform the nuisance parameter.

It is easy to see that \({\overline{Z}}\ge Z\) because \((C+S)^{-1}X(C+S)^{-1}\) is positive semi-definite, while \({\overline{Z}}\le 1\) since \(X(C+S)^{-1}\) is projective, Eq. (3.13), so its eigenvalues are either zero or one. So in place of Eq. (2.18) we now have

$$\begin{aligned} 0<Z\le {\overline{Z}}\le 1. \end{aligned}$$
(3.23)

The information on theoretical uncertainties extracted from the data is thus less in the one parameter fit than it was in the pure theory of Sect. 2.2, due to the extra uncertainty arising in the fit itself, but unlike in the perfect fit Sect. 2.3, the data will still constrain the theoretical uncertainties provided the parameter and theoretical uncertainty act in different directions.

In the model Eq. (2.29) for C and S, and using Eq. (3.16) for X, Eq. (3.22) becomes

$$\begin{aligned} {\overline{Z}}= {{\sigma ^2}\over {\sigma ^2+s^2\sin ^2\phi }}. \end{aligned}$$
(3.24)

Comparing with the corresponding expression for Z, Eq. (2.32), we see that indeed \({\overline{Z}}=1\) when \(\phi =0\), thus when \(n=e\) and the parameter variation and theoretical uncertainty are aligned, while \({\overline{Z}}=Z\) only if \(\phi =\pi /2\), so when n and e are orthogonal, the data have the greatest influence on the uncertainty.

For the covariance of the autopredictions Eq. (3.17) we also have to take account of the correlation between the fitted theory and the nuisance parameter: proceeding as in Eq. (2.49)

$$\begin{aligned} \mathrm{Cov}[T(\theta ,\lambda )]= & {} \mathrm{E}[(T(\theta ,\lambda ) -\mathrm{E}[T(\theta ,\lambda )])(T(\theta ,\lambda ) \nonumber \\&-\mathrm{E}[T(\theta ,\lambda )])^T]\nonumber \\= & {} \mathrm{E}[(T -T_0 +(\lambda -{\overline{\lambda }}_0)\beta )\nonumber \\&(T-T_0+(\lambda -{\overline{\lambda }}_0)\beta )^T]\nonumber \\= & {} \langle (T^{(r)}-T_0)(T^{(r)}-T_0)^T\rangle \nonumber \\&+ \mathrm{E}[(\lambda -{\overline{\lambda }}_0) \beta (T-T_0)^T] \nonumber \\&+ \mathrm{E}[(T-T_0)(\lambda -{\overline{\lambda }}_0) \beta ^T]\nonumber \\&+ \mathrm{E}[(\lambda -{\overline{\lambda }}_0)^2]\beta \beta ^T. \end{aligned}$$
(3.25)

Then again the first term is \(\mathrm{Cov}[T]=X\), Eq. (3.10), while the last is \(\mathrm{Var}[\lambda ]S\), Eq. (3.22), while the cross-terms can be evaluated using Eq. (2.15):

$$\begin{aligned} \mathrm{E}[(\lambda -{\overline{\lambda }}_0) \beta (T-T_0)^T]= & {} \langle \beta (\overline{\lambda }(T^{(r)},D)\nonumber \\&-\overline{\lambda }(T_0,D)(T^{(r)}-T_0)^T\rangle \nonumber \\= & {} -S(C+S)^{-1}\langle (T^{(r)}-T_0)\nonumber \\&(T^{(r)}-T_0)^T\rangle \nonumber \\= & {} -S(C+S)^{-1}\mathrm{Cov}[T]. \end{aligned}$$
(3.26)

We thus find that the result is simply

$$\begin{aligned} \mathrm{Cov}[T(\lambda )] = X-S(C+S)^{-1}X-X(C+S)^{-1}S+{\overline{Z}}S. \nonumber \\ \end{aligned}$$
(3.27)

The meaning of the four terms is easy to understand: the first is the ‘fitting uncertainty’ (which includes contributions from both experimental and theoretical uncertainties), the last the ‘theory uncertainty’, reduced through exposure to the data, and the middle two terms are due to the correlations between the two sources of theoretical uncertainty. We can simplify it by using Eq. (3.22) to show that

$$\begin{aligned} {\overline{Z}}S = S(C+S)^{-1}X(C+S)^{-1}S + ZS, \end{aligned}$$
(3.28)

and then some straightforward algebra to write

$$\begin{aligned}&X-S(C+S)^{-1}X-X(C+S)^{-1}S\nonumber \\&\qquad +S(C+S)^{-1}X(C+S)^{-1}S \nonumber \\&\quad = C(C+S)^{-1}X(C+S)^{-1}C. \end{aligned}$$
(3.29)

We thus find finally

$$\begin{aligned} \mathrm{Cov}[T(\lambda )] = C(C+S)^{-1}X(C+S)^{-1}C + ZS. \end{aligned}$$
(3.30)

Note that if we write \(X=C+S\) (as in the perfect fit model in Sect. 2.3), this result reduces to C, as it should. The cancellations noted in Eq. (2.52) between the cross-terms and the covariances of of T and \(\lambda \) are no longer exact in Eq. (3.30), because \(\mathrm{Cov}[T]\) is no longer \(C+S\), but rather the smaller matrix X (which is in a sense \(C+S\) restricted to the space of variation of the parameter \(\theta \) as in Eq.  (3.10)). Thus the result is no longer the experimental covariance matrix C, but rather the sum in quadrature of the ‘fitting uncertainty’, X, and the ‘theory uncertainty’ S, each reduced to some extent by the correlation of the theoretical uncertainties in fit and prediction.

In the model Eq. (2.29) for C and S, and using Eq. (3.16) for X, we now find

$$\begin{aligned} \mathrm{Cov}[T(\lambda )]= & {} {{\sigma ^2(\sigma ^2+s^2)}\over {(\sigma ^2+s^2\sin ^2\phi )}} \Bigg (nn^T - {{s^2}\over {\sigma ^2+s^2}} \nonumber \\&\times \cos \phi (en^T+ne^T) + {{s^2}\over {\sigma ^2+s^2}}ee^T\Bigg ) .\nonumber \\ \end{aligned}$$
(3.31)

The first term is just X, the off-diagonal term in the middle is the correlation term, and the last is \({\overline{Z}}S\). If \(\phi =0\), so \(n=e\), the three terms combine to give simply \(\sigma ^2nn^T\): in the direction of the fitted parameter, the uncertainty in the autoprediction is the experimental uncertainty, just as in the perfect fit described in Sect. 2.3. On the other hand, if \(\phi =\pi /2\), so n and e are orthogonal, the correlation term disappears, and the result reduces to \(X+ZS\): we add the uncertainties in quadrature, since they are in orthogonal directions, and the theoretical uncertainty is reduced just as in the pure theory described in Sect. 2.2. The one parameter fit thus interpolates smoothly between these two extremes. Note that we can write Eq. (3.31) as

$$\begin{aligned} \mathrm{Cov}[T(\lambda )]= & {} {{\sigma ^2(\sigma ^2+s^2)}\over {(\sigma ^2+s^2\sin ^2\phi )}} \Bigg (n - {{s^2\cos \phi }\over {\sigma ^2+s^2}} e\Bigg )\nonumber \\&\times \Bigg (n^T - {{s^2\cos \phi }\over {\sigma ^2+s^2}}e^T\Bigg ) \nonumber \\&+ {{s^2\sigma ^2}\over {\sigma ^2+s^2}}ee^T . \end{aligned}$$
(3.32)

Here the last term is just ZS, while the first is X, but with the vector n given an additional component in the direction e due to the theory correlation: besides changing its direction, this reduces the size of the fitting uncertainty by a factor \(\sqrt{\sin ^2\phi + Z^2\cos ^2\phi }\). However it is easy to see that the size of the correlated fitting uncertainty is still larger than it would be if the theory uncertainty had not been included in the fit.

3.3 Correlated predictions in one parameter fits

We now consider predictions \(\widetilde{T}_I(\theta )\), \(I=1,\ldots ,\widetilde{N}\), which depend on the same parameter \(\theta \) as the fitted predictions \(T_i(\theta )\), \(i=1,\ldots , N\). There are two distinct sources of uncertainty in \(\widetilde{T}_I(\theta )\): uncertainties in the determination of \(\theta \) due to the experimental uncertainties in the data \(D_i\) and theoretical uncertainties in the theory \(T_i(\theta )\) used in its determination; and theoretical uncertainties in the predictions \(\widetilde{T}_I(\theta )\) themselves.

The first uncertainty is expressed through Eq. (3.7), which gives the variance Eq. (3.8). In analogy with Eq. (3.1) the linearised dependence of the predictions \(\widetilde{T}(\theta )\) may be written

$$\begin{aligned} \widetilde{T}(\theta ) = \widetilde{T}_0 + (\theta -\theta _0)\dot{\tilde{T}}_0, \end{aligned}$$
(3.33)

with \(\widetilde{T}_0=\widetilde{T}(\theta _0)\). Then the covariance of \(\widetilde{T}_I(\theta )\) due to the uncertainty in the parameter \(\theta \) is derived just as in Eqs. (3.10-3.12) for the autopredictions: writing \(\widetilde{T}^{(r)}\equiv \widetilde{T}(\theta ^{(r)})\)

$$\begin{aligned} \widetilde{X}\equiv \mathrm{Cov}[\widetilde{T}(\theta )]= & {} \langle (\widetilde{T}^{(r)}-\widetilde{T}_0)(\widetilde{T}^{(r)}-\widetilde{T}_0)^T\rangle \end{aligned}$$
(3.34)
$$\begin{aligned}= & {} \dot{\tilde{T}}_0\langle (\theta ^{(r)}-\theta _0)^2\rangle \dot{\tilde{T}}_0^T \nonumber \\= & {} \dot{\tilde{T}}_0(\dot{T}_0^T(C+S)^{-1}\dot{T}_0)^{-1}\dot{\tilde{T}}_0^T. \end{aligned}$$
(3.35)

The second uncertainty – the theoretical uncertainty in the predictions \(\widetilde{T}_I(\theta )\) – may again be either correlated or uncorrelated with the theoretical uncertainty in \(T_i(\theta )\). Consider first the simpler situation when it is uncorrelated: this might be the case if, for example, the observable \(\widetilde{T}(\theta )\) was a different type of observable to the \(T(\theta )\) used to determine \(\theta \). Then introducing a nuisance parameter \(\widetilde{\lambda }\), Gaussian distributed about zero with unit variance, and uncorrelated with \(\lambda \), we can write (as in the pure theory model Eq. (2.35))

$$\begin{aligned} \widetilde{T}(\theta ,\widetilde{\lambda }) = \widetilde{T}(\theta ) + \widetilde{\lambda }\widetilde{\beta }, \end{aligned}$$
(3.36)

where the vector \(\widetilde{\beta }_I\) gives the size of the theoretical uncertainties in \(\widetilde{T}_I(\theta )\). Since \(\theta \) and \(\widetilde{\lambda }\) are uncorrelated, we then have

$$\begin{aligned} E[\widetilde{T}(\theta ,\widetilde{\lambda })]= & {} \widetilde{T}(\theta _0), \end{aligned}$$
(3.37)
$$\begin{aligned} \mathrm{Cov}[\widetilde{T}(\theta ,\widetilde{\lambda })]= & {} \mathrm{Cov}[\widetilde{T}(\theta )]+\mathrm{Var}[\widetilde{\lambda }]\widetilde{\beta }\widetilde{\beta }^T = \widetilde{X}+ \widetilde{S}, \end{aligned}$$
(3.38)

where \(\widetilde{S}= \widetilde{\beta }\widetilde{\beta }^T\) is the theory covariance matrix for the prediction \(\widetilde{T}(\theta )\) (compare Eq. (2.36) in the pure theory). Thus when the theoretical uncertainty is uncorrelated we simply add it in quadrature to the uncertainty due to that in the parameter \(\theta \) derived from the fit.

Now consider the more interesting case in which the theoretical uncertainty in \(\widetilde{T}_I(\theta )\) is fully correlated to that in the \(T_i(\theta )\) used in the fit to determine \(\theta \): then \(\widetilde{\lambda }=\lambda \), which has already been determined in the fit to have nonzero expectation value and variance Eqs. (3.19, 3.22). Then writing \(\widetilde{T}(\theta ,\lambda ) = \widetilde{T}(\theta )+\lambda \widetilde{\beta }\),

$$\begin{aligned} E[\widetilde{T}(\theta ,\lambda )] = \widetilde{T}_0 + {\overline{\lambda }}(T_0,D)\widetilde{\beta }, \end{aligned}$$
(3.39)

so the correlation induces a similar shift in the predictions to that in the autopredictions Eq. (3.21): using Eq. (2.15)

$$\begin{aligned} \delta \widetilde{T}(\theta _0) = \widetilde{\beta }\beta ^T(C+S)^{-1}(D-T_0) = -{\widehat{S}}(C+S)^{-1}(T_0-D). \end{aligned}$$
(3.40)

where \({\widehat{S}}= \widetilde{\beta }\beta ^T\), Eq. (2.39), is the matrix of cross-correlations between observables \(T(\theta )\) used in the fit and the predictions \(\widetilde{T}(\theta )\) .

Likewise using the same arguments as were used to derive \(\mathrm{Cov}[T(\theta ,\lambda )]\), Eq. (3.27)

$$\begin{aligned} \mathrm{Cov}[\widetilde{T}(\theta ,\lambda )]= & {} \mathrm{E}[(\widetilde{T}(\theta ,\lambda ) -\mathrm{E}[\widetilde{T}(\theta ,\lambda )])(\widetilde{T}(\theta ,\lambda ) \nonumber \\&-\mathrm{E}[\widetilde{T}(\theta ,\lambda )])^T]\nonumber \\= & {} \langle \widetilde{T}^{(r)}-\widetilde{T}_0)(\widetilde{T}^{(r)}-\widetilde{T}_0)^T\rangle \nonumber \\&+ \mathrm{E}[(\lambda -{\overline{\lambda }}_0) \widetilde{\beta }(\widetilde{T}-\widetilde{T}_0)^T] \nonumber \\&+ \mathrm{E}[(\widetilde{T}-\widetilde{T}_0)(\lambda -{\overline{\lambda }}_0) \widetilde{\beta }^T]\nonumber \\&+ \mathrm{E}[(\lambda -{\overline{\lambda }}_0)^2]\widetilde{\beta }\widetilde{\beta }^T. \end{aligned}$$
(3.41)

Then again the first term is \(\mathrm{Cov}[\widetilde{T}]=\widetilde{X}\), Eq. (3.34), while the last is \(\mathrm{Var}[\lambda ]\widetilde{S}\), Eqs. (3.22, 2.36), while the cross-terms can be evaluated using Eq. (2.15):

$$\begin{aligned} \mathrm{E}[(\lambda -{\overline{\lambda }}_0) \widetilde{\beta }(\widetilde{T}-\widetilde{T}_0)^T]= & {} \langle \widetilde{\beta }(\overline{\lambda }(T^{(r)},D)\nonumber \\&-\overline{\lambda }(T_0,D))(\widetilde{T}^{(r)}-\widetilde{T}_0)^T\rangle \nonumber \\= & {} -{\widehat{S}}(C+S)^{-1}\langle (T^{(r)}-T_0)\nonumber \\&(\widetilde{T}^{(r)}-\widetilde{T}_0)^T\rangle \nonumber \\= & {} -{\widehat{S}}(C+S)^{-1}{\widehat{X}}^T, \end{aligned}$$
(3.42)

where \({\widehat{S}}\) is given by Eq. (2.39), and in analogy to Eq. (3.10) and Eq. (3.34) we define the cross-covariance between \(\widetilde{T}\) and T

$$\begin{aligned} {\widehat{X}}\equiv & {} \langle (\widetilde{T}^{(r)}-\widetilde{T}_0)(T^{(r)}-T_0)^T\rangle \end{aligned}$$
(3.43)
$$\begin{aligned}= & {} \dot{\tilde{T}}_0\langle (\theta ^{(r)}-\theta _0)^2\rangle \dot{T}_0^T\nonumber \\= & {} \dot{\tilde{T}}_0(\dot{T}_0^T(C+S)^{-1}\dot{T}_0)^{-1}\dot{T}_0^T. \end{aligned}$$
(3.44)

We thus find that

$$\begin{aligned} \mathrm{Cov}[\widetilde{T}(\theta ,\lambda )]= & {} \widetilde{X}-{\widehat{S}}(C+S)^{-1}{\widehat{X}}^T\nonumber \\&-{\widehat{X}}(C+S)^{-1}{\widehat{S}}^T+{\overline{Z}}\widetilde{S}. \end{aligned}$$
(3.45)

Using Eq. (3.22), we can write the last term as

$$\begin{aligned} {\overline{Z}}\widetilde{S}= & {} Z\widetilde{S}+ {\widehat{S}}(C+S)^{-1}X(C+S)^{-1}{\widehat{S}}^T, \end{aligned}$$
(3.46)
$$\begin{aligned} Z\widetilde{S}= & {} \widetilde{S}- {\widehat{S}}(C+S)^{-1}{\widehat{S}}^T. \end{aligned}$$
(3.47)

Note that since the coefficients Z and \({\overline{Z}}\) are the same as for the autopredictions, and thus satisfy the bounds Eq. (3.23), we must have (for positive definite C and positive semi-definite S, i.e. \(C>0\), \(S\ge 0\))

$$\begin{aligned} 0\le {\widehat{S}}(C+S)^{-1}X(C+S)^{-1}{\widehat{S}}^T\le {\widehat{S}}(C+S)^{-1}{\widehat{S}}^T \le \widetilde{S},\nonumber \\ \end{aligned}$$
(3.48)

so in particular the subtraction (the last term in Eq. (3.47)) can never be so large that it makes the entire covariance matrix negative.

In summary, comparing Eqs. (3.40, 3.45) with Eqs. (3.37, 3.38), we see that including the correlations between the theoretical uncertainties in the fit and the prediction results in three effects: a shift in the central value of the prediction, a reduction in the theoretical uncertainty, and a reduction in the fitting uncertainty due to the correlations. Performing the fit gives us information (from the data) about the theory, which results in more precise, and hopefully more accurate, predictions.

4 Correlated MHOU in PDF fits

We now repeat the above analysis, but instead of the toy model we consider the more realistic situation in which the theoretical expressions \(T_i[f]\) depend on PDFs f, determined in a global fit to N data \(D_i\), with experimental covariance matrix \(C_{ij}\), and then used to make \(\widetilde{N}\) predictions \(\widetilde{T}_I[f]\). There are then many sources of theoretical uncertainty in the relation between the theoretical calculations and the PDFs: here we consider the most generic, the missing higher order uncertainty (MHOU), computed using scale variations according to one of the prescriptions set out in Ref. [8, 9]. The theory covariance matrices \(S_{ij}\) and \(\widetilde{S}_{IJ}\) associated with the MHOU will then have many non-zero eigenvalues, and thus there will be n nuisance parameters \(\lambda _\alpha \), \(\alpha =1,\ldots , n\) to take into account. There is in principle no limit on n, though in practice \(n\ll N\). As in the toy model, the fitting of the PDFs to the data will determine the mean and covariance of the nuisance parameters, which will then translate into systematic shifts and changes in the uncertainties of the theoretical predictions.

4.1 Expectation and covariance of multiple nuisance parameters

The nuisance parameters \(\lambda _\alpha \) correspond to shifts in the theoretical predictions: \(T_i[f]\rightarrow T_i[f] + \lambda _\alpha \beta _{i,\alpha }[f]\), where we adopt the summation convention for the index \(\alpha \). The shift vectors \(\beta _{i,\alpha }\) are not necessarily orthogonal to each other. We again assume Gaussian uncertainties, so that in place of Eq. (2.4) we now have

$$\begin{aligned} P(T|D\lambda )\propto & {} \exp \bigg (-{{1}\over {2}}(T[f]+\lambda _\alpha \beta _\alpha -D)^TC^{-1}\nonumber \\&\quad \times (T[f]+\lambda _\alpha \beta _\alpha -D)\bigg ), \end{aligned}$$
(4.1)

and assume that each nuisance parameter has a prior which is Gaussian distributed with unit variance, centred on zero, the distributions being independent both of each other and of the data, so that

$$\begin{aligned} P(\lambda |D)=P(\lambda ) \propto \exp \bigg (-{{1}\over {2}}\lambda _\alpha \lambda _\alpha \bigg ). \end{aligned}$$
(4.2)

We now marginalize over \(\lambda _\alpha \), as in Eq. (2.6)

$$\begin{aligned} P(T|D)\propto & {} \int d^n\lambda \, \exp \left( -{{1}\over {2}}[(T[f]+\lambda _\alpha \beta _\alpha -D)^TC^{-1}\nonumber \right. \\&\left. \times (T[f]+\lambda _\beta \beta _\beta -D)+\delta _{\alpha \beta }\lambda _\alpha \lambda _\beta ]\right) \, , \end{aligned}$$
(4.3)

by first completing the square: the details are messy, but the result is very similar to Eq. (2.16), namely

$$\begin{aligned} P(T|D)\propto & {} \int d^n\lambda \, \exp \left( -{{1}\over {2}}(\lambda _\alpha -\overline{\lambda }_\alpha ) Z_{\alpha \beta }^{-1}(\lambda _\beta -\overline{\lambda }_\beta ) \nonumber \right. \\&\left. - {{1}\over {2}}\chi ^2\right) \propto \exp \left( -{{1}\over {2}}\chi ^2\right) , \end{aligned}$$
(4.4)

where now

$$\begin{aligned} Z_{\alpha \beta } = (\delta _{\alpha \beta }+\beta _\alpha ^TC^{-1}\beta _\beta )^{-1}, \end{aligned}$$
(4.5)

the inverse on the right hand side being the matrix inverse with respect to the indices \(\alpha \) and \(\beta \),

$$\begin{aligned} \overline{\lambda }_\alpha (T,D) = Z_{\alpha \beta }\beta _\beta ^TC^{-1}(D-T), \end{aligned}$$
(4.6)

and \(\chi ^2\) is once again given by Eq. (2.2), but now with in place of Eq. (2.3)

$$\begin{aligned} S = \beta _\alpha \beta ^T_\alpha , \end{aligned}$$
(4.7)

as expected. The Gaussian integration in Eq. (4.4) is now trivial, taking us back again to Eq. (2.1) up to a factor \((2\pi )^{n/2}(\mathrm{det}Z)^{1/2}\), which we can ignore as it does not depend on T or D, while Bayes’ Theorem Eq. (2.5) gives us the posterior distribution of the nuisance parameters:

$$\begin{aligned} P(\lambda |TD)\propto \exp \bigg (-{{1}\over {2}}(\lambda _\alpha -\overline{\lambda }_\alpha ) Z_{\alpha \beta }^{-1}(\lambda _\beta -\overline{\lambda }_\beta )\bigg ), \end{aligned}$$
(4.8)

whence we see that

$$\begin{aligned} E[\lambda _\alpha ] =\overline{\lambda }_\alpha ,\qquad E[(\lambda _\alpha -\overline{\lambda }_\alpha )(\lambda _\beta -\overline{\lambda }_\beta )] = Z_{\alpha \beta }. \end{aligned}$$
(4.9)

It is easy to see from the definition Eq. (4.5) that if \(e_\alpha \) is a unit eigenvector of \(Z_{\alpha \beta }\), and \(\beta =e_\alpha \beta _\alpha \), then the corresponding eigenvalue of \(Z_{\alpha \beta }\) is \(z = (1+\beta ^TC^{-1}\beta )^{-1}\), so \(0<z<1\), and (in analogy to the bounds Eq. (2.18)) \(Z_{\alpha \beta }\) is positive definite (thus invertible) and \(\delta _{\alpha \beta }-Z_{\alpha \beta }\) is also positive definite (because the eigenvalues z are all less than one). We can summarise this by writing, in place of Eq. (2.18)

$$\begin{aligned} 0< Z_{\alpha \beta } < \delta _{\alpha \beta }. \end{aligned}$$
(4.10)

We can express \(Z_{\alpha \beta }\) in terms of the inverse of \(C+S\):

$$\begin{aligned} Z_{\alpha \beta } = \delta _{\alpha \beta }-\beta _\alpha ^T(C+S)^{-1}\beta _\beta , \end{aligned}$$
(4.11)

since

$$\begin{aligned}&(\delta _{\alpha \gamma }+\beta _\alpha ^TC^{-1}\beta _\gamma )(\delta _{\gamma \beta }-\beta _\gamma ^T(C+S)^{-1}\beta _\beta )\nonumber \\&\quad =\delta _{\alpha \beta } + \beta _\alpha ^T(C^{-1}-(C+S)^{-1}-C^{-1}S(C+S)^{-1})\beta _\beta \nonumber \\&\quad = \delta _{\alpha \beta }. \end{aligned}$$
(4.12)

Combining Eq. (4.11) with Eq. (4.6), we then have

$$\begin{aligned} \overline{\lambda }_\alpha = \beta _\alpha ^T(C+S)^{-1}(D-T), \end{aligned}$$
(4.13)

since \((1-(C+S)^{-1}S)C^{-1} = (C+S)^{-1}\).

4.2 Fitting the PDFs

We now proceed to apply the above results in the context of a PDF fit incorporating MHOU [8, 9]. We consider first a fixed parametrization: the PDFs \(f(\theta )\) will then depend on m parameters \(\theta _p\), \(p = 1,\ldots ,m\), with \(m<N\) so the data D are sufficient to determine all the parameters through minimization of the \(\chi ^2\) Eq. (2.2).

We can then follow the same procedure as in Sect. 3, the only difference being that now we fit the m parameters \(\theta _p\) rather than just the single parameter \(\theta \). Writing the theoretical predictions \(T[f(\theta )]\equiv T(\theta )\), the linearization relation Eq. (3.33) becomes

$$\begin{aligned} T(\theta ) = T_0 + (\theta _p-\theta _p^0)T_p, \end{aligned}$$
(4.14)

where \(f(\theta ^0)\equiv f_0\) is the PDF that minimizes the \(\chi ^2\), \(T_0\equiv T(\theta ^0)\), \(T_p\equiv \partial T(\theta ^0)/\partial \theta _p^0\), and we use the summation convention for indices p. Using the data replicas Eq. (3.2, 3.3), minimizing Eq. (3.4) with respect to \(\theta _p\) rather than \(\theta \), we find in place of Eq. (3.7) that the fluctuations of the PDF replica parameters are given by

$$\begin{aligned} \theta ^{(r)}_p - \theta ^0_p = (T_p^T(C+S)^{-1}T_q)^{-1}T_q^T(C+S)^{-1}(D^{(r)}-D), \end{aligned}$$
(4.15)

where the matrix inverse in the first factor on the right hand side is with respect to the pq indices. It follows that instead of Eq. (3.8) we now have the covariance matrix

$$\begin{aligned} \mathrm{Cov}_{pq}[\theta ]= & {} \langle (\theta ^{(r)}_p-\theta _p^0)(\theta ^{(r)}_q-\theta _q^0)\rangle \nonumber \\= & {} (T_p^T(C+S)^{-1}T_q)^{-1}, \end{aligned}$$
(4.16)

while the expression Eq. (3.12) for the covariance of the predictions T[f] becomes, on writing \(T_p = |T_p|n_p\), where \(n_p\) are unit vectors (which are however not necessarily orthogonal)

$$\begin{aligned} X = n_p(n_p^T(C+S)^{-1}n_q)^{-1}n_q^T. \end{aligned}$$
(4.17)

so the projective relation Eq. (3.13) still holds, and \(X(C+S)^{-1}\) projects data replicas onto theory replicas as in Eq. (3.14).

It is now easy to see that the results for the autopredictions in Sect. 3.2 continue to hold, and that in particular that since the central values of the nuisance parameters \(\overline{\lambda }_\alpha \) are given by

$$\begin{aligned} \mathrm{E}[\lambda _\alpha ] = -\beta _\alpha ^T(C+S)^{-1}(\langle T^{(r)}\rangle -D), \end{aligned}$$
(4.18)

the shifts Eq. (3.21) are now

$$\begin{aligned} \delta T[f]= & {} \beta _\alpha \beta _\alpha ^T(C+S)^{-1}(D-T[f_0]) \nonumber \\= & {} -S(C+S)^{-1}(T[f_0]-D). \end{aligned}$$
(4.19)

These shifts will improve the \(\chi ^2\) to the experimental data, in just the same way as in Sect. 2.2.

Likewise for the uncertainties: Eq. (3.22) for the variance of the nuisance parameter becomes an equation for the covariance matrix of the nuisance parameters in the context of the PDF fit,

$$\begin{aligned} \mathrm{Cov}_{\alpha \beta }[\lambda ]\equiv & {} \mathrm{E}[(\lambda _\alpha - \mathrm{E}[\lambda _\alpha ])(\lambda _\beta - \mathrm{E}[\lambda _\beta ])]\nonumber \\= & {} \delta _{\alpha \beta } -\beta _\alpha ^T(C+S)^{-1}\beta _\beta -\beta _\alpha ^T(C+S)^{-1}X(C+S)^{-1}\beta _\beta \nonumber \\\equiv & {} {\overline{Z}}_{\alpha \beta }, \end{aligned}$$
(4.20)

using the projective relation Eq. (3.13). Again, both \({\overline{Z}}_{\alpha \beta }-Z_{\alpha \beta }\) and \(\delta _{\alpha \beta }-{\overline{Z}}_{\alpha \beta }\) are positive semi-definite, so

$$\begin{aligned} 0 < Z_{\alpha \beta } \le {\overline{Z}}_{\alpha \beta } \le \delta _{\alpha \beta }. \end{aligned}$$
(4.21)

The covariance matrix of the theoretical autopredictions \(T(f,\lambda )\equiv T[f]+\lambda _\alpha \beta _\alpha \), Eq. (3.27, 3.30) then become

$$\begin{aligned} {\mathrm{Cov}}[T(f,\lambda )]= & {} X - S(C+S)^{-1}X-X(C+S)^{-1}S\nonumber \\&+\beta _\alpha {\overline{Z}}_{\alpha \beta }\beta _\beta ^T \end{aligned}$$
(4.22)
$$\begin{aligned}= & {} C(C+S)^{-1}X(C+S)^{-1}C \nonumber \\&+ S - S(C+S)^{-1}S, \end{aligned}$$
(4.23)

the second expression being identical to the one we found in Sect. 3.2.

The same holds true of course for correlated predictions: the shifts Eq. (3.40) are now

$$\begin{aligned} \delta \widetilde{T}[f]= & {} \widetilde{\beta }_\alpha \beta _\alpha ^T(C+S)^{-1}(D-T[f_0]) \nonumber \\= & {} {\widehat{S}}(C+S)^{-1}(D-T[f_0]), \end{aligned}$$
(4.24)

where \({\widehat{S}}= \widetilde{\beta }_\alpha \beta _\alpha ^T\), while if \(\widetilde{T}(f,\lambda )=\widetilde{T}[f]+\lambda _\alpha \widetilde{\beta }_\alpha \), Eq. (3.45) becomes

$$\begin{aligned} {\mathrm{Cov}}[\widetilde{T}(f,\lambda )]= & {} \widetilde{X}- {\widehat{S}}(C+S)^{-1}{\widehat{X}}^T-{\widehat{X}}(C+S)^{-1}{\widehat{S}}^T\nonumber \\&+ \widetilde{\beta }_\alpha {\overline{Z}}_{\alpha \beta }\widetilde{\beta }_\beta ^T , \end{aligned}$$
(4.25)

where \(\widetilde{S}= \widetilde{\beta }_\alpha \widetilde{\beta }_\alpha ^T\), and

$$\begin{aligned} \widetilde{X}= & {} \widetilde{T}_p(T_p^T(C+S)^{-1}T_q)^{-1}\widetilde{T}_q^T, \end{aligned}$$
(4.26)
$$\begin{aligned} {\widehat{X}}= & {} \widetilde{T}_p(T_p^T(C+S)^{-1}T_q)^{-1}T_q^T. \end{aligned}$$
(4.27)

Using Eq. (4.20), we can write the last term as

$$\begin{aligned} \widetilde{\beta }_\alpha {\overline{Z}}_{\alpha \beta }\widetilde{\beta }_\beta ^T= & {} \widetilde{\beta }_\alpha Z_{\alpha \beta }\widetilde{\beta }_\beta ^T + {\widehat{S}}(C+S)^{-1}X(C+S)^{-1}{\widehat{S}}^T,\nonumber \\ \end{aligned}$$
(4.28)
$$\begin{aligned} \widetilde{\beta }_\alpha Z_{\alpha \beta }\widetilde{\beta }_\beta ^T= & {} \widetilde{S}- {\widehat{S}}(C+S)^{-1}{\widehat{S}}^T. \end{aligned}$$
(4.29)

The final result thus again has exactly the same form as that found in Sect. 3.3: once the nuisance parameters are all eliminated, the only changes are in the expressions for the covariances X, \(\widetilde{X}\) and \({\widehat{X}}\), generalizing the previous one parameter expressions to many parameters.

4.3 Fitting NNPDFs

PDFs in an NNPDF fit are parametrized by a neural network, with a very large number of parameters. The fitting procedure differs from that using a fixed parametrization, since we want to avoid fitting noise. In practice this is achieved using a cross–validation procedure. It follows that when we fit to each data replica \(D^{(r)}\), the neural net parameters, and thus the PDF replicas \(f^{(r)}\), are not precisely determined through exact minimization of the \(\chi ^2\), but rather include some random noise, which is responsible for the ‘functional uncertainty’ inherent in the fit [15]. It is not easy to describe this analytically: all we can say is that while all the general results in Sect. 4.1 remain valid, relations such as Eq. (4.15) for the fitted parameters, and thus the subsequent results Eqs. (4.16, 4.17), no longer hold. However we can still use Eq. (4.13) to compute the expectation and covariance of the nuisance parameters, and obtain the same results Eqs. (4.16, 4.20), provided we define \(T^{(r)}\equiv T[f^{(r)}]\), and \(T^{(0)}\equiv \langle T^{(r)}\rangle \)

$$\begin{aligned} X\equiv \mathrm{Cov}[T[f]] = \langle (T^{(r)} - T^{(0)}) (T^{(r)} - T^{(0)})^T\rangle \end{aligned}$$
(4.30)

as averages over the PDF replicas. This matrix gives the PDF uncertainties (and correlations) for the observables T[f], which includes both the experimental uncertainties in the data and the theoretical uncertainties in extracting the PDFs from the data.

Note that in an NNPDF fit X no longer satisfies the projective relation Eq. (3.13), and indeed \(X(C+S)^{-1}\) no longer projects data replicas directly onto theory replicas as in Eq. (3.14). We can confirm this for a given set of PDF replicas by computing the cross-covariance matrix

$$\begin{aligned} Y \equiv \mathrm{Cov}[T,D]= \langle (T^{(r)} - T^{(0)}) (D^{(r)} - D)^T\rangle . \end{aligned}$$
(4.31)

For a fixed parametrization, we can use Eq. (3.14) and Eq. (3.3) to show that then \(Y=X=Y^T\). However it is easy to check by explicit computation that in an NNPDF fit Y is generally considerably smaller than X: the fluctuations of the theory replicas are not very well correlated to the fluctuations of the data replicas due to the functional uncertainty. So although many of the eigenvalues of \(X(C+S)^{-1}\) will still be zero (because \(m<N\)), the nonzero eigenvalues will differ from one, and many will be somewhat larger than one due to the functional uncertainty. This means that while Eq. (4.10) still holds, the upper bound on \({\overline{Z}}_{\alpha \beta }\), Eq.(4.21), does not: the covariance of the nuisance parameters can be larger than the prior when the functional uncertainty is large.

Note that the fact that X is not invertible is not in any sense a technical limitation: the mapping of a global dataset into a set of PDFs cannot in principle be invertible (except possibly in certain special cases, such as data from a single process taken at a single scale [13]), since it is impossible to recover the data solely from the PDFs. This is in part because the PDFs are only functions of x, while the data also depend on a scale: when we determine PDFs, all the data are effectively projected onto a common scale. But it is also because PDFs are by definition universal, i.e. process independent, so given a set of PDFs it is impossible in principle to say even which processes were used to determine them.

We can now derive general results for the expectation and covariance of autopredictions from the three matrices C, S and X, following the procedure set out in Sect. 3.2. The shifts in the autopredictions are given by a similar expression to Eq. (4.19),

$$\begin{aligned} \delta T[f] = -S(C+S)^{-1}(T^{(0)}-D), \end{aligned}$$
(4.32)

and will reduce the experimental \(\chi ^2\) as explained already in Sect. 2.2. The covariance matrix is still given by Eq. (4.23):

$$\begin{aligned} P\equiv & {} {\mathrm{Cov}}[T(f,\lambda )] = C(C+S)^{-1}X(C+S)^{-1}C\nonumber \\&+ (S - S(C+S)^{-1}S). \end{aligned}$$
(4.33)

If the theory uncertainty S is much smaller than the experimental uncertainty C, P approaches the result

$$\begin{aligned} P_{\mathrm{con}}= X+S; \end{aligned}$$
(4.34)

the fitting uncertainty and theoretical uncertainty can be combined in quadrature. So when the experimental uncertainties dominate there is almost complete decorrelation of the theoretical uncertainties, and the ‘conservative’ prescription recommended in Ref. [9] is a useful approximation.

Returning to the more general correlated result Eq. (4.33), both contributions to P (which we may call the correlated PDF uncertainty and the correlated theory uncertainty) are also positive semi-definite, and combine in quadrature to give the total uncertainty. Moreover the correlated theory uncertainty bounded above by the corresponding uncorrelated theory uncertainty:

$$\begin{aligned} 0\le S - S(C+S)^{-1}S = C(C+S)^{-1}S \le S, \end{aligned}$$
(4.35)

It is tempting to also think that the correlated PDF uncertainty will also be bounded above by the uncorrelated PDF uncertainty X, because since C is positive definite, and S positive semi-definite, \(C\le C+S\), so \(C(C+S)^{-1}\le 1\), and \(C(C+S)^{-1}X(C+S)^{-1}C \le X\). This argument is wrong, however, and the correlated PDF uncertainty can sometimes exceed the uncorrelated. Writing

$$\begin{aligned} C(C+S)^{-1}X(C+S)^{-1}C= & {} X - S(C+S)^{-1}X- X(C+S)^{-1}S\nonumber \\&+S(C+S)^{-1}X(C+S)^{-1}S, \end{aligned}$$
(4.36)

in some circumstances the sum of the last three terms can be positive. For this reason it seems impossible to prove in general that \(P\le P_{\mathrm{con}}\), though in all practical applications we have tested so far this seems to be the case.

For genuine predictions, with theory uncertainties correlated to those in the fit, shifts are given by Eq. (4.24),

$$\begin{aligned} \delta \widetilde{T}[f] = -{\widehat{S}}(C+S)^{-1}(T^{(0)}-D), \end{aligned}$$
(4.37)

while Eq. (4.25) is most usefully written in the form

$$\begin{aligned} \widetilde{P}\equiv & {} {\mathrm{Cov}}[\widetilde{T}(f,\lambda )]\nonumber \\= & {} \widetilde{X}- {\widehat{X}}(C+S)^{-1}{\widehat{S}}^T - {\widehat{S}}(C+S)^{-1}{\widehat{X}}^T\nonumber \\&+{\widehat{S}}(C+S)^{-1}X(C+S)^{-1}{\widehat{S}}^T\nonumber \\&+ (\widetilde{S}- {\widehat{S}}(C+S)^{-1}{\widehat{S}}^T) , \end{aligned}$$
(4.38)

where now besides the matrix X Eq. (4.30) we must now also evaluate

$$\begin{aligned} \widetilde{X}\equiv & {} \mathrm{Cov}[\widetilde{T}[f,\lambda ]] = \langle (\widetilde{T}^{(r)} -\widetilde{T}^{(0)}) (\widetilde{T}^{(r)} - \widetilde{T}^{(0)})^T\rangle , \end{aligned}$$
(4.39)
$$\begin{aligned} {\widehat{X}}\equiv & {} \mathrm{Cov}[\widetilde{T}[f,\lambda ],T[f,\lambda ]] = \langle (\widetilde{T}^{(r)} -\widetilde{T}^{(0)}) (T^{(r)} - T^{(0)})^T\rangle . \end{aligned}$$
(4.40)

Again the covariance Eq. (4.38) separates into the sum in quadrature of a correlated PDF uncertainty (the first two lines) and a correlated theory uncertainty (the third line), When the cross-covariance \({\widehat{S}}\) is very small, we obtain the conservative result

$$\begin{aligned} \widetilde{P}_{\mathrm{con}} = \widetilde{X}+\widetilde{S}, \end{aligned}$$
(4.41)

proposed in Ref. [9]. This will typically be the case for predictions of new processes where the dominant MHOU is in the hard cross-section. However for processes already included in the fit, the situation is more complex, since \(\widetilde{S}\) and \({\widehat{S}}\) may be large even if S is small.

Note that since the inclusion of MHOU in the PDF determination leads to only a small increase in the uncertainties of the PDFs [9], the conservative results Eq. (4.34, 4.41) give uncertainties very close to the conventional prescription, in which the PDF uncertainty (without MHOU) is combined in quadrature with the MHOU in the prediction. This will be particularly true when the MHOU in the prediction is larger than the PDF uncertainty.

Table 1 Classification of datasets into process types

5 Numerical results

In Sect. 4.3 we saw that in a realistic global PDF fit we can still use the same analytic expressions Eqs. (4.32, 4.33, 4.37, 4.38) for the shifts and reduction in uncertainties induced by the correlations between the theoretical uncertainties in fit and prediction as we would use in a fit with a fixed parametrization Sect. 4.2. This is despite the fact that PDFs are smooth functions, which cannot be determined uniquely from a finite set of discrete data, but necessarily have an additional ‘functional uncertainty’, so the PDF parameters are not fixed uniquely by the fit. All that is necessary is to evaluate the matrices X, \(\widetilde{X}\) and \({\widehat{X}}\) Eqs. (4.30, 4.39, 4.40) as ensemble averages over the PDF replicas determined in the fit. In this section we will compute these matrices in a realistic global PDF fit with theory uncertainties, and use them to evaluate autopredictions and genuine predictions including the effect of the correlated theoretical uncertainties.

We will carry out these studies in the context of the NNPDF3.1 NLO global fit with MHOU presented in Ref. [9]. This in turn employed the same experimental data and theory calculations in NNPDF3.1 [16] with two minor differences: the value of the lower kinematic cut was increased from \(Q_{\mathrm{min}}^2=2.69\) \(\hbox {GeV}^2\) to 13.96 \(\hbox {GeV}^2\), and the HERA \(F_2^b\), fixed-target Drell–Yan cross-sections, and some LHC inclusive jet data were removed, for technical reasons. This left a total of \(N_{\mathrm{dat}}=2819\) data points. The complete list of data included in the fit may be found in Tab. 6.3 of Ref. [9]. These data were divided into five classes, depending on the type of process involved, as summarized in Table 1. The MHOU covariance matrix \(S_{ij}\) was constructed using renormalization and factorization scale variations, by a factor of two either side. The factorization scale variations (estimating the MHOU in the NLO parton evolution) are correlated across all processes, but the renormalization scale variations (estimating the MHOU in the NLO hard cross-sections peculiar to each process) while correlated within data belonging to the same process, are uncorrelated between different processes. These variations were then combined to give \(S_{ij}\) using a 9pt scheme, as explained in Ref. [9]. The matrices \(C_{ij}\) and \(S_{ij}\) computed in Ref. [9] are reproduced in Fig. 1 as heat maps.

Fig. 1
figure 1

The experimental covariance matrix, \(C_{ij}\), normalized to the theoretical predictions \(T^{(0)}_i\) (left), and the corresponding theory covariance matrix for MHOU, \(S_{ij}\) (right). The datasets are arranged in the order given in Fig. 7 below: so SLAC data are in the top left corner, and LHC top data in the lower right corner

5.1 Covariance of PDF uncertainties X

We begin by computing the covariance matrix \(X_{ij}\), Eq. (4.30), shown in Fig. 2 as a heat map alongside the corresponding correlation matrix. It can be seen that the off-diagonal elements of \(X_{ij}\) are almost as large as the diagonal elements: this is confirmed by examination of the correlation matrix. This is because theoretical predictions are often very strongly correlated, not only for nearby bins within the same experiment, but also for different processes at nearby scales, due primarily to the smoothness of the underlying PDFs, both in x and in \(Q^2\), but also due to the highly correlated theoretical uncertainties included in the fit.

Fig. 2
figure 2

The covariance matrix of PDF uncertainties, \(X_{ij}\), normalized to the theoretical predictions \(T^{(0)}_i\) (left), and the corresponding correlation matrix \(X_{ij}/\sqrt{X_{ii}X_{jj}}\) (right). The datasets are arranged in the order given in Fig. 7 below: so SLAC data are in the top left corner, and LHC top data in the lower right corner

Fig. 3
figure 3

The square root of the diagonal elements of the matrices X (in orange), C (in green) and S (in purple) normalized to the theoretical predictions \(T^{(0)}_i\), with those for C and S the same as in Ref. [9]. The datasets are arranged in the order given in Fig. 7 below

We compare the PDF uncertainties to the experimental and theoretical uncertainties by looking at the per-point uncertainty (Fig. 3). Recall (Eq. (3.3)) that \(C+S\) is the covariance of the data replicas to which the PDF replicas are fitted. It can be seen that at NLO the relative size of the experimental uncertainties \(C_{ii}\) and the theoretical uncertainties \(S_{ii}\) varies considerably between different datasets: for the fixed target DIS data \(S_{ii}\) is generally below \(C_{ii}\), except at large x, whereas for the HERA NC data \(S_{ii}\) is much less than \(C_{ii}\) at large x, but the other way around at small x where the theoretical uncertainty dominates. For CHORUS, the experimental uncertainty also dominates, while for most DY datasets the theoretical uncertainty dominates. In contrast the PDF uncertainties \(X_{ii}\) are generally less than either \(C_{ii}\) or \(S_{ii}\), because the combination of data within given datasets and information from other datasets in the fit conspire to reduce the uncertainty. This is especially evident for DIS CC, DY and JETS. However for some data sets, particularly cross-section ratios with very small theory uncertainty (such as NMC d/p, the asymmetries, and the differential top data), \(X_{ii}\) lies above \(S_{ii}\), though still below \(C_{ii}\).

Fig. 4
figure 4

The 28 positive eigenvalues \(s^\alpha \) of the theory uncertainty matrix \(S_{ij}\) (above), shown in descending order, and 28 nuisance parameters \(\lambda _\alpha \) corresponding to the 28 eigenvectors \(\beta _\alpha \) (below), as given by Eq. (4.18).The uncertainties in the nuisance parameters are shown in total (square roots of the diagonal entries of Eq. (4.20), and broken down into the contribution from scale uncertainties alone (square roots of the diagonal entries of Eq. (4.11) and from PDF uncertainties (square roots of the diagonal entries of the last term in Eq. (4.20). The yellow bands highlight the region between \(\pm 1\)

Fig. 5
figure 5

Nuisance parameters \(\lambda \) for directions in the space of scale variations corresponding to up/down changes in factorization scale, and in renormalization scale for the five types of processes in the determination of the 9-pt theory covariance matrix for MHOU. The uncertainties in the nuisance parameters are shown in total, and broken down into the contribution from scale uncertainties alone and from PDF uncertainties, just as in Fig. 4. The yellow bands highlight the region between \(\pm 1\)

5.2 Nuisance parameters

Having computed X, we next calculate the nuisance parameters \(\lambda _\alpha \) of the theory covariance matrix \(S_{ij}\) for the MHOU in the NNPDF3.1 NLO global fit. The posterior distributions of these nuisance parameters give us information on which directions of MHOU are constrained by the fit. We showed in Ref. [9] that when there are five different processes, there are 28 nonzero eigenvalues. Thus we have 28 nuisance parameters, in one-to-one correspondence with the 28 eigenvectors with nonzero eigenvalues. The eigenvalues are shown in descending order in Fig. 4, with their nuisance parameters below them. The expectation values of the nuisance parameters are computed using Eq. (4.18), and their uncertainties using Eq. (4.20). The nuisance parameters are normalized so that their prior is a unit gaussian centred on zero, as in Eq. (4.2). It can be seen that after fitting, the uncertainty in the nuisance parameters associated with the largest nine or so eigenvalues has been substantially reduced from one, indicating that exposure to the data has reduced the MHOUs. For those corresponding to the smaller eigenvalues there is very little reduction, showing that the data do not much constrain these directions in the space of MHOU. The central values for the three largest eigenvalue nuisance parameters remain close to zero within uncertainties, showing that the prior choices (mainly overall normalizations) were reasonable, while the next three or four show significant deviations from zero: for these the data seem to carry significant information about the MHOU. For the remaining (smaller) eigenvalues the central values of nuisance parameters are all consistent with zero, and clearly for the very small ones the data have no effect at all, the posterior distributions being the same as the prior. This shows that only the eigenvectors corresponding to the larger eigenvalues are actually relevant for the PDF determination: the remainder correspond to such small changes in theoretical uncertainty that the fit ignores them.

We can understand these features better by separating out the two contributions to the total uncertainty in the nuisance parameters: that due purely to the impact of the fit of a given replica on the MHOU (given by Eq. (4.11)), and that due to the additional PDF uncertainty when the fits to all the replicas are averaged over PDF replicas, given by the last term in Eq. (4.20), also shown in Fig. 4. We see that when fitting a single replica, the uncertainties in the nuisance parameters corresponding to the larger eigenvalues are indeed very substantially reduced: the MHOU along the eigenvectors corresponding to these larger eigenvalues is learnt in the fit to the data, just as we saw in the simple models in Sect. 2.2. Again, very little information is retained about the smaller eigenvalues. The uncertainty contributed by averaging over the PDF replicas is also small for the largest eigenvalue nuisance parameters, but becomes the dominant contribution after the first three. For the smallest it is very small again: for these the data have no effect.

We can learn a little more about which MHOUs are learnt most by choosing different directions for the shift vectors \(\beta _\alpha \) in Eq. (4.1) than the eigenvectors of \(S_{ij}\). Specifically, we can choose the \(\beta _\alpha \) to correspond to factorization scale variations (up or down), or renormalization scale variations (up or down, but now separately for each process). The results are shown in Fig. 5, where we again show the total uncertainty, scale uncertainty, and PDF uncertainty. The central values fluctuate about zero, but all remain within the band \(\pm 1\), showing that the effect of fitting the experimental data on the nuisance parameters is rather mild: this is reassuring, as it confirms the choice of central scales used to make the predictions, and the choice of the range of the scale variations (implicit in the choice of the prior for the nuisance parameters, Eq. (4.2)). We also see that the uncertainties in the nuisance parameters corresponding to factorization scale variations (estimating the MHOUs in parton evolution), are reduced the most when fitting to any given replica, as we would expect since MHOUs in parton evolution are common to all data included in the fit. A little is also learnt about the renormalization scales for DIS. However, the PDF uncertainties partially wash out these effects. This suggests that the significant shifts in the nuisance parameters seen in Fig. 4 are due to global tensions between different processes, rather than problems with the choice of scales for particular processes.

Already we see that information from the data in the fit significantly updates the priors for the nuisance parameter distribution. From this it is likely that there will be an effect at the level of autopredictions, which is the subject of the next section.

5.3 Autopredictions

We now present results for the ‘autopredictions’. As we explained already in Sect. 2.2, these are the theoretical predictions we make for all the datasets included in the PDF fit, including theoretical uncertainties, after the fitting of the PDFs (with these same theoretical uncertainties). They can thus be thought of as ‘postdictions’, or predictions for the results of experiments run in exactly the same way, with the same equipment, as the original equipment but taking account of the original global dataset. They thus form an ideal theoretical laboratory for testing the extent of the decorrelation between the theoretical uncertainties in the PDF fit, and those in the (auto)predictions.

Fig. 6
figure 6

The shifts \(\delta T_i\), Eq. (4.32) (in blue) compared to the differences between theory and data, \(D_i-T^{(0)}_i\) (in green), both normalized to \(T^{(0)}_i\)

Fig. 7
figure 7

The experimental \(\chi ^2\) for each data set, comparing the original result of the NLO fit with no theory uncertainties to the fit with theory uncertainties, and then including the correlated shift in the autopredictions

We begin by computing the shifts \(\delta T_i\), Eq. (4.32), in the autopredictions, due to the correlation in theoretical uncertainty: these are shown in Fig. 6, normalized to the original theoretical prediction \(T^{(0)}_i\). We also show for comparison the differences \(D_i - T^{(0)}_i \). It can be seen from the plot that these shifts are generally much smaller than the difference between data and theory, particularly for DIS NC and DY. However for some datasets (in particular CHORUS and inclusive jets), there seems to be an overall shift in central value of the same order as the difference between experiment and theory. However it is difficult to draw any further conclusions from these observations, since the shifts are very correlated within datasets.

Table 2 The experimental \(\chi ^2\) per data point for each process, comparing the original result of the NLO fit with no theory uncertainties to the fit with theory uncertainties, and then including the shift in the autopredictions
Fig. 8
figure 8

The autoprediction covariance matrix \(P_{ij}\) Eq. (4.33), normalized to the theoretical predictions \(T^{(0)}_i\) (left), and the corresponding correlation matrix \(P_{ij}/\sqrt{P_{ii}P_{jj}}\) (right)

Fig. 9
figure 9

The percentage uncertainties of the autopredictions \(\sqrt{P_{ii}}\) Eq. (4.33) (blue) compared to the PDF uncertainty \(\sqrt{X_{ii}}\) (orange), and the conservative result, \(\sqrt{P^{\mathrm{con}}_{ii}}\) Eq. (4.34) (cyan), all normalised to the theoretical predictions \(T^{(0)}_i\)

To see whether the shifts actually improve the autopredictions, in Fig. 7 we show the experimental \(\chi ^2\) for the original data, computed using the autopredictions for the fit with no theory uncertainties, those when the theory uncertainties are included in the PDF fit, and then when the autoprediction includes the shift. Needless to say all three results are generally very close, and including the theory uncertainties in the fit has mixed results, some predictions getting better, but at the expense of others getting worse, since the main effect of the theory uncertainties is to rebalance the datasets in the fit [9] . Nevertheless when the correlated shifts are included, the fit to most datasets improves, sometimes quite substantially, just as anticipated in the very simple ‘pure theory’ model in Sect. 2.2, and confirmed for the more realistic models in Sect. 3.2 and Sect. 4.2. The numbers broken down by process are shown in Table 2. When including theory uncertainties the total \(\chi ^2\) increases just a little, from 1.17 to 1.19, but when the correlated shift is added to the theoretical predictions, we see a significant improvement to 1.10. This improvement is seen across all the processes.

Although these autopredictions are in some sense artificial – in practice experiments are never repeated using exactly the same equipment and settings – the implications of this exercise for the learning of theoretical uncertainties are nevertheless rather general. This is because in a global fit of the size of that performed here, with 2819 data points from 35 datasets, involving five different processes, removing any one of the smaller datasets has very little impact on the PDFs, and removing any dataset has the effect of increasing PDF uncertainties, while theoretical uncertainties for the remaining data remain unchanged. Consequently if we were to perform the PDF fit without a given dataset, and repeat the analysis, so that the autoprediction becomes a genuine prediction (or more properly ‘postdiction’), the result for this genuine prediction would be very close to the autoprediction. So we expect the shifts in central values that we see in the autopredictions to also give improvements in such predictions: the shifts should improve the accuracy of the predictions.

Fig. 10
figure 10

The contributions to the diagonal elements of the correlated theory uncertainty normalised to diagonal elements of S: \((S-S(C+S)^{-1}S)_{ii}/S_{ii}\) (pink), and \((S-S(C+S)^{-1}S+S(C+S)^{-1}X(C+S)^{-1}S )_{ii}/S_{ii}\) (black)

Fig. 11
figure 11

The contributions to the diagonal elements of the correlated PDF uncertainty normalised to diagonal elements of X: \((X-S(C+S)^{-1}X-X(C+S)^{-1}S)_{ii}/X_{ii}\) (lilac), and \((C(C+S)^{-1}X(C+S)^{-1}C )_{ii}/X_{ii}\) (see Eq. (4.36) (green)

To see whether we can also increase the precision, we consider the uncertainties in the autopredictions. In Fig. 8 we show the full covariance matrix \(P_{ij}\), Eq. (4.33), again normalized to the theoretical predictions, and corresponding correlation matrix. The matrix \(P_{ij}\) is the sum of the PDF uncertainty (derived from data uncertainties and theoretical uncertainties combined) and the theoretical uncertainty in the autoprediction, each reduced in size to account for the learning of the theoretical uncertainties and the correlation between the two sources of theoretical uncertainty. As might be expected there are very large correlations in the autopredictions within datasets, particularly for nearby kinematic points, but there are also smaller correlations, and anticorrelations, between datasets. They are due not only to the correlations of experimental uncertainties within datasets, but also to the use of a common set of smooth underlying PDFs, and the correlations of the theory uncertainties. The correlations within each process are generally larger than those between processes. This suggests that the combined effects of the correlations due to the use of a common factorization scale and the correlations induced by the smoothness of the PDFs is small compared to the correlations from the renormalization scale.

In Fig. 9 we show the percentage uncertainties of the autopredictions: \(\sqrt{P_{ii}}\), compared to the purely PDF uncertainties \(\sqrt{X_{ii}}\) to aid comparison. The correlated autoprediction uncertainties are generally of similar size to the PDF uncertainties; they are rather larger for some of the DY datasets and JETS, but are actually smaller for most of the DIS NC data (most remarkably for the HERA data at small x), and some DY data. So the full autopredictions are not only more accurate: they are also more precise. This increase in precision must be taken with a pinch of salt, since it depends to some extent on the assumptions made in modelling the prior MHOU [9], in particular the choice of independent scales, and the scheme through which they are combined into the theory covariance matrix S. In particular the aggressive reduction in the small x uncertainties for the HERA NC autopredictions seen in Fig. 9 may be due to the adoption of the same factorization scale for both singlet and nonsinglet, which overconstrains the singlet evolution at small x [13]. We leave the relaxation of these kinds of assumptions for future work.

As expected all of the autopredictions uncertainties are smaller than would be obtained by the standard prescription of adding PDF uncertainties and theory uncertainties in quadrature [9], which ignores both learning and correlation. However the conservative approach overestimates the correlated uncertainty for almost all datapoints, typically by a factor of two or more, particularly those for which the theoretical uncertainty is larger than the PDF uncertainty. The only data for which the conservative prescription works well are ratio data (for example the NMC d/p data), for which theoretical uncertainties are very small.

To understand better how these changes in uncertainty arise, we show in Fig. 10 the contributions to the diagonal elements of the correlated theory uncertainty (the second term in Eq. (4.33)), normalised to the uncorrelated elements \(S_{ii}\). The ‘learning’ of the theoretical uncertainty, given by the contribution \(-S(C+S)^{-1}S\) (note that \(S-S(C+S)^{-1}S\) is equal in the one parameter example described in Sect. 3.2 to ZS) is very significant, reducing the prior uncertainty S almost to zero for NC DIS and DY (where there is considerable data), and by an order of magnitude for DIS CC, JETS and TOP. Probably more flexibility is required in the modelling of the prior for this uncertainty. However the PDF fluctuations \(S(C+S)^{-1}X(C+S)^{-1}S\) (note that \(S-S(C+S)^{-1}S+S(C+S)^{-1}X(C+S)^{-1}S\) is equal in the one parameter example described in Sect. 3.2 to \({\overline{Z}}S\)) undo much of this learning, though for all data points this effect is insufficient to take the ratio to S above one.

A similar breakdown of the contributions to the diagonal elements of the correlated PDF uncertainty (the first term in Eq. (4.33), which can be expanded as in Eq. (4.36)), normalised to the uncorrelated elements \(X_{ii}\), is shown in Fig. 11. The correlation terms \(-S(C+S)^{-1}X-X(C+S)^{-1}S\) are indeed very large, as anticipated in Ref. [13], in particular for data with relatively large theoretical uncertainty (such as HERA NC at small x, or JETS), sufficient there to overwhelm X and give a negative result. However they can also be positive for some data (such as JETS). In any event, the addition of the PDF fluctuation term \(S(C+S)^{-1}X(C+S)^{-1}S\) (remember the decomposition Eq. (4.36) of the total correlated PDF uncertainty) is always sufficient to restore positivity of the correlated PDF uncertainty, and can in some situations (where S is large, in particular for JETS, but also some DIS CC and DY data) take the total correlated PDF uncertainty above the uncorrelated result X. Thus the correlations, while generally reducing uncertainties, can in some circumstances increase them, in contrast to learning which always reduces them.

For autopredictions we expect high levels of learning and correlation because we are making predictions for exact repeats of experiments already in the fit. As we noted for the shifts however, removing a smaller dataset from the fit has little effect in the PDFs, so we might expect similar effects for genuine predictions of processes already included in the fit, particular if they are in a similar kinematic region.

Fig. 12
figure 12

The upper two panels show predictions for \(t{\bar{t}}\) unnormalized rapidity distribution data taken at 13 TeV by CMS, the dilepton rapidity distribution [17] (left) and the lepton+jets distribution [18] (right). The four predictions show: the NLO fit with no MHOUs, PDF error only; the combined PDF and MHOU fit, ignoring correlations (thus \(\sqrt{P_{II}^{\mathrm{con}}}\)); the correlated result including the shift, and uncertainty computed using the simplified result (thus \(\sqrt{P_{II}^{\mathrm{sim}}}\)); the result with the same shift, but with the correlations included exactly (thus \(P_{II}\)), and the NNLO result with no MHOU. In the middle panels the same is shown, but normalized to the uncorrelated result. In the lower panels we show the fractional reduction in the PDF uncertainty and the theory uncertainty due to the inclusion of the correlations

5.4 Predictions for top

We now consider genuine predictions, for experiments not used in the PDF fit. These are of two kinds: those for datasets obtained through processes already contained in the fit, and those for completely new processes. For the former we consider the \(\mathrm{t}\bar{\mathrm{t}}\) production rapidity distributions (dilepton and lepton+jets) measured by CMS at 13 TeV [17, 18]. We chose these datasets for two reasons: firstly, the MHOU is large compared to the experimental uncertainty; secondly, the fit in Ref. [9] contains the total \(\mathrm{t}\bar{\mathrm{t}}\) cross-sections at 7, 8 and 13 TeV, and the normalized rapidity distributions at 8 TeV, all from both ATLAS and CMS. Both these factors mean that we expect to see significant effects due to correlations between the theoretical uncertainties of the data in the PDFs and the theoretical uncertainties of the 13 TeV rapidity distributions.

To make genuine predictions, including all correlations, we need to first calculate the covariance matrix for the MHOU of the predictions, \(\widetilde{S}_{IJ}\), and its cross-covariance with the MHOU of the theoretical predictions for the data used in the PDF fit, \({\widehat{S}}_{Ij}\) (the indices IJ running over the predicted data points, while ij run over the data included in the PDF fit): these are in fact the same as if we were planning to include data for the new process in a PDF fit, since the complete covariance matrix for the MHOU would be then of the form Eq. (2.41). Similarly we need to compute, using the fitted PDF replicas the covariance matrix \(\widetilde{X}_{IJ}\) of the PDF uncertainty for the new predictions, and the cross-covariance \({\widehat{X}}_{Ij}\) with the PDF uncertainty of the observables used in the fit. All of these matrices are required for an exact calculation of the correlated shifts in the theoretical predictions \(\delta \widetilde{T}_I\), Eq. (4.37) and the covariance matrix \(\widetilde{P}_{IJ}\) of their combined PDF and correlated theoretical uncertainties Eq. (4.38).

Predictions for the CMS 13 TeV \(\mathrm{t}\bar{\mathrm{t}}\) production rapidity distributions were computed using the same tool chain as in Ref. [16] for the 8 TeV distributions: NLO theoretical predictions were generated with Sherpa [19], in a format compliant with APPLgrid [20], using the MCgrid code [21] and the Rivet [22] analysis package, with OpenLoops [23] for the NLO matrix elements. Renormalization and factorization scales have been chosen based on the recommendation of Ref. [24] as \(H_T/4\).

The results of these calculations are shown in Fig. 12. The prior theoretical uncertainty in the original prediction is around 10%, considerably greater than the PDF uncertainty, as expected since the hard cross-sections are only computed at NLO. The correlated shift is sizeable, around 5%, and almost fully correlated across all the rapidity distributions. This is because these are unnormalized distributions, and thus have an overall theoretical normalization uncertainty which is strongly correlated to the measurements of the total \(\mathrm{t}\bar{\mathrm{t}}\) cross-sections at 7, 8 and 13 TeV by ATLAS and CMS included in the PDF fit. This is confirmed by breaking down the contributions to the shift Eq. (4.24) from the various data points included in the fit: the results are shown in Table 3. All of the shift comes from the six total cross-section measurements, while the 8 TeV normalized rapidity distributions push it down again by around 25%. The remaining data make almost no contribution. Nevertheless, the shift is still rather less than the theoretical uncertainty in the original prediction, as expected from the shifts in the nuisance parameters for the renormalization scale variation for top processes shown in Fig. 5.

Table 3 The fractional contributions of different data sets included in the fit to the shifts in the top rapidity distributions, averaged over all 21 data points

We can compare the shift due to correlations to that from going from NLO to NNLO. We thus show in Fig. 12 the results of a complete NNLO calculation (without theory uncertainties): the NNLO corrections also increase predictions by 5-8%. It is very interesting that the shift, driven by the data for the \(\mathrm{t}\bar{\mathrm{t}}\) total cross-sections at 7, 8 and 13 TeV, largely accounts for the NNLO correction: the data know that the NLO theoretical predictions are on the low side, and this information is carried over into the prediction for the 13 TeV rapidity distributions. Indeed, we compare the experimental \(\chi ^2\) for these data in the various calculations in Table 4: while the theory uncertainties increase the \(\chi ^2\) (due presumably to the other top data being deweighted in the fit), the shift gives a significant improvement both for dileptons and lepton+jets, comparable to that obtained with the complete NNLO corrections. So the shifts provide a new method for using experimental data to make improved theoretical predictions through the learning of theoretical uncertainties. The method should be particularly effective when there is substantial data on the process to be predicted already included in the PDF fit, as is the case here.

The middle panels of Fig. 12 shows the same points as the top panel, but as a ratio to the conservative result, making the uncertainties more visible. Comparing the uncertainties, the difference between the uncorrelated and correlated uncertainty is striking; the correlated uncertainties are much smaller than the uncorrelated. The correlated uncertainties are however still larger than the purely PDF uncertainties, but the very large theoretical uncertainty has been substantially reduced. So not only are the correlated predictions more accurate, they are also more precise. Despite this significant shrinking of uncertainties, the correlated predictions are still compatible with the NNLO result, thanks to the shift in central values. While the conservative prescription is also compatible with the NNLO result, it is immediately clear from the plot that it is inferior to the correlated prediction.

The breakdown of the reduction in uncertainties due to the correlations is shown in the lower panels of Fig. 12. The correlated theory uncertainty is substantially reduced (uniformly across the rapidities), due to the learning of the normalization from the data already included in the fit, while the correlated PDF uncertainty is reduced rather less: as much as a factor of two when the differential cross-section is small, but hardly at all when it is large. So here the dominant effect is clearly the learning of the theoretical uncertainty in the overall normalization.

The theoretical uncertainties in the theoretical predictions are strongly correlated amongst themselves, and between the two rapidity distributions: in Fig. 13 we show the correlation matrices for the PDF uncertainties, \(\widetilde{X}_{IJ}\), and that of the correlated prediction \(\widetilde{P}_{IJ}\). We see that while the predictions are more than 50% correlated across the range of rapidities by the PDF, when the correlated theoretical uncertainties are also included all points are rather more correlated, to more than 70%. The pattern of correlations reflects the symmetry in the dilepton distribution and asymmetry in the lepton+jet distribution: the least correlated points are those with the greatest rapidity separation.

Table 4 The experimental \(\chi ^2\) per data point for the CMS 13 TeV top dilepton and lepton+jet rapidity distributions, comparing the original result of the NLO fit with no theory uncertainties to the fit with theory uncertainties, and then including the correlated shift in the autopredictions. Also shown for comparison is the result in a NNLO fit with no theory uncertainties, only PDF uncertainties
Fig. 13
figure 13

The left hand plot shows the correlation matrix \(\widetilde{X}_{IJ}/\sqrt{\widetilde{X}_{II}\widetilde{X}_{JJ}}\) of the contribution of the PDF uncertainties to the predictions for the 13 TeV rapidity distributions by CMS: the right hand plot shows the correlation matrix \(\widetilde{P}_{IJ}/\sqrt{\widetilde{P}_{II}\widetilde{P}_{JJ}}\) of the total uncertainties including the correlated theoretical uncertainties. Note the expanded scales on the heat maps, different in each plot

5.5 Predictions for Higgs

As an example of a process not included in the PDF fit we consider the prediction for the total cross-section for Higgs production in gluon fusion at 14 TeV. For the calculation of the cross-section we performed calculations using ggHiggs [25,26,27]. Renormalization and factorization scales are set to \(m_h/2\), and the computation is performed using rescaled effective theory.

Our results are shown in in Fig. 14. The PDFs are still the NLO PDFs with MHOU from Ref. [9], but the Higgs total cross-sections are computed at NLO, NNLO and N3LO. At NLO the MHOU in the Higgs cross-section, estimated by varying the renormalization scale, completely dominates all other uncertainties (the PDF uncertainty, the theoretical uncertainty in the PDFs, and the scale uncertainty in the PDF evolution, estimated by factorization scale variation), so the effect of correlations in the MHOU is completely negligible. However when the cross-section is computed at NNLO, the renormalization scale uncertainty is rather smaller, and at N3LO it becomes more comparable to the other sources of uncertainty. The shift due to the correlation between MHOUs is always small compared to the overall uncertainty, and becomes smaller still as the perturbative order of the cross-section is increased, as expected. This is because, unlike in the top predictions, data for this process are not included in the fit, so the renormalization scale uncertainty is completely uncorrelated. It is interesting to note however that the small shift due to the correlation in factorization scale takes the NNLO prediction very close to the result computed with NNLO PDFs (though the coincidence is surely accidental).

Unlike for the top predictions, the effect of the correlation on the size of the overall uncertainty is small. Again, this is because the PDF contains far less information about Higgs production than it does about top production. We saw in Fig. 8 that the information propagated through to the correlated uncertainties was primarily through the renormalization scales, and that correlation due to the factorization scale was a lot weaker. Higgs production, being a new process, is therefore only impacted through the weak factorization scale correlations, and so the reduction in uncertainties is small. Overall, the uncorrelated conservative prescription [9] is comparable in this case to the fully correlated one. Note that if we were to perform these studies with NNLO (or indeed N3LO) PDFs which include MHOU, the MHOU in the PDF determination would have presumably been rather smaller than at NLO, and thus the effect of correlations in the MHOU, in particular the shift, but also the effect on the size of the uncertainty would be even smaller than the small corrections we see here.

From these examples of autopredictions, and genuine predictions for top and Higgs, we have seen that the extent of the shift and correlation can vary quite significantly, depending on the type of prediction being made and what information is already contained in the PDFs. The conservative prescription recommended in Ref. [9] is not always optimal, as the full inclusion of correlations can be quite substantially reduce uncertainties, as we saw both for the autopredictions and top predictions. However, when predicting a new process for which the PDF contains little information about correlated theoretical uncertainties, unsurprisingly the impact of correlations is small and the conservative prescription is quite sufficient.

Fig. 14
figure 14

Predictions for the Higgs total cross-section at 14 TeV, made using a variety of approximations. All results use NLO PDFs, while the Higgs total cross-section is computed at NLO (left panel), NNLO (centre panel) and N3LO (right panel). In each panel, we then have, from left to right: MHOU included only in the PDF determination in the 9pt scheme; the same but with the factorization scale uncertainty (MHOU in PDF evolution) included in quadrature; the same but with instead the renormalization scale uncertainty (MHOU in the Higgs cross-section); the total PDF uncertainty and 9pt MHOU combined in quadrature, as recommended in Ref. [9]; the total PDF plus 9pt MHOU, but now including also the shift and the correlation between theoretical uncertainties. In the centre panel we also show the NNLO prediction with NNLO PDFs (but no theoretical uncertainties), as a dashed line

6 Summary

In this paper we studied in detail the correlation between theoretical uncertainties in the calculations used in the determination of PDFs in a global fit, as formulated in Ref. [9], and the theoretical uncertainties in the predictions made using these PDFs. We began by recasting the theoretical uncertainties using nuisance parameters, determined replica by replica, which carry all the information about the effect of the experimental data on the theoretical uncertainties. Using increasingly realistic models of the fitting procedure, we produced analytical formulae for computing fully correlated predictions. In the process we identified three distinct but related effects, each of which has a significant impact on the final theoretical predictions:

  • Shifts in central values. These are an effect of Bayesian learning: just as we can use experimental data to determine PDFs, so we can also use it to identify theoretical corrections that improve the agreement between data and theory, while remaining within theoretical uncertainties. The correlations between theoretical uncertainties in the fit and those in predictions then lead to more accurate predictions. This effect was first identified in Sect. 2.2.

  • Learning of theoretical uncertainties. A second consequence of Bayesian learning is a reduction in theoretical uncertainty, due to the information provided by the data in the PDF fit, which through correlation of theoretical uncertainties can lead to a corresponding reduction in the theoretical uncertainties in predictions. This effect was also identified in Sect. 2.2, and is complementary to the shift in central values.

  • Correlation in theoretical uncertainties. The third effect is that the correlation between the theoretical uncertainties in the fit and the theoretical uncertainties in the predictions lead to a change in the PDF uncertainties in the prediction, even in situations where there is no shift, thus avoiding any ‘double counting’ of the theoretical uncertainty. The existence of this effect was first noted in Ref. [13], and identified as an effect distinct from Bayesian learning in Sect. 2.3.

While these three effects were first identified in the simple models of Sect. 2, we showed that they are all present in the one parameter fits of Sect. 3 and the more realistic fits with multiple parameters in Sect. 4. Using the NNPDF3.1 NLO global fits with MHOU [9], we demonstrated in Sect. 5 that the shifts can give sensible estimates of NNLO corrections, and thereby reduce the \(\chi ^2\) to the experimental data. We also showed that while the uncertainty in NLO predictions is still a sum in quadrature of the theoretical uncertainty and the PDF uncertainty (which also includes a theory uncertainty), this sum can be significantly reduced, depending on the relative size of the theoretical and experimental uncertainties. Consequently the ‘conservative’ prescription of Ref. [9], where the theory uncertainty in the prediction is combined in quadrature with the PDF uncertainty, is indeed conservative. We expect these conclusions to also hold in global PDF fits with fixed parametrization and tolerance [28, 29], if these were to include MHOU in the PDF fit.

The degree of correlation is highly dependent on the type of prediction being made. For the autopredictions (predictions for new measurements of the same data points as those included in the fit), Sect. 5.3, where there is maximal correspondence between the data in the fit and the predictions being made, the correlation is very high, leading to shifts that improve the quality of the fit to the data, together with a significant reduction in uncertainties, in some cases down to a small fraction of the uncorrelated values. For genuine predictions for new measurements of processes already included in the PDF fit, such as the new measurements of differential top production discussed in Sect. 5.4, we observe that the shift takes the correlated NLO predictions very close to the NNLO prediction, with a significant reduction in uncertainties: the prediction is both more accurate and more precise. For Higgs production, discussed in Sect. 5.5, a process not included in the PDF fit, the level of correlation is much smaller, since the dominant uncertainty (the MHOU in the hard cross-section) is uncorrelated with the MHOU of the fitted processes. In this case the shift is well within uncertainties, and the reduction in uncertainty very modest, so here the use of the conservative prescription Ref. [9] is entirely appropriate. We expect this to be true of predictions for any new process.

Thus our main conclusion is that when using PDFs which include MHOUs, taking account of the correlations between the MHOU included in the determination of the PDFs and the MHOU in the prediction can result in a significant improvement in both accuracy and precision. This is especially true in the case where the predicted process is among those included in the fit. However the correlated predictions must be treated with care, since their reliability relies to some extent on the generality of the prior estimation of the MHOU: if unjustified assumptions are made in the choice of prior, the uncertainty estimates in the correlated predictions may be too aggressive. For these reasons the conservative prescription, as an upper bound on the overall uncertainty, may sometimes be preferable, especially for predictions of new processes.

In order to calculate fully correlated predictions and uncertainties, one requires besides the PDF replicas some additional information: the cross-correlations between the theoretical uncertainties in the prediction and those in the theoretical calculations used to determine the PDFs, \({\widehat{S}}_{Ij}\); and the cross-correlations between the PDF uncertainties in the prediction and all the calculations included in the fit, \({\widehat{X}}_{Ij}\). In the future, it may be possible to present this information in separate NNPDF deliverables to facilitate the calculation of the correlation effects.

Although we presented our numerical study of correlations in the context of MHOUs, we would expect similar results for other kinds of theoretical uncertainty, such as nuclear uncertainties, higher twist uncertainties, or indeed parametric uncertainties: once the theory covariance matrix has been computed, the linear algebra has no concern for the type of theoretical uncertainty it contains. This suggests a new technique for determining external parameters in PDF fits, such as quark masses or electroweak parameters, taking full account of all correlations with the PDFs and theoretical uncertainties. We hope to explore this possibility in the near future.