Statistical issues in the parton distribution analysis of the Tevatron jet data

We analyse a tension between the D0 and CDF inclusive jet data and the perturbative QCD calculations, which are based on the ABKM09 and ABM11 parton distribution functions (PDFs) within the nuisance parameter framework. Particular attention is paid on the uncertainties in the nuisance parameters due to the data fluctuations and the PDF errors. We show that with account of these uncertainties the nuisance parameters do not demonstrate a statistically significant excess. A statistical bias of the estimator based on the nuisance parameters is also discussed.


Introduction
Since the first observation of jet production at Tevatron this process is considered as a valuable source of information about the gluon distribution at large x. Indeed, the gluon distribution directly enters into the jet production cross section in contrast to the deepinelastic-scattering (DIS) process, which provides only an indirect constraint on the gluon distribution, through the QCD evolution. The Tevatron jet production data [1,2] are used in the global fits of parton distribution functions (PDFs) to improve accuracy of the gluon distribution, particularly at large x. At this end proper statistical treatment of the data is required since uncertainties in the data of refs. [1,2] are dominated by the correlated systematics and the simplest χ 2 estimator is inapplicable. In this case one should ideally use the χ 2 estimator including the covariance matrix, which encodes the error correlations. However, for the sake of implementation simplicity an alternative form of estimator is often employed [3]. This form is based on the so-called "nuisance" parameters, which describe a possible shift of the data due to systematic uncertainties. The nuisance parameters entering the estimator are fitted to the data simultaneously with other parameters describing the PDF shape. As a result, the number of fitted parameters dramatically grows. This difficulty is circumvented because the nuisance parameters enter into the estimator of ref. [3] linearly therefore the χ 2 value can be minimized with respect to the nuisance parameters analytically. As an added feature, the approach based on the nuisance parameters allows for the visualization of any tension between the data and the fitted model since it shows how large a shift of the data provides the best agreement with the model. Moreover, in the same way the best values of the nuisance parameters can be estimated for any given data set, which is not included in the PDF fit, in order to check for potential problems with accommodation of the new data into the fit.

JHEP02(2014)041
The ABKM09 PDFs [5] and their refined version, ABM11 PDFs [8], were extracted to next-to-next-to-leading-order (NNLO) in perturbative QCD from a combination of the world inclusive DIS data supplemented by the fixed-target data for the Drell-Yan process and dimuon production in the neutrino-nucleon collision. The Tevatron jet data were also included into a variant of the ABKM09 fit [5,6] and good agreement with other data used in the fit has been achieved. The analysis of ref. [6] is focused on the impact of the Tevatron data on the Higgs cross section estimate, cf. also [9], and statistical aspects in this analysis have not been detailed. In the present paper we fill this gap by giving a detailed calculation of the nuisance parameters for the ABKM09 and ABM11 PDFs with and without Tevatron jet data included. The paper is organized as follows. In section 2 we give a brief outline of the formalism used in analysis of the correlated data. Section 3 contains a description of the systematic uncertainties in the Tevatron jet data and the corresponding nuisance parameters in comparison with ones obtained with other PDF sets. Particular attention is payed on the nuisance parameters for the luminosity uncertainty and on the impact of this source of uncertainty on the fit results following suggestions of ref. [7]. Section 4 contains a conclusion.

Basics of the correlated data analysis
In case measurements are subject to correlated systematic uncertainties the experimental data {y i } can be represented as follows, where f i is the mathematical expectation of the measurement i depending on the vector of model parameters Θ, σ i is its uncorrelated uncertainty, 1 s rel k,i (s k,i ) are the relative (absolute) correlated uncertainties, which stem from N syst independent sources, i.e., s k,i = y i s rel k,i , and the index i runs over all experimental data points. The independent random variables µ i and λ k describe the uncorrelated and correlated fluctuations in the data, respectively. By definition, the uncorrelated fluctuations are independent for each data point. In contrast, the correlated fluctuation due to each source k are common for all data points. Routinely they are related to systematic effects in the luminosity, calibration, corrections, etc. For cross section measurements these factors are applied to the data multiplicatively therefore the systematic errors are commonly multiplicative, while the additive errors may appear due to an uncertainty in the background subtraction, which is usually relatively small for the experimental data used to constraint PDFs. With account of the data correlations the χ 2 -estimator reads

JHEP02(2014)041
The error matrix E ij is the inverse of the positive definite covariance matrix C ij . For the model of eq. (2.1) with the multiplicative errors where δ ij is the Kronecker symbol. For the additive correlated uncertainties the statistical model is simplified as follows with the covariance matrix which should be inserted into eq. (2.2) to obtain the χ 2 -estimator, similarly to the case of multiplicative errors. It should be emphasized, that the chosen model for the uncertainties does rely on certain assumptions, namely that the uncertainties are small and subject to normal distributions. Neither of the two assumptions, which provide the basis of the mathematical formalism for the statistical estimators discussed in the sequel, may be valid, though. If violated for a given experimental measurement, however, the corresponding set of data cannot be used for precision predictions, e.g., of the proton structure, as advertised in the context of the inclusive jet cross sections.
Correlation of the additive errors is often taken into account employing the following form of χ 2 [3] The form of eq. (2.6) allows for shifts of the data by the correlated uncertainty scaled with the values of the parameters η k . The latter are fitted simultaneously with the theoretical model parameters Θ and in this way describe the data shifts, which provide the best description of the fitted model. The advantage of the estimator in eq. (2.6) is essentially its technical simplicity since the vector of η k , which provides the minimum of eq. (2.6) can be found analytically as a product of two matrices defining the nuisance parameters r k as random variables where

JHEP02(2014)041
and The value of the estimator in eq. (2.6) at η k = r k reads Since the inverse of the additive covariance matrix eq. (2.5) is the value of χ 2 min coincides with the one of eq. (2.2) for the statistical model of data with additive systematic errors. This demonstrates the mathematical equivalence of the estimators based on the covariance matrix and the nuisance parameters (cf. ref. [4]). However, note that for the case of multiplicative errors the nuisance parameter approach is not so straightforward since the minimum of the corresponding χ 2 -estimator cannot be found as a solution of linear equation system.
The nuisance parameters r k of eq. (2.7) have the average equal to zero and the variances, which read is the covariance matrix for the vectors B l,l of eq. (2.9). 2 Through f i ( Θ) entering eq. (2.9) the nuisance parameters depend on the fitted parameters Θ. For the data sets, which are not included into the fit, the best-fit nuisance parameters are generally bigger by magnitude than the ones obtained from a fit, which includes those data sets, due to better tuning of Θ to the data in the latter case. In the following section we analyze this trend for the different Tevatron jet data with respect to the ABKM09 [5] and ABM11 [8] fits considering two cases: before and after these data are included into the fit.

Comparison of the ABKM09 and ABM11 fits with Tevatron jet data
The Tevatron experiments CDF and D0 have accumulated big samples of events with hard jets in the final state and have performed elaborated analyses of these samples with different jet definition algorithms, cf. [10] for a recent review. For brevity we consider in the following only two Tevatron inclusive jet data sets [1,2] obtained by the D0 and CDF collaborations, respectively, which nonetheless give a representative illustration of the issues discussed in the paper. Both data sets were collected in Run II and each corresponds to an integral luminosity of about 1fb −1 .
As mentioned in section 1, we will consider two scenarios for the comparison of PDF sets with the Tevatron jet data. In the first, we study how the nominal ABKM09 [5] and ABM11 [8] fits, i.e., those corresponding to the publicly available tables in the LHAPDF library [11,12], accommodate the Tevatron jet data. In the second variant, analogous to preceding studies in [6,8], we analyze the changes after these data are included into the fit.

Analysis of D0 results
The D0 analysis of ref. [1] is based on the midpoint cone algorithm for the jet definition. The D0 data cover the range of −2.4 ÷ 2.4 in the jet rapidity and 50 ÷ 600 GeV in the transverse momentum of jet. The published correlated systematic uncertainties in the D0 data are due to the luminosity and 23 additional sources, including the jet energy calibration, resolution, etc. In the present analysis we consider all these sources taking the average in the case of asymmetric errors. 3 The distribution of the nuisance parameters r of eq. (2.7), which correspond to these 23 sources of systematics, calculated for the NNLO ABKM09 PDFs are given in figure 1(a). The jet production cross sections are obtained with the FastNLO tool [13] and include the NLO corrections [14,15] and the threshold resummation corrections of ref. [16]. The D0 nuisance parameters spread in the range from -1.  Figure 3. Values of the nuisance parameters r for the D0 data [1] (left) and the CDF ones [2] (right) with the uncertainties due to data fluctuation (inner bars) and the total uncertainties including the ones due to PDFs (outer bars) versus the nuisance parameter number n. The luminosity nuisance parameters correspond to n = 6 and 17 for D0 and CDF, respectively.
one. The maximal absolute value of r corresponds to the systematic uncertainty in the luminosity. This reflects the fact that the D0 data systematically overshoot the ABKM09 predictions, cf. refs. [6,8]. However, with account of the errors in the nuisance parameters due to fluctuations in the data and due to the PDF uncertainties the statistical significance of the spread in the nuisance parameters reduces. To check in details the uncertainty in the D0 luminosity nuisance parameter due to the data fluctuation we calculate it for 200 pseudo-data sets generated with eq. (2.4) and the data errors of ref. [1] taking a normal Gaussian distribution for the random variables µ and λ. The distribution of the luminosity nuisance parameter obtained for these data sets is displayed in figure 2. It is comparable to the Gaussian distribution with the best fit value of eq. (2.7) and variance of eq. (2.12), which are r norm = 4.1 and V (r norm ) = 0.85, respectively. This is somewhat different from the central value of r norm = 3.35 obtained for the D0 data [1] with the ABKM09 PDFs in ref. [7]. Meanwhile in both cases the analysis is based on the formalism of section 2, therefore the difference should be ascribed to the details of the implementation. Note in this context that our code for the nuisance parameter computation was crosschecked using numerical minimization of eq. (2.6) (cf. figure 2). The error in the nuisance parameters due to PDFs is estimated in our analysis as a combination of their variation with the change in the PDFs between the central value and each of the 25 PDF sets describing the ABKM09 PDF uncertainties. For the D0 luminosity nuisance parameter this gives an additional uncertainty of ∆ PDF (r norm ) = 0.95. A combination of V (r norm ) and ∆ PDF (r norm ) in quadrature gives the total uncertainty ∆ tot (r norm ) = 1.3. This says, the D0 luminosity nuisance parameter deviates from zero by 3 standard deviations. Meanwhile other D0 nuisance parameters are consistent with zero within uncertainties, cf. figure 3(a). Therefore, the excess in the luminosity nuisance parameter is in fact equivalent to the fluctuation of one of 23 random numbers by 3σ. In practice such excess does not lead to the hypothesis rejection, cf. significance of the NuTeV anomaly and the muon magnetic moment measurements in the context of the Standard Model validation [18].

JHEP02(2014)041
Furthermore, it does not prevent easy accommodation of the D0 data into our fit. In the variant of the ABKM09 analysis with those data included the nuisance parameters are in general much smaller due to better tuning of the PDFs to the data and the value of luminosity nuisance parameter is 1.5 only that is consistent with zero within the error, cf. figure 1(b). Note, the distribution of nuisance parameters for this variant of the ABKM09 fit is very much comparable to the ones for the MSTW08 and NN21 PDFs, also tuned to the Tevatron jet data, cf. figures 1(c,d). Similarly to the case of MSTW and NNPDF analyses the D0 jet data mostly affect the large-x gluon distribution, while other PDFs are modified in much less extent.
To make an explicit check of the impact of the D0 luminosity uncertainty on the extracted PDFs we perform one more variant of the ABKM09 fit, with the luminosity uncertainty in the D0 data dropped. Note that the luminosity error is 6.1% only, much smaller than other systematic uncertainties, therefore it does not dominate in the data errors and dropping this error does not lead to any essential deterioration of the D0 data description.  figure 1 for the CDF data on inclusive jet production [2]. The curves superimposed display a normal Gaussian distribution normalized on the total number of the nuisance parameters. from these two variants of the fit generally does not exceed its uncertainty, cf. figure 4, and for other PDFs it is even smaller. These results are in line with the above conclusion about the statistical significance of the excess in the luminosity nuisance parameter. Furthermore, keeping in mind that this check gives an upper margin of the luminosity uncertainty impact, we conclude that it is well within 1σ. These findings show that accounting for the uncertainties in the nuisance parameters is substantial for the correct interpretation of the observed excess in the luminosity parameters. However, the analysis of ref. [7] lacks the uncertainty estimation therefore it does not provide the essential ingredient for the critical benchmarking of the PDFs with the Tevatron data.

Analysis of CDF results
The CDF data on the inclusive jet cross sections [2] were obtained with the k T algorithm for the jet definition and cover the range of  Figure 7. The same as in figure 6 for the D0 [1] (left) and CDF [2] (right) data on the inclusive jet production cross section.

JHEP02(2014)041
nuisance parameters calculated with the NNLO ABKM09 PDFs is displayed in figure 5(a). In general, it is in agreement with the normal Gaussian one with the only essential excess observed for the luminosity nuisance parameter, which reaches the value of r norm = 5.4, to be compared to 5.05 obtained in ref. [7] for the CDF data with the ABKM09 PDFs. This is bigger than the D0 luminosity nuisance parameter. However, due to bigger uncertainties in the CDF data the errors in this parameter are also bigger as compared to the D0 case. The variance of the CDF luminosity nuisance parameter is V (r norm ) = 0.93 (to be compared to 0.85 for D0) and the uncertainty due to the PDFs is ∆ PDF (r norm ) = 1.43 (to be compared to 0.95 for D0). The CDF error due PDFs is evidently enhanced due to the particular trend of the data with respect to the predictions based on the ABKM09 fit. In the D0 case the offset of data does not depend on the jet energy, while the CDF jet energy dependence is systematically tilted as compared to the predictions, cf. figures 1,2 in ref. [6]. With account of these errors the CDF luminosity nuisance parameter deviates from zero by 3 standard deviations, however, other nuisance parameters are consistent with zero within uncertainties, cf. figure 3(b), therefore in total the statistical significance of the excess in the luminosity nuisance parameter is comparable to one for the D0 jet data. The distribution of the CDF nuisance parameters in the variant of the ABKM09 fit, which includes the CDF data, is in agreement with the normal Gaussian one, cf. figure 5(a). In contrast to our findings, the distribution of the nuisance parameter for the CDF data with the ABKM09 PDFs obtained with the MSTW fitting tools demonstrate clear deviation from the Gaussian one (cf. figure 17a in ref. [7]).
Similarly to the D0 case the CDF data mostly affect gluon distribution and the change in χ 2 due to dropping the CDF luminosity uncertainty in the variant of the ABKM09 fit, which includes the CDF data, is marginal, i.e. less than 1 for 76 data points. The change in the PDFs due to dropping the luminosity uncertainty is within 1σ, cf. figure 4, that supports the conclusion about statistical insignificance of the excess in the CDF and D0 nuisance luminosity parameter found in ref. [7] for the ABKM09 PDFs.

On correlations of uncertainties in CDF/D0 results
Another concern about the conclusion of ref. [7] is related to the relevance of a rigorous statistical treatment of the systematic uncertainties in the Tevatron jet data. Commonly, the different sources of systematics are assumed to be independent, cf. eqs. (2.1), (2.4). This also was assumed in the present study and in ref. [7]. We have checked this hypothesis for the Tevatron jet data plotting the cosine of angles between the systematic uncertainty vectors s k,i , which are defined as The vectors of systematic uncertainties s k,i are not positive definite. Furthermore, the shifts corresponding to different sources of systematic uncertainties are apriori positive and negative in different kinematic regions. Therefore the distribution of cos(φ kk ) must naively peak at cos(φ) = 0 and be symmetric with respect to this peak for the case of independent JHEP02(2014)041 sources of the systematic uncertainties. In particular, such a picture is observed for the HERA data on the inclusive deep-inelastic-scattering (DIS) structure functions, cf. figure 6. However, this is not the case for the D0 and CDF data, cf. figure 7. For both CDF and D0 data the distributions peak at cos(φ) = 1 and are quite asymmetric, particularly in the case of CDF. This asymmetry signals about a one-sided deviation of different systematic uncertainties and the peak indicates a strong collinearity of many systematic uncertainty vectors. In case these systematic errors really stem from one of a few sources only, the PDF fits based on the Tevatron jet data should be revisited. Note that the vectors s k,i corresponding to the luminosity uncertainty are collinear to many other systematic error vectors for these data. Possibly, this also explains the big error in the luminosity parameter since the corresponding nuisance parameters are mixed due to this collinearity.

Comparison of different PDF sets
The distributions of the D0 and CDF nuisance parameters for the variants of NNLO ABM11 fit [8], which include the Tevatron jet data in a similar way to ref. [6], are in agreement with ones for the ABKM09 fit, cf. figure 8. In turn, both ABKM09 and ABM11 nuisance parameter distributions are similar to the ones obtained with the MSTW08 [19] and NN21 [20] PDFs, which are also tuned to the Tevatron jet data, cf. figures 1(c,d) and 5(c,d). The remaining differences can be explained by the specific data selection in the fits and the fitted model peculiarities, like e.g. heavy-quark treatment, high-twist contributions, and others, cf. ref. [8]. It can also appear due to different statistical estimators used in the PDF fits.
In particular, the ABKM09 and ABM11 fits are based on the covariance matrix estimator of eq. (2.2), while in the MSTW08 fit the one of eq. (2.6) is employed. As we have pointed out in section 2, in the first case the systematic errors are considered as multiplicative and in the second case as additive. The nature of the errors in Tevatron jet data is not fully specified in the original papers, however, the dominating source of systematic uncertainty, jet energy scale, is known to be multiplicative [21]. Note, that an additive treatment of the multiplicative errors leads to a statistical bias in the fitted parameters (cf. refs. [22][23][24] and references therein for a discussion). Therefore it may have an impact on the nuisance parameter values which depend on the fitted PDF parameters as well. In the NNPDF fit [20,25] the luminosity errors are treated in a special way, which allows to minimize the bias. However, the covariance matrix of eq. (2.5) is still used to take into account other correlated systematic errors (cf. eq. (1) in ref. [25]). Since for the Tevatron jet data the latter dominate, the bias appears also in the NNPDF fit (cf. ref. [21] for a detailed study of the bias.). In summary, the statistical significance of the excess in the luminosity nuisance parameters found in ref. [7] for the D0 / CDF jet data and ABKM09 PDFs is quantified at the level of 3σ and we observe only marginal impact of the luminosity uncertainty on the variant of the ABKM09 fit, which includes those data. This does not support the judgment of ref. [7] that "systematic shifts r norm = 3 ∼ 5 for some PDF sets" are "completely unreasonable" arrived at without considering the nuisance parameter uncertainties. Besides, the nuisance parameter approach [7] is based on a statistical estimator giving bias in the case of multiplicative errors, such as the luminosity one or the jet energy scale uncertainty,  Figure 8. The same as figure 1 for the D0 data [1] (left) and the CDF data [2] (right) and the variants of NNLO ABM11 fit [8] including the D0 and CDF jet data, respectively.
which dominates in the systematics of the Tevatron jet data. In contrast, the estimator used in the ABKM09 fit is asymptotically unbiased in this case [23].

Conclusion
We have analyzed a tension between the D0 and CDF inclusive jet data and the perturbative QCD calculations, which are based on the NNLO ABKM09 and ABM11 PDFs with account of the NLO and NNLO threshold resummation corrections to the parton cross sections. The nuisance parameters employed to quantify the tension are calculated for each source of systematic uncertainty in the data minimizing the χ 2 -estimator, which allows for shifts of the data by the value of systematic error scaled with the corresponding nuisance parameter. For some sources, in particular for the luminosity uncertainty, the nuisance parameter values are relatively big. However, the analysis of the impact of data fluctuations and the PDF errors shows that the uncertainties of the nuisance parameters themselves are sizable as well. For the D0 and the CDF data sets the nuisance parameters for the luminosity uncertainty take the values r D0 norm = 4.1 ± 1.3 and r CDF norm = 5.4 ± 1.7, respectively. This happens due to many systematic uncertainty vectors including the luminosity ones being collinear and, as a result, the corresponding nuisance parameters are mixed. In view of those uncertainties the excesses in the luminosity nuisance parameters are of marginal statistical significance only with respect to the entire PDF fit. This conclusion is explicitly checked by considering the variants of ABKM09 fit, which include the Tevatron jet data without any luminosity uncertainty taken into account. The results of these fits are quite similar to the ones including the luminosity uncertainties. At the same time, despite the fact that no serious statistical issues arise in the variants of the ABKM09 fit including the Tevatron jet data [6], the latter are finally not yet used in the ABM11 fit [8] in view of yet JHEP02(2014)041 lacking complete NNLO corrections, which may have an impact both on determination of the strong coupling constant and on the parton distribution functions.