Statistical Models with Uncertain Error Parameters

In a statistical analysis in Particle Physics, nuisance parameters can be introduced to take into account various types of systematic uncertainties. The best estimate of such a parameter is often modeled as a Gaussian distributed variable with a given standard deviation (the corresponding"systematic error"). Although the assigned systematic errors are usually treated as constants, in general they are themselves uncertain. A type of model is presented where the uncertainty in the assigned systematic errors is taken into account. Estimates of the systematic variances are modeled as gamma distributed random variables. The resulting confidence intervals show interesting and useful properties. For example, when averaging measurements to estimate their mean, the size of the confidence interval increases for decreasing goodness-of-fit, and averages have reduced sensitivity to outliers. The basic properties of the model are presented and several examples relevant for Particle Physics are explored.


Introduction
Data analysis in Particle Physics is based on observation of a set of numbers that can be represented by a (vector) random variable, here denoted as y. The probability of y (or probability density for continuous variables) can in general be written P (y|µ, θ), where µ represents parameters of interest and θ are nuisance parameters needed for the correctness of the model but not of interest to the analyst.
The goal of the analysis is to carry out inference related to the parameters of interest. A procedure for doing this in the framework of frequentist statistics using the profile likelihood function is described in Sec. 2. This involves using control measurements with given standard deviations to provide information on the nuisance parameters. Here we will usually take the term "systematic error" to mean the standard deviation of of a control measurement itself; in other contexts it often refers to the corresponding uncertainty in an estimate of a parameter of interest.
Often the values assigned to the systematic errors are themselves uncertain. This can be incorporated into the model by treating their values as adjustable parameters and their estimates as random variables. A model is proposed in which the estimates of systematic variances are treated as following a gamma distribution, whose mean and width are set by the analyst to reflect the desired nominal error and its relative uncertainty.
The confidence intervals that result from this type of model are found to have interesting and useful properties. For example, when averaging measurements to estimate their mean, the size of the confidence interval increases with decreasing goodness-of-fit, and averages have reduced sensitivity to outliers. The basic properties of the model are presented and several types of examples relevant for Particle Physics are explored.
The approach followed here is that of frequentist statistics, as this is widely used in Particle Physics. Models with elements similar to the one proposed have been discussed in the statistics literature, e.g., Refs. [1,2]. Analogous Bayesian procedures have been been investigated in Particle Physics [3,4,5] and found to produce results with qualitatively similar properties.
After reviewing parameter inference using the profile likelihood with known systematic errors in Sec. 2, the model with uncertain errors is presented in Sec. 3 and its use in determining confidence intervals is discussed in Sec. 4. In this paper two areas where such a model can be applied are explored: a single Gaussian distribution measurement in Sec. 5 and the method of least squares in Sec. 6. The issue of correlated systematic errors is discussed in Sec. 7 and conclusions are given in Sec. 8.

Parameter inference using the profile likelihood and the case of known systematic errors
Inference about a model's parameters is based on the likelihood function L(µ, θ) = P (y|µ, θ). More specifically one can construct a frequentist test of values of the parameters of interest µ by using the profile likelihood ratio (see, e.g., Ref. [6]), λ(µ) = L(µ,θ) L(μ,θ) . (1) Here in the denominator,μ andθ represent the maximumlikelihood (ML) estimators of µ and θ, andθ are the profiled values of θ, i.e., the values of θ that maximize the likelihood for a given value of µ.
Often the nuisance parameters are introduced to account for a systematic uncertainty in the model. Their presence eliminates the systematic error but, because of correlations between the estimators of the parameters, they decrease the sensitivity of the analysis to the parameters of interest. To counteract this unwanted effect, one often includes into the set of observed quantities additional measurements that provide information on the nuisance parameters.
The log-likelihood in Eq. (3) represents one of the most widely used methods for taking account of systematic uncertainties in Particle Physics analyses. First nuisance parameters are introduced into the model to eliminate the systematic error, and then these parameters are constrained by means of control measurements. The quadratic constraint terms in Eqs. (3) correspond to the case where the estimate u i of the parameter θ i is modeled as a Gaussian distributed variable of known standard deviation σ ui .
In some problems one may have parameters η i that are intrinsically positive with estimates t i modeled as following a log-normal distribution. The Gaussian model covers this case as well by defining θ i = ln η i and u i = ln t i , so that u i is the corresponding Gaussian distributed estimator for θ i .
Often the estimates u i are the outcome of real control measurements, and so the standard deviations σ ui are related to the corresponding sample size. The control measurement itself could, however, involve a number of uncertainties or arbitrary model choices, and as a result the values of the σ ui may themselves be uncertain.
Gaussian modelling of the u i can be used even if the measurement exists only in an idealized sense. For example, the parameter θ i could represent a not-yet computed coefficient in a perturbation series, and u i is one's best guess of its value (e.g., zero). In this case one may try to estimate an appropriate σ ui by means of some recipe, e.g., by varying some of some aspects of the approximation technique used to arrive at u i . For example, in the case of prediction based on perturbation theory one may try varying the renormalization scale in some reasonable range. In such a case the estimate of σ ui results from fairly arbitrary choices, and values that may differ by 50% or even a factor of two might not be unreasonable.

Gamma model for estimated variances
One can extend the model expressed by Eq. (2) to account for the uncertainty in the systematic errors by treating the σ ui as adjustable parameters. The best estimates s i for the σ ui are regarded as measurements to be included in the likelihood model. The width of the distribution of the s i is set by the analyst to reflect the appropriate uncertainty in the σ ui .
The characterization of the "error on the error" is described in Sec. 3.1. In Sec. 3.2 the full mathematical model is defined and the corresponding likelihood profiled over the σ ui is derived. This is shown in Sec. 3.3 to be equivalent to a model in which the estimates u i follow a Student's t distribution.

The relative error on the error
In the model proposed here it is convenient to regard the variances σ 2 u as the parameters, and to take values v i = s 2 i as their estimates. There is a special case in which the estimated variances v i will follow a chi-squared distribution, namely, when v i is the sample variance of n independent observations of u i , i.e., where u i,j is the jth observation of u i and u i = 1 n n j=1 u i,j . If the u i,j are Gaussian distributed with standard deviations σ ui , then one finds (see, e.g., Ref. [7]) that the statistic (n−1)v i /σ 2 ui follows a chi-squared distribution for n−1 degrees of freedom. Furthermore, the chi-squared distribution for n degrees of freedom is a special case of the gamma distribution, for parameter values α = n/2 and β = 1/2. The mean and variance are related to the parameters α and β by E[v] = α/β and V [v] = α/β 2 . Therefore if (n − 1)v i /σ 2 ui follows a chi-square distribution with n − 1 degrees of freedom, then v i follows a gamma distribution with In general the analyst will not base the estimate v i on n observations of u i but rather on different types of information, such as related control measurements or approximate theoretical predictions. The analyst must then set the width of the distribution of v i to reflect the appropriate level of uncertainty in the estimate of σ 2 ui . For v i = s 2 i , using error propagation gives to first order To characterize the width of the gamma distribution we define From Eq. (8) one sees that to first approximation r i ≈ σ si /E[s i ] and thus we can think of these factors as representing the relative uncertainty in the estimate of the systematic error. A more accurate relation between r i as defined here and the relative error on the error σ si /E[s i ] is given in Appendix A.
Using the expectation value of the gamma distribution E[v i ] = α i /β i and its variance V [v i ] = α i /β 2 i , we can relate the values r i supplied by the analyst and the σ ui to α i and β i by Figure 1(a) shows the gamma distribution for σ u = 1 and several values of r and 1(b) shows the corresponding distribution of s = √ v. More details on the distribution of s and its properties are given in App. A. The assumption of a gamma distribution is not unique but represents nevertheless a reasonable and flexible expression of uncertainty in the σ ui . Moreover it will be shown that by using the gamma distribution one finds a very simple procedure for incorporating uncertain systematic errors into the model.
Using Eq. (6) to connect the relative uncertainty r i to the effective number of measurements n gives n = 1 + 1/2r 2 i . A relevant special case is n = 2, sometimes called the problem of "two-point systematics", where one has two estimates u i,1 and u i,2 of a parameter θ i . This givesθ It will be assumed in this paper that the analyst is able to assign meaningful values for the "error-on-the-error" parameters r i . The procedure for doing this will involve elicitation of expert knowledge from those who assigned the systematic errors and will in general vary depending on the experiment. One may want to regard a subset of the measurements as having a certain common r which could be fitted from the data, but we do not investigate this possibility further here.

Likelihood for the gamma model
By treating the estimated variances v = (v 1 , . . . , v N ) as independent gamma distributed random variables, the full likelihood function becomes By using Eqs. (10) and (11) to relate the parameters α i and β i to σ 2 ui and r i one finds, up to additive terms that are independent of the parameters, the log-likelihood ln L(µ, θ, σ 2 u ) = ln P (y|µ, θ) By setting the derivatives of ln L with respect to the σ 2 ui to zero for fixed θ and µ one finds the profiled values Using these for the σ 2 ui gives the profile likelihood with respect to the systematic variances, but which still depends on θ as well as the parameters of interest µ. After some manipulation it can be written up to constant terms as ln L ′ (µ, θ) = ln L(µ, θ, σ 2 u ) = ln P (y|µ, θ)

Derivation of profile likelihood from Student's t distribution
An equivalent derivation of the profile likelihood (18) can be obtained by first defining As u i follows a Gaussian with mean θ i and standard deviation σ ui , and v i follows a gamma distribution with mean σ 2 ui and standard deviation σ vi = 2r 2 i σ 2 ui , one can show (see, e.g., Ref. [7]) that z i follows a Student's t distribution, with a number of degrees of freedom By constructing the likelihood L(µ, θ) as the product of P (y|µ, θ) and Student's t distributions, one obtains the same log-likelihood as given by ln L ′ from Eq. (18). That is, the same model results if one replaces the estimates v i by constants σ 2 ui , but still takes the z i to follow a Student's distribution, with u i = θ i + σ ui z i . Thus in the following we can drop the prime in the profile log-likelihood (18) and regard this equivalently as the log-likelihood resulting from a model where the control measurements are distributed according to a Student's t.

Estimators and confidence regions from profile likelihood
The ML estimators are found by maximizing the full ln L(µ, θ, σ 2 u ) with respect to all of the parameters, which is equivalent to maximizing the profile likelihood with respect to µ and θ. In this way the statistical uncertainties due to both the estimated biases u i as well as their estimated variances v i are incorporated into the variances of the estimators for the parameters of interestμ.
Consider for example the case of a single parameter of interest µ. Having found the estimatorμ, one could quantify its statistical precision by using the standard deviation σμ. The covariance matrix for all of the estimated parameters can to first approximation be found from the inverse of the matrix of second derivatives of ln L (see, e.g., Refs. [8,9]). From this we extract the variance of the estimator of the parameter of interest µ, i.e., V [μ] = σ 2 µ . The presence of the nuisance parameters in the model will in general inflate σμ, which reflects the corresponding systematic uncertainties.
But σμ is by construction a property of the model and not of a particular data set. One may want, however, to report a measure of uncertainty along with the estimateμ that reflects the extent to which the data values are consistent with the hypothesized model, and therefore σμ is not suitable for this purpose. We will show below, however, that a confidence region can be constructed that has this desired property.
In general to find a confidence region (or for a single parameter a confidence interval) one tests all values of µ with a test of size α for some fixed probability α. Those values of µ that are not rejected by the test constitute a confidence region with confidence level 1−α. To determine the critical region of the test of a given µ one can use a test statistic based on the profile likelihood ratio The critical region of a test of µ corresponds to the region of data space having probability content α with maximal t µ . Equivalently, the p-value of a hypothesized point in parameter space µ is where t µ,obs is the observed value of t µ and F is the cumulative distribution of t µ . That is, we define the region of data space even less compatible with the hypothesis than what was observed to correspond to t µ > t µ,obs . The boundary of the confidence region corresponds to the values of µ where p µ = α. Solving Eq. (24) for the test statistic gives where here t µ refers to the value observed, and F −1 is the quantile of t µ . The statistic t µ is also defined in terms of the likelihood through Eqs. (1) and (23), and by using p µ = α one finds that the boundary of the confidence region is given by To find the p-values and thus determine the boundary of the confidence region one needs the distribution f (t µ |µ, θ, σ 2 u ). According to Wilks' theorem [10], for M parameters of interest µ = (µ 1 , . . . , µ M ) the statistic t µ should follow a chi-squared distribution for M degrees of freedom in the asymptotic limit, which here corresponds to the case where the distributions of all ML estimators are Gaussian. To the extent that this approximation holds we may identify the quantile If it is further assumed that the log-likelihood can be well approximated by a quadratic function about its maximum, then one finds asymptotically (see, e.g., Ref. [11]) that where V ij = cov[μ i ,μ j ] is the covariance matrix for the parameters of interest. This equation says that the confidence region is a hyper-ellipsoid of fixed size centred aboutμ . For example, for a single parameter µ one finds that the endpoints give the central confidence interval with confidence level 1 − α. For a probability content corresponding to plus or minus one standard deviation about the centre of a Gaussian, i.e., 1 − α = 68.3%, one has F −1 (1 − α) = 1, which gives the well-known result that the interval of plus or minus one standard deviation about the estimate is asymptotically a 68.3% CL central confidence interval.
The relations (27) and (28) depend, however, on a quadratic approximation of the log-likelihood. In the model where the σ u are treated as adjustable, the profile log-likelihood is given by Eq. (18), which contains terms that are logarithmic in (u i − θ i ) 2 , and not just the quadratic terms that appear in Eq. (3). As a result the relation (27) is only a good approximation in the limit of small r i , which is not always valid in the present problem.
We can nevertheless use Eq. (26) assuming a chisquared distribution for t µ as a first approximation for confidence regions. We will see in the examples below that these have interesting properties that already capture the most important features of the model. If higher accuracy is required then Monte Carlo methods can be used to determine the distribution of t µ . Alternatively we can modify the statistic so that its distribution is closer to the asymptotic form; this is explored further in Sec. 5.1.

Single-measurement model
To investigate the the asymptotic properties of the profile likelihood ratio it is useful to examine a simple model with a single measured value y following a Gaussian with mean µ and standard deviation σ. The parameter of interest is µ and we treat the variance σ 2 as a nuisance parameter, which is constrained by an independent gammadistributed estimate v. Thus the likelihood is given by As before we set the parameters α and β of the gamma distribution so that E[v] = σ 2 and so that from Eq. (9) the standard deviation of v is σ v = 2rσ 2 , where r characterizes the relative error on the error. This gives The goal is to construct a confidence interval for µ by using the profile likelihood ratio The log-likelihood is where C represents constants that do not depend on µ or σ 2 . From this we find the required estimatorŝ With these ingredients we find the following simple expression for the statistic t µ = −2 ln λ(µ), According to Wilks' theorem [10], the distribution f (t µ |µ) should, in the large-sample limit, be chi-squared for one degree of freedom. The large-sample limit corresponds to the situation where estimators for the parameters become Gaussian, which in this case means r ≪ 1.
The behaviour of the distribution of t µ for nonzero r is illustrated in Fig. 2, which shows the distributions from data generated according to a Gaussian of mean µ = 0, standard deviation σ = 1 and values of r = 0.01, 0.2, 0.4 and 0.6. The case of r = 0.01 approximates the situation where the relative uncertainty on σ is negligibly small. One can see that greater values of r lead to an increasing departure of the distribution from the asymptotic form.
Depending on the size of the test being carried out or equivalently the confidence level of the interval, one may find that the asymptotic approximation is inadequate. In such a case one may wish to use the Monte Carlo simulation to determine the distribution of the test statistic. Alternatively one can modify the statistic so that its distribution is better approximated by the asymptotic form, as described in the following section.

Bartlett correction for profile likelihood-ratio statistic
The likelihood-ratio statistic can be modified so as to follow more closely a chi-square distribution using a type of correction due to Bartlett [12,13,14]. This method has received some limited notice in Particle Physics [15] but has not been widely used in that field. The basic idea is to determine the mean value E[t µ ] of the original statistic. In the asymptotic limit, this should be equal to the number of degrees of freedom n d of the chi-square distribution, which in this example is n d = 1. One then defines a modified statistic so that by construction E[t ′ µ ] = n d . It was shown by Lawley [16] that the modified statistic approaches the reference chi-squared distribution with a difference of order n −3/2 , where here the effective sample size n is related to the parameter r by n = 1 + 1/2r 2 (cf. Eqs. (6) and (10)).
One could in principle find the expectation value E[t µ ] by the Monte Carlo method. But for the method to be convenient to use one would like to determine the Bartlett correction without resorting to simulation. By expanding the expectation value as a Taylor series in r one finds where the coefficient of the r 4 term is found numerically to be c ≈ 2 with an accuracy of around 10%. Dividing t µ /n d (here with n d = 1) from Eq. (23) by E[t µ ] to obtain the Bartlett-corrected statistic therefore gives In more complex problems one may not have a simple expression for the expectation value needed in the Bartlett correction and calculation by Monte Carlo may be necessary.
Distributions of t ′ µ are shown in Fig. 3 along with Monte Carlo distributions. As can be seen by comparing the uncorrected distributions from Figs. 2 to those in Fig. 3, the Bartlett correction is clearly very effective, as is needed when the parameter r is large.

Confidence intervals for the single-measurement model
In the simple model explored in this section one can use the measured values of y and v to construct a confidence interval for the parameter of interest µ. The probability that the interval includes the true value of µ (the coverage probability) can then be studied as a function of the relative error on the error r. What emerges is that the interval based on the chi-squared distribution of t µ has a coverage probability substantially less than the nominal confidence level, but that this can be greatly improved by use of the Bartlett-corrected interval.
To derive exact confidence intervals for µ we can use the fact that follows a Student's t distribution for ν = 1/2r 2 degrees of freedom (see, e.g., Ref. [7]). From the distribution of z one can find the corresponding pdf of but in fact this is not directly needed. Rather we can use the pdf of z to find confidence intervals for µ from the fact that a critical region defined by t µ > t c is equivalent to the corresponding region of z given by z < −z c and z > z c where the boundaries of the critical regions in the two variables are related by Eq. (43). Equivalently one can say that the p-value of a hypothesized value of µ is the probability, assuming µ, to find z further from zero than what was observed, i.e., where F (z; ν) is the cumulative Student's t distribution for ν = 1/2r 2 degrees of freedom. The boundaries of the confidence interval at confidence level CL = 1 − α (here α refers to the size of the statistical test, not the parameter α in the gamma distribution) are found by setting p µ = α and solving for µ, which gives the upper and lower limits Here z α/2 is the α/2 upper quantile of the Student's t distribution, i.e., the value of z obs needed in Eq. (44) to have p µ = α.
If one were to assume that the statistic t µ follows the asymptotic chi-squared distribution, then z α/2 is replaced by Here Q α = F −1 (1−α) is obtained from the quantile of the chi-squared distribution for one degree of freedom. And if the Bartlett-corrected statistic t ′ µ is used to construct the interval, then the Q α in Eq. (46) is replaced by Q α E[t µ ], where E[t µ ] = 1 + 3r 2 + 2r 4 is the expectation value of t µ from Eq. (40). The half-width of the interval measured in units of the estimated standard deviation √ v, i.e., z α/2 or z a , are shown in Fig. 4(a) as a function of the r parameter.
The probability P c for the confidence interval to cover the true value of µ is by construction equal to 1 − α for the exact confidence interval. For the interval based on the asymptotic distribution of the test statistic this is where F χ 2 1 is the cumulative chi-squared distribution for one degree of freedom and z a is given by Eq. (46), with Q α replaced by Q α E[t µ ] for the Bartlett-corrected case.
The interval half-widths and coverage probabilities based on t µ and t ′ µ are shown in Figs. 4. As can be seen, the interval based on the Bartlett-corrected statistic is very close to the exact one, and its coverage is close to the nominal 1 − α for relevant values of r.

Least-squares fitting and averaging measurements
An important application of the model described in Sec. 3 is the least-squares fit of a curve, or as a special case of this, the average of a set of measurements. Suppose the data consist of N independent Gaussian distributed values y i , with mean and variance Here the nuisance parameters θ i represent a potential bias or offset. The function ϕ(x i ; µ) plus the bias θ i gives the mean of y i as a function of a control variable x, and it depends on a set of M parameters of interest µ = (µ 1 , . . . , µ M ). That is, the probability P (y|ϕ, θ) in Eq. (2) becomes As before suppose the nuisance parameters θ i are constrained by N corresponding independent Gaussian measurements u i , with mean and variance Often the best estimates of a potential bias θ i will be u i = 0 for the actual measurement, but formally the u i are treated as random variables that would fluctuate upon repetition of the experiment. Therefore the full loglikelihood or equivalently −2 ln L(µ, θ) is up to an additive constant given by (53) That is, if we consider the σ ui as known, then maximumlikelihood estimators are obtained by the minimum of the sum of squares (53) which is the usual formulation of the method of least squares. The next step will be to treat the σ ui as adjustable parameters but before doing this is it interesting to note that by profiling over the nuisance parameters θ i , one finds the profile likelihood That is, the same result is obtained by using the usual method of least squares with statistical and systematic errors added in quadrature. This procedure gives the Best Linear Unbiased Estimator (BLUE), which is widely used in Particle Physics, particularly for the problem of averaging a set of measurements as described in Refs. [17,18,19,20].
Returning to the full dependence on µ and θ and following the model of Sec. 3 we now regard the systematic variances σ 2 ui as free parameters for which we have independent gamma distributed estimates v i , with parameters α i and β i set by σ 2 ui and r i according to Eqs. (10) and (11). The log-likelihood profiled over the σ 2 ui is (cf. Eq. (18)), To find the required estimators we need to solve the system of equations ∂ ln L ′ ∂θ i = 0 , i = 1, . . . , N.
Equation (57) results in where here ϕ i = ϕ(x i ; µ). Simultaneously solving all M + N equations for µ and the θ gives their ML estimators. Solving for the θ i for fixed µ, i.e., fixed ϕ i , gives the profiled valuesθ i . Equations (58) are cubic in θ i and so can be solved in closed form giving either one or three real roots. In the case of three roots, the one is chosen that maximizes ln L ′ . Using the profile log-likelihood from Eq. (55) one can use, for example, the test statistic t µ defined in Eq. (23) to find confidence regions for µ following the general procedure outlined in Sec. 4. Examples of this will be shown in Sec. 6.2.

Goodness of fit
In the usual method of least squares, the minimized sum of squares χ 2 min = χ 2 (μ) based on Eq. (54) is often used to quantify the goodness-of-fit. Because it is constructed as a sum of squares of Gaussian distributed quantities, one can show (see, e.g., Ref. [11]) that its sampling distribution is chi-squared for N −M degrees of freedom, and the p-value of the hypothesis that the true model lies somewhere in the parameter space of µ is thus When using the gamma error model presented above, the quantity −2 ln L ′ (µ, θ) is no longer a simple sum of squares. Nevertheless one can construct the statistic that will play the same role as the minimized χ 2 (µ) by considering the model in which the means ϕ(x i , µ), which depend on the M parameters of interest µ, are replaced by a vector of N independent mean values, one for each of the measurements: ϕ = (ϕ 1 , . . . , ϕ N ). By requiring that the ϕ i are given by ϕ(x i , µ) one imposes N − M constraints and restricts the more general hypothesis to an M -dimensional subspace. One can then construct the likelihood ratio statistic where the numerator contains the M fitted parameters of interestμ, and in the denominator one fits all N of the ϕ i . When fitting separate values of ϕ i and θ i for each measurement (the "saturated model"), one can see from inspection that the maximized value of ln L ′ (ϕ, θ) is zero, and therefore the statistic q becomes According to Wilks' theorem [10], in the limit where the estimatorsμ andθ are Gaussian distributed, q will follow a chi-squared pdf for N − M degrees of freedom. The statistic q thus plays the same role as the minimized sum of squares χ 2 min in the usual method of least squares. In the case of Eq. (61), however, the chi-squared approximation is not exact. One can see this from the fact that the v i are gamma rather than Gaussian distributed; the Gaussian approximation holds only in the limit where the r i are sufficiently small.
If all r i → 0, i.e., there is no uncertainty in the reported systematic errors, then the statistic q reduces to the minimized sum of squares from the method of least squares or BLUE, namely, One can check in an example that the sampling distribution of q follows a chi-squared distribution by generating measured values y i , u i , and s i according to the model described in Sec. 2 using the following parameter values: ϕ i = µ = 10, σ yi = 1, σ ui = 1 for all i = 1, . . . , N . That is, the measurements are assumed to have the same mean µ and the goal is to fit this parameter. The resulting distributions of q are shown in Figs. 5(a) and (b) for N = 2 and N = 5 using r i = 0.2 for all measurements. Overlayed on the histograms is the chi-squared pdf for N − 1 degrees of freedom. Although the agreement is reasonably good there is still a noticeable departure from the asymptotic distribution in the tails. The same set of curves is shown in Figs. 5(c) and (d) for r i = 0.4, for which one sees an even greater discrepancy between the true (i.e., simulated) and asymptotic distributions.
One might need a p-value with an accuracy such that assumption of the asymptotic distribution of q is not adequate. In such a case one can use Monte Carlo to determine the correct sampling distribution of q. Alternatively, following the procedure of Sec. 5.1 one can define a Bartlett-corrected statistic q ′ as so that by construction E[q ′ ] = N − M (in the example above for a single fitted parameter M = 1). Distributions of q ′ corresponding to Fig. 5 are shown in Fig. 6, where the mean value E[q] was itself found from Monte Carlo simulation. While one sees that the distributions of q ′ are in better agreement with the Monte Carlo, visible discrepancies remain. And since here simulation was required to determine the Bartlett correction, one could use it as well to find the p-value directly. The Bartlett correction is nevertheless useful in such a situation because the number of simulated values of q required to estimate accurately E[q] may be much less than what one needs to find the upper tail area for a very high observed value of the test statistic.

Averaging measurements
An important special case of a least-squares fit is the average of N measured quantities, i.e., the fit function ϕ(x; µ) = µ is in effect a horizontal line and the control variable x does not enter. The expectation values of the measurements are thus where the parameter of interest µ represents the desired mean value and as before θ i are the bias parameters. As there is one parameter of interest, the statistic q follows asymptotically a chi-squared distribution for N degrees of freedom, although as we have seen above this approximation breaks down as the r i increase.
As an example, consider the average of two quantities, nominally reported as y i ± σ yi ± s i for i = 1, 2, in which the σ yi represent the statistical errors and s i are the estimated systematic errors. Suppose here these are σ yi = 1 and s i = 1 for both measurements, and that the analyst reports values r i representing the relative accuracy of the estimates of the systematic errors, which in this example we will take to be equal to a common value r. Furthermore suppose that the observed values of y 1 and y 2 are 10 + δ and 10 − δ, respectively, and we will allow δ to vary. For the errors chosen in this example, the value of δ corresponds to the significance of the discrepancy between y 1 and y 2 in standard deviations under assumption of r = 0. Using the input values described above, the mean µ, bias parameters θ i , and systematic errors σ ui are adjusted to maximize the log-likelihood from Eq. (16). Figures 7 show the half-width of the 68.3% confidence interval for µ as a function of the parameter r for different levels of δ. This interval corresponds to the standard deviation σμ when the r i are all small, where the problem is the same as in least squares or BLUE.
In Fig. 7(a), the interval is based on Eq. (26), i.e., it is determined by the point where the profile log-likelihood drops by a fixed amount from its maximum (in Particle Physics often referred to as the "minos" interval [21]). In Fig. 7(b), the interval is found by solving for the value of µ where its p-value is p µ = α, and here α = 1 − 0.683 = 0.317. The p-value depends, however, on the assumed values of the nuisance parameters. Here we use the values of θ i and σ 2 ui profiled at the value of µ tested. This technique is often called "profile construction" in Particle Physics [22], where it is widely used, and elsewhere called "hybrid resampling" [23,24]. The resulting confidence interval will have the correct coverage probability of 1 − α if the nuisance parameters are equal to their profiled values; elsewhere the interval could underor over-cover. Although the intervals from profile construction differ somewhat from those found directly on the loglikelihood, they have the same qualitative behaviour.
From Fig. 7 one can extract several interesting features. First, if r is small, that is, the systematic errors σ ui are very close to their estimated values s i , then the interval's half-length is very close to the standard deviation of the estimator, σμ = 1, regardless of the level of discrepancy between the two measured values.
Further, the effect of larger values of r is seen to depend very much on the level of discrepancy between the measured values. If y 1 or y 2 are very close (e.g., δ = 0 or 1), then the length of the confidence interval can even be reduced relative to the case of r = 0. If the measurements are in agreement at a level that is better than expected, given the reported statistical and systematic uncertainties, then one finds that the likelihood is maximized for values of the systematic errors σ ui that are smaller than the initially estimated s i . And as a consequence, the confidence interval for µ shrinks.
Finally, one can see that if the data are increasingly inconsistent, e.g., in Fig. 7 for δ ≥ 4, then the effect of allowing higher r is to increase the length of the interval. This is also a natural consequence of the assumed model, whereby an observed level of heterogeneity greater than what was initially estimated results in maximizing the likelihood for larger values of σ ui and consequently an increased confidence interval size.
The coverage properties of the intervals for the average of two measurements example are investigated by generating data values y i for i = 1, 2 according to a Gaussian with a common mean µ (here 10) and the standard deviations both σ yi = 1, and the u i are generated according to a Gaussian distributed with mean of θ i = 0 and standard deviation σ ui = 1. The values v i are gamma distributed with parameters α i and β i given by Eqs. (10) and (11) so as to correspond σ ui = 1 and for different values of the parameters r i , taken here to be the same for both measurements. Figure 8 shows the coverage probability for the interval with nominal confidence level 68.3% based on the log-likelihood (the minos interval) and also using profile construction (hybrid resampling), as a function of the r parameter. As seen in the figure, the coverage probability approximates the nominal value reasonably well out to r = 0.5, where one finds P cov = 0.631 and 0.667 for minos and profile construction respectively; at r = 1, the corresponding values are 0.564 and 0.617 (the Monte Carlo statistical errors for all values is around 0.005). Thus reasonable agreement is found with both methods but one should be aware that the coverage probability may depart from the nominal value for large values of r.

Sensitivity to outliers
One of the important properties of the error model used in this paper is that curves fitted to data become less sensitive to points that depart significantly from the fitted curve (outliers) as the r parameter of the measurements is increased. This is a well-known feature of error models based on the Student's t distribution (see, e.g., Ref. [2]).
The reduced sensitivity to outliers is illustrated in Fig. 9 for the case of averaging five measurements of the same quantity (i.e., the fit of a horizontal line). All measured values are assigned statistical and systematic errors of 1.0, and in Figs. 9(a) and (c) they are all fairly close to the central value of 10. In Figs. 9 (b) and (d) the middle point is at 20. In the top two plots, the r parameter for all measurements is taken to be r = 0.01, which is very close to what would be obtained with an ordinary least-squares fit. In (a) the average is 10; in (b) the outlier causes the fitted mean to move to 12.00. In both cases the half-width of the confidence interval is 0.63.
In the lower two plots, (c) and (d), all of the points are assigned r = 0.2, i.e., a 20% relative error on the systematic error. In the case with no outlier, (c), the estimated mean stays at 10.00, and the half-width of the confidence interval only increases a small amount to 0.65. With the outlier in (d), the fitted mean is 10.75 with an interval half-width of 0.78. That is, the amount by which the outlier pulls the estimated mean away from the value preferred by the other points (10.00) is substantially less than with r = 0.01, (fitted mean 12.00). Furthermore, the lower compatibility between the measurements results in a confidence interval that is larger than without the outlier (half-width 0.78 rather than 0.65). For small r, however, the interval size is independent of the goodness of fit. Both the increase in the size of the confidence interval and the decrease in sensitivity to the outlier represent important improvements in the inference.
One might think that the error on the error could be adequately taken into account by using the ordinary least- squares approach, i.e., the log-likelihood of Eq. (53), and simply making the replacement σ ui → σ ui (1 + r i ). In the example shown above with all r i = 0.2, however, the result isμ = 10.00 ± 0.70 without the outlier (middle data point at 10) andμ = 12.00 ± 0.70 if the middle point is moved to 20. So by inflating the systematic errors but still using least squares, one increases the size of the confidence interval by an amount that does not depend on the goodness of fit and the sensitivity to outliers is not improved.

Treatment of correlated uncertainties
A model may contain two nuisance parameters θ i and θ j that are better represented by a single one θ i/j , so that the number of independent parameters is reduced by one. This can happen, for example, if θ i and θ j represent systematic biases of measured values y i and y j , respectively, but the origin of the bias is the same for both.
Having such a common source of bias is often referred to as a correlated systematic uncertainty. This would be the natural term to use in a Bayesian approach, where one associates a probability with the parameters and thus can talk about their correlation. In the present use of a likelihood, however, the term "correlated" is not fully appropriate, because in the frequentist approach one does not associate a probability (or a correlation) with the parameters themselves, but rather only with measurements such as the y i , u i or v i . The control measurements u i could in principle be correlated but this is often not the case.
In general nuisance parameters may enter the model in particular combinations and one may want to replace only certain terms within these combinations by common parameters. For example, for an individual measurement y i there can be more than one source of potential bias, e.g., for M sources one may write θ i as where θ ij represents the bias from source j of the measurement y i . By identifying those components that are common amongst some subset of the measurements one obtains biases that are in part common, which may be referred to loosely as "partially correlated systematic uncertainties". Correlated uncertainties in the Bayesian approach would correspond to having a joint prior pdf for the nuisance parameters θ in which the components are not independent. It could, for example, be a multivariate Gaussian with covariance matrix U ij = cov[θ i , θ j ], which would would report as the "systematic covariance matrix". Often such an object is used in BLUE averaging in which one replaces the covariance matrix of the measurements V = cov[y i , y j ] (the statistical covariance) by the sum V + U . This is the generalization of summing statistical and systematic errors in quadrature for the case of correlations.
The frequentist procedure based on the likelihood can in principle be generalized to cope with a set of control measurements for which there is a nondiagonal covariance matrix U ij = cov[u i , u j ] = ρ ij σ ui σ uj , but one would need to include not only the uncertainties in the variances σ ui but also in the corresponding correlation coefficients ρ ij . This requires not only a set of control measurements v = (v 1 , . . . , v N ) to constrain the systematic variances, but also a further set to constrain their correlations. This approach, not pursued further here, would require some care to ensure that the adjustable parameters in the covariance matrix are confined to a region where it is well defined (positive semi-definite).
The method proposed here supposes independent control measurements whose variances are treated as free parameters. Partial correlations in the sense described above are included in that one may be able to identify components of the nuisance parameters that are common and thus identified as being the same parameter.

Discussion and conclusions
The statistical model proposed here can be applied in a wide variety of analyses where the standard deviations of Gaussian measurements are deemed to have a given relative uncertainty, reflected by the parameters r i defined in Eq. (9). The quadratic constraint terms connecting control measurements to their corresponding nuisance parameters that appear in the log-likelihood are replaced by logarithmic terms (cf. Eqs. (3) and (18)). The resulting model is equivalent to taking a Student's t distribution for the control measurements, with the number of degrees of freedom given by ν = 1/2r 2 .
It is not uncommon for systematic errors, especially those related to theoretical uncertainties, to be uncertain themselves to several tens of percent. The model presented here allows such uncertainties to be taken into account and it has been shown that this has interesting and useful consequences for the resulting inference. Confidence intervals are generally increased in size, and the increase acquires a dependence on the goodness of fit. Averages and fitted curves become less sensitive to outliers.
If the relative errors on the systematic uncertainties are large enough (r greater than around 0.2 in the examples studied), then the sampling distribution of likelihoodratio test statistics starts to depart from the asymptotic chi-squared form. Thus one cannot in general apply asymptotic results for p-values and confidence intervals without taking some care to ensure their validity. In some cases Bartlett-corrected statistics can be used; alternatively one may need to determine the relevant distributions by Monte Carlo simulation.
The point of view taken here has been that the analyst must determine reasonable values for the relative uncertainties in the systematic errors. Even in cases where these errors have perhaps in a first analysis been treated as constants, it could be useful to try inserting reasonable values for r to see what effect this has on the sensitivity of the analysis to parameters of interest.
An alternative mentioned here as a possibility would be to fit a common relative uncertainty to all systematic errors (a global r), e.g., when averaging a set of numbers for which no r values have been reported. This is analogous to the scale-factor procedure used by the Particle Data Group [9] or the method of DerSimonian and Laird [27] widely used in meta-analysis. Note, however, that in arriving at the log-likelihood (18), a number of terms dependent on the r i were dropped, as they were considered fixed constants. If the r i are adjustable parameters then these terms, given in App. B, must be retained in the loglikelihood.
A Exact relation between the r parameter and the relative error on the error The parameter r was defined in Eq. (9) as where we drop the subscript i as we are focusing on a single measurement.
Here v, the estimate of a variance σ 2 u , is assumed to follow a gamma distribution with expectation value E[v] = σ 2 u . The physicist is more likely to work with the estimated standard deviation rather than the variance, i.e., with s = √ v. From error propagation we have that the standard deviation of s is If we approximate the E[s] ≈ (E[v]) 1/2 = σ u , then the relative uncertainty on the standard deviation is Equation (68), based on linear error propagation, holds to the extent that the nonlinearity of v = s 2 is not large over the range v = E[v] ± σ v . For sufficiently large r, however, this assumption will break down and one can no longer interpret r as a relative error on the error. Starting from a gamma distribution (5) with parameters α and β for the distribution of v, the pdf of s = √ v is given by g(s|α, β) = dv ds f (v(s)|α, β) = 2β α Γ (α) s 2α−1 e −βs 2 , (69) where α = 1/4r 2 and β = α/σ 2 u . This is a special case of the Nakagami distribution [25,26], which has mean and variance (71) The exact relative uncertainty in the standard deviation is which is shown in Fig. 10. For example, r = 1 gives r s = 1.09. Thus for relevant values of r one can safely approximate r s ≈ r.
By rearranging terms the profile likelihood can be written (cf. Eq. (18)) ln L ′ (µ, θ) = ln P (y|µ, θ) (75) does not depend on any of the adjustable parameters of the problem and thus can be dropped. If, however, one were to treat the r i as free parameters then C, or at least those terms depending on the r i , must be retained.