1 Introduction

In experimental sciences such as Particle Physics one collects data, here denoted by \(\varvec{y}\), and seeks to make inferences about a hypothesis H that defines the probability distribution for the data, \(P(\varvec{y}|H)\). Often \(P(\varvec{y}|H)\) is indexed by a set of parameters of interest \(\varvec{\mu }\) and by a set of nuisance parameters \(\varvec{\theta }\), thus \(P(\varvec{y}|H)=P(\varvec{y}|\varvec{\mu }, \varvec{\theta })\). The parameters of interest are the main objective of the analysis, whereas nuisance parameters are often introduced to account for systematic uncertainties in the model.

We focus here on frequentist tests of the hypothesized parameters that use a test statistic derived from the likelihood function \(L(\varvec{\mu }, \varvec{\theta })=P(\varvec{y}|\varvec{\mu }, \varvec{\theta })\). These tests lead to confidence intervals or regions for the parameters of interest as well as p values that quantify goodness of fit. To find these results, one requires the sampling distribution of test statistics that are obtained from the likelihood function and are described in greater detail below. For appropriately defined test statistics, the corresponding distributions can often be found using asymptotic results based on theorems due to Wilks [1] and Wald [2] (see, e.g., [3, 4]). The asymptotic distributions are valid in specific limits, which usually correspond to having a large data sample, whose size we will denote generically by n.

In this paper we are interested specifically in the case where n is not sufficiently large for the asymptotic distributions of the relevant test statistics to represent a good approximation. In such problems one could use Monte Carlo methods to obtain the distributions, but this involves additional time-consuming computation. Instead, one can modify the test statistic using the higher-order asymptotic methods, specifically, the \(r^*\) statistic of Barndorff-Nielsen [5] and the Bartlett correction [6], as described, e.g., in [7, 8]. With these methods, the distribution of the modified statistic becomes closer to the asymptotic form, allowing one to find confidence intervals and p values without use of Monte Carlo.

In this paper we consider applications of higher-order asymptotic methods to the Gamma Variance Model (GVM), which was proposed in Ref. [9]. In the GVM, measured values are modeled as following Gaussian distributions with a mean that depends on the parameters of the problem, and with variances \(\sigma ^2\) whose values are themselves not certain. The variances as well are thus taken as adjustable parameters, and the values one would assign to them are treated as measurements that follow a gamma distribution with parameters \(\alpha \) and \(\beta \) (see Sect. 4 below). These parameters are assigned so that the gamma distribution’s relative width reflects the desired uncertainty on \(\sigma ^2\). This is quantified using the quantity \(\varepsilon = 1/2\sqrt{\alpha }\), which to first approximation is the relative uncertainty on the estimate of the standard deviation \(\sigma \), informally referred to as the “error on the error”.Footnote 1

In the Gamma Variance Model there is a correspondence between the error-on-error parameters \(\varepsilon \) and an effective sample size n of

$$\begin{aligned} n = 1 + \frac{1}{2 \varepsilon ^2} . \end{aligned}$$
(1)

That is, the large-sample limit corresponds to the case where \(\varepsilon \rightarrow 0\) and thus the values of \(\sigma \) are accurately estimated. For many analyses, however, the assigned values of standard deviations for individual measurements may easily be uncertain at the level of several tens of percent or more. In this case the effective sample size is low and thus the asymptotic distributions of likelihood-based test statistics are not necessarily valid. The goal of this paper is to apply higher-order asymptotics to this model and thus achieve more accurate confidence levels and p values.

In Sect. 2 we briefly review the basic techniques for finding confidence intervals and p values in a general likelihood-based analysis and Sect. 3 describes how these techniques can be improved using higher-order asymptotics. In Sect. 4 we recall the important properties of the Gamma Variance Model, and then we explore use of higher-order asymptotic corrections to three specific realizations of the model: in Sect. 5 we apply corrections to a simple example of the GVM based on a single measurement, in Sect. 6 to an average of measurements and in Sect. 7 to an average of measured values that includes control measurements to constrain nuisance parameters. A summary and conclusions are given in Sect. 8.

2 Parameter inference using the profile likelihood ratio

In this section we review the basic technology used to find confidence intervals and p values from test statistics derived from the likelihood ratio and the likelihood root by using the first-order asymptotic distributions based on Wilks’ theorem. Further details on these methods as applied in Particle Physics analyses can be found, e.g., in Ref. [3].

In statistical data analysis, the central object needed to carry out inference related to the parameters of interest \(\varvec{\mu }\) using measured data \(\varvec{y}\) is the likelihood function: \(L(\varvec{\mu }, \varvec{\theta }) = P(\varvec{y}|\varvec{\mu }, \varvec{\theta })\). Suppose there are M parameters of interest \(\varvec{\mu } = (\mu _1, \ldots , \mu _M)\) and N nuisance parameters \(\varvec{\theta } = (\theta _1, \ldots , \theta _N)\), which are introduced to account for systematic uncertainties. In frequentist statistics, a test of hypothesized parameter values can be carried out by defining a test statistic based on the (profile) likelihood ratio

(2)

Here \(\hat{\varvec{\mu }}\) and \(\hat{\varvec{\theta }}\) are the Maximum Likelihood Estimators (MLEs) for the parameters of interest and the nuisance parameters, respectively, and are the profiled (or constrained) estimators of the nuisance parameters, given by the values of \(\varvec{\theta }\) that maximize the likelihood for a fixed value of \(\varvec{\mu }\) (here and below we use \(\ell \) to denote the log-likelihood). The likelihood ratio is used to test the compatibility of a value of \(\varvec{\mu }\) with the experimental data, with greater \(w_{\varvec{\mu }}\) corresponding to increasing incompatibility.

The likelihood ratio can be used to derive a confidence region for the parameters of interest \(\varvec{\mu }\) (or a confidence interval if there is just one parameter of interest) by computing the p value for a hypothesized value of \(\varvec{\mu }\),

$$\begin{aligned} p_{\varvec{\mu }} = \int _{w_{\varvec{\mu }, \text {obs}}}^{\infty }f(w_{\varvec{\mu }}|\varvec{\mu }, \varvec{\theta })dw_{\varvec{\mu }}= 1 - F[w_{\varvec{\mu }, \text {obs}}]. \end{aligned}$$
(3)

Here \(f(w_{\varvec{\mu }}|\varvec{\mu }, \varvec{\theta })\) is the probability density function of \(w_{\varvec{\mu }}\) under the hypothesis that \(\varvec{\mu }\) and \( \varvec{\theta }\) are the true parameters, \(w_{\varvec{\mu }, \text {obs}}\) is the observed value of the likelihood ratio, and F is the cumulative distribution of \(w_{\varvec{\mu }}\). The boundary of the confidence region for \(\varvec{\mu }\), with confidence level \(1-\alpha \), is found from the p value of Eq. (3) by solving \(p_{\varvec{\mu }} = \alpha \). This gives a region in parameter space that satisfies

$$\begin{aligned} \text {Prob}(\varvec{\mu } \in \text {confidence region})\ge 1-\alpha . \end{aligned}$$
(4)

When the model has a single parameter of interest \(\mu \), it is possible to define another test statistic called the likelihood root as the square root of the likelihood ratio, multiplied by the sign of \(\hat{\mu }-\mu \):

(5)

In contrast to the likelihood ratio, the likelihood root can be defined only when there is a single parameter of interest. However, while the likelihood ratio gives a two-sided test, the likelihood root allows for one-sided tests as well. The statistic \(r_\mu \) can be used to compute p values in the same way this is done for the likelihood ratio using its density function \(f(r_\mu |\mu , \varvec{\theta })\).

In many realistic applications, finding the probability density functions \(f(w_{\varvec{\mu }}|\varvec{\mu }, \varvec{\theta })\) or \(f(r_\mu |\mu , \varvec{\theta })\) is a major challenge since they are usually not known in closed form. Monte Carlo simulations are often used to compute them, but this can be very time-consuming for complex models with many measurements.

It is possible, however, to avoid the numerical computation of \(f(w_{\varvec{\mu }})\) and \(f(r_\mu )\) in the asymptotic limit, in which all the MLEs of the model are Gaussian distributed. This limit is typically reached when the experimental sample size n approaches infinity, i.e., in the so-called large sample limit, where the MLEs have a Gaussian distribution with an error term of order \(\mathcal {O}(n^{-1/2})\). If this condition holds, \(w_{\varvec{\mu }}\) follows a chi-square distribution with M degrees of freedom (\(\chi _M^2\)),

$$\begin{aligned} w_{\varvec{\mu }} \sim \chi ^2_{M} + \mathcal {O}(n^{-1}), \end{aligned}$$
(6)

where M is the number of parameters of interest. In contrast, in the asymptotic limit \(r_\mu \) follows a normal distribution with mean 0 and standard deviation 1:

$$\begin{aligned} r_\mu \sim \mathcal {N}(0,1) + \mathcal {O}\left( n^{-1/2}\right) . \end{aligned}$$
(7)

As already noted, n generally represents the sample size of the experiment, but it can also be another parameter of the likelihood that controls the convergence of the likelihood to the asymptotic limit. It is important to note that the asymptotic distribution of \(r_\mu \) exhibits a larger error term compared to that of \(w_\mu \). This behavior reflects a trade-off for the flexibility of performing one-sided tests.

In the asymptotic limit, one can show that the profile log-likelihood is approximated by

(8)

where \(V_{ij}=\text {cov}[\hat{\mu }_i,\hat{\mu }_j]\). This can be found using the observed information matrix \(j_{ij}(\varvec{\hat{\varvec{\psi }}})\),

$$\begin{aligned} j_{ij}(\varvec{\hat{\psi }})=-\frac{\partial ^2\ell }{\partial \psi _i\partial \psi _j}\Bigr |_{\varvec{\hat{\psi }}}, \end{aligned}$$
(9)

where \(\varvec{\psi } = (\varvec{\mu }, \varvec{\theta })\) represents all of the parameters. The inverse of the matrix j gives the covariance of all the estimators, \(U_{ij} = \text{ cov }[\hat{\psi }_i, \hat{\psi }_j] = j^{-1}(\hat{\varvec{\psi }}_{ij})\), from which the submatrix \(V_{ij}=\text {cov}[\hat{\mu }_i,\hat{\mu }_j]\) can be extracted. Under assumption of these approximations, the likelihood ratio is given by

$$\begin{aligned} w_{\varvec{\mu }}= (\hat{\varvec{\mu }}-\varvec{\mu })V^{-1}(\hat{\varvec{\mu }}-\varvec{\mu }). \end{aligned}$$
(10)

Deviations from the quadratic approximations of the likelihood root and the profile likelihood are expected when the conditions of the asymptotic limit are not satisfied.

3 Higher-order asymptotic corrections

When the MLEs of the model parameters are not Gaussian distributed, which usually happens when the experimental sample size is small, the distributions of the statistics \(w_{\varvec{\mu }}\) and \(r_{\mu }\) deviate from their asymptotic forms. In such instances, there are two potential strategies to derive test statistics with known distributions: one approach is to refine the approximation of the test statistics’ distributions, and the other is to modify the test statistics themselves such that their distributions are more accurately approximated by the asymptotic formulae, even for small sample sizes. An example of the former is given in Ref. [10]; in this paper, we focus on the latter approach.

Specifically, this section explores two potential solutions, the \(r^*\) approximation [5, 11,12,13] and the Bartlett correction [6, 16]. The aim is to derive corrections for \(w_\mu \) and \(r_\mu \) such that the distributions of the refined statistics, denoted by \(w^*_\mu \) and \(r^*_\mu \), can be more precisely approximated by the asymptotic distributions outlined earlier, with error terms of order \(\mathcal {O}(n^{-3/2})\) or smaller. Moreover, these enhanced statistics are constructed such that P\((S^*> S_\textrm{obs}^*) = P(S > S_\textrm{obs})\) + \(\mathcal {O}(n^{-1/2})\), where S represents one of the original statistics and \(S^*\) corresponds to its improved, higher-order counterpart. This condition ensures that p values computed using the improved statistics are equivalent to those computed with the original ones, up to error terms of order \(n^{-1/2}\) or smaller.

3.1 The \(r^*\) approximation

The asymptotic distributions of the likelihood root and the likelihood ratio are derived from the assumption that the MLEs are Gaussian in the asymptotic limit. But this assumption is only valid up to error terms of order \(\mathcal {O}(n^{-1/2})\). A major development in likelihood-based inference has been to improve the approximation of the distributions of the MLEs. For models with a single parameter \(\mu \), the Barndorff-Nielsen \(p^*\) approximation [5, 11,12,13] is the basic higher-order approximation to the distribution of \(\hat{\mu }\):

$$\begin{aligned} f(\hat{\mu }) \simeq p^*(\hat{\mu }) \equiv c\,|j(\hat{\mu })|^{1/2}\,e^{-w_\mu /2}. \end{aligned}$$
(11)

Here \(w_{\mu }\) is the likelihood ratio as defined in Eq. (2), c is a normalization constant equal to \(1/\sqrt{2\pi }(1+\mathcal {O}(n^{-1}))\), and \(j = -\frac{\partial ^2 \ell }{\partial \mu ^2}\) represents the observed information. Notably, the error term on the \(p^*\) approximation is of order \(n^{-3/2}\). This represents a significant improvement when compared to the \(\mathcal {O}(n^{-1/2})\) error term associated with the first-order Gaussian approximation.

When the \(p^*\) approximation is expanded to \(\mathcal {O}(n^{-1/2})\) the Gaussian density of \(\hat{\mu }\) is recovered. At this order, the likelihood ratio \(w_{\mu }\) can be approximated using Eq. (10) as

$$\begin{aligned} w_\mu = (\hat{\mu }-\mu )^2|j(\hat{\mu })| + \mathcal {O}_p\left( n^{-1/2}\right) , \end{aligned}$$
(12)

where \(\mathcal {O}_p\) denotes convergence in probability. In this way the \(p^{*}\) approximation reduces to a Gaussian with mean \(\mu \) and standard deviation \(|j(\hat{\mu })|\):

$$\begin{aligned} f(\hat{\mu })=\frac{1}{\sqrt{2\pi }}|j(\hat{\mu })|^{1/2}e^{-\frac{(\hat{\mu }-\mu )^2}{2|j(\hat{\mu })|^{-1}}}+\mathcal {O}(n^{-1/2}), \end{aligned}$$
(13)

where j is a constant in \(\mu \) at order \(n^{-1/2}\).

The \(p^*\) approximation provides a way to modify the statistic \(r_\mu \) to reduce the error on its asymptotic distribution. In particular, through an integration of Eq. (11) (see [12]), it can be proved that the modified statistic

$$\begin{aligned} r_\mu ^*= r_\mu + \frac{1}{r_\mu }\log \frac{q_\mu }{r_\mu }\, \end{aligned}$$
(14)

follows a standard normal distribution with an error term of order \(n^{-3/2}\). Equation (14) for \(r_\mu ^*\) involves a correction term \(q_\mu \), whose exact definition is given later in this section. As the model approaches the asymptotic limit, this correction term converges towards \(r_\mu \), thereby leading \(r^*_\mu \) to approach \(r_\mu \). Conversely, when the model diverges from the asymptotic limit, this correction term modifies \(r_\mu \) so that the asymptotic distribution of \(r_\mu ^*\) has a reduced error term:

$$\begin{aligned} r_\mu ^*\sim \mathcal {N}(0,1) + \mathcal {O}\left( n^{-3/2}\right) . \end{aligned}$$
(15)

That is, the error on the asymptotic distribution of \(r_\mu ^*\) falls off three powers of \(n^{-1/2}\) faster than that of the likelihood root (see Eq. (5)). In addition, by squaring \(r_\mu ^*\) one obtains the statistic \((r_\mu ^{*})^2\), which is asymptotically distributed as chi-squared with one degree of freedom. This can be interpreted as a higher-order correction to the likelihood ratio statistic \(w_\mu \).

An intuitive interpretation of the \(r_\mu ^*\) statistic can be obtained, despite its non-trivial derivation (see, e.g., [7]). One can show that the new \(r_\mu ^*\) statistic is related to \(r_\mu \) by

$$\begin{aligned} r_\mu ^*= \frac{r_\mu - \text {E}[r_\mu ]}{\text {V}[r_\mu ]^{1/2}} + \mathcal {O}_p\left( n^{-3/2}\right) , \end{aligned}$$
(16)

where \(\text {E}[r_\mu ]\) and \(\text {V}[r_\mu ]\) are the expectation value and variance of \(r_\mu \). This equation says that, up to errors of order \(n^{-3/2}\), \(r_\mu ^*\) represents the standardized version of \(r_\mu \). Furthermore, this equation provides a method to approximately compute \(r_\mu ^*\) using MC to estimate \(\text {E}[r_\mu ]\) and \(\text {V}[r_\mu ]\) as an alternative to the analytical computation of \(r^*_\mu \).

The analytical computation of \(r^*_\mu \) requires the correction term \(q_{\mu }\), which can be found in different ways depending on the characteristics of the statistical model. Details can be found in [7]; here we summarize the main results. For models with one parameter of interest and no nuisance parameters, \(q_\mu \) can be found as

$$\begin{aligned} q_\mu = \left( \frac{\partial \ell }{\partial \hat{\mu }}\Bigr |_{\hat{\mu }}-\frac{\partial \ell }{\partial \hat{\mu }}\Bigr |_{\mu }\right) j(\hat{\mu })^{1/2}, \end{aligned}$$
(17)

where the first derivative is computed for \(\mu =\hat{\mu }\) and the second for the value of \(\mu \) being tested. For statistical models that include nuisance parameters, if we assume that the full parameter space can be written as \(\varvec{\psi }=(\mu , \varvec{\theta })\), where \(\mu \) is the parameter of interest and \(\varvec{\theta }\) is a vector of N nuisance parameters, the correction term \(q_\mu \) can be found from

$$\begin{aligned} q_{\mu }= & {} \frac{\text {det}\left[ \ell _{\hat{\varvec{\psi }}}(\hat{\mu }, \hat{\varvec{\theta }})-\ell _{\hat{\varvec{\psi }}}(\mu , \hat{\hat{\varvec{\theta }}}),\quad \ell _{\varvec{\theta }\hat{\varvec{\psi }}}(\mu , \hat{\hat{\varvec{\theta }}})\right] }{\text {det}\left[ \ell _{\varvec{\psi }\hat{\varvec{\psi }}}(\hat{\mu }, \hat{\varvec{\theta }})\right] }\nonumber \\{} & {} \times \left( \frac{\text {det}\left[ j_{\varvec{\psi }\varvec{\psi }}\left( \hat{\mu }, \hat{\varvec{\theta }}\right) \right] }{\text {det}\left[ j_{\varvec{\theta }\varvec{\theta }}\left( \mu , \hat{\hat{\varvec{\theta }}}\right) \right] }\right) ^{1/2}, \end{aligned}$$
(18)

where j is the information matrix

$$\begin{aligned} j_{\varvec{\psi }\varvec{\psi }}(\varvec{\psi }) = -\frac{\partial ^2\ell (\varvec{\psi })}{\partial \varvec{\psi }\partial \varvec{\psi }^T}. \end{aligned}$$
(19)

The subscripts on \(\ell \) indicate derivatives; for example, \(\ell _{\varvec{\psi }}\) is the gradient of \(\ell \) with respect to the parameters \(\varvec{\psi }\). The numerator and denominator of Eq. (18) contain determinants of \((N+1)\times (N+1)\) matrices, where \(N+1\) is the dimension of the full parameter space. The matrix in the numerator of the first factor has the \((N+1)\)-dimensional vector \(\ell _{\hat{\varvec{\psi }}}(\hat{\mu }, \hat{\varvec{\theta }})-\ell _{\hat{\varvec{\psi }}}(\mu , \hat{\hat{\varvec{\theta }}})\) as its first column, while the remaining \((N+1)\times N\) section of the matrix is given by \(\ell _{\varvec{\theta }\hat{\varvec{\psi }}}(\mu , \hat{\hat{\varvec{\theta }}})\), i.e. the matrix of the second derivatives of \(\ell \) with respect to the nuisance parameters \(\varvec{\theta }\) as the first index, and with respect to the full parameters vector \(\varvec{\psi }\) as the second index.

Equations (17) and (18) contain derivatives of the likelihood with respect to the MLEs; therefore they require the observed data \(\varvec{y}\), or equivalently the likelihood, to be expressed as an explicit function of them, which is not always possible. In such cases, one can replace \(p^*\) by an alternative quantity \(p_\textrm{TEM}\), where TEM stands for Tangent Exponential Model (see [14]). This expression for the density function of \(\hat{\mu }\) exploits a local approximation of the likelihood with an exponential family model. The canonical parameters \(\phi \) of the approximating model are defined by

$$\begin{aligned} \varvec{\phi }^T (\varvec{\psi }, \varvec{y_\textrm{obs}}) = \sum _{i=1}^{n} \frac{\partial \ell }{\partial y_i}\Bigr |_{\varvec{y_\textrm{obs}}} \,\times \, V, \end{aligned}$$
(20)

and they can be used to derive an alternative expression for the correction term \(q_\mu \). Here \(\varvec{y_\textrm{obs}}\) is the n-dimensional vector of the observed data and V is an \(n \times (N+1)\) matrix defined as

$$\begin{aligned} V = - \left( \frac{\partial \varvec{z}}{\partial \varvec{y}^T}\right) ^{-1}\left( \frac{\partial \varvec{z}}{\partial \varvec{\varvec{\psi }}^T}\right) \Bigr |_{\hat{\varvec{\psi }}_\textrm{obs}}. \end{aligned}$$
(21)

In the last expression, \(\varvec{z} = (z_1(y_1),..., z_N(y_n))\) is a vector of pivotal quantities. Pivotal quantities are transformations of data that have a fixed distribution under the model, i.e., their distribution does not depend on the parameters of the model. Such a vector always exists in the form of cumulative distributions \(F(y_i)\), which are always uniformly distributed in [0, 1] for continuous data. But alternative choices are often available, e.g., for a Gaussian distributed random variable y with mean \(\mu \) and standard deviation \(\sigma \) one can define the pivotal statistic \(z = (y - \mu )/\sigma \), whose distribution is a standard normal for any chosen \(\mu \) and \(\sigma \).

Using the canonical parameters \(\varvec{\phi }\), alternative equations for the correction term \(q_\mu \) can be derived. For models with one parameter of interest and no nuisance parameters the formula for \(q_\mu \) becomes

$$\begin{aligned} q_\mu = \left[ \phi (\hat{\mu })-\phi (\mu )\right] j(\hat{\mu })^{1/2}\left| \frac{\partial \phi }{\partial \mu }(\hat{\mu })\right| ^{-1}. \end{aligned}$$
(22)

whereas, for models that include nuisance parameters, \(q_\mu \) is given by

$$\begin{aligned} \small q_{\mu } = \frac{\text {det}\left[ \varvec{\phi }(\hat{\mu }, \hat{\varvec{\theta }})-\varvec{\phi }(\mu , \hat{\hat{\varvec{\theta }}}),\quad \varvec{\phi }_{\varvec{\theta }}(\mu , \hat{\hat{\varvec{\theta }}})\right] }{\text {det}\left[ \varvec{\phi }_{\varvec{\psi }}(\hat{\mu }, \hat{\varvec{\theta }})\right] }\left( \frac{\text {det}[j_{\varvec{\psi }\varvec{\psi }}(\hat{\mu }, \hat{\varvec{\theta }})]}{\text {det}\left[ j_{\varvec{\theta }\varvec{\theta }}\left( \mu , \hat{\hat{\varvec{\theta }}}\right) \right] }\right) ^{1/2}.\nonumber \\ \end{aligned}$$
(23)

The definition of V given by Eq. (21) applies to the case where \(y_i\) are continuous variables. If instead they are discrete, V can be obtained from

$$\begin{aligned} V = \frac{\text {dE}\left[ \varvec{y}|\varvec{\psi }\right] }{d\varvec{\psi }^T}\Bigr |_{{\varvec{\psi }}={\hat{\varvec{\psi }}}}. \end{aligned}$$
(24)

In addition, in the definition of the canonical parameters \(\varvec{\phi }\) given by Eq. (20) the derivatives \(\partial \ell / \partial y_i\) are replaced by \(\partial \log (f_i)/\partial y_i\), where \(f_i(y_i)\) is the probability distribution of \(y_i\). More details on how to compute \(r^*_\mu \) for applications involving discrete data can be found in [15].

3.2 The Bartlett correction

A different approach to higher-order asymptotics due to Bartlett [6] involves a scaling of the likelihood ratio statistic, rather than a correction to the distributions of the MLEs. Bartlett’s argument is as follows. For a model with M parameters of interest \(\varvec{\mu }\) the likelihood ratio follows a chi-square distribution with M degrees of freedom (\(\chi ^2_M\)) and its expectation value is

$$\begin{aligned} \text {E}[w_{\varvec{\mu }}] = M + b , \end{aligned}$$
(25)

where b is the correction to the asymptotic expectation value. The modified statistic

$$\begin{aligned} w_{\varvec{\mu }}^*= w_{\varvec{\mu }}\, \frac{M}{\text {E}[w_{\mu }]} \equiv \frac{w_{\varvec{\mu }}}{1+b/M} \end{aligned}$$
(26)

follows a distribution closer to the asymptotic \(\chi _M^2\). The quantity \(b = \text {E}[w_{\varvec{\mu }}] - M\) characterizes the size of the Bartlett correction.

In many realistic applications, b cannot be computed exactly. In such scenarios, two possible approaches exist. The first is to estimate \(\text {E}[w_{\varvec{\mu }}]\) using MC methods, while the second is to approximate it perturbatively using a result provided by Lawley.

Lawley [16] developed a general method to compute the expectation value up to \(\mathcal {O}(n^{-2})\) proving that all the cumulants of \(w_{\varvec{\mu }}^*\) match with the cumulants of a \(\chi _M^2\) distribution up to this order. Specifically, Lawley’s formula is based on a quartic expansion of both the likelihood ratio and the score equation, \(\frac{\partial \ell }{\partial \varvec{\mu }}(\varvec{\hat{\mu }})=0\), in powers of \(\hat{\mu }_i-\mu _i\) (see, e.g., [8]), where i is an index running over the parameter space of the model. The two expansions can be combined to obtain an approximation of the expectation value

$$\begin{aligned} \text {E}[w_{\varvec{\mu }}] = 2\text {E}\left[ \ell (\hat{\varvec{\mu }})-\ell (\varvec{\mu })\right] = M + \epsilon _M + \mathcal {O}(n^{-2}), \end{aligned}$$
(27)

where \(\epsilon _M\) here represents the Bartlett correction factor b computed to order \(n^{-1}\) using the Lawley method. The correction term \(\epsilon _M\) has a complicated structure involving derivatives of the likelihood up to the fourth order and their expectation values. Nevertheless, for many applications, it is possible to compute it analytically. Specifically, for a model with M parameters of interest and without any nuisance parameters (the case with nuisance parameters will be considered later in this section) \(\epsilon _M\) can be written as

$$\begin{aligned} \epsilon _M = \sum _{rstu} \lambda _{rstu} - \sum _{rstuvw} \lambda _{rstuvw}, \end{aligned}$$
(28)

where the indexes r,s,t,u,v, and w label all the M parameters of the model (see, e.g., Ref. [8]). The two terms inside the sum are

$$\begin{aligned}{} & {} \lambda _{rstu} = k^{rs}k^{tu}\left( \frac{1}{4}k_{rstu}-k_{rst}^{(u)}+k_{rs}^{(tu)}\right) , \nonumber \\{} & {} \lambda _{rstuvw} = k^{rs}k^{tu}k^{vw} \Bigg (\frac{1}{6}k_{rtv}k_{suw}+\frac{1}{4}k_{rtu}k_{svw}-k_{rtv}k_{sw}^{(u)}\nonumber \\{} & {} \quad -k_{rtu}k_{sw}^{(v)}+k_{rt}^{(v)}k_{sw}^{(v)}+k_{rt}^{(u)}k_{sw}^{(v)} \Bigg ) . \end{aligned}$$
(29)

The terms inside the above definitions can be computed as

$$\begin{aligned} \begin{aligned}&k_{rs} = E\left[ \frac{\partial ^2 \ell }{\partial \mu _r\partial \mu _s}\right] ,\\&k_{rst} = E\left[ \frac{\partial ^3\ell }{\partial \mu _r\partial \mu _s\partial \mu _t}\right] ,\\&k_{rstu} = E\left[ \frac{\partial ^4\ell }{\partial \mu _r\partial \mu _s\partial \mu _t\partial \mu _u}\right] , \end{aligned} \end{aligned}$$
(30)

and

$$\begin{aligned} \begin{aligned}&k_{rs}^{(t)} =\frac{\partial k_{rs}}{\partial \mu _t},\\&k_{rs}^{(tu)} =\frac{\partial ^2 k_{rs}}{\partial \mu _t\partial \mu _u},\\&k_{rst}^{(u)} =\frac{\partial k_{rst}}{\partial \mu _u}, \end{aligned} \end{aligned}$$
(31)

where the matrices with upper indices are the inverses of the corresponding matrices with lower indices. The general expression for the Bartlett correction is quite involved, but its computation is not conceptually complicated, as it only involves computing derivatives and expectation values of them.

Often the parameters can be split into two subsets: parameters of interest \(\varvec{\mu } = (\mu _1,...,\, \mu _M)\), and nuisance parameters \(\varvec{\theta } = (\theta _1, ..., \, \theta _{N})\), and one is typically interested in testing specific parameter values in \(\varvec{\mu }\) space. In such scenarios, the Lawley formula to compute the expectation of \(w_{\varvec{\mu }}\) is given by

$$\begin{aligned} \text {E}[w_{\varvec{\mu }}]= & {} 2\text {E}[\ell (\varvec{\hat{\mu }},\hat{\varvec{\theta }})-\ell (\varvec{\mu },\hat{\hat{\varvec{\theta }}})]\nonumber \\= & {} 2\text {E}[\ell (\varvec{\hat{\mu }},\hat{\varvec{\theta }})-\ell (\varvec{\mu },\varvec{\theta })] - 2\text {E}[\ell (\varvec{\mu },\hat{\hat{\varvec{\theta }}})-\ell (\varvec{\mu },\varvec{\theta })]\nonumber \\= & {} M + N +\epsilon _{N+M} - N - \epsilon _{N} + \mathcal {O}(n^{-2})\nonumber \\= & {} M + \epsilon _{N+M} - \epsilon _{N} + \mathcal {O}(n^{-2}). \end{aligned}$$
(32)

The notation \(\epsilon _{N+M}\) indicates that the summation in Eq.(28) is performed over all the indices labeling the full parameters space, wheres the notation \(\epsilon _{N}\) indicates that the summation is only performed over indices of the nuisance parameters. However, a more efficient way to compute Eq. (32), is to directly calculate the difference \(\epsilon _{N+M} - \epsilon _{N}\) by summing the terms in Eq. (29) over all permutations of the indices that contain at least one parameter of interest. It is worth noting that, for composite hypotheses, the expectation value of the likelihood ratio is dependent on the nuisance parameters, as the distribution of \(w_{\varvec{\mu }}\) still has a dependence on them. Therefore, to evaluate the expectation value, and thus Eq. (32), one should use \(\varvec{\theta }= \hat{\hat{\varvec{\theta }}}({\varvec{\mu }})\), the constrained MLEs of the nuisance parameters.

The Lawley formula (32) is a valuable tool in situations where it is not feasible to compute the exact expectation value of \(w_{\varvec{\mu }}\) analytically, a common scenario in realistic applications. An alternative approach is to numerically estimate the expectation value of \(w_{\varvec{\mu }}\) using MC methods. Specifically, E\([w_{\varvec{\mu }}]\) can be estimated by generating data and setting the parameters of interest \(\varvec{\mu }\) to the value in the parameter space being tested, and the nuisance parameters \(\varvec{\theta }\) to their profiled values \(\hat{\hat{\varvec{\theta }}}(\varvec{\mu })\) (also known as parametric bootstrap estimate).

4 Overview of the Gamma Variance Model

Having outlined in the preceding section the general formalism for higher-order asymptotic corrections, we now demonstrate their use with the Gamma Variance Model (GVM) introduced in Ref. [9]. The GVM extends the general likelihood often used in particle physics analysis,

$$\begin{aligned} L(\varvec{\mu }, \varvec{\theta })= & {} P(\varvec{y}|\varvec{\mu }, \varvec{\theta })\times P(\varvec{u}| \varvec{\theta }) = P(\varvec{y}|\varvec{\mu }, \varvec{\theta })\nonumber \\{} & {} \times \prod \limits _{i=1}^{N}\frac{1}{\sqrt{2\pi \sigma _{u_i}^2}}\ \exp \left[ -\frac{(u_i-\theta _i)^2}{2\sigma _{u_i}^2} \right] , \end{aligned}$$
(33)

where \(P(\varvec{y}|\varvec{\mu }, \varvec{\theta })\) denotes the probability density function of the data \(\varvec{y}\), which depends on M parameters of interest \(\varvec{\mu } = (\mu _1, \ldots , \mu _M)\) and N nuisance parameters \(\varvec{\theta } = (\theta _1, \ldots , \theta _N)\). To provide information on the nuisance parameters one includes N control measurements \(\varvec{u} = (u_1, \ldots , u_N)\), here assumed to be direct estimates of the nuisance parameters \(\varvec{\theta }\) that are independent and Gaussian distributed. We suppose these are unbiased, (i.e., \(\text {E}[\varvec{u}] = \varvec{\theta }\)), and their standard deviations \(\varvec{\sigma _u}= (\sigma _{u_1},...,\sigma _{u_N})\) are often referred to as systematic errors. The inclusion of nuisance parameters enlarges the model’s parameter space, thereby enabling a better approximation of the truth, albeit at the cost of reduced sensitivity to the parameters of interest. Very often, the values of the standard deviations \(\varvec{\sigma _u}\) are assigned by the experimenter and treated as fixed.

The GVM extends this model to address the important situation where the systematic errors are themselves uncertain by regarding the \(\sigma _{u_i}^2\) as adjustable rather than known parameters. The values that one would have assigned to them before are now treated as independent gamma-distributed estimates \(v_i\), i.e.,

$$\begin{aligned} v_i \sim \frac{\beta _i^{\alpha _i}}{\Gamma (\alpha _i)}v_i^{\alpha _i-1}e^{-\beta _i v_i}. \end{aligned}$$
(34)

Here the parameters of the gamma distribution \(\alpha _i\) and \(\beta _i\) are defined such that the expected value is \(\text {E}[v_i] = \alpha _i / \beta _i\) and the variance is \(\sigma _{v_i}^2 = \alpha _i / \beta _i^2\). These are chosen such that \(v_i\) is an unbiased estimator for \(\sigma _{u_i}^2\), (i.e., \(\text {E}[v_i] = \sigma _{u_i}^2\)) and the width of the gamma distribution is adjusted to reflect the appropriate level of uncertainty by defining

$$\begin{aligned} \varepsilon _i \equiv \frac{1}{2}\frac{\sigma _{v_i}}{\text {E}[v_i]}= \frac{1}{2}\frac{\sigma _{v_i}}{\sigma ^2_{u_i}}, \end{aligned}$$
(35)

which is a fixed parameter of the model. Using error propagation Eq. (35) becomes

$$\begin{aligned} \varepsilon _i \simeq \frac{\sigma _{s_i}}{\text {E}[s_i]}, \end{aligned}$$
(36)

where \(s_i = \sqrt{v_i}\). The quantity \(\varepsilon _i\) is thus the relative uncertainty on the assigned systematic error (also called a coefficient of variation), which we refer to informally as the relative error-on-error parameter. Including the \(v_i\) as measurements into the likelihood gives

$$\begin{aligned} L(\varvec{\mu }, \varvec{\theta })= & {} P(\varvec{y}|\varvec{\mu }, \varvec{\theta }) \nonumber \\ {}{} & {} \quad \times \prod \limits _{i=1}^{N}\frac{1}{\sqrt{2\pi \sigma _{u_i}^2}}\,e^{-\frac{(u_i-\theta _i)^2}{2\sigma _{u_i}^2}} \frac{\beta _i^{\alpha _i}}{\Gamma (\alpha _i)}v_i^{\alpha _i-1}e^{-\beta _i v_i}.\nonumber \\ \end{aligned}$$
(37)

Although treating the \(\sigma ^2_{u_i}\) as adjustable in the GVM in effect doubles the number of nuisance parameters in comparison to the model where the \(\sigma _{u_i}^2\) are known, one can profile over them in closed form. After some manipulation (see Ref. [9]), the profile log-likelihood is found to be

$$\begin{aligned} \ell (\varvec{\mu },\varvec{\theta },\varvec{\widehat{\widehat{\sigma ^2_{u}}}})= & {} \ell _p(\varvec{\mu },\varvec{\theta })=\log {P(\varvec{y}|\varvec{\mu }, \varvec{\theta })}\nonumber \\{} & {} \quad -\frac{1}{2}\sum _{i=1}^N\left( 1+\frac{1}{2\varepsilon _i^2}\right) \log {\left[ 1+2\varepsilon _i^2\frac{\left( u_i-\theta _i\right) ^2}{v_i}\right] }.\nonumber \\ \end{aligned}$$
(38)

As discussed in Ref. [9], the Gamma Variance Model leads to interesting and useful consequences for inference about the parameters of interest \(\varvec{\mu }\). In particular, the size of the confidence region for \(\varvec{\mu }\) becomes coupled to the goodness of fit, with increasing incompatibility of the input data leading to larger regions. Furthermore, the point estimate for \(\varvec{\mu }\) shows a decreased sensitivity to outliers in the data. It is therefore of particular interest to apply the GVM in cases where the input values are in tension either among themselves or with the predictions of a hypothesis of interest. For example, the tension between measured and predicted values of the anomalous muon magnetic moment was explored in Ref. [17]. The GVM represents a purely frequentist approach to this type of problem. Bayesian methods have been found to yield qualitatively similar results, e.g., in Refs. [18,19,20,21].

A practical difficulty with the Gamma Variance Model arises in connection with the use of asymptotic formulae to obtain p values and confidence regions when the \(\varepsilon \) parameters exceed a value of around 0.2. As discussed in Ref. [9], there is a correspondence between the parameters \(\varepsilon _i\) and an effective sample size, \(n_i\), which can be found by considering a sample of \(n_i\) independent observations of \(u_i\) and using their sample variance as an estimate of \(\sigma _{u_i}^2\). This estimator is found to be gamma distributed with an error-on-error parameter \(\varepsilon _i\) related to the sample size by

$$\begin{aligned} n_i = 1 + \frac{1}{2 \varepsilon _i^2} . \end{aligned}$$
(39)

Thus when \(\varepsilon _i\) becomes too large, then \(n_i\) drops to become of order unity and the large-sample criterion required for use of asymptotic distributions no longer holds. Values of \(\varepsilon \) are expected to be roughly 0.2–0.5 or even larger in many applications, which could make it far more difficult to compute p values and confidence regions.

The breakdown of the asymptotic formulae for large \(\varepsilon _i\) can be understood intuitively by expanding the logarithmic term of Eq. (38) in powers of \(\varepsilon _i\):

$$\begin{aligned}{} & {} \left( 1+\frac{1}{2\varepsilon _i^2}\right) \log {\left[ 1+2\varepsilon _i^2\frac{\left( u_i-\theta _i\right) ^2}{v_i}\right] }\nonumber \\{} & {} \quad =\left( 1+2\varepsilon _i^2\right) \frac{\left( u_i-\theta _i\right) ^2}{v_i}-\varepsilon _i^2\frac{\left( u_i-\theta _i\right) ^4}{v_i^2}+\mathcal {O}_p\left( \varepsilon _i^4\right) .\nonumber \\ \end{aligned}$$
(40)

Thus as \(\varepsilon _i\) approaches zero, the logarithmic constraint reduces to a quadratic one, associated with a Gaussian constraint for the nuisance parameter, leading to the asymptotic distributions for the statistics \(w_{\mu }\) and \(r_{\mu }\) discussed above. However, for large \(\varepsilon _i\), the Gamma Variance Model deviates from the quadratic approximation by an error term that begins at \(\mathcal {O}_p(\varepsilon _i^2)\), as can be seen in Eq. (40).

Consequently, when \(\varepsilon _i\) is not equal to zero, the asymptotic formulae used to obtain p values and confidence regions are not guaranteed to represent valid approximations. Furthermore, the interval of convergence of the logarithm in Eq. (38) is

$$\begin{aligned} 2\varepsilon _i^2\frac{\left( u_i-\theta _i\right) ^2}{v_i}<1, \end{aligned}$$
(41)

and thus the asymptotic formulae are not expected to give accurate approximations if the above condition is not satisfied.

In principle, this difficulty can be overcome by using Monte Carlo calculations, but this can entail substantial additional work and computing time. It is therefore valuable to have a method of finding p values and confidence regions without MC, and thus the primary goal of this paper is to investigate the use of higher-order asymptotics with the GVM to obtain results that remain accurate even for large \(\varepsilon _i\).

Fig. 1
figure 1

Distributions of \(w_{\mu }\) (blue) for different values of the parameter \(\varepsilon \) compared with the asymptotic \(\chi ^2\) distribution (black)

5 Single-measurement model

In order to investigate the asymptotic properties of a statistical model with uncertain error parameters, it is convenient to use the simple model introduced in Ref. [9]. Here for completeness we reproduce several results shown in that paper using the Bartlett correction and extend them in Sect. 5.1 using the \(r^*\) approximation.

The single-measurement model describes a single Gaussian distributed measurement y with mean \(\mu \) and standard deviation \(\sigma \). We take \(\mu \) to be the parameter of interest and \(\sigma ^2\) to be a nuisance parameter constrained by an independent gamma-distributed estimate v. Therefore, the likelihood is

$$\begin{aligned} L\left( \mu , \sigma ^2\right) =\frac{1}{\sqrt{2\pi \sigma ^2}}e^{-(y-\mu )^2/2\sigma ^2}\frac{\beta }{\Gamma (\alpha )}v^{\alpha -1}e^{-\beta v}, \end{aligned}$$
(42)

where \(\alpha =1/4\varepsilon ^2\) and \(\beta =1/(4\varepsilon ^2\sigma ^2)\), and \(\varepsilon \) is the relative error on the standard deviation \(\sigma \). The log-likelihood of the model is given by

$$\begin{aligned} \ell \left( \mu , \sigma ^2\right) =-\frac{1}{2}\frac{(y-\mu )^2}{\sigma ^2}-\left( \frac{1}{2}+\frac{1}{4\varepsilon ^2}\right) \log {\sigma ^2}-\frac{v}{4\varepsilon ^2\sigma ^2}.\nonumber \\ \end{aligned}$$
(43)

The goal is to compute the likelihood ratio \(w_{\mu }\) (see Eq. (2)) to study its asymptotic properties and to apply to it the higher-order corrections defined in Sect. 3. This requires the estimators

$$\begin{aligned} \begin{aligned}&\hat{\mu }=y,\\&\widehat{\sigma ^2}=\frac{v}{1+2\varepsilon ^2},\\&\widehat{\widehat{\sigma ^2}}=\frac{v+2\varepsilon ^2(y-\mu )^2}{1+2\varepsilon ^2}. \end{aligned} \end{aligned}$$
(44)

With the help of the above expressions it is easy to derive the likelihood ratio \(w_{\mu }\),

$$\begin{aligned} w_\mu =\left( 1+\frac{1}{2\varepsilon ^2}\right) \log {\left[ 1+2\varepsilon ^2\frac{(y-\mu )^2}{v}\right] }, \end{aligned}$$
(45)

which, in the limit \(\varepsilon \rightarrow 0\), becomes

$$\begin{aligned} w_\mu =\frac{(y-\mu )^2}{v}+\mathcal {O}_p(\varepsilon ^2). \end{aligned}$$
(46)

In this limit, the likelihood ratio can be approximated by a quadratic expression, as expected in the asymptotic limit. As seen previously, the parameter \(\varepsilon \) is related to an effective sample size, as it measures the extent to which the model deviates from the asymptotic limit. In particular, it is expected that the distribution of \(w_{\mu }\) should deviate from its asymptotic \(\chi _1^2\) distribution by an error term of order \(\mathcal {O}(\varepsilon ^2)\).

Figure 1 shows the distributions of the likelihood ratio statistic with data generated according to Eq. (42) setting \(\mu =0\), \(\sigma =1\) and \(\varepsilon =0.01,\,0.2,\,0.4,\,0.6\). As found in Ref. [9], the distribution deviates from the asymptotic \(\chi _1^2\) form as the \(\varepsilon \) parameter increases. The simple dependence of the single measurement model on the parameter \(\varepsilon \) makes it an ideal candidate for studying the effectiveness of higher-order asymptotic methods in improving asymptotic formulae.

Fig. 2
figure 2

Distributions of \((r_\mu ^{*})^2\) (green) and \(w_\mu ^*\) computed with the Lawley formula (orange) for different values of the parameter \(\varepsilon \) compared with the \(\chi ^2\) asymptotic distribution (black)

5.1 Higher-order asymptotics for the single-measurement model

As one can see in Fig. 1, the likelihood ratio exhibits noticeable deviations from its asymptotic \(\chi _1^2\) distribution even for moderate values of \(\varepsilon \). It is therefore important to investigate whether higher-order statistics, namely \(r^*_\mu \) and \(w^*_\mu \) as defined in Eqs. (14) and (26), can be better approximated by their asymptotic distributions, particularly for larger values of \(\varepsilon \).

The asymptotic distribution of \(r^*\) is a standard normal and it has an associated error term of \(\mathcal {O}(n^{-3/2})\). For the single-measurement model, \(\varepsilon \) gives the effective sample size (\(n=1+1/2\varepsilon ^2\)), and thus the error term is expected to be of order \(\mathcal {O}(\varepsilon ^3)\) or smaller. In order to compute \(r_\mu ^*\) one needs \(q_\mu \) as defined in Eq. (18). The dependence of the likelihood of the single-measurement model on the data can be explicitly re-expressed in terms of the MLEs defined in Eq. (44):

$$\begin{aligned} \ell (\hat{\mu }, \widehat{\sigma ^2}|\mu ,\sigma ^2)= & {} -\frac{1}{2}\frac{(\hat{\mu }-\mu )^2}{\sigma ^2}- \left( \frac{1}{2}+\frac{1}{4\varepsilon ^2} \right) \log {\sigma ^2} \nonumber \\{} & {} \quad -\frac{\widehat{\sigma ^2}(1+2\varepsilon ^2)}{4\varepsilon ^2\sigma ^2}. \end{aligned}$$
(47)

Therefore, it is possible to use Eq.(18) to compute \(q_\mu \), for which one finds

$$\begin{aligned} q_\mu = \frac{\sqrt{(1+2\varepsilon ^2) v}}{v+2\varepsilon ^2(y-\mu )^2}(y-\mu ). \end{aligned}$$
(48)

Since the asymptotic distribution of \(r_\mu ^*\) is a standard normal, the asymptotic distribution of \(r_\mu ^{*2}\) is a \(\chi _1^2\) distribution, and therefore it can be seen as a higher-order correction to the likelihood ratio.

The second higher-order statistic we want to study is the Bartlett-corrected likelihood ratio,

$$\begin{aligned} w_\mu ^*= w_\mu \, \frac{M}{\text {E}[w_{\mu }]} \equiv \frac{w_\mu }{1+b/M}, \end{aligned}$$
(49)

where E\([w_\mu ] = M+b\) is what one must find to obtain the Bartlett correction. The Bartlett-corrected likelihood ratio \(w_\mu ^*\) is expected to be \(\chi _1^2\) distributed in the asymptotic limit. The expectation value \(\text {E}[w_\mu ]\) can be estimated using the Lawley formula (32), which yields

$$\begin{aligned} \text {E}[w_\mu ] = 1+ 3\varepsilon ^2+\mathcal {O}\left( \varepsilon ^4\right) . \end{aligned}$$
(50)

The asymptotic distribution of \(w_\mu ^*\) will have an error term of \(\mathcal {O}(n^{-2})\), or equivalently \(\mathcal {O}(\varepsilon ^4)\) for the single-measurement model. All of the higher-order statistics described above, namely \(r_\mu ^{*2}\) and \(w_\mu ^*\), follow a \(\chi ^2\) distribution in the asymptotic limit. The expectation value in Eq. (50) matches, at \(\mathcal {O}(\varepsilon ^2)\), the result found in [9] using a Taylor expansion of the integral for its computation.

In Fig. 2, we show the distributions of these two statistics for data generated according to Eq. (42) with \(\mu =0\), \(\sigma =1\), and \(\varepsilon \) values of 0.01, 0.2, 0.4, and 0.6. The distributions of two statistics are much better approximated by \(\chi ^2_1\) compared to the original likelihood ratio \(w_\mu \), indicating that higher-order statistics provide significant improvements in this example.

5.2 Confidence intervals for the single-measurement model

The likelihood ratio is a commonly used tool for deriving confidence regions, typically obtained by finding the p value of \(\varvec{\mu }\) and then solving the equation \(p_{\varvec{\mu }}=\alpha \), where \(1-\alpha \) represents the desired confidence level. In the case of the single-measurement model, which involves only one parameter of interest \(\mu \), our goal is to construct a confidence interval for it as described in Sect. 2. To obtain the p value, the distribution of \(w_\mu \) must be determined. As seen in Fig. 1, the distribution of \(w_{\mu }\) departs from its asymptotic form for large values of \(\varepsilon \), and p values derived from a \(\chi ^2_1\) distribution are thus not accurate. To address this, we can use higher-order statistics such as \(r_\mu ^*\) and \(w_\mu ^*\) to compute the p values, i.e.,

$$\begin{aligned} p_{\mu } = \int _{w^*_{ \text {obs}}}^{\infty }f_{\chi _1^2}(w^*) \, dw^*= 1 - F_{\chi _1^2}[w^*_{ \text {obs}}], \end{aligned}$$
(51)

or

$$\begin{aligned} p_{\mu } = \int _{r^{*2}_{ \text {obs}}}^{\infty }f_{\chi _1^2}\left( r^{*2}\right) \, dr^{*2}= 1 - F_{\chi _1^2}\left[ r^{*2}_{ \text {obs}}\right] . \end{aligned}$$
(52)

To illustrate this we find the confidence interval for \(\mu \) as a function of the parameter \(\varepsilon \) under the assumption that the observed values of y and v are 0 and 1, respectively. Figure 3 shows a comparison of the confidence intervals obtained using the likelihood ratio \(w_\mu \) and the higher-order statistics \(r_\mu ^*\) and \(w_\mu ^*\). Additionally, the confidence interval is also computed by calculating the p value exactly, as described in [9]. The plot in Fig. 3 shows that the use of higher-order statistics significantly improves the accuracy of the confidence interval. Figure 3 confirms the findings in [9], but also shows that \(r^*\) provides estimates of the confidence interval as accurate as those computed with \(w^*\), proving to be a valid alternative to the Bartlett correction.

Fig. 3
figure 3

Half-length of 1-\(\sigma \) confidence intervals (\(68.3\%\) confidence level) for \(\mu \) (Eq. (42)) as a function of \(\varepsilon \), computed using \(w_\mu \) (blue), \(r^*_\mu \) (green) and \(w^*_\mu \) calculated using the Lawley formula (orange). The black curve represents the exact half-length of the confidence interval

6 Simple-average model

The single-measurement model can be extended to an average of N measurements \(\varvec{y}=(y_1,...,\,y_N)\), which are assumed to follow a Gaussian distribution with a common mean \(\mu \) and variances \(\varvec{\sigma ^2}=(\sigma _1^2,...,\,\sigma _N^2)\). Here the variances are assumed to be uncertain with independent estimates \(\varvec{v}=(v_1,...,\,v_N)\) that are gamma distributed with error-on-error parameters (cf. Eq. (35)) of \(\varvec{\varepsilon }=(\varepsilon _1,...,\,\varepsilon _N)\). The likelihood of the model is thus

$$\begin{aligned} L\left( \mu , \varvec{\sigma ^2}\right)= & {} \prod _i^{N}\frac{1}{\sqrt{2\pi \sigma _i^2}}e^{-(y_i-\mu )^2/2\sigma _i^2}\nonumber \\{} & {} \times \prod _i^{N}\frac{\beta _i}{\Gamma (\alpha _i)}v^{\alpha _i-1}e^{-\beta _i v_i}, \end{aligned}$$
(53)

where \(\alpha _i=1/4\varepsilon _i^2\) and \(\beta _i=1/(4 \varepsilon _i^2\sigma _i^2)\). Equivalently, the log-likelihood of the model is

$$\begin{aligned} \ell \left( \mu , \varvec{\sigma ^2}\right)= & {} -\frac{1}{2}\sum _{i=1}^N \left[ \frac{(y_i-\mu )^2}{\sigma _i^2}\right. \nonumber \\{} & {} \left. +\left( 1+\frac{1}{2\varepsilon _i^2}\right) \log {\sigma _i^2}+\frac{v_i}{2\varepsilon _i^2\sigma _i^2} \right] . \end{aligned}$$
(54)

In contrast to the full Gamma Variance Model described in Sect. 4, it does not include nuisance parameters \(\theta _i\) or their estimates, but rather treats the variances \(\sigma _i^2\) of the primary measurements \(y_i\) as uncertain. It can be easily generalized to a curve-fitting problem where the expectation value of each measurement \(y_i\) can be defined as a function of the parameters of interest \(\varvec{\mu }\) and a control measurement \(x_i\), i.e., \(\text {E}[y_i]=f(x_i; \varvec{\mu })\).

The log-likelihood of Eq. (54) profiled over the \(\varvec{\sigma ^2}\) is given by

$$\begin{aligned} \ell _p(\mu )=-\frac{1}{2}\sum _i^N\left( 1+\frac{1}{2\varepsilon _i^2}\right) \log \left[ {1+2\varepsilon _i^2\frac{\left( y_i-\mu \right) ^2}{v_i}}\right] ,\nonumber \\ \end{aligned}$$
(55)

which has been computed using the profiled value of \(\sigma _i^2\):

$$\begin{aligned} \widehat{\widehat{\sigma _i^2}}=\frac{v_i+2\varepsilon _i^2(y_i-\mu )^2}{1+2\varepsilon _i^2}. \end{aligned}$$
(56)

As in the example of the single-measurement model from Sect. 5, we compute the likelihood ratio \(w_\mu \), and the statistics \(r^*_\mu \) and \(w^*_\mu \). These require the MLE \(\hat{\mu }\), which in general must be found numerically. As discussed in Sect. 4, the distribution of the likelihood ratio \(w_\mu \) is expected to deviate from its asymptotic \(\chi ^2_1\) form, and we investigate whether the higher-order statistics \(r_\mu ^*\) and \(w_\mu ^*\) can improve the precision of the inference on \(\mu \).

Because the log-likelihood of Eq.(54) cannot be written explicitly as a function of the MLEs, but is known in terms of the data values \(y_i\), the correction term \(q_{\mu }\), needed to compute \(r_\mu ^*\), must be found using Eq.(23). Additionally, to compute \(q_\mu \), as discussed in Sect. 3.1, a vector of pivotal quantities \(\varvec{z}=(z_{y_1},...,\,z_{y_N},\,z_{v_1},...,\,z_{v_N})\) must be defined and used in Eqs. (20) and (21). We choose these to be

$$\begin{aligned} \begin{aligned}&z_{y_{i}} = \frac{(y_{i}-\mu )^2}{\sigma _{i}^2}\sim \chi ^2_1,\\&z_{v_{i}} = \frac{v_i}{\sigma _{i}^2}\sim \chi ^2_1. \end{aligned} \end{aligned}$$
(57)

The Bartlett-corrected likelihood ratio \(w_\mu ^*\) can be calculated using the Lawley formula (32). The result can be expanded at order \(\varepsilon _i^2\) as

$$\begin{aligned} \begin{aligned} \text {E}[w_\mu ]&= 1+ \frac{4}{\sum _{i=1}^N 1/v_i}\sum _{i=1}^N\frac{\varepsilon _i^2}{v_i}\\ {}&\quad - \frac{1}{\bigg (\sum _{i=1}^N 1/v_i\bigg )^2}\sum _{i=1}^N\frac{\varepsilon _i^2}{v_i^2}+\sum _{i=1}^N\mathcal {O}(\varepsilon _i^4), \end{aligned} \end{aligned}$$
(58)

which in the limit \(\varepsilon _i \rightarrow 0\) gives 1 as expected. Using this result one can thus find the corrected statistic \(w_{\mu }^{*} = w_{\mu } / \text {E}[w_{\mu }]\).

6.1 Confidence intervals for the parameter of interest

In this section, we compute the \(68.3\%\) confidence interval for the parameter \(\mu \) of Eq. (54) to benchmark the higher-order statistics computed in the last section. Specifically, we consider the simple case of averaging two measurements, \(y_1\) and \(y_2\), with observed values of \(+\delta \) and \(-\delta \), respectively. The estimates of the variances \(v_1\) and \(v_2\) are set to 1. We consider \(\delta = 0.5\), which represents the case where \(y_1\) and \(y_2\) are reasonably consistent, and \(\delta = 1.5\), corresponding to a substantial tension between the two measurements. Both measurements are assigned equal error-on-error parameters, \(\varepsilon _1=\varepsilon _2=\varepsilon \), and we present results as a function of \(\varepsilon \). The intervals are found using the likelihood ratio \(w_\mu \) and the higher-order statistics \(r_\mu ^*\) and \(w_\mu ^*\).

In addition, the confidence interval is found by estimating the p value of the likelihood ratio using MC. This is done by generating the exact distribution of the data for a fixed value of \(\mu \) while setting the nuisance parameters \(\sigma _i^2\) to their profiled values. In Particle Physics, this technique is commonly known as the profile construction [22] or hybrid resampling [23, 24] method (in statistics often referred to as a parametric bootstrap). This technique is expected to provide the most accurate determination of the intervals that is computationally feasible.

Fig. 4
figure 4

Half-length of 1-\(\sigma \) confidence intervals for parameter \(\mu \) (Eq. (54)) as a function of \(\varepsilon \) for \(\delta =0.5\) (top) and \(\delta =1.5\) (bottom), computed using \(w_\mu \) (blue), \(r^*_\mu \) (green) and \(w^*_\mu \) calculated using the Lawley formula (orange). The black dots represent our most precise estimate of the interval, computed using profile construction. The confidence interval obtained with \(r^*_\mu \) becomes numerically unstable for large values of \(\varepsilon \) when there is significant tension in the dataset (bottom)

The uncorrected interval based on the likelihood ratio \(w_{\mu }\) is found to undershoot the result from profile construction in both cases, with the discrepancy increasing with \(\varepsilon \). Conversely, the intervals obtained using the \(r_\mu ^*\) and \(w^*\) statistics are found to be in good agreement with that from profile construction when the averaged data are mutually compatible (top panel of Fig. 4). However, for larger values of \(\varepsilon \), \(r_\mu ^*\) breaks down as the tension in the observed data grows (see the lower panel of Fig. 4), leading to numerical instability. This occurs because the correction term \(q_\mu \) in the definition of \(r_\mu ^*\) (see Eq. (14)) no longer effectively corrects the likelihood root, as the likelihood becomes strongly non-Gaussian. Consequently, deviations from the asymptotic limit can no longer be adequately addressed by a correction term.

A conservative approach to determine the applicability of \(r_\mu ^*\) in improving the likelihood ratio predictions is to verify whether the arguments of the logarithmic terms of Eq. (55) are within their radius of convergence. Specifically, for the endpoints of the confidence interval, one should check whether the inequality

$$\begin{aligned} 2\varepsilon _i^2\frac{\left( y_i-\mu \right) ^2}{v_i}<1 \end{aligned}$$
(59)

is satisfied for every measurement \(y_i\). In the example above, this condition implies that one should not trust the accuracy of the result from \(r_\mu ^*\) if \(\varepsilon \ge 0.5\) for \(\delta =0.5\) and \(\varepsilon \ge 0.3\) for \(\delta =1.5\).

6.2 Goodness-of-fit

A confidence region for the parameters of interest does not, on its own, provide a measure of how well the selected model describes the observed data. This can be quantified using a goodness-of-fit statistic for which we take

$$\begin{aligned} q = -2\log \frac{L\left( \hat{\mu },\hat{\varvec{\sigma }}^2\right) }{L_\textrm{s}\left( \hat{\varvec{\varphi }},\hat{\varvec{\sigma }}^2\right) }, \end{aligned}$$
(60)

where \(L_\textrm{s}\) represents the likelihood of the saturated model. This is obtained by replacing the expectation values \(\text {E}[y_i]=\mu \) with a set of independent parameters \(\varvec{\varphi }=(\varphi _1,...,\varphi _N)\), such that \(\text {E}[y_i]=\varphi _i\). Since there is an adjustable \(\varphi _i\) for each measurement \(y_i\), one finds \(\hat{\varphi }_i = y_i\), \(\hat{\sigma }^2_i = v_i\), and the goodness-of-fit statistic reduces to

$$\begin{aligned} q= & {} \sum _{i=1}^N\left( 1+\frac{1}{2\varepsilon _i^2}\right) \nonumber \\{} & {} \times \log \left[ {1+2\varepsilon _i^2\frac{\left( y_i-\hat{\mu }\right) ^2}{v_i}}\right] . \end{aligned}$$
(61)

If the above expression is expanded in powers of \(\varepsilon _i^2\), in the limit \(\varepsilon _i^2 \rightarrow 0\) one finds

$$\begin{aligned} q =\sum _{i=1}^N\frac{\left( y_i-\hat{\mu }\right) ^2}{v_i}+\mathcal {O}_p(\varepsilon _i^2). \end{aligned}$$
(62)

In this limit, q reduces to a sum of squares of Gaussian-distributed quantities, and thus its distribution follows a chi-square distribution with \(N-1\) degrees of freedom. However, for large values of the \(\varepsilon _i\) parameters, deviations from the \(\chi ^2_{N-1}\) asymptotic distribution are expected.

To correct the goodness-of-fit statistic using higher-order asymptotics, q needs to be defined as a likelihood ratio. This can be done by defining the saturated model such that the simple-average model is nested within it. A possible choice is to define the saturated model as

$$\begin{aligned} \ell _\textrm{s}(\varvec{\alpha }, \mu , \varvec{\sigma ^2}){} & {} =\log L_\textrm{s}(\varvec{\alpha }, \mu , \varvec{\sigma ^2}) \nonumber \\{} & {} =-\frac{1}{2}\left[ \sum _i^N\frac{(y_i-\alpha _i-\mu )^2}{\sigma _i^2}\right. \nonumber \\{} & {} \quad \left. +\left( 1+\frac{1}{2\varepsilon _i^2}\right) \log {\sigma _i^2}+\frac{v_i}{2\varepsilon _i^2\sigma _i^2}\right] , \end{aligned}$$
(63)

where we fix \(\alpha _N=-\sum _{i=1}^{N-1}\alpha _i\) so that \(\sum _{i=1}^{N}\alpha _i=0\). Given this definition, the simple-average model is recovered by fixing all the \(\alpha _i\) to zero, hence \(\ell =\ell _\textrm{s}(\varvec{\alpha }=\varvec{0}, \mu , \varvec{\sigma ^2})\). Therefore, the goodness-of-fit statistic can be written as a likelihood ratio of the saturated model,

$$\begin{aligned} q = -2\log \frac{L_\textrm{s}(\varvec{\alpha }=\varvec{0},\hat{\hat{\mu }},\hat{\hat{\varvec{\sigma }}}^2)}{L_\textrm{s}(\hat{\varvec{\alpha }},\hat{\mu },\hat{\varvec{\sigma }}^2)}, \end{aligned}$$
(64)

and its Bartlett correction can be computed using the Lawley formula. This is done by treating the \(\alpha _i\) as parameters of interest and \(\mu \) and the \(\sigma _i^2\) as nuisance parameters. The result is given by:

$$\begin{aligned} \text {E}[q]= & {} \,N - 1- 8\sum _{i=1}^N\sum _{j=1}^{N-1}k^{\mu \alpha _j}\,C_{ij}\frac{r^2_i}{v_i} \nonumber \\{} & {} \quad - 4\sum _{i=1}^N\sum _{j,k=1}^{N-1}k^{\alpha _j\alpha _k}\,C_{ij}C_{ik} \frac{r^2_i}{v_i} \nonumber \\{} & {} \quad -4\sum _{i=1}^N\sum _{j,k=1}^{N-1}k^{\mu \alpha _j}k^{\mu \alpha _k}\,C_{ij}C_{ik}\frac{r^2_i}{v^2_i} \nonumber \\{} & {} \quad -2\sum _{i=1}^N\sum _{j,k,p=1}^{N-1}k^{\mu \alpha _j}k^{\alpha _k\alpha _p}\,C_{ij}C_{ik}C_{ip}\frac{r^2_i}{v^2_i}\nonumber \\{} & {} \quad -\sum _{i=1}^N\sum _{j,k,p,q=1}^{N-1}k^{\alpha _j\alpha _k}k^{\alpha _p\alpha _q}\,C_{ij}C_{ik}C_{ip}C_{iq}\frac{r^2_i}{v^2_i}.\nonumber \\ \end{aligned}$$
(65)

The terms \(k^{\alpha _i\alpha _j}\) and \(k^{\mu \alpha _i}\) refer to the components of the inverse of the expectation value of the Hessian matrix of the likelihood (see Eq. (30)) and are computed for \(\sigma ^2_{y_i}=v_i\). The matrix C is a \(N\times (N-1)\) matrix, whose only non-zero entries are:

$$\begin{aligned} \begin{aligned}&C_{ij} = 1 \quad \forall \,i = j,\\&C_{Nj} = -1 \quad \forall \,j . \end{aligned} \end{aligned}$$
(66)

The \(r^*\) statistic cannot be applied to goodness-of-fit unless \(N=2\), as it can only be computed for models with one parameter of interest.

To measure how well the model describes the observed data, one can compute the p value of the goodness-of-fit,

$$\begin{aligned} p = \int _{q_{ \text {obs}}}^{\infty } f(q) \, dq= 1 - F[q_\textrm{obs}], \end{aligned}$$
(67)

In general, small values of the p value are associated with a bad agreement between the model and the data. In Particle Physics, p values are typically converted to a related quantity Z called the significance, defined (for a two-sided test) as

$$\begin{aligned} Z=\Phi ^{-1}(1-p/2), \end{aligned}$$
(68)

where \(\Phi ^{-1}\) is the inverse cumulative distribution of a standard normal. The p value is thus equated to the probability of a Gaussian variable to fluctuate Z standard deviations or more away from the mean (i.e., a probability of p/2 in each direction).

To illustrate this method, we find the p value and corresponding significance Z for an average of two incompatible measurements, namely \(y_1=-3\) and \(y_2=3\), with estimated variances of \(v_1=v_2=1\). We set \(\varepsilon _1=\varepsilon _2=\varepsilon \) and we show the results as a function of \(\varepsilon \). The significance is estimated using both the goodness-of-fit statistic q and the Bartlett-corrected version \(q^*\). Figure 5 shows the significance as a function of the parameter \(\varepsilon \). The use of the Bartlett correction results in a substantial improvement in estimating the significance, almost perfectly overlapping with the MC estimates. This is a crucial result, as estimating p values of goodness-of-fit for incompatible data can require a large number of pseudo-experiments. For example, roughly on the order of \(10^5\) simulated experiments are necessary to accurately capture a \(4\sigma \) effect.

The Bartlett correction can also be found using MC to estimate \(\text {E}[w_{\mu }]\), which takes less time than computing the full distribution. The result is very close to what is found from the Lawley formula, and the latter has the advantage of avoiding MC entirely.

Fig. 5
figure 5

Significance of discrepancy for \(\delta =3\) as defined by Eq. (68) plotted as a function of \(\varepsilon \). Higher significance values indicate greater tension between the data. The significance was computed using the goodness-of-fit statistic (blue) defined by Eq. (60) and its Bartlett corrected counterpart calculated using the Lawley formula (orange). The black dots represent the significance computed by generating the distribution of q with MC. The plot illustrates that the Bartlett correction enables an accurate approximation of the numerical predictions

7 Averages using the full Gamma Variance Model

The simple-average model of the previous section assumes that the individual measurements are unbiased estimators of the parameter of interest, i.e., \(\text {E}[y_i] = \mu \), but that the standard deviations of the \(y_i\) are uncertain. In practice, the \(\sigma _{y_i}\) are often well estimated because they correspond to the statistical uncertainty in \(y_i\), and thus they are directly related to a sample size or a number of counts. It often happens, however, that \(y_i\) may have a potential bias which must be constrained with a control measurement, as described in the full Gamma Variance Model of Sect. 4. In this section we apply higher-order asymptotic corrections to this case.

More precisely, N measurements are assumed to be independent and Gaussian distributed with means \(\text {E}[y_i]=\mu +\theta _i\) and known variances (the “statistical errors”) \(\text {V}[y_i]=\sigma _{y_i}^2\). Here, the nuisance parameters \(\theta _i\) represent potential biases to the means of the \(y_i\). As described in Sect. 4, their values are estimated with independent Gaussian distributed control measurements \(u_i\), whose variances \(\sigma _{u_i}^2\) (the “systematic errors”) are treated as adjustable parameters. The \(\sigma _{u_i}^2\) are estimated by measurements \(v_i\), whose gamma distributions are characterized by the error-on-error parameters \(\varepsilon _i\). The log-likelihood of the model is

$$\begin{aligned} \begin{aligned} \ell \left( \mu , \varvec{\theta }, \varvec{\sigma _{u}^2}\right) =&-\frac{1}{2}\sum _{i=1}^N \left[ \frac{(y_i-\mu -\theta _i)^2}{\sigma _{y_{i}}^2}+\frac{(u_i-\theta _i)^2}{\sigma _{u_{i}}^2}\right. \\&\left. +\left( 1+\frac{1}{2\varepsilon _i^2}\right) \log {\sigma _{u_{i}}^2}+\frac{v_i}{2\varepsilon _i^2\sigma _{u_{i}}^2} \right] , \end{aligned} \end{aligned}$$
(69)

and the profiled log-likelihood \(\ell _p\) can be computed using

$$\begin{aligned} \widehat{\widehat{\sigma _{u_i}^2}}=\frac{v_i+2\varepsilon _i^2(u_i-\theta _i)^2}{1+2\varepsilon _i^2}, \end{aligned}$$
(70)

leading to

$$\begin{aligned} \ell _p(\mu , \varvec{\theta })= & {} -\frac{1}{2}\sum _{i=1}^N \left[ \frac{(y_i-\mu -\theta _i)^2}{\sigma _{y_{i}}^2}\right. \nonumber \\{} & {} \left. +\left( 1+\frac{1}{2\varepsilon _i^2}\right) \log \left( 1+2\varepsilon _i^2\frac{(u_i-\theta _i)^2}{v_i} \right) \right] .\nonumber \\ \end{aligned}$$
(71)

The MLEs \(\hat{\mu }\) and \(\hat{\theta _i}\) can be found numerically or by solving a system of cubic equations (see Ref. [9]).

As before, we compute the likelihood ratio \(w_\mu \) and the higher-order statistics \(r^*_\mu \) and \(w^*_\mu \). To compute \(r_\mu ^*\), one needs to calculate \(q_\mu \) as defined in Eq. (23). This requires a vector of pivotal quantities

\(\varvec{z}=(z_{y_1},...,\,z_{y_N},\,z_{u_1},...,\,z_{u_N},\,z_{v_1},...,\,z_{v_N})\), which can be defined as

$$\begin{aligned} \begin{aligned}&z_{y_{i}} = \frac{(y_{i}-\mu -\theta _i)^2}{\sigma _{y_i}^2}\sim \chi _1^2,\\&z_{u_{i}} = \frac{(u_{i}-\theta _i)^2}{\sigma _{u_i}^2}\sim \chi _1^2,\\&z_{v_{i}} = \frac{v_i}{\sigma _{u_i}^2}\sim \chi _1^2. \end{aligned} \end{aligned}$$
(72)

The Bartlett-corrected likelihood ratio \(w_\mu ^*= w_\mu / \text {E}[w_\mu ]\) can be estimated numerically using the Lawley formula defined by Eq. (32), which predicts the expectation value of \(w_\mu \) to be

$$\begin{aligned} \text {E}[w_\mu ]= 1+\sum _i^N\mathcal {O}(\varepsilon _i^4). \end{aligned}$$
(73)

Therefore, the Bartlett correction factor is equal to unity up to \(\sum _i\mathcal {O}(\varepsilon _i^4)\), indicating that any deviations of the likelihood ratio’s density function from its asymptotic distribution are also expected to be \(\sum _i\mathcal {O}(\varepsilon _i^4)\).

To further improve the accuracy of the Lawley formula, one can compute the Bartlett correction numerically. Specifically, one can approximate the expectation value of \(w_\mu \) by generating data with all the model parameters set to their maximum likelihood estimates and treating it as a constant independent of the model parameters:

$$\begin{aligned} \text {E}[w_\mu ]\simeq \text {E}[w_{\hat{\mu }}]. \end{aligned}$$
(74)

This approximation has been found to yield highly accurate results, as shown below, and moreover it significantly speeds up the computation of confidence intervals. Rather than generating a new set of data to estimate the Bartlett correction for every tested value of \(\mu \), data only needs to be generated once.

In certain scenarios, an analyst may wish to conduct inference on one or more nuisance parameters \(\theta _i\) for example to generate a ranking plot of the systematics or obtain the correlation matrix of their estimators. In such cases, the nuisance parameters must be treated as parameters of interest. According to the Lawley formula, the expected value of the likelihood ratio is

$$\begin{aligned} \text {E}[w_{\mu ,\varvec{\theta }}]= & {} 1+M-4\sum _i^M k^{\theta _i\theta _i}\frac{\varepsilon _i^2}{v_i}\nonumber \\{} & {} -\sum _i^M\left( k^{\theta _i\theta _i}\right) ^2\frac{\varepsilon _i^2}{v_i^2}+\sum _i^N\mathcal {O}(\varepsilon _i^4), \end{aligned}$$
(75)

where, M represents the number of nuisance parameters that have been promoted to parameters of interest. The term \(k^{\theta _i\theta _i}\) refers to the \(\theta _i\) component of the inverse of the expectation values of the Hessian matrix of the likelihood, as defined by the first term of equation Eq. (30)), and is evaluated at \(\sigma ^2_{u_i}=v_i\).

7.1 Confidence regions

Fig. 6
figure 6

Half-length of 1-\(\sigma \) confidence intervals for parameter \(\mu \) (Eq. (69)) as a function of \(\varepsilon \) for \(\delta =0.5\) (top) and \(\delta =1.5\) (bottom), computed using \(w_\mu \) (blue), \(r^*_\mu \) (green) and \(w^*_\mu \) calculated with MC (orange). The black dots represent our most precise estimate of the interval, computed using the profile construction. The confidence interval obtained with \(r^*_\mu \) breaks down for large values of \(\varepsilon \) when there is significant tension in the dataset (bottom)

As in the previous examples, we compute confidence intervals for the parameter of interest \(\mu \) using the likelihood ratio and higher-order statistics, assuming their density functions are given by the asymptotic distributions. Specifically, consider an example similar to what was used in Sect. 6, namely, the mean of two measurements, \(y_1=-\delta \) and \(y_2=+\delta \), here with associated statistical errors \(\sigma _1\) and \(\sigma _2\) both equal to \(1/\sqrt{2}\) and \(\delta =0.5\) or 1.5. Additionally, we assume that the control measurements \(u_1\) and \(u_2\) have observed values of 0, and the estimates of the systematic errors to be \(1/\sqrt{2}\), or equivalently, the estimates of the variances \(v_1\) and \(v_2\) to be 1/2. Both measurements are assumed to have equal error on error parameters, \(\varepsilon _1=\varepsilon _2=\varepsilon \), and we look at the results for different \(\varepsilon \).

Figure 6 shows the confidence interval for the parameter \(\mu \) found using the likelihood ratio \(w_\mu \), as well as from the higher-order statistics \(w_\mu ^*\) and \(r_\mu ^*\). The resulting confidence intervals are compared to what is found using the profile construction method, which is taken as the best available estimate of such intervals. Among the three methods, \(w_\mu ^*\) is the most accurate, almost perfectly overlapping with the numerical predictions obtained using the profile construction method. Moreover, \(w_\mu ^*\) is significantly faster to compute as it only requires data generation for \(\mu =\hat{\mu }\). In contrast, the profile construction method entails generating a new set of data for every tested value of \(\mu \).

The \(r_\mu ^*\) statistic, on the other hand, provides accurate predictions for internally consistent data (see the top plot of Fig. 6). However, for larger discrepancies between the measurements (bottom plot of Fig. 6), it is reliable only for small values of \(\varepsilon \). In this example, it starts deviating from the numerical prediction for \(\varepsilon \) above 0.3 and breaks down for \(\varepsilon \) exceeding 0.6, resulting in numerical instabilities. To assess the usability of \(r_\mu ^*\), one can check whether the logarithmic terms in the log-likelihood associated with the control measurements \(\varvec{u}\) fulfill the perturbative condition given by Eq. (41) for the endpoints of the confidence interval. For \(\delta =1.5\), this condition limits the applicability of \(r_\mu ^*\) to \(\varepsilon \simeq 0.3\), whereas, for \(\delta =0.5\), the threshold is higher, above 0.6.

Figure 7 illustrates the 2D confidence regions in the \((\mu ,\theta _1)\) plane. These confidence regions were computed by fixing \(\theta _2\) to its profiled value, while treating \(\theta _1\) as a parameter of interest. A similar exercise could have been conducted for \(\mu \) and \(\theta _2\), or \(\theta _1\) and \(\theta _2\). The results shown in Fig. 7 were derived using the same measured data as in the previous example, with the error on error parameters \(\varepsilon _1\) and \(\varepsilon _2\) fixed at 0.5. The confidence regions were computed using the likelihood ratio \(w_{\mu ,\varvec{\theta }}\) and the Bartlett-corrected likelihood ratio \(w_{\mu ,\varvec{\theta }}^*\) computed via Eq. (75), and were then compared with the confidence regions estimated using the profile construction technique. The results indicate that the Bartlett correction improves the accuracy of predictions significantly compared to the uncorrected likelihood ratio.

Fig. 7
figure 7

\(68.3\%\) confidence regions in \((\mu , \theta _1)\) plane for \(\delta =0.5\) (top) and \(\delta =1.5\) (bottom), computed using the likelihood ratio (blue) and the Bartlett correction calculated with the Lawley formula (orange). The black dots represent the most precise estimate of the interval, computed using the profile construction. The error-on-error parameters \(\varepsilon _1\) and \(\varepsilon _2\) are fixed to 0.5

7.2 Goodness of fit

The goodness-of-fit statistic for the Gamma Variance Model can be defined using the same approach as used for the simple-average model in Sect. 6.2, leading to

$$\begin{aligned} q= & {} -2\log L(\hat{\mu },\hat{\varvec{\theta }},\widehat{\widehat{\varvec{\sigma _u^2}}}) = \sum _{i=1}^N \left[ \frac{(y_i-\hat{\mu }-\hat{\theta }_i)^2}{\sigma _{y_{i}}^2}\right. \nonumber \\{} & {} +\left. \left( 1+\frac{1}{2\varepsilon _i^2}\right) \log \left( 1+2\varepsilon _i^2\frac{(u_i-\hat{\theta }_i)^2}{v_i} \right) \right] . \end{aligned}$$
(76)

In this case, however, constructing a saturated model is not useful because the Lawley formula gives \(b=0\) at order \(\varepsilon _i^2\). Nonetheless, the Bartlett correction can still be computed using MC. This method allows for a significant computational improvement over generating the exact distribution of q using pseudo-experiments to estimate its p value. The latter approach would require roughly on the order of \(10^5\) simulated experiments to accurately capture a \(4\sigma \) effect, while the expectation value of q can be estimated with a precision of several percent using only \(\mathcal {O}(10^3)\) pseudo-experiments.

Fig. 8
figure 8

Significance of discrepancy as defined by Eq. (68) plotted as a function of \(\varepsilon \). Higher significance values indicate greater tension between the data. The significance was computed using the goodness-of-fit statistic (blue) defined by Eq. (60) and its Bartlett corrected counterpart calculated numerically (orange). The black dots represent the significance computed by generating the distribution of q with MC. The plot illustrates that the Bartlett correction enables an accurate approximation of the numerical predictions

To illustrate these techniques we compute the significance of the p value for an average of two incompatible measurements using Eq. (69). The observed values of \(y_1\) and \(y_2\) are assumed to be \(-3\) and 3 whereas the control measurements \(u_1\) and \(u_2\) are set to 0. The statistical uncertainties, \(\sigma _1\) and \(\sigma _2\), are set to \(1/\sqrt{2}\) as the estimates of the systematic errors (which is equivalent to set \(v_1=v_2=1/2\)). Figure 8 compares the significance computed using the goodness-of-fit statistic q and the Bartlett-corrected \(q^*\), with the significance computed generating the distribution of \(q^*\) numerically, all of them as a function of \(\varepsilon \). Consistent with our earlier findings, the Bartlett correction is found to yield highly accurate predictions.

8 Conclusions

We have demonstrated the efficacy of higher-order asymptotics in the computation of confidence intervals and p values within the framework of the Gamma Variance Model, a specialized statistical model designed to address uncertainties in parameters that themselves represent uncertainties. The methods studied in this paper hold particular relevance when the GVM’s fixed parameters \(\varepsilon \), indicative of the relative uncertainties in estimates of standard deviations for Gaussian-distributed measurements, are not negligible. In such scenarios, standard asymptotic methods are unable to provide accurate confidence intervals or p values.

Our investigation specifically focused on the \(r^*\) statistic and the Bartlett correction, both of which are higher-order asymptotic techniques that offer adjustments to the first-order (profile) likelihood ratio and likelihood root test statistics. These adjustments enable the test statistics to be more accurately approximated by their asymptotic distributions, even when the \(\varepsilon \) parameter is large.

Both the Barndorff-Nielsen \(r^*\) statistic and the Bartlett corrected likelihood ratio demonstrated their value as tools to enhance the accuracy and reliability of confidence interval and p value calculations using Gamma Variance Models. However, it should be noted that the \(r^*\) statistic exhibited instabilities in the presence of internally incompatible data for large values of \(\varepsilon \). Additionally, while \(r^*\) can be computed analytically for all the examples examined in this paper, the expressions become complicated for models associated with realistic applications, such as the simple-average and full GVM models.

Conversely, the Bartlett correction, calculated using the Lawley formula (32), offers a more elegant expression for the expectation value of the likelihood ratio, which is employed to compute the Bartlett correction factor for the likelihood ratio. Specifically, refer to Eq.(58) for the simple-average model (see Sect. 6) and Eqs.(73) and (75) for combinations utilizing the full GVM (see Sect. 7.1). In addition, when necessary, estimating the Bartlett correction numerically is straightforward, as shown in Sect. 7.

The Bartlett correction also proved to be an effective technique for improving the goodness-of-fit statistic. This was true for cases where the statistic could be computed analytically using the Lawley formula, such as the simple-average model (see Eq. (65) in Sect. 6.2), as well as for cases where it was estimated using Monte Carlo methods, as in the full Gamma Variance Model (see Sect. 7.2). The application of the Bartlett correction in the latter scenario significantly reduced the number of pseudo-experiments required for accurately estimating the significance of rare effects.

Overall, to improve both the applicability and precision of the GVM, the Bartlett correction has proven to be a more reliable and versatile solution.

Furthermore, these findings highlight the potential of higher-order asymptotics to refine inference on the parameters of interest in various contexts, not only the GVM. Higher-order asymptotics are valuable tools when the MLEs of statistical models do not follow Gaussian distributions or, equivalently, when log-likelihoods are not well approximated by quadratic expressions. For Gamma Variance Models this occurs when \(\varepsilon \) is large; however, in general, such deviations are typically associated with small experimental sample sizes. In Particle Physics, it is not uncommon to search for new signal processes by counting collision events with very specific characteristics, such that the expected number of background events may be order unity. With sample sizes of this order it is expected that asymptotic distributions will not be accurate and high-order asymptotic formulae should prove valuable (see, e.g., Refs. [7, 25].

The introduction of higher-order asymptotic corrections removes a potential stumbling block for use of the Gamma Variance Model. As many estimates of systematic uncertainties may themselves be uncertain at the level of 20–50% or more, one would not expect asymptotic confidence intervals or p values to be accurate. By using higher-order corrections, accurate results can be achieved without minimal or no Monte Carlo simulation, greatly simplifying use of the model.