# Modeling theoretical uncertainties in phenomenological analyses for particle physics

- First Online:

- Received:
- Accepted:

## Abstract

The determination of the fundamental parameters of the Standard Model (and its extensions) is often limited by the presence of statistical and theoretical uncertainties. We present several models for the latter uncertainties (random, nuisance, external) in the frequentist framework, and we derive the corresponding *p* values. In the case of the nuisance approach where theoretical uncertainties are modeled as biases, we highlight the important, but arbitrary, issue of the range of variation chosen for the bias parameters. We introduce the concept of adaptive *p* value, which is obtained by adjusting the range of variation for the bias according to the significance considered, and which allows us to tackle metrology and exclusion tests with a single and well-defined unified tool, which exhibits interesting frequentist properties. We discuss how the determination of fundamental parameters is impacted by the model chosen for theoretical uncertainties, illustrating several issues with examples from quark flavor physics.

## 1 Introduction

In particle physics, an important part of the data analysis is devoted to the interpretation of the data with respect to the Standard Model (SM) or some of its extensions, with the aim of comparing different alternative models or determining the fundamental parameters of a given underlying theory [1, 2, 3]. In this activity, the role played by uncertainties is essential, since they constitute the limit for the accurate determination of these parameters, and they can prevent from reaching a definite conclusion when comparing several alternative models. In some cases, these uncertainties are from a statistical origin: they are related to the intrinsic variability of the phenomena observed, they decrease as the sample size increases and they can be modeled using random variables. A large part of the experimental uncertainties belong to this first category. However, another kind of uncertainties occurs when one wants to describe inherent limitations of the analysis process, for instance, uncertainties in the calibration or limits of the models used in the analysis. These uncertainties are very often encountered in theoretical computations, for instance when assessing the size of higher orders in perturbation theory or the validity of extrapolation formulas. Such uncertainties are often called “systematics”, but they should be distinguished from less dangerous sources of systematic uncertainties, usually of experimental origin, that roughly scale with the size of the statistical sample and may be reasonably modeled by random variables [4]. In the following we will thus call them “theoretical” uncertainties: by construction, they lack both an unambiguous definition (leading to various recipes to determine these uncertainties) and a clear interpretation (beyond the fact that they are not from a statistical origin). It is thus a complicated issue to incorporate their effect properly, even in simple situations often encountered in particle physics [5, 6, 7].^{1}

The relative importance of statistical and theoretical uncertainties might be different depending on the problem considered, and the progress made both by experimentalists and theorists. For instance, statistical uncertainties are the main issue in the analysis of electroweak precision observables [11, 12]. On the other hand, in the field of quark flavor physics, theoretical uncertainties play a very important role. Thanks to the *B*-factories and LHCb, many hadronic processes have been very accurately measured [13, 14], which can provide stringent constraints on the Cabibbo–Kobayashi–Maskawa matrix (in the Standard Model) [15, 16, 17], and on the scale and structure of New Physics (in SM extensions) [18, 19, 20, 21]. However, the translation between hadronic processes and quark-level transitions requires information on hadronization from strong interaction, encoded in decay constants, form factors, bag parameters... The latter are determined through lattice QCD simulations. The remarkable progress in computing power and in algorithms over the last 20 years has led to a decrease of statistical uncertainties and a dominance of purely theoretical uncertainties (chiral and heavy-quark extrapolations, scale chosen to set the lattice spacing, finite-volume effects, continuum limit...). As an illustration, the determination of the Wolfenstein parameters of the CKM matrix involves many constraints which are now limited by theoretical uncertainties (neutral-meson mixing, leptonic and semileptonic decays, ...) [22].

The purpose of this note is to discuss theoretical uncertainties in more detail in the context of particle physics phenomenology, comparing different models not only from a statistical point of view, but also in relation with the problems encountered in phenomenological analyses where they play a significant role. In Sect. 2, we summarize fundamental notions of statistics used in particle physics, in particular *p* values and test statistics. In Sect. 3, we list properties that we seek in a good approach for theoretical uncertainties. In Sect. 4, we propose several approaches and in Sect. 5, we compare their properties in the most simple one-dimensional case. In Sect. 6, we consider multi-dimensional cases (propagation of theoretical uncertainties, average of several measurements, fits and pulls), which we illustrate using flavor physics examples related to the determination of the CKM matrix in Sect. 7, before concluding. An appendix is devoted to several issues connected with the treatment of correlations.

## 2 Statistics concepts for particle physics

We start by briefly recalling frequentist concepts used in particle physics, highlighting the role played by *p* values in hypothesis testing and how they can be used to define confidence intervals.

### 2.1 *p* values

#### 2.1.1 Data fitting and data reduction

*B*-meson events, where this sample is theoretically known to follow a PDF

*f*. The PDF is parameterized in terms of a few physics parameters, among which we assume the ones of interest are the direct and mixing-induced

*C*and

*S*CP asymmetries. The functional form of this PDF is dictated on very general grounds by the CPT invariance and the formalism of two-state mixing (see, e.g., [26]), and is independent of the particular underlying phenomenological model (e.g. the Standard Model of particle physics). In practice, however, detector effects are required to be modeled by additional parameters that modify the shape of the PDF. We denote by \(\theta \) the set of parameters \(\theta =(C,S,\ldots )\) that are needed to specify the PDF completely. The likelihood for the sample \(\{t_i\}\) is defined by

The latter choice of expressing the experimental likelihood in terms of model-dependent parameters such as \(\beta \) has, however, one technical drawback: the full statistical analysis has to be performed for each model one wants to investigate, e.g., the Standard Model, the Minimal Supersymmetric Standard Model, GUT models, ... In addition, building a statistical analysis directly on the initial likelihood requires one to deal with a very large parameter space, depending on the parameters in \(\theta \) that are needed to describe the detector response. One common solution to these technical difficulties is a two-step approach. In the first step, the data are *reduced* to a set of model- and detector-independent^{2} random variables that contains the same information as the original likelihood (to a good approximation): in our example the likelihood-based estimators \(\hat{C}\) and \(\hat{S}\) of the parameters *C* and *S* can play the role of such variables (estimators are functions of the data and thus are random variables). In a second step, one can work in a particular model, e.g., in the Standard Model, to use \(\hat{C}\) and \(\hat{S}\) as inputs to a statistical analysis of the parameter \(\beta \). This two-step procedure gives the same result as if the analysis were done in a single step through the expression of the original likelihood in terms of \(\beta \). This technique is usually chosen if the PDF *g* of the estimators \(\hat{C}\) and \(\hat{S}\) can be parameterized in a simple way: for example, if the sample size is sufficiently large, then the PDF can often be modeled by a multivariate normal distribution, where the covariance matrix is approximately independent of the mean vector.

Let us now extend the above discussion to a more general case. A sample of random events is \(\{E_i,i=1\ldots n\}\), where each event corresponds to a set of directly measurable quantities (particle energies and momenta, interaction vertices, decay times...). The distribution of these events is described by a PDF, the functional form *f* of which is supposed to be known. In addition to the event value *E*, the PDF value depends on some fixed parameters \({\theta }\), hence the notation \(f(E;\theta )\). The likelihood for the sample \(\{E_i\}\) is defined by \( \mathcal L_{\{E_i\}}(\theta ) = \prod _{i=1}^n f(E_i;\theta ). \) We want to interpret the event observation in a given phenomenological scenario that predicts at least some of the parameters \(\theta \) describing the PDF in terms of a set of more fundamental parameters \(\chi \).

To this aim we first reduce the event observation to a set of model- and detector-independent random variables *X* together with a PDF \(g(X;\chi )\), in such a way that the information that one can get on \(\chi \) from *g* is equivalent to the information one can get from *f*, once \(\theta \) is expressed in terms of \(\chi \) consistently with the phenomenological model of interest. Technically, it amounts to identifying a minimal set of variables *x* depending on \(\theta \) that are independent of both the experimental context and the phenomenological model. One performs an analysis on the sample of events \({E_i}\) to derive estimators \(\hat{x}\) for *x*. The distribution of these estimators can be described in terms of a PDF that is written in the \(\chi \) parametrization as \(g(X;\chi )\), where we have replaced \(\hat{x}\) by the notation *X*, to stress that in the following *X* will be considered as a new random variable, setting aside how it has been constructed from the original data \(\{E_i\}\). Obviously, in our previous example for \(B^0(t)\)\(\rightarrow J/\psi K_S\), \(\{t_i\}\) correspond to \(\{E_i\}\), *C* ans *S* to *x*, and \(\beta \) to \(\chi \).

#### 2.1.2 Model fitting

*x*, with associated random variable

*X*, and an associated PDF \(g(X;\chi )\) depending on purely theoretical parameters \(\chi \). With a slight abuse of notation we include in the symbol

*g*not only the functional form, but also all the needed parameters that are kept fixed and independent of \(\chi \). In particular for a one-dimensional Gaussian PDF we have

*X*is a potential value of the observable

*x*and \(x(\chi )\) corresponds to the theoretical prediction of

*x*given \(\chi \). This PDF is obtained from the outcome of an experimental analysis yielding both a central value \(X_0\) and an uncertainty \(\sigma \), where \(\sigma \) is assumed to be independent of the realization \(X_0\) of the observable

*x*and is thus included in the definition of

*g*.

*x*. One very general way to perform this task is

*hypothesis testing*, where one wants to quantify how much the data are compatible with the null hypothesis that the true value of \(\chi \), \(\chi _t\), is equal to some fixed value \(\chi \):

*X*under the null hypothesis \({\mathcal H}_\chi \), one defines a

*test statistic*\(T(X;\chi )\), that is, a scalar function of the data

*X*that measures whether the data are in favor or not of the null hypothesis. We indicated the dependence of

*T*on \(\chi \) explicitly, i.e., the dependence on the null hypothesis \({\mathcal H}_\chi \). The test statistic is generally a definite positive function chosen in a way that large values indicate that the data present evidence against the null hypothesis. By comparing the actual data value \(t=T(X_0;\chi )\) with the sampling distribution of \(T=T(X;\chi )\) under the null hypothesis, one is able to quantify the degree of agreement of the data with the null hypothesis.

*p*value. One calculates the probability to obtain a value for the test statistic at least as large as the one that was actually observed, assuming that the null hypothesis is true. This tail probability is used to define the

*p*value of the test for this particular observation

*h*of the test statistic is obtained from the \(\mathrm{PDF}\)

*g*of the data as

*T*with the convolution of the r.h.s. of (5) with the same test function. A small value of the

*p*value means that \(T(X_0;\chi )\) belongs to the “large” region, and thus provides evidence against the null hypothesis. This is illustrated for a simple example in Figs. 1 and 2.

*h*

*p*value in Eq. (4) is defined as a function of \(X_0\) and as such, is a random variable.

Through the simple change of variable \(\frac{\mathrm{d}p}{\mathrm{d}T}\frac{\mathrm{d}\mathcal P}{\mathrm{d}p}=\frac{\mathrm{d}\mathcal P}{\mathrm{d}T}\), one obtains that *the null distribution* (that is, the distribution when the null hypothesis is true) *of a**p**value is uniform*, i.e., the distribution of values of the *p* value is flat between 0 and 1. This uniformity is a fundamental property of *p* values that is at the core of their various interpretations (hypothesis comparison, determination of confidence intervals...) [1, 2].

In the frequentist approach, one wants to design a procedure to decide whether to accept or reject the null hypothesis \({\mathcal H}_\chi \), by avoiding as much as possible either incorrectly rejecting the null hypothesis (Type-I error) or incorrectly accepting it (Type-II error). The standard frequentist procedure consists in selecting a Type-I error \(\alpha \) and determining a region of sample space that has the probability \(\alpha \) of containing the data under the null hypothesis. If the data fall in this critical region, the hypothesis is rejected. This must be performed before data are known (in contrast to other interpretations, e.g, Fischer’s approach of significance testing [1]). In the simplest case, the critical region is defined by a condition of the form \(T\ge t_\alpha \), where \(t_\alpha \) is a function of \(\alpha \) only, which can be rephrased in terms of *p* value as \(p\le \alpha \). The interest of the frequentist approach depends therefore on the ability to design *p* values assessing the rate of Type-I error correctly (its understatement is clearly not desirable, but its overstatement yields often a reduction in the ability to determine the truth of an alternative hypothesis), as well as avoiding too large a Type-II error rate.

A major difficulty arises when the hypothesis to be tested is *composite*. In the case of numerical hypotheses like (3), one gets compositeness when one is only interested in a subset \(\mu \) of the parameters \(\chi \). The remaining parameters are called *nuisance parameters*^{3} and will be denoted by \(\nu \), thus \(\chi =(\mu ,\nu )\). In this case the hypothesis \({\mathcal H}_\mu : \mu _t=\mu \) is composite, because determining the distribution of the observables requires the knowledge of the true value \(\nu _t\) in addition to \(\mu \). In this situation, one has to devise a procedure to infer a “*p* value” for \({\mathcal H}_\mu \) out of *p* values built for the simple hypotheses where both \(\mu \) and \(\nu \) are fixed. Therefore, in contrast to a simple hypothesis, a composite hypothesis does not allow one to compute the distribution of the data.^{4}

*p*value for \({\mathcal H}_\mu \) is uniform, and one may get different situations:

*p*value (exact coverage), or if this is not possible, a (reasonably) conservative one (overcoverage). Such

*p*values will be called “valid”

*p*values. In the case of composite hypotheses, the conservative or liberal nature of a

*p*value may depend not only on \(\alpha \), but also on the structure of the problem and of the procedure used to construct the

*p*value, and it has to be checked explicitly [1, 2].

Once *p* values are defined, one can build confidence intervals out of them by using the correspondence between acceptance regions of tests and confidence sets. Indeed, if we have an exact *p* value, and the critical region \(C_\alpha (X)\) is defined as the region where \(p(X;\mu )<\alpha \), the complement of this region turns out to be a confidence set of level \(1-\alpha \), i.e., \(P[\mu \notin C_\alpha (X)]= 1-\alpha \). This justifies the general use of plotting the *p* value as a function of \(\mu \), and reading the 68 or 95% CL intervals by looking at the ranges where the *p* value curve is above 0.32 or 0.05. This is illustrated for a simple example in Figs. 2 and 3. Once again, this discussion is affected by issues of compositeness and nuisance parameters, as well as the requirement of checking the coverage of the *p* value used to define these confidence intervals: an overcovering *p* value will yield too large confidence intervals, which will prove indeed conservative.

A few words about the notation and the vocabulary are in order at this stage. A *p* value necessarily refers to a null hypothesis, and when the null hypothesis is purely numerical such as (3) we can consider the *p* value as a mathematical function of the fundamental parameter \(\mu \). This of course does not imply that \(\mu \) is a random variable (in frequentist statistics, it is always a fixed, but unknown, number). When the *p* value as a function of \(\mu \) can be described in a simple way by a few parameters, we will often use the notation \(\mu =\mu _0\pm \sigma _\mu \). In this case, one can easily build the *p* value and derive any desired confidence interval. Even though this notation is similar to the measurement of an observable, we stress that this does not mean that the fundamental parameter \(\mu \) is a random variable, and it should not be seen as the definition of a PDF. In line with this discussion, we will call *uncertainties* the parameters like \(\sigma \) that can be given a frequentist meaning, e.g., they can be used to define the PDF of a random variable. On the other hand, we will call *errors* the intermediate quantities such as \(\sigma _\mu \) that can be used to describe the *p* value of a fundamental parameter, but cannot be given a statistical meaning for this parameter.

### 2.2 Likelihood-ratio test statistic

^{5}

*T*is constructed not to depend on the nuisance parameters \(\nu \) explicitly, its distribution Eq. (5) a priori depends on them (through the PDF

*g*). Even though the Neyman–Pearson lemma does not apply here, there is empirical evidence that this test is powerful, and in some cases it exhibits good asymptotic properties (easy computation and distribution independent of nuisance parameters) [1, 2].

For the problems considered here, the MLR choice features alluring properties, and in the following we will use test statistics that are derived from this choice. First, if \(g(X;\chi _t)\) is a multi-dimensional Gaussian function, then the quantity \(-2\ln {\mathcal L}_X(\chi _t)\) is the sum of the squares of standard normal random variables, i.e., is distributed as a \(\chi ^2\) with a number of degrees of freedom (\(N_\mathrm{dof}\)) that is given by \(\mathrm{dim}(X)\). Secondly, for linear models, in which the observables *X* depend linearly on the parameters \(\chi _t\), the MLR Eq. (11) is again a sum of standard normal random variables, and is distributed as a \(\chi ^2\) with \(N_\mathrm{dof}=\mathrm{dimension}(\mu )\). Wilks’ theorem [28] states that this property can be extended to non-Gaussian cases in the asymptotic limit: under regularity conditions and when the sample size tends to infinity, the distribution of Eq. (11) will converge to the same \(\chi ^2\) distribution depending only on the number of parameters tested.

^{6}

## 3 Comparing approaches to theoretical uncertainties

Summary table of various approaches to theoretical uncertainties considered in the text

Approach | Random-\(\delta \) | Nuisance-\(\delta \) | External-\(\delta \) |
---|---|---|---|

Hypothesis | Random var. | Composite hyp. | Family of simple hyp. |

\(\mathrm{PDF}_\varDelta (\delta )\) | \({\mathcal H}_\mu :\mu _t=\mu \) | \({\mathcal H}^{(\delta )}_\mu :\mu _t=\mu +\delta \) | |

Test | Likelihood ratio | Quadratic | Quadratic |

Constraint on \(\delta \) | – | \(\varOmega \) | \(\varOmega \) |

Associativity | Yes if normal \(\mathrm{PDF}\) | Yes if \(\varOmega \) hyperball | Yes if \(\varOmega \) hyperball |

Splitting of errors | Yes if normal \(\mathrm{PDF}\) | Yes for all \(\varOmega \) | Yes for all \(\varOmega \) |

Stationarity | Yes | Yes | Yes |

Simple asympt. lim. | Yes if normal \(\mathrm{PDF}\) | Yes | Yes |

Simple \(\sigma \rightarrow 0\) limit | Depends on \(\mathrm{PDF}\) | \(\varOmega \) | \(\varOmega \) |

Particular cases | Naive Gaussian | Fixed/adaptive nuis. | Scan |

If we take | Normal PDF | Fixed/adaptive \(\varOmega \) | Sup over fixed \(\varOmega \) |

*modeled*(except in the somewhat academic case where a bound on the difference between the exact value and the approximately computed one can be proven). The choice of a model for theoretical uncertainties involves not only the study of its mathematical properties and its physical implications in specific cases, but also some personal taste. One can indeed imagine several ways of modeling/treating theoretical uncertainties:

one can (contrarily to what has just been said) treat the theoretical uncertainty on the same footing as a statistical uncertainty; in this case, in order to follow a meaningful frequentist procedure, one has to assume that one lives in a world where the repeated calculation of a given quantity leads to a distribution of values around the exact one, with some variability that can be modeled as a PDF (“random-\(\delta \) approach”),

one can consider that theoretical uncertainties can be modeled as external parameters, and perform a purely statistical analysis for each point in the theoretical uncertainty parameter space; this leads to an infinite collection of

*p*values that will have to be combined in some arbitrary way, following a model averaging procedure (“external-\(\delta \) approach”),one can take the theoretical uncertainties as fixed asymptotic biases,

^{7}treating them as nuisance parameters that have to be varied in a reasonable region (“nuisance-\(\delta \) approach”).

as general as possible, i.e., apply to as many “kinds” of theoretical uncertainties as possible (lattice uncertainties, scale uncertainties) and as many types of physical models as possible,

leading to meaningful confidence intervals, in reasonable limit cases: obviously, in the absence of theoretical uncertainties, one must recover the standard result; one may also consider the type of constraint obtained in the absence of statistical uncertainties,

exhibiting good coverage properties, as it benchmarks the quality of the statistical approach: the comparison of different models provides interesting information but does not shed light on their respective coverage,

associated with a statistically meaningful goodness-of-fit,

featuring reasonable asymptotic properties (large samples),

yielding the errors as a function of the estimates easily (error propagation), in particular by disentangling the impact of theoretical and statistical contributions,

leading to a reasonable procedure to average independent estimates – if possible, it should be equivalent for any analysis to include the independent estimates separately or the average alone (associativity). In addition, one may wonder whether the averaging procedure should be conservative or aggressive (i.e., the average of similar theoretical uncertainties should have a smaller uncertainty or not), and if the procedure should be stationary (the uncertainty of an average should be independent of the central values or not),

leading to reasonable results in the case of averages of inconsistent measurements.

We summarize some of the points mentioned above in Table 1. As it will be seen, it will, however, prove challenging to fulfill all these criteria at the same time, and we will have to make compromises along the way.

## 4 Illustration of the approaches in the one-dimensional case

### 4.1 Situation of the problem

^{8}

*p*value easily from Eq. (4)

*X*reduces to \(\mu +\delta \). The challenge is to extract some information on \(\mu \), given the fact that the value of \(\delta \) remains unknown.

Take a model corresponding to the interpretation of \(\delta \):

*random variable, external parameter, fixed bias as a nuisance parameter...*Choose a test statistic \(T(X;\mu )\) that is consistent with the model and that discriminates the null hypothesis:

*Rfit, quadratic, other...*Compute, consistently with the model, the

*p*value, which is in general a function of \(\mu \) and \(\delta \).Eliminate the dependence with respect to \(\delta \) by some well-defined procedure.

Exploit the resulting

*p*value (coverage, confidence intervals, goodness-of-fit).

### 4.2 The random-\(\delta \) approach

In the random-\(\delta \) approach, \(\delta \) would be related to the variability of theoretical computations, which one can model with some PDF for \(\delta \), such as \({\mathcal N}_{(0,\varDelta )}\) (normal) or \({\mathcal U}_{(-\varDelta ,+\varDelta )}\) (uniform). The natural candidate for the test statistic \(T(X;\mu )\) is the MLR built from the PDF. One considers a model where \(X=s+\delta \) is the sum of two random variables, *s* being distributed as a Gaussian of mean \(\mu \) and width \(\sigma \), and \(\delta \) as an additional random variable with a distribution depending on \(\varDelta \).

*X*is then the convolution of two Gaussian PDFs, leading to

*p*value that would be obtained when the two uncertainties are added in quadrature

^{9}and there is no strong argument that would help to choose the associated PDF (for instance, \(\delta \) could be a variable uniformly distributed over \([-\varDelta ,\varDelta ]\)). However, for a general PDF, the

*p*value has no simple analytic formula and it must be computed numerically from Eq. (4). In the following, we will only consider the case of a Gaussian PDF when we discuss the random-\(\delta \) approach.

### 4.3 The nuisance-\(\delta \) approach

In the nuisance approach, \(\delta \) is not interpreted as a random variable but as a fixed parameter so that in the limit of an infinite sample size, the estimator does not converge to the true value \(\mu _t\), but to \(\mu _t+\delta \). The distinction between statistical and theoretical uncertainties is thus related to their effect as the sample size increases, statistical uncertainties decreasing while theoretical uncertainties remaining of the same size (see Refs. [29, 30, 31] for other illustrations in the context of particle physics). One works with the null hypothesis \(\mathcal {H}_{\mu }: \mu _t=\mu \), and one has then to determine which test statistic is to be built.

*X*is normal, with mean \(\mu +\delta \) and variance \(\sigma ^2\)

*p*values and the resulting statistical outcomes. Indeed, with this PDF for the nuisance-\(\delta \) approach,

*T*is distributed as a rescaled, non-central \(\chi ^2\) distribution with a non-centrality parameter \((\delta /\sigma )^2\) (this non-centrality parameter illustrates that the test statistic is centered around \(\mu \) whereas the distribution of

*X*is centered around \(\mu +\delta \)). \(\delta \) is then a genuine asymptotic bias, implying inconsistency: in the limit of an infinite sample size, the estimator constructed from

*T*is \(\mu \), whereas the true value is \(\mu +\delta \). Using the previous expressions, one can easily compute the cumulative distribution function of this test statistic,

*T*is built to be independent of nuisance parameters, its PDF depends on them a priori).

*p*value one can take the supremum value for \(\delta \) over some interval \(\varOmega \)

*p*value for \(\mu \), from which one can infer confidence intervals for \(\mu \). This space cannot be the whole space (as one would get \(p=1\) trivially for all values of \(\mu \)), but there is no natural candidate (i.e., coming from the derivation of the test statistic). More specifically, should the interval \(\varOmega \) be kept fixed or should it be rescaled when investigating confidence intervals at different levels (e.g. 68 vs. 95%)?

- If one wants to keep it fixed, \(\varOmega _r=r[-\varDelta ,\varDelta ]\):One may wonder what the best choice is for$$\begin{aligned} p_\mathrm{fixed\ \varOmega _r}=\mathrm{Max}_{\delta \in \varOmega _r}[1- \mathrm {CDF}_\delta (\mu )]. \end{aligned}$$(26)
*r*, as the*p*value gets very large if one works with the reasonable \(r=3\), while the choice \(r=1\) may appear as non-conservative. We will call this treatment the*fixed**r*-nuisance approach. - One can then wonder whether one would like to let \(\varOmega \) depend on the value considered for
*p*. In other words, if we are looking at a \(k\,\sigma \) range, we could consider the equivalent range for \(\delta \). This would correspond towhere \(k_\sigma (p )\) is the “number of sigma” corresponding to$$\begin{aligned} p_\mathrm{adapt\ \varOmega }=\mathrm{Max}_{\delta \in \varOmega _{k_\sigma ( p)}}[1- \mathrm {CDF}_\delta (\mu )] \end{aligned}$$(27)*p*where the function Prob has been defined in Eq. (12). We will call this treatment the$$\begin{aligned} k_\sigma (p )^2=\mathrm{Prob}^{-1}(p,N_\mathrm{dof}=1), \end{aligned}$$(28)*adaptive*nuisance approach. The correct interpretation of this*p*value is:*p*is a valid*p*value if the true (unknown) value of \(\delta /\varDelta \) belongs to the “would be” \(1-p\) confidence interval around 0. This is not a standard coverage criterion: one can use*adaptive coverage*, and*adaptively valid p value*, to name this new concept. Note that Eqs. (27), (28) constitute a non-algebraic implicit equation that has to be solved by numerical means.

*p*values accordingly. In this sense, the adaptive approach provides a unified approach to deal with two different issues of importance, namely the metrology of parameters (at 1 or 2\(\sigma \)) and exclusion tests (at 3 or 5\(\sigma \)).

### 4.4 The external-\(\delta \) approach

*p*value that explicitly depends on \(\delta \). If one takes \(X\sim \mathcal {N}_{(\mu +\delta ,\sigma )}\) and

*T*quadratic [either \((X-\mu -\delta )^2/\sigma ^2\) or \((X-\mu -\delta )^2/(\sigma ^2+\varDelta ^2)\)]:

^{10}

*p*values instead of a single one related to the aimed constraint on \(\mu \).

*fixed*

*r*-external approach for \(\delta \in \varOmega _r\). This is equivalent to the Rfit ansatz used by CKMfitter [15, 16] in the one-dimensional case (but not in higher dimensions), proposed to treat theoretical uncertainties in a different way from statistical uncertainties, treating all values within \([-\varDelta ,\varDelta ]\) on an equal footing. We recall that the Rfit ansatz was obtained starting from a well test statistic, with a flat bottom with a width given by the theoretical error and parabolic walls given by statistical uncertainty.

*T*follows a \(\chi ^2\)-law with the corresponding number of degrees of freedom

*N*, including both parameters of interest and nuisance parameters.

^{11}Then the \(1-\alpha \) confidence region is then determined by varying nuisance parameters in given intervals (typically \(\varOmega _1\)), but accepting only points where \(T\le T_c\), where \(T_c\) is a critical value so that \(P(T\ge T_c;N|H_0)\ge \alpha \) (generally taken as \(\alpha =0.05\)). This latter condition acts as a test of compatibility between a given choice of nuisance parameters and the data.

## 5 Comparison of the methods in the one-dimensional case

the random-\(\delta \) approach with a Gaussian random variable, or naive Gaussian (nG), see Sect. 4.2,

the nuisance-\(\delta \) approach with quadratic statistic and fixed range, or fixed nuisance, see Sect. 4.3,

the nuisance-\(\delta \) approach with quadratic statistic and adaptive range, or adaptive nuisance, see Sect. 4.3,

the external-\(\delta \) approach with quadratic statistic and fixed range, equivalent to the Rfit approach in one dimension; see Sect. 4.4.

### 5.1 *p* values and confidence intervals

*p*values obtained from the various methods discussed above in Fig. 4, where we compare nG, Rfit, fixed nuisance and adaptive nuisance approaches. From these

*p*values, we can infer confidence intervals at a given significance level and a given value of \(\varDelta /\sigma \), and determine the length of the (symmetric) confidence interval (see Table 2). We notice the following points:

By construction, nG always provides the same errors whatever the relative proportion of theoretical and statistical uncertainties, and all the approaches provide the same answer in the limit of no theoretical uncertainty \(\varDelta =0\).

By construction, for a given \(n\sigma \) confidence level, the interval provided by the adaptive nuisance approach is identical to the one obtained using the fixed nuisance approach with a \([-n,n]\) interval. This explains why the adaptive nuisance approach yields identical results to the fixed 1-nuisance approach at 1\(\sigma \) (and similarly for the fixed 3-nuisance approach at 3\(\sigma \)). The corresponding curves cannot be distinguished on the upper and central panels of Fig. 5.

The adaptive nuisance approach is numerically quite close to the nG method; the maximum difference occurs for \(\varDelta /\sigma =1\) (up to 40% larger error size for 5\(\sigma \) intervals).

The

*p*value from the fixed-nuisance approach has a very wide plateau if one works with the ‘reasonable’ range \([-3\varDelta ,+3\varDelta ]\), while the choice of \([-\varDelta ,+\varDelta ]\) might be considered as nonconservative.The 1-external and fixed 1-nuisance approaches are close to each other and less conservative than the adaptive approach, which is expected, but also than nG, for confidence intervals at 3 or 5\(\sigma \) when theory uncertainties dominate.

When dominated by theoretical uncertainties (\(\varDelta /\sigma \) large), all approaches provide 3 and 5\(\sigma \) errors smaller than the nG approach, apart from the adaptive nuisance approach.

Comparison of the size of one-dimensional confidence intervals at \(1,3,5\sigma \) for various methods and various values of \(\varDelta /\sigma \)

nG | 1-nuisance | Adaptive nuisance | 1-external | |
---|---|---|---|---|

\(\varDelta /\sigma =0.3\) | ||||

\(1\sigma \) | 1.0 | 1.0 | 1.0 | 1.2 |

\(3\sigma \) | 3.0 | 3.0 | 3.5 | 3.2 |

\(5\sigma \) | 5.0 | 5.0 | 6.1 | 5.1 |

\(\varDelta /\sigma =1\) | ||||

\(1\sigma \) | 1.0 | 1.1 | 1.1 | 1.4 |

\(3\sigma \) | 3.0 | 2.7 | 4.1 | 2.8 |

\(5\sigma \) | 5.0 | 4.1 | 7.0 | 4.2 |

\(\varDelta /\sigma =3\) | ||||

\(1\sigma \) | 1.0 | 1.1 | 1.1 | 1.3 |

\(3\sigma \) | 3.0 | 1.8 | 3.7 | 1.9 |

\(5\sigma \) | 5.0 | 2.5 | 6.3 | 2.5 |

\(\varDelta /\sigma =10\) | ||||

\(1\sigma \) | 1.0 | 1.0 | 1.0 | 1.1 |

\(3\sigma \) | 3.0 | 1.3 | 3.3 | 1.3 |

\(5\sigma \) | 5.0 | 1.5 | 5.5 | 1.5 |

### 5.2 Significance thresholds

*p*value corresponds to \(1,3,5 \sigma \) (in significance scale) in a given method, and compute the corresponding

*p*values for the other methods. The results are gathered in Tables 3 and 4. Qualitatively, the comparison of significances can be seen from Fig. 4: if the size of the error is fixed, the different approaches quote different significances for this same error.

Comparison of 1D \(1,3,5\sigma \) significance thresholds for \(\varDelta /\sigma =1\). For instance, the first line should read: if with nG a *p* value = 1\(\sigma \) is found, then the corresponding values for the three other methods are 0.9/1.0/0.4\(\sigma \). \(\infty \) means that the corresponding *p* value was numerically zero (corresponding to more than 8\(\sigma \))

nG | 1-nuisance | Adaptive nuisance | 1-external | |
---|---|---|---|---|

1\(\sigma \) signif. threshold | ||||

nG | 1 | 0.9 | 1.0 | 0.4 |

1-nuisance | 1.1 | 1 | 1.0 | 0.5 |

Adaptive nuisance | 1.1 | 1.0 | 1 | 0.5 |

1-external | 1.4 | 1.4 | 1.2 | 1 |

3\(\sigma \) signif. threshold | ||||

nG | 3 | 3.4 | 2.3 | 3.2 |

1-nuisance | 2.7 | 3 | 2.0 | 2.8 |

Adaptive nuisance | 4.1 | 4.9 | 3 | 4.8 |

1-external | 2.8 | 3.2 | 2.1 | 3 |

5\(\sigma \) signif. threshold | ||||

nG | 5 | 6.2 | 3.6 | 6.1 |

1-nuisance | 4.1 | 5 | 3.0 | 4.9 |

Adaptive nuisance | 7.0 | \(\infty \) | 5 | \(\infty \) |

1-external | 4.2 | 5.1 | 3.1 | 5 |

Comparison of 1D \(1,3,5\sigma \) significance thresholds for \(\varDelta /\sigma =3\). Same comments as in the previous table

nG | 1-nuisance | Adaptive nuisance | 1-external | |
---|---|---|---|---|

1\(\sigma \) signif. threshold | ||||

nG | 1 | 0.8 | 0.9 | 0.2 |

1-nuisance | 1.1 | 1 | 1.0 | 0.5 |

Adaptive nuisance | 1.1 | 1.0 | 1 | 0.5 |

1-external | 1.3 | 1.4 | 1.1 | 1 |

3\(\sigma \) signif. threshold | ||||

nG | 3 | 6.6 | 2.4 | 6.5 |

1-nuisance | 1.8 | 3 | 1.5 | 2.8 |

Adaptive nuisance | 3.7 | \(\infty \) | 3 | \(\infty \) |

1-external | 1.9 | 3.2 | 1.6 | 3 |

5\(\sigma \) signif. threshold | ||||

nG | 5 | \(\infty \) | 4.0 | \(\infty \) |

1-nuisance | 2.5 | 5 | 2.0 | 4.9 |

Adaptive nuisance | 6.3 | \(\infty \) | 5 | \(\infty \) |

1-external | 2.5 | 5.1 | 2.1 | 5 |

In agreement with the previous discussion, we see that fixed 1-nuisance and 1-external yield similar results for 3 and 5\(\sigma \), independently of the relative size of statistical and theoretical effects. Moreover, they are prompter to claim a tension than nG, the most conservative method in this respect being the adaptive nuisance approach.

*p*value (under the hypothesis that the true value of \(a_\mu ^\mathrm{SM}-a_\mu ^\mathrm{exp}\) is \(\mu =0\)).

^{12}The nG method yields 3.6\(\sigma \), the 1-external approach 3.8\(\sigma \), the 1-nuisance approach 4.0\(\sigma \), and the adaptive nuisance approach 2.7\(\sigma \). The overall pattern is similar to what can be seen from the above tables, with a significance of the discrepancy which depends on the model used for theoretical uncertainties.

### 5.3 Coverage properties

As indicated in Sect. 2.1.2, *p* values are interesting objects if they cover exactly or slightly overcover in the domain where they should be used corresponding to a given significance; see Eqs. (7)–(9). If coverage can be ensured for a simple hypothesis [1, 2], this property is far from trivial and should be checked explicitly in the case of composite hypotheses, where compositeness comes from nuisance parameters that can be related to theoretical uncertainties, or other parameters of the problem.

*p*value at the true value of \(\mu \). The shape of the distribution of

*p*values indicates over, exact or under coverage. More specifically, one can determine \(P(p\ge 1-\alpha )\) for a CL of \(\alpha \): if it is larger (smaller) than \(\alpha \), the method overcovers (undercovers) for this particular CL, i.e. it is conservative (liberal). We emphasize that this property is

*a priori*dependent on the chosen CL.

Coverage properties of the various methods at 68.27, 95.45 and 99.73% CL, for different true values of \(\delta /\varDelta \) contained in, at the border of, or outside the fixed volume \(\varOmega \), and for various relative sizes of statistical and theoretical uncertainties \(\varDelta /\sigma \)

68.27% CL | 95.45% CL | 99.73% CL | 68.27% CL | 95.45% CL | 99.73% CL | |
---|---|---|---|---|---|---|

\(\varDelta /\sigma =1\), \(\delta /\varDelta =1\) | \(\varDelta /\sigma =1\), \(\delta /\varDelta =0\) | |||||

nG | 65.2% | 96.6% | 99.9% | 84.1% | 99.5% | 100.0% |

1-nuisance | 68.2% | 95.4% | 99.7% | 86.5% | 99.3% | 100.0% |

Adaptive nuisance | 68.3% | 99.6% | 100.0% | 86.4% | 100.0% | 100.0% |

1-external | 83.9% | 97.8% | 99.9% | 95.4% | 99.7% | 100.0% |

1-ext. (excl. \(p\equiv 1\)) | 69.2% | 95.7% | 99.8 % | 85.5% | 99.1% | 100.0 % |

\(\varDelta /\sigma =1\), \(\delta /\varDelta =3\) | \(\varDelta /\sigma =3\), \(\delta /\varDelta =0\) | |||||
---|---|---|---|---|---|---|

nG | 5.76% | 43.2% | 89.1% | 99.8% | 100.0% | 100.0% |

1-nuisance | 6.60% | 38.0% | 78.4% | 100.0% | 100.0% | 100.0% |

Adaptive nuisance | 6.53% | 75.4% | 99.8% | 99.9% | 100.0% | 100.0% |

1-external | 16.0% | 50.3% | 84.2% | 100.0% | 100.0% | 100.0% |

1-ext. (excl. \(p\equiv 1\)) | 14.0% | 49.1% | 83.8 % | 98.5% | 100.0% | 100.0 % |

\(\varDelta /\sigma =3\), \(\delta /\varDelta =3\) | \(\varDelta /\sigma =3\), \(\delta /\varDelta =1\) | |||||
---|---|---|---|---|---|---|

nG | 0.00% | 0.35% | 68.7% | 56.3% | 100.0% | 100.0% |

1-nuisance | 0.00% | 0.00% | 0.07% | 68.1% | 95.5% | 99.7% |

Adaptive nuisance | 0.00% | 9.60% | 99.8% | 68.2% | 100.0% | 100.0% |

1-external | 0.00% | 0.00% | 0.13% | 84.1% | 97.7% | 99.9 % |

1-ext. (excl. \(p\equiv 1\)) | 0.00% | 0.00% | 0.13% | 68.2% | 95.4% | 99.7% |

In order to compare the different situations, we take \(\sigma ^2+\varDelta ^2=1\) for all methods, and compute for each method the coverage fraction (the number of times the confidence level interval includes the true value of the parameter being extracted) for various confidence levels and for various values of \(\varDelta /\sigma \). Note that the coverage depends also on the true value of \(\delta /\varDelta \) (the normalized bias). The results are gathered in Table 5 and Fig. 6. We also indicate the distribution of *p* values obtained for the different methods.

One notices in particular that the 1-external approach has a cluster of values for \(p=1\), which is expected due to the presence of a plateau in the *p* value. This behavior makes the interpretation of the coverage more difficult, and as a comparison, we also include the results when we consider the same distribution with the \(p=1\) values removed. Indeed one could imagine a situation where reasonable coverage values could only be due to the \(p=1\) clustering, while other values of *p* would systematically undercover: such a behavior would either yield no constraints or too liberal constraints on the parameters depending on the data.

If \(\varOmega \) is fixed and does not contain the true value of \(\delta /\varDelta \) (“unfortunate” case), both external-\(\delta \) and nuisance-\(\delta \) approaches lead to undercoverage; the size of the effect depends on the distance of \(\delta /\varDelta \) with respect to \(\varOmega \). This is also the case for nG.

If \(\varOmega \) is fixed and contains the true value of \(\delta /\varDelta \) (“fortunate” case), both the external-\(\delta \) and the nuisance-\(\delta \) approaches overcover. This is also the case for nG.

If \(\varOmega \) is adaptive, for a fixed true value of \(\delta \), a

*p*value becomes valid if it is sufficiently small so that the corresponding interval contains \(\delta \). Therefore, for the adaptive nuisance-\(\delta \) approach, there is always a maximum value of CL above which all*p*values are conservative; this maximum value is given by \(1-\mathrm {Erf}[\delta /(\sqrt{2}\varDelta )]\).

*p*value that has exact coverage under the individual simple hypotheses when \(\delta \) is fixed. Therefore, as long as the true value \(\delta \) lies within the range over which one takes the supremum, this procedure yields a conservative envelope. This explains the overcoverage/undercoverage properties for the external-\(\delta \) and nuisance-\(\delta \) approaches given above.

### 5.4 Conclusions of the uni-dimensional case

It should be stressed that, by construction, all methods are conservative if the true value of the \(\delta \) parameter satisfy the assumption that has been made for the computation of the *p* value. Therefore coverage properties are not the only criterion to investigate in this situation in order to assess the methods: in particular one has to study the robustness of the *p* value when the assumption set on the true value of \(\delta \) is not true. The adaptive approach provides a means to deal with a priori unexpected true values of \(\delta \), provided one is interested in a small enough *p* value, that is, a large enough significance effect. Other considerations (size of confidence intervals, significance thresholds) suggest that the adaptive approach provides an interesting and fairly conservative framework to deal with theoretical uncertainties. We are going to consider the different approaches in the more general multi-dimensional case, putting emphasis on the adaptive nuisance-\(\delta \) approach and the quadratic test statistic.

## 6 Generalization to multi-dimensional cases

Up to here we only have discussed the simplest example of a single measurement *X* linearly related to a single model parameter \(\mu \). Obviously the general case is multi-dimensional, where we deal with several observables, depending on several underlying parameters, possibly in a non-linear way, with several measurements involving different sources of theoretical uncertainty. Typical situations correspond to averaging different measurements of the same quantity, and performing fits to extract confidence regions for fundamental parameters from the measurement of observables. In this section we will discuss the case of an arbitrary number of observables in a linear model with an arbitrary number of parameters, where we are particularly interested in a one-dimensional or two-dimensional subset of these parameters.

### 6.1 General formulas

*n*-vector of measurements, \(x=(x_i,\ i=1,\ldots ,n)\) is the

*n*-vector of model predictions for the \(X_i\) that depends on \(\chi =(\chi _j,\ j=1,\ldots , n_\chi )\), the \(n_\chi \)-vector of model parameters, \(\tilde{\delta }\) is the

*m*-vector of (dimensionless) theoretical biases, \(W_s\) is the (possibly non-diagonal) \(n\times n\) inverse of the statistical covariance matrix \(C_s\), \(\widetilde{W}_t\) is the inverse of the (possibly non-diagonal) \(m\times m\) theoretical correlation matrix \(\widetilde{C}_t\), \(\varDelta \) is the \(n\times m\)-matrix of theoretical uncertainties \(\varDelta _{i\alpha }\), so that the reduced biases \(\tilde{\delta }_\alpha \) have a range of variation within \([-1,1]\) (this explains the notation with tildes for the reduced quantities rescaled to be dimensionless).

*T*can be recast into the canonical form,

*T*are further discussed in Appendix C. In particular, one can reduce the test statistic to the case \(m=n\) with a diagonal \(\varDelta \) matrix without losing information. In the case where both correlation/covariance matrices are regular, Eq. (36) boils down to \(\bar{W}=[C_s+C_t]^{-1}\) with \(C_t =\varDelta \widetilde{C}_t \varDelta ^T\). This structure is reminiscent of the discussion of theoretical uncertainties as biases and the corresponding weights given in Ref. [29], but it extends it to the case where correlations yield singular matrices.

*linear*, i.e., the predictions \(x_i\) depend linearly on the parameters \(\chi _j\):

Following the one-dimensional examples in the previous sections, we always assume that the measurements \(X_i\) have Gaussian distributions for the statistical part. We will consider two main cases of interest in our field: averaging measurements and determining confidence intervals for several parameters.

### 6.2 Averaging measurements

We start by considering the averages of several measurements of a single quantity, each with both statistical and theoretical uncertainties, with possible correlations. We will focus mainly on the nuisance-\(\delta \) approach, starting with two measurements before moving to other possibilities.

#### 6.2.1 Averaging two measurements and the choice of a hypervolume

A first usual issue consists in the case of two uncorrelated measurements \(X_1\pm \sigma _1\pm \varDelta _1\) and \(X_2\pm \sigma _2\pm \varDelta _2\) that we want to combine. The procedure is well defined in the case of purely statistical uncertainties, but it depends obviously on the way theoretical uncertainties are treated. As discussed in Sect. 3, associativity is a particularly appealing property for such a problem as it allows one to replace a series of measurements by its average without loss of information.

*p*value over a rectangle \({\mathcal C}\) (called the “hypercube case” in the following, in reference to its multi-dimensional generalization), \(\delta _\mu \) varies in \(\varDelta _\mu \), with

Each choice of volume provides an average with different properties. As discussed earlier, associativity is a very desirable property: one can average different observations of the same quantity prior to the full fit, since it gives the same result as keeping all individual inputs. The hyperball choice indeed fulfills associativity. On the other hand, the hypercube case does not: the combination of the inputs 1 and 2 yields the following test statistic: \((w_1+w_2)(\mu -\hat{\mu })^2\), whereas the resulting combination \(\hat{\mu }\pm \sigma _\mu \pm \varDelta _\mu \) has the statistic \((\mu -\hat{\mu })^2/(\sigma _\mu ^2+\varDelta _\mu ^2)\). The two statistics are proportional and hence lead to the same *p* value, but they are not equivalent when added to other terms in a larger combination.

A comment is also in order concerning the size of the uncertainties for the average. In the case of the hypercube, the resulting linear addition scheme is the only one where the average of different determinations of the same quantity cannot lead to a weighted theoretical uncertainty that is smaller than the smallest uncertainty among all determinations.^{13} In the case of the hyperball, it may occur that the average of different determinations of the same quantity yields a weighted theoretical uncertainty smaller than the smallest uncertainty among all determinations.

Whatever the choice of the volume, a very important and alluring property of our approach is the clean separation between the statistical and theoretical contribution to the uncertainty on the parameter of interest. This is actually a general property that directly follows from the choice of a quadratic statistic, and in the linear case it allows one to perform global fits while keeping a clear distinction between various sources of uncertainty.

#### 6.2.2 Averaging *n* measurements with biases in a hyperball

We will now consider here the problem of averaging *n*, possibly correlated, determinations of the same quantity, each individual determination coming with both a Gaussian statistical uncertainty, and a number of different sources of theoretical uncertainty. We focus first on the nuisance-\(\delta \) approach, as it is possible to provide closed analytic expressions in this case. We will first discuss the variation of the biases over a hyperball, before discussing other approaches, which will be illustrated and compared with examples from flavor physics in Sect. 7.

*U*is the

*n*-vector \((1,\ldots ,1)\). After minimization over the \(\tilde{\delta }_\alpha \),

*T*can be recast into the canonical form

*P*a lower triangular matrix with diagonal positive entries. This yields the expression for the bias,

Statistical uncertainties are assumed here to be strictly Gaussian and hence symmetric (see Appendix C for more details of the asymmetric case). In contrast, in the nuisance approach, a theoretical uncertainty that is modeled by a bias parameter \(\delta \) may be asymmetric: that is, the region in which \(\delta \) is varied may depend on the sign of \(\delta \), e.g., \(\delta \in [-\varDelta _-,+\varDelta _+]\) in one dimension with the fixed hypercube approach (\(\varDelta _\pm \ge 0\)). In order to keep the stationarity property that follows from the quadratic statistic, we take the conservative choice \(\varDelta =\mathrm{Max}(\varDelta _+,\varDelta _-)\) in the definition Eq. (34). Let us emphasize that this symmetrization of the test statistic is independent of the range in which \(\delta \) is varied: if theoretical uncertainties are asymmetric, one computes Eqs. (46)–(48) to express the asymmetric combined uncertainties \(\varDelta _{\mu ,\pm }\) in terms of the \(\varDelta _{i\alpha ,\pm }\).

#### 6.2.3 Averages with other approaches

*P*of the biases (for instance, the Cholesky decomposition of \(C_t\), but the discussion is more general), so that \((P^{-1}\tilde{\delta })_\beta \) are uncorrelated biases varied within a hypercube. This would lead to \(\tilde{\delta }\) varied within a deformed hypercube, which corresponds to cutting the hypercube by a set of \((\tilde{\delta }_i,\tilde{\delta }_j)\) hyperplanes. It can take a rather complicated convex polygonal shape that is not symmetric along the diagonal in the \((\tilde{\delta }_i,\tilde{\delta }_j)\) plane, leading to the unpleasant feature that the order in which the measurements are considered in the average matters to define the range of variation of the biases (an illustration is given in Appendix B).

^{14}

As indicated before, this discussion occurs for any linear transformation *P* and is not limited to the Cholesky decomposition. We have not been able to find other procedures that would avoid these difficulties while paralleling the hypercube case. In the following, we will thus use Eq. (49) even in the presence of theoretical correlations: therefore, the latter will be taken into account in the definition of *T* through \(\bar{W}\), but not in the definition of the range of variations to compute the error \(\varDelta \). We also notice that the problems that we encounter are somehow due to contradicting expectations concerning the hypercube approach. In Sect. 6.2.1, the hypercube corresponds to values of \(\delta _1\) and \(\delta _2\) left free to vary without relation among them (contrary to the hyperball case). It seems therefore difficult to introduce correlations in this case which was designed to avoid them initially. Our failure to introduce correlations in this case might be related to the fact that the hypercube is somehow designed to avoid such correlations from the start and cannot accommodate them easily.

In the case of the external-\(\delta \) approach, the scan method leads to the same discussion as for the nuisance case, provided that one uses the following statistic: \(T = (X-\mu -\delta )^2/(\sigma ^2+\varDelta ^2)\). This choice is different from Ref. [32] by the normalization (\(\sigma ^2+\varDelta ^2\) rather than \(\sigma ^2\)) in order to take into account of the importance of both uncertainties when combining measurements (damping measurements which are imprecise in one way or the other). As indicated in Sect. 4.4, the difference of normalization of the test statistic does not affect the determination of the *p* value in the uni-dimensional case, but it has an impact once several determinations are combined. The choice above corresponds to the usual one when \(\varDelta \) is of statistical nature. It gives a reasonable balance when two or more inputs are combined that all come with both statistical and theoretical uncertainties.

A similar discussion holds for the random-\(\delta \) approach. However, if the combined errors \(\sigma _\mu \) and \(\varDelta _\mu \) are the same between the nuisance-\(\delta \) (with hyperball), the random-\(\delta \) and the external-\(\delta \) (with hyperball) approaches, we emphasize that the *p* value for \(\mu \) built from these errors is different and yields different uncertainties for a given confidence level for each approach, as discussed in Sect. 4.

#### 6.2.4 Other approaches in the literature

There are other approaches available in the literature, often starting from the random-\(\delta \) approach (i.e., modeling all uncertainties as random variables).

The Heavy Flavor Averaging Group [36] choose to perform the average including correlations. In the absence of knowledge on the correlation coefficient between uncertainties of two measurements (typically coming from the same method), they tune the correlation coefficient so that the resulting uncertainty is maximal (which is not \(\rho =1\) in the case where the correlated uncertainties have a different size and are combined assuming a statistical origin; see Appendix A.2). This choice is certainly the most conservative one when there is no knowledge concerning correlations.

The Flavor Lattice Averaging Group [37] follows the proposal in Ref. [38]: they build a covariance matrix where correlated sources of uncertainties are included with 100% correlation, and they perform the average by choosing weights \(w_i\) that are not optimal but are well defined even in the presence of \(\rho =\pm 1\) correlation coefficients. As discussed in Appendix A.2, our approach to singular covariance matrices is similar but more general and guarantees that we recover the weights advocated in Ref. [38] for averages of fully correlated measurements.

Finally, the PDG approach [34] combines all uncertainties in a single covariance matrix. In the case of inconsistent measurements, one may then obtain an average with an uncertainty that may be interpreted as ‘too small’ (notice however that the weighted uncertainty does not increase with the incompatibility of the measurements). This problem occurs quite often in particle physics and cannot be solved by purely statistical considerations (even in the absence of theoretical uncertainties). If the model is assumed to be correct, one may invoke an underestimation of the uncertainties. A (commonly used) recipe in the pure statistical case has been adopted by the Particle Data Group, which consists in computing a factor \(S=\sqrt{\chi ^2/(N_\mathrm{dof}-1)}\) and rescaling all uncertainties by this factor. A drawback of this approach is the lack of associativity: the inconsistency is either removed or kept as it is, depending on whether the average is performed before any further analysis, or inside a global fit. Furthermore since the ultimate goal of statistical analyses is indeed to exclude the null hypothesis (e.g. the Standard Model), it looks counter-intuitive to first wash out possible discrepancies by an *ad hoc* procedure. Therefore we refrain to define a *S* factor in the presence of theoretical uncertainties, and we leave the discussion of discrepancies between independent determinations of the same quantity to a case-by-case basis, based on physical (and not statistical) grounds.

In the case of the Rfit approach adopted by the CKMfitter group [15, 16], a specific recipe was chosen to avoid underestimating combined uncertainties in the case of marginally compatible values. The idea is first combine the statistical uncertainties by combining the likelihoods restricted to their statistical part, then assign to this combination the smallest of the individual theoretical uncertainties. This is justified by the following two points: the present state of the art is assumed not to allow one to reach a better theoretical accuracy than the best of all estimates, and this best estimate should not be penalized by less precise methods. In contrast with the plain (or naive) Rfit approach for averages (consisting in just combining Rfit likelihoods without further treatment), this method of combining uncertainties was called educated Rfit and is used by the CKMfitter group for averages [17, 19, 22]. Let us note finally that the calculation of pull values, discussed in Sect. 6.3, is a crucial step for assessing the size of discrepancies.

### 6.3 Global fit

#### 6.3.1 Estimators and errors

Another prominent example of multi-dimensional problem is the extraction of a constraint on a particular parameter of the model from the measured observables. If the model is linear, Eq. (38), the discussion follows closely that of Sect. 6.2.2. In the case where there is a single parameter of interest \(\mu \), we do not write explicitly the calculations and refer to Sect. 7 for numerical examples.

*p*value for \(\mu \) follows exactly the discussion for uni-dimensional measurements.

^{15}

*p*value for \(\mu =\chi _q\)), can readily be obtained from

### 6.4 Goodness-of-fit

*X*are distributed following a multivariate normal distribution, with central value \(a \chi +b+\varDelta \tilde{\delta }\) and correlation matrix \(C_s\). The CDF \(H_{\tilde{\delta }}(t)\) for \(T_\mathrm{min}\) at fixed \(\tilde{\delta }\) can thus be rephrased in the following way: considering a vector

*Y*distributed according to a multivariate normal distribution of covariance \(C_s\) centered around 0, \(H_{\tilde{\delta }}(t)\) is the probability \(P[(Y-a \chi -\varDelta \tilde{\delta })^T M (Y-a \chi -\varDelta \tilde{\delta })\le t]\).

*L*lower triangular (using the Cholesky decomposition), \(\alpha \) is diagonal and

*K*orthogonal (so that \(\alpha \) are the (positive) eigenvalues of \(L^TML\) and thus of \(MC_s\)). Let us note that \(\alpha \) does depend only on \(C_s\) and \( C_t\), whereas the dependence on the true value of \(\chi \) and \(\tilde{\delta }\) is only present in \(\beta \). The problem is then equivalent to considering a vector

*Z*distributed according to a multivariate normal distribution of covariance identity centered around 0, and computing \(P[(Z-\beta )^T \alpha (Z-\beta )\le t]\). This is the CDF of a linear combination of the form \(\sum _i \alpha _i X_i^2\) corresponding to non-central \(\chi ^2\) distributions.

*p*value as

### 6.5 Pull parameters

In addition to the general indication given by goodness-of-fit indicators, it is useful to determine the agreement between individual measurements and the model. One way of quantifying this agreement consists in determining the pull of each quantity. Indeed, the agreement between the indirect fit prediction and the direct determination of some observable *X* is measured by its pull, which can be determined by considering the difference of minimum values of the test statistic including or not the observables [22]. In the absence of non-Gaussian effects or correlations, the pulls are random variables of vanishing mean and unit variance.

*pull parameter*\(p_{X_m}\) in the test statistic \(T(X_0;\chi ,p_{X_m})\)

*p*value for the null hypothesis \(p_{X_m} = 0\) is by definition the pull for \(X_i\). It can be understood as a comparison of the best-fit value of the test statistic reached letting \(p_{X_a}\) free (corresponding to a global fit without the measurement \(X_m\)) with the case setting \(p_{X_m}=0\) (corresponding to a global fit including the measurement \(X_m\)).

*T*, leading to the same expression for

*T*as in Eq. (50), but with \(\bar{W}\) replaced by the matrix

*N*parameters, introducing

*N*distinct pull parameters and determining the

*p*value for the null hypothesis where all pull parameters vanish simultaneously.

*n*measurements, introducing the modified test statistic compared to Eq. (44):

### 6.6 Conclusions of the multi-dimensional case

We have discussed several situations where a multi-dimensional approach is needed in phenomenology analysis. In addition to the issues already encountered in one dimension, a further arbitrary choice must be performed in the multi-dimensional case for nuisance and external approaches concerning the shape of the volume in which the biases are varied: two simple cases are given by the hypercube and the hyperball, corresponding, respectively, to the well-known linear and quadratic combination of uncertainties. We have then discussed how to average two (or several) measurements, emphasizing the case of the nuisance approach. We have finally illustrated how a fit could be performed in order to determine confidence regions. Beyond the metrology of the model, we can also determine the agreement between model and experiments thanks to the pull parameters associated with each observable.

The uni-dimensional case (stationarity of the quadratic test statistic under minimization, coverage properties) has led us to prefer the adaptive nuisance approach, even though the fixed nuisance approach could also be considered. In the multi-dimensional case, the hyperball in conjunction with the quadratic test statistic allows us to keep associativity when performing averages, so that it is rigorously equivalent from the statistical point of view to keep several measurements of a given observable or to average them in a single value. We have also been able to discuss theoretical correlations using the hyperball case at two different stages: including the correlations among observables in the domain of variations of the biases when computing the errors \(\varDelta \), and providing a meaningful definition for the theoretical correlation among parameters of the fit. We have not found a way to keep these properties in the case of the hypercube. Moreover, choosing the hypercube may favor best-fit configurations where all the biases are at the border of their allowed regions, whereas the hyperball prevents such ‘fine-tuned’ solutions from occurring.

For comparison, in the following we will focus on two nuisance approaches: fixed 1-hypercube and adaptive hyperball with a preference for the latter. The other combinations would yield far too conservative (adaptive hypercube) or too liberal (fixed 1-hyperball) ranges of variations for the biases.

## 7 CKM-related examples

We will now consider the differences between the various approaches considered using several examples from quark flavor physics. These examples will be only for illustrative purposes, and we refer the reader to other work [15, 16, 22, 35] for a more thorough discussion of the physics and the inputs involved. From the previous discussion, we could consider a large set of approaches for theoretical uncertainties.

We will restrict to a few cases compared to the previous sections. First, we will consider educated Rfit (Rfit with specific treatment of uncertainties for averages), as used by the CKMfitter analyses and described in Sect. 6.2.4, while the naive Rfit approach will only be shown for the sake of comparison and is not understood as an appropriate model. We will also consider two nuisance approaches, namely the adaptive hyperball and the 1-hypercube cases. Our examples will be chosen in the context of CKM fits, and correspond approximately to the situation for Summer 2014 conferences. However, for pedagogical purposes, we have simplified intentionally some of the inputs compared to actual phenomenological analyses performed in flavor physics [35].

### 7.1 Averaging theory-dominated measurements

^{16}and we neglect all correlations. We stress that this is done only for purposes of illustration, and that an extended list of lattice QCD results with asymmetric uncertainties and correlations will be taken into account in forthcoming phenomenological applications [35].

Top: lattice determinations of the kaon bag parameter \(B_K^{\bar{\mathrm{MS}}}(2\mathrm{GeV})\). Middle: averages according to the various methods, and corresponding confidence intervals for various significances. Bottom: pulls associated to each measurement for each method. For Rfit methods, we quote only the significance of the pull, whereas other methods yield the pull parameter as well as the pull itself under the form \(p\pm \sigma \pm \varDelta \) (significance of the pull)

References | \(N_f\) | Mean | Stat | Theo | |
---|---|---|---|---|---|

ETMC10 [41] | 2 | 0.532 | ±0.019 | \(\pm 0.003\pm 0.007\pm 0.003\pm 0.008\pm 0.005\) | |

LVdW11 [42] | 2 + 1 | 0.5572 | ±0.0028 | \(\pm 0.0045\pm 0.0033\pm 0.0039\pm 0.0006\pm 0.0134\) | |

BMW11 [43] | 2 + 1 | 0.5644 | ±0.0059 | \(\pm 0.0022\pm 0.0008\pm 0.0006\pm 0.0006\pm 0.0002\pm 0.0056\) | |

RBC-UKQCD12 [44] | 2 + 1 | 0.554 | ±0.008 | \(\pm 0.007 \pm 0.003\pm 0.012\) | |

SWME14 [45] | 2 + 1 | 0.5388 | ±0.0034 | \(\pm 0.0237\pm 0.0048\pm 0.0005\pm 0.0108\pm 0.0022\pm 0.0016\pm 0.0005\) |

Method | Average | 1\(\sigma \) CI | 2\(\sigma \) CI | 3\(\sigma \) CI | 5\(\sigma \) CI |
---|---|---|---|---|---|

nG | \(0.5577 \pm 0.0063 \pm 0\) | \(0.5577 \pm 0.0063\) | \(0.5577 \pm 0.0126\) | \(0.5577 \pm 0.0189\) | \(0.5577 \pm 0.0315\) |

Naive Rfit | \(0.5562\pm 0.0120 \pm 0.0018 \) | \(0.5562 \pm 0.0138\) | \(0.5562 \pm 0.0258\) | \(0.5562 \pm 0.0379\) | \(0.5562 \pm 0.0619\) |

Educ Rfit | \(0.5562 \pm 0.0020 \pm 0.0100\) | \(0.5562 \pm 0.0120\) | \(0.5562 \pm 0.0139\) | \(0.5562 \pm 0.0159\) | \(0.5562 \pm 0.0198\) |

1-hypercube | \(0.5577\pm 0.0038 \pm 0.0176\) | \(0.5577 \pm 0.0193\) | \(0.5577\pm 0.0240\) | \(0.5577 \pm 0.0281\) | \(0.5577 \pm 0.0360\) |

Adapt hyperball | \(0.5577\pm 0.0038 \pm 0.0050\) | \(0.5577 \pm 0.0068\) | \(0.5577 \pm 0.0165\) | \(0.5577 \pm 0.0257\) | \(0.5577 \pm 0.0436\) |

Pull | nG | (e)Rfit | 1-hypercube | Adaptive hyperball | |
---|---|---|---|---|---|

ETMC10 | \(-1.22\pm 1.04\pm 0\ (1.2\sigma )\) | \((0.0 \sigma )\) | \(-1.22\pm 0.85\pm 1.88\ (0.3\sigma )\) | \(-1.22\pm 0.85\pm 0.60\ (1.1\sigma )\) | |

LVdW11 | \(-0.04\pm 1.10\pm 0\ (0.0\sigma )\) | \((0.0 \sigma )\) | \(-0.04\pm 0.35\pm 2.71\ (0.0\sigma )\) | \(-0.04\pm 0.35\pm 1.04\ (0.1\sigma )\) | |

BMW11 | \(\ 1.74\pm 1.49\pm 0\ (1.2\sigma )\) | \((0.0 \sigma )\) | \(\ 1.74\pm 0.86\pm 4.32\ (0.0\sigma )\) | \(\ 1.74\pm 0.86 \pm 1.21 \ (1.0 \sigma )\) | |

RBC-UKQCD12 | \(-0.27\pm 1.08\pm 0\ (0.2\sigma )\) | \((0.0 \sigma )\) | \(-0.27\pm 0.55\pm 2.38\ (0.0\sigma )\) | \(-0.27\pm 0.56\pm 0.93\ (0.4\sigma )\) | |

SWME14 | \(-0.75\pm 1.03\pm 0\ (0.7\sigma )\) | \((0.0 \sigma )\) | \(-0.75\pm 0.19\pm 2.24\ (0.0\sigma )\) | \(-0.75\pm 0.19\pm 1.01\ (0.7\sigma )\) |

The results for each method are given in Table 6 (middle). The first column corresponds to the outcome of the averaging procedure. In all the approaches considered, we can split statistical and theoretical uncertainties. In the case of naive Rfit, one combines the measurements by adding the well statistic corresponding to each measurement: the resulting test statistic *T* is a well with a bottom, the width of which can be interpreted as a theoretical uncertainty, whereas the width at \(T_{\min }+1\) determines the statistical uncertainty.^{17} The case of educated Rfit was described in Sect. 6.2.4. The confidence intervals are obtained from the *p* value determined from the “average” column.

We compute the pulls in the same way in both cases, interpreting the difference of \(T_\mathrm{min}\) with and without the observables as a random variable distributed according to a \(\chi ^2\) law with \(N_\mathrm{dof}=1\). The propagation of uncertainties for the quadratic statistic was detailed in Sects. 6.2.1 and 6.2.2 where the separate extraction of statistical and theoretical uncertainties was described. The tables are obtained by plugging the average into the one-dimensional *p* value associated with the method, and reading from the *p* value the corresponding confidence interval at the chosen significance. The associated pulls are given in Table 6 (bottom).

Top: lattice determinations of the \(D_s\)-meson decay constant \(f_{D_s}\) (in MeV). Middle: averages according to the various methods, and corresponding confidence intervals for various significances. Bottom: pull associated to each measurement for each method. For Rfit methods, we quote only the significance of the pull, whereas other methods yield the pull parameter as well as the pull itself under the form \(p\pm \sigma \pm \varDelta \) (significance of the pull)

References | \(N_f\) | Mean | Stat | Theo | |
---|---|---|---|---|---|

ETMC09 [46] | 2 | 244 | ±3 | \(\pm 2 \pm 7\) | |

HPQCD10 [47] | 2 + 1 | 248.0 | ±1.4 | \(\pm 0.4 \pm 1.4 \pm 1.0 \pm 0.8 \pm 0.3 \pm 0.3 \pm 0.3\) | |

FNAL-MILC11 [48] | 2 + 1 | 260.1 | ±8.9 | \(\pm 2.2 \pm 1.6\pm 1.0\pm 1.4 \pm 2.8 \pm 2.0 \pm 3.4 \pm 1.8\) | |

FNAL-MILC14 [49] | 2 + 1 + 1 | 248.8 | ± 0.3 | \(\pm 1.2 \pm 0.2\pm 0.1 \pm 0.4\) | |

ETMC14 [50] | 2 + 1 + 1 | 247.2 | ±3.9 | \(\pm 0.7 \pm 1.2 \pm 0.3\) |

Method | Average | 1\(\sigma \) CI | 2\(\sigma \) CI | 3\(\sigma \) CI | 5\(\sigma \) CI |
---|---|---|---|---|---|

nG | \(248.5\pm 1.1 \pm 0 \) | \(248.5 \pm 1.1\) | \(248.5 \pm 2.2\) | \(248.5 \pm 3.3\) | \(248.5 \pm 5.5\) |

Naive Rfit | \(248.1\pm 0.9 \pm 1.3 \) | \(248.1\pm 2.2\) | \(248.1 \pm 3.1\) | \(248.1 \pm 4.1\) | \(248.1\pm 5.9\) |

Educ Rfit | \(248.1\pm 0.3 \pm 1.9\) | \(248.1 \pm 2.2\) | \(248.1 \pm 2.5\) | \(248.1 \pm 2.8\) | \(248.1 \pm 3.4\) |

1-hypercube | \(248.5 \pm 0.5 \pm 2.7 \) | \(248.5 \pm 3.0\) | \(248.5 \pm 3.5\) | \(248.5 \pm 4.0\) | \(248.5 \pm 5.0\) |

Adapt hyperball | \(248.5\pm 0.5 \pm 1.0 \) | \(248.5 \pm 1.2\) | \(248.5 \pm 2.8\) | \(248.5 \pm 4.3\) | \(248.5 \pm 7.2\) |

Pull | nG | (e)Rfit | 1-hypercube | Adaptive hyperball | |
---|---|---|---|---|---|

ETMC09 | \(-0.59\pm 1.01\pm 0\ (0.6\sigma )\) | \((0.0 \sigma )\) | \(-0.59\pm 0.39\pm 1.47\ (0.0\sigma )\) | \(-0.59\pm 0.39\pm 0.93\ (0.6\sigma )\) | |

HPQCD10 | \(-0.28\pm 1.12\pm 0\ (0.3\sigma )\) | \((0.0 \sigma )\) | \(-0.28\pm 0.60\pm 2.77 \ (0.0\sigma )\) | \(-0.28\pm 0.60\pm 0.95\ (0.4\sigma )\) | |

FNAL-MILC11 | \(1.08\pm 1.00\pm 0\ (1.1\sigma )\) | \((0.0 \sigma )\) | \(1.08\pm 0.82\pm 1.74 \ (0.3\sigma )\) | \(1.08\pm 0.83\pm 0.57\ (1.0\sigma )\) | |

FNAL-MILC14 | \(0.63\pm 1.82\pm 0\ (0.3\sigma )\) | \((0.0 \sigma )\) | \(0.63\pm 1.05\pm 4.97 \ (0.0\sigma )\) | \(0.63\pm 1.05\pm 1.48 \ (0.5\sigma )\) | |

ETMC14 | \(-0.35\pm 1.04\pm 0\ (0.3\sigma )\) | \((0.0 \sigma )\) | \(-0.35\pm 0.94\pm 1.20\ (0.2\sigma )\) | \(-0.35\pm 0.94\pm 0.43 \ (0.4\sigma )\) |

For both quantities \(B_K\) and \(f_{D_s}\) at large confidence level (3\(\sigma \) and above), the most conservative method is the adaptive hyperball nuisance approach, whereas the one leading to the smallest uncertainties is the educated Rfit approach. Below 3\(\sigma \), the 1-hypercube approach is more conservative than the adaptive hyperball nuisance approach, and it becomes less conservative above that threshold. The most important differences are observed at large CL/significance. The statistical uncertainty obtained in the nG approach is by construction identical to the combination in quadrature of the statistical and theoretical uncertainties obtained in the adaptive hyperball approach. However, one can notice that the confidence intervals for high significances in the two approaches are different, with nG being less conservative. The overall very good agreement of lattice determinations means vanishing pulls for Rfit methods (since all the wells have a common bottom with a vanishing \(T_\mathrm{min}\)). For the other methods, the pull parameter has statistical and theoretical errors of similar size in the adaptive hyperball case, whereas theoretical errors tend to dominate in the 1-hypercube method. This yields smaller pulls in the latter approach.

A last illustration, which does not come solely from lattice simulations, is provided by the determination the strong coupling constant \(\alpha _S(M_Z)\). The subject is covered extensively by recent reviews [34, 51], and we stress that we do not claim to provide an accurate alternative average to these reviews which requires a careful assessment of the various determinations and their correlations. As a purely illustrative example, we will focus on the average of determinations from \(e^+e^-\) annihilation under a set of simplistic hypotheses for the separation between statistical and theoretical uncertainties. In order to allow for a closer comparison with Refs. [34, 62], we try to assess correlations this time. We assume that theoretical uncertainties for the same set of observables (\( j \& s\), 3*j*, *T*), but from different experiments, are 100% correlated, and the statistical uncertainties for determinations from similar experimental data are 100% correlated (BS-T, DW-T, AFHMS-T).^{18}

We perform the average in the different cases considered, see Table 8 (middle), which are represented graphically in Fig. 8 (a similar plot at \(3\sigma \) is given in Fig. 13 in Appendix D). We notice that the various approaches yield results with similar central values to the nG case. The pulls for individual quantities are mostly around 1\(\sigma \), and they are smaller in the adaptive hyperball approach compared to the nG one, showing better consistency. Refs. [34, 62] take a different approach, “range averaging”, which amounts to considering the spread of the central values for the various determinations, leading to \(\alpha _S(M_Z)=0.1174 \pm 0.0051\) for the determination from \(e^+e^-\) annihilation data considered here [62]. This approach is motivated in Ref. [34] by the complicated pattern of correlations and the limited compatibility between some of the inputs and, more importantly, it does not take into account that the different determinations have different accuracies according to the uncertainties quoted. The approach in Refs. [34, 62] conservatively accounts for the possibility that some uncertainties are underestimated. On the contrary, our averages given in Table 8 and Fig. 8 assume that all the inputs should be taken into account and averaged according to the uncertainties given in the original articles. The difference in the underlying hypotheses for the averages explain the large difference observed between our results and the ones in Refs. [34, 62]. Note, however, that our numerics directly follow from the use of the different averaging methods, and lack the necessary critical assessment of the individual determinations of \(\alpha _S(m_Z)\) performed in Refs. [34, 62].

### 7.2 Averaging incompatible or barely compatible measurements

Another important issue occurs when one wants to combine barely compatible measurements. This is for instance the case for \(|V_{ub}|\) and \(|V_{cb}|\) from semileptonic decays, where inclusive and exclusive determinations are not in very good agreement. The list of determinations used for illustrative purposes and the results for each method are given in Tables 9 and 10, together with the corresponding graphical comparisons in Fig. 9 (a similar plot at \(3\sigma \) is given in Fig. 14 in Appendix D). Our inputs are slightly different from Ref. [36] for several reasons. The inclusive determination of \(|V_{ub}|\) corresponds to the BLNP approach [64], and we consider the theoretical uncertainties from shape functions (leading and subleading), weak annihilation, and heavy-quark expansion uncertainties on matching and \(m_b\). We use only branching fractions measured for \(B\rightarrow \pi \ell \nu \) and average the unquenched lattice calculations quoted in Ref. [36]. For \(|V_{cb}|\) exclusive we also split the various sources of theoretical uncertainties coming from the determination of the form factors. We assume that there are no correlations among all these uncertainties.

The lack of compatibility between the two types of determination means in particular that the naive Rfit combined likelihood has not flat bottom, and thus no theoretical uncertainty. This behavior was one of the reasons to propose the educated Rfit approach, where the theoretical uncertainty of the combination cannot be smaller than any of the individual measurements.

The same pattern of conservative and aggressive approaches can be observed, with a fairly good agreement at 3\(\sigma \) level (apart from the naive Rfit approach, already discussed). At 5\(\sigma \), the adaptive hyperball proves again rather conservative, even though the theoretical error of the averages are smaller than the 1-hypercube nuisance and the educated Rfit approaches. The analysis of the pulls yields similar conclusions, with discrepancies at the 2\(\sigma \) for \(|V_{ub}|\) and between 2 and 3\(\sigma \) for \(|V_{cb}|\). Once again, theoretical errors for the pull parameters are larger in the 1-hypercube approach than in the adaptive hyperball case. Let us also notice that in both cases, there are only two quantities to combine, so that the two pull parameters are by construction opposite to each other up to an irrelevant scaling factor, leading to the same pull for both quantities.

Top: determinations of \(\alpha _S(M_Z)\) using \( e^+ e^- \) annihilation, taken from Ref. [34]. Middle: averages for \(\alpha _S(M_Z)\) from \(e^+e^-\) annihilation according to the various methods, and corresponding confidence intervals for various significances. Bottom: pull associated to each measurement for each method. For Rfit methods, we quote only the significance of the pull, whereas other methods yield the pull parameter as well as the pull itself under the form \(p\pm \sigma \pm \varDelta \)

References | Mean | Stat (\( \times 10^{-3} \)) | Theo (\( \times 10^{-3} \)) | ||
---|---|---|---|---|---|

ALEPH-j & s [52] | 0.1224 | \( \pm \)0.9 \( \pm \) 0.9 \( \pm \)1.2 | \( \pm \)3.5 | ||

OPAL-j & s [53] | 0.1189 | \( \pm \)0.8 \( \pm \) 1.6 \( \pm \)1.0 | \( \pm \)3.6 | ||

JADE-j & s [54] | 0.1172 | \( \pm \)0.6 \( \pm \) 2.0 \( \pm \)3.5 | \( \pm \)3.0 | ||

Dissertori-3j [55] | 0.1175 | \( \pm \)2.0 | \( \pm \)1.5 | ||

JADE-3j [56] | 0.1199 | \( \pm \)1.0 \( \pm \) 2.1 \( \pm \) 5.4 | \( \pm \)0.7 | ||

BS-T [57] | 0.1172 | \( \pm \)1.0 \( \pm \) 0.8 | \( \pm \) 1.2 \( \pm \)1.2 | ||

DW-T [58] | 0.1165 | \( \pm \)2.2 | \( \pm \) 1.7 | ||

AFHMS-T [59] | 0.1135 | \( \pm \)0.2 | \( \pm \) 0.5 \( \pm \)0.9 | ||

GLM-T [60] | 0.1134 | \( \pm \) 2.5 | ± 0.6 | ||

HKMS-C [61] | 0.1123 | \( \pm \)0.2 | \( \pm \)0.7 \( \pm \)1.4 |

Method | Average | 1\(\sigma \) CI | 2\(\sigma \) CI | 3\(\sigma \) CI | 5\(\sigma \) CI |
---|---|---|---|---|---|

nG | \(0.1143 \pm 0.0010 \pm 0\) | \(0.1143 \pm 0.0010\) | \(0.1143 \pm 0.0020\) | \(0.1143 \pm 0.0030\) | \(0.1143 \pm 0.0050\) |

Naive Rfit | \(0.1145\pm 0.0002 \pm 0\) | \(0.1145 \pm 0.0002\) | \(0.1145 \pm 0.0004\) | \(0.1145 \pm 0.0006\) | \(0.1145 \pm 0.0011\) |

Educ Rfit | \(0.1145\pm 0.0001 \pm 0.0006\) | \(0.1145 \pm 0.0007\) | \(0.1145 \pm 0.0009\) | \(0.1145 \pm 0.0010\) | \(0.1145 \pm 0.0013\) |

1-hypercube | \(0.1143\pm 0.0005 \pm 0.0018\) | \(0.1143 \pm 0.0020\) | \(0.1143 \pm 0.0026\) | \(0.1143 \pm 0.0031\) | \(0.1143 \pm 0.0041\) |

Adapt hyperball | \(0.1143\pm 0.0005 \pm 0.0009\) | \(0.1143 \pm 0.0011\) | \(0.1143 \pm 0.0026\) | \(0.1143 \pm 0.0039\) | \(0.1143 \pm 0.0067\) |

Pull | nG | (e)Rfit | 1-hypercube | Adaptive hyperball | |
---|---|---|---|---|---|

ALEPH-j & s | \(1.30 \pm 0.69 \pm 0\ (1.9\sigma )\) | (2.5\(\sigma \)) | \(1.30 \pm 0.26 \pm 0.91\ (1.8\sigma )\) | \(1.30 \pm 0.26 \pm 0.63\ (1.6\sigma )\) | |

OPAL-j & s | \(0.93 \pm 0.69 \pm 0\ (1.3\sigma )\) | (0.4\(\sigma \)) | \(0.93 \pm 0.29 \pm 0.89\ (0.7\sigma )\) | \(0.93 \pm 0.29 \pm 0.63\ (1.2\sigma )\) | |

JADE-j & s | \(0.76 \pm 0.79 \pm 0\ (0.9\sigma )\) | (0.0\(\sigma \)) | \(0.76 \pm 0.55 \pm 0.84\ (0.6\sigma )\) | \(0.76 \pm 0.55 \pm 0.57\ (0.9\sigma )\) | |

Dissertori-3j | \(1.13 \pm 0.77 \pm 0\ (1.4\sigma )\) | (0.9\(\sigma \)) | \(1.13 \pm 0.58 \pm 0.95\ (0.9\sigma )\) | \(1.13 \pm 0.58 \pm 0.51\ (1.3\sigma )\) | |

JADE-3j | \(1.10 \pm 1.00 \pm 0\ (1.1\sigma )\) | (0.8\(\sigma \)) | \(1.10 \pm 0.98 \pm 0.46\ (1.0\sigma )\) | \(1.10 \pm 0.98 \pm 0.22\ (1.1\sigma )\) | |

BS-T | \(0.36 \pm 0.92 \pm 0\ (0.4\sigma )\) | (0.2\(\sigma \)) | \(0.36 \pm 0.88 \pm 0.41\ (0.4\sigma )\) | \(0.36 \pm 0.88 \pm 0.26\ (0.4\sigma )\) | |

DW-T | \(0.15 \pm 0.97 \pm 0\ (0.2\sigma )\) | (0.1\(\sigma \)) | \(0.15 \pm 0.96 \pm 0.18\ (0.2\sigma )\) | \(0.15 \pm 0.96 \pm 0.10\ (0.2\sigma )\) | |

AFHMS-T | \(-0.24 \pm 0.78 \pm 0\ (0.3\sigma )\) | (0.0\(\sigma \)) | \(-0.24 \pm 0.57 \pm 1.00\ (0.1\sigma )\) | \(-0.24 \pm 0.57 \pm 0.53\ (0.4\sigma )\) | |

GLM-T | \(-0.29 \pm 0.95 \pm 0\ (0.3\sigma )\) | (0.2\(\sigma \)) | \(-0.28 \pm 0.88 \pm 0.73\ (0.2\sigma )\) | \(-0.28 \pm 0.88 \pm 0.36\ (0.3\sigma )\) | |

HKMS-C | \(-2.27 \pm 1.35 \pm 0\ (1.7\sigma )\) | (1.4\(\sigma \)) | \(-2.27 \pm 0.72 \pm 2.27\ (0.7\sigma )\) | \(-2.27 \pm 0.72 \pm 1.14\ (1.4\sigma )\) |

### 7.3 Averaging quantities dominated by different types of uncertainties

In order to illustrate the role played by statistical and theoretical uncertainties, we consider the question of averaging quantities dominated by one or the other. This happens for instance when one wants to compare a theoretically clean determination with other determination potentially affected by large theoretical uncertainties. This situation occurs in flavor physics for instance when one compares the extraction of \(\sin (2\beta )\) from time-dependent asymmetries in \(b\rightarrow c\bar{c}s\) and \(b\rightarrow q\bar{q}s\) decays (let us recall that, for the CKM global fit, only charmonium input is used for \(\sin (2\beta )\)). The first have a very small penguin pollution, which we will neglect, whereas the latter is significantly affected by such a pollution. The corresponding estimates of \(\sin (2\beta )\) have large theoretical uncertainties, and for illustration we use the computation done in Ref. [63].

Top: determinations of \(|V_{ub}|\cdot 10^3\) from semileptonic decays. Middle: averages according to the various methods, and corresponding confidence intervals for various significances. Bottom: pulls associated to each determination for each method. For Rfit methods, we quote only the significance of the pull, whereas other methods yield the pull parameter as well as the pull itself under the form \(p\pm \sigma \pm \varDelta \) (significance of the pull)

Reference | Mean | Stat | Theo | ||
---|---|---|---|---|---|

Exclusive | |||||

CKMfitter Summer 14 | 3.28 | ±0.15 | ± 0.26 | ||

Inclusive | |||||

CKMfitter Summer 14 | 4.359 | ±0.180 | \(\pm 0.013 \pm 0.027 \pm 0.037\pm 0.161 \pm 0.200\) |

Method | Average | 1\(\sigma \) CI | 2\(\sigma \) CI | 3\(\sigma \) CI | 5\(\sigma \) CI |
---|---|---|---|---|---|

nG | \(3.79\pm 0.22 \pm 0 \) | \( 3.79 \pm 0.22\) | \( 3.79 \pm 0.44\) | \( 3.79 \pm 0.65\) | \( 3.79 \pm 1.1\) |

Naive Rfit | \(3.70\pm 0.12 \pm 0\) | \(3.70 \pm 0.12\) | \(3.70 \pm 0.23\) | \(3.70 \pm 0.35\) | \(3.70 \pm 0.58\) |

Educ Rfit | \(3.70\pm 0.11 \pm 0.26\) | \(3.70 \pm 0.38\) | \(3.70 \pm 0.49\) | \(3.70 \pm 0.61\) | \(3.70 \pm 0.84\) |

1-hypercube | \( 3.79\pm 0.12 \pm 0.34\) | \( 3.79 \pm 0.40\) | \( 3.79 \pm 0.54\) | \( 3.79 \pm 0.67\) | \( 3.79 \pm 0.91\) |

Adapt hyperball | \( 3.79\pm 0.12 \pm 0.18 \) | \( 3.79 \pm 0.24\) | \( 3.79 \pm 0.57\) | \( 3.79 \pm 0.88\) | \( 3.79 \pm 1.49\) |

Pull | nG | (e)Rfit | 1-hypercube | Adaptive hyperball | |
---|---|---|---|---|---|

Exclusive | \(-3.60\pm 1.46\pm 0 \ (2.5\sigma )\) | \((1.6 \sigma )\) | \(-3.60\pm 0.78\pm 2.31 \ (1.9\sigma )\) | \(-3.60\pm 0.78\pm 1.23 \ (1.9\sigma )\) | |

Inclusive | \(3.40\pm 1.38\pm 0 \ (2.5\sigma )\) | \((1.6 \sigma )\) | \(3.40\pm 0.74\pm 2.20 \ (1.9\sigma )\) | \(3.40\pm 0.74\pm 1.16 \ (1.9\sigma )\) |

### 7.4 Global fits

In order to illustrate the impact of the treatment of theoretical uncertainties, we consider a global fit including mainly observables that come with a theoretical uncertainty. The list of observables is given in Table 12. Their values are motivated by the CKMfitter inputs used in Summer 2014, but they are used only for purposes of illustration.^{19} We consider two fits: Scenario A involves only constraints dominated by theoretical uncertainties whereas Scenario B includes also constraints from the angles (statistically dominated).

As far as the CKM matrix elements are concerned the Standard Model is linear, but it is not linear in all the other fundamental parameters of the Standard Model. For the illustrative purposes of this note, the first step thus consists in determining the minimum of the full (non-linear) \(\chi ^2\), and to linearize the Standard Model formulas for the various observables around this minimum (we choose the inputs of Scenario B to determine this point): this define an exactly linear model, which at this stage should not be used for realistic phenomenology but is useful for the comparison of the methods presented here. One can use the results presented in the previous section in order to determine the *p* value as a function of each of the parameters of interest. In the case of the nuisance-\(\delta \) approach, we can describe this *p* value using the same parameters as before, namely a central value, a statistical error and a theoretical error.

*p*values. As before, we observe that the methods give similar results at the 2–3\(\sigma \) level, although the adaptive hyperball method tends to be more conservative than the others.

Top: determinations of \(|V_{cb}|\cdot 10^3\) from semileptonic decays. Middle: averages according to the various methods, and corresponding confidence intervals for various significances. Bottom: pulls associated to each determination for each method. For Rfit methods, we quote only the significance of the pull, whereas other methods yield the pull parameter as well as the pull itself under the form \(p\pm \sigma \pm \varDelta \) (significance of the pull)

Reference | Mean | Stat | Theo | ||
---|---|---|---|---|---|

Exclusive | |||||

CKMfitter Summer 14 | 38.99 | \(\pm 0.49\) | \(\pm 0.04\pm 0.21\pm 0.13\pm 0.39\pm 0.17\pm 0.04\pm 0.19\) | ||

Inclusive | |||||

CKMfitter Summer 14 | 42.42 | \(\pm 0.44\) | \(\pm 0.74\) |

Method | Average | 1\(\sigma \) CI | 2\(\sigma \) CI | 3\(\sigma \) CI | 5\(\sigma \) CI |
---|---|---|---|---|---|

nG | \(40.41\pm 0.55 \pm 0\) | \(40.41\pm 0.55\) | \(40.41 \pm 1.11\) | \(40.41 \pm 1.66\) | \(40.41 \pm 2.77\) |

Naive Rfit | \(41.00\pm 0.33 \pm 0 \) | \(41.00\pm 0.32\) | \(41.00 \pm 0.65\) | \(41.00 \pm 0.98\) | \(41.00\pm 1.64\) |

Educ Rfit | \(41.00\pm 0.33 \pm 0.74\) | \(41.00 \pm 1.07\) | \(41.00 \pm 1.39\) | \(41.00 \pm 1.72\) | \(41.00 \pm 2.38\) |

1-hypercube | \(40.41\pm 0.34 \pm 0.99\) | \(40.41 \pm 1.15\) | \(40.41 \pm 1.57\) | \(40.41 \pm 1.94\) | \(40.41\pm 2.65\) |

Adapt hyperball | \(40.41\pm 0.34 \pm 0.44\) | \(40.41 \pm 0.60\) | \(40.41 \pm 1.45\) | \(40.41 \pm 2.26\) | \(40.41 \pm 3.84\) |

Pull | nG | (e)Rfit | 1-hypercube | Adaptive hyperball | |
---|---|---|---|---|---|

Exclusive | \(-4.75\pm 1.56\pm 0 \ (3.1\sigma )\) | \((2.3\sigma )\) | \(-4.75\pm 0.91\pm 2.65 \ (2.6\sigma )\) | \(-4.75\pm 0.91\pm 1.26 \ (2.3\sigma )\) | |

Inclusive | \(3.98\pm 1.30\pm 0 \ (3.1\sigma )\) | \((2.3\sigma )\) | \(3.98\pm 0.77\pm 2.22 \ (2.6\sigma )\) | \(3.98\pm 0.77\pm 0.74 \ (2.3\sigma )\) |

## 8 Conclusion

A problem often encountered in particle physics consists in analyzing data within the Standard Model (or some of its extensions) in order to extract information on the fundamental parameters of the model. An essential role is played here by uncertainties, which can be classified in two categories, statistical and theoretical. If the former can be treated in a rigorous manner within a given statistical framework, the latter must be described through models. The problem is particularly acute in flavor physics, as theoretical uncertainties often play a central role in the determination of underlying parameters, such as the four parameters describing the CKM matrix in the Standard Model.

*p*values to be combined through model averaging, the nuisance-\(\delta \) describes them through fixed biases which have to be varied over a reasonable region. These approaches have to be combined with particular choices for the test statistic used to compute the

*p*value. We have illustrated these approaches in the one-dimensional case, recovering the Rfit model used by CKMfitter as a particular case of the external-\(\delta \) approach, and discussing the interesting alternative of a quadratic test statistic.

Top: symmetrized determinations of \(\sin (2\beta _\mathrm{eff})\) from various penguin \(b\rightarrow q\bar{q}s\) modes and from charmonia modes [36], and estimate within QCD factorization of the correction from penguin pollution in the Standard Model (symmetrized range quoted in Table 1 in Ref. [63]). We neglect any penguin pollution in the case of the charmonium extraction of \(\sin (2\beta )\). Middle: averages according to the various methods, and corresponding confidence intervals for various significances. Bottom: pulls associated to each determination for each method. For Rfit methods, we quote only the significance of the pull, whereas other methods yield the pull parameter as well as the pull itself under the form \(p\pm \sigma \pm \varDelta \) (significance of the pull)

\(\sin (2\beta _\mathrm{eff})\) | \(\varDelta S=\sin (2\beta _\mathrm{eff})-\sin (2\beta )\) | \(\sin (2\beta )\) | |||
---|---|---|---|---|---|

\(\pi ^0 K_S\) | \(0.57\pm 0.17\pm 0\) | \( 0.085 \pm 0\pm 0.065\) | \(0.485\pm 0.17\pm 0.065\) | ||

\(\rho ^0 K_S\) | \(0.525 \pm 0.195\pm 0\) | \( -0.135 \pm 0\pm 0.155\) | \(0.66 \pm 0.195\pm 0.155\) | ||

\(\eta ' K_S\) | \(0.63\pm 0.06\pm 0\) | \( 0.015 \pm 0\pm 0.015\) | \(0.615\pm 0.06\pm 0.015\) | ||

\(\phi K_S\) | \(0.73 \pm 0.12\pm 0\) | \( 0.03 \pm 0\pm 0.02\) | \(0.7 \pm 0.12\pm 0.02\) | ||

\(\omega K_S\) | \(0.71 \pm 0.21\pm 0\) | \( 0.11 \pm 0\pm 0.10\) | \(0.6 \pm 0.21\pm 0.10\) | ||

\((c\bar{c}) K_S\) | \(0.689\pm 0.018\) | 0 | \(0.689\pm 0.018\pm 0\) |

Method | Average | 1\(\sigma \) CI | 2\(\sigma \) CI | 3\(\sigma \) CI | 5\(\sigma \) CI |
---|---|---|---|---|---|

nG | \(0.681\pm 0.017\pm 0\) | \(0.681 \pm 0.017\) | \(0.681 \pm 0.034\) | \(0.681 \pm 0.051\) | \(0.681 \pm 0.085\) |

Naive Rfit | \(0.683 \pm 0.017 \pm 0\) | \(0.683 \pm 0.017\) | \(0.683 \pm 0.034\) | \(0.683 \pm 0.051\) | \(0.683 \pm 0.085\) |

Educ Rfit | \(0.683\pm 0.017\pm 0.\) | \(0.683 \pm 0.017\) | \(0.683 \pm 0.034\) | \(0.683\pm 0.051\) | \(0.683 \pm 0.084\) |

1-hypercube | \(0.681\pm 0.017\pm 0.003\) | \(0.681 \pm 0.017\) | \(0.681\pm 0.034\) | \(0.681\pm 0.052\) | \(0.681 \pm 0.086\) |

Adapt hyperball | \(0.681\pm 0.017\pm 0.002\) | \(0.681 \pm 0.017\) | \(0.681 \pm 0.034\) | \(0.681\pm 0.052\) | \(0.681\pm 0.090\) |

Pull | nG | (e)Rfit | 1-hypercube | Adaptive hyperball | |
---|---|---|---|---|---|

\(\pi ^0 K_S\) | \(-1.09\pm 1.00\pm 0 \ (1.1\sigma )\) | \((0.8 \sigma )\) | \(-1.09\pm 0.94\pm 0.37 \ (1.1\sigma )\) | \(-1.09\pm 0.94\pm 0.36\ (1.1\sigma )\) | |

\(\rho ^0 K_S\) | \(-0.09\pm 1.00\pm 0 \ (0.1\sigma )\) | \((0.0 \sigma )\) | \(-0.09\pm 0.79\pm 0.63 \ (0.1\sigma )\) | \(-0.09\pm 0.79\pm 0.62\ (0.1\sigma )\) | |

\(\eta ' K_S\) | \(-1.16\pm 1.04\pm 0 \ (1.1\sigma )\) | \((0.9 \sigma )\) | \(-1.16\pm 1.01\pm 0.28 \ (1.1\sigma )\) | \(-1.16\pm 1.01\pm 0.24 \ (1.1\sigma )\) | |

\(\phi K_S\) | \(0.16\pm 1.01\pm 0 \ (0.1\sigma )\) | \((0.0 \sigma )\) | \(0.16\pm 1.00\pm 0.19 \ (0.2\sigma )\) | \(0.16\pm 1.00\pm 0.17\ (0.2\sigma )\) | |

\(\omega K_S\) | \(-0.35\pm 1.00\pm 0 \ (0.3\sigma )\) | \((0.0 \sigma )\) | \(-0.35\pm 0.91\pm 0.44 \ (0.3\sigma )\) | \(-0.35\pm 0.91\pm 0.43\ (0.4\sigma )\) | |

\((c\bar{c}) K_S\) | \(3.79\pm 2.97\pm 0 \ (1.3\sigma )\) | \((1.1 \sigma )\) | \(3.79\pm 2.87\pm 1.63 \ (1.1\sigma )\) | \(3.79\pm 2.87\pm 0.78\ (1.2\sigma )\) |

Inputs for the theory-dominated CKM fits, inspired by the data available in Summer 2014. Scenario A is restricted to the upper part of the table, whereas Scenario B includes all inputs

Observable | Input |
---|---|

\(|V_{ud}|\) | \(0.97425\pm 0\pm 0.00022\) |

\(|V_{ub}|\) | \((3.70\pm 0.12\pm 0.26)\times 10^{-3}\) |

\(|V_{cb}|\) | \((41.00\pm 0.33\pm 0.74)\times 10^{-3}\) |

\(\varDelta m_{d}\) | \((0.510\pm 0.003)\) ps\(^{-1}\) |

\(\varDelta m_{s}\) | \((17.757\pm 0.021)\) ps\(^{-1}\) |

\(B_s/B_d\) | \(1.023\pm 0.013\pm 0.014\) |

\(B_s\) | \(1.320\pm 0.017\pm 0.030\) |

\(f_{B_s}/f_{B_d}\) | \(1.205\pm 0.004\pm 0.007\) |

\(f_{B_s}\) | \(225.6\pm 1.1\pm 5.4\) MeV |

\(\eta _B\) | \(0.5510\pm 0\pm 0.0022\) |

\(\bar{m}_t\) | \(165.95\pm 0.35 \pm 0.64\) GeV |

\(\alpha \) | \((87.8\pm 3.4)^\circ \) |

\(\sin (2\beta )\) | \(0.682\pm 0.019\) |

\(\gamma \) | \((72.8\pm 6.7)^\circ \) |

*p*value by taking the supremum of the bias over a fixed range fixed by the size of the theoretical uncertainty to be modeled (fixed nuisance approach). An alluring alternative consists in adjusting the size of the range to the confidence level chosen: the range for a low confidence level can be obtained by varying the bias parameter in a small range, whereas a range for a high confidence level could require a more conservative (and thus larger) range for the bias parameter. We have designed such a scheme, called adaptive nuisance approach. It provides a unified statistical approach to deal with the metrology of the parameters (for low CL ranges) and the exclusion of models (for high CL ranges).

Numerical results and *p* values for the CKM parameters in *A* and \(\lambda \) for Scenarios A and B, depending on the method chosen. For each quantity, we provide the error budget, whenever possible, and the plots of the *p* values for Scenarios A (left) and B (right)

Numerical results and *p* values for the CKM parameters in \(\bar{\rho }\) and \(\bar{\eta }\) for Scenarios A and B, depending on the method chosen. For each quantity, we provide the error budget, whenever possible, and the plots of the *p* values for Scenarios A (left) and B (right)

We have determined the *p* values associated with each approach for a measurement involving both statistical and theoretical uncertainties. We have also studied the size of error bars, the significance of deviations and the coverage properties. In general, the most conservative approaches correspond to a naive Gaussian treatment (belonging to the random-\(\delta \) approach) and the adaptive nuisance approach. The latter is better defined and more conservative than the former in the case where statistical and theoretical approaches are of similar size. Other approaches (fixed nuisance, external) turn out less conservative at large confidence level.

We have then considered extensions to multi-dimensional cases, focusing on the linear case where the quantity of interest is a linear combination of observables. Due to the presence of several bias parameters, one has to make another choice concerning the shape of the space over which the bias parameters are varied. Two simple examples are the hypercube and the hyperball, leading to a linear or quadratic combination of theoretical uncertainties, respectively. The hypercube is more conservative, as it allows for sets of values of the bias parameters that cannot be reached within the hyperball. On the other hand, the hyperball has the great virtue of associativity, so that one can average different measurements of the same quantity or put all of them in a global fit, without changing its outcome. It also allows us to include theoretical correlations easily, both in the range of variation of biases to determine errors and in the definition of theoretical correlations for the outcome of a fit. We have discussed the average of several measurements using the various approaches, including correlations. We considered in detail the case of 100% correlations leading to a non-invertible covariance matrix. We also discussed global fits and pulls in a linearized context. We have then provided several comparisons between the different approaches using examples from flavor physics: averaging theory-dominated measurements, averaging incompatible measurements linear fits to a subset of flavor inputs.

It is now time to determine which choice seems preferable in our case. Random-\(\delta \) has no strong statistical basis: its only advantage consists in its simplicity. External-\(\delta \) is closer in spirit to the determination of systematics as performed by experimentalists, but it starts with an inappropriate null hypothesis and tries to combine an infinite set of *p* values in a single *p* value. On the contrary, the nuisance-\(\delta \) approach starts from the beginning with the correct null hypothesis and deals with a single *p* value.

This choice is independent from another choice, i.e., the range of variation for the parameter \(\delta \). Indeed, when several bias parameters are involved, one may imagine different multi-dimensional spaces for their variations, in particular the hyperball and the hypercube. As said earlier, the hyperball has the interesting property of associativity when performing averages and avoids fine-tuned solutions where all parameters are pushed in a corner of phase space. The hypercube is closer in spirit to the Rfit model (even though the latter is not a bias model), but it cannot avoid fine-tuned situations and it does not seem well suited to deal with theoretical correlations, since it is designed from the start to avoid such correlations.

A third choice consists in determining whether one wants to keep the volume of variation fixed (fixed approach), or to modify it depending on the desired confidence level (adaptive approach). Adaptive hypercube is in principle the most conservative choice but in practice, it gives too large errors, whereas fixed hyperball would give very small errors. Fixed hypercube is more conservative at low confidence levels (large *p* values), whereas adaptive hyperball is more conservative at large confidence levels (small *p* values).

This overall discussion leads us to consider the nuisance approach with adaptive hyperball as a promising approach to deal with flavor physics problems, which we will investigate in more phenomenological analyses in forthcoming publications [35].

## Footnotes

- 1.
The issue of theoretical uncertainties is naturally not the only question that arises in the context of statistical analyses. The statistical framework used to perform these analyses is also a matter of choice, with two main approaches, frequentist and Bayesian, adopted in different settings and for various problems in and beyond high-energy physics [1, 2, 3, 8, 9, 10]. In this paper, we choose to focus on the frequentist approach to discuss how to model theoretical uncertainties.

- 2.
It may happen that the detector and/or background effects have a sizable impact on the fitted quantities \(\hat{C}\) and \(\hat{S}\); this can be viewed as uncertainties in the modeling of the event PDF

*f*. These effects are reported as*systematic uncertainties*and in particle physics, it is customary to treat them on the same footing as the pure statistical uncertainties. Although we will not try to follow this avenue in the examples discussed here, it would be possible to consider these systematic uncertainties as theoretical uncertainties, to be modeled according to the methods that we describe in the following sections. - 3.
“Nuisance” does not mean that these parameters are necessarily unphysical, “pollution” parameters. They can be fundamental constants of Nature, and interesting as such.

- 4.
We have defined compositeness for numerical hypotheses, since this is our case of interest in the following. More generally, compositeness also occurs in the case of non-numerical hypotheses such as “The Standard Model is true”, for which it is not possible to compute the distribution of data either. Indeed assuming that the Standard Model is true does not imply anything on the value of its fundamental parameters, and thus one cannot compute the distribution of a given observable under this hypothesis.

- 5.
Strictly speaking, the likelihood is only defined for the actually measured data \(X_0\): \({\mathcal L}_0(\chi )\equiv g(X_0;\chi )\) and thus is only a function of the parameters \(\chi \). Nevertheless it is common practice to use the word “likelihood” for the object \(g(X;\chi )\), considered as a function of both the observables

*X*and the parameters \(\chi \). - 6.
More precisely, the asymptotic limit is reached when the model can be linearized for all values of the data that contribute significantly to the integral (4). It corresponds to the situation where the errors on the parameters derived from computing

*p*values are small with respect to the typical parameter scales of the problem. - 7.
A bias is defined as the difference between the average of the estimator among a large number of experiments with finite sample size and the true value. An estimator is said to be consistent if it converges to the true value when the size of the sample tends to infinity (e.g., maximum likelihood estimators). Consistency implies that the bias vanishes asymptotically, while inconsistency may stem from theoretical uncertainties.

- 8.
We discuss how the method can be adapted for asymmetric uncertainties in Appendix C.

- 9.
- 10.
The choice of the weight in the denominator of the test statistic will be discussed in the multi-dimensional case in Sect. 6.2.3, but it does not impact the result for the

*p*value in one dimension where it plays only the role of an overall normalization that cancels when computing the*p*value. - 11.
Such a test statistic tends typically to be less sensitive to discrepancies in a global fit than the likelihood ratio. In the presence of quantities having no or little dependence on the scanned parameters, the impact of discrepancies is diluted in the case of the likelihood statistic.

- 12.
In full generality, one should have kept the different sources of theoretical uncertainties separated, as their combination in a single theoretical uncertainty depends on the precise model used for theoretical uncertainties. We consider here the result of Ref. [34] where all theoretical uncertainties are already combined.

- 13.
This is true at least for approaches where theoretical errors are modeled by fixed bias parameters: the combined error on the quantity of interest is a weighted sum as in Eq. (42), and the maximal value of this quantity can only be made always larger than each individual contribution if the corners of the hypercube are included in the maximization region.

- 14.
This problem does not occur in the hyperball case, where the section of the hyperellipsoid by a hyperplane always yields an ellipse symmetric along the diagonal, with an elongation according to the theoretical correlation between the biases.

- 15.
As discussed in Sect. 4.4, the overall normalization of \(T(X;\mu )\) is irrelevant to derive uni-dimensional

*p*values. - 16.
In this section, we will not deal with asymmetric uncertainties, and for illustrative purpose, we symmetrize all uncertainties, statistical and theoretical, following Eq. (C.39).

- 17.
In general, for naive Rfit, the tails of the resulting test statistic

*T*are neither Gaussian nor symmetric. However, our approximation is valid to a good accuracy for our illustrative purposes and the examples discussed in this section. - 18.
In addition, we have made further choices concerning the separation of statistical and theoretical uncertainties based on the following considerations. Ref. [60] discusses the sources of uncertainties (scales, function parameters, b-quark mass) within a fit leading to uncertainties assumed to be of statistical nature, with a further systematic uncertainty coming from the difference between the two different schemes. The systematic uncertainties in Ref. [57] are assumed to be of statistical nature in the absence of any opposite statement. For the first two classes (j & s and 3j) hadronization is taken into account by Monte Carlo methods, while for the last two classes (T and C) analytic analyses are made: in the former (latter) case, the hadronic uncertainties are treated as statistical (theoretical).

- 19.
In particular, most of the inputs have several sources of theoretical uncertainties, which should be combined together linearly or in quadrature according to the model of theoretical uncertainties chosen. Since we just want to illustrate the difference between the various approaches at the level of the fit, we take as inputs the values obtained in a given framework (Rfit) without recomputing the averages and uncertainties for each approach.

- 20.
The definition of \(C_s^+\) can be extended for an arbitrary matrix

*C*in the following way. \(\varSigma \) is defined as the diagonal matrix with entries \(\{\sqrt{|C_{11}|},\ldots \sqrt{|C_{NN}|}\}\) (if a diagonal entry is 0, one defines \(\varSigma \) with 1 in the corresponding entry). The matrix \(\varGamma =\varSigma ^{-1}.C.\varSigma ^{-1} \) can be written according to a singular value decomposition \(\varGamma =R.D.S\) with two rotation matrices*R*and*S*. Once the generalized inverse \(D^+\) is defined, the corresponding generalized inverse of*C*is defined as \(C^+=\varSigma ^{-1}.S^T.D^+.R^T.\varSigma ^{-1}\). - 21.
One could try to symmetrize the problem, but one would lose the connection with the Cholesky decomposition, with the unpleasant feature that all domains of variation would be identical and thus do not take into account correlations.

## Notes

### Acknowledgements

We would like to thank S. T’Jampens for collaboration at an early stage of this work, as well as all our collaborators from the CKMfitter group for many useful discussions on the statistical issues covered in this article. We would also like to express a special thanks to the Mainz Institute for Theoretical Physics (MITP) for its hospitality and support during the workshop “Fundamental parameters from lattice QCD” where part of this work was presented and discussed. LVS acknowledges financial support from the Labex P2IO (Physique des 2 Infinis et Origines). SDG acknowledges partial support from Contract FPA2014-61478-EXP. This project has received funding from the European Unions Horizon 2020 research and innovation programme under Grant agreements Nos. 690575, 674896 and 692194.

### References

- 1.F. James,
*Statistical Methods in Experimental Physics*(World Scientific, Hackensack, 2006)CrossRefMATHGoogle Scholar - 2.G. Cowan, Statistics for Searches at the LHC. doi:10.1007/978-3-319-05362-2_9. arXiv:1307.2487 [hep-ex]
- 3.M.G. Kendall, A. Stuart,
*The Advanced Theory of Statistics*(Griffin, London, 1969)MATHGoogle Scholar - 4.P. Sinervo, eConf C
**030908**, TUAT004 (2003)Google Scholar - 5.M. Schmelling, arXiv:hep-ex/0006004
- 6.W.A. Rolke, A.M. Lopez, J. Conrad, Nucl. Instrum. Methods A
**551**, 493 (2005). arXiv:physics/0403059 ADSCrossRefGoogle Scholar - 7.R.D. Cousins, J.T. Linnemann, J. Tucker, Nucl. Instrum. Meth. A
**595**, 480 (2008)ADSCrossRefGoogle Scholar - 8.W.M. Bolstad, J.M. Curran,
*Introduction to Bayesian Statistics*(Wiley, New York, 2016)MATHGoogle Scholar - 9.G. D’Agostini, Rep. Prog. Phys.
**66**, 1383 (2003). doi:10.1088/0034-4885/66/9/201. arXiv:physics/0304102 ADSCrossRefGoogle Scholar - 10.A.J. Bevan,
*Statistical Data Analysis for the Physical Sciences*(Cambridge Press, Cambridge, 2013)CrossRefGoogle Scholar - 11.S. Schael et al., ALEPH and DELPHI and L3 and OPAL and LEP Electroweak Collaborations, Phys. Rep.
**532**, 119 (2013). arXiv:1302.3415 [hep-ex] - 12.M. Baak et al., Gfitter Group Collaboration, Eur. Phys. J. C
**74**, 3046 (2014). arXiv:1407.3792 [hep-ph] - 13.R. Aaij et al., LHCb Collaboration, Eur. Phys. J. C
**73**(4), 2373 (2013). arXiv:1208.3355 [hep-ex] - 14.A.J. Bevan et al., BaBar and Belle Collaborations, Eur. Phys. J. C
**74**, 3026 (2014). arXiv:1406.6311 [hep-ex] - 15.A. Hocker, H. Lacker, S. Laplace, F. Le Diberder, Eur. Phys. J. C
**21**, 225 (2001). arXiv:hep-ph/0104062 ADSCrossRefGoogle Scholar - 16.J. Charles et al., CKMfitter Group Collaboration, Eur. Phys. J. C
**41**, 1 (2005). arXiv:hep-ph/0406184 - 17.J. Charles et al., Phys. Rev. D
**84**, 033005 (2011). arXiv:1106.4041 [hep-ph]ADSCrossRefGoogle Scholar - 18.O. Deschamps, S. Descotes-Genon, S. Monteil, V. Niess, S. T’Jampens, V. Tisserand, Phys. Rev. D
**82**, 073012 (2010). arXiv:0907.5135 [hep-ph]ADSCrossRefGoogle Scholar - 19.
- 20.A. Lenz, U. Nierste, J. Charles, S. Descotes-Genon, H. Lacker, S. Monteil, V. Niess, S. T’Jampens, Phys. Rev. D
**86**, 033008 (2012). arXiv:1203.0238 [hep-ph]ADSCrossRefGoogle Scholar - 21.J. Charles, S. Descotes-Genon, Z. Ligeti, S. Monteil, M. Papucci, K. Trabelsi, Phys. Rev. D
**89**(3), 033016 (2014). arXiv:1309.2293 [hep-ph]ADSCrossRefGoogle Scholar - 22.J. Charles, O. Deschamps, S. Descotes-Genon, H. Lacker, A. Menzel, S. Monteil, V. Niess, J. Ocariz et al., Phys. Rev. D
**91**(7), 073007 (2015). arXiv:1501.05013 [hep-ph]ADSCrossRefGoogle Scholar - 23.B. Aubert et al., BaBar Collaboration, Phys. Rev. D
**79**, 072009 (2009). doi:10.1103/PhysRevD.79.072009. arXiv:0902.1708 [hep-ex] - 24.I. Adachi et al., Phys. Rev. Lett.
**108**, 171802 (2012). doi:10.1103/PhysRevLett.108.171802. arXiv:1201.4643 [hep-ex]ADSCrossRefGoogle Scholar - 25.R. Aaij et al., LHCb Collaboration, Phys. Rev. Lett.
**115**(3), 031601 (2015). doi:10.1103/PhysRevLett.115.031601. arXiv:1503.07089 [hep-ex] - 26.I.I.Y. Bigi, V.A. Khoze, N.G. Uraltsev, A.I. Sanda, Adv. Ser. Direct. High Energy Phys.
**3**, 175 (1989). doi:10.1142/9789814503280_0004 ADSCrossRefGoogle Scholar - 27.J. Neyman, E.S. Pearson, Philos. Trans. R. Soc. Lond. A
**231**, 289–337 (1933)ADSCrossRefGoogle Scholar - 28.S.S. Wilks, Ann. Math. Stat.
**9**(1), 60–62 (1938)CrossRefGoogle Scholar - 29.F.C. Porter, arXiv:0806.0530 [physics.data-an]
- 30.S. Fichet, G. Moreau, Nucl. Phys. B
**905**, 391 (2016). arXiv:1509.00472 [hep-ph]ADSCrossRefGoogle Scholar - 31.
- 32.G. . Dubois-Felsmann, D.G. Hitlin, F.C. Porter, G. Eigen, arXiv:hep-ph/0308262
- 33.G. Eigen, G. Dubois-Felsmann, D.G. Hitlin, F.C. Porter, Phys. Rev. D
**89**(3), 033004 (2014). arXiv:1301.5867 [hep-ex]ADSCrossRefGoogle Scholar - 34.K.A. Olive et al., Particle Data Group Collaboration, Chin. Phys. C
**38**, 090001 (2014)Google Scholar - 35.J. Charles et al., Work in progressGoogle Scholar
- 36.Y. Amhis et al., Heavy Flavor Averaging Group (HFAG) Collaboration, arXiv:1412.7515 [hep-ex]
- 37.S. Aoki et al., arXiv:1607.00299 [hep-lat]
- 38.M. Schmelling, Phys. Scr.
**51**, 676 (1995)ADSCrossRefGoogle Scholar - 39.H. Ruben, Ann. Math. Stat.
**33**(2), 542–570 (1962)ADSMathSciNetCrossRefGoogle Scholar - 40.A. Castaño-Martnez, F. López-Blázquez, TEST
**14**, 397 (2005)MathSciNetCrossRefGoogle Scholar - 41.M. Constantinou et al., ETM Collaboration, Phys. Rev. D
**83**, 014505 (2011). arXiv:1009.5606 [hep-lat] - 42.
- 43.S. Durr, Z. Fodor, C. Hoelbling, S.D. Katz, S. Krieg, T. Kurth, L. Lellouch, T. Lippert et al., Phys. Lett. B
**705**, 477 (2011). arXiv:1106.3230 [hep-lat]ADSCrossRefGoogle Scholar - 44.R. Arthur et al., RBC and UKQCD Collaborations, Phys. Rev. D
**87**, 094514 (2013). arXiv:1208.4412 [hep-lat] - 45.T. Bae et al., SWME Collaboration, arXiv:1402.0048 [hep-lat]
- 46.
- 47.C.T.H. Davies, C. McNeile, E. Follana, G.P. Lepage, H. Na, J. Shigemitsu, Phys. Rev. D
**82**, 114504 (2010). arXiv:1008.4018 [hep-lat]ADSCrossRefGoogle Scholar - 48.A. Bazavov et al., Fermilab Lattice and MILC Collaboration, Phys. Rev. D
**85**, 114506 (2012). arXiv:1112.3051 [hep-lat] - 49.A. Bazavov et al., Fermilab Lattice and MILC Collaborations, Phys. Rev. D
**90**(7), 074509 (2014). arXiv:1407.3772 [hep-lat] - 50.N. Carrasco, P. Dimopoulos, R. Frezzotti, P. Lami, V. Lubicz, F. Nazzaro, E. Picca, L. Riggio et al., Phys. Rev. D
**91**(5), 054507 (2015). arXiv:1411.7908 [hep-lat] - 51.D. d’Enterria,P.Z. Skands, arXiv:1512.05194 [hep-ph]
- 52.G. Dissertori, A. Gehrmann-De Ridder, T. Gehrmann, E.W.N. Glover, G. Heinrich, G. Luisoni, H. Stenzel, JHEP
**0908**, 036 (2009). arXiv:0906.3436 [hep-ph]ADSCrossRefGoogle Scholar - 53.G. Abbiendi et al., OPAL Collaboration, Eur. Phys. J. C
**71**, 1733 (2011). arXiv:1101.1470 [hep-ex] - 54.S. Bethke et al., JADE Collaboration, Eur. Phys. J. C
**64**, 351 (2009). arXiv:0810.1389 [hep-ex] - 55.G. Dissertori, A. Gehrmann-De Ridder, T. Gehrmann, E.W.N. Glover, G. Heinrich, H. Stenzel, Phys. Rev. Lett.
**104**, 072002 (2010). arXiv:0910.4283 [hep-ph]ADSCrossRefGoogle Scholar - 56.J. Schieck et al., JADE Collaboration, Eur. Phys. J. C
**73**(3), 2332 (2013). arXiv:1205.3714 [hep-ex] - 57.
- 58.R.A. Davison, B.R. Webber, Eur. Phys. J. C
**59**, 13 (2009). arXiv:0809.3326 [hep-ph] - 59.R. Abbate, M. Fickinger, A.H. Hoang, V. Mateu, I.W. Stewart, Phys. Rev. D
**83**, 074021 (2011). arXiv:1006.3080 [hep-ph]ADSCrossRefGoogle Scholar - 60.T. Gehrmann, G. Luisoni, P.F. Monni, Eur. Phys. J. C
**73**(1), 2265 (2013). arXiv:1210.6945 [hep-ph] - 61.A.H. Hoang, D.W. Kolodrubetz, V. Mateu, I.W. Stewart, Phys. Rev. D
**91**(9), 094018 (2015). arXiv:1501.04111 [hep-ph]ADSCrossRefGoogle Scholar - 62.S. Bethke, G. Dissertori, G.P. Salam, EPJ Web Conf.
**120**, 07005 (2016). doi:10.1051/epjconf/201612007005 CrossRefGoogle Scholar - 63.
- 64.B.O. Lange, M. Neubert, G. Paz, Phys. Rev. D
**72**, 073006 (2005). arXiv:hep-ph/0504071 ADSCrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Funded by SCOAP^{3}