Modelling theoretical uncertainties in phenomenological analyses for particle physics

The determination of the fundamental parameters of the Standard Model (and its extensions) is often limited by the presence of statistical and theoretical uncertainties. We present several models for the latter uncertainties (random, nuisance, external) in the frequentist framework, and we derive the corresponding $p$-values. In the case of the nuisance approach where theoretical uncertainties are modeled as biases, we highlight the important, but arbitrary, issue of the range of variation chosen for the bias parameters. We introduce the concept of adaptive $p$-value, which is obtained by adjusting the range of variation for the bias according to the significance considered, and which allows us to tackle metrology and exclusion tests with a single and well-defined unified tool, which exhibits interesting frequentist properties. We discuss how the determination of fundamental parameters is impacted by the model chosen for theoretical uncertainties, illustrating several issues with examples from quark flavour physics.

In particle physics, an important part of the data analysis is devoted to the interpretation of the data with respect to the Standard Model (SM) or some of its extensions, with the aim of comparing different alternative models or determining the fundamental parameters of a given underlying theory [1][2][3]. In this activity, the role played by uncertainties is essential, since they constitute the limit for the accurate determination of these parameters, and they can prevent from reaching a definite conclusion when comparing several alternative models. In some cases, these uncertainties are from a statistical origin: they are related to the intrinsic variability of the phenomena observed, they decrease as the sample size increases and they can be modeled using random variables. A large part of the experimental uncertainties belong to this first category. However, another kind of uncertainties occurs when one wants to describe inherent limitations of the analysis process, for instance, uncertainties in the calibration or limits of the models used in the analysis. These uncertainties are very often encountered in theoretical computations, for instance when assessing the size of higher orders in perturbation theory or the validity of extrapolation formulae. Such uncertainties are often called "systematics", but they should be distinguished from less dangerous sources of systematic uncertainties, usually of experimental origin, that roughly scale with the size of the statistical sample and may be reasonably modeled by random variables [4]. In the following we will thus call them "theoretical" uncertainties: by construction, they lack both an unambiguous definition (leading to various recipes to determine these uncertainties) and a clear interpretation (beyond the fact that they are not from a statistical origin). It is thus a complicated issue to incorporate their effect properly, even in simple situations often encountered in particle physics [5][6][7] 1 .
The relative importance of statistical and theoretical uncertainties might be different depending on the problem considered, and the progress made both by experimentalists and theorists. For instance, statistical uncertainties are the main issue in the analysis of electroweak precision observables [11,12]. On the other hand, in the field of quark flavour physics, theoretical uncertainties play a very important role. Thanks to the B-factories and LHCb, many hadronic processes have been very accurately measured [13,14], which can provide stringent constraints on the Cabibbo-Kobayashi-Maskawa matrix (in the Standard Model) [15][16][17], and on the scale and structure of New Physics (in SM extensions) [18][19][20][21]. However, the translation between hadronic processes and quark-level transitions requires information on hadronisation from strong interaction, encoded in decay constants, form factors, bag parameters. . . The latter are determined through lattice QCD simulations. The remarkable progress in computing power and in algorithms over the last 20 years has led to a decrease of statistical uncertainties and a dominance of purely theoretical uncertainties (chiral and heavy-quark extrapolations, scale chosen to set the lattice spacing, finite-volume effects, continuum limit. . . ). As an illustration, the determination of the Wolfenstein parameters of the CKM matrix involves many constraints which are now limited by theoretical uncertainties (neutral-meson mixing, leptonic and semileptonic decays. . . ) [22].
The purpose of this note is to discuss theoretical uncertainties in more detail in the context of particle physics phenomenology, comparing different models not only from a statistical point of view, but also in relation with the problems encountered in phenomenological analyses where they play a significant role. In Sec. 1, we summarise fundamental notions of statistics used in particle physics, in particular p-values and test statistics. In Sec. 2, we list properties that we seek in a good approach for theoretical uncertainties. In Sec. 3, we propose several approaches and in Sec. 4, we compare their properties in the most simple one-dimensional case. In Sec. 5, we consider multi-dimensional cases (propagation of theoretical uncertainties, average of several measurements, fits and pulls), which we illustrate using flavour physics examples related to the determination of the CKM matrix in Sec. 6, before concluding. An appendix is devoted to several issues connected with the treatment of correlations.

Statistics concepts for particle physics
We start by briefly recalling frequentist concepts used in particle physics, highlighting the role played by p-values in hypothesis testing and how they can be used to define confidence intervals.

Data fitting and data reduction
First, we would like to illustrate the concepts of data fitting and data reduction in particle physics, starting with a specific example, namely the observation of the time-dependent CP asymmetry in the decay channel B 0 (t) → J/ψK S by the BaBar, Belle and LHCb experiments [23][24][25]. Each experiment collects a sample of observed decay times t i corresponding to the B-meson events, where this sample is theoretically known to follow a PDF f . The PDF is parameterized in terms of a few physics parameters, among which we assume the ones of interest are the direct and mixing-induced C and S CP asymmetries. The functional form of this PDF is dictated on very general grounds by the CPT invariance and the formalism of two-state mixing (see, e.g., [26]), and is independent of the particular underlying phenomenological model (e.g. the Standard Model of particle physics). In practice however, detector effects require to be modelled by additional parameters that modify the shape of the PDF. We denote by θ the set of parameters θ = (C, S, . . .) that are needed to specify the PDF completely. The likelihood for the sample {t i } is defined by and can be used as a test statistic to infer constraints on the parameters θ, and/or construct estimators for them, as will be discussed in more detail below. The combination of different samples/experiments can be done simply by multiplication of the corresponding likelihoods. On the other hand one can choose to work directly in the framework of a specific phenomenological model, by replacing in θ the quantities that are predicted by the model in terms of more fundamental parameters: for example in the Standard Model, and neglecting the "penguin" contributions, one has the famous relations C = 0, S = sin 2β where β is one of the angles of the Unitarity Triangle and can be further expressed in terms of the Cabibbo-Kobayashi-Maskawa couplings. The latter choice of expressing the experimental likelihood in terms of model-dependent parameters such as β has however one technical drawback: the full statistical analysis has to be performed for each model one wants to investigate, e.g., the Standard Model, the Minimal Supersymmetric Standard Model, GUT models. . . In addition, building a statistical analysis directly on the initial likelihood requires one to deal with a very large parameter space, depending on the parameters in θ that are needed to describe the detector response. One common solution to these technical difficulties is a two-step approach. In the first step, the data are reduced to a set of model-and detector-independent 2 random variables that contains the same information as the original likelihood (to a good approximation): in our example the likelihood-based estimatorsĈ andŜ of the parameters C and S can play the role of such variables (estimators are functions of the data and thus are random variables). In a second step, one can work in a particular model, e.g., in the Standard Model, to useĈ andŜ as inputs to a statistical analysis of the parameter β. This two-step procedure gives the same result as if the analysis were done in a single step through the expression of the original likelihood in terms of β. This technique is usually chosen if the PDF g of the estimatorsĈ andŜ can be parameterized in a simple way: for example, if the sample size is sufficiently large, then the PDF can often be modelled by a multivariate normal distribution, where the covariance matrix is approximately independent of the mean vector.
Let us now extend the above discussion to a more general case. A sample of random events is {E i , i = 1 . . . n}, where each event corresponds to a set of directly measurable quantities (particle energies and momenta, interaction vertices, decay times. . . ). The distribution of these events is described by a PDF, the functional form f of which is supposed to be known. In addition to the event value E, the PDF value depends on some fixed parameters θ, hence the notation f (E; θ). The likelihood for the sample {E i } is defined by . We want to interpret the event observation in a given phenomenological scenario that predicts at least some of the parameters θ describing the PDF in terms of a set of more fundamental parameters χ.
To this aim we first reduce the event observation to a set of model-and detector-independent random variables X together with a PDF g(X; χ), in such a way that the information that one can get on χ from g is equivalent to the information one can get from f , once θ is expressed in terms of χ consistently with the phenomenological model of interest. Technically, it amounts to identifying a minimal set of variables x depending on θ that are independent of both the experimental context and the phenomenological model. One performs an analysis on the sample of events E i to derive estimatorsx for x. The distribution of these estimators can be described in terms of a PDF that is written in the χ parametrization as g(X; χ), where we have replacedx by the notation X, to stress that in the following X will be considered as a new random variable, setting aside how it has been constructed from the original data {E i }. Obviously, in our previous example for B 0 (t) → J/ψK S , {t i } correspond to {E i }, C and S to x, and β to χ.

Model fitting
From now on we work with one or more observable(s) x, with associated random variable X, and an associated PDF g(X; χ) depending on purely theoretical parameters χ. With a slight abuse of notation we include in the symbol g not only the functional form, but also all the needed parameters that are kept fixed and independent of χ. In particular for a one-dimensional Gaussian PDF we have where X is a potential value of the observable x and x(χ) corresponds to the theoretical prediction of x given χ. This PDF is obtained from the outcome of an experimental analysis yielding both a central value X 0 and an uncertainty σ, where σ is assumed to be independent of the realisation X 0 of the observable x and is thus included in the definition of g.  Figure 1: Illustration in the simple case where X is predicted as x(µ) = µ. Under the hypothesis µ t = µ, and having measured X = 0±1, one can determine the associated p-value p(0; µ) by examining the distribution of the quadratic test statistic T (X; µ) = (X − µ) 2 assuming X is distributed as a Gaussian random variable with central value 0 and width 1. The blue dashed line corresponds to the value of T associated with the hypothesis µ = −1.4, with a p-value obtained by considering the gray area. The red dotted line corresponds to the hypothesis µ = 2.5.
Our aim is to derive constraints on the parameters χ, from the measurement X 0 ± σ of the observable x. One very general way to perform this task is hypothesis testing, where one wants to quantify how much the data are compatible with the null hypothesis that the true value of χ, χ t , is equal to some fixed value χ: In order to interpret the observed data X 0 measured in a given experiment in light of the distribution of the observables X under the null hypothesis H χ , one defines a test statistic T (X; χ), that is a scalar function of the data X that measures whether the data are in favour or not of the null hypothesis.
We indicated the dependence of T on χ explicitly, i.e., the dependence on the null hypothesis H χ . The test statistic is generally a definite positive function chosen in a way that large values indicate that the data present evidence against the null hypothesis. By comparing the actual data value t = T (X 0 ; χ) with the sampling distribution of T = T (X; χ) under the null hypothesis, one is able to quantify the degree of agreement of the data with the null hypothesis.
Mathematically it amounts to defining a p-value. One calculates the probability to obtain a value for the test statistic at least as large as the one that was actually observed, assuming that the null hypothesis is true. This tail probability is used to define the p-value of the test for this particular observation where the PDF h of the test statistic is obtained from the PDF g of the data as which can be obtained easily from comparing the convolution of dT dX h(T ) = g(X) with a test function of T with the convolution of the r.h.s. of (5) with the same test function. A small value of the p-value means that T (X 0 ; χ) belongs to the "large" region, and thus provides evidence against the null hypothesis. This is illustrated for a simple example in Figs. 1 and 2. From its definition, one sees that 1 − p(X 0 ; χ) is nothing else but the cumulative distribution function of the PDF h where θ is the Heaviside function. This expression corresponds to the probability for the test statistic to be smaller than a given value T (X 0 ; χ). The p-value in Eq. (4) is defined as a function of X 0 and as such, is a random variable. Through the simple change of variable dp dT dP dp = dP dT , one obtains that the null distribution (that is, the distribution when the null hypothesis is true) of a p-value is uniform, i.e., the distribution of values of the p-value is flat between 0 and 1. This uniformity is a fundamental property of pvalues that is at the core of their various interpretations (hypothesis comparison, determination of confidence intervals. . . ) [1,2].
In the frequentist approach, one wants to design a procedure to decide whether to accept or reject the null hypothesis H χ , by avoiding as much as possible either incorrectly rejecting the null hypothesis (Type-I error) or incorrectly accepting it (Type-II error). The standard frequentist procedure consists in selecting a Type-I error α and determining a region of sample space that has the probability α of containing the data under the null hypothesis. If the data fall in this critical region, the hypothesis is rejected. This must be performed before data are known (in contrast to other interpretations, e.g, Fischer's approach of significance testing [1]). In the simplest case, the critical region is defined by a condition of the form T ≥ t α , where t α is a function of α only, which can be rephrased in terms of p-value as p ≤ α. The interest of the frequentist approach depends therefore on the ability to design p-values assessing the rate of Type-I error correctly (its understatement is clearly not desirable, but its overstatement yields often a reduction in the ability to determine the truth of an alternative hypothesis), as well as avoiding too large a Type-II error rate.
A major difficulty arises when the hypothesis to be tested is composite. In the case of numerical hypotheses like (3), one gets compositeness when one is only interested in a subset µ of the parameters χ. The remaining parameters are called nuisance parameters 3 and will be denoted by ν, thus χ = (µ, ν). In this case the hypothesis H µ : µ t = µ is composite, because determining the distribution of the observables requires the knowledge of the true value ν t in addition to µ. In this situation, one has to devise a procedure to infer a "p-value" for H µ out of p-values built for the simple hypotheses where both µ and ν are fixed. Therefore, in contrast to a simple hypothesis, a composite hypothesis does not allow one to compute the distribution of the data 4 .
At this stage, it is not necessarily guaranteed that the distribution of the p-value for H µ is uniform, and one may get different situations: which may depend on the value of α considered. Naturally, one would like to design as much as possible an exact p-value (exact coverage), or if this is not possible, a (reasonably) conservative one (overcoverage). Such p-values will be called "valid" p-values. In the case of composite hypotheses, the conservative or liberal nature of a p-value may depend not only on α, but also on the structure of the problem and of the procedure used to construct the p-value, and it has to be checked explicitly [1,2]. Once p-values are defined, one can build confidence intervals out of them by using the correspondence between acceptance regions of tests and confidence sets. Indeed, if we have an exact p-value, and the critical region C α (X) is defined as the region where p(X; µ) < α, the complement of this region turns out to be a confidence set of level 1 − α, i.e., P [µ / ∈ C α (X)] = 1 − α. This justifies the general use of plotting the p-value as a function of µ, and reading the 68% or 95% CL intervals by looking at the ranges where the p-value curve is above 0.32 or 0.05. This is illustrated for a simple example in Figs. 2 and 3. Once again, this discussion is affected by issues of compositeness and nuisance parameters, as well as the requirement of checking the coverage of the p-value used to define these confidence intervals: an overcovering p-value will yield too large confidence intervals, which will prove indeed conservative.
A few words about the notation and the vocabulary are in order at this stage. A p-value necessarily refers to a null hypothesis, and when the null hypothesis is purely numerical such as (3) we can consider the p-value as a mathematical function of the fundamental parameter µ. This of course does not imply that µ is a random variable (in frequentist statistics, it is always a fixed, but unknown, number). When the p-value as a function of µ can be described in a simple way by a few parameters, we will often use the notation µ = µ 0 ±σ µ . In this case, one can easily build the p-value and derive any desired confidence interval. Even though this notation is similar to the measurement of an observable, we stress that this does not mean that the fundamental parameter µ is a random variable, and it should not be seen as the definition of a PDF. In line with this discussion, we will call uncertainties the parameters like σ that can be given a frequentist meaning, e.g., they can be used to define the PDF of a random variable. On the other hand, we will call errors the intermediate quantities such as σ µ that can be used to describe the p-value of a fundamental parameter, but cannot be given a statistical meaning for this parameter.  Figure 3: A α-CL interval built from a p-value with exact coverage has a probability of α of containing the true value. This is illustrated in the simple case of a quantity X which has a true value µ t = 0 but is measured with an uncertainty σ = 1. Each time a measurement is performed, it will yield a different value for X 0 and thus a different p-value curve as a function of the hypothesis tested µ t = µ. From each measurement, a 68% CL interval can be determined by considering the part of the curve above the line p = 0.32, but this interval may or may not contain the true value µ t = 0.
The curves corresponding to the first case (second case) are indicated with 6 green solid lines (4 blue dotted lines). Asymptotically, if the p-value has exact coverage, 68% of these confidence intervals will contain the true value.

Likelihood-ratio test statistic
Here we consider test statistics that are constructed from the logarithm of the likelihood 5 More precisely, one uses tests based on the likelihood ratio in many instances. Its use is justified by the Neyman-Pearson lemma [1,2,27] showing that this test has appealing features in a binary model with only two alternatives for χ t , corresponding to the two simple hypotheses H χ 1 and H χ 2 . Indeed one can introduce the likelihood ratio L X (χ 1 )/L X (χ 2 ), define the critical region where this likelihood ratio is smaller than a given α, and decide that one rejects H χ 1 whenever the observation falls in this critical region. This test is the most powerful test that can be built [1,2], in the sense that among all the tests with a given Type-I error α (probability of rejecting H χ 1 when H χ 1 is true), the likelihood ratio test has the smallest Type-II error (probability of accepting H χ 1 when H χ 2 is true). These two conditions are the two main criteria to determine the performance of a test.
In the case of a composite hypothesis, there is no such clear-cut approach to choose the most powerful test. The Maximum Likelihood Ratio (MLR) is inspired by the Neynman-Pearson lemma, comparing the most plausible configuration under H µ with the most plausible one in general: Let us emphasise that even though T is constructed not to depend on the nuisance parameters ν explicitly, its distribution Eq. (5) a priori depends on them (through the PDF g). Even though the Neyman-Pearson lemma does not apply here, there is empirical evidence that this test is powerful, and in some cases it exhibits good asymptotic properties (easy computation and distribution independent of nuisance parameters) [1,2]. For the problems considered here, the MLR choice features alluring properties, and in the following we will use test statistics that are derived from this choice. First, if g(X; χ t ) is a multidimensional Gaussian function, then the quantity −2 ln L X (χ t ) is the sum of the squares of standard normal random variables, i.e., is distributed as a χ 2 with a number of degrees of freedom (N dof ) that is given by dim(X). Secondly, for linear models, in which the observables X depend linearly on the parameters χ t , the MLR Eq. (11) is again a sum of standard normal random variables, and is distributed as a χ 2 with N dof = dimension(µ). Wilks' theorem [28] states that this property can be extended to non-Gaussian cases in the asymptotic limit: under regularity conditions and when the sample size tends to infinity, the distribution of Eq. (11) will converge to the same χ 2 distribution depending only on the number of parameters tested.
The great virtue of the χ 2 -distribution is that it only depends on the number of degrees of freedom, which means in particular that the null-distribution of Eq. (11) is independent of the nuisance parameters ν, whenever the conditions of the Wilks' theorem apply. Furthermore the integral (4) can be computed straightforwardly in terms of complete and incomplete Γ functions: In practice the models we want to analyse, such as the Standard Model, predict non linear relations between the observables and the parameters. In this case one has to check whether Wilks' theorem applies, by considering whether the theoretical equations can be approximately linearized 6 .

Comparing approaches to theoretical uncertainties
We have argued before that an appealing test statistic is provided by the likelihood ratio Eq. (11) due to its properties in limit cases (linearised theory, asymptotic limit). These properties rely on the fact that the likelihood ratio can be built as a function of random variables described by measurements involving only statistical uncertainties. However, in flavour physics (as in many other fields in particle physics), there are not only statistical but also theoretical uncertainties. Indeed, as already indicated in the introduction, these phenomenological analyses combine experimental information and theoretical estimates. In the case of flavour physics, the latter come mainly from QCD-based calculations, which are dominated by theoretical uncertainties.
Unfortunately, the very notion of theoretical uncertainty is ill-defined as "anything that is not due to the intrinsic variability of data". Theoretical uncertainties (model uncertainty) are thus of a different nature with respect to statistical uncertainties (stochastic uncertainty, i.e. variability in the data), but they can only be modelled (except in the somewhat academic case where a bound on the difference between the exact value and the approximately computed one can be proven). The choice of a model for theoretical uncertainties involves not only the study of its mathematical properties and its physical implications in specific cases, but also some personal taste. One can indeed imagine several ways of modelling/treating theoretical uncertainties: • one can (contrarily to what has just been said) treat the theoretical uncertainty on the same footing as a statistical uncertainty; in this case, in order to follow a meaningful frequentist procedure, one has to assume that one lives in a world where the repeated calculation of a given quantity leads to a distribution of values around the exact one, with some variability that can be modelled as a PDF ("random-δ approach"), • one can consider that theoretical uncertainties can be modelled as external parameters, and perform a purely statistical analysis for each point in the theoretical uncertainty parameter space; this leads to an infinite collection of p-values that will have to be combined in some arbitrary way, following a model averaging procedure ("external-δ approach"), • one can take the theoretical uncertainties as fixed asymptotic biases 7 , treating them as nuisance parameters that have to be varied in a reasonable region ("nuisance-δ approach").
There are some desirable properties for a convincing treatment of theoretical uncertainties: • as general as possible, i.e., apply to as many "kinds" of theoretical uncertainties as possible (lattice uncertainties, scale uncertainties) and as many types of physical models as possible, • leading to meaningful confidence intervals, in reasonable limit cases: obviously, in the absence of theoretical uncertainties, one must recover the standard result; one may also consider the type of constraint obtained in the absence of statistical uncertainties, • exhibiting good coverage properties, as it benchmarks the quality of the statistical approach: the comparison of different models provides interesting information but does not shed light on their respective coverage, • associated with a statistically meaningful goodness-of-fit, • featuring reasonable asymptotic properties (large samples), • yielding the errors as a function of the estimates easily (error propagation), in particular by disentangling the impact of theoretical and statistical contributions, • leading to a reasonable procedure to average independent estimates -if possible, it should be equivalent for any analysis to include the independent estimates separately or the average alone (associativity). In addition, one may wonder whether the averaging procedure should be conservative or aggressive (i.e., the average of similar theoretical uncertainties should have a smaller uncertainty or not), and if the procedure should be stationary (the uncertainty of an average should be independent of the central values or not), • leading to reasonable results in the case of averages of inconsistent measurements.
Finally a technical requirement is the computing power needed to calculate the best fit point and confidence intervals for a large parameter space with a large number of constraints. Even though it should not be the sole argument in favour of a model, it should be kept in mind (a very complicated model for theoretical uncertainties would not be particularly interesting if it yields very close results to a much simpler one). We summarize some of the points mentioned above in Tab. 1. As it will be seen, it will however prove challenging to fulfill all these criteria at the same time, and we will have to make compromises along the way.
3 Illustration of the approaches in the one-dimensional case 3

.1 Situation of the problem
We will now discuss the three different approaches and some of their properties in the simplest case, i.e. with a single measurement (for an experimental quantity) or a single theoretical determination (for a theoretical quantity). Following a fairly conventional abuse of language, we will always refer to this piece of information as a "measurement" even though some modelling may be involved in its extraction through data reduction, as discussed in Sec. 1. The main, yet not alone, aim is to model/interpret/exploit a measurement like 8 X = X 0 ± σ (exp) ± ∆(th) (13) to extract information on the value of the associated fundamental parameter µ. Without theoretical uncertainty (∆ = 0), one would use this measurement to build a PDF PDF no th (X; µ) = N (µ,σ) (X) (14) yielding the MLR test statistic and one can build a p-value easily from Eq. (4) In the presence of a theoretical uncertainty ∆, the situation is more complicated, as there is no clear definition of what ∆ corresponds to. A possible first step is to introduce a theoretical uncertainty parameter δ that describes the shift of the approximate theoretical computation from the exact value, and that is taken to vary in a region that is defined by the value of ∆. This leads to the PDF PDF(X; µ) = N (µ+δ,σ) (X) (17) in such a way that in the limit of an infinite sample size (σ → 0), the measured value of X reduces to µ + δ. The challenge is to extract some information on µ, given the fact that the value of δ remains unknown.
The steps (to be spelt out below) to achieve this goal are: • Take a model corresponding to the interpretation of δ: random variable, external parameter, fixed bias as a nuisance parameter. . .
• Choose a test statistic T (X; µ) that is consistent with the model and that discriminates the null hypothesis: Rfit, quadratic, other . . .
• Compute, consistently with the model, the p-value that is in general a function of µ and δ • Eliminate the dependence with respect to δ by some well-defined procedure • Exploit the resulting p-value (coverage, confidence intervals, goodness-of-fit) Since we focus on Gaussian experimental uncertainties (the generalization to other shapes is formally straightforward but may be technically more complicated), for all approaches that we discuss in this note we take the following PDF PDF(X; µ) = N (µ+δ,σ) (X) (18) where, in the limit of an infinite sample size (σ → 0), µ can be interpreted as the exact value of the parameter of interest, and µ + δ the approximately theoretically computed one. The interpretation of δ will differ depending on the approach considered, which we will discuss now.

The random-δ approach
In the random-δ approach, δ would be related to the variability of theoretical computations, that one can model with some PDF for δ, such as N (0,∆) (normal) or U (−∆,+∆) (uniform). The natural candidate for the test statistic T (X; µ) is the MLR built from the PDF. One considers a model where X = s + δ is the sum of two random variables, s being distributed as a Gaussian of mean µ and width σ, and δ as an additional random variable with a distribution depending on ∆.
One may often consider for δ a variable normally distributed with a mean zero and a width ∆ (denoted naive Gaussian or "nG" in the following, corresponding to the most common procedure in the literature of particle physics phenomenology). The resulting PDF for X is then the convolution of two Gaussian PDFs, leading to PDF nG (X; µ) = N (µ, √ σ 2 +∆ 2 ) (X) (19) to which corresponds the usual quadratic test statistic (obtained from MLR) recovering the p value that would be obtained when the two uncertainties are added in quadrature We should stress that considering δ as a random variable corresponds to a rather strange frequentist world 9 , and there is no strong argument that would help to choose the associated PDF (for instance, δ could be a variable uniformly distributed over [−∆, ∆]). However for a general PDF, the p-value has no simple analytic formula and it must be computed numerically from Eq. (4). In the following, we will only consider the case of a Gaussian PDF when we discuss the random-δ approach.

The nuisance-δ approach
In the nuisance approach, δ is not interpreted as a random variable but as a fixed parameter so that in the limit of an infinite sample size, the estimator does not converge to the true value µ t , but to µ t + δ. The distinction between statistical and theoretical uncertainties is thus related to their effect as the sample size increases, statistical uncertainties decreasing while theoretical uncertainties remaining of the same size (see Refs. [29][30][31] for other illustrations in the context of particle physics). One works with the null hypothesis H µ : µ t = µ, and one has then to determine which test statistic is to be built.
In the frequentist approach, the choice of the test statistic is arbitrary as long as it models the null hypothesis correctly, i.e., the smaller the value of the test statistic, the better the agreement of the data with the hypothesis. A particularly simple possibility consists in the quadratic statistic already introduced earlier: where the minimum is not taken over a fixed range, but on the whole space. The great virtue of the quadratic shape is that in linear models it remains quadratic after minimization over any subset of parameters, in contrast with alternative, non-quadratic, test statistics. The PDF for X is normal, with mean µ + δ and variance σ 2 Although we choose test statistics for the random-δ and nuisance-δ of the same form, Eqs. (20) and (22), the different PDFs Eqs. (19) and (23) imply very different constructions for the p-values and the resulting statistical outcomes. Indeed, with this PDF for the nuisance-δ approach, T is distributed as a rescaled, non-central χ 2 distribution with a non-centrality parameter (δ/σ) 2 (this non-centrality parameter illustrates that the test statistic is centered around µ whereas the distribution of X is centered around µ + δ). δ is then a genuine asymptotic bias, implying inconsistency: in the limit of an infinite sample size, the estimator constructed from T is µ, whereas the true value is µ + δ.
Using the previous expressions, one can easily compute the cumulative distribution function of this test statistic which depends explicitly on δ but not on ∆ (as indicated before, even if T is built to be independent of nuisance parameters, its PDF depends on them a priori ).
To infer the p-value one can take the supremum value for δ over some interval Ω The interpretation is the following: if the (unknown) true value of δ belongs to Ω, then p Ω is a valid p-value for µ, from which one can infer confidence intervals for µ. This space cannot be the whole space (as one would get p = 1 trivially for all values of µ), but there is no natural candidate (i.e., coming from the derivation of the test statistic). More specifically, should the interval Ω be kept fixed or should it be rescaled when investigating confidence intervals at different levels (e.g. 68% vs 95%)?
• If one wants to keep it fixed, Ω r = r[−∆, ∆]: One may wonder what is the best choice for r, as the p-value gets very large if one works with the reasonable r = 3, while the choice r = 1 may appear as non-conservative. We will call this treatment the fixed r-nuisance approach.
• One can then wonder whether one would like to let Ω depend on the value considered for p.
In other words, if we are looking at a k σ range, we could consider the equivalent range for δ. This would correspond to where k σ (p) is the "number of sigma" corresponding to p where the function Prob has been defined in Eq. (12). We will call this treatment the adaptive nuisance approach. The correct interpretation of this p-value is: p is a valid p-value if the true (unknown) value of δ/∆ belongs to the "would be" 1 − p confidence interval around 0. This is not a standard coverage criterion: one can use adaptive coverage, and adaptively valid p-value, to name this new concept. Note that Eqs. (27)-(28) constitute a non-algebraic implicit equation, that has to be solved by numerical means.
Let us emphasise that the fixed interval is very close to the original 'Rfit' method of the CKMfitter group [15,16] in spirit, but not numerically, as will be shown below by an explicit comparison. In contrast the adaptive choice is more aggressive in the region of δ close to zero, but allows this parameter to take large values, provided one is interested in computing small p-values accordingly. In this sense, the adaptive approach provides a unified approach to deal with two different issues of importance, namely the metrology of parameters (at 1 or 2 σ) and exclusion tests (at 3 or 5 σ).

The external-δ approach
In this approach, the parameter δ is also considered as a fixed parameter. The idea behind this approach is very simple, and it is close to what experimentalists often do to estimate systematic effects: in a first step one considers that δ is a fixed constant, and one performs a standard, purely statistical analysis that leads to a p-value that explicitly depends on δ.
Note that this procedure actually corresponds to the simple null hypothesis H (δ) µ : µ t = µ + δ instead of H µ : µ t = µ, hence one gets an infinite collection of p-values instead of a single one related to the aimed constraint on µ.
Since δ is unknown one has to define a procedure to average all the p δ (µ) obtained. The simplest possibility is to take the envelope (i.e., the maximum) of p δ (µ) for δ in a definite interval (e.g. [−∆, +∆]), leading to: By analogy with the previous case, we will call this treatment the fixed r-external approach for δ ∈ Ω r . This is equivalent to the Rfit ansatz used by CKMfitter [15,16] in the one-dimensional case (but not in higher dimensions), proposed to treat theoretical uncertainties in a different way from statistical uncertainties, treating all values within [−∆, ∆] on an equal footing. We recall that the Rfit ansatz was obtained starting from a well test statistic, with a flat bottom with a width given by the theoretical error and parabolic walls given by statististical uncertainty. A related method, called the Scan method, has been developed in the context of flavour physics [32,33]. It is however slightly different from the case discussed here. First, the test statistic chosen is not the same, since the Scan method uses the likelihood rather than the likelihood ratio, i.e. it relies on the test statistic T = −2 log L(µ, ν) which is interpreted assuming that T follows a χ 2 -law with the corresponding number of degrees of freedom N , including both parameters of interest and nuisance parameters 11 . Then the 1 − α confidence region is then determined by varying nuisance parameters in given intervals (typically Ω 1 ), but accepting only points where T ≤ T c , where T c is a critical value so that P (T ≥ T c ; N |H 0 ) ≥ α (generally taken as α = 0.05). This latter condition acts as a test of compatibility between a given choice of nuisance parameters and the data.

Comparison of the methods in the one-dimensional case
In the following, we will discuss properties of the different approaches in the case of one dimension. More specifically, we will consider: • the random-δ approach with a Gaussian random variable, or naive Gaussian (nG), see Sec. 3.2, • the nuisance-δ approach with quadratic statistic and fixed range, or fixed nuisance, see Sec. 3.3, • the nuisance-δ approach with quadratic statistic and adaptive range, or adaptive nuisance, see Sec. 3.3, • the external-δ approach with quadratic statistic and fixed range, equivalent to the Rfit approach in one dimension, see Sec. 3.4.
Note that we will not consider other (non quadratic) statistics. Finally, we consider varying ∆/σ as an indication of the relative size of the experimental and theoretical uncertainties.

p-values and confidence intervals
We can follow the discussion of the previous section and plot the results for the p-values obtained from the various methods discussed above in Fig. 4, where we compare nG, Rfit, fixed nuisance and adaptive nuisance approaches. From these p-values, we can infer confidence intervals at a given significance level and a given value of ∆/σ, and determine the length of the (symmetric) confidence interval (see Tab. 2). We notice the following points: • by construction, nG always provides the same errors whatever the relative proportion of theoretical and statistical uncertainties, and all the approaches provide the same answer in the limit of no theoretical uncertainty ∆ = 0.
• by construction, for a given nσ confidence level, the interval provided by the adaptive nuisance approach is identical to the one obtained using the fixed nuisance approach with a [−n, n] interval. This explains why the adaptive nuisance approach yields identical results to the fixed 1-nuisance approach at 1 σ (and similarly for the fixed 3-nuisance approach at 3 σ). The corresponding curves cannot be distinguished on the upper and central panels of Fig. 5.
• the adaptive nuisance approach is numerically quite close to the nG method; the maximum difference occurs for ∆/σ = 1 (up to 40% larger error size for 5 σ intervals).
• the p-value from the fixed-nuisance approach has a very wide plateau if one works with the 'reasonable' range [−3∆, +3∆], while the choice of [−∆, +∆] might be considered as non conservative.
• the 1-external and fixed 1-nuisance approaches are close to each other and less conservative than the adaptive approach, which is expected, but also than nG, for confidence intervals at 3 or 5 σ when theory uncertainties dominate.
• when dominated by theoretical uncertainties (∆/σ large), all approaches provide 3 and 5 σ errors smaller than the nG approach, apart from the adaptive nuisance approach.

Significance thresholds
Another way of comparing methods consists in taking the value of µ for which the p-value corresponds to 1, 3, 5 σ (in significance scale) in a given method, and compute the corresponding p-values for the other methods. The results are gathered in Tabs. 3 and 4. Qualitatively, the comparison of significances can be seen from Fig. 4: if the size of the error is fixed, the different approaches quote different significances for this same error.
In agreement with the previous discussion, we see that fixed 1-nuisance and 1-external yield similar results for 3 and 5 σ, independently of the relative size of statistical and theoretical effects. Moreover, they are prompter to claim a tension than nG, the most conservative method in this respect being the adaptive nuisance approach.
As a physical illustration of this problem, we can consider the current situation for the anomalous magnetic moment of the muon, namely the difference between the experimental measurement and the theoretical computation in the Standard Model [34]:    is µ = 0). 12 The nG method yields 3.6 σ, the 1-external approach 3.8 σ, the 1-nuisance approach 4.0 σ, and the adaptive nuisance approach 2.7 σ. The overall pattern is similar to what can be seen from the above tables, with a significance of the discrepancy which depends on the model used for theoretical uncertainties.

Coverage properties
As indicated in Sec. 1.1.2, p-values are interesting objects if they cover exactly or slightly overcover in the domain where they should be used corresponding to a given significance, see Eqs. (7)- (9). If coverage can be ensured for a simple hypothesis [1,2], this property is far from trivial and should be checked explicitly in the case of composite hypotheses, where compositeness comes from nuisance parameters that can be related to theoretical uncertainties, or other parameters of the problem.
For all methods we study coverage properties in the standard way: one first fixes the true values of the parameters µ and δ (which are not assumed to be random variables), from which one generates a large sample of toy experiments X i . Then for each toy experiment one computes the p-value at the true value of µ. The shape of the distribution of p-values indicates over, exact or under coverage. More specifically, one can determine P (p ≥ 1 − α) for a CL of α: if it is larger (smaller) than α, the method overcovers (undercovers) for this particular CL, i.e. it is conservative (liberal). We emphasise that this property is a priori dependent on the chosen CL.
In order to compare the different situations, we take σ 2 + ∆ 2 = 1 for all methods, and compute for each method the coverage fraction (the number of times the confidence level interval includes the true value of the parameter being extracted) for various confidence levels and for various values of ∆/σ. Note that the coverage depends also on the true value of δ/∆ (the normalized bias   Figure 6: Distribution of p-value (for a fixed total number of events) for different true values δ/∆ and various relative sizes of statistical and theoretical uncertainties ∆/σ. The following approaches are shown: nG (dotted, red), Rfit or 1-external (dashed, black), fixed 1-nuisance (dotted-dashed, blue), adaptive nuisance (solid, green). Since the 1-external approach produces clusters of p = 1 p-values, the coverage values excluding these clusters are also shown, as well as the distribution of p-values (dotted-dotted-dashed, grey). Note that the behaviour of the 1-external p-value around p = 1 is smoothened by the graphical representation.   Fig. 6. We also indicate the distribution of p values obtained for the different methods.
One notices in particular that the 1-external approach has a cluster of values for p = 1, which is expected due to the presence of a plateau in the p-value. This behaviour makes the interpretation of the coverage more difficult, and as a comparison, we also include the results when we consider the same distribution with the p = 1 values removed. Indeed one could imagine a situation where reasonable coverage values could only be due to the p = 1 clustering, while other values of p would systematically undercover: such a behaviour would either yield no constraints or too liberal constraints on the parameters depending on the data.
The results are the following: • if Ω is fixed and does not contain the true value of δ/∆ ("unfortunate" case), both external-δ and nuisance-δ approaches lead to undercoverage; the size of the effect depends on the distance of δ/∆ with respect to Ω. This is also the case for nG.
• if Ω is fixed and contains the true value of δ/∆ ("fortunate" case), both the external-δ and nuisance-δ approaches overcover. This is also the case for nG.
• if Ω is adaptive, for a fixed true value of δ, a p-value becomes valid if it is sufficiently small so that the corresponding interval contains δ. Therefore, for the adaptive nuisance-δ approach, there is always a maximum value of CL above which all p-values are conservative; this maximum value is given by To interpret the pattern of coverage seen above in the external and nuisance approaches, note that one starts with a p-value that has exact coverage under the individual simple hypotheses when δ is fixed. Therefore, as long as the true value δ lies within the range over which one takes the supre-mum, this procedure yields a conservative envelope. This explains the overcoverage/undercoverage properties for the external-δ and nuisance-δ approaches given above.

Conclusions of the uni-dimensional case
It should be stressed that, by construction, all methods are conservative if the true value of the δ parameter satisfy the assumption that has been made for the computation of the p-value. Therefore coverage properties are not the only criterion to investigate in this situation in order to assess the methods: in particular one has to study the robustness of the p-value when the assumption set on the true value of δ is not true. The adaptive approach provides a means to deal with a priori unexpected true values of δ, provided one is interested in a small enough p-value, that is, a large enough significance effect. Other considerations (size of confidence intervals, significance thresholds) suggest that the adaptive approach provides an interesting and fairly conservative framework to deal with theoretical uncertainties. We are going to consider the different approaches in the more general multi-dimensional case, putting emphasis on the adaptive nuisance-δ approach and the quadratic test statistic.

Generalization to multi-dimensional cases
Up to here we only have discussed the simplest example of a single measurement X linearly related to a single model parameter µ. Obviously the general case is multi-dimensional, where we deal with several observables, depending on several underlying parameters, possibly in a non-linear way, with several measurements involving different sources of theoretical uncertainty. Typical situations correspond to averaging different measurements of the same quantity, and performing fits to extract confidence regions for fundamental parameters from the measurement of observables. In this section we will discuss the case of an arbitrary number of observables in a linear model with an arbitrary number of parameters, where we are particularly interested in a one-dimensional or two-dimensional subset of these parameters.

General formulae
We start by defining the following quadratic test statistic where X = (X i , i = 1, . . . , n) is the n-vector of measurements, x = (x i , i = 1, . . . , n) is the n-vector of model predictions for the X i that depends on χ = (χ j , j = 1, . . . , n χ ), the n χ -vector of model parameters,δ is the m-vector of (dimensionless) theoretical biases, W s is the (possibly non diagonal) n × n inverse of the statistical covariance matrix C s , W t is the inverse of the (possibly non diagonal) m × m theoretical correlation matrix C t , ∆ is the n × m-matrix of theoretical uncertainties ∆ iα , so that the reduced biasesδ α have a range of variation within [−1, 1] (this explains the notation with tildes for the reduced quantities rescaled to be dimensionless). After minimization over theδ α , T can be recast into the canonical form with 22 The definition ofW involves the inverse of matrices that can be singular. This may occur in particular in cases where the statistical uncertainties are negligible and some of the theoretical uncertainties are assumed to be 100% correlated. This requires us to define a generalised inverse, including singular cases, which is described in detail in App. A and corresponds to a variation of the approach presented in Ref. [5]. Ambiguities and simplifications that can occur in the definition of T are further discussed in App. C. In particular, one can reduce the test statistic to the case m = n with a diagonal ∆ matrix without losing information. In the case where both correlation/covariance matrices are regular, Eq.
This structure is reminiscent of the discussion of theoretical uncertainties as biases and the corresponding weights given in Ref. [29], but it extends it to the case where correlations yield singular matrices. We will focus here on the case where the model is linear, i.e., the predictions x i depend linearly on the parameters χ j : where a ik and b i are constants. We leave the phenomenologically important non-linear case and its approximate linearisation for a dedicated discussion in a separate paper [35]. Following the one-dimensional examples in the previous sections, we always assume that the measurements X i have Gaussian distributions for the statistical part. We will consider two main cases of interest in our field: averaging measurements and determining confidence intervals for several parameters.

Averaging measurements
We start by considering the averages of several measurements of a single quantity, each with both statistical and theoretical uncertainties, with possible correlations. We will focus mainly on the nuisance-δ approach, starting with two measurements before moving to other possibilities.

Averaging two measurements and the choice of a hypervolume
A first usual issue consists in the case of two uncorrelated measurements X 1 ±σ 1 ±∆ 1 and X 2 ±σ 2 ±∆ 2 that we want to combine. The procedure is well defined in the case of purely statistical uncertainties, but it depends obviously on the way theoretical uncertainties are treated. As discussed in Sec. 2, associativity is a particulary appealing property for such a problem as it allows one to replace a series of measurements by its average without loss of information.
Averaging two measurements amounts to combining them in the test statistic. The nuisance-δ approach, together with the quadratic statistic Eq. (34), in the absence of correlations yields: µ is a linear combination of Gaussian random variables, and is thus distributed according to a Gaussian p.d.f, with mean µ + δ µ and variance σ 2 Therefore, T − T min is distributed as a rescaled uni-dimensional non-central χ 2 distribution with non-centrality parameter (δ µ /σ µ ) 2 .
σ µ corresponds to the statistical part of the error on µ. δ 1 and δ 2 remain unknown by construction, and the combined theory error can only be obtained once a region of variation is chosen for the δ's (as a generalisation of the [-1,1] interval in the one-dimension case). If one maximises the p-value over a rectangle C (called "hypercube case" in the following, in reference to its multidimensional generalisation), δ µ varies in ∆ µ , with recovering the proposal in Ref. [29] for the treatment of systematic uncertainties. In this case, δ 1 and δ 2 are allowed to be varied separately, without introducing any relation in their values, and can assume both extremal values. On the other hand, if one performs the maximisation over a disk (referred to as the "hyperball case" for the same reasons as above) one has the range In this case, the values of δ 1 and δ 2 are somehow related, since they cannot both reach extremal values simultaneously. Each choice of volume provides an average with different properties. As discussed earlier, associativity is a very desirable property: one can average different observations of the same quantity prior to the full fit, since it gives the same result as keeping all individual inputs. The hyperball choice indeed fulfills associativity. On the other hand, the hypercube case does not: the combination of the inputs 1 and 2 yields the following test statistic: (w 1 + w 2 )(µ −μ) 2 , whereas the resulting combinationμ ± σ µ ± ∆ µ has the statistic (µ −μ) 2 /(σ 2 µ + ∆ 2 µ ). The two statistics are proportional and hence lead to the same p-value, but they are not equivalent when added to other terms in a larger combination.
A comment is also in order concerning the size of the uncertainties for the average. In the case of the hypercube, the resulting linear addition scheme is the only one where the average of different determinations of the same quantity cannot lead to a weighted theoretical uncertainty that is smaller than the smallest uncertainty among all determinations 13 . In the case of the hyperball, it may occur that the average of different determinations of the same quantity yields a weighted theoretical uncertainty smaller than the smallest uncertainty among all determinations.
Whatever the choice of the volume, a very important and alluring property of our approach is the clean separation between the statistical and theoretical contribution to the uncertainty on the parameter of interest. This is actually a general property that directly follows from the choice of a quadratic statistic, and in the linear case it allows one to perform global fits while keeping a clear distinction between various sources of uncertainty.

Averaging n measurements with biases in a hyperball
We will now consider here the problem of averaging n, possibly correlated, determinations of the same quantity, each individual determination coming with both a Gaussian statistical uncertainty, and a number of different sources of theoretical uncertainty. We focus first on the nuisance−δ approach, as it is possible to provide closed analytic expressions in this case. We will first discuss the variation of the biases over a hyperball, before discussing other approaches, which will be illustrated and compared with examples from flavour physics in Sec. 6.
We use the test statistic Eq. (34) for µ, with x(χ) simply replaced by µU , where U is the n-vector (1, . . . , 1). After minimization over theδ α , T can be recast into the canonical form The minimization of Eq. (44) over µ leads to an estimatorμ of the average in terms of the measurements X iμ that allows one to compute the statistical uncertainty σ µ in the following way The theoretical bias is given by δ µ = i,α w i ∆ iαδα . We would like to varyδ α in ranges required to infer the theoretical uncerainty, identifying the combination of biases that is uncorrelated. This is a well known problem of statistics, and it can be easily achieved in a linear manner by noticing that the relevant combination is ∆ TC t ∆, cf. Eq. (36), and by introducing the Cholesky decomposition for the theoretical correlation matrix C t = P · P T , with P a lower triangular matrix with diagonal positive entries. This yields the expression for the bias where (P −1δ ) β are uncorrelated biases. If the latter biases are varied over a hyperball, the biasesδ are varied over a hyperellipsoid elongated along the directions corresponding to strong correlations (see App. B for illustrations) and one gets Known (linear) statistical correlations between two measurements are straightforward to implement, by using the full covariance matrix in the test statistic Eq. (46). On the other hand, in the physical problems considered here (involving hadronic inputs from lattice QCD simulations), it often happens that two a priori independent calculations of the same quantity are statistically correlated, because they use the same (completely or partially) ensemble of gauge configurations. The correlation is not perfect of course, since usually different non linear actions are used to perform the computation. However the accurate calculation of the full covariance matrix is difficult, and in many cases it is not available in the literature. For definiteness, we will assume that if two lattice calculations are statistically correlated, then the (linear) correlation coefficient is one. In such a case the covariance matrix is singular, and its inverse W s is ill-defined, as well as all quantities that are defined above in terms of W s . A similar question arises for fully correlated theoretical uncertainties (coming from the same method), leading to ambiguities in the definition of W t . Details on these issues are given in Apps. A and B.
Statistical uncertainties are assumed here to be strictly Gaussian and hence symmetric (see App. D for more detail on the asymmetric case). In contrast, in the nuisance approach, a theoretical uncertainty that is modelled by a bias parameter δ may be asymmetric: that is, the region in which δ is varied may depend on the sign of δ, e.g., δ ∈ [−∆ − , +∆ + ] in one dimension with the fixed hypercube approach (∆ ± ≥ 0). In order to keep the stationarity property that follows from the quadratic statistic, we take the conservative choice ∆ = Max(∆ + , ∆ − ) in the definition Eq. (34). Let us emphasise that this symmetrisation of the test statistic is independent of the range in which δ is varied: if theoretical uncertainties are asymmetric, one computes Eqs. (46)- (48) to express the asymmetric combined uncertainties ∆ µ,± in terms of the ∆ iα,± .

Averages with other approaches
In Sec. 5.2.1, we indicated that other domains can be chosen in principle in order to perform the averages of measurements, for instance a hypercube rather than a hyperball. If we do not try to take into account theoretical correlations in the range of variation, it is quite easy to determine the result for ∆ reminiscent of the formulae derived in Ref. [29]. However, we encountered severe difficulties when trying to include theoretical correlations in the discussion. Similarly to the hyperball case, it would be interesting to consider a linear transformation P of the biases (for instance, the Cholesky decomposition of C t , but the discussion is more general), so that (P −1δ ) β are uncorrelated biases varied within a hypercube. This would lead toδ varied within a deformed hypercube, which corresponds to cutting the hypercube by a set of (δ i ,δ j ) hyperplanes. It can take a rather complicated convex polygonal shape that is not symmetric along the diagonal in the (δ i ,δ j ) plane, leading to the unpleasant feature that the order in which the measurements are considered in the average matters to define the range of variation of the biases (an illustration is given in App. B) 14 .
As indicated before, this discussion occurs for any linear transformation P and is not limited to the Cholesky decomposition. We have not been able to find other procedures that would avoid these difficulties while paralleling the hypercube case. In the following, we will thus use Eq. (49) even in the presence of theoretical correlations: therefore, the latter will be taken into account in the definition of T throughW , but not in the definition of the range of variations to compute the error ∆. We also notice that the problems that we encounter are somehow due to contradicting expectations concerning the hypercube approach. In Sec. 5.2.1, the hypercube corresponds to values of δ 1 and δ 2 left free to vary without relation among them (contrary to the hyperball case). It seems therefore difficult to introduce correlations in this case which was designed to avoid them initially. Our failure to introduce correlations in this case might be related to the fact that the hypercube is somehow designed to avoid such correlations from the start and cannot accomodate them easily.
In the case of the external-δ approach, the scan method leads to the same discussion as for the nuisance case, provided that one uses the following statistic: T = (X − µ − δ) 2 /(σ 2 + ∆ 2 ). This choice is different from Ref. [32] by the normalisation (σ 2 +∆ 2 rather than σ 2 ) in order to take into account of the importance of both uncertainties when combining measurements (damping measurements which are unprecise in one way or the other). As indicated in Sec. 3.4, the difference of normalisation of the test statistic does not affect the determination of the p-value in the uni-dimensional case, but it has an impact once several determinations are combined. The choice above corresponds to the usual one when ∆ is of statistical nature. It gives a reasonable balance when two or more inputs are combined, that all come with both statistical and theoretical uncertainties.
A similar discussion holds for the random-δ approach. However, if the combined errors σ µ and ∆ µ are the same between the nuisance-δ (with hyperball), the random-δ and the external-δ (with hyperball) approaches, we emphasise that the p-value for µ built from these errors is different and yields different uncertainties for a given confidence level for each approach, as discussed in Sec. 3.

Other approaches in the literature
There are other approaches available in the literature, often starting from the random-δ approach (i.e., modeling all uncertainties as random variables).
The Heavy Flavour Averaging Group [36] choose to perform the average including correlations. In the absence of knowledge on the correlation coefficient between uncertainties of two measurements (typically coming from the same method), they tune the correlation coefficient so that the resulting uncertainty is maximal (which is not ρ = 1 in the case where the correlated uncertainties have a different size and are combined assuming a statistical origin, see App. A.2). This choice is certainly the most conservative one when there is no knowledge concerning correlations.
The Flavour Lattice Averaging Group [37] follows the proposal in Ref. [38]: they build a covariance matrix where correlated sources of uncertainties are included with 100% correlation, and they perform the average by choosing weights w i that are not optimal but are well defined even in the presence of ρ = ±1 correlation coefficients. As discussed in App. A.2, our approach to singular covariance matrices is similar but more general and guarantees that we recover the weights advocated in Ref. [38] for averages of fully correlated measurements.
Finally, the PDG approach [34] combines all uncertainties in a single covariance matrix. In the case of inconsistent measurements, one may then obtain an average with an uncertainty that may be interpreted as 'too small' (notice however that the weighted uncertainty does not increase with the incompatibility of the measurements). This problem occurs quite often in particle physics and cannot be solved by purely statistical considerations (even in the absence of theoretical uncertainties). If the model is assumed to be correct, one may invoke an underestimation of the uncertainties. A (commonly used) recipe in the pure statistical case has been adopted by the Particle Data Group, which consists in computing a factor S = χ 2 /(N dof − 1) and rescaling all uncertainties by this factor. A drawback of this approach is the lack of associativity: the inconsistency is either removed or kept as it is, depending on whether the average is performed before any further analysis, or inside a global fit. Furthermore since the ultimate goal of statistical analyses is indeed to exclude the null hypothesis (e.g. the Standard Model), it looks counter-intuitive to first wash out possible discrepancies by an ad hoc procedure. Therefore we refrain to define a S factor in presence of theoretical uncertainties, and leave the discussion of discrepancies between independent determinations of the same quantity to a case-by-case basis, based on physical (and not statistical) grounds.
In the case of the Rfit approach adopted by the CKMfitter group [15,16], a specific recipe was chosen to avoid underestimating combined uncertainties in the case of marginally compatible values. The idea is first combine the statistical uncertainties by combining the likelihoods restricted to their statistical part, then assign to this combination the smallest of the individual theoretical uncertainties. This is justified by the following two points: the present state of the art is assumed not to allow one to reach a better theoretical accuracy than the best of all estimates, and this best estimate should not be penalized by less precise methods. In contrast with the plain (or naive) Rfit approach for averages (consisting in just combining Rfit likelihoods without further treatment), this method of combining uncertainties was called educated Rfit and is used by the CKMfitter group for averages [17,19,22]. Let us note finally that the calculation of pull values, discussed in Sec. 5.3, is a crucial step for assessing the size of discrepancies.

Estimators and errors
Another prominent example of multi-dimensional problem is the extraction of a constraint on a particular parameter of the model from the measured observables. If the model is linear, Eq. (38), the discussion follows closely that of Sec. 5.2.2. In the case where there is a single parameter of interest µ, we do not write explicitly the calculations and refer to Sec. 6 for numerical examples.
We start from the test statistic Eq. (34) in the linear case defined in Eq. (38), reducing the number of theoretical biases to the case m = n as indicated in App. C. Following the same discussion as in Sec. 5.2.2, we can minimise with respect toδ α , leading to the canonical form 27 The minimum of this function is found at the pointχ k where ∂T ∂χ q χ=χ = 0,χ = (a TW a) −1 · (a TW (X − b)) (51) so that we haveχ The minimumχ q is thus linearly related to the measured observables X i and their statistical properties are closely related. The test statistic for a particular parameter µ = χ q will lead to T (X; µ) = (µ −χ q ) 2 × (a TW a) qq , so that the discussion of the p-value for µ follows exactly the discussion for uni-dimensional measurements 15 . For instance, if the observables X i have central values X i0 and variances σ 2 X i , the central value and the variance forχ q (corresponding also to the central value and statistical uncertainty for the p-value for µ = χ q ), can readily be obtained from Similarly to what was presented in the previous section, the theoretical uncertainty on µ = χ q is obtained in the hyperball case as It remains to determine how to define the theoretical correlation in this framework, denoted κ qr corresponding to the actual parameters of interest. This can be seen as trying to infer a scalar product on the vectors [w (q) ∆P ] i from the knowledge of a norm, here L 2 . We will thus define the theoretical correlation in the following way In Sec. 5.2.2 we encountered difficulties in extending the discussion to the hypercube case. We can define errors varying the biases without correlations in the definition of the hypercube but we could not determine a way of defining this hypercube taking into account theoretical correlations. Moreover, there is no obvious way to extend the definition of theoretical correlation for the hypercube in a similar way to Eq. (56), as there is no scalar product associated to the L 1 -norm. We will thus not quote theoretical correlations for the hypercube case.

Goodness-of-fit
We would like also to compute the distribution of T min in presence of biases and extract a goodnessof-fit value. Coming back to the initial problem, we see that T min can be written as where X are distributed following a multivariate normal distribution, with central value aχ + b + ∆δ and correlation matrix C s . The CDF Hδ(t) for T min at fixedδ can thus be rephrased in the following way: considering a vector Y distributed according to a multivariate normal distribution of covariance C s centred around 0, We are able to reexpress this problem as a linear combination of non-central χ 2 distributions. Indeed, we can define with L lower triangular (using the Cholesky decomposition), α is diagonal and K orthogonal (so that α are the (positive) eigenvalues of L T M L and thus of M C s ). Let us note that α does depend only on C s and C t , whereas the dependence on the true value of χ andδ is only present in β. The problem is then equivalent to considering a vector Z distributed according to a multivariate normal distribution of covariance identity centred around 0, and computing P [ . This is the CDF of a linear combination of the form i α i X 2 i corresponding to non-central χ 2 distributions. In the case where α is proportional to identity, the CDF can be expressed in terms of the generalised Marcum Q-function with the non-centrality parameter λ = i β 2 i . In the general case, the answer can be found in various articles, for instance in Ref. [39], as a linear combination of infinitely many (central or non-central) χ 2 distribution functions, and in Ref. [40], where an expansion in terms of Laguerre polynomials is provided for a fast numerical evaluation. We can thus infer the corresponding p value as whereδ has to be varied in a hyperball or a hypercube depending on the volume chosen, and χ q are replaced by their estimated values µ χq .

Pull parameters
In addition to the general indication given by goodness-of-fit indicators, it is useful to determine the agreement between individual measurements and the model. One way of quantifying this agreement consists in determining the pull of each quantity. Indeed, the agreement between the indirect fit prediction and the direct determination of some observable X is measured by its pull, which can be determined by considering the difference of minimum values of the test statistic including or not the observables [22]. In the absence of non-Gaussian effects or correlations, the pulls are random variables of vanishing mean and unit variance. The pull of an observable X m can be conveniently computed by introducing an additional pull parameter p Xm in the test statistic T (X 0 ; χ, p Xm ) The pull parameter p Xm is a dimensionless fit parameter for which one can compute confidence intervals, or errors and uncertainties. Its best-fit value is a random variable that measures the distance of the indirect prediction (determined by the global fit) from the direct measurement, in units of σ.
The p-value for the null hypothesis p Xm = 0 is by definition the pull for X i . It can be understood as a comparison of the best-fit value of the test statistic reached letting p Xa free (corresponding to a global fit without the measurement X m ) with the case setting p Xm = 0 (corresponding to a global fit including the measurement X m ).
As far as the test statistic is concerned, the pull parameter can be treated on the same footing as the parameters χ, and it can be determined in the same way as in the previous section, first solving the minimisation condition ∂T /∂p Xm = 0, and plugging the result for p Xm into T , leading to the same expression for T as in Eq. (50), but withW replaced by the matrix which can be solved as before forχ, leading to the expression forp Xm If the statistical method allows one to separate the statistical and theoretical contributions to the error on p X i , one can report the values for the errors ∆ p X i and σ p X i in addition to the pull itself: this gives an indication of how independent from theoretical uncertainties the underlying tested hypothesis is. One can also extend this notion for N parameters, introducing N distinct pull parameters and determining the p-value for the null hypothesis where all pull parameters vanish simultaneously.
As an illustration in a simple case, one can compute the pulls associated with the average of n measurements, introducing the modified test statistic compared to Eq. (44):

Conclusions of the multi-dimensional case
We have discussed several situations where a multi-dimensional approach is needed in phenomenology analysis. In addition to the issues already encountered in one dimension, a further arbitrary choice must be performed in the multi-dimensional case for nuisance and external approaches concerning the shape of the volume in which the biases are varied: two simple cases are given by the hypercube and the hyperball, corresponding respectively to the well-known linear and quadratic combination of uncertainties. We have then discussed how to average two (or several) measurements, emphasising the case of the nuisance approach. We have finally illustrated how a fit could be performed in order to determine confidence regions. Beyond the metrology of the model, we can also determine the agreement between model and experiments thanks to the pull parameters associated with each observable.
The uni-dimensional case (stationarity of the quadratic test statistic under minimisation, coverage properties) has led us to prefer the adaptive nuisance approach, even though the fixed nuisance approach could also be considered. In the multidimensional case, the hyperball in conjunction with the quadratic test statistic allows us to keep associativity when performing averages, so that it is  Bottom: Pulls associated to each measurement for each method. For Rfit methods, we quote only the significance of the pull, whereas other methods yield the pull parameter as well as the pull itself under the form p ± σ ± ∆ (significance of the pull).
rigourously equivalent from the statistical point of view to keep several measurements of a given observable or to average them in a single value. We have also been able to discuss theoretical correlations using the hyperball case at two different stages: including the correlations among observables in the domain of variations of the biases when computing the errors ∆, and providing a meaningful definition for the theoretical correlation among parameters of the fit. We have not found a way to keep these properties in the case of the hypercube. Moreover, choosing the hypercube may favour best-fit configurations where all the biases are at the border of their allowed regions, whereas the hyperball prevents such 'fine-tuned' solutions from occurring.
For comparison, in the following we will focus on two nuisance approaches: fixed 1-hypercube and adaptive hyperball with a preference for the latter. The other combinations would yield far too conservative (adaptive hypercube) or too liberal (fixed 1-hyperball) ranges of variations for the biases.

CKM-related examples
We will now consider the differences between the various approaches considered using several examples from quark flavour physics. These examples will be only for illustrative purposes, and we refer the reader to other works [15,16,22,35] for a more thorough discussion of the physics and the inputs involved. From the previous discussion, we could consider a large set of approaches for theoretical uncertainties.
We will restrict to a few cases compared to the previous sections. First, we will consider educated Rfit (Rfit with specific treatment of uncertainties for averages), as used by the CKMfitter analyses and described in Sec. 5.2.4, while the naive Rfit approach will only be shown for the sake of comparison and is not understood as an appropriate model. We will also consider two nuisance approaches,

Averaging theory-dominated measurements
We start by illustrating the case of measurements dominated by theoretical uncertainties, which is the case for the lattice determinations. We take the case of B K , which is needed to discuss KK mixing, and has been the subject of important debates concerning its agreement (or not) with the rest of the global fit. We have selected a particular list of lattice determinations given in Tab. 6 (top). For each measurement, we have kept the various theoretical uncertainties separate, since their combination (linear or quadratic) depends on the method used. For purposes of illustration, we perform an average over measurements performed with different lattice gauge actions, we symmetrise the results having asymmetric uncertainties 16 and we neglect all correlations. We stress that this is done only for purposes of illustration, and that an extended list of lattice QCD results with asymmetric uncertainties and correlations will be taken into account in forthcoming phenomenological applications [35].
The results for each method are given in Tab. 6 (middle). The first column corresponds to the outcome of the averaging procedure. In all the approaches considered, we can split statistical and theoretical uncertainties. In the case of naive Rfit, one combines the measurements by adding the well statistic corresponding to each measurement: the resulting test statistic T is a well with a bottom, the width of which can be interpreted as a theoretical uncertainty, whereas the width at T min + 1  determines the statistical uncertainty 17 . The case of educated Rfit was described in Sec. 5.2.4. The confidence intervals are obtained from the p-value determined from the "average" column.
We compute the pulls in the same way in both cases, interpreting the difference of T min with and without the observables as a random variable distributed according to a χ 2 law with N dof = 1. The propagation of uncertainties for the quadratic statistic was detailed in Secs. 5.2.1 and 5.2.2 where the separate extraction of statistical and theoretical uncertainties was described. The tables are obtained by plugging the average into the 1-dimensional p-value associated with the method, and reading from the p-value the corresponding confidence interval at the chosen significance. The associated pulls are given in Tab. 6 (bottom).
We present the same analysis in the case of the D s -meson decay constant f Ds in Tab. 7 (with the same caveat concerning the selected inputs, asymmetries and correlations), while graphical comparisons of the different averages in both cases can be seen at 1σ in Fig. 7 (a similar plot at 3σ is given in Fig. 12 in App. E).
For both quantities B K and f Ds at large confidence level (3 σ and above), the most conservative method is the adaptive hyperball nuisance approach, whereas the one leading to the smallest uncertainties is the educated Rfit approach. Below 3 σ, the 1-hypercube approach is more conservative than the adaptive hyperball nuisance approach, and it becomes less conservative above that threshold. The most important differences are observed at large CL/significance. The statistical uncertainty obtained in the nG approach is by construction identical to the combination in quadrature of the statistical and theoretical uncertainties obtained in the adaptive hyperball approach. However, one can notice that the confidence intervals for high significances in the two approaches are different, with nG being less conservative. The overall very good agreement of lattice determinations means [52] 0.1224 ± 0.9 ± 0.9 ± 1.2 ± 3.5 OPAL-j & s [53] 0.1189 ± 0.8 ± 1.6 ± 1.0 ± 3.6 JADE-j & s [54] 0.1172 ± 0.6 ± 2.0 ± 3.5 ± 3.0 Dissertori-3j [55] Table 8: Top: Determinations of α S (M Z ) using e + e − annihilation, taken from Ref. [34]. Middle: Averages for α S (M Z ) from e + e − annihilation according to the various methods, and corresponding confidence intervals for various significances. Bottom: Pull associated to each measurement for each method. For Rfit methods, we quote only the significance of the pull, whereas other methods yield the pull parameter as well as the pull itself under the form p ± σ ± ∆.
vanishing pulls for Rfit methods (since all the wells have a common bottom with a vanishing T min ). For the other methods, the pull parameter has statistical and theoretical errors of similar size in the adaptive hyperball case, whereas theoretical errors tend to dominate in the 1-hypercube method. This yields smaller pulls in the latter approach. A last illustration, which does not come solely from lattice simulations, is provided by the determination the strong coupling constant α S (M Z ). The subject is covered extensively by recent reviews [34,51], and we stress that we do not claim to provide an accurate alternative average to these reviews which requires a careful assessment of the various determinations and their correlations. As a purely illustrative example, we will focus on the average of determinations from e + e − annihilation under a set of simplistic hypotheses for the separation between statistical and theoretical uncertainties. In order to allow for a closer comparison with Refs. [34,62], we try to assess correlations this time. We assume that theoretical uncertainties for the same set of observables (j&s, 3j, T ), but from different experiments, are 100% correlated, and the statistical uncertainties for determinations from similar experimental data are 100% correlated (BS-T, DW-T, AFHMS-T) 18 We perform the average in the different cases considered, see Tab. 8 (middle), which are represented graphically in Fig. 8 (a similar plot at 3σ is given in Fig. 13 in App. E). We notice that the various approaches yield results with similar central values to the nG case. The pulls for individual quantities are mostly around 1 σ, and they are smaller in the adaptive hyperball approach compared to the nG one, showing better consistency. Refs. [34,62] take a different approach, "range averaging", which amounts to considering the spread of the central values for the various determinations, leading to α S (M Z ) = 0.1174±0.0051 for the determination from e + e − annihilation data considered here [62]. This approach is motivated in Ref. [34] by the complicated pattern of correlations and the limited compatibility between some of the inputs and, more importantly, it does not take into account that the different determinations have different accuracies according to the uncertainties quoted. The approach in Refs. [34,62] conservatively accounts for the possibility that some uncertainties are underestimated. On the contrary, our averages given in Tab. 8 and Fig. 8 assume that all the inputs should be taken into account and averaged according to the uncertainties given in the original articles. The difference in the underlying hypotheses for the averages explain the large difference observed between our results and the ones in Refs. [34,62]. Note however that our numerics directly follow from the use of the different averaging methods, and lack the necessary critical assessment of the individual determinations of α S (m Z ) performed in Refs. [34,62]. on the following considerations. Ref. [60] discusses the sources of uncertainties (scales, function parameters, b-quark mass) within a fit leading to uncertainties assumed to be of statistical nature, with a further systematic uncertainty coming from the difference between the two different schemes. The systematic uncertainties in Ref. [57] are assumed to be of statistical nature in the absence of any opposite statement. For the first two classes (j & s and 3j) hadronisation is taken into account by Monte Carlo methods, while for the last two classes (T and C) analytic analyses are made: in the former (latter) case, the hadronic uncertainties are treated as statistical (theoretical).  Table 9: Top: Determinations of |V ub | · 10 3 from semileptonic decays. Middle: Averages according to the various methods, and corresponding confidence intervals for various significances. Bottom: Pulls associated to each determination for each method. For Rfit methods, we quote only the significance of the pull, whereas other methods yield the pull parameter as well as the pull itself under the form p ± σ ± ∆ (significance of the pull).

Averaging incompatible or barely compatible measurements
Another important issue occurs when one wants to combine barely compatible measurements. This is for instance the case for |V ub | and |V cb | from semileptonic decays, where inclusive and exclusive determinations are not in very good agreement. The list of determinations used for illustrative purposes and the results for each method are given in Tabs. 9 and 10, together with the corresponding graphical comparisons in Fig. 9 (a similar plot at 3σ is given in Fig. 14 in App. E). Our inputs are slightly different from Ref. [36] for several reasons. The inclusive determination of |V ub | corresponds to the BLNP approach [64], and we consider the theoretical uncertainties from shape functions (leading and subleading), weak annihilation, and heavy-quark expansion uncertainties on matching and m b . We use only branching fractions measured for B → π ν and average the unquenched lattice calculations quoted in Ref. [36]. For |V cb | exclusive we also split the various sources of theoretical uncertainties coming from the determination of the form factors. We assume that there are no correlations among all these uncertainties. The lack of compatibility between the two types of determination means in particular that the naive Rfit combined likelihood has not flat bottom, and thus no theoretical uncertainty. This behaviour was one of the reasons to propose the educated Rfit approach, where the theoretical uncertainty of the combination cannot be smaller than any of the individual measurements.
The same pattern of conservative and aggressive approaches can be observed, with a fairly good agreement at 3 σ level (apart from the naive Rfit approach, already discussed). At 5 σ, the adaptive hyperball proves again rather conservative, even though the theoretical error of the averages are smaller than the 1-hypercube nuisance and the educated Rfit approaches. The analysis of the pulls yields similar conclusions, with discrepancies at the 2 σ for |V ub | and between 2 and 3 σ for |V cb |. Once again, theoretical errors for the pull parameters are larger in the 1-hypercube approach than in the adaptive hyperball case. Let us also notice that in both cases, there are only two quantities to combine, so that the two pull parameters are by construction opposite to each other up to an irrelevant scaling factor, leading to the same pull for both quantities.

Averaging quantities dominated by different types of uncertainties
In order to illustrate the role played by statistical and theoretical uncertainties, we consider the question of averaging quantities dominated by one or the other. This happens for instance when one wants to compare a theoretically clean determination with other determination potentially affected by large theoretical uncertainties. This situation occurs in flavour physics for instance when one compares the extraction of sin(2β) from time-dependent asymmetries in b → ccs and b → qqs decays (let us recall that for the CKM global fit, only charmonium input is used for sin(2β)). The first have a very small penguin pollution, which we will neglect, whereas the latter is significantly affected by such a pollution. The corresponding estimates of sin(2β) have large theoretical uncertainties, and for illustration we use the computation done in Ref. [63]. The results are collected in Tab. 11, which were computed neglecting all possible correlations between the different extractions. One can see that the resulting theoretical uncertainty from the combination of the various inputs remains small, so that most of the approaches yield a very similar result for the confidence intervals. The corresponding pulls show a global consistency concerning the observables that deviate by 1σ.  Fig. 7 for the legend.

Global fits
In order to illustrate the impact of the treatment of theoretical uncertainties, we consider a global fit including mainly observables that come with a theoretical uncertainty. The list of observables is given in Tab. 14. Their values are motivated by the CKMfitter inputs used in Summer 2014, but they are used only for purposes of illustration 19 . We consider two fits: Scenario A involves only constraints dominated by theoretical uncertainties whereas Scenario B includes also constraints from the angles (statistically dominated). As far as the CKM matrix elements are concerned the Standard Model is linear but it is not linear in all the other fundamental parameters of the Standard Model. For the illustrative purposes of this note, the first step thus consists in determining the minimum of the full (non-linear) χ 2 , and to linearise the Standard Model formulae for the various observables around this minimum (we choose the inputs of scenario B to determine this point): this define an exactly linear model, which at this stage should not be used for realistic phenomenology but is useful for the comparison of the methods presented here. One can use the results presented in the previous section in order to determine the p-value as a function of each of the parameters of interest. In the case of the nuisance-δ approach, we can describe this p-value using the same parameters as before, namely a central value, a statistical error and a theoretical error.
We provide the results for the 4 CKM parameters in both scenarios in Tabs. 12 and 13 (using the same linearised theory described above). We also indicate the profiles of the p-values. As before, we observe that the methods give similar results at the 2-3 σ level, although the adaptive hyperball method tends to be more conservative than the others.

Conclusion
A problem often encountered in particle physics consists in analysing data within the Standard Model (or some of its extensions) in order to extract information on the fundamental parameters of the model. An essential role is played here by uncertainties, which can be classified in two categories, statistical and theoretical. If the former can be treated in a rigorous manner within a given statistical framework, the latter must be described through models. The problem is particularly acute in flavour physics, as theoretical uncertainties often play a central role in the determination of underlying parameters, such as the four parameters describing the CKM matrix in the Standard Model.
This article aims at describing and comparing several approaches that can be implemented in a frequentist framework. After recalling some elements of frequentist analysis, we have discussed three different approaches for theoretical uncertainties: the random-δ approach treats theoretical uncertainties as random variables, the external-δ approach considers them as external parameters leading to an infinity of p-values to be combined through model averaging, the nuisance-δ describes them through fixed biases which have to be varied over a reasonable region. These approaches have to be combined with particular choices for the test statistic used to compute the p-value. We have illustrated these approaches in the one-dimensional case, recovering the Rfit model used by CKMfitter as a particular case of the external-δ approach, and discussing the interesting alternative of a quadratic test statistic.
In the case of the nuisance-δ approach, one has to decide over which range the bias parameter should be varied. It is possible to compute the p-value by taking the supremum of the bias over a fixed range fixed by the size of the theoretical uncertainty to be modeled (fixed nuisance approach). An alluring alternative consists in adjusting the size of the range to the confidence level chosen: the range for a low confidence level can be obtained by varying the bias parameter in a small range, whereas a range for a high confidence level could require a more conservative (and thus larger) range for the bias parameter. We have designed such a scheme, called adaptive nuisance approach. It provides a unified statistical approach to deal with the metrology of the parameters (for low CL ranges) and the exclusion of models (for high CL ranges).
We have determined the p-values associated with each approach for a measurement involving both statistical and theoretical uncertainties. We have also studied the size of error bars, the significance of deviations and the coverage properties. In general, the most conservative approaches correspond to a naive Gaussian treatment (belonging to the random-δ approach) and the adaptive nuisance approach. The latter is better defined and more conservative than the former in the case where statistical and theoretical approaches are of similar size. Other approaches (fixed nuisance, external) turn out less conservative at large confidence level.
We have then considered extensions to multi-dimensional cases, focusing on the linear case where the quantity of interest is a linear combination of observables. Due to the presence of several bias parameters, one has to make another choice concerning the shape of the space over which the bias parameters are varied. Two simple examples are the hypercube and the hyperball, leading to a linear or quadratic combination of theoretical uncertainties respectively. The hypercube is more conservative, as it allows for sets of values of the bias parameters that cannot be reached within the hyperball. On the other hand, the hyperball has the great virtue of associativity, so that one can average different measurements of the same quantity or put all of them in a global fit, without changing its outcome. It also allows us to include theoretical correlations easily, both in the range of variation of biases to determine errors and in the definition of theoretical correlations for the outcome of a fit. We have discussed the average of several measurements using the various approaches, including correlations. We considered in detail the case of 100% correlations leading to a noninvertible covariance matrix. We also discussed global fits and pulls in a linearised context. We have then provided several comparisons between the different approaches using examples from flavour physics: averaging theory-dominated measurements, averaging incompatible measurements linear fits to a subset of flavour inputs.
It is now time to determine which choice seems preferable in our case. Random-δ has no strong statistical basis: its only advantage consists in its simplicity. External-δ is closer in spirit to the determination of systematics as performed by experimentalists, but it starts with an inappropriate null hypothesis and tries to combine an infinite set of p-values in a single p-value. On the contrary, the nuisance-δ approach starts from the beginning with the correct null hypothesis and deals with a single p-value.
This choice is independent from another choice, i.e., the range of variation for the parameter δ. Indeed, when several bias parameters are involved, one may imagine different multidimensional spaces for their variations, in particular the hyperball and the hypercube. As said earlier, the hyperball has the interesting property of associativity when performing averages and avoids fine-tuned solutions where all parameters are pushed in a corner of phase space. The hypercube is closer in spirit to the Rfit model (even though the latter is not a bias model), but it cannot avoid fine-tuned situations and it does not seem well suited to deal with theoretical correlations, since it is designed from the start to avoid such correlations.
A third choice consists in determining whether one wants to keep the volume of variation fixed (fixed approach), or to modify it depending on the desired confidence level (adaptive approach). Adaptive hypercube is in principle the most conservative choice but in practice, it gives too large errors, whereas fixed hyperball would give very small errors. Fixed hypercube is more conservative at low confidence levels (large p-values), whereas adaptive hyperball is more conservative at large confidence levels (small p-values).
This overall discussion leads us to consider the nuisance approach with adaptive hyperball as a promising approach to deal with flavour physics problems, which we will investigate in more phenomenological analyses in forthcoming publications [35].
We have to choose a generalised inverse C + s . We cannot rely on arguments based on the case where C s is invertible (for instance taking a correlation 0 < ρ < 1, followed by the limit ρ → 1) since this limit is singular. We can start by constraining the structure of C + s due to the particular structure of C s . We have where Σ is a diagonal matrix with uncertainties as entries {σ 1 , . . . σ n }, Γ the correlation matrix with entries between -1 and 1 (and diagonal entries equal to 1), R is an orthogonal matrix, and D is a diagonal matrix with entries in decreasing order The entries of D are positive since C s is assumed to be positive, with A generalised inverse for C s can be expressed in terms of a generalised inverse for D, if we define Indeed a generalised inverse for C s obeys C s C + s C s = C s , which is equivalent to the condition where d is the m × m diagonal matrix with entries d i , A is an m × (n − m) arbitrary matrix and B is an (n − m) × (n − m) arbitrary matrix. A and B can only depend on d 1 , . . . d m , and each choice of A and B correspond to an admissible generalised inverse. Under these conditions, we find for the weights and the variance

A.2 Choice of a generalised inverse
The most common generalised inverse is the Moore-Penrose pseudoinverse, obtained by adding three other conditions on C + s on top of the definition of a generalised inverse. The condition C + s C s C + s = C + s (reflexive generalised inverse) would translate as D + .D.D + = D + leading to the condition B = A T .d.A in Eq. (73), whereas the two other conditions for the Moore-Penrose inverse of C + s do not translate easily on D + . Unfortunately, we will see in explicit examples that this pseudoinverse gives more weight to measurements with a poor accuracy, and is thus not appropriate in our case.
An alluring alternative to obey Eq. (73) consists in considering A = 0 and B = λ × 1 (n−m)×(n−m) proportional to the identity, with λ a real number to be fixed. In this case, the weights read Let us assume that σ a becomes much smaller than the other σ i , the weights are dominated by Since the first (normalisation) factor is the same for all the inputs, the dominant weight will be w a , under the condition that which is a condition fulfilled for 0 < λ ≤ 1/d 1 . We see that the family of generalised inverses thus defined 20 has the following properties • they can be computed in a very simple way • for 0 < λ ≤ 1/d 1 , if a determination is much more precise than the others, it will dominate the average For λ = 1/d 1 , we call C + s the λ-inverse of C s . For λ = 0, we recover the Moore-Penrose pseudoinverse for D, and call this generalised inverse the 0-inverse of C s . As said earlier, one could also consider the possibility of taking the Moore-Penrose pseudoinverse of C s directly. We will illustrate these three possibilities with a few simple examples.

A.3.1 Two measurements
In the case of two uncorrelated measurements, there is no problem with inversion, and we get for all methods For partially correlated measurements (|ρ| < 1), the same inversion can be performed, leading to 20 The definition of C + s can be extended for an arbitrary matrix C in the following way. Σ is defined as the diagonal matrix with entries { |C11|, . . . |CNN |} (if a diagonal entry is 0, one defines Σ with 1 in the corresponding entry). The matrix Γ = Σ −1 .C.Σ −1 can be written according to a singular value decomposition Γ = R.D.S with two rotation matrices R and S. Once the generalised inverse D + is defined, the corresponding generalised inverse of C is defined as C + = Σ −1 .S T .D + .R T .Σ −1 . and the expression for the uncertainty In each case, we indicate the limit where σ 1 becomes much smaller than σ 2 with the ∼ symbol, i.e., one measurement is much more accurate than the other. A comment is in order with respect to the HFAG approach at this stage. As noticed in Ref. [36], the maximal uncertainty is min(σ 2 1 , σ 2 2 ) and corresponds to the correlation coefficient ρ = min(σ 1 /σ 2 , σ 2 /σ 1 ) (it is not ρ = 1).
In the case of two fully correlated measurements, we have where we indicated the limit when σ 1 → 0. The 0-inverse yields and the Moore-Penrose pseudoinverse yields

A.3.2 n fully correlated measurements
We have a correlation matrixC s with unit entries everywhere. This yields d 1 = n, d i>1 = 0. The λ-inverse yields where we indicated the limit when σ 1 → 0. The 0-inverse yields The Moore-Penrose pseudoinverse yields We can actually show that in this situation, the choice of the λ-inverse is optimal in the family of generalised inverses defined in App. A.2. Indeed, there is only one non-vanishing eigenvalue d 1 = n, leading to which is minimal for the maximal value λ = 1/d 1 , corresponding to the λ-inverse.
Lattice Averaging Group [37], and it does not run into the danger of underestimating the resulting uncertainty as discussed by the Heavy Flavour Averaging Group [36]. For these reasons, we choose the λ-inverse to compute both the inverse statistical covariance matrix and the inverse theoretical correlation matrix when these matrices are singular (the regular case being trivial).
B Varying the biases in the presence of theoretical correlations B.1 Range of variations for the biases Another issue consists in implementing correlations for the biases describing theoretical uncertainties. Some differences occur compared to statistical uncertainties, since different models are used in both cases (random variables versus biases). As described in Sec. 5.2.2, once the weights w i are determined, the theoretical uncertainty is given by δ µ = i w i ∆ iαδα , which requires one to determine the range of variation for the normalised biasesδ α . We want to describe their variation starting from variations of uncorrelated variables. This can be achieved through a linear transformation by introducing the Cholesky decomposition for the theoretical correlation matrix C t = P · P T with P a lower triangular matrix with diagonal positive entries. We obtain the expression for the theoretical uncertainty where (P −1δ ) j are uncorrelated biases varied in a hyperball, leading to There is an ambiguity in the definition of P when C t is only semi-definite positive (which occurs when C t is singular due to 100% correlations, and exhibits not only positive but also vanishing eigenvalues). We define then P by computing P ( ) for the shifted matrix C t + × 1 m×m and defining P = lim →0 + P ( ). This limit is not singular, and it allows one to define the limit of two measurements fully correlated theoretically as a smooth limit of the general case with a partial correlation. One should emphasise that in the case of a singular correlation matrix C t for theoretical uncertainties, we may have to treat this singularity at two different stages: first when we build the test statistic involvingW (depending on the structure of the statistical and theoretical correlation matrices), second when we consider the domain of variation for the parametersδ. We stress that we used different procedures in both cases (λ-inverse forW , Cholesky decomposition forδ), which involves some arbitrariness, but reproduces desirable properties for the combined uncertainties and domains of variation of the biases in this singular limit.
In the case of a hypercube, we may want to follow the same procedure and define The question mark indicates that this definition is only tentative, and will not actually be used. Indeed as discussed in Sec. 5.2.3 and illustrated in the following sections, this definition has the rather unpleasant feature that the ranges of variations depend on the order of the inputs used, and we have not been able to identify an alternative choice for the range of variations that would avoid this problem, which does not occur in the hyperball case. These difficulties could be somehow expected from the properties of the hypercube case discussed in Sec. 5.2.1. Indeed, in the case of two measurements, the hypercube corresponds to values of δ 1 and δ 2 left free to vary without relation among them (contrary to the hyperball case). It seems therefore difficult to introduce correlations in this case which was designed to avoid them initially. Our failure to introduce correlations in this case might be related to the fact that the hypercube is somehow designed to avoid such correlations from the start and cannot accommodate them easily.
We thus propose the alternative definition, ignoring theoretical correlations to determine the range of variations for the biases
In the case of a hypercube with correlations,δ 1 ,δ 2 are varied in a parallelogram with two sides parallel to theδ 2 axis, whereas they are varied in a tilted ellipse in the hyperball case, as can be seen in Fig. 10. In both cases, the limiting case where ρ → ±1 corresponds toδ 1 andδ 2 varied along a diagonal line, meeting our expectations for fully correlated theoretical uncertainties. We see that this treatment yields a symmetric domain forδ 1 andδ 2 in the hyperball case, but not in the hypercube case, which means that the two uncertainties are not treated in a symmetric way 21 . As indicated before, Eq. (96) corresponds to the hypercube with ρ = 0, i.e., a square domain forδ 1 andδ 2 .
One can easily extend the same procedure to a larger number of correlated theoretical uncertainties. As indicated above, the hyperball with correlations yields domains of variations which are symmetric for any pair (δ k ,δ l ) whereas the hypercube with correlations does not. This means that the range of variation chosen for the biases will depend on the order of the inputs: a mere reshuffling of the inputs will yield different ranges of variations for the biases and (in general) different outcomes for averages and fits. In addition, we should emphasise that a total correlation (C t ) k,l = 0 between two biases does not have the same impact for the domain of variation in the (δ k ,δ l ) plane in both approaches: in the hyperball case, one obtains an undeformed disk, whereas the hypercube case yields a complicated convex polytope depending on the other elements of the correlation matrix (see Fig. 11 in the case of three biases) (a symmetrisation of the Cholesky decomposition in the form P + P T or a different choice of linear transformation would yield similar results).
These features lead us to neglect correlations in the hypercube range of variations, whereas we keep them when considering the hyperball case. We thus discard Eq. (95) and consider only Eqs. (94) and (96) in our analyses. In Eq. (34), one may be uncertain about the case where a theoretical uncertainty is fully correlated between two observables. Let us imagine that we have two quantities X 1 = X 10 ± σ 1 ± ∆ 1 and X 2 = X 20 ± σ 2 ± ∆ 2 with the two theoretical uncertainties being fully correlated. We can imagine describing the theoretical uncertainties either via m = 2 parameters fully correlated through C t : or as m = 1 parameter intervening in the two quantities via ∆ II : We can see in the above discussion that the only relevant combination of ∆ and C t is actually ∆P , whether in the definition ofW that involves ∆ C t ∆ T = (∆P )(∆P ) T , or in the discussion of the theoretical uncertainty ∆ µ . We have leading to the same ∆ C t ∆ T and showing that only one uncorrelated bias parameter is needed in both cases, even though we started from a different number of bias parameters. The discussion can be extended to an arbitrary number of fully correlated theoretical uncertainties. Obviously, for partial correlations, only C t can be used with an unchanged number of bias parameters.

C.2 Reducing the problem to one bias parameter per observable
We can define a reduced version of the problem Eq. (34), with only n bias parameters rather than m. We have to determine an equivalent problem whereW t and ∆ are n × n matrices, and ∆ is diagonal. From what was discussed before, we see that we will obtain the same result for the weights w (q) , the variances and the correlations, if we ensure that ∆P = ∆ P . This can be achieved by defining ∆ and the correlation matrix C t using C t is positive semi-definite, which means that ∆ C t ∆ T will also be. The diagonal elements of a positive semi-definite matrix are positive, and therefore, one can define ∆ so that C t has 1 as a diagonal. It could occur that ∆ C t ∆ T has 0 on the diagonal for some k th entry. But since ∆ C t ∆ T is positive semi-definite, one can prove that the corresponding row and column then vanish, meaning that the corresponding bias parameter does not actually occur in the reduced problem. In such a case, one can define ∆ k = 0 and C t vanishing on the k th row and column, and C t,kk = 1 (this is the case for instance if there is no theoretical uncertainty for some of the observables).
Moreover, one can check that C is indeed a correlation matrix by defining the scalar product (x, y) = x T ∆ C t ∆ T y. We can apply the Cauchy-Schwartz inequality to the basis vectors u (i) defined so that u (i) j = δ ij (i.e., only one non-vanishing component): (u (i) , u (j) ) 2 ≤ (u (i) , u (i) )(u (j) , u (j) ) so that | C t,ij | ≤ 1 and C t,ii = 1, with the appropriate structure of a correlation matrix. Finally, the Cholesky decomposition of C t corresponds to P = (∆ ) −1 ∆P . Therefore, the determination of the theoretical uncertainties for ∆ µ remains indeed the same with the new set of biases.
We have thus reduced the problem of n measurements and m theoretical biases to the case with n measurements, each of them having with a single bias parameter, with correlations among the biases. Without loss of generality we can consider that ∆ is diagonal and m = n.

D Asymmetric uncertainties
In this article, statistical uncertainties are assumed to be strictly Gaussian and hence symmetric. In practice, if asymmetric uncertainties are quoted, we symmetrise in the following manner This is also the case for the theoretical uncertainties in the random-δ approach.
In contrast, it is perfectly possible to have asymmetric theoretical uncertainties in the nuisanceδ or external-δ approaches described above. A theoretical uncertainty that is modeled by a bias parameter δ may be asymmetric: that is, the region in which δ is varied may depends on the sign of δ, e.g. δ ∈ [−∆ − , +∆ + ] in one dimension (∆ ± ≥ 0).
In the case of a quadratic test statistic, we want to keep the stationarity property stemming from the symmetric quadratic shape, by using a test statistic Eq. (22) with (∆ + + ∆ − )/2 or Max(∆ + , ∆ − ) in the definition, the second possibility being more conservative and our preferred choice in the following. As indicated in Sec. 5.2.2, this is independent of the range of variation Ω chosen, which will be kept asymmetric, e.g., [−∆ − , ∆ + ] in the fixed nuisance approach.
In the case of the Rfit approach [15,16], we can use the fact that the well test statistic has a shape that is independent of the central value chosen, as long as the position of the flat bottom remains unchanged. One can thus shift the central value by an arbitrary quantity if one remains at the bottom of the well. It is thus completely equivalent to take asymmetric theoretical ranges or to take symmetric theoretical ranges following Eq. (105) where σ ± is replaced by ∆ ± .

E 3-σ intervals for CKM-related examples
We collect here the intervals at 3 σ for the various approaches applied to the CKM examples discussed in Sec. 6. Figs. 12, 13 and 14 are the 3-σ equivalents of Figs. 7, 8 and 9 showing 1 σ intervals. The comparison between the two series of plot shows how the intervals evolve with the confidence level. In particular, the adaptive hyperball approach appears more (less) conservative than the 1-hypercube approach at high (low) significance. This change of hierarchy explains why we choose a different convention to plot the 1 σ (dashed horizontal line) and 3 σ (vertical lines in the middle of the solid intervals) intervals for the 1-hypercube approach in Figs  The black range gives the statistical error. For each individual input, the solid yellow range indicates the 3 σ interval according to the adaptive hyperball approach, whereas the interval corresponding to the 1-fixed hypercube approach is given by the vertical lines in the middle of the solid yellow intervals. For average according to the different approaches, the black range corresponds again to the 3 σ statistical error, whereas the yellow range corresponds to the 3 σ interval following the corresponding approach. The comparison between black and yellow ranges illustrates the relative importance of statistical and theoretical errors. Finally, for illustrative purposes, the vertical purple line gives the arithmetic average of the inputs (same weight for all central values).