1 Introduction

Computing exactly physical observables in a generic quantum field theory (QFT) is still an open challenge. The closest approximation to such a computation is achieved through numerical simulations of the theory on a discretized space-time (lattice). This approach, however, cannot be used universally. For instance, for some observables (e.g. involving large momenta) the required computing power would be out of reach. A complementary approach, applicable when the coupling of the theory is sufficiently small, is the so-called perturbative approach, or perturbation theory. Namely, physical observables are expressed as a power series in the coupling. Computing the coefficients of this power series can be done systematically, e.g. using Feynman diagram techniques, but the complication of the calculation grows considerably with the perturbative order. Therefore, in practice, only a limited number of coefficients is achievable for a given observable.

This poses an important question: how far is the approximate perturbative result from the unknown exact one? The answer to this question can be cast in an uncertainty, usually called theory uncertainty from missing higher orders. Since the exact result is unknown, one can only quantify this uncertainty in terms of a probability distribution. How to determine such a distribution is the subject of this paper.

The most standard and widespread approach to estimate this uncertainty is the so-called scale variation method. This approach is based on the observation that physical observables do not depend on unphysical scales (such as the renormalization scale) appearing in QFTs. However, this independence is strictly valid only for the exact result. Any approximate result computed in perturbation theory will in fact depend on unphysical scales, the dependence being formally of higher order. The idea is thus to estimate the theory uncertainty by varying the scale, as the result will accordingly change by an amount that is formally of the same order as what this uncertainty wants to quantify. While this idea is certainly valid and powerful, the canonical method used to exploit it has various caveats. Indeed, the canonical recipe consists in varying the unphysical scale \(\mu \) by a factor of two about a central value \(\mu _0\) of choice, and then using the maximal variation of the observable with respect to the value at central scale as a measure of the uncertainty. This is depicted in Fig. 1. In formulae, the canonical way of writing a result based on a perturbative computation of a physical observable \(\Sigma \) isFootnote 1

$$\begin{aligned} \Sigma \; \overset{\begin{array}{c} {\mathrm{canonical}}\\ {\mathrm{scale}}\\ {\mathrm{variation}} \end{array}}{\equiv }\; \Sigma _{\mathrm{pert}}(\mu _0) \pm \max _{\mu _0/2<\mu <2\mu _0}\Big |\Sigma _{\mathrm{pert}}(\mu )-\Sigma _{\mathrm{pert}}(\mu _0)\Big |, \end{aligned}$$
(1.1)

where \(\Sigma _{\mathrm{pert}}(\mu )\) is the scale dependent perturbative result, \(\Sigma _{\mathrm{pert}}(\mu _0)\) represents the “central value” of the prediction, and the scale variation is appended as a theory “error”. The caveats of this approach are apparent:

  • it is largely arbitrary, in the choice of the central scale \(\mu _0\) and of the interval of variation (factor of two);

  • in the vicinity of stationary points the uncertainty can become accidentally small;

  • the uncertainty has no probabilistic interpretation.

The latter point can be overcome by assigning an interpretation (for instance, the “error” could be interpreted as the standard deviation of a gaussian distribution), but any choice would be totally arbitrary. In addition to all this, it is well known that the scale variation uncertainty often underestimates the size of higher order contributions.Footnote 2

Fig. 1
figure 1

Schematic representation of the canonical method to estimate theory uncertainty, namely the scale variation approach. Since the left and right variations of the scale generally lead to different sizes of variation of the perturbative result (shown by the two double-headed arrows), one may either choose to keep the uncertainty asymmetric, or to select the largest and symmetrize it

In 2011, Cacciari and Houdeau [3] proposed a completely different approach to estimate the uncertainty from missing higher orders. In their groundbreaking work they constructed a probabilistic model to define this uncertainty, based on some assumptions on the progression of the perturbative expansion. Roughly speaking, the model assumes that the coefficients of the power expansion in the coupling are bounded by an unknown (hidden) parameter. The knowledge of the first few orders of the perturbative expansion allow to perform (Bayesian) inference on the hidden parameter, whose improved knowledge can in turn be used to make inference on the unknown subsequent perturbative coefficients. Once the likelihood of the perturbative coefficients given the hidden parameter and the prior distribution of the hidden parameter are defined, the model produces probability distributions for the unknown coefficients, and thus for the physical observable itself.

This approach to theory uncertainties is unquestionably superior and more elegant than canonical scale variation, mainly because it is probabilistically founded. However, it also has limitations. For instance, while it performs well for QCD observables at \(e^+e^-\) colliders [3], it leads to less reliable results in the case of proton-proton collider observables [4, 5], that are usually affected by larger perturbative corrections. Perhaps for this reason, in conjunction with the simplicity and the deep-rooted attitude of using the scale variation method, the Cacciari–Houdeau (CH) approach is not very popular in the high-energy physics community, with only few applications [5,6,7,8,9,10,11,12,13,14,15,16,17,18], most of which in the context of effective theories. Moreover, the CH approach does not deal with the unphysical scale dependence of the result – the CH prediction is computed for a given choice of the scale, so the final probability distributions is de facto scale dependent.

Fig. 2
figure 2

The general structure of the inference in any model considered in this work

In this paper, we use a Bayesian approach to build new probabilistic models, similar to the CH model, that overcome the limitations of the previous approach. The inference structure of any model that we will consider is depicted in full generality in Fig. 2. Starting from the known orders, under some model assumptions one can make inference on the (hidden) parameters characterizing the model, to be used in turn to infer the probability distribution of the unknown higher orders. In formulae, we have, schematically,

$$\begin{aligned}&P(\text {unknown orders}|\text {known orders}) \nonumber \\&\quad = \int d \text {pars}\; P(\text {unknown orders}|\text {pars}) P(\text {pars}|\text {known orders})\nonumber \\ \end{aligned}$$
(1.2)

where P(A|B) is the conditional probability distribution of A given B. Eq. (1.2) is given in terms of the posterior distribution of the hidden parameters

$$\begin{aligned} P(\text {pars}|\text {known orders}) \propto P(\text {known orders}|\text {pars}) P_0(\text {pars}) \end{aligned}$$
(1.3)

which depends on the prior distribution \(P_0(\text {pars})\) of the hidden parameters and on the model assumptions through the likelihood \(P(\text {orders}|\text {pars})\), that appears also explicitly in Eq. (1.2). These two model-dependent ingredients are sufficient to let Bayesian inference work, and can be used to eventually construct a probability distribution for the observable, that contains all the information on the uncertainty from missing higher orders.

In this work, we propose two main models: one is an improved version of the CH model that efficiently describes perturbative expansions with large perturbative corrections; the other is a model inspired by the scale variation method, but constructed in such a way to be reliable and probabilistically sound. Both methods outperform the current approaches in terms of reliability. They are also sufficiently general to be used for perturbative expansion that are not necessarily fixed-order expansions in powers of the coupling, but can be for instance resummed expansions or other generalized expansions. Moreover, the first model can be applied also beyond quantum field theory, for instance to quantum mechanics or in general to any expansion that behaves perturbatively. We also explore variants of the methods and combinations of them.

These methods are still applied to the perturbative expansion at a given fixed value of the unphysical scale. An important and innovative development proposed in this work is a way to “remove” the scale dependence of the result. This is achieved by dealing with the scale dependence within the probabilistic framework, and leads to a result that is to a large extent scale independent. As a byproduct of this procedure, the central value of the prediction (identified with the mean of the probability distribution for the observable) does not necessarily correspond to the canonical perturbative result at a given “central” scale. Thanks to this feature our method improves not only the reliability of the uncertainty, but also that of the central prediction for the observable.

To facilitate the adoption of the results of this work, a computer code named THunc is publicly released. The code is very easy to use: the user provides the perturbative expansion, and the code outputs the probability distribution of the result, together with a number of statistical estimators (mean, mode, median, standard deviation, degree-of-belief intervals). It is also fast and efficient, and flexible as it allows the user to define customized models.

The structure of this paper is the following. In Sect. 2 we give some preliminary information on the perturbative expansions and their scale dependence, we provide some basic concepts on the probabilistic definition of the theory uncertainty as well as a brief recap of the CH method, and we define a working example to be used in the subsequent sections. In Sect. 3 we start defining some notations and describe general features common to all models we will later consider. We then move on presenting our two main models (at fixed scale) in Sects. 4 and 5. We propose a way to construct scale-independent results and uncertainties in Sect. 6. We then validate our methods in Sect. 7, where we also consider some realistic applications. In Sect. 8, we discuss the issue of defining correlations between theory uncertainties. After concluding in Sect. 9, we collect details on numerical implementations in Appendix A and propose a number of variants and possible improvements in Appendix B.

2 Preliminaries

2.1 Basic concepts and assumptions on perturbative expansions

We consider a (renormalizable) quantum field theory (QFT) depending on a single coupling \(\alpha \). We will often refer to practical examples in quantum chromodynamics (QCD), since its coupling is not very small and thus perturbation theory produces somewhat large high-order corrections, which is the case where reliably estimating theory uncertainties is both important and challenging. We focus on a generic physical observable \(\Sigma \). The theory predicts a unique well-defined value for this observable, that we call \(\Sigma _{\mathrm{true}}\). This is the value we aim to obtain. However, we usually cannot compute it exactly, and we thus use a perturbative approach to approximate it.

According to the perturbative hypothesis, namely that the dimensionless coupling \(\alpha \) of the theory is sufficiently small, it is possible to compute the observable \(\Sigma \) as a power series in the coupling itself. We then write

$$\begin{aligned} \Sigma _{\mathrm{true}}\simeq \sum _{k=0}^n c_k \alpha ^k, \end{aligned}$$
(2.1)

where \(c_k\) are the coefficients of the perturbative expansion, and n is some order at which we stop the expansion. Eq. (2.1) is not an equality. One may be tempted to think that if \(n\rightarrow \infty \), then it would become an equality. This limit, however, does not exist, as perturbative expansions are divergent [19, 20]. The divergence of the series is related to the fact that \(\Sigma _{\mathrm{true}}\) is a non-analytic function of the coupling \(\alpha \) in \(\alpha =0\). One may try to treat the divergent series using e.g. the Borel summation method. For some known (to all orders) divergent contributions, such as those due to renormalons (see e.g. Ref. [21]), one can obtain a finite result through Borel summation.Footnote 3 However, it is not guaranteed that the Borel-sum of the series captures the full result: there may be intrinsically non-perturbative contributions that cannot be reconstructed from the perturbative expansion.Footnote 4 Moreover, in order to use the Borel summation method, the series should be known to all orders, or at least its asymptotic behaviour should be known.Footnote 5 However, this is usually not the case, so in practice the only information we have about an observable is its (truncated) perturbative expansion, Eq. (2.1).

Fig. 3
figure 3

The asymptotic expansion of the function Eq. (2.2) truncated at various orders from 0 to 15 (orange dots), normalized to the value of the function itself (blue line). The lower panel shows the absolute difference between the truncated and exact results. The left plot corresponds to \(\alpha =0.13\), the right plot corresponds to \(\alpha =0.2\)

The fact that the series is divergent and that summation methods cannot be used may suggest that the perturbative result is useless. However, this is in constrast with the well known fact that perturbation theory works, namely it predicts results in decent (or even good) agreement with data, at least when the coupling is sufficiently small (for instance, in QED it works very well). The explanation of this fact relies on the assumption that perturbative series are asymptotic expansions of the exact result. To our knowledge, there is no general proof of this statement, but it seems very reasonable and we take it as valid. The asymptotic nature of the perturbative expansion implies that up to some order \(k_{\mathrm{asympt}}\) adding terms to the expansion improves the accuracy of the prediction, but beyond \(k_{\mathrm{asympt}}\) the divergent contributions to the series dominate and the sum explodes. A visual example of this fact is shown in Fig. 3, where the following non-analytic function of \(\alpha \) is compared with its asymptotic expansion at small \(\alpha \):

$$\begin{aligned} \frac{1}{\alpha }\exp \left( \frac{1}{\alpha }\right) \Gamma \left( 0,\frac{1}{\alpha }\right) \overset{\mathrm{asympt}}{=} \sum _{k=0}^\infty (-1)^k k! \alpha ^k. \end{aligned}$$
(2.2)

From the figure one sees that for a sufficiently small value of \(\alpha =0.13\) (left plot) the first few (approximately 7) orders give a good approximation of the result, showing an apparently converging behaviour. However, adding extra orders deteriorates the prediction, because the factorial growth of the series wins over the power suppression. The best prediction one can make using the asymptotic expansion is thus obtained truncating it to \(k_{\mathrm{asympt}}\sim 7\). This prediction, however, has an irreducible uncertainty due to the truncation itself,Footnote 6 of the size of the last term included in the truncated expansion. The right plot, that is obtained with a larger coupling \(\alpha =0.2\), shows that this irreducible uncertainty grows with the value of the coupling, and the value of \(k_{\mathrm{asympt}}\) decreases accordingly.Footnote 7

From these considerations we can conclude that a physical observable can be expressed as the sum

$$\begin{aligned} \Sigma _{\mathrm{true}}= \sum _{k=0}^{k_{\mathrm{asympt}}} c_k \alpha ^k + \Delta _{\mathrm{asympt}} + \Delta _\text {non-pert}, \end{aligned}$$
(2.3)

where the first term is perturbative expansion is truncated at \(k_{\mathrm{asympt}}\), which is the best prediction we can make using perturbation theory, while \(\Delta _{\mathrm{asympt}}\) represents the irreducible difference between the truncated expansion and its all-order sum. The \(\Delta _\text {non-pert}\) term, instead, represents possible intrinsically non-perturbative contributions that cannot be captured by perturbation theory. The general expectation on the size of these contribution is

$$\begin{aligned} \left| \Delta _{\mathrm{asympt}} \right| \sim \left| \Delta _\text {non-pert} \right| \ll \left| \sum _{k=0}^{k_{\mathrm{asympt}}} c_k \alpha ^k \right| , \end{aligned}$$
(2.4)

at least when the coupling is sufficiently small. This expectation is again not a proof, but a consequence of the goodness of perturbation theory, together with the fact that both \(\Delta \) terms are known to be exponentially suppressed by \(\exp (-a/\alpha )\), \(a>0\), in some established cases (namely, when \(\Delta _\text {non-pert}\) contains instanton contributions and when \(\Delta _{\mathrm{asympt}}\) is dominated by factorially divergent contributions like renormalons [21]).

Unfortunately, since we do not generally know the asymptotic behaviour of the expansion, we cannot know a priori the value of \(k_{\mathrm{asympt}}\), and we thus cannot truncate the expansion at the optimal value. This is not a real issue, as in most cases we know just a rather small number of orders, typically two or three, corresponding to next-to-leading order (NLO) and next-to-next-to-leading order (NNLO) computations. Only in very few cases we know physical quantities at \(\hbox {N}^3\hbox {LO}\) (four terms in the expansion) or beyond. We thus expect (and assume) the number n of known orders to be smaller than \(k_{\mathrm{asympt}}\).Footnote 8 Therefore, we can rewrite Eq. (2.3) as

$$\begin{aligned} \Sigma _{\mathrm{true}}= \sum _{k=0}^{n} c_k \alpha ^k + \Delta _{\mathrm{MHO}}^{(n)} + \Delta _{\mathrm{asympt}} + \Delta _\text {non-pert}, \end{aligned}$$
(2.5)

having defined the contribution from missing higher orders

$$\begin{aligned} \Delta _{\mathrm{MHO}}^{(n)} = \sum _{k=n+1}^{k_{\mathrm{asympt}}} c_k \alpha ^k, \end{aligned}$$
(2.6)

where n is the highest known perturbative order. Assuming that n is sufficiently smaller than \(k_{\mathrm{asympt}}\), using similar considerations to those that led to Eq. (2.4) we can conclude that in most cases the contribution from missing higher orders is larger than the asymptotic and non-perturbative contributions,

$$\begin{aligned} \left| \Delta _{\mathrm{MHO}}^{(n)} \right| \gg \left| \Delta _{\mathrm{asympt}} \right| \sim \left| \Delta _\text {non-pert} \right| . \end{aligned}$$
(2.7)

This implies that our knowledge of the observable \(\Sigma \) is determined by the perturbative expansion truncated at order n with an uncertainty that is dominated by the missing higher order term \(\Delta _{\mathrm{MHO}}^{(n)}\). Quantifying this term, or better determining its probability distribution, is the main task of the rest of this paper.

2.2 Constructing a probability distribution for a physical observable

Defining the theoretical uncertainty from missing higher orders in a probabilistic way may sound impossible or completely arbitrary to many. This would certainly be true in the context of the so-called frequentist approach to probability, in which the definition of a probability requires the existence of a repeatable event, which is typically the case for an experiment but clearly not the case for a theoretical prediction. However, the frequentist approach to probability is not the only one – actually, the frequentist formulation is mathematically inconsistent (see e.g. [28]), and thus certainly not the best one.

The only mathematically correct formulation of a probability theory is the so-called Bayesian approach, where the probability is defined as the “degree of belief” of an event, which is then intrinsically subjective. Initially, when no information is available, the probability of an event is given by a prior distribution, which encodes our subjective and arbitrary prejudices. Acquiring information on the event changes the degree of belief through statistical inference (Bayes theorem). Therefore, any probability will depend on subjective assumptions through the prior distribution, but adding more and more information updates the probability making it less and less dependent on the prior.

In case of repeatable events, one can acquire information on the process by repeating them (and, in the limit of large number of repetitions, one recovers the frequentist result). However, repetion is not the only way of acquiring information, and thus one can use the Bayesian approach also in cases (like ours) when the event is not repeatable. Here “event” means something that can happen in different ways with different likelihoods, which we want to describe through a probability distribution. In our case, the event is “the observable takes the value \(\Sigma \)”, and its probability distribution will be a function of \(\Sigma \) ranging over all possible values. The information on this event that we want to use is the perturbative expansion of the observable. How this will be used in practice depends on the model and will be discussed at length in the rest of this paper. Thanks to this information, we can then use probabilistic inference to improve the knowledge on the observable, namely to update the distribution of \(\Sigma \).

The goal of this work is thus the construction of a probability distribution of the observable \(\Sigma \) given the perturbative expansion up to order n, namely

$$\begin{aligned} P(\Sigma |c_0,\ldots ,c_n, H), \end{aligned}$$
(2.8)

where we have indicated with H any assumption (hypothesis), including both prior distributions and the model we want to use (we will come back to this point later). P(A|B) indicates the probability distribution of A given the information B. In our case, the information is given by the first \(n+1\) coefficients \(c_0,\ldots ,c_n\), and H. This distribution contains all the information we desire about our knowledge of the observable. For instance, we can compute the best estimate of the observable \(\Sigma \) as its expectation value according to such distribution,

$$\begin{aligned} \langle \Sigma \rangle = \int d\Sigma \; \Sigma \; P(\Sigma |c_0,\ldots ,c_n, H), \end{aligned}$$
(2.9)

and its uncertainty either as the standard deviation or using degree-of-belief (DoB) intervals. Most importantly, the probability distribution can be used directly in physical analyses, when comparing theory predictions with data.

In the limit of “infinite information”, namely when we know the exact result \(\Sigma _{\mathrm{true}}\), the probability Eq. (2.8) should become

$$\begin{aligned} P(\Sigma |\Sigma _{\mathrm{true}}) = \delta (\Sigma -\Sigma _{\mathrm{true}}), \end{aligned}$$
(2.10)

which represents the certainty (not probability) that \(\Sigma =\Sigma _{\mathrm{true}}\). In this limit any a priori assumption H does not matter. Eq. (2.10) cannot be seen as the all-order limit of Eq. (2.8), due to the divergent nature of the series and to the non-perturbative contributions discussed in Sect. 2.1. However, it suggests that when adding information (i.e. when increasing the number n of known orders, up to \(k_{\mathrm{asympt}}\)) the probability distribution should become narrower and more localised. We shall consider this behaviour as a property that a good model for theory uncertainty must satisfy.

Note that knowing the probability distribution for the missing higher order term \(\Delta _{\mathrm{MHO}}^{(n)}\) is practically the same as knowing the distribution for \(\Sigma \). Indeed, in the limit Eq. (2.7) where we neglect the asymptotic and non-perturbative contributions, the distributions for \(\Delta _{\mathrm{MHO}}^{(n)}\) and \(\Sigma \) are the same up to a trivial shift given by the perturbative result (Eq. 2.5). In the following, we will always deal directly with the distribution of \(\Sigma \) (Eq. 2.8), but we will compute it by estimating the missing higher orders \(\Delta _{\mathrm{MHO}}^{(n)}\).

2.3 The role of unphysical scales

A general feature of renormalizable QFTs is the appearance of an unphysical scale \(\mu \) as a consequence of the regularization procedure needed to deal with ultraviolet divergences. This is known as the renormalization scale. Physical observables do not depend on it, as this scale is an artefact of the scheme adopted to renormalize the theory. However, in practical computations using perturbation theory, a scale dependence is present in each order, in a way that it is compensated order by order. Any finite-order truncation of the perturbative series will thus have a residual scale dependence, which is formally of higher order. As discussed in the introduction, this observation is at the core of the canonical scale variation method to estimate theory uncertainties.

Because of renormalization scale dependence, perturbative expansions do not uniquely determine a series, but rather a family of series parametrized by the renormalization scale \(\mu \). We shall thus rewrite Eq. (2.5) as

$$\begin{aligned} \Sigma _{\mathrm{true}}= & {} \sum _{k=0}^{n} c_k(\mu ) \alpha ^k(\mu ) \nonumber \\&+ \Delta _{\mathrm{MHO}}^{(n)}(\mu ) + \Delta _{\mathrm{asympt}}(\mu ) + \Delta _\text {non-pert}(\mu ), \end{aligned}$$
(2.11)

where the left-hand side, the exact result, is scale independent:

$$\begin{aligned} \frac{d}{d\mu } \Sigma _{\mathrm{true}}= 0. \end{aligned}$$
(2.12)

Therefore, whenever we want to estimate the value of an observable using perturbation theory, we need to face with the fact that the perturbative result is scale dependent, and also the missing higher orders are.

An immediate consequence of this fact is that also the probability distribution Eq. (2.8) will unavoidably depend on such the choice of scale \(\mu \). We can express this by writing the probability Eq. (2.8) as

$$\begin{aligned} P(\Sigma |c_0(\mu ),\ldots ,c_n(\mu ), H), \end{aligned}$$
(2.13)

where we have emphasised that each coefficient depends on the scale \(\mu \),Footnote 9 or equivalently as

$$\begin{aligned} P(\Sigma |c_0,\ldots ,c_n, \mu , H), \end{aligned}$$
(2.14)

where the coefficients \(c_k\) are intended as functions, to be computed at the value \(\mu \) passed as an extra parameter. This dependence on the scale is clearly undesired, because this probability distribution, which depends on \(\mu \), is for the true observable, which is independent of \(\mu \). In the limit of infinite knowledge (Eq. 2.10), the distribution should tend to \(\delta (\Sigma -\Sigma _{\mathrm{true}})\) irrespectively of the value of \(\mu \). This implies that increasing the order, the probability distributions at different values of \(\mu \) should become more and more similar. While this feature is certainly nice, having an infinite number of different results for the same object is obviously not ideal. Rather, one would like to obtain a probability distribution for \(\Sigma \) that does not depend on the choice of scale. This can be achieved in two ways: either having a criterion for selecting an “optimal value” of the scale, or combining in some way the results at different scales.

The first way is obviously simpler, provided such a criterion exists. In the literature there are various approaches that aim at selecting an optimal scale, e.g. the Brodsky-Lepage-Mackenzie (BLM) method [29,30,31], the principle of minimal sensitivity (PMS) [32, 33], the principle of maximal conformality (PMC) [34,35,36,37] and the recent principle of observable effective matching (POEM) [38]. The PMC is probably the most widespread approach. It provides a way to select, order by order, an optimal scale that removes non-conformal \(\beta \)-function contributions from the perturbative expansion. This is believed to remove the renormalons from the perturbative expansion, thereby leading to a possibly convergent series (or at least to a less divergent one). For our purposes, this approach could provide a way to select, among the infinitely many probability distributions for \(\Sigma \), a specific one.Footnote 10 Note, however, that the PMC fixes the scale at each known order except the last one, which is free and thus arbitrary. This implies that a residual scale dependence is present also in the PMC approach, even though it is claimed to be much milder than the canonical scale dependence. However, it has been pointed out that a proper study of all the ambiguities in the approach leads to larger uncertainties, comparable to the canonical scale uncertainty [40,41,42,43]. We conclude that while the PMC approach is certainly intersting, it cannot provide the full solution to our problem.

The second way to obtain a scale-independent probability distribution for \(\Sigma \) is what we pursue in this work. The treatment of the renormalization scale is addressed within the methodology for computing the probability distribution Eq. (2.8). The actual procedure to combine the probability distributions at different scales, which represents one of the most innovative proposals of this work, will be presented in Sect. 6.

Before moving further, another aspect of scale dependence must be discussed. So far, we have described scale dependence as an obstacle to obtain a unique probability distribution for the observable. In fact, scale dependence can be also considered as a tool. This relies on the fact, already discussed in the introduction, that the \(\mu \)-dependence of the finite-order truncation of the perturbative series is of higher order, namely

$$\begin{aligned} \mu \frac{d}{d\mu } \sum _{k=0}^{n} c_k(\mu ) \alpha ^k(\mu ) = {\mathcal {O}}\left( \alpha ^{n+1}\right) = {\mathcal {O}}\left( \Delta _{\mathrm{MHO}}^{(n)}\right) . \end{aligned}$$
(2.15)

This fact provides additional information on the expansion, which can be very useful as in most cases of interest in particle physics the available information is very limited (typically \(n=2\) or 3). In practice, let us assume for simplicity that the \(\mu \) dependence at a given order k can be translated in a single number, \(r_k\). It can for instance be the canonical scale uncertainty error, or the slope of the cross section as a function of \(\mu \), or something similar (we will provide a precise definition later in Sect. 3.2). We can then generalize the probability distribution as

$$\begin{aligned} P(\Sigma |c_0(\mu ), r_0(\mu ),\ldots ,c_n(\mu ), r_n(\mu ), H), \end{aligned}$$
(2.16)

where we have made explicit that also the \(r_k\) numbers generally depend on the choice of \(\mu \) about which the scale dependence is computed. In other words, also in this case we get a family of distributions depending on the value of \(\mu \) at which the perturbative expansion is computed, however this time we also include in each member of the family some information on the scale dependence. Since these parameters double the previous informationFootnote 11 they are clearly very precious.

Note that the parameters \(r_k\) represent the kind of information used in the construction of the canonical scale uncertainty. More precisely, the canonical scale uncertainty is based only on the last one, \(r_n\) (assuming \(r_n\) is defined according to Eq. (1.1)), and it does not provide a probabilistic interpretation. In Sect. 5 we will instead make use of the scale variation information \(r_k\) in a fully fledged probabilistic model, thereby providing a method that is, in some sense, a more reliable and statistically sound version of canonical scale variation.

We finally stress that the renormalization scale is not the only scale appearing in perturbative computations. For instance, in QCD processes involving hadrons in the initial or final states, the factorization of collinear singularities introduces a (perturbative) dependence on another unphysical scale, the so-called factorization scale. Also, in effective field theories widely used in collider phenomenology (e.g. heavy quark effective theory or soft-collinear effective theory), other unphysical scales may appear. The way to deal with these scales strictly depends on the scale itself. For instance, the dependence on the factorization scale cancels between the perturbatively computable coefficients and the non-perturbative parton distribution functions (PDFs). Therefore, if we wish to obtain factorization scale independent probability distribution for a physical observable, we may try to extend the procedure that we propose in Sect. 6 to this scale, with proper caveats due to the fact that PDFs are non-perturbative objects. Instead, if we wish to include our definition of theory uncertainties in the fits used to determine PDFs from data, the situation is completely different, as the PDFs are not physical observables and they are thus scheme and scale dependent. Addressing this issue is beyond the scope of this paper and it is left to future work. Here, we only focus on the renormalization scale dependence, which is universal.

2.4 The Cacciari–Houdeau method

Before moving to our new proposals, we now present the Cacciari–Houdeau (CH) method for estimating theory uncertainties [3]. Let us forget about the scale dependence (which is not dealt with in the original paper) and consider the perturbative expansion

$$\begin{aligned} \Sigma _{\mathrm{pert}}= \sum _{k=0}^\infty c_k \alpha ^k. \end{aligned}$$
(2.17)

We know that this series is divergent, but for the moment we ignore this fact. The basic assumption made in the CH model is that all the coefficients \(c_k\) are bounded in absolute value by a common number \({\bar{c}}\), namely

$$\begin{aligned} \left| c_k \right| \le {\bar{c}}\qquad \forall k. \end{aligned}$$
(2.18)

The coefficient \({\bar{c}}\) is a parameter of the model, and specifically a hidden parameter, which will disappear (through marginalization) in the final results. Moreover, they assume that all \(c_k\) are independent from each other, with the exception for the common bound, which implies that

$$\begin{aligned} P(c_k,c_j|{\bar{c}}) = P(c_k|{\bar{c}}) P(c_j|{\bar{c}})\qquad \forall k,j, \quad k\ne j. \end{aligned}$$
(2.19)

The conditional probability \(P(c_k|{\bar{c}})\), that we shall call the likelihood, encodes in a probabilistic way the assumption Eq. (2.18). In the CH approach it is given by

$$\begin{aligned} P(c_k|{\bar{c}}) = \frac{1}{2{\bar{c}}} \theta ({\bar{c}}-\left| c_k \right| ), \end{aligned}$$
(2.20)

namely the condition Eq. (2.18) must be strictly satisfied (the probability that the condition is violated is zero), and within the allowed range all values are equally likely (flat distribution). Finally, they provide a prior distribution for the hidden parameter,

$$\begin{aligned} P_0({\bar{c}}) \propto \frac{1}{{\bar{c}}}\theta ({\bar{c}}), \end{aligned}$$
(2.21)

which corresponds to a flat distribution in the logarithm of \({\bar{c}}\), to encode the idea that the order of magnitude of \({\bar{c}}\) is a priori unknown. Note that this prior distribution is not normalizeable, and thus it requires a regularization procedure to be used.

The ingredients above are sufficient to define the model, and using standard Bayesian inference they allow to compute the sought probability distribution for the observable. In practice, since the starting point is the perturbative expansion Eq. (2.17), to obtain a probability distribution for the full sum it is sufficient to have a probability distribution for the unknown \(c_k\) coefficients given the knowledge of the first \(n+1\) coefficients \(c_0,\ldots ,c_n\). The key ingredient is thus \(P(c_k|c_0,\ldots ,c_n)\), with \(k>n\), which can be computed as

$$\begin{aligned} P(c_k|c_0,\ldots ,c_n)&= \frac{P(c_k,c_0,\ldots ,c_n)}{P(c_0,\ldots ,c_n)} (k>n) \nonumber \\&= \frac{\int d{\bar{c}}\, P(c_k,c_0,\ldots ,c_n,{\bar{c}})}{\int d{\bar{c}}\, P(c_0,\ldots ,c_n,{\bar{c}})} \nonumber \\&= \frac{\int d{\bar{c}}\, P(c_k,c_0,\ldots ,c_n|{\bar{c}})P_0({\bar{c}})}{\int d{\bar{c}}\, P(c_0,\ldots ,c_n|{\bar{c}})P_0({\bar{c}})} \nonumber \\&= \frac{\int d{\bar{c}}\, P(c_k|{\bar{c}}) P(c_0|{\bar{c}}) \cdots P(c_n|{\bar{c}}) P_0({\bar{c}})}{\int d{\bar{c}}\, P(c_0|{\bar{c}}) \cdots P(c_n|{\bar{c}}) P_0({\bar{c}})} \end{aligned}$$
(2.22)

where we have used the relation between the joint probability and the conditional probability \(P(A,B)=P(A|B)P(B)\) in the first step, introduced the hidden parameter in the second step, used again the definition of the conditional probability in the third step and finally used Eq. (2.19). The last line is written in terms of known functions (the likelihood and the prior), and can thus be easily computed. This result can be easily generalized to the joint probability of more than one unknown coefficient,

$$\begin{aligned}&P(c_{k_1},\ldots ,c_{k_m}|c_0,\ldots ,c_n) \nonumber \\&\quad = \frac{\int d{\bar{c}}\, P(c_{k_1}|{\bar{c}})\cdots P(c_{k_m}|{\bar{c}}) P(c_0|{\bar{c}}) \cdots P(c_n|{\bar{c}}) P_0({\bar{c}})}{\int d{\bar{c}}\, P(c_0|{\bar{c}}) \cdots P(c_n|{\bar{c}}) P_0({\bar{c}})},\nonumber \\&\qquad k_1,\ldots ,k_m>n. \end{aligned}$$
(2.23)

At this point one can also compute, at least formally, the probability distribution for the full sum, which is given by

$$\begin{aligned}&P(\Sigma _{\mathrm{pert}}|c_0,\ldots ,c_n) \nonumber \\&\quad = \int dc_{n+1}dc_{n+2}\cdots \, P(c_{n+1},c_{n+2},\ldots |c_0,\ldots ,c_n) \nonumber \\&\qquad \times \delta \left( \Sigma _{\mathrm{pert}}-\sum _{k=0}^\infty c_k\alpha ^k\right) . \end{aligned}$$
(2.24)

Since this is an infinite-dimensional integration, it is impossible to perform it numerically and too hard to compute it analytically. Therefore, one can approximate the full sum with a truncated sum at some finite order \(n+j\) to get

$$\begin{aligned}&P(\Sigma _{\mathrm{pert}}|c_0,\ldots ,c_n) \simeq \int dc_{n+1}\cdots dc_{n+j}\nonumber \\&\quad \, P(c_{n+1},\ldots ,c_{n+j}|c_0,\ldots ,c_n) \nonumber \\&\quad \delta \left( \Sigma _{\mathrm{pert}}-\sum _{k=0}^{n+j} c_k\alpha ^k\right) , \end{aligned}$$
(2.25)

which can be easily handled, at least numerically. The easiest approximation is obtained with \(j=1\), where only the first missing higher order is used to approximate the distribution, and it leads to a simple analytical expression [3].

The CH approach is a breakthrough in the context of estimating the theory uncertainty from missing higher orders, as it provides for the first time a probabilistic way to determine the uncertainty of an observable computed in perturbation theory. Note that this approach only considers the behaviour of the expansion, without using any information from the scale dependence. This is exactly the opposite of the canonical scale variation method, which is based on the scale dependence and does not use any information on the behaviour of the expansion.

Despite the nice properties of the CH approach, there are some caveats that need to be considered. The most obvious one is the assumption Eq. (2.18), that implies that the perturbative expansion is bounded by a convergent (geometric) series,

$$\begin{aligned} \left| \Sigma _{\mathrm{pert}} \right| = \left| \sum _{k=0}^\infty c_k \alpha ^k \right| \le \sum _{k=0}^\infty \left| c_k \right| \alpha ^k \le \sum _{k=0}^\infty {\bar{c}} \alpha ^k = \frac{{\bar{c}}}{1-\alpha }, \end{aligned}$$
(2.26)

where we have assumed \(\alpha <1\) (which is consistent with the perturbative hypothesis). This is in contrast with the known fact that perturbative expansions are divergent. In a subsequent paper, Ref. [4], the CH approach has been modified to account for the divergence of the series, by modifying the condition Eq. (2.18) into

$$\begin{aligned} \left| c_k \right| \le {\bar{b}} k!\qquad \forall k, \end{aligned}$$
(2.27)

with \({\bar{b}}\) being the new hidden parameter, with the same prior as \({\bar{c}}\). This condition on the coefficients is much less stringent and compatible with the assumption that the divergence of the series is dominated by a factorial growth such as those due to renormalons. However, with this choice it is no longer possible to use Eq. (2.24) to compute the full sum, as it does not exist. Therefore only the approximation Eq. (2.25) can be considered, and typically with a low value of j otherwise the probability distribution becomes large due to the factorial growth. In Ref. [4] only \(j=1\) is considered.

The second issue is related to the fact that Eq. (2.18) does not account for a possible power growth of the coefficients. In other words, each term of the perturbative expansion is assumed to have a power scaling given just by \(\alpha \). This limitation was stressed already in the first CH paper [3], where they propose to solve it by rescaling \(\alpha \),

$$\begin{aligned} \sum _{k=0}^\infty c_k \alpha ^k = \sum _{k=0}^\infty c_k' \left( \frac{\alpha }{\eta }\right) ^k \end{aligned}$$
(2.28)

to obtain new coefficients \(c_k'\) that satisfy Eq. (2.18), or Eq. (2.27) in the factorial divergent hypothesis of Ref. [4]. The trouble is how to find such a rescaling factor \(\eta \). In Ref. [4], a global survey over a quite large number of observables is proposed to determine an optimal value of \(\eta \). In this survey they compare the uncertainty computed at the next-to-last known order with the actual (known) next order, to quantify how reliable the uncertainty is for each given value of the rescaling factor. Apart from the details (for which we refer the Reader to Ref. [4]), we stress that this approach assumes that the rescaling factor is the same for all the observables.Footnote 12 However, this is hardly the case, as different processes and observables are characterized by different dominant perturbative corrections.Footnote 13 Therefore, it is more appropriate to assume that the rescaling factor \(\eta \) is process and observable dependent. A different way to obtain it has been proposed in Ref. [9], where a fitting procedure is suggested to find an optimal value of \(\eta \) such that the known \(c_k\) are all of the same order. This method is observable dependent and uses only the information on the perturbative expansion to obtain the optimal rescaling. However, a fitting procedure to determine the rescaling factor clashes with the probabilistic nature of the rest of the procedure.

Finally, the CH approach or its modified versions do not deal with scale dependence. The CH machinery is applied to the perturbative expansion at a given value of the scale, and if one changes the scale the result changes accordingly. The difference in the final probability distribution at different values of the scales can be sizeable, see e.g. Ref. [5].

2.5 A working example: Higgs production at the LHC

In Sect. 7 we will consider various examples of perturbative expansions, and apply our methods to each of them. Nevertheless, in order to be clearer when discussing our new proposals, we think it is instructive to have a working example to immediately visualize how the various methods work.

The observable we choose is the inclusive cross-section for Higgs production in gluon fusion at the LHC for this purpose. This process has a number of advantages:

  • it is known up to \(\hbox {N}^3\hbox {LO}\) [44,45,46,47] (four orders in the perturbative expansion) in the so-called large top mass effective theory (this is a rarity in QCD processes at LHC, most of which are only known to NLO or NNLO);

  • it is characterized by large perturbative corrections;

  • canonical scale variation underestimates the impact of (large) higher orders;

  • its factorization scale dependence is very mild, so the whole scale dependence is basically fully captured by its renormalization scale dependence;

  • because the process starts at \({\mathcal {O}}(\alpha _s^2)\), the LO is scale dependent;

  • it is a real process and not a toy example.

On top of these reasons, the process is interesting also from a phenomenological point of view (see e.g. Ref. [48]), and indeed it was subject of several investigations of theory uncertainties from missing higher orders [4, 5, 8, 46].

Fig. 4
figure 4

The Higgs production cross section in gluon fusion at LHC \(\sqrt{s}=13\) TeV with \(m_{H}=125\) GeV for \(\mu _{\scriptscriptstyle \mathrm F}=m_{H}/2\) as a function of the renormalization scale \(\mu \) at the four known perturbative orders

Specifically, we consider LHC at \(\sqrt{s}=13\) TeV and set the Higgs mass to \(m_{H}=125\) GeV. We fix the factorization scale to \(\mu _{\scriptscriptstyle \mathrm F}=m_{H}/2\) (which is a standard choice [48]), even though changing this value has a negligible effect on the cross section, in particular at high orders. The “raw” result of the computation, which is the cross section as a function of the renormalization scale \(\mu \), is plotted in Fig. 4. Note that this process depends on a single hard scale, the Higgs mass \(m_{H}\). Therefore, it is natural to choose \(\mu \) of the order of \(m_{H}\), in order to avoid the presence of large unresummed logarithms of \(\mu /m_{H}\) in the perturbative coefficients. Nevertheless, we believe it is instructive to visualize the scale dependence for wide range of scales: the plot covers almost four decades in \(\mu /m_{H}\).

We see that for large values of the scale, where the QCD coupling is smaller, the expansion is characterized by all positive contributions and it progresses very slowly, with large perturbative corrections. Conversely, at small scales where the strong coupling blows up the expansion is highly unstable. In a “central” region, where \(\mu \sim m_{H}\), the expansion behaves in a reasonably perturbative way, even though the perturbative corrections are rather large and it is not at all clear what could possibly be the true cross section. Note also the presence of a stationary point at NNLO and of two stationary points at \(\hbox {N}^3\)LO, which could corrupt an estimate of the uncertainty based on canonical scale variation.

We stress that the full plot of Fig. 4 can be constructed from just the sequence of partial sums of the observable at the various orders at a given scale, and the knowledge of the value of the coupling at that scale. In this example, we have at \(\mu =m_{H}\) (and \(\mu _{\scriptscriptstyle \mathrm F}=m_{H}/2\))Footnote 14

$$\begin{aligned} \alpha _s(m_{H})=0.1126, \Sigma _{\mathrm{pert}}(m_{H}) {=} \left\{ 13.0, 30.7, 41.8, 46.3 \right\} \text {pb}, \end{aligned}$$
(2.29)

where the values in curly brackets correspond to the cross section at LO, NLO, NNLO and \(\hbox {N}^3\hbox {LO}\) respectively. The way to reconstruct the cross section at any scale from these ingredients is discussed in Sect. A.1.

3 Model-independent features of the inference approach at fixed scale

In this section we start introducing our notation and present some general features of the construction of the models. For the time being we consider only the models at a fixed scale. How to obtain scale-independent probability distributions will be discussed in Sect. 6.

3.1 Basic notations

Let us denote with \(\Sigma _n(\mu )\) the partial sum of the perturbative series up to order n depending on the scale \(\mu \). If we are considering a standard perturbative expansion in powers of \(\alpha \) this is given by

$$\begin{aligned} \Sigma _n(\mu ) = \sum _{k=0}^n c_k(\mu ) \alpha ^{k+k_0}(\mu ), \end{aligned}$$
(3.1)

where we have explicitly introduced an offset \(k_0\) for observables starting at \({\mathcal {O}}\left( \alpha ^{k_0}\right) \). This notational change with respect to the previous section allows us to be sure that the first order, \(k=0\), namely the leading order (LO), is non-zero. Note that the information contained in the coefficients \(c_k\) is fully contained in the sequence of partial sums

$$\begin{aligned} \Sigma _0, \Sigma _1, \Sigma _2, \ldots \end{aligned}$$
(3.2)

once the value of \(\alpha \) and of \(k_0\) are specified. From now on, we shall consider the partial sums as the basic objects, and forget about the coefficients \(c_k\) and even of \(\alpha \). In this way, the “perturbative expansion” is more general, as it does no longer necessarily need to be a strict expansion in powers of \(\alpha \). For instance, it can be a logarithmic-ordered expansion of a resummed result, or a (non-)linear transformation of the perturbative expansion. In what follows we simply assume that \(\Sigma _n\) represents the partial sum of a non-specified expansion that behaves perturbatively, defined such that the “LO” \(\Sigma _0\) is non-zero.

For a number of reasons that will become clear later, it is convenient to introduce a normalized version of the expansion, where the LO is factored out,

$$\begin{aligned} \Sigma _n(\mu ) = \Sigma _0(\mu )\sum _{k=0}^n \delta _k(\mu ), \end{aligned}$$
(3.3)

where we have defined the dimensionless coefficientsFootnote 15

$$\begin{aligned} \delta _k(\mu ) \equiv \frac{\Sigma _k(\mu )-\Sigma _{k-1}(\mu )}{\Sigma _0(\mu )}. \end{aligned}$$
(3.4)

According to this definition, \(\delta _0=1\) always, so we can also write

$$\begin{aligned} \Sigma _n(\mu ) = \Sigma _0(\mu )\left( 1+\sum _{k=1}^n \delta _k(\mu ) \right) . \end{aligned}$$
(3.5)

The coefficients \(\delta _k\) contain the information on the perturbative orders. The fact that \(\delta _0=1\) sets a common size for all perturbative expansions (useful when defining the model), and also tells us that the LO does not contain any useful information on the behaviour of the expansion and thus on the uncertainty due to missing higher orders (which is obvious, as with just the LO one cannot know how large the perturbative corrections will be). This is true even when the LO is scale dependent, because without another order to compare with one cannot know how to reliably translate the scale dependence into an information on the size of missing higher orders. Only when at least two orders, namely \(\delta _0\) and \(\delta _1\), are known one can start making inference. The only role of the LO is to set the dimension and the size of the observable through the prefactor \(\Sigma _0(\mu )\).

3.2 Definition of the scale-dependence numbers

Because the expansion is scale dependent, as discussed in Sect. 2.3 one can construct numbers \(r_k(\mu )\) that encode this dependence at each order. The importance of these numbers stems from the observation that scale dependence of a physical observable is formally of higher order, Eq. (2.15), namelyFootnote 16

$$\begin{aligned} r_k(\mu )={\mathcal {O}}\left( \alpha ^{k+1}\right) . \end{aligned}$$
(3.6)

The actual construction of these numbers must be such that this equation is satisfied. The simplest measure of the scale dependence is the slope of \(\Sigma _k(\mu )\) as a function of \(\mu \), or more precisely its derivative with respect to \(\log \mu \) (the logarithm is natural because the dependence on the scale is logarithmic). In order to directly compare these numbers with the \(\delta _k\) coefficients, it is convenient to normalise the derivative to the observable itself to make them dimensionless. The two most natural ways to do this are

$$\begin{aligned} \frac{1}{\Sigma _0(\mu )}\,\mu \frac{d}{d\mu }\Sigma _k(\mu ) \qquad \text {or}\qquad \frac{1}{\Sigma _k(\mu )}\,\mu \frac{d}{d\mu }\Sigma _k(\mu ), \end{aligned}$$
(3.7)

where in the first case we have normalized to the LO \(\Sigma _0\), while in the second case we have normalized to the observable at the same order at which the scale dependence is computed. If the observable does not change much at different orders, the two options are equivalent. However, in presence of large perturbative corrections there can be a substantial difference between the two. None of them is better in an absolute sense. However, we argue that the second option has a nicer perturbative behaviour, with higher order \(r_k\) being typically smaller than lower order ones, as one would expect from Eq. (3.6).

Fig. 5
figure 5

The absolute value of the normalized slopes defined in Eq. (3.7), respectively shown in the left and right plots, for the Higgs production process

To prove this, we consider the example of Higgs production introduced in Sect. 2.5, and show in Fig. 5 the (absolute value of) the two options in Eq. (3.7). From the left plot, corresponding to the left option in Eq. (3.7), we see that at NLO (green curve) the slope is always larger than the LO one (black curve), and at the next orders there is a large variability without a precise hierarchy. Conversely, in the right plot, corresponding to the right option in Eq. (3.7), the higher order curves are smaller than the lower order ones over a wide range of scales, which is the expected perturbative behaviour. So the definition of the \(r_k\) coefficients that we propose is morally given by

$$\begin{aligned} r_k(\mu ) \simeq \left| \frac{1}{\Sigma _k(\mu )}\,\mu \frac{d}{d\mu }\Sigma _k(\mu ) \right| = \left| \mu \frac{d}{d\mu }\log \Sigma _k(\mu ) \right| , \end{aligned}$$
(3.8)

where the absolute value is introduced to keep only the information on the size of the dependence but not the sign. Note that at small scales, where the perturbative expansion is unstable (Fig. 4), the expected hierarchy is violated. This is the case also in proximity of the stationary points. To avoid problems arising from stationary points, we propose to define the \(r_k\) numbers in a slightly different way. Namely, we replace the derivative with a finite difference, and look for the largest finite difference in a range around the point \(\mu \). In formulae, we haveFootnote 17

$$\begin{aligned} r_k(\mu ) \equiv \frac{1}{\left| \Sigma _k(\mu ) \right| }\,\max _{\mu /f\le \nu \le f\mu }\left| \frac{\Sigma _k(\nu ) - \Sigma _k(\mu )}{\log (\nu /\mu )} \right| , \end{aligned}$$
(3.9)

that selects the largest finite difference in the range of scales between \(\mu /f\) and \(f\mu \), where \(f>1\) is a factor to be fixed. This definition is more robust than the one based on the derivative.

Fig. 6
figure 6

The coefficients \(r_k(\mu )\) (Eq. 3.9), for \(f=2\) (left) and \(f=4\) (right), for the Higgs production process

We show this in Fig. 6, for \(f=2\) (left plot) and \(f=4\) (right plot). Increasing f allows to be less sensitive to stationary points, and indeed in the right plot the expected hierarchy is preserved over a wide range of scales. At low scales, the value of \(r_k\) at NNLO and \(\hbox {N}^3\hbox {LO}\) blows up because the finite difference probes the low-scale region where the perturbative result is unstable (Fig. 4). For larger values of f, the blow-up region obviously moves to larger scales. Therefore, the robustness gained increasing f has to be balanced with the stability loss. We believe that \(f=4\) can be considered as a reasonable compromise.Footnote 18

We stress that for an efficient computation of the \(r_k(\mu )\) coefficients a fast evaluation of the scale dependence of the observable is needed. Since the renormalization scale dependence is universal and governed by the \(\beta \)-function of the theory, it can be constructed automatically from the knowledge of the observables at the various orders at a single value of the scale. The details are reported in Appendix A.1.

We finally note that if the LO is scale independent, then \(r_0=0\). Given that we aim at using the \(r_k\) numbers to estimate the higher orders, a zero value would be inappropriate. More precisely, \(r_0=0\) means that the LO cannot be used to probe higher orders through scale variation. We can fix this by assuming that, when the LO is scale independent, we can only say that the NLO corrections may be of the order of the LO itself. Given the definition of the \(r_k\) as normalized quantities, this corresponds to assuming \(r_0={\mathcal {O}}(1)\). For definiteness, we arbitrarily set \(r_0(\mu )=1/2\) in these cases. We shall see later that this assumption is rather conservative.

3.3 General features of the models

Before discussing the actual models that we propose, we want to give some general features that are common to all of them. Let us recall that the goal of this work is to construct a probability distribution for \(\Sigma \) given the first known \(n+1\) orders \(\Sigma _0,\Sigma _1,\ldots ,\Sigma _n\), or equivalently \(\Sigma _0,\delta _1,\ldots ,\delta _n\) (using our new notation). This was given in Eq. (2.8), or, when working at fixed scale, in Eq. (2.14).

According to our assumption Eq. (2.7) that the uncertainty of a theoretical prediction based on perturbation theory is dominated by the missing higher orders, we shall compute the distribution for \(\Sigma \) through these missing higher orders, ignoring the other contributions from the asymptotic expansion truncation and the non-perturbative part. In other words, we shall approximate the observable \(\Sigma \) as

$$\begin{aligned} \Sigma \simeq \Sigma _{k}(\mu ), \qquad k\le k_{\mathrm{asympt}}, \end{aligned}$$
(3.10)

where the best approximation is obtained using \(k=k_{\mathrm{asympt}}\), namely including all the missing higher orders up to the point in which the asymptotic expansion starts to grow. Since we typically do not know a priori the value of \(k_{\mathrm{asympt}}\), the best we can do is to include a few extra orders beyond the known ones, without exaggerating in order not to risk to go beyond \(k_{\mathrm{asympt}}\).

The inference on the observable \(\Sigma \) can be obtained in terms of the more fundamental inference of the higher orders from the known ones. This was already done in Sect. 2.4, and we repeat that derivation here in more generality and with our new notation. Assuming we know the coefficients of the expansion up to order n and we approximate the observable with its expansion up to order \(n+j\) with \(j>0\), \(n+j\le k_{\mathrm{asympt}}\), the probability distribution is given by (suppressing for ease of notation the implicit dependence on the hypothesis H)

(3.11)

where we have used in the last step the \((n+j)\)-th order approximation

$$\begin{aligned} P(\Sigma |\delta _{n+j},\ldots ,\delta _1,\Sigma _0,\mu ) \simeq \delta \left( \Sigma -\Sigma _{n+j}(\mu )\right) , \end{aligned}$$
(3.12)

which is a direct consequence of Eq. (3.10), and the definition of \(\Sigma _n(\mu )\), Eq. (3.5).Footnote 19 If \(\Sigma \) is a positive definite observable, like a cross section, one could also impose a positivity constraint through a factor \(\theta (\Sigma )\), which however requires the computation of a normalization factor as in general the distribution is no longer normalized. The result Eq. (3.11) is written in terms of \(P(\delta _{n+j},\ldots ,\delta _{n+1}|\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )\), which represents the probability of the higher orders given the known ones, and depends on the model under consideration. Note that we have also included \(\Sigma _0\) among the known information, even though it is just a prefactor: indeed \(\Sigma _0\) is required to compute the \(r_k\) numbers introduced in Sect. 3.2 that are needed in models that use information on the scale dependence.

The delta function in Eq. (3.11) can be used to perform the integral over one of the higher orders, say \(\delta _{n+j}\), which gives

$$\begin{aligned}&P(\Sigma |\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )\nonumber \\&\quad \simeq \frac{1}{\Sigma _0(\mu )} \int d\delta _{n+j-1}\cdots d\delta _{n+1} \nonumber \\&\qquad \times P\left( \delta _{n+j}=\frac{\Sigma -\Sigma _{n+j-1}(\mu )}{\Sigma _0(\mu )}, \delta _{n+j-1},\right. \nonumber \\&\qquad \quad \left. \qquad \ldots ,\delta _{n+1}|\delta _n,\ldots ,\delta _1,\Sigma _0,\mu \right) . \end{aligned}$$
(3.13)

The other integrations are more complicated, and should be performed numerically (unless the model is particularly simple). The simplest result is obtained when considering \(j=1\), namely when approximating the observable using just the first unknown order,

$$\begin{aligned}&P(\Sigma |\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )\nonumber \\&\quad \overset{j=1}{\simeq }\frac{1}{\Sigma _0(\mu )} P\left( \delta _{n+1}=\frac{\Sigma -\Sigma _{n}(\mu )}{\Sigma _0(\mu )}|\delta _n,\ldots ,\delta _1,\Sigma _0,\mu \right) . \end{aligned}$$
(3.14)

In practice, since the order \(n+j\) that approximates best the observable is not known a priori, a convenient approach to choose properly j is the following. We start considering the simplest approximation, \(j=1\) (Eq. 3.14). Then, we include the next order, namely we use \(j=2\). If the distribution changes visibly, then we further increase j by a unity, and so on until the distribution changes only mildly, up to a tolerance decided by the user.

We now turn our attention to the probability \(P(\delta _{n+j},\ldots ,\delta _{n+1}|\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )\). As we said, this is model dependent. However, we can further write it in terms of more fundamental probabilities, using the relation

$$\begin{aligned}&P(\delta _{n+j},\ldots ,\delta _{n+1}|\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )\nonumber \\&\quad =\frac{P(\delta _{n+j},\ldots ,\delta _{n+1},\delta _n,\ldots ,\delta _1,\Sigma _0|\mu )}{P(\delta _n,\ldots ,\delta _1,\Sigma _0|\mu )}. \end{aligned}$$
(3.15)

The numerator and the denominator are the same object, simply with a different number of \(\delta _k\) terms. In the numerator only some of them are known while the others are unknown, but mathematically this does not make any difference. The joint distribution \(P(\delta _m,\ldots ,\delta _1,\Sigma _0|\mu )\) at fixed scale is the basic object of the model that is needed to make inference on the higher orders. In all the models we will consider, we will assume that there is a number of hidden parameters characterizing the model. Denoting with \(\vec p\) the vector of such parameters, the joint distribution can be written as

$$\begin{aligned}&P(\delta _m,\ldots ,\delta _1,\Sigma _0|\mu ) = \int d\vec p\; P(\delta _m,\ldots ,\delta _1,\Sigma _0,\vec p|\mu ) \nonumber \\&\quad = \int d\vec p\; P(\delta _m,\ldots ,\delta _1,\Sigma _0|\vec p,\mu ) P_0(\vec p|\mu ), \end{aligned}$$
(3.16)

where in the second line we have written explicitly the prior of these parameters, as they are hidden and thus can never be known exactly.

Fig. 7
figure 7

The most general Bayesian network of models of inference of missing higher orders (left), and a more specific one, using explicitly the scale variation numbers \(r_k\), that covers all the cases considered in this work (except the one of Sect. B.2)

To compute the joint distribution, it is sometimes useful to visualize the relation between the different objects using a Bayesian network. The most general network for any model of theory uncertainties is rather simple, and it is depicted in Fig. 7 (left) showing only the first four orders (the generalization to more than four orders is obvious). The various orders depend in general on all previous orders (but not on future ones, otherwise the model cannot be predictive), and on the hidden parameters \(\vec p\). Since \(\Sigma _0\) is the first order and it only sets the size of the observable, it does not depend on anything. In addition to this general structure, we want to consider more explicitly the role of the scale dependence numbers \(r_k\). Since these are functions of the \(\delta _k\) coefficients and \(\Sigma _0\), there is no need to specify them in the network. However, it may be useful to introduce them explicitly in order to better appreciate their role. Therefore, in the same Fig. 7 (right) we also show explicitly a network depending on these \(r_k\). The dashed arrows represent deterministic links, namely analytic relations rather than probabilistic ones, and mean that the \(r_k\) numbers are computable analytically from the various orders. Note that this network is not as general as the previous one. Indeed now we have made the assumption that each \(\delta _k\) depend only on the previous \(\delta _{k-1}\) and \(r_{k-1}\) (and on \(\vec p\)), but not on all the previous orders. This simplification is not necessary, but it will be adopted in all our models (except the one of Sect. B.2).

4 Model 1: geometric behaviour model

We now present our first model, that uses only information on the behaviour of the expansion. As such, this model is very general, and not restricted to a QFT application.

4.1 The hypothesis of the model

The first model that we consider is a generalization of the Cacciari–Houdeau model introduced in Sect. 2.4. The main difference is that our model accounts for a possible power growth of the coefficients of the expansion within the probabilistic approach. In the CH model each term of the expansion is bounded by

$$\begin{aligned} \left| c_k\alpha ^k \right| \le {\bar{c}} \alpha ^k\qquad \forall k, \end{aligned}$$
(4.1)

which is Eq. (2.18) in which we have emphasised that the power behaviour of the full \({\mathcal {O}}(\alpha ^k)\) term is entirely described by \(\alpha ^k\). As we have discussed in Sect. 2.4, this hypothesis is hardly satisfied, and a variant of the CH method in which the expansion parameter \(\alpha \) is rescaled by a factor \(\eta \) is advisable. So far, this has never been done within the context of the probabilistic model.

Now, in our new notation (Eq. 3.5), we have lost information on the coupling \(\alpha \), as the whole information is contained in the \(\delta _k\) coefficients Eq. (3.4). This was done on purpose and is to be considered as an advantage, as the expansion Eq. (3.5) is more general than a strict expansion in powers of \(\alpha \). If we want to translate the CH condition Eq. (4.1) in the new language, we are forced to introduce back an expansion parameter. Since from our point of view this is a new parameter (as we have lost information about \(\alpha \)), it is natural to consider it as a parameter of the model, rather than an external one. As such, it is not fixed to be \(\alpha \) or a fraction of it. The condition that we consider is thus

$$\begin{aligned} \left| \delta _k(\mu ) \right| \le c a^k\qquad \forall k< k_{\mathrm{asympt}}, \end{aligned}$$
(4.2)

where both c and a are positive hidden parameters of the model. This condition implies that the expansion is bounded by a geometric expansion, and we thus call this model a geometric behaviour model. We have specified that this bound can only be valid for orders k smaller than \(k_{\mathrm{asympt}}\), otherwise it is certainly violated. This is anyway the only region which we are interested in, according to the discussion in Sect. 2.1.

Equation (4.2), though very similar to the CH condition Eq. (4.1), differs from it in a number of very important aspects, that we now list.

  • The fact that a is a parameter makes not only the model more general than CH, but it also allows to find, through inference, the most appropriate values (in a probabilistic sense) of the expansion parameter a compatible with the behaviour of the expansion. In other words, a can be interpreted as the rescaled expansion parameter \(\alpha /\eta \) introduced in Sect. 2.4, but with the rescaling factor \(\eta \) being determined through inference from the perturbative expansion itself, as opposed to the approaches of Refs. [4, 9].

  • The parameter c is dimensionless, as opposed to \({\bar{c}}\) which has the dimension of the observable. This is useful as we can legitimately use a universal prior for c without knowing anything about the observable.Footnote 20

  • Since the condition Eq. (4.2) is limited to \(k< k_{\mathrm{asympt}}\), the fact that the perturbative expansion is typically factorially divergent does not imply that the geometric bound is unacceptable. Of course one cannot say that the entire series is bounded by a geometric series, but a small portion of it may well be. Therefore, the condition Eq. (4.2) limited to the first few orders can be considered as perfectly acceptable.Footnote 21

The CH assumption Eq. (2.18) can be recovered from this new approach by fixing \(a=\alpha \) (or \(a=\alpha /\eta \) in the rescaled variant) and rewriting \(c={\bar{c}}/\Sigma _0(\mu )\). Note that the hidden parameters ca depend on the scale \(\mu \). This has not been written explicitly, because in a statistical language this information is expressed by saying that c and a are correlated with \(\mu \).

In order to construct a probabilistic model to estimate theory uncertainties, we need to translate the condition Eq. (4.2) into a likelihood function. We assume the simple conditional probability

$$\begin{aligned} P(\delta _k| c, a, \mu ) = \frac{1}{2ca^k}\theta \left( ca^k - \left| \delta _k(\mu ) \right| \right) , \qquad k<k_{\mathrm{asympt}}, \end{aligned}$$
(4.3)

which is the straightforward extension of the CH choice, Eq. (2.20). We have considered the idea of allowing violation of the bound, by adding tails to the likelihood. However, the fact of having two hidden parameters already makes the model much more flexible than the original CH model, so adding a violation of the bound would not lead to any substantial improvement to the model stability.

In similarity with the CH model, we assume that all \(\delta _k\) are independent of each other at fixed c, a and \(\mu \), namely

$$\begin{aligned}&P(\delta _k,\delta _j| c, a, \mu ) \nonumber \\&\quad = P(\delta _k| c, a, \mu ) P(\delta _j| c, a, \mu ) \qquad \forall k,j, \quad k\ne j, \end{aligned}$$
(4.4)

which generalizes to any set of \(\delta _k\) coefficients. These conditions, together with the prior distributions for the hidden parameters that we are going to discuss, are sufficient to fully define the model. The Bayesian network of this model is a simplified version of the general one introduced in Sect. 3.3, and is depicted in Fig. 8.

Fig. 8
figure 8

Bayesian network for the geometric behaviour model. Each \(\delta _k\) depends only on the hidden parameters (and \(\mu \)) through the likelihood Eq. (4.3)

4.2 The choice of priors

The two hidden parameters ca need a prior distribution. We assume that a priori the two variables are uncorrelated

$$\begin{aligned} P_0(c,a|\mu ) = P_0(c|\mu ) P_0(a|\mu ), \end{aligned}$$
(4.5)

where we have emphasised that in principle the prior can depend on the scale \(\mu \), considered to be externally given for the moment (this will change in Sect. 6).

Before proposing a functional form for the prior, let us comment on the first step of the inference. The information from the LO is encoded in \(\Sigma _0\), that appears as a prefactor in our normalized expansion (Eq. 3.5). Since the likelihood Eq. (4.3) looks at the \(\delta _k\) and not at \(\Sigma _0\), the knowledge of the LO does not change our prior: \(P(c,a|\Sigma _0,\mu )=P_0(c,a|\mu )\). While this is strictly speaking correct according to our notation, it misses a conceptual point. Indeed, it exists also a \(\delta _k\) for the LO, which is \(\delta _0\). The fact that \(\delta _0=1\) makes it a trivial variable, in the sense that it carries no information, which is the reason why it does not appear in our functions. However, from a mathematical viewpoint, it would play a role if we assume that the likelihood Eq. (4.3) is also valid for the LO. Indeed, at order zero, the likelihood becomes

$$\begin{aligned} P(\delta _0|c,a,\mu ) = \frac{1}{2c}\theta \left( c-\left| \delta _0 \right| \right) = \frac{1}{2c}\theta \left( c-1\right) . \end{aligned}$$
(4.6)

This equation implies that the distribution for a is unmodified by the knowledge of the LO, while the distribution for c changes as

$$\begin{aligned} P(c|\delta _0,\mu )\propto & {} \int da\,P(\delta _0|c,a,\mu ) P_{0}(c,a|\mu )\nonumber \\\propto & {} \frac{P_{0}(c|\mu )}{c} \theta (c-1). \end{aligned}$$
(4.7)

This result shows that the requirement that the likelihood Eq. (4.3) applies also at LO implies the constraint \(c\ge 1\). Since \(\delta _0\) is not explicitly part of our parameters, we will not perform the inference in Eq. (4.7) in our model. In other words, our prior Eq. (4.5) is to be considered as the posterior after the (trivial and universal) knowledge of \(\delta _0=1\). We will keep track of this result by constructing the prior such that the condition \(c\ge 1\) is satisfied.

At this point we are free to choose the functional form of our prior. Note that it is convenient to use simple functional forms, such that analytic computations can be performed. Let us start with the prior for c. Since we do not have any a priori knowledge on the expected size of c, only a monotonic prior is acceptable. We find it reasonable to assume a power law function

$$\begin{aligned} P_0(c|\mu ) =\frac{\epsilon }{c^{1+\epsilon }} \theta (c-1), \qquad \epsilon >0, \end{aligned}$$
(4.8)

where we have included the \(\theta (c-1)\) for the reason explained above. Note that we do not include any dependence on the scale, namely for any value of \(\mu \) we use the same prior distribution. The parameter \(\epsilon \) is an arbitrary parameter, and can be chosen at will (we will discuss our favourite choices later in Sect. 4.4). We note that for \(\epsilon =1\) we obtain the form \(\theta (c-1)/c^2\), which results from a “pre-prior” (without knowledge of \(\delta _0\)) proportional to \(\theta (c)/c\), as obvious from Eq. (4.7). This is the prior used for \({\bar{c}}\) in the CH model, Eq. (2.21). Since \(\theta (c)/c\) is an improper distribution, a regularization procedure is needed in CH to perform practical computation. Rather, when including the trivial information \(\delta _0=1\) within the model, the prior is a proper distribution, and the computations are simplified. This is one advantage of using from the start the universal information \(\delta _0=1\), which is in turn a consequence of using the normalized version of the expansion (Eq. 3.5).

We stress that the choice \(\theta (c)/c\) for the “pre-prior” corresponds to a flat (and thus non-informative) distribution for the variable \(\log c\), justified in the CH work by the argument that the order of magnitude of the hidden parameter is unknown. The other natural non-informative “pre-prior” is given simply by \(\theta (c)\), namely a flat distribution in the hidden parameter c, which is again improper. This choice corresponds to the value \(\epsilon =0\) in Eq. (4.8). With \(\epsilon =0\) also the prior Eq. (4.8) is improper, and indeed in that equation we have assumed \(\epsilon \) to be strictly greater than zero. In fact, computations can be easily performed also in the \(\epsilon \rightarrow 0\) limit with just a little care. We find however that this complication is not necessary: if we wish to mimic the effect of a flat “pre-prior” in c, we can just use a very small positive value for \(\epsilon \). This will indeed be our favourite choice, see Sect. 4.4. Variations of this parameter will be explored in Sect. 4.6.

As far as the parameter a is concerned, we have to make an initial choice about the expected behaviour of the expansion. Indeed, the geometric bound is convergent (namely, at finite order, decreasing with the order) only for \(a<1\). In principle, we could allow \(a\ge 1\), which would describe a (power) divergent behaviour of the expansion.Footnote 22 However, allowing \(a\ge 1\) is in contrast with the asymptotic nature of the expansion that we are assuming, see Sect. 2.1. Therefore, we suggest to limit our interest to the region \(a<1\). We thus propose the functional form

$$\begin{aligned} P_0(a|\mu ) = (1+\omega )(1-a)^\omega \theta (a)\theta (1-a), \qquad \omega \ge 0. \end{aligned}$$
(4.9)

Once again, we assume that this prior is independent of the scale \(\mu \). For \(\omega =0\), we obtain a flat distribution in the allowed region \(0\le a\le 1\) (in this case, the extreme value \(a=1\) is included), while for \(\omega >0\) we suppress the region \(a\rightarrow 1\) to favour small values of a. The actual value of \(\omega \) that we recommend will be discussed in Sect. 4.4, and its variations will be considered in Sect. 4.6.

4.3 Inference on the unknown higher orders

We have now all the ingredients to perform the inference in this model. The basic probability that we need is the conditional probability of unknown higher orders given the first n known non-trivial orders \(\delta _1,\ldots ,\delta _n\). Note that because of our assumption Eq. (4.2), only a limited number of higher orders, up to order \(k_{\mathrm{asympt}}\), can be predicted within this model. Therefore, the most generic probability distribution we need to consider, according to Eq. (3.15), is

$$\begin{aligned}&P(\delta _{n+j},\ldots ,\delta _{n+1}|\delta _n,\ldots ,\delta _1,\mu )\nonumber \\&\quad =\frac{P(\delta _{n+j},\ldots ,\delta _{n+1},\delta _n,\ldots ,\delta _1|\mu )}{P(\delta _n,\ldots ,\delta _1|\mu )}, \ j\ge 1, \ n+j<k_{\mathrm{asympt}}, \end{aligned}$$
(4.10)

namely the probability of the unknown orders \(n+1,\ldots ,n+j\) given the known orders \(1,\ldots ,n\). In contrast with Eq. (3.15), we have removed here the explicit dependence on \(\Sigma _0\), as it does not appear in the likelihood and thus it does not play any role in the inference procedure.

The numerator and the denominator are the same object, therefore we can focus on the distribution (at fixed scale) for a generic number m of consecutive coefficients, which is given by

$$\begin{aligned}&P(\delta _m,\ldots ,\delta _1|\mu ) =\int dc\int da\, P(\delta _m,\ldots ,\delta _1|c,a,\mu ) P_0(c,a|\mu ) \nonumber \\&\quad =\int dc\int da\, P(\delta _m|c,a,\mu )\cdots P(\delta _1|c,a,\mu ) P_0(c,a|\mu )\nonumber \\&\quad = \frac{1}{2^m} \int \frac{da}{a^{\frac{m(m+1)}{2}}}\, P_0(a|\mu ) \nonumber \\&\qquad \times \int \frac{dc}{c^m}\, P_0(c|\mu )\, \theta \left( c-\max \left[ \frac{\left| \delta _1(\mu ) \right| }{a},\ldots , \frac{\left| \delta _m(\mu ) \right| }{a^m}\right] \right) \end{aligned}$$
(4.11)

which corresponds to Eq. (3.16) specialized to our case. In the first line we have introduced the hidden parameters, in the second line we have used the independence of the coefficients, Eq. (4.4), and in the third line we have explicitly written the likelihood Eq. (4.3) and used the prior independence of the hidden parameters Eq. (4.5). The integral in Eq. (4.11) can be computed analytically for our choice of priors Eq. (4.8) and Eq. (4.9). The inner integral is given by

$$\begin{aligned}&\int \frac{dc}{c^m}\, P_0(c|\mu )\, \theta \left( c-\max \left[ \frac{\left| \delta _1(\mu ) \right| }{a},\ldots , \frac{\left| \delta _m(\mu ) \right| }{a^m}\right] \right) \nonumber \\&\quad = \frac{\epsilon }{m+\epsilon }\max \left[ 1,\frac{\left| \delta _1(\mu ) \right| }{a},\ldots , \frac{\left| \delta _m(\mu ) \right| }{a^m}\right] ^{-m-\epsilon }. \end{aligned}$$
(4.12)

Depending on the value of a, the max function selects a different term with a different a dependence. In order to compute the a integral analytically, it is therefore convenient to partition the integration region \(0\le a<\infty \) into a finite number of intervals, in each of which the max function returns one of its arguments. Since the arguments of the max function contain powers of a that grow with k, the intervals are ordered with k. More precisely, smaller values of a will select larger powers k, and viceversa. We can thus introduce consecutive decreasing numbers \(a_k\), representing the boundaries of these consecutive intervals, defined such that

$$\begin{aligned} a_{k+1}< a< a_k\Leftrightarrow & {} \max \left[ 1,\frac{\left| \delta _1(\mu ) \right| }{a},\ldots ,\frac{\left| \delta _m(\mu ) \right| }{a^m}\right] \nonumber \\&\quad = \frac{\left| \delta _k(\mu ) \right| }{a^k} \end{aligned}$$
(4.13)

and assuming \(a_0\equiv \infty \) and \(a_{m+1}\equiv 0\). An algorithm for extracting the various \(a_k\)’s from the knowledge of the \(\delta _k\)’s is described in Appendix A.2. The a integral is then given by

$$\begin{aligned}&P(\delta _m,\ldots ,\delta _1|\mu ) = \frac{1}{2^m} \int \frac{da}{a^{\frac{m(m+1)}{2}}}\,P_0(a|\mu )\frac{\epsilon }{m+\epsilon }\nonumber \\&\quad \max \left[ 1,\frac{\left| \delta _1(\mu ) \right| }{a},\ldots , \frac{\left| \delta _m(\mu ) \right| }{a^m}\right] ^{-m-\epsilon } \nonumber \\&\quad =\frac{1}{2^m} \sum _{k=0}^{m} \int _{a_{k+1}}^{a_k} \frac{da}{a^{\frac{m(m+1)}{2}}}\, P_0(a|\mu ) \frac{\epsilon }{m+\epsilon } \left( \frac{\left| \delta _k(\mu ) \right| }{a^k}\right) ^{-m-\epsilon } \nonumber \\&\quad =\frac{\epsilon (1+\omega )}{2^m(m+\epsilon )}\sum _{k=0}^{m} \frac{1}{\left| \delta _k(\mu ) \right| ^{m+\epsilon }} \int _{\min (1,a_{k+1})}^{\min (1,a_k)} da\, a^{(m+\epsilon )k-\frac{m(m+1)}{2}} (1-a)^\omega , \end{aligned}$$
(4.14)

where in the last line we have used the explicit form of the prior for a (Eq. 4.9), that further restricts the integration region to \(a\le 1\). The general result of this integral can be written in terms of the incomplete Beta function. However, a simpler form is obtained if \(\omega \) is an integer. Indeed in this case

$$\begin{aligned}&P(\delta _m,\ldots ,\delta _1|\mu ) =\frac{\epsilon (1+\omega )}{2^m(m+\epsilon )}\sum _{k=0}^{m}\frac{1}{\left| \delta _k(\mu ) \right| ^{m+\epsilon }}\nonumber \\&\qquad \times \sum _{j=0}^\omega (-1)^j\left( {\begin{array}{c}\omega \\ j\end{array}}\right) \int _{\min (1,a_{k+1})}^{\min (1,a_k)} da\, a^{(m+\epsilon )k-\frac{m(m+1)}{2}+j} \nonumber \\&\quad =\frac{\epsilon (1+\omega )}{2^m(m+\epsilon )}\sum _{k=0}^{m} \frac{1}{\left| \delta _k(\mu ) \right| ^{m+\epsilon }} \sum _{j=0}^\omega (-1)^j\left( {\begin{array}{c}\omega \\ j\end{array}}\right) \times \nonumber \\&\qquad \times {\left\{ \begin{array}{ll} \log \frac{\min (1,a_k)}{\min (1,a_{k+1})} \qquad \qquad \qquad \qquad \qquad \text {if }(m+\epsilon )k-\frac{m(m+1)}{2}+j+1=0\\ \frac{\min (1,a_k)^{(m+\epsilon )k-\frac{m(m+1)}{2}+j+1}-\min (1,a_{k+1})^{(m+\epsilon )k-\frac{m(m+1)}{2}+j+1}}{(m+\epsilon )k-\frac{m(m+1)}{2}+j+1}\qquad \text {elsewhere}. \end{array}\right. } \end{aligned}$$
(4.15)

The advantage of having such a simple analytic form is that the numerical evaluation is very fast. However, nothing prevents one from making more complicated choices for the prior distributions, paying the price that the numerical integration will typically slow down the computation of the distribution.

Equation (4.15) can be used directly in Eq. (4.10) to obtain the probability distribution of the unknown higher orders. Following the derivation of Sect. 3.3 we can then construct the distribution for the observable \(\Sigma \) (Eq. 3.13). A useful property of the result is that the tails of such distributions are dominated by the first missing higher order (this is a consequence of the hierarchy in the arguments of the max function, Eq. (4.13)). Therefore, the asymptotic behaviour of the distribution is given by

$$\begin{aligned} P(\Sigma |\delta _n,\ldots ,\delta _1,\Sigma _0,\mu ) \sim \frac{1}{\left| \Sigma -\Sigma _n \right| ^{n+1+\epsilon }}. \end{aligned}$$
(4.16)

4.4 The posterior of the hidden parameters

Even if the hidden parameters of the model are never part of the final distribution, it is instructive to understand how their distribution changes with the knowledge of the first few orders. The posterior distribution of c and a can be easily computed as

$$\begin{aligned} P(c,a|\delta _n,\ldots ,\delta _1,\mu )= & {} \frac{P(\delta _n,\ldots ,\delta _1,c,a|\mu )}{P(\delta _n,\ldots ,\delta _1|\mu )} \nonumber \\= & {} \frac{P(\delta _n,\ldots ,\delta _1|c,a,\mu )P_0(c,a|\mu )}{P(\delta _n,\ldots ,\delta _1|\mu )} \nonumber \\= & {} \frac{P(\delta _n|c,a,\mu )\cdots P(\delta _1|c,a,\mu )P_0(c,a|\mu )}{P(\delta _n,\ldots ,\delta _1|\mu )},\nonumber \\ \end{aligned}$$
(4.17)

where the denominator is given in Eq. (4.15), and also corresponds to the integral over c and a of the numerator. It’s clear that even if in our prior c and a were uncorrelated, correlations arise from the model after inference takes place. Using the explicit form of our likelihood (Eq. 4.3), the posterior becomes

$$\begin{aligned}&P(c,a|\delta _n,\ldots ,\delta _1,\mu ) =\frac{P_0(c,a|\mu )}{P(\delta _n,\ldots ,\delta _1|\mu )}\,\nonumber \\&\qquad \times \frac{1}{a^{\frac{n(n+1)}{2}}}\, \frac{1}{c^n}\, \theta \left( c-\max \left[ \frac{\left| \delta _1 \right| }{a},\ldots , \frac{\left| \delta _n \right| }{a^n}\right] \right) \nonumber \\&\quad =\frac{\epsilon (1+\omega )\theta (a)\theta (1-a)}{P(\delta _n,\ldots ,\delta _1|\mu )}\, \frac{(1-a)^\omega }{a^{\frac{n(n+1)}{2}}}\, \nonumber \\&\qquad \times \frac{1}{c^{n+1+\epsilon }}\, \theta \left( c-\max \left[ 1,\frac{\left| \delta _1 \right| }{a},\ldots , \frac{\left| \delta _n \right| }{a^n}\right] \right) , \end{aligned}$$
(4.18)

where in the second line we have also used the explicit form of our prior (Eqs. 4.8 and 4.9). The theta function cuts out the region of small c and small a, with a boundary given by a sequence of contours identified by \(ca^k=\left| \delta _k \right| \) in the region \(a_{k+1}<a<a_k\) selected by the max function, see Eq. (4.13). On the other hand, the growing negative power of both c and a tend to favour small values of ca, thus close to this boundary. Since the power of a grows quadratically with n while that of c only linearly, inference tends to favour smaller a at the price of having somewhat larger c. A visual example of this behaviour is given in Fig. 9.

Fig. 9
figure 9

The plots show the probability distribution of the parameters c and a. The first plot (upper left) is the prior, the second (upper right) is the posterior after the knowledge of \(\delta _1\), the third (bottom left) adds the knowledge of \(\delta _2\) and the last (bottom right) the one of \(\delta _3\). The observable under consideration is the inclusive Higgs cross section, Sect. 2.5, for fixed scale \(\mu =m_{H}/2\). Each line represents the contour \(ca^k=\left| \delta _k \right| \), for increasing values of k from 1 to 3, to clarify the role of the theta function. Note that the c axis is shown in log scale, but the probability distribution is for c and not \(\log c\). For this plot, we used \(\omega =1\) and \(\epsilon =0.1\) for the prior parameters

This preference is a nice outcome of the model: inference favours small values of the parameter a that lead to a better behaviour of the expansion, with smaller higher orders. The prediction of the observable will then be more precise, namely subject to a smaller uncertainty. Of course one also (and more importantly) wants the prediction to be accurate, namely with a reliable uncertainty that does not underestimate the missing higher orders. Judging whether the outcome of the model is reliable is not immediate, and requires explicit examples to verify it. We will come back later to this point in Sects. 4.5 and 7.

Note that all the considerations so far are independent of the prior. The prior has the only role of changing the “starting point” of the inference procedure. With sufficiently many known orders, our choice of the prior will not matter. However, since the number of known orders is typically limited, choosing wisely is important. Since we like the “direction” selected by the inference procedure (it’s better to have larger c and smaller a than the opposite) it seems convenient to choose a prior distribution that already favours the same region of parameter space. This is achieved in our Eqs. (4.8) and (4.9) choosing a small value for \(\epsilon \) and a large value for \(\omega \). Note however that a large \(\omega \) suppresses the region \(a\sim 1\), while in the inference there is no such suppression, simply the small a region is enhanced by a negative power of a. Therefore, using a large value of \(\omega \) may introduce a significant bias. A good compromise, that we advocate as the best choice, is \(\omega =1\). In this way there is a preference for smaller a with only a mild suppression for \(a\sim 1\). On the other hand, for \(\epsilon \) we can choose an arbitrarily small (positive) value, for instance \(\epsilon =0.1\) or \(\epsilon =0.01\). The difference between either choice is relevant only at very low orders: in Fig. 9 only the first plot (the prior) would change visibly. We thus use \(\epsilon =0.1\) in the rest of this work, with the exception of Sect. 4.6 where we will consider variations of the prior parameters.

4.5 Representative results

Before moving further, we now present some representative results of this method. We use our working example of Higgs production to examine the distribution for the cross section. We fix the renormalization scale \(\mu =m_{H}/2\), which is the most widely used choice for this process. Of course the result of this model depends on the choice of scale made, and in addition it does not know anything about the scale dependence. So the input for this model are just 4 numbers, the values of the cross section at the chosen scale at LO, NLO, NNLO and \(\hbox {N}^3\)LO. How to deal with scale dependence in this model will be discussed in Sect. 6.

Fig. 10
figure 10

Probability distributions of the Higgs cross section with different states of knowledge

In Fig. 10 we plot the distribution for the observable \(\Sigma \) (the Higgs cross section), Eq. (3.13), using only the first missing higher order (solid lines) or the first two missing higher orders (dashed lines). The four colors correspond to different knowledge: LO (black), NLO (green), NNLO (blue) and \(\hbox {N}^3\)LO (red).

We immediately notice that the solid and dashed curves are basically identical. This implies that, for any given status of knowledge, the uncertainty in this model is dominated by the first missing higher order. This is a consequence of having \(a<1\), that implies that higher and higher orders are smaller and smaller in this model. Moreover, given that the distribution of c and a favours small values of a (Fig. 9), the impact of the next higher orders is significantly smaller than the first missing higher order. The reason why the curves are almost identical also depends on the fact that the tails of these distributions behave as a negative power (Eq. 4.16), and are thus very “long”: therefore, the effect by the next higher orders of “enlarging” the distribution is almost invisible as the tails already cover large deviations of the observable. We thus conclude that for this model it is sufficient to consider only the first missing higher order, so we can directly use Eq. (3.14) for all practical applications. This makes the implementation of this model very fast, as no numerical integration is needed.

Let us now comment on the predictions of this model. Of course, every distribution is centered on the cross section at the known order, and it is symmetric, because in our assumption Eq. (4.2) the sign of the missing higher order is treated agnostically (we will consider a possible way of taking the sign into account in Sect. B.5). We consider the four states of knowledge in turn.

  • When only the LO is known, the shape of the curves is fully determined by the prior (no inference took place yet), so the black curve is not particularly useful. Indeed, for our choice of prior, the distribution is very broad, so it is basically not predictive. Note also that the distribution is barely normalizable, because it asymptotically behaves as \(\left| \Sigma -\Sigma _0 \right| ^{-1-\epsilon }\), Eq. (4.16), with \(\epsilon \) small. Another consequence of this behaviour is that the variance of this distribution is infinite. Also, the distribution allows with a high probability (\(\sim 40\%\)) unphysical negative values of the cross section. These can be avoided imposing a positivity constraint on \(\Sigma \), but it is instructive to see how much information is needed to obtain a distribution sufficiently narrow to have negligible probability of negative cross section.

  • The knowledge of the NLO allows to perform the first step of the interference, so the green curve provides the first non-trivial prediction. However, it is still an “immature” prediction, because only one piece of information has been used. Indeed, this distribution is still very broad, with tails asymptotically behaving as \(\left| \Sigma -\Sigma _0 \right| ^{-2-\epsilon }\), and a cusp that favours a single point but not really a region. For \(0<\epsilon <1\), as we advocate, also this distribution has infinite variance. Also in this case, there is a significant probability (\(\sim 4\%\)) of negative cross section. Therefore, the NLO alone does not provide sufficient information in this model to make a precise prediction.

  • Once the NNLO is known, the situation changes. On top of having tails that decrease more rapidly, the blue curve clearly identifies a more probable region where the distribution has some sort of bump. The prediction is still rather uncertain, but at least the procedure seems to converge. The probability of negative cross section is reduced to less than \(0.3\%\).

  • As a confirmation of this, we see that the knowledge of the \(\hbox {N}^3\)LO improves the situation even more. Now the red distribution is well localized, with a clear bump, which is also well compatible with the prediction at the previous order. The uncertainty is clearly reduced, and the probability of negative cross section becomes negligible (less than \(0.01\%\))

We conclude that this method works well, but requires a sufficient number of known orders to be predictive. Two orders (NLO) is the absolute minimum, but three orders (NNLO) are probably necessary to achieve a decent precision. Beyond NNLO (four or more orders) it should work very well. Note also that the distributions do not look like a gaussian, and therefore they cannot be approximated with a gaussian distribution in applications.

Fig. 11
figure 11

Summary of the distributions of Fig. 10 at the various orders given in terms of mean, standard deviation and degree of belief intervals. The canonical scale variation uncertainty bands are also shown. The left plot is for the standard scale \(\mu =m_{H}/2\), while the right plot is for the “natural” scale \(\mu =m_{H}\)

To further appreciate the results of the approach, we compute from the distributions a number of quantifiers, namely mean (which equals mode and median given the symmetry of the distributions), standard deviation, and degrees of belief (DoB) intervals. We summarize them in Fig. 11. In this summary, we consider both the scale choice used previously, namely \(\mu =m_{H}/2\), and also another one, namely \(\mu =m_{H}\), to emphasise the scale dependence of this approach at this level. A number of comments are in order.

  • The uncertainty is clearly reduced visibly and substantially when adding information from perturbative orders. All the uncertainty quantifiers (standard deviation, 68% and 95% DoB intervals) shrink with increasing the knowledge.

  • Because of the high tails, the standard deviation is quite large: infinite in the first two cases (knowledge of LO and NLO), and larger than the 68% DoB interval in the other two cases. For the very same reason, the 95% DoB interval is always much larger than the 68% DoB interval.

  • In the shrinking of the uncertainty when adding information the results are always well compatible with the previous orders. All 95% DoB intervals are contained in those of the previous orders. The same is true for the standard deviation. The 68% DoB intervals, instead, are contained in the same interval of the previous order only in some cases, but this is perfectly acceptable, as there is large probability (32%) that the true result is outside that interval.

We conclude once again that the knowledge of the NLO in this method is not sufficient to achieve a decent precision, while when at least the NNLO is known the method works very well. Note that the reliability is manifest, as the uncertainties always cover nicely the next orders. This is achieved at a price, namely having large uncertainties with a poor state of knowledge. But this is perfectly meaningful, because with too few information on the perturbative expansion it is impossible to predict with precision the value of the observable.Footnote 23 We note that, despite the large uncertainties at low orders, once the \(\hbox {N}^3\)LO is known the prediction is very precise, at least for some “good” scale choices (we will come back to this point in Sect. 6).

In the same plot also the conventional scale variation uncertainty is shown for comparison, both its asymmetric version (in black) and its symmetrized version (in grey). We stress once again that these conventional uncertainty bands have no probabilistic interpretation, but they just represent the “error” usually assigned to perturbative results. Since the probability distributions are symmetric, their mean coincides with the “central value” of the conventional scale variation approach. There is little to say about these “error bars”, given the absence of a probabilistic meaning. We observe that, accidentally, the width of these “error bars” is similar in size to the 68% DoB intervals, with the exception of that of the LO.

4.6 Dependence on the prior

Before concluding the section we present a study on the impact of the choice of the prior in the final probability distribution for the observable. To do this, we consider variations of the prior parameters (Sect. 4.2) and see how the final distributions change. For definiteness, starting from the default choice of parameters \(\epsilon =0.1\) and \(\omega =1\), we vary one of the two keeping the other at its default value. For the prior for c, we consider \(\epsilon =0.01\) and \(\epsilon =1\), while for a we take \(\omega =0\) and \(\omega =2\). The results of these variations are shown in Fig. 12, both at the level of the distributions for the cross section, and in terms of the statistical estimators (considering in this case only the most accurate results with the knowlege of the \(\hbox {N}^3\hbox {LO}\)).

Fig. 12
figure 12

Dependence of the probability distribution of the observable at various orders on the parameters of the prior (left plot), and statistical estimators for the distribution at \(\hbox {N}^3\hbox {LO}\) for the same choices of parameters

At the distribution level, we notice that the shape of the distribution is not particularly affected by the variation of parameters, as they all look qualitatively the same. However, they vary quantitatively, as for instance the “width” and the “height” of each distribution changes with the prior parameters. It has to be noted that, as expected, these differences are more marked at low orders (less information, thus more dependence on the prior), and become less and less relevant increasing the order (more information).

To better appreciate the impact of the different priors in the right plot of Fig. 12 the statistical estimators (standard deviation and DoB intervals) are shown for the distribution at \(\hbox {N}^3\hbox {LO}\) obtained with each choice of the prior parameters. We see that the difference are somewhat marked. In particular, the largest uncertainties are obtained with \(\omega =0\), which is obvious as it allows larger values of the expansion parameter a with larger probability (see discussion in Sect. 4.4). Similarly, larger \(\epsilon \) also leads to a larger uncertainty. This is due to the fact that, if smaller values of c are favoured, the constraints from the theta function in Eq. (4.18) forces a to be larger (see also Fig. 9). In agreement with these observations, the result with \(\omega =2\) has smaller uncertainty due to the stronger suppression of the \(a\rightarrow 1\) region, and also the result with \(\epsilon =0.01\) is more precise, even though here the difference is very mild due to the tiny difference in the parameter value. Note that in all cases the largest difference in the 95% DoB interval, which is dominated by the tails of the distributions that are mostly affected by the choice of prior. The 68% DoB interval, dominated by the peak region, is more stable.

We conclude that there certainly is a non-negligible bias due to the choice of the prior. Nevertheless, we believe that our suggested choice of parameters represents a good compromise between accuracy and precision. One can obtain more conservative results with larger \(\epsilon \) and/or smaller \(\omega \), but if the suggested values already provide accurate results (as we shall verify with other examples in Sect. 7) varying them does not seem convenient. Conversely, one could obtain more precise results with a more aggressive prior for a, namely with larger \(\omega \), but this may lead to a loss of accuracy, which is to be avoided.

5 Model 2: a new approach using scale variation information

We now move to our second model, that uses information on the scale dependence of the perturbative expansion. This model is specific for a QFT application.

5.1 The hypothesis of the model

We now consider another model that rather than looking at the behaviour of the perturbative expansion uses only the information coming from the scale variation to infer the size of the missing higher orders, similarly to what is done in the canonical scale variation approach. The foundation of this method relies on the fact already stressed several times that the scale dependence of a perturbative expansion truncated at order n is formally of order \(n+1\), namely of the same order as the missing higher orders, Eq. (2.15). In Sect. 3.2 we have introduced the estimators \(r_k(\mu )\), Eq. (3.9) to quantify the scale dependence of the perturbative expansion at order k, which are thus objects of order \(k+1\) (Eq. 3.6). We shall thus expect

$$\begin{aligned} \left| \delta _k(\mu ) \right| \sim r_{k-1}(\mu ), \end{aligned}$$
(5.1)

namely, the term of the perturbative expansion at order k should be similar in size to the scale variation estimator of the previous order \(k-1\). From this vague statement we can now propose the hypothesis of our model, which is

$$\begin{aligned} \left| \delta _k(\mu ) \right| \le \lambda r_{k-1}(\mu ) \qquad k<k_{\mathrm{asympt}}, \end{aligned}$$
(5.2)

where \(\lambda \) is a hidden parameter of the model, again dependent on (correlated with) \(\mu \). This condition does not tell us anything about the behaviour of the expansion, but only relies on the goodness of the scale variation numbers \(r_k\) as estimators of the next order, up to a factor \(\lambda \).

Note that if the scale variation estimator \(r_k\) is accidentally small, \(\lambda \) is forced to be large to accomodate the condition Eq. (5.2). This is definitely undesirable, as a large value of \(\lambda \) due to an accident of the scale dependence would enlarge the uncertainty making the model less predictive. This would be the case if \(r_k\) was constructed as the derivative of the observable: close to stationary points, the derivative is accidentally small, see Fig. 5 in comparison with Fig. 4. Our definition of \(r_k\), Eq. (3.9), is indeed designed to avoid such problematic behaviours, as one can see from Fig. 6. Since it is not unusual that the leading order is scale independent, one has to redefine \(r_0\), because in this cases it would be identically zero. As anticipated in Sect. 3.2, in these cases we simply set \(r_0=1/2\), which represents a rather conservative choice (larger values can also be considered to be even more conservative). Note that in these cases the first non-trivial information comes from the NNLO, so the probability distributions obtained from the knowledge of the NLO only have to be considered biased by the arbitrary choice of \(r_0\).

It is instructive to understand the connection between our hypothesis Eq. (5.2) and the canonical scale variation approach. The relation between the two is very simple in the case in which the scale dependence is linear in \(\log \mu \). In such a case, \(r_k(\mu )\) coincides with the derivative of the observable with respect to \(\log \mu \). The canonical scale variation Eq. (1.1) in this limit would predict the “error” to be \(\log 2\) times the same derivative.Footnote 24 Therefore, if the canonical scale variation “error” is thought as an absolute limit on the next order, it would coincide with Eq. (5.2) with \(\lambda =\log 2\) fixed. This new method can thus be viewed as an improved version of canonical scale variation in which the variation factor is not fixed to be 2, rather it is inferred from the perturbative expansion itself to be such that the uncertainty estimate is reliable.

The condition Eq. (5.2) must be translated into a probability distribution for the coefficients \(\delta _k\) given \(\lambda \) and \(r_{k-1}\). We assume that the condition is strictly satisfied, and within the allowed range all values are equally likely. This leads to the likelihood

$$\begin{aligned}&P(\delta _k| r_{k-1},\lambda ,\mu ) = \frac{1}{2\lambda r_{k-1}(\mu )}\theta \left( \lambda r_{k-1}(\mu ) - \left| \delta _k(\mu ) \right| \right) ,\nonumber \\&\qquad k>0, \quad k<k_{\mathrm{asympt}}, \end{aligned}$$
(5.3)

where we have stressed that this conditional probability makes sense only for \(k>0\), because at LO there is no previous order to be used to compute scale dependence (in other words, \(r_{-1}\) does not exist). This is once again in line with the fact that the LO alone does not bring any information on the behaviour of the expansion. Since the likelihood depends on a single hidden parameter, the resulting model is not very “flexible”, and will for example lead to non-smooth distributions for the observable. To add some flexibility, one may allow violations of the bound Eq. (5.2). This possibility will be explored later in Sect. B.1.

The \(r_{k-1}\) coefficient is assumed to be a given information in the likelihood Eq. (5.3). In fact, \(r_{k-1}\) can be computed from the knowledge of all the previous \(\delta _{k-1}(\mu ),\ldots ,\delta _1 (\mu )\) and \(\Sigma _0 (\mu )\). This notation must thus be interpreted as a shorthand notation for the complete expression

$$\begin{aligned} P(\delta _k| r_{k-1},\lambda ,\mu ) \equiv P(\delta _k| \delta _{k-1},\ldots ,\delta _1,\Sigma _0,\lambda ,\mu ). \end{aligned}$$
(5.4)

Note that, differently from the geometric behaviour model of Sect. 4, here there is an explicit dependence on \(\Sigma _0(\mu )\), which is needed to compute the scale variation estimators \(r_k\). Equation (5.4) implies that the \(\delta _k\) coefficients are not independent (as they were in the geometric behaviour model). The dependence, however, in always only on the previous ones. Thanks to this, the inference procedure is straightforward. The Bayesian network of this model, using explicitly the \(r_k\) parameters, is depicted in Fig. 13.

Fig. 13
figure 13

Bayesian network for the scale variation model

5.2 The choice of prior

This model depends on a single parameter \(\lambda \). In principle, it can take any positive value. However, Eq. (5.1) suggests that it should be of \({\mathcal {O}}(1)\). For this reason, we find it convenient to choose its prior distribution such that large values are strongly suppressed. We thus suggest an exponential behaviour

$$\begin{aligned} P_0(\lambda |\mu ) = \frac{1}{\Gamma (1+\gamma )}\,\lambda ^\gamma \, e^{-\lambda }\, \theta (\lambda ),\qquad \gamma \ge 0, \end{aligned}$$
(5.5)

where \(\gamma \) is a parameter that changes the shape of the distribution in the region of small \(\lambda \). The mode of the distribution is in

$$\begin{aligned} \lambda _{\mathrm{mode}} = \gamma . \end{aligned}$$
(5.6)

Therefore, it makes sense to choose either \(\gamma =1\), so that our expectation \(\lambda ={\mathcal {O}}(1)\) is contained in the prior, or \(\gamma =0\), that represents a more agnostic (less informative) choice. The dependence on \(\gamma \) will be explored in Sects. 5.4 and 5.6. Similarly to the previous model, the prior Eq. (5.5) is chosen to be independent of \(\mu \).

5.3 Inference on the unknown higher orders

The probability distribution of the missing higher orders given the known ones, Eq. (3.15), is

$$\begin{aligned}&P(\delta _{n+j},\ldots ,\delta _{n+1}|\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )\nonumber \\&\quad =\frac{P(\delta _{n+j},\ldots ,\delta _{n+1},\delta _n,\ldots ,\delta _1|\Sigma _0,\mu )}{P(\delta _n,\ldots ,\delta _1|\Sigma _0,\mu )},\nonumber \\&\qquad j\ge 1, \quad n+j<k_{\mathrm{asympt}}, \end{aligned}$$
(5.7)

where on the right-hand side we have included \(\Sigma _0(\mu )\) as part of the information as it is only needed to compute the scale dependence. Both numerator and denominator can be directly computed from

$$\begin{aligned}&P(\delta _m,\ldots ,\delta _1|\Sigma _0,\mu ) =\int d\lambda \, P(\delta _m,\ldots ,\delta _1,\lambda |\Sigma _0,\mu ) \nonumber \\&\quad =\int d\lambda \, P(\delta _m|\delta _{m-1},\ldots ,\delta _1,\lambda , \Sigma _0,\mu ) P(\delta _{m-1},\ldots ,\delta _1,\lambda |\Sigma _0,\mu ) \nonumber \\&\quad =\int d\lambda \, P(\delta _m|\delta _{m-1},\ldots ,\delta _1,\lambda , \Sigma _0,\mu ) \cdots P(\delta _1|\lambda , \Sigma _0,\mu )\nonumber \\&\quad P_0(\lambda |\Sigma _0,\mu ) \nonumber \\&\quad \equiv \int d\lambda \, P(\delta _m|r_{m-1},\lambda ,\mu )\cdots P(\delta _1|r_0,\lambda ,\mu ) P_0(\lambda |\mu )\nonumber \\&\quad = \frac{1}{2^m\, r_0(\mu )\cdots r_{m-1}(\mu )} \int \frac{d\lambda }{\lambda ^m}\, P_0(\lambda |\mu )\, \nonumber \\&\qquad \theta \left( \lambda -\max \left[ \frac{\left| \delta _1(\mu ) \right| }{r_0(\mu )},\ldots , \frac{\left| \delta _m(\mu ) \right| }{r_{m-1}(\mu )}\right] \right) , \end{aligned}$$
(5.8)

where in the first step we have introduced the hidden parameter, then in the next two steps we have recursively written the probability of each \(\delta _k\) in terms of the previous ones, in the fourth step we introduced the shorthand notation Eq. (5.4) and removed \(\Sigma _0\) from the prior which does not depend on it, and finally we have written explicitly the likelihoods Eq. (5.3). Note that the arguments of the max function depend only on the perturbative coefficients and not on the parameter. Therefore, the max function itself is just a number, so we can conveniently define it

$$\begin{aligned} \lambda _m(\mu ) \equiv \max \left[ \frac{\left| \delta _1(\mu ) \right| }{r_0(\mu )},\ldots , \frac{\left| \delta _m(\mu ) \right| }{r_{m-1}(\mu )}\right] \end{aligned}$$
(5.9)

to simplify the expressions. With our choice of prior, Eq. (5.5), it is easy to compute the integral

$$\begin{aligned}&P(\delta _m,\ldots ,\delta _1|\Sigma _0,\mu )\nonumber \\&= \frac{1}{2^m\, r_0(\mu )\cdots r_{m-1}(\mu )\Gamma (1+\gamma )} \int _{\lambda _m(\mu )}^\infty d\lambda \,\lambda ^{\gamma -m}e^{-\lambda }\nonumber \\&= \frac{\Gamma \left( 1+\gamma -m,\lambda _m(\mu )\right) }{2^m\, r_0(\mu )\cdots r_{m-1}(\mu )\Gamma (1+\gamma )}, \end{aligned}$$
(5.10)

where \(\Gamma (a,b)\) is the incomplete Gamma function. This result is very simple, thanks to the simplicity of the model (just one parameter) and of the choice of likelihood and prior. The distribution Eq. (5.7) has an equally simple form

$$\begin{aligned}&P(\delta _{n+j},\ldots ,\delta _{n+1}|\delta _n,\ldots ,\delta _1,\Sigma _0,\mu ) \nonumber \\&\quad = \frac{\Gamma \left( 1+\gamma -n-j,\lambda _{n+j}(\mu )\right) }{2^j\, r_n(\mu )\cdots r_{n+j-1}(\mu )\Gamma \left( 1+\gamma -n,\lambda _n(\mu )\right) }. \end{aligned}$$
(5.11)

The derivation of the distribution for the observable \(\Sigma \) is then straightforward according to the procedure of Sect. 3.3. There is, however, an important difference with respect to the geometric behaviour model. To make inference on the first missing higher order \(n+1\) the knowledge of the scale dependence \(r_n\) of the order n is required. This is fine because the order n is known. However, to make inference on the next higher order, it is necessary to have the subsequent coefficient \(r_{n+1}\). This is not known. In principle, this is not a real problem, as \(r_{n+1}\) can be computed from \(\delta _{n+1}\).Footnote 25 However, to compute the scale dependence at order m, the m-loop \(\beta \)-function is required. Therefore, the scale dependence can only be computed up to the order at which we know the \(\beta \)-function of the theory. In the case of QCD, the \(\beta \)-function is presently known up to 5 loops [53], which implies that the largest coefficient which can be inferred in this model is \(\delta _7\) for observables that are scale independent at LO and \(\delta _6\) otherwise (see Sect. A.1 for further detail).

5.4 The posterior of the hidden parameter

We now turn our attention to the posterior distribution of the hidden parameter, that as we have seen in the previous model brings interesting information. The posterior for \(\lambda \) is given by

$$\begin{aligned}&P(\lambda |\delta _n, \ldots , \delta _1, \Sigma _0,\mu )\nonumber \\&=\frac{P(\delta _n,\ldots ,\delta _1,\lambda |\Sigma _0,\mu )}{P(\delta _n,\ldots ,\delta _1|\Sigma _0,\mu )} \nonumber \\&=\frac{P(\delta _n,\ldots ,\delta _1|\lambda )P_0(\lambda )}{P(\delta _n,\ldots ,\delta _1)} \nonumber \\&=\frac{P(\delta _n|r_{n-1},\lambda ,\mu )\cdots P(\delta _1|r_0,\lambda ,\mu ) P_0(\lambda |\mu )}{P(\delta _n,\ldots ,\delta _1)} \nonumber \\&= \frac{1}{\Gamma \left( 1+\gamma -n,\lambda _n(\mu )\right) }\,\lambda ^{\gamma -n}\,e^{-\lambda }\,\theta \left( \lambda -\lambda _n(\mu )\right) , \end{aligned}$$
(5.12)

where in the last line we have already used the specific expressions for our choice of likelihood and prior, with the definition of \(\lambda _n\) Eq. (5.9). This distribution is very simple: it has the same functional form of the prior, but the power of \(\lambda \) is reduced by a unity for each non-trivial known order, and the region of small \(\lambda \) is cut out by the theta function.

Fig. 14
figure 14

The plots show the probability distribution of the parameter \(\lambda \), for the prior with \(\gamma =1\) (left plot) and \(\gamma =0\) (right plot). The observable under consideration is the inclusive Higgs cross section (Sect. 2.5) for fixed scale \(\mu =m_{H}/2\)

In Fig. 14 we show the posterior distribution for our example of Higgs production. In the left plot we use \(\gamma =1\), and in the right plot \(\gamma =0\). We see that the difference is obviously marked for the prior (black dashed curve), but becomes less and less relevant when adding information. Most importantly, the lower limit on \(\lambda \) imposed by the theta function, which plays an important role, is independent of the parameter \(\gamma \). The figure also shows a striking effect of the inference: the lower limit on \(\lambda \), imposed by \(\theta (\lambda -\lambda _n(\mu ))\), is fully determined by the first non-trivial order, \(\delta _1\). Indeed, it’s clear that \(\lambda _n\) Eq. (5.9) coincides with \(\left| \delta _1 \right| /r_0=2.72\) for all values of \(n=1,2,3\) (for \(\mu =m_{H}/2\)). Indeed, the next orders give \(\left| \delta _2 \right| /r_1=1.53\) and \(\left| \delta _3 \right| /r_2=0.69\), both smaller than the first (and decreasing). This means that, in this case, the actual next order is smaller than what estimated by the scale variation parameter \(r_k\) times the smallest allowed value of \(\lambda \). This is the case despite our definition of \(r_k\), which is normalized to \(\Sigma _k\) rather than \(\Sigma _0\) exactly with the purpose of giving a smaller number. This fact can imply two things:

  • either this behaviour is just an accident of the observable under consideration, or

  • the model itself is based on a wrong (or at least non-optimal) assumption.

We will see later that this behaviour is shared by other observables, which seems to suggest that it is indeed the model assumption to be problematic. However, the actual pattern of \(\left| \delta _k \right| /r_{k-1}\) also depends on the scale \(\mu \) at which they are computed. For instance, for Higgs production, we have verified that for \(\mu >5m_{H}\) they all become of the same order, in agreement with the model assumption. It is thus difficult to draw a sharp conclusion on the goodness of the model based just on these observations.

It is interesting to note that, since for some scales the \(\left| \delta _k \right| /r_{k-1}\) values are decreasing, it could seem convenient to modify the assumption by adding a parameter \({\tilde{a}}\) behaving like a power, namely

$$\begin{aligned} \left| \delta _k(\mu ) \right| \le \lambda {\tilde{a}}^k r_{k-1}(\mu ) \qquad k<k_{\mathrm{asympt}}. \end{aligned}$$
(5.13)

This model would better describe the behaviour of the known orders, but seems unjustified: where should such a power come from? An alternative interpretation could be that the original assumption Eq. (5.2) is meaningful, but it takes a few orders before the bound is homogeneously satisfied, with the first orders being more “unstable”. In this interpretation, it would make sense to allow a violation of the bound. This option will be explored in Sect. B.1.

We stress that using this model as it is can be regarded as a conservative method: indeed the allowed values of \(\lambda \) are larger than actually needed, thus predicting a larger uncertainty. So for the moment we keep using it, but having in mind this conservative interpretation.

5.5 Representative results

We now present some representative results of this method, using Higgs production as an example, as we did in Sect. 4.5. In Fig. 15 we plot the distributions for the observable given by Eq. (3.13), using a single higher order (solid lines) or two higher orders (dashed lines) to approximate the true cross section, with different status of knowledge: LO (black), NLO (green), NNLO (blue), \(\hbox {N}^3\hbox {LO}\) (red).

The first striking difference with respect to the geometric behaviour model is the fact that the distribution obtained using two unknown higher orders to approximate the cross section is very different from that obtained using only the first unknown higher order. According to the discussion after Eq. (3.13), we should then keep adding unknown orders in our approximation until the distribution stops changing visibly. Unfortunately, this will likely never happen, and the distribution will likely get broader and broader with higher and larger tails. The reason for this comes from the way the model works. When both \(\delta _{n+1}\) and \(\delta _{n+2}\) are used in the approximation for \(\Sigma \), the model needs \(r_{n+1}\) to obtain the distribution for \(\delta _{n+2}\). But \(r_{n+1}\) depends in turn on \(\delta _{n+1}\), which is not fixed, but varies among all possible values, in principle from minus infinity to plus infinity. The trouble is that the models relies on the assumption that the scale variation numbers \(r_k\) are good estimators of the higher orders, which in turn relies on the fact that they behave in a perturbative way too. For many of the possible values of \(\delta _{n+1}\), however, the corresponding \(r_{n+1}\) will violate this assumption and will be much larger than expected, thus predicting large values of \(\delta _{n+2}\). These large values unavoidably contaminate the distribution for \(\Sigma \), making it broader with higher tails. This pattern can only get worse using more orders in the approximation of \(\Sigma \).

One possible solution to this problem would be to impose some constraint also on \(r_k\), requiring that they behave in a perturbative way, and use this constraint in the inference for the unknown \(\delta _k\). This approach is certainly interesting and potentially very powerful, but it is much more complicated to implement. We will discuss it in Sect. B.2. For the time being we stick to another possible solution, namely restricting our attention to the uncertainty coming from the first unknown higher order only. This approach is acceptable, provided the result is interpreted for what it is: not a probability for the true observable \(\Sigma \), but just the probability for the observable at the next order, \(\Sigma _{n+1}\).

Fig. 15
figure 15

Same as Fig. 10, but for the scale variation model

With this limitation in mind, we now comment the shapes of the distributions in Fig. 15, focussing on those using only the first unknown higher order (solid lines).Footnote 26 With the exception of the first one (black) based only on the knowledge of the LO, they all feature a plateau, surrounded by exponentially decreasing tails.Footnote 27 The plateau is a direct consequence of the likelihood being a theta function of the absolute value of the order, and of the model being dependent on a single hidden parameter \(\lambda \). Indeed the inference on \(\lambda \) sets a lower limit, Fig. 14, which in turn implies that all values of the next order lower (in absolute value) than the lower limit of \(\lambda \) times the previous \(r_k\) are equally probable. We also see that the distributions shrink nicely by adding information. Because also the tails die faster and faster increasing the number of known orders, it’s clear that these distributions tend to a uniform distribution. Therefore, one can interpret this model as a sort of provider of an “absolute error”, a region where one is almost certain that the next order will lie, but without any clue on where inside that region.

Fig. 16
figure 16

Same as Fig. 11, but for the scale variation model

In Fig. 16 we show the results using quantifiers of the distributions, for two values of the scale, \(\mu =m_{H}/2\) (left) and \(\mu =m_{H}\) (right). We see that the standard deviation is very similar to the 68% DoB interval, and the 95% DoB interval is only slightly larger than the 68% DoB interval due to the exponential suppression of the tails, differently from what happened in the geometric behaviour model. Also in this case the bands shrink nicely increasing the order, with a good compatibility with the previous bands (remember however that each band now just represents the uncertainty of the next order, and not of the full result). The only exception is the LO, which has small uncertainties not very compatible with the NLO. Note that in this case the distribution at LO is not fully determined by the prior, but also depends on \(r_0\), the scale variation number of the LO. Therefore the small uncertainty may also be due to the fact that \(r_0\) is not a very good representative of the NLO. Anyway, also in this case one must at least know two orders to let the inference work, so the LO uncertainty is not very relevant.

We also compare these results with the canonical scale variation approach. We observe that the “error bars” of the canonical method at NNLO and \(\hbox {N}^3\hbox {LO}\) are close in size (when they are symmetrized) to the standard deviation (or the 68% DoB interval) of our model. This is purely accidental.

5.6 Dependence on the prior

We finally consider variations of the prior parameter \(\gamma \). On top of our default choice (\(\gamma =1\)), we take a smaller value \(\gamma =0\) (more agnostic choice) and a larger value \(\gamma =2\) (assuming a larger prior value for \(\lambda \) of order 2, see discussion in Sect. 5.2). The results of these variations are shown in Fig. 17, for the distributions of the cross section (left plot) and for the statistical estimators at \(\hbox {N}^3\)LO (right plot).

Fig. 17
figure 17

Dependence of the probability distribution of the observable at various orders on the parameter of the prior (left plot), and statistical estimators for the distribution at \(\hbox {N}^3\hbox {LO}\) for the same choices of parameter

The most striking feature of the first plot is that the width of the plateau is independent of the prior. This is a consequence of the fact that the plateau is determined by the theta function in the likelihood, and is thus a feature that does not depend on the prior. What changes when varying the prior is the functional behaviour of the tails of the distributions, and due to the normalization constraint this also modifies the “hight” of the plateau. The difference are however very mild. The only distribution that changes significantly is the one given the knowledge of the LO only, that we recall is fully determined from the prior and it thus does not contain any relevant information.

The stability of the result is also manifest when looking at the summary plot (on the right in the figure). All bands are basically identical. Note that one could obtain a larger difference by changing the functional form of the prior, e.g. replacing the exponential form with a power law. However, even in this case only the tails of the distributions will be affected, thus still leading to small (though perhaps more marked) differences. We conclude that this method, despite its limitations, has the advantage of being essentially independent of the choice of the prior.

6 Dealing with scale dependence

So far we have presented two models that allow to construct a probability distribution for the missing higher orders of a given perturbative expansion at a fixed scale. We have already seen in the examples that the same method on the same observable at different scales produces different results. Hopefully with sufficiently many orders the dependence on the scale will be mild, but nevertheless this dependence is undesirable. In this section we propose an approach to produce a prediction for the observable and its uncertainty that is scale independent.

6.1 The scale as a model parameter

The idea to remove the scale dependence from the probability distribution is very simple: promoting the scale \(\mu \) to be a parameter of the model. The corresponding Bayesian network is depicted in Fig. 18. In this way, the scale dependence is easily removed by simply marginalizing over \(\mu \):

$$\begin{aligned}&P(\Sigma |\delta _n,\ldots ,\delta _1,\Sigma _0) = \frac{P(\Sigma ,\delta _n,\ldots ,\delta _1,\Sigma _0)}{P(\delta _n,\ldots ,\delta _1,\Sigma _0)} \nonumber \\&= \frac{\int d\mu \, P(\Sigma ,\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )}{P(\delta _n,\ldots ,\delta _1,\Sigma _0)} \nonumber \\&= \frac{\int d\mu \, P(\Sigma |\delta _n,\ldots ,\delta _1,\Sigma _0,\mu ) P(\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )}{P(\delta _n,\ldots ,\delta _1,\Sigma _0)} \nonumber \\&= \int d\mu \, P(\Sigma |\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )\, P(\mu |\delta _n,\ldots ,\delta _1,\Sigma _0). \end{aligned}$$
(6.1)

The equation above gives the probability of the observable \(\Sigma \) given the knowledge of the first orders up to \(\hbox {N}^n\hbox {LO}\). The latter are seen not as the values of the perturbative contribution at a given scale, but in a more abstract sense. This probability is expressed as the integral over \(\mu \) of the same probability but at fixed \(\mu \), which was given in Eq. (3.13) and was the object discussed so far in this paper, and the posterior distribution for \(\mu \) given the first \(n+1\) orders, which is in turn given by

$$\begin{aligned}&P(\mu |\delta _n,\ldots ,\delta _1,\Sigma _0) = \frac{P(\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )}{P(\delta _n,\ldots ,\delta _1,\Sigma _0)} \nonumber \\&\quad = \frac{P(\delta _n,\ldots ,\delta _1,\Sigma _0|\mu )P_0(\mu )}{\int d\mu \,P(\delta _n,\ldots ,\delta _1,\Sigma _0|\mu )P_0(\mu )}, \end{aligned}$$
(6.2)

where \(P_0(\mu )\) is the prior distribution for the scale \(\mu \). A number of comments are in order.

Fig. 18
figure 18

The most general Bayesian network of models of inference of missing higher orders, including explicitly the scale \(\mu \) as a parameter over which one can marginalize

The first and most important comment is related to the interpretation of this procedure. Indeed, the fact that \(\mu \) is a parameter of the model implies that it has some sort of “physical” meaning, since it is possible to make inference on it and construct a posterior distribution (Eq. 6.2). That is, inference selects values of the scale that are “better” than others. This seems to be in direct contradiction with the fact that the renormalization scale is unphysical, and thus any value is equally good. In other words, there is no physical motivation for preferring some values of the scale and disfavouring others.Footnote 28

This apparent contradiction is solved by noting that the probabilistic inference on \(\mu \) selects values that are “better” according to the model. Indeed the posterior distribution for \(\mu \) is proportional to \(P(\delta _n,\ldots ,\delta _1,\Sigma _0|\mu )\), that is the model-dependent probability of the sequence of the first n orders at the scale \(\mu \). Inference selects scales for which this probability is higher, namely scales for which the perturbative expansion fits well with the assumptions of the model. Therefore, not only this procedure is perfectly acceptable, but it also does what we would have desired, namely selecting the scales with the best perturbative expansion (according to the assumptions of the model), within a well defined probabilistic framework.

As a second comment, we observe that since the scale has become a parameter of the model, a prior distribution \(P_0(\mu )\) has to be specified. This prior distribution will contain our prejudices on what are the most appropriate scales, and it thus represents in some sense a “residual scale dependence” of the result. However, we will see that if the prior distribution is sufficiently broad and non-informative, the results will depend very little on its precise form and size.

Another comment is related to the fact that the only new ingredient needed to compute the scale-independent distribution is just the prior \(P_0(\mu )\). Indeed, the distribution \(P(\Sigma |\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )\) in Eq. (6.1) was the subject of the previous sections, and the posterior distribution for \(\mu \) depends on the prior and on \(P(\delta _n,\ldots ,\delta _1,\Sigma _0|\mu )\), which is the basic model-dependent object discussed in Sects. 4 and 5, which is also needed in the computation of \(P(\Sigma |\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )\), see Sect. 3.3. Therefore, no additional computations are needed to “remove” the scale dependence: scale-independent results can be obtained with just a simple (numerical) integration.

We can use Eq. (3.11) to write explicitly the scale-independent distribution Eq. (6.1) in terms of fundamental objects. We find, approximating the observable with the order \(n+j\),

$$\begin{aligned}&P(\Sigma |\delta _n,\ldots ,\delta _1,\Sigma _0) \nonumber \\&\quad = \int d\mu \, P(\Sigma |\delta _n,\ldots ,\delta _1,\Sigma _0,\mu )\, P(\mu |\delta _n,\ldots ,\delta _1,\Sigma _0) \nonumber \\&\quad \simeq \int d\mu \int d\delta _{n+j}\cdots d\delta _{n+1}\, \delta \left( \Sigma -\Sigma _{n+j}(\mu )\right) \nonumber \\&\qquad \times \frac{P(\delta _{n+j},\ldots ,\delta _{n+1}, \delta _n,\ldots ,\delta _1,\Sigma _0|\mu )P_0(\mu )}{P(\delta _n,\ldots ,\delta _1,\Sigma _0)}, \end{aligned}$$
(6.3)

having used Eq. (3.15) and Eq. (6.2). As we did in Sect. 3.3, we can use the delta function to compute one of the \(\delta _k\) integrals. Alternatively, we can consider moments of the distribution, and use the delta function to integrate over \(\Sigma \). We have for the generic p moment

$$\begin{aligned}&\langle \Sigma ^p\rangle _{\delta _n,\ldots ,\delta _1,\Sigma _0}\nonumber \\&\quad =\int d\Sigma \; \Sigma ^p\, P(\Sigma |\delta _n,\ldots ,\delta _1,\Sigma _0) \nonumber \\&\quad \simeq \int d\mu \int d\delta _{n+j}\cdots d\delta _{n+1}\, \Sigma _{n+j}^p(\mu )\nonumber \\&\qquad \times \frac{P(\delta _{n+j},\ldots ,\delta _{n+1},\delta _n,\ldots ,\delta _1,\Sigma _0|\mu )P_0(\mu )}{P(\delta _n,\ldots ,\delta _1,\Sigma _0)}, \end{aligned}$$
(6.4)

from which we can compute for instance the mean of the distribution and its variance. It is interesting to notice that in the cases in which the probability \(P(\delta _m,\ldots ,\delta _1,\Sigma _0|\mu )\) is symmetric for \(\delta _k\rightarrow -\delta _k\), as it is the case of the geometric behaviour model, the computation of the mean of the distribution is simplified as the contributions from unknown higher orders integrate to zero. Therefore, in such a symmetric case, the mean reduces to

$$\begin{aligned} \langle \Sigma \rangle _{\delta _n,\ldots ,\delta _1,\Sigma _0}&= \int d\mu \; \Sigma _n(\mu ) \frac{P(\delta _n,\ldots ,\delta _1,\Sigma _0|\mu )P_0(\mu )}{P(\delta _n,\ldots ,\delta _1,\Sigma _0)}, \end{aligned}$$
(6.5)

which depends on the known \(\hbox {N}^n\hbox {LO}\) result \(\Sigma _n(\mu )\). This is the same result that we would obtain when approximating the observable using only the known orders (\(j=0\)). If we were working at fixed scale, this would just be the number that comes out of the computation. But since this number is scale dependent, it generates a distribution, Eq. (6.3) with \(j=0\), whose mean is given by Eq. (6.5). This is a very interesting consequence of the procedure: even a fixed-order computation at order n cannot be regarded as an “exact” prediction at that order, because it depends on the unphysical scale. The distribution Eq. (6.3) with \(j=0\) represents the best we can say about the \(\hbox {N}^n\hbox {LO}\) result, ignoring the missing higher orders. Because the \(\hbox {N}^n\hbox {LO}\) is a distribution, one can also compute its standard deviation. This standard deviation would represent a true “scale uncertainty” on the known finite order, but it would not contain any uncertainty from missing higher orders. This is very different from the usual approach, where the “canonical scale uncertainty” is used as an estimate of the missing higher order uncertainty.

We conclude by discussing the form of the prior distribution for \(\mu \). In principle, any value of \(\mu \) is allowed, and each value is equally valid as they all lead to the same exact result. However, in practice some values have to be avoided. For instance, in theories like QCD at small scales the coupling grows and invalidates the perturbative hypothesis. Therefore, the scale (in QCD) cannot become too low. Similarly, very large scales (in QCD) are not advisable, as they lead to small values of the coupling, slowing down the convergence of the expansion. At the same time, physical processes are characterized by one (or more) physical scale(s), and the various orders \(\delta _k\) will contain logarithms of the ratio of such scale(s) with the unphysical scale \(\mu \). If \(\mu \) is very different from the physical scale(s) the logarithms became large and invalidate the perturbative expansion. Therefore, the most convenient and meaningful approach is to use the value of the physical scale(s) to determine a “central value” \(\mu _0\) of the scale \(\mu \), and then vary it in a range that satisfy the previous (or any other) constraints. Considering that the scale dependence is logarithmic, the easiest option is a flat distribution in the logarithm of the scale, namely

$$\begin{aligned} P_0(\mu ) = \frac{1}{2\log F}\frac{1}{\mu } \theta \left( \log F - \left| \log \frac{\mu }{\mu _0} \right| \right) , \qquad F>1, \end{aligned}$$
(6.6)

where F is factor that sets the size of the interval of allowed scales, assumed to be symmetric about \(\mu _0\) (which is not a necessary condition). Alternatively, one could consider a distribution peaked in \(\mu _0\), for instance a gaussian, perhaps with hard limits to avoid entering the low- and large-scale regions. However, in the spirit of letting the model select the scale that leads to the best convergence properties, a flat distribution seems more natural. The factor F should be sufficiently large to contain a region where good convergence is achieved. This may be either selected by eye looking at the behaviour of the expansion as a function of the scale, or with a step-by-step approach where F is enlarged if the posterior for \(\mu \) turns out to be peaked close or at the boundary of the allowed region.

We stress that, despite the simplicity of this approach to obtain scale-independent results, this method is very innovative and provides for the first time a consistent way to deal with scale dependence which is compatible with physical requirements.

6.2 A pre-example: application to the Cacciari–Houdeau model

Let us now start to investigate the implication of this method to obtain scale-independent results. To begin with, we consider the original Cacciari–Houdeau model, described in Sect. 2.4. We are not really interested in obtaining results within the CH model, since we have proposed a new model that can be seen as an improved version of it. Rather, we want to use it to point out the importance of using the normalized variables \(\delta _k\) for defining the model.

To see this, let us start by computing the posterior distribution for \(\mu \) given the knowledge of the LO. In the CH model, indeed, the LO is treated as a source of information. The posterior is given by (using the notation of Sect. 2.4)

$$\begin{aligned} P(\mu |c_0)&= \frac{P(c_0|\mu )P_0(\mu )}{P(c_0)} \nonumber \\&= \frac{P_0(\mu )}{P(c_0)}\int d{\bar{c}}\, P(c_0|{\bar{c}},\mu )P_0({\bar{c}}) \nonumber \\&\propto P_0(\mu )\int _{\left| c_0(\mu ) \right| }^\infty \frac{d{\bar{c}}}{{\bar{c}}^2} = \frac{P_0(\mu )}{\left| c_0(\mu ) \right| }, \end{aligned}$$
(6.7)

where we have used Eqs. (2.20) and (2.21), and assumed that the hidden parameter \({\bar{c}}\) and the scale \(\mu \) are uncorrelated a priori. This computation shows that the posterior distribution for \(\mu \) given the LO is larger for values of \(\mu \) for which the LO \(c_0(\mu )\) is smaller.Footnote 29 This is not particularly clever, as very often NLO and higher corrections are positive, so a larger value of the LO should be favoured instead, to reduce the impact of higher orders and lead to a more well-behaved expansion. This is for instance what happens in the Higgs production process that we have used as an example so far.

The underlying reason for the failure of this procedure is that from the LO only it is impossible to decide which scale should be favoured. However, in this model the situation does not improve when adding the NLO. Indeed the posterior becomes

$$\begin{aligned} P(\mu |c_0,c_1) \propto \frac{P_0(\mu )}{\max \left( \left| c_0(\mu ) \right| , \left| c_1(\mu ) \right| \right) ^2}, \end{aligned}$$
(6.8)

which is likely still mostly determined by \(c_0\), as \(\left| c_1 \right| \) will typically be smaller than \(\left| c_0 \right| \) being a perturbative correction, so that it will never win in the max function. Therefore, the problem with using the information from the LO to make inference in the model is not solved by adding orders. In order to avoid this issue one should really treat the LO as just a prefactor, and start the inference procedure from the NLO, as we do in this work. This is another important motivation for using the normalized quantities \(\delta _k\) in the definition of the model.

6.3 Scale-independent geometric behaviour model

We now move to the geometric behaviour model discussed in Sect. 4. This model is more interesting and we will use the whole machinery introduced in Sect. 6.1 to obtain scale-independent distributions and uncertainties.

To begin with, we want to stress that in this case, making inference on the normalized perturbative expansion, gives the desired posterior distribution for the scale. We start noticing that the knowlege of the LO \(\Sigma _0\) does not change the distribution of \(\mu \),

$$\begin{aligned} P(\mu |\Sigma _0) = \frac{P(\Sigma _0|\mu )P_0(\mu )}{P(\Sigma _0)} = P_0(\mu ), \end{aligned}$$
(6.9)

which comes from the fact that \(\Sigma _0\) does not appear in the likelihood of the model (Eq. 4.3), and therefore \(P(\Sigma _0|\mu )=P(\Sigma _0)\). Equivalently, if one considers instead the normalized LO \(\delta _0\) for which a likelihood exists, it cannot change the distribution for \(\mu \) because \(\delta _0=1\) is scale independent. This is due to the fact, stressed already several times, that the LO does not carry any information on the behaviour of the expansion, not even when scale dependence is taken into account. The first non-trivial information comes from the NLO coefficient \(\delta _1(\mu )\), that changes the distribution for \(\mu \). The general computation is somewhat more complicated than in the CH case, so we stick to the simple case in which the prior for a is flat (\(\omega =0\)), so that we get

$$\begin{aligned} P(\mu |\delta _1) \propto P_0(\mu ) \left[ \frac{1}{1+\epsilon } + \log \frac{1}{\left| \delta _1(\mu ) \right| }\right] . \end{aligned}$$
(6.10)

This probability is clearly larger for scales such that \(\left| \delta _1(\mu ) \right| \) is smaller (at this order the dependence is only logarithmic, but at higher orders power behaving contributions appear). This is the expected behaviour, namely inference selects scales for which perturbative corrections are smaller. Note that for such scales the LO \(\Sigma _0(\mu )\) may be larger to “compensate”Footnote 30 for the next orders (this is what happens in the Higgs case).

Fig. 19
figure 19

Posterior distribution for the scale \(\mu \) within the geometric behaviour model with different states of knowledge, for the Higgs production process

Understanding the behaviour of the posterior for \(\mu \) beyond NLO is difficult analytically. Therefore we now switch to numerical results. In Fig. 19 we show the posterior for \(\mu \), Eq. (6.2), for our example of Higgs production, for four different states of knowledge: LO (black), NLO (green), NNLO (blue) and \(\hbox {N}^3\hbox {LO}\) (red). The first curve corresponds to our prior, that has been taken to be flat in \(\log \mu \), Eq. (6.6), ranging from \(m_{H}/10\) to \(2m_{H}\). This choice of range is motivated by the fact that scales between \(m_{H}/4\) and \(m_{H}/2\) lead to better convergence properties of the expansion (see e.g. Ref. [5], and also Fig. 4) and are thus used to define the center of the distribution, while the width is set by requiring that the lowest scale (\(m_{H}/10=12.5\) GeV) is still characterised by a sufficiently small strong coupling. Note that the whole range of scales allowed by this prior spans a factor of 20, which is way larger than the ususal range of scales considered in standard computations (typically, once the hard scale of the process is identified, the central scale is chosen within a factor two or four). This means that in our computation we consider also scales that are usually considered too far from the hard scale of the process.

Fig. 20
figure 20

Similar to Fig. 10, after removing scale dependence

Fig. 21
figure 21

Left plot: similar Fig. 11, after removing scale dependence. As a reference the conventional scale variation “error” is also shown for \(\mu =m_{H}/2\). Right plot: dependence on the prior for \(\mu \) of the most precise result (given knowledge of \(\hbox {N}^3\)LO). The first three results correspond to a larger support \(m_{H}/16<\mu <4m_{H}\), the next three to the default support \(m_{H}/10<\mu <2m_{H}\), and the last three to a smaller support \(m_{H}/8<\mu <m_{H}\). In each block, the first result correspond to a flat prior, the second to a triangular prior, and the third to a truncated gaussian prior

We observe that, as commented before through the analytic computation, the knowledge of the NLO leads to a preference towards smaller scales, even though the preference is mild because the dependence on the NLO coefficient \(\delta _1\) is logarithmic. The knowledge of the NNLO changes the situation more dramatically, giving a net preference for scales around \(\mu \sim 20\) GeV, and suppressing strongly high scales. The peak corresponds to the scale at which the NNLO correction is zero, which incidentally corresponds to the region where the NNLO scale dependence has a plateau, Fig. 4. At \(\hbox {N}^3\hbox {LO}\), the distribution is still peaked, but with a smoother bump towards somewhat larger scales, approximately \(m_{H}/3\). The reason is that the \(\hbox {N}^3\hbox {LO}\) correction is zero at a higher scale, close to \(m_{H}/2\), and so there is a trade off between the preference of the NNLO correction and that of the \(\hbox {N}^3\hbox {LO}\) correction. The high scale region is further suppressed.

We now move to the distribution for the observable \(\Sigma \). We use the approximation based on the first unknown higher order, as we have seen that it captures well the full uncertainty. The results are shown in Fig. 20. We immediately observe that these distributions are asymmetric, which is a consequence of the fact that scale dependence is non-trivial. At the first two orders, the distributions are very broad and not predictive. However, we notice that the procedure to remove the scale dependence has the effect of favouring larger values with respect to the fixed-scale results, in particular covering (at NLO) the region favoured by the next orders with a high probability. The distribution at NNLO features a high peak towards large values of the cross section, at \(\Sigma \simeq 51\) pb. This peak is a direct consequence of the peak in the posterior of \(\mu \), Fig. 19, and indeed it corresponds to the value of the cross section at the plateau of the NNLO, Fig. 4. Note however that the distribution is very asymmetric, and the mean of the distribution is to the left of the peak, more in line with the next order. Finally, the distribution at \(\hbox {N}^3\hbox {LO}\) is the narrowest, with a marked peak and a slight asymmetry.

To better understand these results, we show at each order the mean, mode, median, standard deviation and degree of belief intervals in Fig. 21 (left plot). The pattern found is similar to that of the fixed-scale result (Fig. 11). One difference is that the uncertainties here are somewhat larger, especially at low orders. This however implies that the convergence pattern is improved, as for instance this time the 68% DoB band is always contained in that of the previous order (which is mostly due to the fact that the NLO band is increased). The mean and median are rather similar in all cases, while the mode is not always close, due to the asymmetry of the distributions. At \(\hbox {N}^3\hbox {LO}\) the prediction is rather precise, certainly much more precise than at previous orders. The mean at \(\hbox {N}^3\hbox {LO}\) is close to the canonical result at \(\mu =m_{H}/2\).

These results clearly depend on the choice of the prior used for the scale \(\mu \). Therefore, we now consider variations of the prior, in order to understand to what extend the results depend on it. We consider two types of variations: the support (range of allowed values), and the shape. Previously we used a support \(m_{H}/10<\mu <2m_{H}\), and we now consider a larger support \(m_{H}/16<\mu <4m_{H}\) and a smaller one \(m_{H}/8<\mu <m_{H}\) Secondly, on top of the uniform prior used before, we consider a symmetric triangular prior (with the same support), and a gaussian like prior, obtained cutting the tails after two standard deviations, and adjusting the parameters such that the support is again the same. The results are shown in the right plot of Fig. 21 for the most precise result based on the knowledge of the \(\hbox {N}^3\)LO. We notice that enlarging the support has the effect of enlarging the uncertainty, and the converse is also true. Also, the flat prior gives obviously the larger uncertainty, while the triangular and gaussian priors give similar results. Overall, we see that the dependence on the prior is rather mild, certainly milder than the dependence on the result on the scale when it is kept fixed. For instance, in this case the mean, mode and median are more or less independent of the prior. We conclude, as anticipated, that the “residual scale dependence” induced by the freedom in choosing the prior for \(\mu \) is very much reduced with respect to the canonical scale dependence of the result. We can thus consider these results as almost scale independent.

6.4 Scale-independent scale variation model

We now apply the approach to remove scale dependence applied to the scale variation model. Note that it may seem somewhat contradictory to use the information on the scale dependence twice, once for estimating the higher orders and once for removing the scale dependence. However, there is nothing problematic in practice, because as in the previous case the approach to remove the scale dependence just selects, through inference, scales at which the model provides the best performance, independently of how the model itself works.

To begin with, we show the posterior for \(\mu \) in this model. According to the results of Sect. 5.3, from Eq. (6.2) we get

$$\begin{aligned} P(\mu |\delta _n,\ldots .,\delta _1,\Sigma _0) \propto \frac{P_0(\mu )}{r_0(\mu )\cdots r_{n-1}(\mu )} \int _{\lambda _n(\mu )}^\infty d\lambda \,\lambda ^{\gamma -n}e^{-\lambda }, \end{aligned}$$
(6.11)

where \(\lambda _n(\mu )\) is given in Eq. (5.9). The integral can be computed analytically, but written in this way it’s clear that the integral is larger the smaller \(\lambda _n(\mu )\). If the numbers \(r_k\) were independent of \(\mu \), the posterior would thus be larger for scales for which the largest \(\delta _k(\mu )/r_{k-1}(\mu )\) in \(k\in [1,n]\) is small. Namely, it would select the scale for which the perturbative corrections were small, which is indeed what we expect this approach to do. Of course, the numbers \(r_k\) do depend on \(\mu \), so this picture is modified by their presence in the prefactor. However, their dependence is mild (see Fig. 6), and typically milder than the dependence of the \(\delta _k\)’s, so the argument above gives a sufficiently good description of how the marginalization over \(\mu \) works in this model.

Fig. 22
figure 22

Same as Fig. 19, but for the scale variation model

In order to understand exactly what happens, we now move to the numerical analysis. In Fig. 22 we show the posterior of \(\mu \) for this model, in analogy with the discussion of the previous section. We see that this time, in the range under consideration, both NLO and NNLO tend to favour small scales. At \(\hbox {N}^3\hbox {LO}\), too small scales are disfavoured, and the distribution becomes peaked at \(\mu \sim m_{H}/4\). As in the previous case, high scales are strongly disfavoured.

Fig. 23
figure 23

Same as Fig. 20, but for the scale variation model

The probability distribution for the cross section is shown in Fig. 23. Again, after removing the scale dependence the distribution becomes asymmetric. Again, since at low orders smaller scales are favoured, larger values of the cross section are more probable, in line with the results of the higher orders. The distribution at \(\hbox {N}^3\hbox {LO}\) has a sort of plateau, and strongly decreasing tails, giving a well localised result. However, the uncertainty seems rather large, and specifically larger than the result at fixed scale, Fig. 15.

Fig. 24
figure 24

Same as Fig. 21, but for the scale variation model

Finally, in Fig. 24 (left), we show the summary of the results using the usual quantifiers: mean, mode, median, standard deviation, degree of belief intervals. With respect to the fixed-scale results, the uncertainties are larger, and the compatibility between different orders consequently improved. The mean of the distribution is rather stable from NLO onwards, suggesting that perhaps these uncertainties are overestimated (as we pointed out also when discussing the model at fixed scale, Sect. 5). At \(\hbox {N}^3\hbox {LO}\), because the tails die rapidly, the 95% DoB interval is only slightly larger than the 68% DoB interval, suggesting that also after computing the scale independent result the interpretation of this model as a provider of some sort of “absolute error” is still valid.

In the same figure the right plot shows the dependence of the most precise result on the choice of prior. The variations are the same introduced in Sect. 6.3. We see that for a wider support of the prior (first three results) the uncertainty increases quite significantly, with the exception of the triangular prior (second result). The explanation for this is that when pushing the smallest scale to be \(m_{H}/16\), the calculation of the \(r_k\), that probes scales up to a factor of 4 smaller than the current \(\mu \), gets contributions from scales as small as \(m_{H}/64\sim 2\) GeV, where \(\alpha _s\) becomes large. Indeed, from Fig. 6 we see that \(r_2\) and \(r_3\) explode below \(\sim m_{H}/10\), implying that for those scales the uncertainty in this model becomes huge. Therefore, when this region is included in the support for \(\mu \), it contaminates the whole distribution with a very broad contribution, thereby leading to the large uncertainties seen in the figure. The triangular distribution is an exception because it goes to zero at the endpoint, thereby suppressing part of the problematic region. Reducing the range or changing the shape, instead, has a much milder, almost negligible effect. We conclude that in this model it is crucial to keep the region of allowed scales within a sensible range, in order to avoid artificially large contributions. In particular low scales may be problematic because of the large strong coupling.

7 Validation and applications

In this section, we will consider a number of examples of applications of our methods. First, we focus on examples with a known sum, that will serve as a validation of the procedure. Some of them are just mathematical series without a physical meaning, others are physics examples. We then move to some applications where the true sum is not known. We focus on observables that are known to a high order, so that the results are more significant.

7.1 Convergent series

We start by considering the simplest case, namely a convergent series. This example may not be too significant as we know that perturbative expansions are divergent, but it represents a good check of the machinery. We consider a quasi-geometric series, namely

$$\begin{aligned} \Sigma = \sum _{k=0}^\infty \alpha _s^k A^k \cos (B k) = \frac{1-A\alpha _s\cos (B)}{1+A^2\alpha _s^2-2A\alpha _s\cos (B)}. \end{aligned}$$
(7.1)

The presence of the parameter B induces some “oscillations” in the perturbative expansion, and in the limit \(B=0\) (or a multiple of \(\pi \)) we recover a geometric series. This series is convergent for \(\left| A\alpha _s \right| <1\), and it is bounded by a geometric series \(\sum _k\left| A\alpha _s \right| ^k\). In the following, we fix \(A=4\) and \(B=2\), to have sufficiently large perturbative corrections and a sufficiently non-trivial pattern. We take \(\alpha _s\) to be \(\alpha _s(m_{Z})=0.118\), so that the sum is \(\Sigma =0.740531\). Note that the effective expansion parameter of this series is \(A\alpha _s\), but we made a distinction because we want to also induce an artificial scale dependence, to test the scale variation model and the procedure to remove scale dependence. This is obtained by changing the scale of \(\alpha _s\) from \(m_{Z}\) to a generic \(\mu \), compensating for this by adding the proper scale-dependent contributions to the expansion coefficients, according to the procedure described in Sect. A.1.

Fig. 25
figure 25

Geometric behaviour model applied to the toy convergent series Eq. (7.1). The left plot shows the distributions for the observable, and the right plot shows a summary of the quantifiers of these distributions. The thin solid black lines represent the exact result

Let us start by ignoring the scale dependence, and consider thus the expansion Eq. (7.1) with \(\alpha _s=\alpha _s(m_{Z})\). In this case, we can only use the geometric behaviour model of Sect. 4. Of course, the series Eq. (7.1) satisfies the assumption of the model Eq. (4.2), and so the model is expected to work well. We assume to know the first five orders, up to \(\hbox {N}^4\hbox {LO}\). In Fig. 25 (left plot) we show the probability distribution for the next order given the knowledge of the first orders up to \(\hbox {N}^4\hbox {LO}\). We see that by adding information the distributions become narrower, and they move closer to the exact result, represented by the vertical black thin line. In particular, the last two distributions (given the knowledge of the \(\hbox {N}^3\hbox {LO}\) and \(\hbox {N}^4\hbox {LO}\)) are in excellent agreement with the exact result. In the same figure (right plot) we show the quantifiers (mean, standard deviation, degree of belief intervals) of these distributions (ignoring the one using only the knowledge of the LO, that is purely determined by priors and is thus not significant). We see that all result are well compatible with the exact result, which is always within one standard deviation, and almost always also within the \(68\%\) DoB interval. Increasing the knowledge, the uncertainty clearly shrinks, reaching a very high precision at \(\hbox {N}^4\hbox {LO}\) (and also fairly good at \(\hbox {N}^3\hbox {LO}\)). These results show that, at least with this somewhat trivial example, the geometric behaviour model works very well, as expected.

Fig. 26
figure 26

Left plot: the distributions obtained applying the scale variation model to the toy series Eq. (7.1) assuming \(Q=m_{Z}\) and \(\mu =Q\). Right plot: the scale dependence of the observable. The exact result is shown as a thin solid black line

Let us now assume that the series Eq. (7.1) is a QCD observable, so that we can consider its scale dependence. We can thus use the scale variation model of Sect. 5. The probability distributions obtained with this model are shown in Fig. 26 (left plot). Here we see something strange and undesired, namely the distributions become broader when adding the knowledge of the NNLO and of the \(\hbox {N}^3\hbox {LO}\). Note that all of them are well compatible with the exact result, so the method is accurate, but the uncertainties are large, even at \(\hbox {N}^4\hbox {LO}\), so the method is not precise. This peculiar pattern can be understood by looking at the scale dependence of the (fake) observable (Fig. 26, right plot). We observe that the convergence pattern of the expansion deteriorates quickly for scales \(\mu <Q\equiv m_{Z}\), and it improves visibly at higher scales. Therefore, for this observable high scales have to be favoured.

Fig. 27
figure 27

Left plot: same as Fig. 26 but for \(\mu =4Q\). Right plot: the posterior distribution for \(\lambda \) at \(\mu =4Q\)

Let us take for example \(\mu =4Q\) and look again at the distributions of the scale variation model. In Fig. 27 (left plot) we see that indeed the distributions have a better pattern, and the distribution with the largest amount of information is rather precise (even though not at the level of the geometric behaviour model). However, the distribution at NLO is still narrower than that of the next order. This undesired behaviour is an artefact due to the fact that the LO is scale independent. Indeed, had we used literally the scale independence of the LO, the number \(r_0\) would be zero, forcing \(\lambda \) to be infinite, making the model not predictive. In Sects. 3.2 and 5.1 we have suggested to arbitrarily set \(r_0=1/2\) when the LO is scale independent. This value is sufficiently large to give unimportant restrictions to \(\lambda \), with the drawback that it allows small values of \(\lambda \) leading to an artificially narrow distribution at NLO. This can be appreciated by looking at Fig. 27 (right plot), where we show the posteriors for \(\lambda \). We see that the knowledge of the NLO gives a very mild restriction on \(\lambda \), namely \(\lambda \ge \left| \delta _1 \right| /r_0\), which is small because \(r_0\) is large. Conversely, at the next order, when the first real information on the scale dependence (\(r_1\)) is used, the smallest allowed value of \(\lambda \) becomes much bigger. This implies that the posterior \(P(\lambda |\delta _1,\Sigma _0,\mu )\), used in the construction of the green curve of Fig. 27 (left), is inaccurate and thus leads to an inaccurate prediction. We conclude that, when the LO is scale independent, the knowledge of the NNLO is needed in order to obtain the first meaningful prediction through this model. Note also that the subsequent orders do not push the lowest value of \(\lambda \) forward, similarly to what happens in the Higgs case. This suggest again that this model is not well suited for precise predictions, as the assumption itself is not optimal. For a possibly better model, see Sect. B.1.

Fig. 28
figure 28

Results after applying the procedure of Sect. 6 to remove the scale dependence, using a flat prior in the range \(Q<\mu <50Q\). The upper plots show the distributions (left) and the summary (right) for the geometric behaviour model, while the lower plots are the same for the scale variation model. The thin solid black lines represent the exact result

We finally use the procedure of Sect. 6 to obtain scale independent results. According to Fig. 26 (right), it is clear that scales larger than Q have to be favoured. We consider a large range, namely \(Q<\mu <50Q\), to be used as the support of our (flat) prior for \(\mu \). The distributions obtained after marginalizing over \(\mu \) are shown in Fig. 28 (left) for both the geometric behaviour model (above) and the scale variation model (below), and the summary of the distributions through the usual quantifiers are also shown (right). In these last plots we also show the conventional scale “error” assuming a central scale \(\mu =Q\). For the geometric behaviour model, we see that the convergence pattern is still excellent, and the uncertainty shrinks more rapidly than in the result at fixed scale, providing very precise predictions both at \(\hbox {N}^3\hbox {LO}\) and at \(\hbox {N}^4\hbox {LO}\), perfectly compatible with the exact result. For the scale variation model the situation is slightly improved with respect to the fixed scale result, but the uncertainties are still rather large, and of course well compatible with the exact result (with the exception of the NLO, for the reasons mentioned above). We conclude that all the methods proposed so far work as expected, and in particular the geometric behaviour model performs extremely well for a series that is compatible with its assumptions.

7.2 Divergent series

We now move to a somewhat opposite case, namely a factorially divergent series. Provided the expansion parameter is sufficiently small, the expansion has the behaviour of an asymptotic series (see Fig. 3). Therefore, it should be a decent representative of actual perturbative expansions, and thus a good validation test for our models. We consider the expansion

$$\begin{aligned} \Sigma = \alpha _s\sum _{k=0}^\infty \alpha _s^k A^k k! = -\frac{1}{A} \exp \left( -\frac{1}{A\alpha _s}\right) \Gamma \left( 0, -\frac{1}{A\alpha _s}\right) , \end{aligned}$$
(7.2)

where the analytic sum (which is the Borel sum, or equivalently the function of which the series is the asymptotic expansion), is strictly speaking valid only for \(A<0\) (or, if A is complex, for values of A that are not real and positive). However, for \(A>0\) the Borel sum can be still computed up to an ambiguous contribution, which is exponentially suppressed in the coupling and we thus ignore it. Note that we have included a factor of \(\alpha _s\) in front, so that the expansion starts at \({\mathcal {O}}(\alpha _s)\). This choice allows to have a LO that is scale dependent, to avoid the issues discussed in the previous case when using the scale variation model. The series Eq. (7.2) is assumed to be written at \(\mu =Q\equiv m_{Z}\), with \(\alpha _s=\alpha _s(m_{Z})=0.118\), and the expansion at generic \(\mu \) can be obtained with the procedure described in Sect. A.1. In the following, we will consider both cases \(A=-1\) and \(A=1\), dubbed respectively as factorially divergent series with alternating signs and with same signs.

Fig. 29
figure 29

Geometric behaviour model applied to the toy divergent series Eq. (7.2), for \(A=-1\) (upper plots) and \(A=1\) (lower plots). The left plot shows the distributions for the observable, and the right plot shows a summary of the quantifiers of these distributions. The thin solid black lines represent the exact result

Let us start again by ignoring the scale dependence and thus using only the geometric behaviour model. The results (distributions and summary in terms of quantifiers) are shown in Fig. 29. We observe that in the case of alternating signs (\(A=-1\)) the method works very well, with the distributions shrinking and converging towards the exact result. This is a consequence of the expansion parameter being sufficiently small to give an approximately convergent pattern at low orders, as we hope is the case for physical observables. The prediction is thus accurate, and increasing the order becomes also very precise. In the case of all same sings (\(A=1\)) the situation is different. The distributions shrink and move towards the exact result, without “reaching” it. In this case, it is clear that the shape of the distribution is inadequate, as the exact result always lies in the tail of the distributions. This behaviour is not surprising, as this series violates the assumptions of the model in a “maximal” way (the series can be considered as the bound of a physical expansion).

Fig. 30
figure 30

Left plots: the distributions obtained applying the scale variation model to the toy series Eq. (7.2) assuming \(Q=m_{Z}\) and \(\mu =Q\), for alternating signs (above) and same signs (below). Right plots: the scale dependence of the observable in either case. The exact result is shown as a thin solid black line

Let us now include the scale dependence, and consider the scale variation model. In Fig. 30 (left plots) we show the distributions obtained with this model in the two cases, for \(\mu =Q\equiv m_{Z}\). For alternating signs, the accuracy of the model is manifest, with the distributions always covering the exact result (within the plateau region when present), and getting narrower increasing the order. However, the precision is not at the level of the geometric behaviour model, as usual. When the coefficients of the series have all the same sign, a similar pattern to that of the geometric behaviour model is observed. In the same figure, the right plots show the (artificial) scale dependence of our observable, as we constructed it. Note that the range is different, as in one case (alternating signs) larger scales lead to a better convergence, while in the other case (same signs) lower scales give a better expansion (this peculiar behaviour is just a consequence of the artificial way in which scale dependence is introduced).

Fig. 31
figure 31

Scale independent results for alternating signs (upper plots) and same signs (lower plots). The thin solid black lines represent the exact result

According to this observation, we now apply the procedure of Sect. 6 to remove the scale dependence, using flat priors for \(\mu \) in the range \(Q/2<\mu <16Q\) for alternating signs and \(Q/8<\mu <8Q\) for same signs. The results are shown in Fig. 31. For the alternating signs series (upper plots) removing the scale dependence improves the precision of both the geometric behaviour model and of the scale variation model, while keeping excellent accuracy. This provides an important validation of all our methods. For the same sign series marginalizing over the scale improves a bit the accuracy, especially for the geometric behaviour model, where the exact result becomes compatible at least within 95% DoB also at high orders. While not perfect in this case, our methods are certainly superior to the conventional scale “error”, also shown in the plots.

Note that the methods can improve if the sign pattern is taken into account. The two expansions considered are characterized by either all positive corrections or corrections with alternating signs. If the patter is recognised within the model, then the prediction of the next orders can be performed accordingly. We will investigate this possibility in Sect. B.5.

7.3 Anharmonic oscillator in quantum mechanics

We now consider a physics example, namely a quartic anharmonic oscillator in quantum mechanics. Using the notation of Ref. [54], its hamiltonian is given by

$$\begin{aligned} H = p^2 +x^2 + \alpha x^4, \end{aligned}$$
(7.3)

where \(\alpha \) is the “coupling”, namely the perturbative expansion parameter. The energy levels of this anharmonic oscillator as a perturbative expansion in \(\alpha \) are known to diverge [55]. The ground-state energy is given by

$$\begin{aligned} E_0(\alpha ) = 1+2\sum _{k=1}^\infty \frac{A_k}{2^k} \alpha ^k, \end{aligned}$$
(7.4)

where the first 75 \(A_k\) coefficients can be found in Ref. [55]. We recall that these coefficients have alternating signs, namely \(A_k = (-1)^{k-1}\left| A_k \right| \). They grow faster than a power, but slower than a factorial. The exact result for this ground state energy has been computed in Ref. [54] through a Borel-Padé method. We consider the value \(\alpha =0.1\), which is the smallest value reported in that paper, corresponding to the sum \(E_0(0.1)=1.0653\). For this value the series already starts diverging quite soon: the optimal truncation value of the asymptotic expansion is \(k_{\mathrm{asympt}}=6\).

Fig. 32
figure 32

Geometric behaviour model applied to the ground-state energy of the quartic anharmonic oscillator in quantum mechanics. The left plot shows the distributions for the observable, and the right plot shows a summary of the quantifiers of these distributions. The thin solid black lines represent the exact result

Since this is a quantum mechanical system, there is no unphysical scale dependence, and therefore we can only use the geometric behaviour model of Sect. 4. The distributions for the ground-state energy given different status of knowledge are shown in Fig. 32 (left). We observe that starting from \(\hbox {N}^3\hbox {LO}\) the distributions are quite narrow, and oscillate around the exact value, because of the alternating signs of the \(A_k\) coefficients. In fact, the prediction does not become very accurate and precise, essentially because we are very close to the asymptotic point \(k_{\mathrm{asympt}}=6\) that sets the limit of validity of the assumptions of the model.

This can be appreciated also by looking at the summary of the distributions through the usual quantifiers in the right plot of the same figure. It seems surprising that the quality of the uncertainty, though perfectly acceptable (the exact result is always within the 95% DoB interval), is not as good as in the similar case of the factorially divergent series discussed in Sect. 7.2. The reason is that the effective coupling, namely the parameter determining the power growth of the expansion, is larger in this case than it was in the factorially divergent case. In fact, we are now in a condition for which the expansion is barely perturbative, and therefore all our assumptions, including the basic ones of Sect. 2.1, are barely satisfied. Nevertheless, with all these limitations in mind, the method works rather well. Note that taking into account the sign pattern (alternating signs) in the model, as described in Sect. B.5, can help improving the accuracy of the results.

7.4 Higgs production in the threshold limit

We conclude the examples used for the validation of the models with a QCD process. Since, to our knowledge, there are no QCD observables in the perturbative regime for which the exact result is known, we use a trick. Namely, we consider a purely all-order resummed result at a finite logarithmic accuracy, and consider it as if it were the full (exact) result. We expand it in powers of \(\alpha _s\) to obtain the first few perturbative orders, to which our methods are applied.

This procedure obviously misses contributions that are not logarithmically enhanced, which may constitue an important part of the observable. Therefore, it cannot be regarded as a validation of a complete QCD perturbative expansion, but only of a part of it (the logarithmically enhanced contributions) which may in general behave differently from the full result. This limitation can be minimized by considering threshold resummation of an inclusive observable. Indeed, threshold logarithms reproduce a number of structures that are present in the full result (e.g., plus distributions), and indeed they often offer a good approximation to the full result [56]. Additionally, to all orders threshold contributions diverge factorially [57], as the full expansion does. Therefore, a threshold approximation can be regarded as a good representative of a full QCD perturbative expansion.

The process we consider is once again Higgs production in gluon fusion. Its threshold resummation is known at a high logarithmic accuracy, \(\hbox {N}^3\hbox {LL}\) (see e.g. Ref. [58]), and it is well known to provide a good approximation of the full result order by order (see e.g. Ref. [5]). More precisely, we take the so-called \(\psi \)-\(\hbox {soft}_2\) resummation [58], with the so-called default choice for the constant terms [5]. Using the same setup of Sect. 2.5, at the scale \(\mu =m_{H}\) (and \(\mu _{\scriptscriptstyle \mathrm F}=m_{H}/2\)) the partial sums of the perturbative expansion of the cross section are given byFootnote 31

$$\begin{aligned} \Sigma _{\mathrm{pert}}(m_{H}) = \left\{ 13.0, 28.0, 37.8, 42.4 \right\} \; \text {pb}. \end{aligned}$$
(7.5)

These numbers are very similar to the exact ones, Eq. (2.29): they are slightly smaller, but given the large size of the perturbative corrections, this approximation captures well the behaviour of the expansion. The “exact” result, namely the all-order resummed one, is instead given by

$$\begin{aligned} \Sigma = 45.0\; \text {pb}. \end{aligned}$$
(7.6)

Note that this result is in fact scale dependent, as it is not really exact. The scale dependence is induced by terms beyond \(\hbox {N}^3\hbox {LL}\) and beyond the threshold approximation. In practice, this dependence is rather mild, and certainly much milder than the dependence of the fixed-order result. Therefore, to a first approximation, we simply ignore it and use the value of Eq. (7.6), which is computed at \(\mu =m_{H}/2\) where the convergence is faster, as if it were the exact one.

Fig. 33
figure 33

Scale dependence of the purely resummed expanded Higgs cross section (settings as in Fig. 4). The thin solid black line represents the “exact” result

Fig. 34
figure 34

The geometric behaviour model (upper plots) and scale variation model (lower plots) at fixed scale \(\mu =m_{H}/2\) (left plots) and after marginalizing over the scale (right plots) for the resummed Higgs production cross section at \(\hbox {N}^3\)LL. The thin solid black lines represent the all-order resummed result, used as “exact” result in this example

We start by showing the scale dependence of the various orders in this example in Fig. 33. We observe that this plot looks very similar to that of Fig. 4, as expected from the goodness of the threshold approximation. Therefore, the probability distributions that we get from the models are also very similar, and we thus do not report them here. We also observe that the “exact” result is rather close to the plateau of the \(\hbox {N}^3\hbox {LO}\) result.

In Fig. 34 we show the summaries of the uncertainties using the usual quantifiers. We see that all results (with the exception of the scale variation model at LO) are well compatible with the “exact” result, especially after marginalizing over the scale. These plots are very similar to those of Figs. 1116, 21 and 24, and thus provide a strong validation of those results. We can also see that the canonical scale variation “error” is acceptable at NNLO and \(\hbox {N}^3\hbox {LO}\), but it underestimates the uncertainty at NLO and strongly at LO, and this even at the “optimal” scale \(\mu =m_{H}/2\) used in the plots: the pattern may become much worse at different (e.g. higher) scales.

7.5 \(e^+e^-\) into hadrons at \(\hbox {N}^4\hbox {LO}\)

We now consider some applications of our methods to real physical processes of interest for which the exact result is not known. We start from the classical process of \(e^+e^-\rightarrow \text {hadrons}\), which is known in QCD up to \(\hbox {N}^4\hbox {LO}\) [59]. We define the observable \(\Sigma \) as the ratio of the function R(s) describing the processFootnote 32 to its LO \(R_0(s)\),

$$\begin{aligned} \Sigma= & {} \frac{R(s)}{R_0(s)} = 1+\frac{\alpha _s(s)}{\pi }+1.40923\left( \frac{\alpha _s(s)}{\pi }\right) ^2\nonumber \\&-12.805\left( \frac{\alpha _s(s)}{\pi }\right) ^3-80.434 \left( \frac{\alpha _s(s)}{\pi }\right) ^4+{\mathcal {O}}(\alpha _s^5)\nonumber \\ \end{aligned}$$
(7.7)

where s is the center of mass energy squared of the collision, the renormalizaion scale is \(\mu =\sqrt{s}\), and we have assumed \(n_f=5\) active flavours for the computation of the numerical coefficients of the expansion [59]. We choose the collider energy to be \(\sqrt{s}=10\) GeV, so that the bottom quark is active (\(n_f=5\)) and the scale is sufficiently large to be in the perturbative regime, and at the same time sufficiently small to neglect the contribution from a virtual Z boson.

Fig. 35
figure 35

Left: scale dependence of the \(e^+e^-\rightarrow \text {hadrons}\) process at \(\sqrt{s}=10\) GeV. Right: posterior distribution for the scale for the geometric behaviour model

The “raw” result of this process, shown as a function of the renormalization scale, is depicted in Fig. 35 (left). With the exception of the LO, which is scale independent (it corresponds to the lower edge in the plot), we observe that increasing the order the scale dependence flattens out. At \(\hbox {N}^4\hbox {LO}\), the result is very stable upon scale variation. We also observe an apparently convergent pattern at all scales except the smallest ones, where of course the larger value of \(\alpha _s\) makes the perturbative expansion badly behaved.

Fig. 36
figure 36

Fixed scale and scale independent results for the \(e^+e^-\rightarrow \text {hadrons}\) process, for both the geometric behaviour model (upper plots) and the scale variation model (lower plots)

We now apply our probabilistic methods to this expansion. We consider both the geometric behaviour model and the scale variation model, both at fixed scale \(\mu =\sqrt{s}\) and after marginalizing over the scale, using a flat prior in the range \(\sqrt{s}/3<\mu <10\sqrt{s}\). The distributions and their summary in terms of the usual quantifiers are shown in Fig. 36. Let us start from the geometric behaviour model. At fixed scale, the distributions from NNLO onwards are quite narrow and localized, giving rise to precise predictions. However, at least at NNLO, the prediction is not very accurate, as the next orders tend to favour values towards the tail of the NNLO distribution. Indeed, the 68% DoB region is very small, and the next orders are compatible only within the 95% DoB interval. The fact that the precision of the NNLO distribution is not faithful is also seen from the large standard deviation of the distribution, due to the slowly decreasing tails. We observe that after marginalizing over the scale the pattern improves significantly. The NNLO distribution becomes very asymmetric, covering with higher probability a region of smaller cross section. Indeed, in this case the 68% DoB interval is larger and covers the next orders. We see once again that marginalizing over the scale, on top of removing a bias, leads to more accurate results. A peculiar feature of the scale independent result is the bimodal distribution at \(\hbox {N}^4\hbox {LO}\). This is a consequence of the scale dependence, shown in Fig. 35 (left), where we see that there are two scales for which the \(\hbox {N}^4\hbox {LO}\) correction vanishes. One value, approximately \(\mu \sim 3\sqrt{s}\), leads to a larger cross section which is also close to the NNLO result, which makes this scale more favourable, thus producing the higher peak of the distribution. The other value, approximately \(\mu \sim 0.6\sqrt{s}\), leads to a smaller cross section and produces the secondary peak. This value is less favoured because the \(\hbox {N}^3\hbox {LO}\) is larger here, but at a slightly higher scale all corrections are rather small, and in particular the NNLO correction also vanishes. This implies that the posterior distribution for \(\mu \), shown in Fig. 35 (right), is bimodal itself, with the secondary peak in correspondence of the peak of the posterior at NNLO. Note that in the end the fact that the distribution for \(\Sigma \) is bimodal is not a problem, and indeed the uncertainty shown through the quantifiers is rather small (though slightly asymmetric).

The scale variation model gives, as usual, larger distributions and larger uncertainties, though the tails are exponentially suppressed and therefore the “support” of the distribution is more localized. The accuracy of the results is good, with a mild exception for the NNLO result at fixed scale, that covers the next orders only within the 95% interval. As expected, this is improved after marginalizing over the scale, which in turn also gives more peaked distributions at higher orders. The uncertainty estimate of the scale independent results is certainly more conservative than that obtained with the geometric behaviour model, and consequently very reliable but less precise.

7.6 Higgs decay to gg at \(\hbox {N}^4\hbox {LO}\)

Other QCD observables known at \(\hbox {N}^4\hbox {LO}\) are the decay width of the Higgs boson into a \(b{\bar{b}}\) pair [60, 61] and into gluonsFootnote 33 [61]. The former shares various similarities with the \(e^+e^-\rightarrow \text {hadrons}\) process discussed in Sect. 7.5, and the application of our methods leads to very similar results. Therefore, we do not consider it and just focus on the decay into gluons. We define the observable \(\Sigma \) as the decay width

$$\begin{aligned} \Sigma&= \Gamma _{H\rightarrow gg}(\mu ), \end{aligned}$$
(7.8)

which, for \(\mu =m_{H}\), leads to the expansion [61]

$$\begin{aligned} \Sigma&= \Gamma _0(m_{H})\left[ 1+5.70305\alpha _s(m_{H}) +15.5120\alpha _s^2(m_{H})\right. \nonumber \\&\quad \left. +12.666\alpha _s^3(m_{H}) +69.329\alpha _s^4(m_{H})+{\mathcal {O}}(\alpha _s^5)\right] , \end{aligned}$$
(7.9)

where the numerical coefficients have been computed with \(n_f=5\), as appropriate, and the LO at the generic scale \(\mu \) is given by

$$\begin{aligned} \Gamma _0(\mu ) = \alpha _s^2(\mu ) \frac{G_F m_{H}^3}{36\sqrt{2}\pi ^3} = \alpha _s^2(\mu ) \times 14.43\; \text {MeV}. \end{aligned}$$
(7.10)

Differently from the previous example (and from the \(H\rightarrow b{\bar{b}}\) decay) the LO now is scale dependent. When computing this result at another scale, we must then account for the change at LO, using the procedure described in Sect. A.1.

Fig. 37
figure 37

Left: scale dependence of the \(H\rightarrow gg\) decay width. Right: posterior distribution for the scale for the geometric behaviour model

In Fig. 37 (left) we show the “raw” results as a function of \(\mu \). We note that the behaviour of this plot is similar to that of Higgs production in gluon fusion, Fig. 4, due to the fact that the two processes are obviously related, and in particular they both start at \({\mathcal {O}}(\alpha _s^2)\). In this case, one extra order is known, thus providing an interesting and useful validation. Note however that the perturbative correction are somewhat smaller in this case, leading to a better convergence pattern.

We move immediately to the results of our methods, shown in Fig. 38, in the same format as Fig. 36. The scale \(\mu =m_{H}\) used in the plots is a good one, because the perturbative expansion at this scale is very benign. Therefore, the probability distributions at fixed scale are rather good and converge well, with the next order always included within the 68% DoB interval with no exceptions. Moreover, the precision at \(\hbox {N}^4\hbox {LO}\) of both models is comparable.

Fig. 38
figure 38

Fixed scale and scale independent results for the \(H\rightarrow gg\) decay width, for both the geometric behaviour model (upper plots) and the scale variation model (lower plots)

When we marginalize over the scale, using a flat prior for \(\mu \) in the range \(m_{H}/8<\mu <8m_{H}\), we obtain slightly larger uncertainties, with improved convergence (in particular the mean of the distributions at lower orders are more in line with the higher order results). We observe that the geometric behaviour model generates again multimodal distributions, this time both at \(\hbox {N}^3\hbox {LO}\) and \(\hbox {N}^4\hbox {LO}\). The explanation is still the same, namely that there is more than one region of scales for which the observable converges well, thus producing multimodal posterior distributions for the scale, as shown in Fig. 37 (right). The marginalization over the scale produces a prediction that is slightly smaller (though compatible) with the one at \(\mu =m_{H}\), which is an obvious consequence of the fact that for \(\mu \sim m_{H}\) both the \(\hbox {N}^3\hbox {LO}\) and the \(\hbox {N}^4\hbox {LO}\) have a maximum (see Fig. 37, left).

These results allow us to conclude that the estimate of the uncertainties using our methods are robust and reliable. Given the similarity of this process to the case of Higgs production, we can extrapolate that the methods are reliable also in that case.

7.7 A resummed logarithmic-ordered expansion

So far we have applied our methods to fixed-order perturbative expansions. However, we have stressed several times that our methods can work also for more general expansions, provided they behave in a perturbative way. In particular, one can consider the results obtained using all-order resummations, and consider the sequence of increasing logarithmic accuracy, namely LL, NLL, NNLL and so on. Such an expansion is supposed to behave perturbatively, with a structure given by (we consider for simplicity the case of a single-logarithmic enhancement)

$$\begin{aligned} \Sigma _{\mathrm{pert}}= \sum _{k=0}^\infty g_k(\alpha _sL) \alpha _s^k, \end{aligned}$$
(7.11)

where the \(g_k\) coefficients are all-order functions of their argument, L being the resummed logarithm. The resummation assumes \(\alpha _sL\sim 1\), so the \(g_k\) coefficients are formally coefficients of \({\mathcal {O}}(1)\), and the explicit power of \(\alpha _s\) determines a fully fledged perturbative expansion.

In this section, we consider threshold resummation applied to the Higgs production process, which is known to a high order, \(\hbox {N}^3\hbox {LL}\) [58]. We have already used this result in Sect. 7.4 to construct a fixed-order perturbative expansion by expanding the resummed result. Here, instead, we take the full resummed result at various logarithmic orders, matched (as appropriate for phenomenology) to the corresponding fixed-order result. Namely, the perturbative expansion corresponds to the sequence LO+LL, NLO+NLL, NNLO+NNLL, \(\hbox {N}^3\hbox {LO}+\hbox {N}^3\hbox {LL}\), whose values at \(\mu =m_{H}\) (and \(\mu _{\scriptscriptstyle \mathrm F}=m_{H}/2\)) are given by

$$\begin{aligned} \Sigma _{\mathrm{pert}}(m_{H}) = \left\{ 15.9, 38.7, 47.1, 48.5 \right\} \; \text {pb} \end{aligned}$$
(7.12)

(we used the same setting as Sect. 7.4, namely the default of Ref. [5]). Since the coefficients of the perturbative expansion are all-order functions of \(\alpha _s\), their scale dependence cannot be reconstructed using the procedure of Sect. A.1. Therefore, to simplify the treatment of this process, we just consider the geometric behaviour model at fixed scale.Footnote 34 To investigate the dependence of the results on the scale, we consider also the expansion at another scale, \(\mu =m_{H}/2\), leading to the sequence

$$\begin{aligned} \Sigma _{\mathrm{pert}}(m_{H}/2) = \left\{ 20.1, 46.2, 50.1, 48.6\right\} \; \text {pb}. \end{aligned}$$
(7.13)

Note that at this scale the value of the coupling is different, \(\alpha _s(m_{H}/2)=0.1252\).

Fig. 39
figure 39

Probability distributions and their summaries for the Higgs production process at resummed level matched to fixed order, for fixed scale \(\mu =m_{H}\) (upper plots) and \(\mu =m_{H}/2\) (lower plots) using the geometric behaviour model

We show in Fig. 39 the results of the geometric behaviour model applied at each of the scales considered. Because the resummation speeds up the (apparent) perturbative convergence [5], these results are somewhat more precise (smaller uncertainty) than those coming from the fixed-order expansion. Nonetheless, they remain reliable, as one can appreciate from the fact that the 68% DoB interval always covers the next order. These results are also fully compatible with those coming from the fixed-order perturbative expansion. A thorough study of the uncertainties of this (or any other) process is beyond the scope of this work and will be done elsewhere.

8 Treating correlations between different theory predictions

The approach and models presented so far can be used to compute the uncertainty on a given perturbative expansion, namely for the prediction of a given observable. In many cases, one needs to consider more than one observable at a time in a physics analysis. In these cases, correlations between the uncertainties of different observables are fundamental.

We stress that when we say different observables we actually also mean the same physical observable at different kinematic points. Consider for example a differential observable, like a transverse momentum distribution. The methods presented so far would work independently for any value of the transverse momentum \(p_{\mathrm{t}}\), as each value of \(p_{\mathrm{t}}\) leads to a different perturbative expansion, in principle with very different perturbative behaviours.Footnote 35 It’s clear that in such cases there are strong correlations between different values of \(p_{\mathrm{t}}\), approaching 100% for close values of \(p_{\mathrm{t}}\). But there may be also important correlations for \(p_{\mathrm{t}}\) values very far from each other, induced by the knowledge of the full cross section (integral of the distribution). Similarly, when one consider different distributions for the same physical process, there likely are correlations on the uncertainty from missing higher orders, e.g. due to the fact that all distributions must integrate to the same total cross section.

The situation is somewhat different when considering different processes. Here it is less clear whether the uncertainties should be correlated or not. Note that there is a class of uncertainties, due to the uncertain values of parameters such as couplings or masses, that are clearly (and easily) correlated between different observables of different processes. However, here we are dealing with the uncertainty due to missing higher orders. The presence of correlations in this case implies that the perturbative expansions of the different observables of different processes are somehow related. This is not to be excluded a priori, as indeed perturbative expansions often show some similarities.Footnote 36 However, quantifying such similarities is hard in general and subject to a large degree of arbitrariness.

Finding a general and satisfactory solution to all these cases is complicated (if not impossible) and it will not be subject of this work. In the following, we limit ourselves to discuss some simple cases and propose some strategies. This section has just to be considered as a starting point for future work in this direction.

8.1 Correlations for the same observable at different kinematic points

Let us start our discussion from the case where there certainly are correlations, namely the case of the same observable at different kinematic points. Let us call by \(\vec v\) and \(\vec w\) generic configurations of kinematic variables, and by \(\Sigma (\vec v)\) and \(\Sigma (\vec w)\) the observable that depends on those variables. The information on the correlation between \(\Sigma (\vec v)\) and \(\Sigma (\vec w)\) is contained in the joint distribution

$$\begin{aligned}&P(\Sigma (\vec v), \Sigma (\vec w) | \delta _n(\vec v),\ldots ,\nonumber \\&\quad \delta _1(\vec v),\Sigma _0(\vec v), \delta _n(\vec w),\ldots ,\delta _1(\vec w),\Sigma _0(\vec w)), \end{aligned}$$
(8.1)

where each term of the perturbative expansion is computer either at \(\vec v\) or at \(\vec w\). We recall that, in the limit \(\vec w\rightarrow \vec v\), the distribution must lead to a positive 100% correlation.

Constructing such joint distribution satisfying the aforementioned requirement is far from trivial. The most promising way that we can imagine consists in considering not the values of the observable at given kinematic points, but the observable as a function of the kinematics. The probability to be computed is that of the function, and the requirement that the correlation of adjacent points tends to one is automatic provided obvious conditions of continuity and smoothness are included. Also, the condition that the integral of the distribution is the inclusive cross section is easy to implement. Note that, in general, with this approach the probability distribution of a single point would not necessarily correspond with that obtained from the application of the methods introduced in the previous sections. In particular, one could expect to obtain a smaller uncertainty with this approach, given the additional constraints included.

How to realize in practice this approach is again subject of arbitrary choices. A promising option consists in using the proposal of Ref. [62]. The idea is simple. There are classes of higher order contributions whose functional structure is known to all orders from resummation techniques. These do not capture all the possible missing higher orders, but in some kinematic regions they are quite accurate, and they can be constructed such that in other regions they anyway represent to some accuracy the full result. In this way, the only missing information for constructing the next orders is the value of some numerical coefficients which the functional structure of the resummation depends upon. In other words, rather than using a generic set of functions to model the observable one restricts the attention to a physically motivated subset parametrized by a small number of real parameters. In this way the inference is to be done on the parameters themselves. Moreover, these parameters are typically power series in the coupling, and therefore one can use straight away the probabilistic methods introduced in this work.Footnote 37 The actual implementation of this approach will be studied elsewhere.

8.2 Correlations among observables of different processes

We now move to discussing the correlations among observables belonging to different physical processes. In this case, it is not clear at all that the uncertainty due to missing higher orders should be correlated: the computations are different, the number of known orders is in general different, the perturbative series describing the various observables are different. Therefore, at first sight, one is tempted to say that the uncertainty from missing higher orders between observables of different process are uncorrelated.

This conclusion is likely correct for very different processes. However, some processes may be characterized by perturbative expansions with common features, for instance when the underlying physical mechanism at the origin of some kind of corrections is the same. This is for instance the case when the bulk of perturbative corrections is given by logarithmically enhanced contributions, which in turn originate from a specific structure of particle emissions. Similarly, processes like Z and W production at proton-proton colliders (or equivalently neutral-current and charged-current structure functions in DIS experiments) are very similar, as they share the same structure of the interaction and thus the form of the perturbative corrections. In these cases it makes sense to assume that correlations between the uncertainties from missing higher orders of the given processes are present.

The tough question is how to quantify such correlations. To keep the discussion very general, we may assume that some of the parameters characterizing the probabilistic model are common to two given observables \(\Sigma \) and \(\Sigma '\) of different processes. If this is the case, the joint distribution can be written as

$$\begin{aligned}&P(\Sigma ,\Sigma ' | \delta _n,\ldots ,\delta _1,\Sigma _0, \delta _n',\ldots ,\delta _1',\Sigma _0') \nonumber \\&\quad \propto P(\Sigma , \Sigma ', \delta _n,\ldots ,\delta _1,\Sigma _0, \delta _n',\ldots ,\delta _1',\Sigma _0') \nonumber \\&\quad = \int d\vec p\, P(\Sigma , \Sigma ', \delta _n,\ldots ,\delta _1,\Sigma _0, \delta _n',\ldots ,\delta _1',\Sigma _0',\vec p) \nonumber \\&\quad = \int d\vec p\, P(\Sigma | \delta _n,\ldots ,\delta _1,\Sigma _0, \vec p) P(\Sigma '|\delta _n',\ldots ,\delta _1',\Sigma _0',\vec p)\nonumber \\&\qquad \times P(\delta _n,\ldots ,\delta _1,\Sigma _0|\vec p) P(\delta _n',\ldots ,\delta _1',\Sigma _0'|\vec p) P_0(\vec p) \end{aligned}$$
(8.2)

where \(\vec p\) represents the vector of parameters of the model that are shared between the observables. Of course the model may depend on additional parameters, but these are assumed to be independent for the two observables, and therefore do not enter our decomposition. Equation (8.2) depends on the probability of each observable given its respective perturbative expansion and the common parameters \(\vec p\), and similarly on the probability of the perturbative expansions themselves given \(\vec p\). The actual form of these probability distributions depend on what is \(\vec p\), but once the model is specified and the set of common parameters identified, then Eq. (8.2) can be used directly, as it depends on ingredients introduced in the previous sections.

The choice of the common parameters \(\vec p\) depends on the processes under consideration and it is the trickiest part of the job. For instance, one model-independent parameter that could be used to construct such correlations is the unphysical scale \(\mu \), or better its ratio to the hard scale of the process. In principle, the scale dependence of each process is independent, however it is induced in all cases by the dependence of the coupling on the scale. For this reason, one can assume that the variation of the observable induced by a variation of the scale is somehow related, and use it to construct the sought correlation. This reasoning may however lead to some problems, as for instance obtaining such correlations requires using the same prior \(P_0(\mu /Q)\), with Q the hard scale of each process, which makes impossible to adjust this prior to process-dependent conditions.Footnote 38 In agreement with this conclusion, in recent work [63, 64] on the inclusion of theory uncertainties in PDF fits, the dependence on the renormalization scale is not considered as a source of correlation among different classes of processes, while it is considered for similar processes like Z and W production belonging to the same class (Drell–Yan).

An alternative way to construct correlations between different but somehow related processes consists in using again the approach of Ref. [62], as we described in Sect. 8.1. Indeed, some of the parameters used in the construction of the resummed results at the basis of that approach are process dependent, but some others are more universal, and common among different processes. For instance, the cusp anomalous dimension is one of the key ingredients of soft-gluon resummations for any process. Therefore, in Eq. (8.2) one can take \(\vec p\) to be the model parameters describing the perturbative expansion of the common coefficients (such as the cusp anomalous dimension). The probability distribution would be constructed exactly as one would do for Sect. 8.1. We conclude that the approach of Ref. [62] used in conjunction with our methods is very promising for estimating theory uncertainties from missing higher orders in a way that produces meaningful correlations between different observables, and deserves further studies.

We conclude by mentioning that the situation becomes more complicated for hadron-initiated processes, due to the presence of parton distribution functions (PDFs). Indeed, PDFs induce two additional problems. One is that they depend on a different unphysical scale, the factorization scale \(\mu _{\scriptscriptstyle \mathrm F}\). If the PDFs were known exactly, this would not be too problematic, as one could extend the approach of Sect. 6 to \(\mu _{\scriptscriptstyle \mathrm F}\) as well. However, PDFs are non-perturbative objects and cannot be computed from the theory. Therefore, they are determined by a fit to data, using a theoretical computation for the perturbative coefficients describing an observable as input. This means that PDFs indirectly depend on the accuracy of the computation of the perturbative coefficients, which is the second problem. Ideally, one would like to include theory uncertainties with proper correlations in the determination of the PDFs themselves, as studied recently in Refs. [63, 64] (using canonical scale variation). The trouble is that the \(\mu _{\scriptscriptstyle \mathrm F}\) dependence is thus entangled with the theory uncertainty, and it is not obvious how to properly deal with it (see e.g. Ref. [65]). This important topic deserves a dedicated study.

9 Conclusions

Having a reliable estimate of the uncertainty on theoretical predictions is of fundamental importance for (particle) physics. A particularly important source of uncertainty is the one coming from the finite perturbative accuracy of theoretical computations. Quantifying the uncertainty due to unknown (or missing) higher orders is crucial, especially for those theories like QCD that are characterized by large perturbative corrections. In these cases the missing higher orders may be sizeable, and underestimating them may lead to wrong conclusions when comparing theoretical predictions with precise data.

The standard practice to estimate the size of the missing higher orders is based on the “canonical scale variation” method, where the unphysical scale(s) is varied by a factor of 2 about an arbitrary central scale. Since scale dependence is formally subleading, the effect of the variation is formally higher order. The problem of this approach appears when one tries to promote this information to an uncertainty. Not only the procedure is totally arbitrary and has no probabilistic foundation; it also often underestimates the true uncertainty, and it is thus not reliable.

A groundbreaking approach to define and compute the uncertainty from missing higher orders has been introduced by Cacciari and Houdeau in 2011 [3]. This method uses a Bayesian approach to infer the probability distribution for an observable from its perturbative expansion, under some conditions that define the Cacciari–Houdeau (CH) model. Later, the CH model has been modified to account for the divergence of the perturbative expansion [4]. While this approach is formally superior to canonical scale variation, it also has limitation. Indeed, in some cases the CH model is not very robust and it does not predict reliably the uncertainty from the missing higher orders. This behaviour is due to the assumptions of the model itself, and not to the approach per se.

In this work we built upon the CH model to construct more general, flexible and reliable models, using the same Bayesian approach. Moreover, we have introduced an innovative way to eliminate an ambiguity of theoretical computations in quantum field theory, namely the dependence on unphysical scale(s). We have validated the models using perturbative expansions with a known sum, and tested their quality on some observables that are known to a high perturbative order.

The models proposed in this work are essentially two, with a number of possible variants discussed in the appendices. One model is a generalization of the CH model, and assumes that the various orders k of the perturbative expansion are bounded by a geometric behaviour \(ca^k\), where both c and a are (hidden) model parameters. The main difference with respect to the CH model is that a, that plays the role of the expansion parameter of the geometric bound, in the CH approach is fixed to be the coupling constant, possibly up to an externally supplied fixed factor. Promoting a to a model parameter allows to select its most appropriate values based only on information from the expansion itself, through a probabilistically valid inference procedure. This improvement alone already makes the model much more reliable and robust. In practical applications, it performs very well, providing precise and accurate uncertainties. These are also pretty stable upon variations of the assumptions, as the bulk of the distributions is rather insensitive to the precise form of the prior, which mostly affects the tails.

The other model uses the information from the scale variation, similarly to the canonical scale variation method, but in a probabilistically well defined way. In particular, it assumes that the next order is bounded by a factor \(\lambda \) times a properly defined scale variation coefficient of the current order. The factor \(\lambda \) is a hidden parameter of the model and it is inferred from the known orders. Oversimplifying, this model can be interpreted as the canonical scale variation method where the size of the variation, rather than being fixed to a factor of 2, is variable and inferred from the expansion itself. We have noticed that this model predicts somewhat larger uncertainties than the geometric behaviour model, thus providing a more conservative estimate of the theory uncertainty. This uncertainty is very stable upon variation of the prior distribution.

There are a number of improvements that can be applied to the models, discussed in the appendices. Two of them are particularly interesting. One suggests to impose an additional condition on the scale dependence of the inferred higher orders: indeed, not only the various perturbative contributions should get smaller and smaller by increasing the order, but also their scale dependence should, as it is always formally of the next order and must thus behave perturbatively. Requesting a reduction of the scale dependence provides strong constraints on the missing higher orders, resulting in smaller uncertainties (i.e., higher precision). Another improvement consists in looking for a sign pattern in the various perturbative orders, to infer the next orders according to the same pattern. Also this procedure allows to reduce the uncertainty, producing more precise results. Moreover, different conditions can be combined together to provide more stringent models that can further improve the precision.

Which model to choose is a duty of the user. A “perfect” or “correct” model does not exist, as the only way to be sure about the size of the missing higher orders is computing them, and any estimate is subject to arbitrary assumptions and biases.Footnote 39 The Bayesian approach highlights the presence of such subjective assumptions, making them very explicit (the model, the priors). Users, based on their beliefs about the expansion, can freely choose the model and priors, declaring them. This said, we now give a personal preference. From the studies presented in this work, it is clear that the geometric behaviour model provides precise results without sacrificing accuracy. Given the simplicity of the model and the fast implementation provided, we suggest its use as default. Moreover, this is the only model that is applicable beyond a quantum field theory, and it is thus more general and universal. Since the plain scale variation model is simple and fast, for QCD observables we also recommend its application as a “double check”, consistently with its conservative interpretation. Should one be interested in providing more aggressive results, the best performance is obtained by combining all the proposed assumptions (Eqs. 4.2, 5.2, and B.4), and keeping track of the sign pattern as discussed in Appendix B.5. Such an approach makes the most of the known orders, and it is thus particularly useful when only a low number of orders is available. However, we suggest its use only for very specific and dedicated studies and not as a general-purpose approach, also due to its slow numerical implementation. In all cases, variations of the priors should be considered to assess the stability of the results upon changes of the subjective arbitrary input.

Independently of the model used, we have proposed a general way to remove the dependence on the renormalization scale from the probability distribution of the observable (the method can be extended also to other unphysical scales). Indeed, each model uses as basic elements the coefficients of the perturbative expansion, which however depend on the unphysical scale and would thus lead to infinitely many different results depending on the choice of the scale. The solution is actually very simple: promoting the unphysical scale to be a parameter of the model. The inference works exactly as in the general case (Eqs. 1.2 and 1.3), with the difference that among the hidden parameters there is also the scale, with its own prior. In this way, after marginalizing over the scale, the dependence disappears. The original arbitrariness in the choice of scale is almost completely removed – only a mild arbitrariness remains in the choice of the prior. It has to be stressed that the inference on the scale acts differently depending on the model, selecting those values of the scale for which the given model performs better. In this way, the most accurate and precise results are obtained for free, without the need of choosing wisely the scale corresponding to a “good” perturbative expansion. We also stress that with this procedure the “best” prediction for an observable is given by the mean of its probability distribution, in contrast with standard methods, where it corresponds to the value of the perturbative expansion computed at the highest known order at an arbitrary “central” scale. Therefore not only the method provides a reliable uncertainty, but also a better central value for the observable. For scale dependent observables, we always recommend the use of this procedure.

We have finally considered the problem of quantifying the correlations between various theoretical predictions and their uncertainties. We believe that this argument cannot be addressed in full generality, but specific solutions need to be put in place depending on the observables and the processes under considerations. Some proposals have been suggested, but a thorough study is delayed to future work.

The results of this paper can be used to compute the uncertainty on any physical observable through a publicly released code named THunc, available at

http://www.roma1.infn.it/bonvini/THunc

The two main models proposed in this work are implemented analytically and lead to a very fast evaluation of the uncertainties. Other models (including those proposed in the appendices or alternatives invented by the user) can be implemented straight away through a custom model feature, at the price of a slower numerical evaluation.