An analysis of Bayesian estimates for missing higher orders in perturbative calculations

With current high precision collider data, the reliable estimation of theoretical uncertainties due to missing higher orders (MHOs) in perturbation theory has become a pressing issue for collider phenomenology. Traditionally, the size of the MHOs is estimated through scale variation, a simple but ad hoc method without probabilistic interpretation. Bayesian approaches provide a compelling alternative to estimate the size of the MHOs, but it is not clear how to interpret the perturbative scales, like the factorisation and renormalisation scales, in a Bayesian framework. Recently, it was proposed that the scales can be incorporated as hidden parameters into a Bayesian model. In this paper, we thoroughly scrutinise Bayesian approaches to MHO estimation and systematically study the performance of different models on an extensive set of high-order calculations. We extend the framework in two significant ways. First, we define a new model that allows for asymmetric probability distributions. Second, we introduce a prescription to incorporate information on perturbative scales without interpreting them as hidden model parameters. We clarify how the two scale prescriptions bias the result towards specific scale choice, and we discuss and compare different Bayesian MHO estimates among themselves and to the traditional scale variation approach. Finally, we provide a practical prescription of how existing perturbative results at the standard scale variation points can be converted to 68%/95% credibility intervals in the Bayesian approach using the new public code MiHO.


Introduction
Perturbative calculations are currently the only way to obtain Standard Model (SM) predictions for collider experiments. The backbone of almost all calculations for hadron collider observables is the QCD factorisation theorem, which relates an observable Σ in protonproton collisions (e.g., a total cross-section or differential distribution) to the corresponding partonic quantity, where f i (x, µ F ) denotes the parton distribution function (PDF) for parton i andΣ ij denotes the observable computed on a collision of the partons i and j. Here µ F is the factorisation scale, Q denotes a characteristic hard scale of the process, and Λ QCD 1 GeV is the scale at which QCD becomes strongly coupled. The partonic coefficientsΣ ij are computable in perturbation theory, where k 0 ≥ 0 is a non-negative integer, α s (µ R ) denotes the renormalised strong coupling constant evaluated at the renormalisation scale µ R , andΣ (k) ij is the (next-to-) k -leading order (N k LO) contribution to the partonic coefficient. 1 One can combine the QCD factorisation theorem in eq. (1.1) with the perturbative expansion of the partonic coefficients in eq. (1.2) to obtain the N n LO approximation for the observable Σ within (fixed-order) perturbation theory Σ Σ n (µ F , µ R ) := n k=0 Σ (k) (µ F , µ R ) . (1.3) Computations are typically performed in the MS renormalisation scheme. Although Σ is formally independent of the choice of the factorisation and renormalisation scales, the truncation of the perturbative series introduces a residual scale dependence into Σ n . Since a perturbative result is always an approximation of the exact value Σ, it is important to have robust methods to assess the quality of the approximation and to quantify the uncertainty attached to it. Three of the main sources of uncertainty are coming from the values for the strong coupling constant and other input parameters (we collectively refer to them as the parametric uncertainty), the PDFs 2 and the truncation of the perturbative series. The parametric and PDFs uncertainties are related to their extraction from 1 The ' '-sign in eq. (1.2) indicates that the perturbative expansion on the right-hand side is in general an asymptotic series with zero radius of convergence. For most phenomenological predictions only very few terms in the perturbative expansion are known (typically not more than the first 3 or 4 terms), and one does not have to worry about the non-convergent nature of the series. 2 Note that the PDF uncertainty can itself be seen as a parametric uncertainty, just like the coupling constants or the masses. Indeed, there is no strong conceptual difference between the PDF and, e.g., αs uncertainties, and often they are treated on the same footing. For our discussion in later sections, we find it convenient to think of these two sources separately. This does, however, not impact any of our results, as our primary focus are MHO uncertainties.
experimental data and the fitting methodology used. Consequently, the PDF and parametric uncertainties are not directly related to the observable Σ. They are computed using observable-independent recipes, cf., e.g., refs. [1][2][3] (though care is needed as there may be correlations between Σ and the data used to extract the PDFs and α s ). The uncertainty resulting from the truncation of the perturbative series quantifies the effect of the missing higher orders (MHOs) in perturbation theory. Its estimation is one of the main topics of this paper. Under the assumption that the PDF, parametric, and MHO uncertainties are uncorrelated (see [4][5][6][7] for discussion of correlation between PDF and MHO uncertainties), they can be added in quadrature to obtain the uncertainty on the N n LO prediction. 3 Currently, there is no reliable way to estimate the size of the MHOs (other than computing the next order in perturbation theory). Since the effect of varying the scales at N n LO is an N n+1 LO effect, one conventionally estimates the size of the MHO terms by studying the variation of Σ n (µ F , µ R ) as the factorisation and renormalisation scales are varied in a certain range around a central scale µ F/R,cent corresponding to a characteristic hard scale of the process. A popular choice is the so-called 7-point variation, where the MHO uncertainty at N n LO is estimated by the interval (1.4) (k F , k R ) take values in the following set: This differs from the 9-point variation where k F , k R ∈ {1/2, 1, 2} by the exclusion of large ratios of the scales, i.e., k F /k R = 4 or 1/4. We stress that the procedure of estimating the MHO uncertainty from scale variation is completely ad hoc. There is no reason other than experience from NLO and NNLO calculations why the obtained uncertainty interval at N n LO should capture the actual size of the N n+1 LO corrections Σ (n+1) or the true value of Σ. Indeed, the higher-order terms probed by varying the scales are universal and predictable from renormalisation group methods. They do not take into account the size of the genuine higher-order corrections, which are process-dependent and independent of the perturbative scales. With advances in precision calculations over the last 5 years, NNLO results are becoming routinely available for many LHC processes, and even N 3 LO corrections to colourless final states have been recently computed. It is, therefore, time to critically assess the MHO uncertainties and to look for alternative prescriptions. A number of attempts have been made in the past to estimate or predict the truncation errors and MHOs, see, e.g., refs . Various prescriptions for reducing or eliminating scale ambiguity have also been discussed broadly in the literature, e.g., refs. [30][31][32][33][34][35][36][37][38][39][40][41][42][43][44][45][46][47].
Here we focus on a promising, widely applicable approach to estimate the MHO contributions using Bayesian inference on the known fixed-order computations. This idea was pioneered by Cacciari and Houdeau (CH) in ref. [48], and developed further in refs. [49][50][51]. Bayesian inference is a powerful method to construct probability distributions in which Bayes' theorem is used to update the probability as new information becomes available iteratively. The procedure can be summarised as follows: one starts from some model and the prior distribution of the model parameters. These inputs are purely subjective. The next step is a construction of a posterior distribution for model parameters based on available data. This is the inference step, where one updates prior belief as new data become available. Finally, one constructs the posterior distribution for the next data point by marginalising over the posterior distribution for the model parameters. The main idea of the CH-method it to use Bayesian inference on the known perturbative orders to model the size of the MHOs. The original CH-model can be applied only after one specifies the choices for the perturbative scales. In this paper, we follow up on the recent work in ref. [51], where the CH-approach was extended by the postulate that unphysical scales should be treated on par with other hidden model parameters. According to the lore of Bayesian inference, one expects that the posterior distribution for the perturbative scales peaks at the 'optimal value' for the scale choice, and the dependence on the scales is eliminated by marginalisation. However, this is only one example of possible prescriptions, as there is no reason, physical or mathematical, to interpret the perturbative scales as hidden model parameters. One of the main objectives of this paper is to thoroughly scrutinise the consequences of this postulate and to explore alternative approaches.
This paper is organised as follows: In section 2 we discuss two prescriptions of how to incorporate scale dependence into a model for Bayesian inference of fixed-order calculations, and we investigate implicit biases inherent to them. In section 3 we recap the mathematical formalism behind Bayesian inference at fixed scale and introduce a model to account for asymmetric probability distributions. In section 4 we study extensively the model performance for representative perturbative calculations without scale dependence or at fixed scale choice. Finally, in section 5 we study the performance of the two prescriptions to incorporate scale information for an extensive set of collider processes known at NNLO and/or N 3 LO. We conclude with a discussion in section 6.
2 Bayesian methods for observables in quantum field theory

Incorporating scale information into Bayesian inference
In this section we present a way to estimate the MHO uncertainty for an observable Σ via Bayesian methods. To be concrete, let us assume that we have computed the first n + 1 orders Σ (k) , 0 ≤ k ≤ n, in the perturbative expansion of Σ. Our goal is to construct a probability distribution P(Σ|Σ n ) for Σ given the sequence of perturbative coefficients Σ n := (Σ (0) , Σ (1) , . . . , Σ (n) ). This distribution can be used to construct credibility intervals (CIs) for Σ, which may serve as the uncertainty estimate. In particular, we define the x% CI as CI (2.1) We will mainly consider x% ≈ 68% and x% ≈ 95% CIs. The original approach of ref. [48], known as the Cacciari-Houdeau (CH) method, allows one to construct a probability P(Σ|Σ n ) in the case where the Σ n are numbers. In perturbation theory, however, the perturbative coefficients Σ n are functions of the perturbative scales, which are arbitrary quantities introduced by perturbation theory without any physical interpretation. Clearly, P(Σ|Σ n ) should be independent of the scale choice, but the value of the N n LO prediction may depend strongly on this choice, and there is no reason to prefer one scale over another (as long as the scale does not differ too much from the hard scale of the process).
Reference [51] introduced a possible way of incorporating perturbative scales into a Bayesian approach to estimating MHOs (see also ref. [50] for an alternative approach). The approach of ref. [51] 4 to include the information on the perturbative scales into a Bayesian model is not the only one. Broadly speaking, we identify two different prescriptions of how one can incorporate information on perturbative scales into the construction of a probability distribution: Prescription 1: Scale-marginalisation (sm). Despite the perturbative scales not being physical, one assigns a Bayesian degree of belief to them. In this way the scales become (hidden) parameters of a Bayesian model. It is then possible to construct P(Σ|Σ n ) through Bayesian inference and marginalisation over the perturbative scales. This is the approach advocated in ref. [51].
Prescription 2: Scale-averaging (sa). Since the perturbative scales are devoid of any physical meaning, they cannot be treated as hidden parameters of a Bayesian model, i.e., they cannot be treated as random variables with a statistical interpretation. Instead, the result of a perturbative calculation through N n LO should be interpreted as a family of independent predictions for the true value Σ parametrised by the scales. The probability distribution P(Σ|Σ n ) is then defined by adding coherently the probabilities computed for each family member.
As we will show later, the two prescriptions of obtaining scale-independent probabilities differ not only in their philosophy, but in practice they can produce different results due to implicit biases towards some scales over others. In the next section we present the mathematical framework to construct the probability distribution P(Σ|Σ n ). In order to keep the notations concise, we focus first on the simplified situation where the perturbative coefficients Σ n depend on a single perturbative scale µ (which we may think of as the renormalisation scale for now).

Prescription 1: scale-marginalisation
We start by reviewing the scale-marginalisation (sm) prescription of ref. [51]. Since µ is a parameter of the model, the probability distribution P(Σ|Σ n ) is obtained by marginalising over µ; P(Σ|Σ n ) = dµ P(Σ, µ|Σ n ) = dµ P(Σ|Σ n , µ) P(µ|Σ n ) . (2.2) Let us interpret the probabilities in the integrand. For the first factor, we apply Bayes' formula to write P(Σ|Σ n , µ) = P(Σ, Σ n |µ) P(Σ n |µ) = P (Σ, Σ n (µ)) P (Σ n (µ)) = P (Σ|Σ n (µ)) . (2. 3) The interpretation of this equation is as follows: Since µ is a parameter of the model, whenever we evaluate a conditional probability for a fixed value of µ, we can replace the perturbative coefficients Σ n (which are functions) by their values Σ n (µ) for fixed µ (which are numbers). In the following we always indicate by P a probability distribution which takes are arguments the perturbative coefficients Σ n (which are functions), while P denotes a probability distribution on perturbative coefficients computed at a fixed scale µ. For now we assume that P is given. In section 3 we discuss a framework how a distribution P can be constructed. The second factor in the integrand of eq. (2.2) is the posterior distribution for µ, and we have where P 0 (µ) is an (arbitrary) prior distribution on the perturbative scale µ. In ref. [51] it was argued that a physically well-motivated choice is a flat distribution in the logarithm of the scale, where F > 1 is a constant and Θ denotes the Heaviside step function; The central scale µ 0 is in principle arbitrary, but assumed to be of the order of the hard scale of the process. Putting everything together, we see that P(Σ|Σ n ) can be expressed in terms of the probability distribution P and the prior P 0 on the scale: Equation (2.7) allows one to relate P sm (Σ|Σ n ) to two other probability distributions (in addition to the prior on µ). The first one is the probability distribution P (Σ|Σ n (µ)) for the (true) value of the observable Σ given the n + 1 first perturbative orders calculated at the scale µ. The second one is the probability distribution P (Σ n (µ)), which is related through Bayes' theorem to P (Σ (n) (µ)|Σ n−1 (µ)). The latter computes the probability distribution for the N n LO corrections Σ (n) (µ) evaluated at the scale µ, given the n first perturbative orders evaluated at that scale. The two distributions P (Σ|Σ n (µ)) and P (Σ (n+1) (µ)|Σ n (µ)) are the answers to two different mathematical questions, and there is a priori no reason why they should be related. However, if perturbation theory is applicable, then the true value of Σ should be well approximated by Σ n+1 (µ) (at least for reasonable choices of µ). We therefore expect that also the distribution P (Σ|Σ n (µ)) is well approximated by P (Σ (n+1) (µ)|Σ n (µ)). Substituting Σ (n+1) (µ) ≈ Σ − Σ n (µ) in the latter, we expect that, if perturbation theory is applicable, then For a more detailed derivation of this approximation in a Bayesian framework we refer to ref. [51]. Using this approximation, we can cast eq. (2.7) into the form: We stress that, unlike indicated by the notation in eq. (2.8), the probability distributions in the left and right-hand sides of eqs. (2.7) and (2.8) are actually distinct: P (Σ|Σ n (µ)) is the conditional distribution for the true value Σ of the observable, while P (Σ − Σ n (µ)|Σ n (µ)) is the distribution for the value of the N n+1 LO correction Σ (n+1) (µ) ≈ Σ−Σ n (µ) computed at the scale µ. We will always assume that available N n LO results sufficiently well approximate Σ and that we are in the perturbative regime where eq. (2.8) holds, so we do not distinguish the notation for the two distributions.

Prescription 2: scale-averaging
In this section we present an alternative way to include scale information into the model which does not require the perturbative scale µ to be promoted to a model parameter with a statistical or physical meaning. We call this method the scale-averaging (sa) prescription. The starting point is to interpret µ as a parameter defining a family of perturbative predictions which approximate Σ at N n LO in perturbation theory. For each member of this family, i.e., for each fixed value µ of the scale, we can compute the probability P (Σ|Σ n (µ)) (as before we assume again that P is given). We then combine this family of predictions into a single distribution P(Σ|Σ n ) as a weighted average of probabilities P(Σ|Σ n ) ≡ P sa (Σ|Σ n ) := dµ w(µ) P (Σ|Σ n (µ)) . (2.10) The form of the weighting function w(µ) is arbitrary, but it must be chosen such that P(Σ|Σ n ) satisfies all the axioms of a normalised probability distribution. In particular, w(µ) must itself be normalised: Since one expects that the perturbative scale µ should be chosen close to the characteristic hard scale of the process in order to ensure good perturbative convergence, we postulate a window function around a fixed scale µ 0 ∼ Q: where F > 1 is a constant. We note that our choice for w(µ) is identical to the prior on the scale in the scale marginalisation in eq. (2.5). The interpretation, however, is different: in the previous section the scale µ is a model parameter with a statistical interpretation that leads to eq. (2.7). In eq. (2.10), the scale has no statistical interpretation, and is simply a parameter defining a family of distributions, which are then combined into a single distribution with weighting function w(µ) according to eq. (2.10). With the weight given in eq. (2.12), this corresponds to a uniform average in the logarithm of the scale over the interval µ 0 /F < µ < F µ 0 . Using the approximation in eq. (2.8) we obtain the final expression (2.13)

Extension to multiple perturbative scales
In the case where the perturbative coefficients Σ n are independent of the unphysical perturbative scale µ, i.e., Σ n (µ) = Σ n (µ 0 ) for some µ 0 , the distributions P reduce to P : To extend the discussion to more scales µ i , 1 ≤ i ≤ s we introduce a multivariate prior P 0 (µ 1 , . . . , µ s ) or weighting function w(µ 1 , . . . , µ s ), and we integrate over each scale µ i separately.
We need to comment on hadronic observables, which always depend on the factorisation scale µ F . In ref. [51] the factorisation scale was explicitly excluded from the marginalisation procedure described in section 2.1.1. Indeed, the factorisation scale also appears in the PDFs. It is not clear what is the correlation and how the PDF uncertainty impacts the probability distribution P(Σ|Σ n ), which is supposed to represent the probability distribution for the true value of the observable and should supersede and incorporate all other uncertainties. Here we aim for something much simpler. We want to use the distribution P(Σ|Σ n ) to construct credibility intervals that serve as estimators for the MHO uncertainty. In this sense, the role of P(Σ|Σ n ) is similar to scale variation as a tool to estimate the size of the MHOs. Like in the case of scale variation, we will assume that the MHO uncertainty is independent of the PDF uncertainty, and the two can be added in quadrature.

Connection to prescriptions for choosing the perturbative scales
In the previous section we have presented two methods to construct a scale-independent probability distribution P(Σ|Σ n ). The two methods differ not only by a different philosophy of how the unphysical perturbative scales are included, but can also lead to different results for the inferred probability distributions. Here we show that under rather general conditions the position of the peak of P(Σ|Σ n ) as a function of Σ will be biased towards two different scales µ FAC and µ PMS : • The point of fastest apparent convergence (FAC) [52,53] of an observable computed at N n LO is the scale µ FAC such that where we used the notation Σ n (µ) := Σ n (µ, µ). Note that the value of µ FAC depends on the order at which it is computed, and there may be more than one solution to eq. (2.15) at a given order in perturbation theory.
• The principle of minimal sensitivity (PMS) [53] aims at selecting a scale µ PMS at N n LO that maximises the stability of the observable under changes of the scale Just like the FAC point, the PMS point depends on the order on which it is computed and may not be unique at a given order in perturbation theory.
Consider now a broad class of Bayesian inference models satisfying the following conditions: 1. We work with the prior in eq. (2.5) or the weighting function in eq. (2.12).
4. The peak is more and more pronounced as n increases, i.e., as more information on the progression of the perturbative series becomes known.
Note that these assumption imply that the peak of P (Σ (n+1) (µ)|Σ n (µ)) is located at Σ (n+1) (µ) = 0, for all µ. The aforementioned assumptions are reasonable, and they are satisfied for some of the models considered in ref. [51], in particular by the geometric model and the scale-variation model of ref. [51] (see also section 3.2). In appendix A we show that any model satisfying the assumptions above will also satisfy the following property for the peak of the distribution P(Σ|Σ n ): • If there is a single µ FAC ∈ [µ 0 /F, µ 0 F ], then P sm (Σ|Σ n ) peaks at Σ = Σ n (µ FAC ), i.e., at the N n LO prediction for the observable evaluated at the point of fasted apparent convergence.
• If there is a single µ PMS ∈ [µ 0 /F, µ 0 F ], then P sa (Σ|Σ n ) peaks at Σ = Σ n (µ PMS ), i.e., at the N n LO prediction for the observable evaluated at the point of minimal sensitivity.
As a consequence, the sensitivity of the probability distributions and CIs on the choice of the scale variation interval (given by F ) depends on the presence (or absence) of FAC and PMS points in the interval for the sm-and sa-prescriptions respectively. We will illustrate this on concrete examples in section 5. Let us conclude this section with an important comment about the sm-prescription. In the sm-prescription, the perturbative scales are treated as model parameters with a statistical interpretation. Bayesian inference will then select the 'best' value for the perturbative scales according to the model, see also the discussion in ref. [51]. At high enough perturbative orders, the dependence on the priors is expected to be mild, so that the conclusions derived from the inference are robust and (to a certain extent) prior-independent. The results of this section indicate that such a Bayesian interpretation of the perturbative scales should be taken with a grain of salt: For symmetric models with the prior in eq. (2.5) (which applies in particular to some of the models of ref. [51]), as soon as the range [µ 0 /F, µ 0 F ] contains an FAC point, the scale selected by the inference is biased: it will automatically select the value of the observable at the FAC point as the most likely value. In other words, such a model will always select the FAC point as its 'best' estimate for the scales, independently of the actual progression of the perturbative series. Note that the previous discussion does not apply to the sa-prescription, because the perturbative scales are not assigned a probabilistic interpretation (in particular, the weight w(µ) in eq. (2.12) is not a probability distribution). In fact we will see examples in section 5 where the CIs for the sa-prescription increase monotonically with F .

Connection to MHO estimates from scale variation
One may wonder about the practicality of the prescriptions in sections 2.1.1 and 2.1.2, because evaluating the integrals of the scale could be rather slow, as it may require to evaluate the perturbative approximation of Σ for many values of the scales µ i . In many examples of higher-order computations for the LHC, the dependence on the factorisation and renormalisation scales is rather smooth. We can then approximate the integrals over scale in eqs. (2.9) and (2.13) by a low-order quadrature rule.
We focus for now on observables depending on a single scale µ, and we comment on multiple scales later. It is particularly convenient to choose the quadrature points to coincide with the ones used in the conventional scale variation, namely µ/µ 0 ∈ {1/2, 1, 2}. 5 For the three-point Gauss-Legendre quadrature rule this can be achieved by using the value of F = 2 √ 5/3 ≈ 2.45 for w(µ) in eq. (2.12). Then the integral in eq. (2.13) can be approximated as where the weights w i for the Gauss-Legendre quadrature are This quadrature achieves high accuracy for smooth functions in the interval − log F < is polynomial in log µ up to degree 5; note that hadronic observables are never of this form, because some logarithms have been resummed to all orders into the PDFs and α s (µ)). Similar formulas can of course be derived for quadrature rules of higher order and multi-dimensional µ integrals. The advantage of the three-point rule in eq. (2.18) is that it allows one to easily convert the available data from the scale-variation approach into an approximation for the Bayesian probability distribution, which is scale-marginalised or scale-averaged over the interval µ 0 /F < µ < µ 0 F with F ≈ 2.45. This approximation is valid as long as the three-point quadrature rule gives a good approximation of the integral. In applications, the dependence on the scale is usually smooth, so that low-order quadrature rules are expected to give reasonable approximations. We will illustrate this on examples in section 5. One could also restrict the integration region to the interval [µ 0 /2, 2µ 0 ], i.e. F = 2, by using a quadrature rule where the nodes coincide with the end-points of the integration, e.g., the Gauss-Lobatto quadrature rule (accurate at cubic order). We compare the log µ/µ 0 ranges of different quadrature rules using the same scale variation points in the left panel of figure 1. The quadrature rule in eq. (2.18) allows us to obtain an approximate bound on the probability that the true value lies in the range obtained from conventional scale variation. We will first discuss the case of an observable depending on a single perturbative scale µ, and generalise the result later. Define the lower and upper values of the scale variation around some central scale µ 0 as Σ min := Σ n (k min µ 0 ) = min k∈{1/2,1,2} Σ n (kµ 0 ) , (2.20) Consider the sa-prescription and a model such that 1. the model is symmetric, 2. P (Σ (n+1) (µ)|Σ(µ)) peaks only at Σ (n+1) (µ) = 0, for all values of µ, 3. the three-point quadrature in eq. (2.18) gives a good approximation of the integral.
If these assumptions are satisfied, then we can show that the probability that Σ ∈ [Σ min , Σ max ] is bounded from above: where w min is the quadrature weight associated to the scale µ = k min µ 0 through eq. (2.18), and similarly for w max . The sign reminds us that the bound is only as accurate as the assumptions above, in particular the validity of eq. (2.18). To understand the origin of this approximate bound, we first note that Using eqs. (2.13) and (2.18), we express Φ(Σ) as a sum of cumulative distributions ϕ(Σ, µ) := Σ −∞ dΣ P (Σ − Σ n (µ)|Σ n (µ)) for a fixed choice of the scale: where the last step follows from the fact that all three terms are positive. An illustration of the three functions P (Σ − Σ n (kµ 0 )|Σ n (kµ 0 )), k ∈ {1/2, 1, 2}, is shown in figure 1 for two scenarios: when the distributions are narrow compared to the scale variation interval [Σ min , Σ max ] (shaded areas) and when the distributions are wider than [Σ min , Σ max ] (dashed lines). Note that we do not know which value of k corresponds to which distribution in figure 1. However, from the assumptions above it follows that only one of them peaks at Σ min , namely the one corresponding to the value of k = k min . Since the distribution is by assumption symmetric around Σ = Σ min , we must have ϕ(Σ min , k min µ 0 ) = 1 2 . Hence, we obtain Using a similar argument, we can conclude Equation (2.21) then follows upon inserting the bounds for Φ(Σ min ) and Φ(Σ max ) into eq. (2.22). We can ask when the approximate bound in eq. (2.21) is saturated. It is easy to see that the bound for the cumulative distribution in eq. (2.24) is saturated if the three distributions P (Σ − Σ n (kµ 0 )|Σ n (kµ 0 )), k ∈ {1/2, 1, 2} are narrow compared to their separation, because in that scenario we have Φ(Σ min ) ≈ 1 2 w min . (see figure 1). A similar argument holds for the bound in eq. (2.25), Φ(Σ max ) ≈ 1 − 1 2 w max . Hence, we conclude that the approximate bounds on the probability for [Σ min , Σ max ] are saturated for observables and models for which the distributions P (Σ − Σ n (kµ 0 )|Σ n (kµ 0 )), k ∈ {1/2, 1, 2} are well separated.
We can repeat exactly the same argument for observables depending on multiple scales (µ 1 , . . . , µ s ) by using the same quadrature rule s times for scale-averaging for each scale µ i in eq. (2.23). Each cumulative distribution ϕ (Σ, µ 1 , . . . , µ s ) in the sum is multiplied by a product of s quadrature weights. It is then straightforward to show that the bound becomes Let us discuss some implications of the approximate bound in eq. (2.26). Under the assumptions spelled out above, eq. (2.26) establishes a connection between the interval obtained by varying the scale up and down by a factor of 2 and the Bayesian probability distribution constructed from the sa-prescription. The value of the approximate bound depends on the quadrature weights used to approximate the integral. For the three quadrature rules illustrated in figure 1, the values of the bound are shown in table 1 for s = 1, 2. Note that different quadratures correspond to different intervals over which the scale-averaging is done. In addition, the accuracy of the approximation in eq. (2.18) also depends on the order of the quadrature rule. Therefore the approximate bounds in table 1 depend on how well the assumptions stated at the beginning of this section are satisfied. In particular, it does not allow one to attach a definite probability that the true value lies in the range given by the scale variation. However, we can show that the (100β s )% credibility interval CI 100βs = [Σ low 100βs , Σ up 100βs ] will contain [Σ min , Σ max ] and the two intervals (approximately) agree whenever the approximate bound is saturated. Indeed, from the definition of the credibility interval in eq. (2.1) it follows that Φ(Σ low 100βs ) = and so the monotonicity of the cumulative distribution implies 29) and the claim follows.
The previous discussion has an important consequence: Assume that we are in a situation with s = 2 scales. Then we conclude from table 1 that the 9-point scale variation interval satisfies (2.30) Clearly, the 7-point interval is also contained in the same interval. In the case where the approximate bound is saturated, we expect that for two scales (µ F , µ R ) the intervals obtained from scale variation are comparable to ≈ 95% credibility intervals. In practise, we often find that observables are much less sensitive to one of the scales, namely, µ F , and we are effectively dealing with s = 1 case, where β 1 ≈ 66-83%. We will illustrate in section 5 that the conventional scale variation intervals are often similar in size to CI 68 using Bayesian methods. The arguments presented in this section show that this is not a coincidence. Nevertheless we reiterate that the derivation of these bounds relies on specific assumptions. In particular, the sm-prescription and models with asymmetric probability distributions do not need comply with these requirements and the bounds can be broken. However, as we will see in the examples in section 5, in many cases the size of the CIs is similar for symmetric and asymmetric models.

Modelling probability distributions at fixed scales
In the previous section we have discussed two ways to define a probability distribution P(Σ|Σ n ) in the context of perturbation theory. In both approaches the computation of the distribution P(Σ|Σ n ) is reduced to an integral over the perturbative scale µ. The integrand involves, on the one hand, the prior P 0 (µ) in eq. (2.5) or the weighting function w(µ) in eq. (2.12), and, on the other hand, the probability P (Σ|Σ n (µ)) evaluated at a fixed value µ of the perturbative scale. The latter is related through eq. (2.8) to the probability distribution P (Σ (n+1) (µ)|Σ n (µ)) for the size of the next perturbative order at the scale µ, given the n + 1 first orders computed at the same scale. Various models for P (Σ (n+1) (µ)|Σ n (µ)) have been presented in the literature, in particular the CH-method of ref. [48] and its generalisations of ref. [51]. All theses examples define symmetric distributions in the sense of eq. (2.17). In section 2.2 we have argued that symmetric distributions always lead to very special distributions P(Σ|Σ n ) after integration over µ, with peaks that are correlated to the FAC and PMS points. This may introduce unwanted bias into the model. The goal of this section is to introduce a model that is not symmetric, which is a necessary condition for distributions not necessarily peaked at the FAC or PMS points. We start by presenting a framework to define different models for P (Σ (n+1) (µ)|Σ n (µ)). This framework contains, in particular, the CH-method of ref. [48] and its generalisations of ref. [51] as special cases. We then use this framework to define a new model that takes into account the signs of the corrections and leads to asymmetric distributions.

The mathematical setup
Consider the set consisting of elements (δ, p), where p ∈ R p is a vector and δ = (δ 0 , δ 1 , . . .) is an infinite sequence of real numbers representing the infinite series ∞ k=0 δ k . We prefer to think of the sequence δ rather than the infinite series, because the order of the δ k 's matters.
We make the following assumptions: 1. The parameters p are hidden, and we do not have access to them. Instead, they are distributed according to a known prior distribution P 0 (p).
2. The sequence δ and the hidden parameters p are not independent. For a given value of p, we know the probability that the first n + 1 elements of δ take a certain value, i.e., for each n ≥ 0, if δ n := (δ 0 , δ 1 , . . . , δ n ), we know the conditional probability P (n) (δ n |p).
3. The conditional probabilities P (n) (δ n |p) for different values of n are related by marginalisation: We refer to a choice of (P 0 , P (0) , P (1) , P (2) , . . .) as a model for sequences (δ, p). We always assume that all probabilities are normalised, and we will often keep the superscript on P (n) implicit and use the simplified notation P (δ n |p) := P (n) (δ n |p). Assume that we are given the first n + 1 terms δ n of δ. The main idea is that, within a given model, we can use Bayesian inference to obtain a conditional probability distribution P (δ n+1 |δ n ) for the next term δ n+1 in the sequence δ, given the first n + 1 terms δ n . The probability distribution P (δ n+1 |δ n ) is given by Bayes' formula: The probability distributions for P (δ n ) and P (δ n+1 ) are themselves obtained from Bayes' formula: We refer to a model as symmetric if it satisfies (cf. eq. (2.17)) A symmetric model always predicts a vanishing median and expectation value for δ n+1 given δ n , and all CIs computed from a symmetric model are symmetric around δ n+1 = 0.

The geometric model
The CH model assumes a universal bound c for all coefficients δ k (µ) = Σ (k) (µ)/α k s (µ) for a fixed value of µ. In ref. [51] a variation of the CH model, called the geometric model, was introduced to take into account the fact that the expansion parameter does not need to be equal to α s . 6 The geometric model considers a sequence of coefficients , whose magnitude is bounded from above by a geometric series. Consequently, the model has two hidden parameters p = (a, c), and the coefficients of the sequence δ(µ) must satisfy the constraint (for simplicity we suppress the dependence on the scale): The constraints for k = 0, 1 imply c > 1 and a > 0. Since we do not make any other assumption about the coefficients, we choose a flat distribution that takes into account eq. (3.6) [51]: (3.7) The model is clearly symmetric. For the prior on the hidden parameters, a and c are assumed independent, P 0 (a, c) = P 0 (a)P 0 (c). The prior on c is given by Here ε > 0 is a constant introduced to obtain a normalisable prior. The prior on a is [51]: where ω > 0 suppresses the prior at a = 1. Note that ε and ω are constants, and they are not considered hidden parameters of the model. For comparisons with the asymmetric model defined in the next section, we will use = 0.1 and ω = 1. The form of eq. (3.7) and the priors are simple enough that the integral in eq. (3.3) can be performed in closed form, see ref. [51]. Let us conclude this section with some comments. First, we note that the geometric model contains the original CH model as a special case, where a = α s (µ), i.e., the prior for a is P 0,CH (a) = δ(a − α s (µ)) , (3.10) so that P CH (δ n |c) = da P geom (δ n |a, c) P 0,CH (a) = P geom (δ n |α s (µ), c) . Second, so far we have chosen the coefficients δ k (µ) to be real numbers. We can assume that this sequence was obtained by evaluating some sequence of functions Σ (k) at some fixed value µ, i.e., δ k (µ) = Σ (k) (µ)/Σ (0) (µ). Reference [51] proposes a variant where the probability distribution takes input from other values of the functions Σ (k) . The scale variation model of ref. [51] considers sequences δ(µ) whose coefficients satisfy a bound of the form: where r k (µ) encodes information on the derivative of Σ k (µ) with respect to µ. The probability distribution for fixed µ is related to P CH via the relation (with α s replaced by unity): . (3.13) In this case the prior for c is chosen in ref. [51] to be a Poisson distribution.

An asymmetric geometric model: the abc-model
The symmetry in the geometric model can be traced back to the fact that the constraint in eq. (3.6) defining the model only depends on the magnitude of the coefficients, and it is insensitive to their signs. Alternative models that lead to asymmetric distributions and include information about coefficient's signs are discussed in appendix B of ref. [51] (see in particular B.2 and B.5). Here, we consider a new model where the upper and lower bounds are different in magnitude. More precisely, we have three hidden parameters p = (a, b, c), and we consider a sequence of coefficients δ k = Σ (k) /Σ (0) such that (again, we suppress the dependence on the scale µ) We refer to this model as the asymmetric geometric model, or the abc-model. The conditional probability distribution takes a form very similar to the one in the geometric model in eq. (3.7): Like in the geometric model, our choice of priors assumes that a is independent of the parameters b and c. However, since b and c jointly define the upper and lower bounds, we do not see any reason to consider them independent. Instead, we use a joint probability for where the factors in the right-hand side are given by: Unlike in the geometric model, we do not constrain a to be positive and only require that |a| < 1. The prior P 0 (b, c) is characterised by three constants , η and ξ. As in eq. (3.8), > 0 is used to make the prior normalisable. The second parameter η > 0 determines the minimal difference between the upper and lower bounds of the series, see eq. (3.14). Finally, ξ > 0 restricts the allowed range of b, with ξ = 0 forcing b to vanish and we recover a symmetric geometric model. We note that the constraints in eq. (3.14) for k = 0 imply c > |b − 1|. Combining this with the prior in eq. (3.17), we obtain c > max{η, 1 1+ξ }. In figure 2 we show the ranges for b and c allowed by the prior in eq. (3.17) and c > |b − 1| for the two cases ξ > 1 and ξ < 1. We can clearly see from figure 2 that if η < 1 1+ξ , then η no longer affects the model inference. Similarly if ξ > η+1 η > 1, i.e., η > 1 ξ−1 , then the inference becomes independent of ξ. Every subsequent term δ k in the series adds the constraint c > δ k a k − b , whose position depends on the values of δ k and a. We conclude this section by stressing that the quantities ω, ξ, and η are not hidden parameters of the model. Rather, they are (positive) constants defining a family of abcmodels. In the next section we will study the model sensitivity to the choice of these constants. However, one has to keep in mind that this only explores the robustness for the specific model defined by eqs. (3.14), (3.15) and (3.17). There is an infinite-dimensional functional space of possible models and priors that one might consider. We believe that the abc-model offers a good compromise between the flexibility of the model (it gives asymmetric probability distribution and can be equally well applied to monotonic and simple alternating series) and the transparency and simplicity of the model.

Sensitivity to the choice of priors
Let us illustrate the abc-model and its sensitivity to the choice of priors on the example of the geometric series δ k = x k , for |x| < 1. Then there exists a = x for which the region of values of (b, c) excluded by eq. (3.15) collapses to c > |1 − b|. We see from figure 2 that η > 0 prevents c from reaching zero. Setting η → 0 and ξ → ∞ turns off any constraints by the prior P 0 (b, c), and the abc-model can describe such a geometric series arbitrary well with a → x, b → 1, c → 0 in eq. (3.15).
We compute the conditional probability for the estimated next term in the series δ est n+1 : For fixed a, the integral over b can be done analytically and expressed in terms of maximum and minimum functions of δ k a k , η and ξ. In appendix B we show how to evaluate the 1  distribution P abc (δ n ) analytically in terms of Gauss' hypergeometric function, allowing for a fast and efficient numerical implementation of the model.
In the top left panel of figure 3 we show the probability distributions for the partial sums S n := n k=0 δ k = n k=0 x k for x = 0.7 and n ≤ 7 using the abc-model with parameter values ( , ω, ξ, η) = (0.1, 1, 2, 0.1) (see the discussion below for the choice of these values). For n = 0, the probability distribution is symmetric and centred around S 0 = 1. For n > 0 the distributions are clearly not symmetric and become more and more peaked as n increases. In the top right panel we show the probability distributions for the scaled deviation from the known correction (S est This allows us to compare different orders without the suppression of the expansion parameter. In this plot the S n+1 value corresponds to zero on the x-axis, while S n corresponds to ±1 (depending on the sign of δ n+1 ). Again, for n = 0 the distribution is centred around the initial value S 0 , but for each subsequent order, the distribution shifts towards the true value. We note that the shape of the distribution does not change significantly beyond n = 3, so that the narrowness of the distributions in the left panel is solely due to the higher power of the expansion parameter. In the simple case of a geometric series, the scaled distributions are controlled by the prior P 0 (b, c). For η = 0.1 and ξ = 2, the minimum of c is at figure 2), which gives a good approximation for the centre position and the (half) width of distributions seen in figure 3. As explained earlier, increasing ξ and reducing η can narrow down the posterior distribution around the exact geometric series value x n+1 .
In the bottom panels of figure 3 we show the 68% and 95% CIs for the abc (blue) and geometric models (red) (we use the same values (ω, ) = (1, 0.1) for both models). On the left we show the CIs for S est n+1 , while on the right we again scale to the known size of the next term in the series. We see that for n > 0 the 68% CIs computed with the abc-model encompass the next value of S n . This is not the case for the geometric model, even though the 68% CIs are larger in magnitude. This is due to the fact that the geometric model is symmetric around the last known order, thereby not accommodating the fact that all the coefficients δ k are positive. In contrast, the CIs of the abc-model are systematically shifted towards the exact values for n > 3, and the 95% CIs even exclude the previous value S n . However, one must remember that these features of the abc-model may not always be desired in practical applications. and can be controlled by the priors (see below).
For completeness, in appendix C we have also included a similar plot showing the alternating series with x = −0.7. Here we just note that abc-model correctly predicts the negative sign of the expansion parameter and remains consistently shifted towards the next value in the geometric series δ n+1 . In all other respects the distributions and the CIs are identical to the ones shown in figure 3.
In figure 4 we study the dependence of P abc (δ n+1 |δ n ) on the choice of the values for ( , ω, η, ξ) that define the priors on (a, b, c) in eq. (3.17). We vary each parameter independently, with the other held fixed at their default values ( , ω, ξ, η) = (0.1, 1, 2, 0.1). We see that, as expected, the dependence on , ω, and η reduces as n increases. The ξ-dependence, however, does not disappear completely with increasing n. This is due to the fact that ξ controls the amount of asymmetry of P abc (δ n+1 |δ n ). For small values of ξ, the distribution approaches a symmetric distribution around the last known value S n , while for large ξ it becomes much more peaked around the next value S n+1 . Empirically, we find that our choice ξ = 2 allows distributions to be shifted towards the next term, but keeping the width of the distribution of the same size as the shift. Next we look at ω, which controls the suppression of the prior for a around |a| = 1 (ω = 0 corresponds to a flat prior). Because here we consider a rather large value of the expansion parameter x = 0.7, increasing ω slightly reduces the expectation value of a and shifts the distribution to smaller values of S est n+1 . Our default choice of ω = 1 biases the hidden model parameter inference towards small |a| values. In the bottom left panel of figure 4 we see that larger values of η shift the posterior distribution towards the previous order. η plays a similar role to ξ in controlling the allowed minimum value of c. We use a small value of η = 0.1, which is in fact superseded by the constraint from ξ (see figure 2). Finally, in contrast to η, for larger values of the model becomes slightly more peaked and shifted towards the next term in the series. We choose a small value = 0.1 to maintain fairly flat distribution in the prior of log c.

Quantities without explicit scale dependence
The goal of this section and the next is to illustrate the concepts and models from the previous sections on various examples of observables in QFT. We start by investigating several quantities whose perturbative expansion does not have an explicit dependence on the perturbative scales, or for which scale dependence is not relevant. This allows us to illustrate the application of the abc-model to genuine perturbative expansions from QFT. The sequence δ n is the sequence of the n + 1 first perturbative coefficients, normalised such that δ 0 = 1 The rescaling introduces a Jacobian into eq. (2.8), which now takes the form We then use this probability distribution to compute CIs for Σ within the abc-model using perturbative input through N n LO. These intervals estimate the size of the missing N n+1 LO terms, and so they serve as measures of the MHO uncertainty at N n LO. To assess the validity of this procedure, we show in each case how the size of the CIs computed at different perturbative orders compares to the actual size of the next order whenever it is available. By abuse of notation, we will use both P abc (Σ est n+1 |δ n ) and P abc (δ est n+1 |δ n ), where the relation to eq. (4.2) has to be inferred from the argument.
Before discussing explicit examples of perturbative expansions in QFT, let us make some general comments. First, we stress that the approximation in eq. (4.2) is only valid if we are in the perturbative regime. It is in particular not valid once we enter the regime where the perturbative series starts to diverge. In that situation the right-hand side of eq. (4.2) may still represent a valid probability distribution for the next perturbative coefficient, but this distribution is no longer a good approximation of the left-hand side, which is the probability distribution for the true value of Σ. We note that for most applications only very few perturbative orders are available, and this will not be a problem. Second, it is possible to be in a situation where the model happens to describe the input data extremely well. The abc-model strongly favours a perfect geometric series. If the available perturbative data closely follow a geometric progression up to a certain order by some numerical coincidence, then the resulting distributions can be strongly peaked. In such a scenario, the 68% or 95% CIs may even exclude the highest available order Σ n . This is a design feature of the abc-model that can be controlled by the priors (see section 3.3).
We stress that regardless of how well the truncated perturbative series is described by a chosen model, there is no guarantee that the actual next order in perturbative QFT or the measured experimental value of Σ will fall into a specific choice of CI. Therefore, the choices of the model, the priors and the CI used to estimate the MHO uncertainty of an N n LO computation should be carefully scrutinised for each new observable separately.
Planar N = 4 SYM. The cusp anomalous dimension in planar N = 4 SYM admits a perturbative expansion: Here the expansion parameter is the 't Hooft coupling λ = g 2 s Nc 8π 2 , where g s is the Yang-Mills coupling and N c is the number of fundamental SU(N c ) colours. Note that N = 4 SYM is a conformal field theory (even away from the planar limit), and so the coupling is not UV renormalised and does not depend on any renormalisation scale. The first few expansion coefficients are shown in    Figure 5. Top left panel: The probability distribution from the abc-model for the cusp anomalous dimension (γ K ) est n+1 in planar N = 4 SYM for λ = 0.06 and for different values of n. Top right panel: The same distributions normalised to the exact N n+1 LO correction. Bottom left panel: the median (plus), 68% CI (errorbox) and 95% CI (errorbar) for the posterior of (γ K ) est n+1 , computed from the abc (blue) and geometric (red) models using information on the previous orders. The exact value of (γ K ) n is shown as black circles. Bottom right panel: CIs scaled to the exact N n+1 LO correction.
In figures 5 and 6 we show the probability distributions obtained from the abc-model for the partial sums with λ = 0.06 and λ = 0.155. The series δ n is normalised according to eq. (4.1). For λ = 0.06 < 1 8 , the perturbative series (which is alternating) is expected to converge. In figure 5 we clearly see a convergent pattern in the left top and bottom panels. The distributions quickly become very peaked around the limiting value with very small 68% and 95% CIs. In the right panels, we show the distributions and CIs normalised to the  exact N n+1 LO correction. This effectively cancels the suppression due to increasing power of the expansion parameter λ. We see that after such a rescaling, the distributions do not change significantly for n > 2 (up to a trivial reflection around the vertical axis, because of the alternating nature of the expansion). We note that in contrast to the case of an exact geometric series (see figure 3), here the posterior distribution for δ est n+1 is centred around the half of the exact value |δ n+1 |. As the series progresses, the scaled distribution reaches a peak for n = 5. After that it becomes steeper and broader. In the bottom right panel of figure 5, we see that the 95% CIs shrink more than the 68% CIs. For comparison we also show the CIs for the geometric model, which are significantly wider, but are no better in capturing the next order than the narrower CIs from the abc-model. In both cases the 95% CIs appear to give a reasonable estimate for the size of δ n+1 , even though the series is clearly not a perfect geometric series.
For λ = 0.155 the series is asymptotic and starts to diverge after the seventh perturbative order, i.e. n > 6. This behaviour is reflected in the probability distributions in the left panel of fig. 6: they become peaked around the limiting value up to the inclusion of the sixth order (the sixth and seventh order corrections are numerically almost identical). After that a plateau develops which becomes broader as we increase the perturbative order. Consequently, the width of the CIs increases steadily after the seventh order (bottom left panel). However, the rescaled distributions (right panels) look essentially identical to the convergent λ = 0.06 case. This is not surprising, because the two series differ only by the value of the expansion parameter, which is scaled away in the right panels. Therefore, like before the 95% CIs give a good estimate of the size of the next order corrections, even in the case of an asymptotic series. We note, however, that for n > 6 we enter the regime where the next order is not expected to give a good approximation of the true value of γ K . Hence, eq. (4.2) is not valid, and so the credibility intervals are not expected to give a good estimate of the uncertainty attached to the perturbative result.
The cusp anomalous dimension in QCD. In QCD, the cusp anomalous dimension is known up to four loops [54][55][56][57][58]72]. It depends on the representation of the Wilson loop and the matter content of the theory. Here we focus on the fundamental representation, and we work with 5 massless flavours. In this case we have QCD obtained from the abc-model. In the top left panel we see that the series is rapidly convergent, and the probability distributions become progressively more peaked. From the rescaled distribution in the top right panel we see that the probability distribution for n = 2 peaks at the exact value of the correction. We do not show scaled results for n = 3, because the exact five-loop result is not known yet. In the bottom left panel we show the 68% and 95% CIs for the abc-model and the geometric model. We see that, unlike in the case of planar N = 4 SYM, the 68% CIs always include the next correction, thus giving a reliable estimate of the MHOs (at least for the first few orders considered here). We emphasise that this is particularly non-trivial at four loop order, where Casimir scaling is violated for the first time, and a new quartic Casimir enters. Estimates of the size of the four-loop corrections based on Padé approximants from the first three orders were within 5% of the four-loop contributions that admit Casimir scaling for n f = 5, but were 57% off from the complete result [75]. The 68% CI obtained from the first three loop orders in the abc-model correctly captures the size of the four-loop result, including the contribution from the quartic Casimir.
From the bottom right plot we see that the median of the posterior distribution for the abc model (marked by a plus) is closer to the exact value than for the geometric model, although the abc-model appears to expect larger higher-order correction, and the distributions are skewed towards larger corrections.  i.e., Σ = Γ QCD cusp − (Γ QCD cusp ) 4 , is where we scaled out C F αs(µ R ) π 5 for visibility. We see that the abc-model at 68% credibility level would expect a positive higher-order correction. In contrast, the geometric model estimates a symmetric interval CI 68 = [−5.4, 5.4] with the same normalisation .

On-shell and MS-scheme quark masses
Our next example is the relation between the heavy quark mass m in the on-shell scheme and the mass m in the MS-scheme. The bare mass m 0 is related to the on-shell and MS quark masses via the renormalisation factors Z m  For heavy quarks it is possible to perturbatively compute these renormalisation factors. They can be used to define the scheme-conversion factor between the two schemes, For heavy quarks z m (µ R ) is currently known up to four loops [76][77][78]. We can express the on-shell mass as a series calculated in perturbative QFT In practice, one does not consider arbitrary µ R but rather work with a self-consistent definition of the m mass m(m). Thus for the quark masses we consider the scale as fixed number rather than as a free parameter. In figure 8 we show the probability distributions and CIs of the on-shell top quark mass (m t ) est n+1 = n+1 k=0 m (k) given the first n + 1 perturbative orders of the scheme-conversion factor z k mt (µ R ) evaluated at µ R = m t . In the top left panel we see that the distributions become more and more peaked as perturbative information becomes available. On the top right we see that the abc-model correctly estimates the positive sign of the corrections, but the estimated corrections are in general smaller than the exact value. In the bottom left panel we see that the CIs shrinks rapidly as n increases. In the bottom right we see that, unlike in the case of the QCD anomalous dimension, the 68% CIs of both the abc and geometric models do not include the next perturbative order, although the 95% CIs are sufficiently wide to include it. We note that the abc-model distributions are asymmetric, with a tail towards positive perturbative corrections showing that, although the abc-model does not peak at the exact value of N n LO contribution, it (correctly) anticipates that larger positive corrections are more likely. This is a result of the fact that the perturbative coefficients z (4.10) It is interesting to compare the size of the CIs obtained from the abc-model to the range five-loop estimates (in GeV) obtained in ref. [79]: We see that the 68% CI from the abc-model are lower than the estimates of ref. [79], but the 95% CI includes all values of ref. [79], except one. For completeness, in appendix C, we include plots for the bottom and charm quarks. The scaled probability distributions and CIs look similar to the ones in the right panels of figure 8. We note, however, that the MHO estimate using Bayesian inference is less reliable for lighter quarks, as in this case corrections become large and the perturbative expansion itself fails. This can be seen especially in the case of the charm quark mass.
In figure 9 we show the dependence of the probability distributions for the (m t ) est n+1 for n = 1, 3 from the abc-model on the parameters (η, ξ, , ω). As already discussed in section 3.3, we see that the dependence on (η, , ω) reduces as more perturbative information becomes available, while ξ controls the amount of asymmetry of the distribution. However, after the fourth perturbative order, the distributions become very narrow and peaked, and the shape is largely independent of the values of (η, ξ, , ω) if ξ ≥ 2.

Electron anomalous magnetic moment and positronium hyperfine-splitting
In this section we illustrate how Bayesian techniques work in the context of two classical results in QED which are known to relatively high order in the coupling, namely the electron anomalous magnetic moment and the positronium hyperfine splitting. QED corrections are typically believed to be rapidly convergent due to the smallness of the QED coupling α = 1/137.035 998 995(85) [80]. Further, these results are evaluated in the on-shell renormalisation scheme, thus they do not depend on any unphysical scale.  Here we focus only on the mass-independent part of the corrections, which are known up to 5 loops [81][82][83][84][85][86][87][88][89][90]. The numerical values for the coefficients C 2k e are collected in table 3.  Table 3. Known coefficients of mass-independent corrections to electron anomalous magnetic moment and to positronium hyperfine splitting. Figure 10 illustrates the abc-model estimates for the probability distributions and the CIs for the partial sums (a e ) est n+1 = n+1 . Due to the very small values of the expansion parameter, the absolute distributions and CIs are hardly visible for n > 1. The distribution for n = 1 is rather broad, resulting in CIs that overestimate the actual size of the higher-order corrections. At n = 2, the 68% CI captures the size of the next correction, while for n = 3 only the 95% CI includes the next order.
Our second QED example is the positronium hyperfine-splitting. This is a non-perturbative quantity and admits a double expansion in α and ln α: (4.13) Only terms with j < i are non-zero with the exception of the coefficient C 00 . This introduces a natural ordering among the coefficients, which are ordered first by i then by j. For the known terms this ordering is also equivalent to ordering the contributions by their size, since ln α ∼ −5. The currently known values [91][92][93][94][95] are shown in table 3. The presence of the logarithmic corrections in the series clearly violates the geometric progression of the series assumed in both the geometric and abc-models. Our Bayesian inference for such series is therefore less reliable as an estimate of the MHOs in comparison to a purely perturbative quantity such as a e . This can be also seen in figure 11: At the lowest two perturbative orders the CIs overestimate the size of the perturbative corrections. As the logarithmically enhanced corrections appear, both geometric and abc models fail to provide reliable estimates of the next-order correction. This is expected, since the models consider only a single expansion parameter a and expect the next-order correction to be suppressed by an additional power of a.
In principle, we can consider even more complicated QED bound-state observables, for example, the bound electron g-factor [96][97][98][99]. In this case, one has a triple expansion in powers of (Zα), α and ln(Zα), where Z is atomic charge. We expect that such series require more attuned models that take into account non-perturbative terms and double or triple expansions. We conclude this section by observing that QED corrections require dedicated studies within Bayesian framework that we leave for future works.

Hadronic observables with explicit scale dependence
In this section we discuss the application of Bayesian techniques to estimate MHOs for a selection of hadronic observables known to NNLO, or even N 3 LO, in the strong coupling constant. In all cases, we compute scale-independent probability distributions and CIs using the scale-marginalisation (sm) and scale-averaging (sa) prescriptions introduced in sections 2.1.1 and 2.1.2. Throughout this section we work with the geometric and abcmodels of section 3. We start by discussing inclusive cross-sections and discuss differential distributions towards the end of this section. In contrast to the examples studied in ref. [51], we take into account the variation of the factorisation scale.

Higgs production in gluon fusion
We start by discussing the inclusive Higgs production cross-section in gluon fusion: The inclusive cross-section is known through NLO in QCD including all finite top-mass effects [100][101][102][103][104][105][106][107][108][109], and through N 3 LO in an effective theory expansion where the top quark is infinitely heavy [110][111][112][113][114][115]. All results in this section were obtained with iHixs 2 [116], neglecting finite top-mass effects, with the PDF4LHC15_nnlo_100 PDF set and the on-shell top quark mass m t = 172.5 GeV. We choose as value for the central scale µ 0 = m H /2 = 62.5 GeV (both for scale-variation results and in the priors in eqs. (2.5) and (2.12)). First, in figure 12 we show the probability distributions and CIs for (σ ggH ) est n for fixed µ R = µ F = µ 0 . We see that for n = 1, 2, the CIs for both the abc and geometric models include the next order at 68% credibility level. We note, however, that for µ 0 = m H /2 the N 3 LO corrections are particularly small, and the CIs of the abc-model overshoot the computed value. We will see below that for other scale choices, the CIs of the abc-model are centred around the N 3 LO values.
Next, in figure 13 we plot the renormalisation scale dependence of (σ ggH ) n at different orders (for this plot only we keep µ F = µ 0 fixed). We see that in the conventional scale variation window, F = 2, the cross-section depends quite strongly on the scale, except for the highest order. The N 3 LO corrections vanish for µ R ≈ 0.8µ 0 , and the lines representing the NNLO and N 3 LO results as a function of the scale cross, i.e., we reach a FAC point. The FAC point at NNLO is at a smaller value, µ R ≈ 0.3µ 0 . Also, the crossing points visible in figure 13 approximately coincide with the local maxima for n = 2, 3, i.e., the FAC and PMS points are surprisingly close to each other.
In figure 13 we further show the CIs in the geometric and abc-models with LO, NLO and NNLO cross-sections (at fixed scale) used as an input. We observe that the CIs for the geometric model are centred around the n = 2 line due to the symmetry of the geometric model, but the abc-model assigns a higher probability to a positive correction. Indeed, for µ R > 2µ 0 the CIs are almost centred around the N 3 LO line. For µ R < 0.8µ 0 , the N 3 LO corrections become negative, while the abc-model still anticipates a positive correction. We note that the CIs for the geometric model shrink dramatically as one approaches the NNLO FAC point at µ R = 0.3µ 0 , i.e., large corrections are deemed unlikely by the geometric model close to that scale. Once the NNLO corrections become negative, the abc-model also becomes centred around the NNLO result, i.e., the abc-model no longer expects a positive correction. The CIs in this region do not shrink as dramatically as for the geometric model, although generally the abc-model leads to narrower CIs than the geometric model.  Figure 12. Top left panel: The probability distributions from the abc-model for the inclusive gluon-fusion Higgs production cross-section (σ ggH ) est n+1 evaluated at µ R = µ F = m H /2 and for different values of n. Top right panel: The same distributions normalised to the exact N n+1 LO correction. Bottom left panel: the median (plus), 68% CI (errorbox) and 95% CI (errorbar) for the posterior of (σ ggH ) est n+1 , computed from the abc (blue) and geometric (red) models using information on the previous orders. The exact values of (σ ggH ) n are shown as black circles. Bottom right panel: CIs scaled to the exact N n+1 LO correction.

Sensitivity to the scale interval and scale prescription
We begin by analysing the dependence of our sm-and sa-prescriptions on the range F that is chosen in eqs. (2.9) and (2.13), i.e., µ 0 /F ≤ µ ≤ F µ 0 . Figure 14 shows the CIs for the values F = 2, 4, 10 using the geometric and abcmodels. The left and right panels summarise the results from the sm-and sa-prescriptions respectively. For n = 2 the sm-prescription exhibits a somewhat smaller dependence on the choice of F , while the CIs for the sa-prescription grow with F . This can be understood from the discussion in section 2.2: for symmetric models like the geometric model, the smprescription tries to adapt to the point where the higher-order corrections are minimised, i.e., the FAC point. Once this point is covered by the range in the marginalisation, a further increase does not have a substantial impact on the uncertainties. From figure 13 we see that this is precisely what happens for the geometric model for n = 2, when F is increased from 2 to 4, and even for F = 10 the CIs change only very little. The discussion of section 2.2 does not apply to the abc-model, which is asymmetric. Recall from figure 13 that the abc-model does not have such dramatic reduction in the size of the CIs when reaching an FAC point.  Therefore it is also less biased to the inclusion of FAC point into the integration range for µ in the sm-prescription. For n = 3 the second FAC point at very small scales becomes relevant only for the largest F values. The sa-prescription, on the other hand, is biased to the regions where the scale dependence is flat, i.e., it is biased towards the PMS point. However, for this particular process we observe the peculiar situation that FAC and PMS points are very close to each other (at least for the orders we considered). Due to the coherent addition of probabilities, for the sa-prescription increasing the range of the µ-integration inevitably will also increase the CIs.
Let us conclude this discussion with an important point. From a Bayesian perspective, abc scale-averaging n = 0 n = 1 n = 2 n = 3 "exact" one expects that the predictions of the model become independent of the priors once enough data have become available. In particular, in the sm-prescription, the perturbative scales are treated as model parameters, and so one expects that at high enough orders the model predictions should only mildly depend on the prior in eq. (2.5), and the choice of F , in agreement with our findings in figure 14. However, it would be premature to conclude that the probability distributions are independent of the prior and F : we have shown in section 2.2 that for the geometric model in the sm-prescription, the preferred scale is the FAC point. The fact that the CIs are insensitive to the choice of F is likely to be related to the fact that the distributions will be highly peaked at the values of the cross-section at the FAC point, which is not necessarily related to the prior independence. Note that the argument of section 2.2 does not apply to the abc-model in the sm-prescription (because the abc-model is not symmetric).

Accuracy of Gauss-Legendre quadrature rule
As already discussed in section 2.3, performing the numerical integral over the scale can become computationally expensive. Quadrature rules, on the other hand, allow one to approximate the full integral using only a small number of input points. In particular, a set of points for a 9-point scale variation may be readily turned into an integral over (µ R , µ F ).
The left panel of figure 15 illustrates the accuracy of this approximation for the probability distributions in the abc-model using the sa-prescription. The distributions for different values of n (in colour) are obtained using the 3-point Gauss-Legendre quadrature for both µ R and µ F . The results obtained by a finely-spaced equidistant numerical integration over the same scale interval with F ≈ 2.45 is overlaid on top (black dashed lines). We see that, except for some features around the peak, the quadrature rules reproduce the "exact" numerical integration very well. In particular, this approximation has very little effect on the CIs, which are shown in the right panel of figure 15, where we show the 68% and 95% CIs for the geometric and abc models using both the sm-and sa-prescriptions. The dashed black lines indicate the results of the finely-spaced numerical integration. We note that after scale integration the probability distributions are no longer symmetric, even if the distributions at fixed scale are. Overall, we observe that the Gauss-Legendre quadrature rule is able to reproduce both the 68% and 95% CIs from the "exact" numerical integration to a very good degree. Therefore, in the subsequent sections we will only show the results obtained using the Gauss-Legendre quadrature.
For Higgs production in gluon fusion both the sm-and sa-prescriptions produce very similar CIs shown in the right panel of figure 15. The CIs are also very similar in size for both the geometric and the abc-models. The results for the latter are, however, systematically shifted upwards due to the sequence of positive corrections at lower orders. For comparison we also display the scale variation intervals using 9-and 7-point rules, which have no probabilistic interpretation. We observe that the scale variation intervals are slightly smaller, but rather close, to the 68% CIs obtained from the Bayesian methods. This illustrates the arguments of section 2.3 in a practical example. In particular, thanks to the slow µ F variation, the scale variation intervals are close to saturating the s = 1 bound in eq. (2.29).
Finally, we conclude with the summary of 68% and 95% CIs and scale-variation intervals for the highest known order n = 3 of (σ ggH ) n : model prescription CI 68  We note that as the N 3 LO cross-section reaches a local maximum near the central scale µ 0 , traditional scale variation produces a one-sided interval. The geometric model has approximately symmetric intervals around the central value, while the abc-model anticipates positive MHO terms. Although all 68% CIs and the scale-variation intervals are numerically close in size, we stress that they have completely different statistical meanings. Scale-variation intervals have no probabilistic interpretation and cannot be associated with a particular level of credibility. In contrast, the Bayesian CIs express a mathematically rigorous degree of belief for the MHOs from the posterior distribution for a given model and scale prescription.

Vector-Boson Fusion Higgs and di-Higgs production
Our next two examples are Higgs production in vector-boson fusion (VBF), and a similar process for double Higgs production. The inclusive VBF cross-section is known up to N 3 LO. We denote VBF cross-sections for single Higgs [117][118][119][120] and double-Higgs production [120- n geo sm abc sm sa sa 9pt 7pt Figure 16. The 68% and 95% CIs for the VBF cross-sections for Higgs and di-Higgs production for the geometric and abc-models using the sa-and sm-prescriptions. The scale variation intervals using 7 and 9 points are shown for comparison.

(5.2)
As before the centre-of-mass energy is √ s = 13 TeV. The central scale is given by the vector boson momentum [123] and we take into account the dependence on both factorisation and renormalisation scales. Computations were performed with the proVBFH code [124].
In the left panel of figure 16 we display the CIs for different models and prescriptions for single Higgs VBF production. For n < 2 the Bayesian approach gives a larger uncertainty (68% CIs) than the traditional scale variation. Because the NLO correction is negative, the abc-model anticipates an alternating series, and consequently the CIs for n = 1 for the abcmodel are positively shifted compared to the NLO result. However, the NNLO corrections are again negative, and for n = 2 all studied models and prescriptions give very similar 68% CIs, although the abc-model has much larger 95% CIs than the geometric model. For n = 3 the 68% CIs shrink even further and become somewhat smaller than the scale variation intervals. For the single Higgs VBF cross-section (σ VBF-H ) n at n = 3 these CIs are: We note that the sm-prescription gives much smaller CIs than the sa-prescription. In fact, the 95% CIs of the scale-marginalised geometric model is smaller and does not contain the scale-variation interval, demonstrating that the bounds discussed in section 2.3 do not necessarily apply to the sm-prescription. In contrast the 95% CIs for the geometric model in sa-prescription contain the scale variation intervals, as expected. In the right panel of figure 16 we display the CIs for different models and prescriptions for di-Higgs VBF production. We observe very good convergence of the cross-section, and correspondingly the CIs from Bayesian inference shrink rapidly. We observe that the saprescription gives larger CIs than the sm-prescription, which is due to the presence of an FAC point in the scale integration interval. We also see a significant difference between the 7-and 9-point scale variation intervals. The 9-point interval for n = 2 is larger than the 68% CI for the geometric model with the sa-prescription, but still within the 95% CI. This is a sign that both renormalisation and factorisation scale dependencies are comparable and the one dimensional (s = 1) bound estimated in section 2.3 is not applicable in this case. For n = 2 and n = 3 the 95% CI for scale-marginalised geometric model is even smaller than 9-point scale variation interval, while sa-prescription 95% still encompasses it, as expected. The CIs for di-Higgs VBF cross-section (σ VBF-HH ) n at n = 3 are: We see that sa-prescription gives generally more conservative estimates for the MHOs than either the sm-prescription or scale variation.

Drell-Yan processes
Drell-Yan-type processes play a crucial role in hadron collider phenomenology. The inclusive cross-section for off-shell photon production with virtuality Q 2 , pp → γ ( * ) , is known though N 3 LO [125][126][127]: N 3 LO corrections are also known for the charged-current Drell-Yan process pp → W ± → ± ν [126,[128][129][130][131][132][133]. We focus here on the lepton charge asymmetry defined as and the n-th order approximation is n geo sm abc sm sa sa 9pt 7pt Figure 17. The 68% and 95% CIs for the neutral-current Drell-Yan cross-section and chargedcurrent lepton-charge asymmetry for the geometric and abc-models using the sa-and smprescriptions. The scale variation intervals using 7 and 9 points are shown for comparison.
In the left panel of figure 17 we show results for the neutral current Drell-Yan crosssection. We see that for n = 2, 3 the sa-prescription produces 68% CIs which are similar in size to the regular scale variation intervals. The CIs from the sm-prescription for n = 2 are much smaller, again due to the presence of an FAC point in the integration region for µ. The N 3 LO corrections are known to be sizeable, and they are not covered by the conventional 7-point scale variation at n = 2 [127]. The CIs for the neutral-current Drell Yan cross-section (σ DY-NC ) n at n = 3 are: model prescription CI 68  We observe that the 68% CIs are similar in size among themselves, and to the scalevariation intervals, but the CIs from the abc-model are slightly shifted upwards in the anticipation of a positive MHO correction.
In the right panel of figure 17 we show results for the lepton charge asymmetry for µ 0 = Q = m W . The perturbative expansion for A W (m 2 W ) is quickly convergent with only a mild scale dependence, because some corrections cancel in the ratio. The perturbative coefficients feature a monotonic increase with the perturbative order, and the abc-model correctly anticipates positive contributions from MHOs. The CIs from the abc-model are slightly smaller than for the geometric model. We do not observe significant differences between the sm-and sa-prescriptions, except for n = 3, where scale-marginalisation gives more aggressive CIs. We note that the traditional 7-point scale variation intervals for n = 0, 1 fail to include the next correction, but for n = 2 they are similar to the 68% CIs obtained from Bayesian inference. We observe that the sm-prescription produces smaller CIs than the sa-prescription, indicating the presence of FAC point, which also pulls the CIs upwards with respect to the N 3 LO result. Scale-averaged CIs are more conservative and more centred around the N 3 LO results for both the geometric and abc-models.

Deep inelastic scattering
The Bayesian procedure of estimating MHOs can be readily applied to differential spectra by considering individual points/bins of the associated observable. 7 In the following, we will investigate how the geometric and abc-models perform for various differential predictions. Results will be further contrasted between the sm-and sa-prescriptions, and compared to the 9-point scale variation intervals. Like for inclusive observables, we will perform scale integrals using the Gauss-Legendre method from section 2.3. This setup allows us to accommodate already existing results based on 7-or 9-point scale variation, without the need for a full re-computation of these predictions. In this section we begin our discussion of differential spectra with the DIS process. We will present results for several benchmark processes at the LHC in the subsequent sections. Predictions up to N 3 LO for the DIS process were obtained from the calculation in ref. [134], where we consider the distribution dσ/dQ 2 with respect to the virtuality Q 2 of the intermediate photon. The results are presented in figure 18. The left panels show the absolute distributions (top) and the standard scale variation envelopes normalised by the NLO prediction evaluated at the central scale (bottom). The four right panels contrast the MHO estimates (68% CIs) obtained using the abc and geometric models with the sa-(top two panels) and sm-prescriptions (bottom two panels). All CIs are normalised to the NLO prediction evaluated at the central scale. 8 Bands of different colours correspond to estimates at different perturbative orders: grey (LO), blue (NLO), red (NNLO) and dark grey (N 3 LO). Overall, we can observe that all models and prescriptions perform well in capturing the progression of the perturbative series with either overlapping or touching uncertainty bands. At NLO the Bayesian approaches give rise to 68% CIs which are slightly larger than the 9-point scale variation interval, but are comparable in size at NNLO and N 3 LO. In the case of the sm-prescription shown in the top two panels on the right in figure 18, the CI bands narrow dramatically at NLO for Q 2 ∼ 400 GeV, as well as at NNLO and N 3 LO at the lower end of the spectrum. This can be traced back to the presence of a FAC points that are strongly emphasised by the sm-prescription. In the case of the geometric model, we see that both scale prescriptions move the median of the NLO 7 Note that here no correlations are considered, i.e., cross-section entries in each bin of the distribution are assumed to be independent. The incorporation of correlations is left for future studies. 8 For clarity, we do not show 68% CIs at LO, nor any of the 95% CIs, which can be very large.  probability distribution towards the direction of the higher-order corrections, which slightly improves the overlap between the CIs at NNLO and N 3 LO. For the abc-model, on the other hand, the central value at NLO moves in the opposite direction, with strongly asymmetric uncertainty bands that extend further towards the LO prediction. This is because after the negative NLO correction, the abc-model anticipates a positive NNLO correction, as would be the case for an alternating series. However, for this process the NNLO correction is again negative. The geometric model, which ignores the signs of the corrections, appears to perform better. We note that at higher orders the abc-model captures the perturbative progression well with only slightly larger CIs than the geometric model.

Di-photon production
We now consider predictions for di-photon production at the LHC at √ s = 8 TeV using the results of ref. [135]. The production of a pair of photons, p p → γγ, provides a clean and well-measured final state that is an important background in Higgs measurements and can be used in the search for New Physics resonances. One striking feature of this process is the apparent slow convergence of the perturbative QCD series, as partially induced by the staggered transverse-momentum cuts on the photons. As a consequence, lower-order predictions are found to be inadequate to describe the measurement, and only starting from NNLO a sensible comparison to the data can be performed. Results for the di-photon invariant-mass distribution dσ/dm γγ are shown in figure 19 following the same pattern as in figure 18. We note that the sizeable higher-order corrections put the respective predictions far outside from the traditional scale variation intervals of the  previous order. Moreover, the scale variation envelope at NNLO is found to be of similar size, or even larger, than at NLO, with no convincing sign of convergence. The Bayesian models, on the other hand, produce 68% CIs that are generally larger, and the CIs do not increase when going to higher orders. With the large positive corrections prohibiting any FAC points, the two scale prescriptions produce CIs that are almost identical. While the CI bands between NLO and NNLO touch in the case of the geometric model, for the abc-model the NNLO CIs are outside of the NLO band for m γγ 200 GeV. This is likely due to large positive corrections encountered at each perturbative order. The abc-model anticipates this pattern to continue at higher orders and shifts the probability distribution towards larger cross-section values. Finally, in view of the noticeable differences for the size and position of the bands at the highest order (NNLO) using different approaches, we can conclude that N 3 LO corrections will be highly relevant for this process to obtain more robust MHO estimates, which the traditional scale variation is likely substantially underestimating.

Gauge boson production in association with a jet
The production of electroweak gauge bosons that recoil against hard QCD emissions is among the most important standard candle processes at the LHC with a wide range of applications that span from detector calibration, precision QCD studies, to New Physics searches. This process is now known up to NNLO for all gauge boson types. Here we focus on W + +jet production using the predictions from ref. [136]. Figure 20 presents the results for the inclusive transverse momentum distribution dσ/dp T,W of the W boson following the same layout as in figure 18. We can observe a  Figure 20. W + +jet production up to NNLO with µ 0 = E T,W as the central scale choice. Left: absolute dσ/dp T,W distributions and 9-point scale variation envelopes normalised to the NLO prediction at µ 0 . Right: 68% CIs for the abc and geometric models using the scale-marginalisation (top) and scale-averaging (bottom) prescriptions (also normalised to NLO).
good perturbative behaviour for this process with NNLO corrections that are well captured by the traditional scale variation at one order lower. The Bayesian approaches give rise to 68% CIs that are wider at NLO, while the size of the bands at NNLO are comparable between all approaches. The sm-and sa-prescriptions give rise to almost identical CIs, similar to the observations made for the di-gamma process. In the case of the abc-model, we can appreciate the flexibility of the asymmetric probability distribution. The positive offset at NLO centres the CIs at this order around the NNLO band. Finally, we note that once NNLO corrections are included, all the different prescriptions of estimating MHOs considered here give almost identical bands indicating that the uncertainties assigned at this order are very robust.

Higgs boson production in association with a jet
The study of the Higgs boson and its properties is among the highest priorities of the LHC programme and the transverse momentum distribution, in particular, plays an important role in this endeavour not only for boosted analyses but also to resolve the details of the production dynamics for this process. We focus here on Higgs boson production in association with a jet at √ s = 13 TeV in the four-lepton decay channel, H → 4 , and use the predictions for the "ATLAS II" setup computed in ref. [137].
The results up to NNLO for the transverse momentum distribution dσ/dp T,H of the Higgs boson are shown in figure 21. We observe large positive corrections at each perturbative order with the exception of the lowest p T bins. In contrast to the other differential distributions considered so far, we notice that the Bayesian models are not giving rise to 0.6 0.8  Figure 21. Higgs+jet production up to NNLO with µ 0 = 1 2 E T,H as the central scale choice. Left: absolute dσ/dp T,H distributions and 9-point scale variation envelopes normalised to the NLO prediction at µ 0 . Right: 68% CIs for the abc and geometric models using the scale-marginalisation (top) and scale-averaging (bottom) prescriptions (also normalised to NLO).
substantially larger uncertainty bands (68% CI) at NLO compared to the traditional scale variation envelope, but they are similar in size. Going to NNLO, the uncertainty estimates are found to be comparable between the different approaches, with the sm-prescription giving somewhat narrower bands. This compatibility between the different approaches again supports the robustness of the uncertainties assigned to these predictions. With the monotonic pattern of positive corrections at each order, the abc-model is again able to adapt to this feature in the intermediate-high p T range, capturing to a large extent the NNLO corrections in the NLO shift. This is more pronounced in the sm-prescription suggesting that the central scale choice of µ 0 = 1 2 E T,H is probing regimes in the scale variation that are close a FAC point.

Inclusive jet production
Jet production is the most prominent process at hadron colliders and an essential tool for the study of QCD. The inclusive transverse momentum distribution dσ/dp T,j is particularly interesting in the context of theory uncertainties, due to its sensitivity on the specific choice of the central scale that further persists at NNLO due to instabilities in the sub-leading jet distribution that impacts the inclusive spectrum [138]. In order to gain insights into this sensitivity and how the different prescriptions for assigning MHO uncertainties compare in these scenarios, we consider two different central scale choices: the jet transverse momentum p T,j and the scalar sum of the transverse momenta of the partonsĤ T = i∈partons p T,i , which are shown in figures 22 and 23, respectively.   Figure 22. Inclusive jet production up to NNLO with µ 0 = p T,j as the central scale choice. Left: absolute dσ/dp T,j distributions and 9-point scale variation envelopes normalised to the NLO prediction at µ 0 . Right: 68% CIs for the abc and geometric models using the scale-marginalisation (top) and scale-averaging (bottom) prescriptions (also normalised to NLO). absolute dσ/dp T,j distributions and 9-point scale variation envelopes normalised to the NLO prediction at µ 0 . Right: 68% CIs for the abc and geometric models using the scale-marginalisation (top) and scale-averaging (bottom) prescriptions (also normalised to NLO).
We begin our discussion with the scale choice µ 0 = p T,j , which makes the NLO corrections artificially small. From the results presented in figure 22, we see that in the low-p T region all approaches give rise to sizeable NNLO CIs. This is most clearly visible for the abc-model, where no reduction in the size of the CI band is observed as we move from NLO to NNLO. Neither of the bands overlap, because the abc-model expects a positive NNLO correction, which is not true for this central scale choice. The geometric model, instead, is able to assign more conservative MHO uncertainties at NLO that allow for some overlap between the last two orders. Overall, we can conclude that the Bayesian models considered here are not able to "un-do" an inappropriate scale choice through either the sm-or sa-prescriptions.
More specifically, no prescription based on a simple re-scaling from a central scale choice will be able to adapt between different dynamical scales that encapsulate a different kinematic dependence. This becomes most apparent in observables that are sensitive to infrared physics, because it is the cancellation between real and virtual corrections that strongly impacts such a sensitivity. For two dynamical scale choices that have a different dependence on the kinematics, the difference in the scales assigned to the real and virtual corrections can vary substantially. Such a feature can never be captured by a global scaling, as done in the sa-and sm-prescriptions considered here, that varies the scales for the real and virtual contributions in the same manner.
UsingĤ T as the central scale choice as advocated in ref. [138], we observe in figure 23 a stable progression of the series throughout with the next order, typically lying within the uncertainty estimate of the previous order. The 68% CIs are slightly larger for the Bayesian approaches at NLO, while at NNLO all uncertainty estimates are mutually compatible. This further supports this scale as an appropriate choice given its robustness with respect to the different prescriptions in assigning theory uncertainties. Finally, it is interesting to note that the abc-model is able to adapt to the sizeable positive corrections in the tail of the distribution, thus giving a noticeable positive shift at NLO that reduces the overall impact of NNLO corrections.

Discussion
The estimation of MHOs is a vital aspect of collider phenomenology. Traditional approaches are based on scale variation, an ad hoc prescription to obtain uncertainty intervals for MHOs without a clear probabilistic underpinning. In this paper, we have analysed Bayesian approaches to estimate the MHO uncertainties from the progression of the perturbative expansion of a physical observable. Building on the recent work in ref. [51], we have focused in particular on the question of how the dependence on the unphysical perturbative scales µ F and µ R can be incorporated into Bayesian inference. Our goal was to scrutinise various Bayesian models and to assess their common features by studying their performance on a wide and representative selection of hadron collider observables known at high orders in perturbation theory, namely NNLO and/or N 3 LO. In figure 24 we show a summary plot of 68% and 95% CIs for MHOs in our Bayesian approach and the traditional scale variation intervals for selected collider processes known up to N 3 LO. In the following we summarise our main findings and conclusions, pointing out directions for future research.
Biases related to the choice of the scale prescription. An important question is how the dependence on perturbative scales can (or should) be included into Bayesian MHO estimation. In ref. [51] it was proposed to treat the perturbative scales on par with other hidden parameters of a Bayesian model, which we refer to as the scale-marginalisation (sm) prescription. In section 2 we have introduced an alternative prescription, called scaleaveraging (sa), which does not require the perturbative scales to be assigned a statistical interpretation. We do not see any compelling physics argument to prefer one prescription over the other. The two prescriptions do not just differ in their philosophy of how the perturbative scales are interpreted, but we have demonstrated that they may lead to substantially different estimates for MHOs. In particular, in section 2.2 we have shown that (under some additional reasonable assumptions for Bayesian models) each of the two prescriptions has intrinsic biases towards specific scales: the sm-prescription tends to favour FAC points as the 'best' value of the perturbative scale, while the sa-prescription is biased towards the values of the observable at the PMS points. These biases are clearly visible in the examples in section 5 (even for models for which the assumptions entering the proofs are not satisfied). Care is thus needed when applying Bayesian techniques to incorporate scale information: the choice of the prescription may bias the predictions in favour of a specific scale choice. It would be interesting to see if one can construct other Bayesian models that incorporate scale information in an unbiased way.
The choice of the Bayesian model and the priors. In order to estimate MHOs in a Bayesian approach, one has to specify a model for the progression of the perturbative series. The geometric model studied in ref. [51] neglected the sign of perturbative corrections, which enforces the posterior distributions for MHOs (at fixed scale) to be symmetric around zero. To allow for asymmetric MHO probability distributions we introduced the abc-model in section 3 (see appendix B of ref. [51] for alternatives). In section 4 we benchmarked the abc-model on a number of perturbative QFT quantities known to high orders, for which scale dependence is not relevant. Our results show that the specific choice of the model and its priors can strongly influence the size and position of MHO CIs, especially when the number of input data is small.
From the physics perspective, there is no preference between Bayesian models. While from an aesthetic point of view, one may prefer simple models, a judicious but ad hoc choice has to be made. It might be therefore desirable to develop new models in the future suited for specific classes of perturbative series, e.g., see our QED examples.
The choice of the central scale and the range of scales considered. The prior on the perturbative scale in eq. (2.5) for the sm-prescription (see ref. [51]) (or the weighting function in eq. (2.12) for the sa-prescription) depends on a choice of the central scale µ 0 and the size F of the interval for the scale integration. Our analysis in section 5 shows that the CIs have a substantial dependence on the choice of µ 0 and F . For example, once the range µ F , µ R ∈ [µ 0 /F, µ 0 F ] is wide enough to include an FAC point, the CIs in the sm-prescription may become relatively F -independent, but mostly because the probability distributions become strongly peaked and biased towards to the FAC point. The sa-prescription, instead, will typically increase with F , thereby rendering the probability distributions F -dependent.
The study of the inclusive jet cross section at NNLO in section 5.8 shows that the convergence of the CIs in the Bayesian approach (at least for the prescriptions and models studied here) is not any less sensitive to the central scale choice as the traditional scalevariation intervals. It would be interesting to see if there are other prescriptions or choices for the prior P 0 (µ) and the weighting function w(µ) that lead to probability distributions and CIs that are truly independent of the choice of the central scale µ 0 and interval size F . The main challenge in this regard arises from the limitations of a naive re-scaling of the central scale by constant factors that, in general, will not able to capture the difference in the kinematic dependence among possible choices for a dynamical scale.
Relation to MHO estimates using scale-variation. Let us point out a very practical connection between the existing theoretical results with the 9-point scale variation and scale prescriptions in the Bayesian approach. The scale dependence is typically mild enough so that a low-order Gauss-Legendre quadrature rule can be used to approximate the scale integrals in the sm-and sa-prescriptions. For the specific choice of integration interval F ≈ 2.45, the traditional scale variation points are placed at the quadrature nodes. Therefore the Bayesian CIs can be efficiently computed from already existing predictions.
Furthermore, for the observables studied in section 5, we have demonstrated that the traditional 7-or 9-point scale variation intervals are often similar in size to the Bayesian 68% and 95% CIs. Similar behaviour has been observed before [48,50,51]. This is not a coincidence, and in section 2.3 we have shown that (under certain assumptions) the 95% CIs in the sa-prescription must include the scale-variation intervals. We stress that scalevariation intervals themselves are devoid of any statistical interpretation.
Finally, we observe that the 68% CIs at different perturbative orders in the Bayesian approach share some qualitative features known from scale-variation. For example, neither the scale variation envelopes nor the 68% CIs overlap for Drell-Yan production at N 3 LO and NNLO or di-photon production at NNLO and NLO (however some models lead to barely overlapping 68% CIs, e.g., see figure 19). For processes where scale-variation intervals indicate good perturbative convergence, Bayesian models often also show a good, or even better, nesting of 68% CIs at different orders. The similarities of the two approaches are not coincidental, as both prescriptions share assumptions on the convergence of the perturbative series, namely that we are at low enough order so the perturbative series exhibits a convergent behaviour. Despite these similarities, scale-variation intervals and Bayesian CIs are very different objects from a statistical point of view. In particular, within the Bayesian approach, one can choose in a quantitative manner how aggressive or conservative estimates of MHO should be for a given process and for a given choice of model.
Conclusions. Based on our comprehensive analysis, we conclude that Bayesian approaches provide solid alternatives to the traditional scale variation as a means to estimate MHOs, but there are several caveats that one needs to keep in mind (and that possibly need to be improved in the future, e.g., by developing novel Bayesian models). The choice of the Bayesian setup, in particular the scale prescription and the priors, may have a substantial impact on the inference. In a certain sense, the Bayesian approach is not any less subjective or ad hoc than customary scale variation. Therefore care is needed not to introduce any biases by the model choice.
A significant advantage of Bayesian techniques over scale variation is that the CIs computed from Bayesian approach have a clear probabilistic interpretation (at least within the context of a given model). This opens the way to apply statistical techniques, e.g., to systematically combine MHO uncertainties with uncertainties from different sources or to include correlations between observables or bins for differential distributions. It would also be interesting to explore how far one can treat the choice of the model itself in a Bayesian way, in order to minimise bias from the model choice, e.g., using ideas from Bayesian Model Averaging (BMA) (see, e.g., ref. [139], and references therein). We leave these studies for future work.
The source code to the computer program MiHO that was used to obtain the results shown in the paper is made publicly available at https://github.com/aykhuss/miho.

A The position of the peak of the distribution
In this appendix we present a (heuristic) proof of the position of the peak of the distribution P(Σ|Σ n ). We work under the assumptions discussed in section 2.2, which we recall here for convenience: 1. The model (P, P 0 ) is symmetric.
2. We work with the prior in eq. (2.5) or the weighting function in eq. (2.12).
4. The peak is more and more pronounced as m increases, i.e., as more information on the progression of the series becomes known.

5.
There is a single FAC or PMS point in the interval [µ 0 /F, µ 0 F ].

A.1 The position of the peak for the scale-marginalisation prescription
Using the prior on the scale in eq. (2.5) and the approximation in eq. (2.8), eq. (2.9) takes the form: P sm (Σ|Σ n ) = L + L − dL P (Σ − Σ n (e L )|Σ n (e L )) P (Σ n (e L )) L + L − dL P (Σ n (e L )) . (A.1) In the previous equation we have changed variables from µ to L = log µ, and we defined L ± = log(µ 0 F ±1 ). In order to determine the position of the peak of P sm (Σ|Σ n ) as a function of Σ, we analyse where the bulk of the integral in the numerator receives its contributions. Bayes' formula implies: P (Σ n (µ)) = P (Σ 0 (µ)) n k=1 P (Σ (k) (µ)|Σ k−1 (µ)) .

(A.3)
Since P is symmetric, P (Σ−Σ n−1 (µ FAC )|Σ n (µ FAC )) reaches its peak when Σ−Σ n−1 (µ FAC ) = 0, i.e., the position of the peak is located at the FAC point at N n LO: We use the same notations and conventions as in the previous section. Using the weighting function in eq. (2.12) and the approximation in eq. (2.8), eq. (2.10) takes the form: P sa (Σ|Σ n ) = 1 2 log F L + L − dL P (Σ − Σ n (e L )|Σ n (e L )) . (A.5) We proceed in a similar way as in the previous section, and we want to understand where the bulk of the integral comes from. The probability in the integrand is peaked by assumption. To understand the impact on the position of the peak of P sa (Σ|Σ n ), let us in first approximation write the probability distribution in the integrand as a δ-function with support at Σ − Σ n (e L ) = 0: . (A.6) By assumption, the δ-function has support in the range [L − , L + ], and so there is L Σ = log µ Σ ∈ [L − , L + ] such that Σ − Σ n (µ Σ ) = 0 (note that µ Σ depends on Σ!). This gives: We see from this argument that P sa (Σ|Σ n ) reaches is maximum when Σ n (µ Σ ) ≈ 0, and so Σ peak n = Σ n (µ Σ ) = Σ n (µ PMS ) is at the PMS point at N n LO.

B Analytic solution of the abc-model
In this appendix we show the analytical solution of the abc model for ξ = 1. The general case ξ = 1 can be solved using similar techniques. Recall that for ξ = 1, after integration over c we have P abc (δ n ) = da db P 0 (a) |a| n(n+1)/2 η 2 n+2 (n + 2 + ) max η, max (1 + ω) η 2 n+3 (n + 2 + ) for η ≤ |b ∨ |, η(∆ (−) − ∆ (+) + 2η) + 2η − −n−1 n+ +1 , for η > |b ∨ | . (B.6) The result for ξ = 1 takes a similar functional form, but one has to consider additional cases, as shown in figure 2. In practice, we perform the integral over a numerical in our implementation of the abc model in the code MiHO. Nonetheless, it is instructive to discuss the analytical solution for a specific choice of ω. We note that the ∆ (±) depend on a and are piecewise power functions of a; ∆ (±) ∼ a −i for some power i. As such, the a integration domain is divided into disjoint intervals 0 < . . . < a k < a k+1 < . . . < 1 on which the maximum function can be evaluated (see ref. [51] for how to find the a k ) and the most complicated integral we encounter has the form with α, β > 0, l ∈ Z, i, j ∈ N. Restricting ourselves to integer ω, the most general integral that we need is of the form I pq (µ, ν; a k , a k+1 , α, β) ≡ a k+1 a k da a µ α a p + β a q ν (B.8) with F related to Gaussian hypergeometric function F (ρ, σ; A, B) ≡ B 1+ρ 1 + ρ 2 F 1 2 + ρ + σ, 1 + ρ; 2 + ρ;