Alternative to the application of PDG scale factors

The Particle Data Group recommends a set of procedures to be applied when discrepant data are to be combined. We introduce an alternative method based on a more general and solid statistical framework, providing a robust way to include possible unknown systematic effects interfering with experimental measurements or their theoretical interpretation. The limit of large data sets and practical cases of interest are discussed in detail.


Introduction
In any field of science, it is often the case that a number of data points or data sets need to be combined in order to achieve a greater overall precision. Now, data naturally fluctuate and it is not uncommon that one or several data points may appear discrepant or outlying with respect to the bulk of the data. This is not necessarily a concern, e.g., if the results of the individual measurements or observations are known to be dominated by the statistical uncertainty, or even in the presence of significant systematic effects, as long as their associated uncertainties can be reliably estimated. On the other hand, if the observed discrepancies are suspiciously large or plentiful, one may worry that some unknown systematic effect or unjustified but hidden assumption might have moved the central value of one or more observations. In that latter case, a more conservative handling of the data and its combination would be called for.
Of course, it is impossible to know independently which of the aforementioned situations-larger than expected random fluctuations, unknown systematic effect(s), or both-one is facing, or which of the individual data (sub)sets could be a e-mail: erler@fisica.unam.mx (corresponding author) b e-mail: rferrohernandez00@gmail.com at fault. As a remedy, the Particle Data Group (PDG) [1] proposed a set of rules according to which the uncertainty of an average is to be enlarged by a scale factor S, while the central values are to remain unchanged by fiat. Assuming Gaussian errors, in a first step the reduced χ 2 is computed as twice the log-likelihood of the minimum divided by N eff , where N eff is the effective number of degrees of freedom given by the number of observations (data points), N , minus the number of independent fit parameters. Thus, for the most common case of a simple average of one parameter, N eff = N − 1: 1. If the reduced χ 2 is smaller than unity, the results are accepted and there is no scaling of errors. 2. If the reduced χ 2 is larger than unity, and the experiments are of comparable precision, then all errors are re-scaled by a common factor S, given by the reduced χ 2 , i.e., S = χ 2 /N eff . 3. If some of the individual errors are much smaller than others, then S is computed from only the most precise experiments. The criterium for these is given with reference to an ad hoc cutoff value.
Given that the rationale for a procedure such as this one, is to err on the conservative side, one immediate objection is that if there is only one data point then no conservative scaling will be applied, even though in this case one is most exposed to a potential problem as there is no control measurement. Another problem is that the set of individual data points is not well-defined. In principle, one may combine certain data subsets first, such as from different data taking periods or different decay channels obtained by the same experimental apparatus, or combine identical channels obtained by different detectors and average these is a second step. Conversely, one could split up the available results into more but less precise individual entries. While this has no impact on ordinary maximum likelihood analyses, it will generally dilute or enlarge the reduced χ 2 value on which the S factors are based upon. In fact, applying PDG scale factors to data points of which some have already undergone the scale factor treatment (typically, by the experimental collaboration) then this kind of iteration does generally change the central value of the combination. Also note that the prescription according to which reduced χ 2 values greater and smaller than unity are being treated differently generates an unnecessary dichotomy.
In this paper we present an alternative which shares some of the features of the PDG recommendation while improving on others. The framework is a hierarchical model within Bayesian parameter inference [2]. The basic idea is that individual data points are not considered independently and identically distributed (iid), but rather independently and similarly distributed, in the sense that the parent distributions are permitted to vary to some extent to allow for unknown effects that may or may not be different from one data point (measurement) to another. Thus, we propose a hierarchical model where each measurement is assumed to determine a different parameter, each considered as having arisen as a random draw from a common parent distribution described in turn in terms of hyper-parameters.
A similar approach is widely used in the biological sciences when estimating treatment effects by combining several studies performed under similar but not identical conditions [3,4], in what is often referred to as meta-analysis [5][6][7]. In these cases the experimental conditions can vary slightly, so that the individual studies may be affected by different unknown biases.
Several authors within the physics community introduced attempts to incorporate the effects of unknown error sources when combining data. For example, Ref. [8] finds results similar to the ones in our work, but within a frequentist approach. Ref. [9] models the probability of underestimating the experimental error by including a different scale factor for each measurement, which is in turn randomly drawn from a prior distribution. Very recently it was shown [10] that it is even possible to test the shape of the prior distribution, and not just to constrain the values of its parameters. We leave this kind of more complete analysis for the future.
In the next section we summarize the formalism of Bayesian hierarchical modeling using the notation of Ref. [2]. The rest of the paper introduces our approach, illustrated by a number of examples and reference cases.

The non-hierarchical model
Suppose that we want to determine a parameter θ from an experimental measurement or observation, and to be specific, that the likelihood for the outcome y of such an experiment Fig. 1 Ordinary averaging. We assume that the y i are random outcomes of measurements of the same parameter θ can be described as a Gaussian with central value θ and standard deviation σ , where, The posterior distribution for the parameter θ can be obtained through Bayes' theorem, where p(θ ) is the prior probability distribution of θ . It is very convenient to assume p(θ ) to be a conjugate prior, which means that the posterior distribution will fall within the same family of functions as the prior. Thus, in our case we adopt the prior, yielding the posterior, where, is the sum of precisions of the prior and the experimental result, while is the precision averaged central value. Clearly, if the experiment has a small error, σ τ , it will dominate θτ . In the limitτ → ∞, the prior is called non-informative. Now, let us include further such experiments with central values y i and total errors σ i , all measuring the same quantity θ , as illustrated in Fig. 1. For simplicity, we assume that the σ i are mutually uncorrelated. The posterior distribution p(θ |y i , σ i ,μ,τ ) is again given by Eq. (5), but now with and Obviously, the uncertainty στ in θ decreases strictly monotonically with the inclusion of more experiments. Nevertheless, if one or several of the experiments was subject to a number of systematic effects that was neither corrected for, nor accounted for in the individual uncertainties σ i , then the experiments are (effectively) not measuring the same quantity, and στ would be underestimated. In other words, each experiment can be viewed as measuring different parameters θ i , which are, however, not entirely independent of each other, since after all, the experiments were supposed to constrain the same θ . We will now review hierarchical Bayesian modeling, and propose it as a systematic method to interpolate between the extreme and rarely realistic cases of all θ i being either equal or else entirely independent of each other.

The hierarchical model
This is achieved by considering each θ i to be the result of a random draw from a parent distribution, where p(μ, τ ) is the hyper-prior distribution for what are now called the hyper-parameters μ and τ . We sketch this model in Fig. 2. Note that Eq. (10) implies the property of exchangeability between the θ i , i.e. symmetry under θ i ↔ θ j . From Bayes' theorem one has, and explicitly in the Gaussian case, Marginalizing over θ i one finds the "master" equation, We will use it to compute the posterior distribution of the hyper-parameters, once a hyper-prior is chosen. For example, assuming a flat prior for μ and τ , we can integrate over μ to find, where, The parameter τ quantifies general differences in the θ i . If τ = 0, the experiments measure the same parameter, i.e., θ i = θ j . For τ → ∞, each one measures a completely independent parameter θ i . From the master equation one can see that the parameter of interest is μ. If τ = 0 the posterior distribution for μ reduces to the ordinary likelihood for parameter estimation given in Eq. (5) withτ → ∞. The full posterior distribution for μ can be obtained integrating Eq. (13) numerically over τ . If there are large unknown systematic effects, then the most likely values of τ will differ from zero, which leads to the important result of increasing the error in μ.

The hyper-prior
We propose a hyper-prior which is μ-independent, i.e., p(μ, τ ) = p(τ ), and that interpolates smoothly between a flat and a sharply peaked τ distribution, This form will prove to be useful due to the simple interpretation of α in terms of the number of degrees of freedom, and the possibility to obtain closed analytical formulas for the posterior distribution of μ. We remark that in Bayesian methods one needs to specify a prior that cannot be determined from first principles. Here we have chosen a prior with a simple analytical form interpolating between a flat prior and τ = 0. Very interestingly, while this prior is only one of many possible choices, it turns out that it coincides with Jeffrey's prior in a certain limit. We will return to this at the end of Sect. 6. It is interesting to study the effect of this kind of prior on the tails of the posterior density of μ. Integrating Eq. (13) over τ produces the posterior density of μ given the data, For large μ, the exponential suppression factor favors large values of τ , so that, and after a change of variables We observe that the usual exponential suppression of μ in the tails has turned into a milder power law suppression which increases with the effective number of degrees of freedom, i.e., in our case the number or measurements, ν ≡ N +α −2.

Experiments with errors of the same size
When all errors are equal, σ i = σ j ≡ σ , we obtain an analytical formula which illustrates how the PDG scale factor re-emerges for large data sets. The master equation reads in Fig. 3 Scale factor versus the square root of the reduced χ 2 . We employed α = 0 this case, or simply, where we defined, which is the usual χ 2 function. Changing variables, we obtain, which is the master formula in this case in terms of the cumulative distribution function F for a χ 2 distribution with ν degrees of freedom. This equation implies an interesting result. Since p(μ|y i ) depends on μ only through χ 2 (μ), we have dp(μ|y i ) dμ = dp(μ|y i ) dχ 2 so that the mode of the distribution is the same as in the usual case, i.e., at the value of μ where χ (μ) 2 = 0. Thus, For σ i = σ j the posterior distributions of the hierarchical and non-hierarchical models peak at the same location.
From Eq. (23), we can also obtain the scale factor, which we define here as the ratio of the sizes of the 68% highest confidence intervals of the hierarchical and non-hierarchical models. In Figs. 3 and 4, we show the scale factor for several values of α and N , from which one can see the similarity to the PDG scale factor for large N . We now turn to the case of a large number of degrees of freedom and the Gaussian approximation.

Large number of degrees of freedom
We rewrite Eq. (23) by another change of variables, so that where we defined χ 2 ν−1 ≡ χ 2 /(ν − 2). Thus, large values of ν suppress the integrand exponentially. Depending on the value r 0 = (χ 2 ν−1 ) −1 where r χ 2 ν−1 − ln r has a minimum, we have two cases: (1) For r 0 > 1 the minimum falls outside the integration limits, and the integral can be approximated by considering values of r near 1, which gives We recognize this is the usual likelihood for parameter inference without scaling. Thus, for σ i ≈ σ j , ν → ∞ and χ 2 ν−1 (μ 0 ) < 1, the hierarchical model implies no scaling of the errors.
(2) For r 0 < 1 the minimum resides inside the integration region, and the integral can be approximated by considering values of r near r 0 . After some algebra, which is proportional to the Student-t distribution for ν − 1 degrees of freedom, and for very large ν it can be further approximated by a Gaussian, This yields another important result, for σ i ≈ σ j , ν → ∞ and χ 2 ν−1 (μ 0 ) > 1, the hierarchical model implies a re-scaling of the overall error by σ → σ χ 2 ν (μ 0 ). It is amusing to note that for large ν we recovered the PDG scale factor prescription. On the other hand, for low values of ν our model implies larger scalings than recommended by the PDG. In the next subsection we approximate the distribution of μ as a Gaussian, so as to obtain an analytical formula for the scale factor in terms of ν and the value of χ 2 .

Gaussian approximation
To do so, we expand the logarithm of the posterior distribution p = p(μ|y i ) in powers of μ around μ 0 , The second term on the right hand side is zero because we are expanding around the maximum. The third term can be compared to the corresponding term of the expansion of a Gaussian distribution, which gives Using Eq. (23) we have, where γ is the incomplete Gamma function, defined by As we mentioned before, the scale factor S Bayes is defined as the ratio of the sizes of the 68% highest confidence intervals of the hierarchical and non-hierarchical models. In the Gaussian approximation we find, where we have used the power series expansion of the incomplete Gamma function, In Fig. 5 we compare the approximate formula with the exact result. As expected, the approximation improves for larger values of ν. We are now ready to discuss the general case of unequal errors, σ i = σ j .

Experiments with unequal precisions
To understand this case, we fix the value of τ in Eq. (13). The distribution of μ is then Gaussian, with total error, and central value, Thus, experiments with smaller errors are more sensitive to τ than less precise ones. Suppose that M of the experiments have an error σ M , and that σ M is much smaller than the error σ of the rest of the experiments. Then, for σ M τ σ the scaling will mainly affect the experiments with small errors. Since we were unable to find an analytical formula for the peak or mean of τ , we proceed with a numerical analysis.
As a first example, we randomly generated eleven fictitious measurement points from a Gaussian with standard deviation σ = 1 centered at the value of 10. The last point is from a Gaussian centered at 10 + 5/ √ 10 with σ M = 1/ √ 10, which is chosen so that its precision is the same as the combined precision of the other ten. The results are shown in Fig. 6. The red point denotes the ordinary weighted average with PDG scaling applied, and is pulled away from the horizontal line as a result of the deviating 11th measurement. The black point, on the other hand, is the average obtained as the result of our Bayesian hierarchical model (here we use α = 10 to specify our prior). It is closer to the bulk of Fig. 7 The measurement points with small error are shown in blue, the usual averages with the PDG scaling in red, and the hierarchical averages in black. The labels at the horizontal axis show by how many σ M the blue points deviate from the gray point. The gray band represents the ordinary weighted averages of the bulk of measurements in Fig. 6 data than to the measurement with the smaller error. This is a reasonable property, since it is less likely that all the measurements in the bulk had a systematic error in the same direction.
In Fig. 7 we show how the two kind of averages change when we move the central value of the 11th measurement (in blue) while leaving the other 10 unchanged. Just for orientation, the gray band represents the ordinary average (nonhierarchical) of the bulk of measurements with the same error. As in Fig. 6, the red points are the usual PDG-scaled averages, while the black points are the hierarchical averages. Clearly, as we approach the bulk the combined error shrinks.

Neutron lifetime
There is an interesting discrepancy between the two types of experiments measuring the lifetime of the neutron. For a state of the art review of both types and more details, see Ref. [11]. The first type are beam experiments [12][13][14], which measure the number of protons or electrons from decays of cold neutrons in a beam passing through a magnetic or electric trap. After the beam has passed the trap, some of the neutrons are deposited in a foil at the end of the beam path. The neutron lifetime is proportional to the rate of neutrons deposited and inversely proportional to the rate of decays detected.
The other type of experiment uses bottles [15][16][17][18][19][20][21] containing ultra-cold neutrons with a kinetic energy of less than 100 neV. Neutrons with such a low kinetic energy can be confined due to the effective Fermi potential between neutrons and atomic nuclei in many materials. Gravitational forces and magnetic fields can also be used to confine the neutrons within the container. The idea is simply to count the num- −0.63 s, than the PDG method τ n = 879.35 ± 0.64 s where S P DG = 1.56. This is due to the bulk of the bottle experiments that prefer lifetimes longer than 880 s. It is important to recall that the tails of the Bayesian hierarchical model do not fall as fast as a Gaussian, so that there is still a non-negligible probability for τ n to be lower.

Relations to other models
While this paper was being written, two interesting papers related to our work appeared. The first one [22] discusses the kaon mass in the context of a skeptical combination of experiments, scaling each experimental error independently but correlated. The second one [23] studies the discrepancy that arises when the PDG scaling is applied to sub-sets of experiments and then to the combination of the sets, vs. (for example) applying it to the whole data at the same time. The conclusion is that the χ 2 /ν prescription used to enlarge the standard deviation does not hold sufficiency. This means that the scaling is not sufficient to properly describe the full probability distribution. Our model would have had the same problem had we used the marginalized (over τ 2 ) distribution of μ. This is because the "correlations" that emerge through τ 2 would be absent. But it is clear from Eq. (13) that if we use the posterior distribution of μ and τ 2 of a subset of experiments as the prior for the remaining subset, then the updated posterior will be the same as combining the whole data set simultaneously.
Another interesting point made in Ref. [23] is the fact that the PDG scaling treats any value of N equally, while for fixed χ 2 /N the p value decreases with N . In other words, since the probability distribution of the reduced χ 2 function peaks around one as the number of degrees of freedom increases, the scaling (given a discrepant value of the reduced χ 2 ) should be larger when more experiments are included in the average. This is not the case for the PDG description, because the scaling only depends on the reduced χ 2 value and not on the number of degrees of freedom. Now, it is clear from Fig. 3 that in the Hierarchical Model with α chosen close to zero this problem would be aggravated, i.e., for any given value of the reduced χ 2 , there is more scaling for low N . However, we can use the freedom to choose a value of α to improve on this issue. First we demand the variance of the τ distribution to be finite, which corresponds to α > 6. In Fig. 9 we show the scaling versus the reduced χ 2 with α = 6 + (where is an infinitesimal) from which one can see that for large values of the reduced χ 2 the scaling reduces as N gets smaller. This is just the desired effect. On the other hand, we still have more scaling for small values of the reduced χ 2 . This is a natural consequence of the fact that for a low number of experiments τ can not be constrained too strongly, which translates into an enlarged error for μ.
One can also consider Jeffrey's prior. E.g., if we specify to the case of uncertainties of equal magnitude, σ i = σ j = σ , In the case of a distribution with several parameters (in our case μ and τ 2 ), Jeffrey's prior is defined as the square root of the determinant of Fisher's information matrix, which in turn is defined as the average (over y i ) of the Hessian of the log-likelihood N y i |μ, τ 2 + σ 2 i . then Jeffrey's prior reduces precisely to Eq. (16) with α = 3. This would lead to a plot very similar to the one shown in Fig. 9.

Conclusions and outlook
We proposed a Bayesian hierarchical model as a strategy to compute averages of several uncorrelated experimental measurements, specifically with the possibility in mind that unaccounted for systematic effects might be present, leading to underestimates of the quoted uncertainties. We should stress that the point is not that (some part of) the systematic error has been underestimated or assessed too aggressively. If this is suspected then a strategy should be developed to increase the systematic error component(s), which would imply-among other things-that statistics limited measurements would not be questioned. Here, we rather addressed the generic situation in which unknown effects or human errors may be present, and which therefore could affect even ostensibly clean determinations.
We have shown that our methodology resembles the recommendation of the Particle Data Group whenever the number of degrees of freedom (data points) is large. Our approach connects smoothly to cases with fewer degrees of freedom, though. Another important advantage is that it makes the underlying assumptions in the averaging process transparent. E.g., a large value of the parameter α appearing in our proposed form of the prior, implies a strong believe that the experiments do not have an unknown systematic error, while a small value corresponds to a more agnostic point of view. Our method can be extended to experiments with correlated errors, but we leave this generalization for the future.
Due to the additive form, σ 2 i + τ 2 , of the denominator in the exponential part of the distribution, our model has the drawback that it tends to penalize experiments with high precision more strongly. This relative issue is already seen in the τ n example, where the most recent beam measurement which has a larger error than most bottle experiments and a higher central value tends to push the combined value up. On the other hand, the natural power suppressed tails of the posterior distribution help to mitigate possible strong shifts in the central value.
We also would like to point out that to apply our method to the PDG, it has to be studied, discussed and compared with other approaches in more detail, to confirm that it can be used within the PDG framework.
In closing, we remark that we also envision an application of this model in the context of new physics searches within the Standard Model Effective Field Theory (SMEFT) framework [24,25], in which thousands of a priori independent operator (Wilson) coefficients need to be determined. Yet, many of these operators are almost certainly generated at some com-mon energy scale, and are consequently not entirely independent. Thus, the idea is to assume that (classes of) the Wilson coefficients are random samples generated at a common ultra-violet energy scale, lending itself to a hierarchical approach. This can be particularly useful when estimating the sensitivity of a hypothetical future experiment to physics beyond the Standard Model. This is another direction for future work.