1 Introduction

The final measurement of the anomalous magnetic moment of the muon at the Brookhaven National Laboratory (BNL) E821 experiment [1] differed from the Standard Model (SM) prediction by \(3.7\sigma \). This discrepancy was replicated by the Fermi National Accelerator Laboratory (FNAL) E989 measurement [2]. The combined measurement from BNL and FNAL,

$$\begin{aligned} a_{\mu } = 116\,592\,061\,(41) \times 10^{-11}, \end{aligned}$$
(1)

deviates from the SM theory prediction by \(4.2\sigma \), motivating the possibility of physics beyond the SM [3] as well as scrutiny of the SM prediction [4].

The SM prediction for \(a_{\mu }\) incorporates contributions from quantum electrodynamics (QED), electroweak interactions (EW), and hadronic effects [5]. While QED and EW contributions can be calculated with high precision using perturbation theory and are well-controlled [6], hadronic contributions are harder to compute and are the largest source of uncertainty in predictions for \(a_{\mu }\). The hadronic contribution can itself be decomposed into the following parts:

$$\begin{aligned} a_{\mu }^{\textrm{had}}&=a_\mu ^\textsc {hvp}+a_{\mu }^{\textrm{LbL}}, \end{aligned}$$
(2)

where \(a_\mu ^\textsc {hvp}\) is the hadronic vacuum polarization (HVP) contribution and \(a_{\mu }^{\textrm{LbL}}\) is the light-by-light scattering contribution. In this work we focus on the leading-order (LO) contribution to HVP. This is particularly challenging and the dominant source of uncertainty, making up about \(80\%\) of the total. The two most popular computational methods are lattice QCD and estimates from dispersion integrals and cross-section data.

Lattice methods calculate the HVP contribution by discretizing spacetime and performing a weighted integral of relevant functions over Euclidean time. This requires significant computational resources and currently cannot match the nominal precision achieved by data-driven methods. Data-driven methods use data from, for example, the KLOE [7], BaBar [8,9,10,11], SND [12] and CMD-3 [13, 14] experiments to estimate the R-ratio,

$$\begin{aligned} R(s) \equiv \frac{\sigma ^{0}(e^+e^- \rightarrow \text {hadrons})}{\sigma (e^+e^- \rightarrow \mu ^+\mu ^-)}. \end{aligned}$$
(3)

This is a function of the center-of-mass (CM) energy, \(\sqrt{s}\).Footnote 1 From the estimate of the R-ratio, the leading-order (LO) HVP contributions can be computed through the dispersion integral,

$$\begin{aligned} a_\mu ^\textsc {hvp}= \frac{\alpha ^{2}}{3\pi ^{2}} \int \limits _{m_\pi ^2}^\infty \frac{K(s) R(s)}{s} \, \text {d}s \end{aligned}$$
(4)

where K(s) is the QED kernel [15, 16]. Hadronic contributions to the effective electromagnetic coupling constant at the Z boson mass can be computed in a similar way through the dispersion relationship

(5)

where represents the principal-value prescription. There is growing tension between results from the two approaches [17, 18]. A recent lattice QCD calculation found [19]

$$\begin{aligned} a_\mu ^\textsc {hvp}= 707.5\, (5.5) \times 10^{-10} \end{aligned}$$
(6)

whereas a conservative combination of data-driven estimates yielded [4]

$$\begin{aligned} a_\mu ^\textsc {hvp}= 693.1 \, (4.0) \times 10^{-10}. \end{aligned}$$
(7)

The lattice result was at least partly corroborated by other recent lattice computations [20,21,22] and, moreover, data-driven estimates using hadronic \(\tau \)-decays are close to lattice results [23]. The most recent measurement of the \(2\pi \) final state at CDM-3 [24] compounded the mystery, as it conflicts with older measurements including CDM-2 [25].

With these issues in mind, we wish to reconsider the statistical methodology for inferring the R-ratio from noisy data. As we shall discuss in Sect. 2, the existing approaches use carefully constructed but ad hoc techniques and closed-source software, and consider uncertainties in a frequentist framework. The data-driven approach, though, is connected to common problems in data-science and statistics: modeling an unknown function (here the R-ratio) and managing the risks of under- and over-fitting. In Sect. 3, we describe how we tackle these issues using Gaussian processes – flexible non-parametric statistical models – and marginalization of the model’s hyperparameters.Footnote 2 This allows coherent uncertainty quantification and regularizes the wiggliness of the R-ratio, which helps prevent the model from over-fitting the noisy data. Our algorithm is implemented in our public kingpin package documented in a separate paper [28]. We focus on modeling choices and developing a tool for principled modeling of the R-ratio; our estimates are supplementary to existing ones and we don’t attempt to match previous comprehensive estimates in all respects. We don’t anticipate dramatic differences with respect to previous findings; however, careful modeling of the R-ratio is important because \({\mathcal {O}}(1\%)\) changes in the HVP contribution or finding that the uncertainty was underestimated could resolve tension with the experimental measurements and lattice predictions. We present predictions from our model for \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\) in Sect. 4. Finally, we conclude in Sect. 5.

2 Existing data-driven methods

We now briefly review two data-driven methods for calculating \(a_\mu ^\textsc {hvp}\). First, the DHMZ approach [29,30,31], which employs HVPTools, a private software package that combines and integrates cross-section data from \(e^{+}e^{-}\rightarrow \text {hadrons}\). For each experiment, second-order polynomial interpolation is used between adjacent measurements to discretize the results into small bins (of around \(1\,\text {MeV}\)) for later averaging and numerical integration. The HVP contributions are estimated in a frequentist framework. To ensure that uncertainties are propagated consistently, pseudo-experiments are generated and closure tests with known distributions are performed to validate the data combination and integration. If the results from different experiments are locally inconsistent, the uncertainty of the combination is readjusted according to the local \(\chi ^{2}\) value following the well-known PDG approach [32].

The second method is the KNT approach [4, 33, 34], which performs a data-driven compilation of hadronic R-ratio data to calculate the HVP contribution. It first selects the data to be used and then bins the data using a clustering procedure to avoid over-fitting. The clustering procedure determines the optimal binning of data for all channels into a set of clusters based on the available local data density. The optimal clustering criteria are shown in Ref. [4]. As of Ref. [33], the KNT compilation uses an iterated \(\chi ^{2}\) fit to achieve the actual combination. This new method ensures that the covariance matrix is re-initialized at each iteration. The motivation of this procedure is to avoid bias. The fit results in the mean R-ratio for each cluster and a full covariance matrix containing all correlated and uncorrelated uncertainties. Combined with trapezoidal integration, these are used to determine channel-by-channel contributions to \(a_\mu ^\textsc {hvp}\).

The DHMZ and KNT approaches are both data-driven methods that estimate \(a_\mu ^\textsc {hvp}\) in a frequentist framework, using privately curated databases of measurements, and in-house custom codes and techniques to avoid over-fitting. The differences between the two methods are not only evident in their distinct compilation targets – the DHMZ approach combines and integrates cross-section data from \(e^{+}e^{-}\rightarrow \text {hadrons}\), while the KNT approach performs a data-driven compilation of the hadronic R-ratio. Furthermore, discernible disparities emerge in their respective data handling procedures, encompassing data selection, data combination, and the propagation of uncertainties. Each method has its own strengths and limitations. While DHMZ and KNT approaches do not exhibit significant differences in estimating the central value of \(a_\mu ^\textsc {hvp}\), there are significant disparities in the resulting uncertainties and the shapes of the combined spectra.

3 Treed Gaussian process

3.1 Gaussian processes

In our data-driven approach, we model the unknown R-ratio with Gaussian processes (GPs; [35, 36]). A GP generalizes the Gaussian distribution. Roughly speaking, whereas a Gaussian describes the distribution of a scalar and a multivariate Gaussian describes the distribution of a vector, a GP describes the distribution of a function – an infinite collection of variables f(x) indexed by a location x. Any subset of the random variables are correlated through a multivariate Gaussian. The degree of correlation between f(x) and \(f(x^\prime )\) governs the smoothness of f(x) and is set by a choice of kernel function, \(k(x, x^\prime )\).

Just as a GP generalizes a Gaussian distribution of scalars or vectors to a distribution of functions, it allows us to generalize inference over unknown scalars or vectors to inference over unknown functions. Suppose we wish to learn an unknown function. Because a GP describes the distribution of a function, it can be used as the prior for the unknown function in a Bayesian setting. This prior distribution can be updated through Bayes’ rule by any noisy measurements or exact calculations of the values of f at particular locations x. In this paper we will update a GP for the R-ratio by the noisy measurements of the R-ratio. We use celerite2 [37] for ordinary GP computations.

The kernel function is usually stationary, that is, depends only on the Euclidean distance between locations,

$$\begin{aligned} k(x, x^\prime ) = k(|x - x^\prime |). \end{aligned}$$
(8)

Once a particular form of stationary kernel has been chosen, a GP can be controlled by three hyperparameters: a constant mean \(\mu \),

$$\begin{aligned} \textsf {E}\mathopen {}\mathclose {\left[f(x)\right]} = \mu , \end{aligned}$$
(9)

and a scale \(\sigma \) and length \(\ell \) that govern the covariance,

$$\begin{aligned} \textsf {Cov}\mathopen {}\mathclose {\left[f(x), f(x^\prime )\right]} = \sigma ^2 k\left( \frac{|x - x^\prime |}{\ell }\right) . \end{aligned}$$
(10)

The scale controls the size of wiggles in the function predicted by the GP. The length determines the length scale over which correlation decays and hence the number of wiggles in an interval. For Gaussian kernels, by Rice’s formula [38] the expected number of wiggles per unit distance scales as \(1 / \ell \). These three hyperparameters can substantially affect how well a GP models an unknown function. In a fully Bayesian framework, the hyperparameters are marginalized. This automatically weights choices of hyperparameter by how well they model the data and alleviates overfitting. The wigglines is regularized and the fit needn’t pass through every data point.

3.2 Treed Gaussian processes

The GPs described thus far are stationary – they model all regions of input space identically. To allow for non-stationary structure in the R-ratio, we use a treed-GP (TGP; [39, 40]). This is necessary as we know that the R-ratio contains narrow features such as resonances. In a TGP, the input space is partitioned using a binary tree. The predictions in each partition are governed by a different GP with independent hyperparameters. The number and locations of partitions are modeled using the so-called CGM prior [41].

Fig. 1
figure 1

The GP (left) predicts a wiggly fit to the straight line sections due to the step. The TGP (right) automatically addresses the issue by partitioning the input space

The difference in predictions between a GP and our TGP is illustrated in Fig. 1. In this illustration we consider evenly spaced noisy measurements of a function that contains a step. The GP (left) models the data poorly, as to accommodate the sudden step, the covariance between input locations must be weak which results a wiggly fit to the straight-line sections. The TGP (right) automatically partitions the input space allowing it to model the distinct straight-line sections separately. At the jumps before and after the step, the TGP predicts the function with substantial uncertainty. This is satisfying since the data points change dramatically across those regions of input space and do not indicate what might happen inside them.

TGPs build on ideas such as CART [41], treed models generally [42] and partitioning [43], and are similar to piece-wise GPs [44] and a recent proposal in machine learning [45]. Alternative approaches to non-stationarity include non-stationary kernel functions [46], Deep GPs [47, 48] where non-stationarity is modeled through warping, and hierarchical models of GPs [49, 50]. There is valuable discussion and comparison of these approaches in Refs. [51, 52] and this remains an active area. As well as addressing non-stationarity in unknown functions, these approaches address heteroscedastic noise in our measurements.

Our approach is fully Bayesian – we marginalize the GP hyperparameters and tree structure. This decreases the risk of over- or under-fitting the noisy data and smooths the partitions between GPs. We perform marginalization numerically using reversible jump Markov Chain Monte Carlo (RJ-MCMC; [53]. For reviews see Refs. [54,55,56]). This is a generalization of MCMC that works on parameter spaces that don’t have a fixed dimension – this is vital because the number of GPs and thus the total number of hyperparameters isn’t fixed. Navigating the tree structure requires special RJ-MCMC proposals – such as growing, pruning and rotating the tree – that are described in Ref. [39].

3.3 Integration

The idea of modeling integrals through GPs was originally known by Bayes–Hermite quadrature [57], and later discussed under the names of Bayesian Monte Carlo (BMC; [58]) and Bayesian quadrature or cubature [59, 60]; see Ref. [61] for a review. Suppose we wish to compute an integral of the form,

$$\begin{aligned} I = \int C(x) f(x) \, \text {d} x \end{aligned}$$
(11)

where f(x) is the estimated function and C(x) is a known function. BMC provides an epistemic meaning to errors in quadrature estimates of theses integrals, such as

$$\begin{aligned} I \approx \sum _i C(x) f(x_i) \Delta x_i, \end{aligned}$$
(12)

because we may make inferences on I through our statistical model for f(x). In cases in which the function C(x) and choice of kernel lead to intractable computations, there is an additional discretization error in BMC inferences as the GP predictions are evaluated on a finitely-spaced grid. This is known as approximate Bayesian cubature [61]. This additional error may be neglected when the integrand is approximately linear between prediction points. We will use a TGP to model an integrand. Although trees have been proposed in BMC [62], they haven’t previously been directly combined with GPs in this way.

3.4 Sequential design

After completing inference of an unknown function with the data at hand, one may wish to know what data to collect next. This problem is known as sequential design or active learning. Broadly speaking, this is a challenging question and greedy approaches that make optimal choices one step at a time are easier to implement. We thus consider a variant of active learning Mackay (ALM; [63, 64]).

Following the approach in Ref. [65], we consider the location that contributes most to the uncertainty in the \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\) integrals to be an optimal location at which to perform more measurements. For an integral of the form eq. (11), we compute

$$\begin{aligned} x_\text {ALM} = \mathop {\mathrm {arg\,max}}\limits \nolimits _x \left[ \int C(x) \, \textsf {Cov}\mathopen {}\mathclose {\left[f(x), f(y)\right]} \, C(y) \, \text {d} y \right] .\nonumber \\ \end{aligned}$$
(13)

See Refs. [40, 66] for further discussion.

4 Results

4.1 Data selection

We investigated the public dataset from the Particle Data Group (PDG; [32, 67]. See also Ref. [68] for further details), which primarily comprises data on the inclusive R-ratio, asymmetric statistical errors, and point-to-point systematic errors from electron-positron annihilation to hadrons at different CM energies. Certain CM energies may have multiple point-to-point systematic uncertainties stemming from different sources. We symmetrized errors and combined systematic (\(\tau \)) and statistical errors (\(\sigma \)) in quadrature:

$$\begin{aligned} \tau ^2&= \sum _{i} \tau _i^2 \quad \text {where}\quad \tau _i = \frac{1}{2} (\tau _{\text {up}} + \left| \tau _{\text {down}} \right| ),\end{aligned}$$
(14)
$$\begin{aligned} \sigma _\text {total}&= \sqrt{\sigma ^2 + \tau ^2} \quad \text {where}\quad \sigma = \frac{1}{2} (\sigma _\text {up} + \left| \sigma _\text {down}\right| ). \end{aligned}$$
(15)

We selected 859 data points inside the CM energy interval 0.3–\(1.937\,\text {GeV}\). This interval was selected to facilitate a comparison with Ref. [33]. The maximum \(\sqrt{s} = 1.937 \,\text {GeV}\) was chosen as it is the point at which summing exclusive R-ratio data becomes unfeasible and perturbative QCD may be reliable. The minimum \(\sqrt{s}=0.3 \,\text {GeV}\) was the minimum energy in the public PDG dataset.

To model this data set, we utilized a treed Gaussian process (TGP), as described in Sect. 3. Besides selecting the data, we must specify the locations at which we want to predict the R-ratio. In our study, we predicted at every input location and at two uniformly spaced locations between every pair of consecutive input locations.

4.2 Computational methods and modelling choices

We model the R-ratio by a TGP in which the input space is divided into partitions using a binary tree. Each partition in our TGP is governed by a mean, \(\mu \), and a Matérn-3/2 kernel with independent scale, \(\sigma \), and length, \(\ell \), hyperparameters. We use a uniform prior between 0 and 150 for the mean, a uniform prior between 0 and 500 for the scale, and a uniform prior between 0–\(5\,\text {GeV}\) for the length. These choices were motivated by the maximum measured R-ratio and the CM interval 0.3–\(1.937\,\text {GeV}\) under consideration. Following Ref. [39], the structure of the tree itself is controlled by a CGM prior with hyperparameters \(\alpha = 0.5\) and \(\beta = 2\); see Ref. [41] for explanation of these parameters. These choices favor smaller and more balanced trees.

We marginalize the tree structure and hyperparamters using RJ-MCMC. To improve computational efficiency, we thin the chains by a factor of four and only compute predictions for the states in the thinned chains. This reduces the computational time but only slightly reduces the effective sample size as the states in the unthinned chain are strongly correlated. We run RJ-MCMC for 300,000 steps but discard 5000 burn-in steps to minimize bias from the beginning of the chain. For computational efficiency and following a multistart heuristic, we run 10 chains in parallel and combine them.

Fig. 2
figure 2

Predicted R-ratio from the TGP model. The experimental errors and uncertainty in the TGP predictions are scaled by 5 and the \(\rho \)\(\omega \) and \(\phi \) resonances are plotted separately for visibility

4.3 Predictions

The predictions from our TGP model for the R-ratio are shown in Fig. 2 as a mean and an error band. The mean predictions pass smoothly around the data points without any undue fluctuations near the data points that are characteristic of over-fitting. The \(\rho \)\(\omega \) and \(\phi \) resonances are typically fitted by their own tree partitions with separate hyperparameters. They aren’t forced to be as smooth as the rest of the spectrum and appear well-fitted. Our model predictions are noticeably more uncertain in regions with fewer or noisy measurements. We identify no anomalous features and tentatively conclude that the RJ-MCMC marginalization adequately converged. We ran standard MCMC diagnostics on the mean of the R-ratio using ArviZ [69]; finding \(n_\text {eff} \simeq 600\) bulk effective samples and Gelman–Rubin diagnostic 1.01 [70]. There are typically five or six partitions, as the two peaks and the three flatter regions are modeled separately as shown in Fig. 3. In Fig. 4 we show the result when using an ordinary GP. To accommodate the narrow peaks in the measured R-ratio, the GP model permits substantial wiggles between data points, especially where the data points are sparse.

The TGP model outputs the mean, \(\textsf {E}\mathopen {}\mathclose {\left[R_i\right]}\), and covariance, \(\textsf {Cov}\mathopen {}\mathclose {\left[R_i, R_j\right]}\), of the R-ratio at the prediction locations \(\sqrt{s}_i\). The mean function represents the expected or average output value for a given input value, while the covariance function represents the covariance between predictions at different CM energies. As the RJ-MCMC can be computationally expensive and time-consuming, we saved these results to disk and made them publicly available [71]. We used the mean and covariance predictions for R in combination with the dispersion integrals to predict contributions to \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\) from the CM energy interval of 0.3–\(1.937\,\text {GeV}\). As \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\) are linear functions of R, we propagate \(\textsf {E}\mathopen {}\mathclose {\left[R_i\right]}\) and \(\textsf {Cov}\mathopen {}\mathclose {\left[R_i, R_j\right]}\) to obtain predictions. In all subsequent integration processes, we employ the trapezoidal rule [72], and in subsequent formulae \(\sqrt{s}\) denotes the locations of our TGP predictions rather than the locations of the measurements.

Fig. 3
figure 3

Histogram of locations of partition edges in the TGP model. The mean prediction for the R-ratio is shown for reference (blue)

The calculation of \(a_\mu ^\textsc {hvp}\) is based on Eq. (4). However, since the independent variable in this case is the CM energy, a simple deformation of Eq. (4) is necessary,

$$\begin{aligned} a_\mu ^\textsc {hvp}= \frac{2\alpha ^{2}}{3\pi ^{2}} \int _{m_\pi ^2}^\infty \frac{K(s) R(\sqrt{s})}{\sqrt{s} } \, \text {d}\sqrt{s}. \end{aligned}$$
(16)

Then we calculate the value of \(a_\mu ^\textsc {hvp}\) through numerical quadrature,

$$\begin{aligned} a_\mu ^\textsc {hvp}= \sum _{i} C^\textsc {hvp}_i R_i, \end{aligned}$$
(17)

where we defined

$$\begin{aligned} C^\textsc {hvp}_i \equiv \frac{2\alpha ^{2}}{3\pi ^{2}} \frac{K(\sqrt{s}_i)}{\sqrt{s}_i} w_i, \end{aligned}$$
(18)

where \(w_i\) are the quadrature weights. We use the trapezoid rule such that

$$\begin{aligned} w_i = \frac{1}{2}{\left\{ \begin{array}{ll} \sqrt{s}_{2} - \sqrt{s}_{1} &{} i = 1\\ \sqrt{s}_{N} - \sqrt{s}_{N-1} &{} i = N\\ \sqrt{s}_{i+1} - \sqrt{s}_{i-1} &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(19)

From Eq. (17) and by linearity, the mean can be found through

$$\begin{aligned} \textsf {E}\mathopen {}\mathclose {\left[a_\mu ^\textsc {hvp}\right]} = \sum _{i} C^\textsc {hvp}_i \textsf {E}\mathopen {}\mathclose {\left[R_i\right]}, \end{aligned}$$
(20)

and from the covariance matrix for predictions of the R-ratio, the uncertainty in our prediction of \(a_\mu ^\textsc {hvp}\) can be calculated using

$$\begin{aligned} \textsf {Var}\mathopen {}\mathclose {\left[a_\mu ^\textsc {hvp}\right]}= & {} \textsf {Var}\mathopen {}\mathclose {\left[\sum _{i} C^\textsc {hvp}_i R_i\right]}\nonumber \\= & {} \sum _{i, j=1}^{n} C^\textsc {hvp}_i\, C^\textsc {hvp}_j \,\textsf {Cov}\mathopen {}\mathclose {\left[R_i, R_j\right]}. \end{aligned}$$
(21)

We compute \(\Delta \alpha _\text {had}\) similarly using,

(22)

We use Eqs. (20) and (21) though with coefficients,

$$\begin{aligned} C^\text {had}_i = \frac{2\alpha M_Z^{2}}{3\pi } \frac{w_i}{\sqrt{s}_i\left[ M_Z^{2}- s_i\right] }. \end{aligned}$$
(23)

Because our calculation is performed at CM energies from 0.3 to \(1.937\,\text {GeV}\), the principal-value prescription does not need to be considered.

Fig. 4
figure 4

Similar to Fig. 2, though showing results from an ordinary GP

For the sake of comparison and to verify parts of our tool-chain, we calculate \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\) naively without utilizing a TGP. We consider a naive model that at the locations of the measurements of the R-ratio predicts

$$\begin{aligned} R_i = {\hat{R}}_i \pm \sigma _i \end{aligned}$$
(24)

where \({\hat{R}}_i\) are the central values and \(\sigma _i\) are the errors of the measurements. In this naive model there is no covariance between predictions, that is, \(\textsf {Cov}\mathopen {}\mathclose {\left[R_i, R_j\right]} = 0\) for \(i \ne j\). Equations (20) and (21) apply to this simple case, although it should be noted that in this case \(\sqrt{s}\) are a series of data points, whereas in the TGP \(\sqrt{s}\) are the chosen prediction locations.

The results of the above calculations are summarized in Table 1. We show the predictions from KNT18 [33] and KNT19 [34] for comparison, which are found by summing data-based exclusive channels in Tables 2 and 1, respectively, and combing errors in quadrature.Footnote 3 We see that the TGP prediction for \(a_\mu ^\textsc {hvp}\) is smaller than predictions from the naive model, KNT18 [33] and KNT19 [34]. This would make tension between data-driven estimates and lattice QCD and the experimental measurements worse. The uncertainties in our TGP predictions are nearly identical to those from the naive model – we explain this similarity in uncertainties in appendix A – though substantially smaller than those from KNT18 [33] and KNT19 [34]. We don’t anticipate that the smaller TGP uncertainties are a consequence of the TGP model itself; rather, KNT18 [33] and KNT19 [34] are based on a different dataset and treatment of systematics. For example, they include uncertainties from vacuum polarization (VP) effects and final-state radiation (FSR) that we omit. We thus find no clear evidence of mismodelling or that our more careful modeling can shed light on the tension between data-driven estimates, lattice estimates and experiments. It is possible, however, that for an identical dataset to that in KNT18 [33] and KNT19 [34], the TGP predictions could be greater than KNT18 [33] and KNT19 [34]– the impact of reducing overfitting with a TGP could work in the opposite direction in that dataset.

Table 1 Contributions to \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\) in the CM energy range of 0.3–\(1.937\,\text {GeV}\) from our TGP model, a naive model, KNT18 [33] and KNT19 [34]

4.4 Sequential design

We may use our TGP result to identify locations that contribute most to the uncertainty in the \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\) predictions and where future measurements would be most beneficial. For both \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\), the ALM estimate from Eq. (13) yields

$$\begin{aligned} {\sqrt{s}}_\text {ALM} = 0.788 \,\text {GeV}. \end{aligned}$$
(25)

This lies near noisy measurements after the \(\rho \)\(\omega \) resonance; see Fig. 2. Besides lying close to noisy measurements, the uncertainty at this location is substantial because it is a boundary between partitions of the TGP – the behavior of the function changes abruptly here and so is hard to predict.

4.5 Correlation

Lastly, let us consider the relationship between the predictions for \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\). From Eqs. (4) and (5), we observe that the dispersion integral formulas used to calculate \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\) both involve the R-ratio. To quantify this relationship, covariance can be used to measure the correlation between two variables. The sign of covariance indicates whether the trends between the two variables are consistent. The correlation coefficient is usually utilized to reflect the strength of the correlation between two variables. Thus, to gain more understanding about the relationship between \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\), we computed their covariance and correlation,

$$\begin{aligned} \textsf {Cov}\mathopen {}\mathclose {\left[a_\mu ^\textsc {hvp},\Delta \alpha _\text {had}\right]}&= \sum _{i,j=1}^{n} C^\textsc {hvp}_i \, C^\text {had}_j \, \textsf {Cov}\mathopen {}\mathclose {\left[R_i, R_j\right]} \end{aligned}$$
(26)
$$\begin{aligned} \rho \mathopen {}\mathclose {\left[a_\mu ^\textsc {hvp},\Delta \alpha _\text {had}\right]}&= \frac{\textsf {Cov}\mathopen {}\mathclose {\left[a_\mu ^\textsc {hvp},\Delta \alpha _\text {had}\right]}}{\sqrt{\textsf {Var}\mathopen {}\mathclose {\left[a_\mu ^\textsc {hvp}\right]} \textsf {Var}\mathopen {}\mathclose {\left[\Delta \alpha _\text {had}\right]}}} \end{aligned}$$
(27)

where the (co)variances were computed under the TGP as described. As anticipated, we obtained a positive correlation between the two. Specifically, when \(\Delta \alpha _\text {had}\) increases, the value of \(a_\mu ^\textsc {hvp}\) also increases, and vice versa. The calculated correlation coefficient was \(\rho \simeq 0.8\), which is close to 1, quantifying the strong correlation between \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\).

5 Discussion and conclusions

The BNL and FNAL measurements of the anomalous magnetic moment of the muon disagree with the Standard Model (SM) prediction by more than \(4\sigma \). This has led to renewed scrutiny of new physics explanations and the SM prediction. With that as motivation, we extracted the hadronic vacuum polarization (HVP) contributions, \(a_\mu ^\textsc {hvp}\), from electron cross-section data using a treed Gaussian process (TGP) to model the unknown R-ratio as a function of CM energy. This is a principled and general method from data-science, that allows complete uncertainty quantification and automatically balances over- and under-fitting to noisy data.

The challenges in the data-driven approach are common in data-science. A competitive estimate of \(a_\mu ^\textsc {hvp}\), however, requires domain-specific expertise, careful curation of measurements, and careful consideration of systematic errors and their correlation. This should be developed over time in collaboration with domain experts. Thus our work should be seen as preliminary and serves to explore an alternative statistical methodology based on more general principles and develop an associated toolchain. We used a dataset available from the PDG, though as noted as early as 2003 in Ref. [68], a more complete, documented and standardized database of measurements would allow further scrutiny of data-driven estimates of HVP.

Our analysis used about \(n \approx 1000\) data points. The linear algebra operations in GP computations scale as \({\mathcal {O}}(n^3)\). There are computational approaches and approximations to overcome this scaling (see e.g., Refs. [73,74,75,76,77]); nevertheless, working with more complete datasets could be challenging. On the other hand, splitting data channel by channel could help the situation. For a competitive estimate, we would require careful treatment of correlated systematic uncertainties. The approach started here – carefully building an appropriate statistical model – naturally allows us to model systematic uncertainties. For example, through nuisance parameters for scale uncertainties or sophisticated noise models for correlated noise (see e.g., Ref. [78]). The statistical model could include, for example, a hierarchical model of systematic uncertainties accounting for “errors on errors.”

Fig. 5
figure 5

A horizontal line fitted to noisy data (left) and a naive model that passes through every data point (right). Despite making quite different predictions for y, they make identical predictions for \(\sum y_i\)

The prediction for \(a_\mu ^\textsc {hvp}\) from our TGP model is slightly smaller than existing data-driven estimates. Thus, more principled modeling of the R-ratio in fact increases tension between the SM prediction and measurements for \(g-2\). On the other hand, because the kernel functions were slowly-varying, the TGP model predicted \(a_\mu ^\textsc {hvp}\) with a similar uncertainty to that obtained in naive approach. This can be understood from the trade-off between variance and covariance in predictions of the R-ratio at different CM energies. Looking forward, by the ALM criteria, the best CM energy for future measurements was \(\sqrt{s} \simeq 0.788\,\text {GeV}\) for both \(a_\mu ^\textsc {hvp}\) and \(\Delta \alpha _\text {had}\), as it lies close to particularly noisy measurements of the R-ratio. In conclusion, we developed a statistical model for the R-ratio, based on general principles and publicly available toolchains. We found no indication that mismodeling the R-ratio could be responsible for tension with measurements or lattice predictions. We hope, however, that this work serves as a starting point for further scrutiny, principled modeling and development of associated public tools.