Modeling the R -ratio and hadronic contributions to g − 2 with a Treed Gaussian process

TheBNLandFNALmeasurementsoftheanoma-lous magnetic moment of the muon disagree with the Standard Model (SM) prediction by more than 4 σ . The hadronic vacuum polarization (HVP) contributions are the dominant source of uncertainty in the SM prediction. There are, however, tensions between different estimates of the HVP contributions, including data-driven estimates based on measurements of the R -ratio. To investigate that tension, we modeled the unknown R -ratio as a function of CM energy with a treed Gaussian process (TGP). This is a principled and general method grounded in data-science that allows complete uncertainty quantiﬁcation and automatically balances over-and under-ﬁtting to noisy data. Our tool yields exploratory results are similar to previous ones and we ﬁnd no indication that the R -ratio was previously mismodeled. Whilst we advance some aspects of modeling the R -ratio and develop new tools for doing so, a competitive estimate of the HVP contributions requires domain-speciﬁc expertise and a carefully curated database of measurements ( github , https://github.com/qiao688/TGP_for_g-2).


Introduction
The final measurement of the anomalous magnetic moment of the muon at the Brookhaven National Laboratory (BNL) E821 experiment [1] differed from the Standard Model (SM) prediction by 3.7σ.This discrepancy was replicated by the Fermi National Accelerator Laboratory (FNAL) E989 measurement [2].The combined measurement from BNL and FNAL, a µ = 116 592 061 (41) deviates from the SM theory prediction by 4.2σ, motivating the possibility of physics beyond the SM [3] as well as scrutiny of the SM prediction [4].
The SM prediction for a µ incorporates contributions from quantum electrodynamics (QED), electroweak interactions (EW), and hadronic effects [5].While QED and EW contributions can be calculated with high precision using perturbation theory and are well-controlled [6], hadronic contributions are harder to compute and are the largest source of uncertainty in predictions for a µ .The hadronic contribution can itself be decomposed into the following parts: where a HVP µ is the hadronic vacuum polarization (HVP) contribution and a LbL µ is the light-by-light scattering contribution.In this work we focus on the leading-order (LO) contribution to HVP.This is particularly challenging and the dominant source of uncertainty, making up about 80% of the total.The two most popular computational methods are lattice QCD and estimates from dispersion integrals and cross-section data.
This is a function of the center-of-mass (CM) energy, s. 1 From the estimate of the Rratio, the leading-order (LO) HVP contributions can be computed through the dispersion integral, where K (s) is the QED kernel [15,16].Hadronic contributions to the effective electromagnetic coupling constant at the Z boson mass can be computed in a similar way through the dispersion relationship where ffl represents the principal-value prescription.There is growing tension between results from the two approaches [17,18].A recent lattice QCD calculation found [19] a HVP µ = 707.5 (5.5) × 10 −10 (6) whereas a conservative combination of data-driven estimates yielded [4] a HVP µ = 693.1 (4.0) × 10 −10 .
The lattice result was at least partly corroborated by other recent lattice computations [20][21][22] and, moreover, data-driven estimates using hadronic τ-decays are close to lattice results [23].The most recent measurement of the 2π final state at CDM-3 [24] compounded the mystery, as it conflicts with older measurements including CDM-2 [25].With these issues in mind, we wish to reconsider the statistical methodology for inferring the R-ratio from noisy data.As we shall discuss in section 2, the existing approaches use carefully constructed but ad hoc techniques and closed-source software, and consider uncertainties in a frequentist framework.The data-driven approach, though, is connected to common problems in data-science and statistics: modeling an unknown function (here the R-ratio) and managing the risks of under-and overfitting.In section 3, we describe how we tackle these issues using Gaussian processes -flexible non-parametric statistical models -and marginalization of the model's hyperparameters. 2 This allows coherent uncertainty quantification and regularizes the wiggliness of the R-ratio, which helps prevent the model from over-fitting the noisy data.Our algorithm is implemented in our public kingpin package documented in a separate paper [28].We focus on modeling choices and developing a tool for principled modeling of the R-ratio; our estimates are supplementary to existing ones and we don't attempt to match previous comprehensive estimates in all respects.We don't anticipate dramatic differences with respect to previous findings; however, careful modeling of the R-ratio is important because O (1%) changes in the HVP contribution or finding that the uncertainty was underestimated could resolve tension with the experimental measurements and lattice predictions.We present predictions from our model for a HVP µ and ∆α had in section 4. Finally, we conclude in section 5.

Existing data-driven methods
We now briefly review two data-driven methods for calculating a HVP µ .First, the DHMZ approach [29][30][31], which employs HVPTools, a private software package that combines and integrates cross-section data from e + e − → hadrons.For each experiment, secondorder polynomial interpolation is used between adjacent measurements to discretize the results into small bins (of around 1 MeV) for later averaging and numerical integration.The HVP contributions are estimated in a frequentist framework.To ensure that uncertainties are propagated consistently, pseudo-experiments are generated and closure tests with known distributions are performed to validate the data combination and integration.If the results from different experiments are locally inconsistent, the uncertainty of the combination is readjusted according to the local χ 2 value following the well-known PDG approach [32].
The second method is the KNT approach [4,33,34], which performs a data-driven compilation of hadronic R-ratio data to calculate the HVP contribution.It first selects the data to be used and then bins the data using a clustering procedure to avoid overfitting.The clustering procedure determines the optimal binning of data for all channels into a set of clusters based on the available local data density.The optimal clustering criteria are shown in Ref. [4].As of Ref. [33], the KNT compilation uses an iterated χ 2 fit to achieve the actual combination.This new method ensures that the covariance matrix is re-initialized at each iteration.The motivation of this procedure is to avoid bias.The fit results in the mean R-ratio for each cluster and a full covariance matrix containing all correlated and uncorrelated uncertainties.Combined with trapezoidal integration, these are used to determine channel-by-channel contributions to a HVP µ .The DHMZ and KNT approaches are both data-driven methods that estimate a HVP µ in a frequentist framework, using privately curated databases of measurements, and in-house custom codes and techniques to avoid over-fitting.The differences between the two methods are not only evident in their distinct compilation targets -the DHMZ approach combines and integrates cross-section data from e + e − → hadrons, while the KNT approach performs a data-driven compilation of the hadronic R-ratio.Furthermore, discernible disparities emerge in their respective data handling procedures, encompassing data selection, data combination, and the propagation of uncertainties.Each method has its own strengths and limitations.While DHMZ and KNT approaches do not exhibit significant differences in estimating the central value of a HVP µ , there are significant disparities in the resulting uncertainties and the shapes of the combined spectra.

Gaussian processes
In our data-driven approach, we model the unknown R-ratio with Gaussian processes (GPs; [35,36]).A GP generalizes the Gaussian distribution.Roughly speaking, whereas a Gaussian describes the distribution of a scalar and a multivariate Gaussian describes the distribution of a vector, a GP describes the distribution of a function -an infinite collection of variables f (x) indexed by a location x.Any subset of the random variables are correlated through a multivariate Gaussian.The degree of correlation between f (x) and f (x ′ ) governs the smoothness of f (x) and is set by a choice of kernel function, k(x, x ′ ).
Just as a GP generalizes a Gaussian distribution of scalars or vectors to a distribution of functions, it allows us to generalize inference over unknown scalars or vectors to inference over unknown functions.Suppose we wish to learn an unknown function.Because a GP describes the distribution of a function, it can be used as the prior for the unknown function in a Bayesian setting.This prior distribution can be updated through Bayes' rule by any noisy measurements or exact calculations of the values of f at particular locations x.In this paper we will update a GP for the R-ratio by the noisy measurements of the R-ratio.We use celerite2 [37] for ordinary GP computations.
The kernel function is usually stationary, that is, depends only on the Euclidean distance between locations, Once a particular form of stationary kernel has been chosen, a GP can be controlled by three hyperparameters: a constant mean µ, and a scale σ and length ℓ that govern the covariance, The scale controls the size of wiggles in the function predicted by the GP.The length determines the length scale over which correlation decays and hence the number of wiggles in an interval.For Gaussian kernels, by Rice's formula [38] the expected number of wiggles per unit distance scales as 1/ℓ.These three hyperparameters can substantially affect how well a GP models an unknown function.In a fully Bayesian framework, the hyperparameters are marginalized.This automatically weights choices of hyperparameter by how well they model the data and alleviates overfitting.The wigglines is regularized and the fit needn't pass through every data point.

Treed Gaussian processes
The GPs described thus far are stationary -they model all regions of input space identically.To allow for non-stationary structure in the R-ratio, we use a treed-GP (TGP; [39,40]).This is necessary as we know that the R-ratio contains narrow features such as resonances.In a TGP, the input space is partitioned using a binary tree.The predictions in each partition are governed by a different GP with independent hyperparameters.The number and locations of partitions are modeled using the so-called CGM prior [41].
The difference in predictions between a GP and our TGP is illustrated in fig. 1.In this illustration we consider evenly spaced noisy measurements of a function that contains a step.The GP (left) models the data poorly, as to accommodate the sudden step, the covariance between input locations must be weak which results a wiggly fit to the straight-line sections.The TGP (right) automatically partitions the input space allowing TGPs build on ideas such as CART [41], treed models generally [42] and partitioning [43], and are similar to piece-wise GPs [44] and a recent proposal in machine learning [45].Alternative approaches to non-stationarity include non-stationary kernel functions [46], Deep GPs [47,48] where non-stationarity is modeled through warping, and hierarchical models of GPs [49,50].There is valuable discussion and comparison of these approaches in Ref. [51,52] and this remains an active area.As well as addressing non-stationarity in unknown functions, these approaches address heteroscedastic noise in our measurements.
Our approach is fully Bayesian -we marginalize the GP hyperparameters and tree structure.This decreases the risk of over-or under-fitting the noisy data and smooths the partitions between GPs.We perform marginalization numerically using reversible jump Markov Chain Monte Carlo (RJ-MCMC; [53].For reviews see Ref. [54][55][56]).This is a generalization of MCMC that works on parameter spaces that don't have a fixed dimension -this is vital because the number of GPs and thus the total number of hyperparameters isn't fixed.Navigating the tree structure requires special RJ-MCMC proposals -such as growing, pruning and rotating the tree -that are described in Ref. [39].

Integration
The idea of modeling integrals through GPs was originally known by Bayes-Hermite quadrature [57], and later discussed under the names of Bayesian Monte Carlo (BMC; [58]) and Bayesian quadrature or cubature [59,60]; see Ref. [61] for a review.Suppose we wish to compute an integral of the form, (11) where f (x) is the estimated function and C (x) is a known function.BMC provides an epistemic meaning to errors in quadrature estimates of theses integrals, such as because we may make inferences on I through our statistical model for f (x).In cases in which the function C (x) and choice of kernel lead to intractable computations, there is an additional discretization error in BMC inferences as the GP predictions are evaluated on a finitely-spaced grid.This is known as approximate Bayesian cubature [61].This additional error may be neglected when the integrand is approximately linear between prediction points.We will use a TGP to model an integrand.Although trees have been proposed in BMC [62], they haven't previously been directly combined with GPs in this way.

Sequential design
After completing inference of an unknown function with the data at hand, one may wish to know what data to collect next.This problem is known as sequential design or active learning.Broadly speaking, this is a challenging question and greedy approaches that make optimal choices one step at a time are easier to implement.We thus consider a variant of Active Learning Mackay (ALM; [63,64]).Following the approach in Ref. [65], we consider the location that contributes most to the uncertainty in the a HVP µ and ∆α had integrals to be an optimal location at which to perform more measurements.For an integral of the form eq. ( 11), we compute See Ref. [40,66] for further discussion.

Data selection
We investigated the public dataset from the Particle Data Group (PDG; [32,67].See also Ref. [68] for further details), which primarily comprises data on the inclusive Rratio, asymmetric statistical errors, and point-to-point systematic errors from electronpositron annihilation to hadrons at different CM energies.Certain CM energies may have multiple point-to-point systematic uncertainties stemming from different sources.We symmetrized errors and combined systematic (τ) and statistical errors (σ) in quadrature: We selected 859 data points inside the CM energy interval 0.3 -1.937 GeV.This interval was selected to facilitate a comparison with Ref. [33].The maximum s = 1.937GeV was chosen as it is the point at which summing exclusive R-ratio data becomes unfeasible and perturbative QCD may be reliable.The minimum s = 0.3 GeV was the minimum energy in the public PDG dataset.
To model this data set, we utilized a treed Gaussian process (TGP), as described in section 3.Besides selecting the data, we must specify the locations at which we want to predict the R-ratio.In our study, we predicted at every input location and at two uniformly spaced locations between every pair of consecutive input locations.

Computational methods and modelling choices
We model the R-ratio by a TGP in which the input space is divided into partitions using a binary tree.Each partition in our TGP is governed by a mean, µ, and a Matérn-3/2 kernel with independent scale, σ, and length, ℓ, hyperparameters.We use a uniform prior between 0 and 150 for the mean, a uniform prior between 0 and 500 for the scale, and a uniform prior between 0 GeV to 5 GeV for the length.These choices were motivated by the maximum measured R-ratio and the CM interval 0.3 -1.937 GeV under consideration.Following Ref. [39], the structure of the tree itself is controlled by a CGM prior with hyperparameters α = 0.5 and β = 2; see Ref. [41] for explanation of these parameters.These choices favor smaller and more balanced trees.
We marginalize the tree structure and hyperparamters using RJ-MCMC.To improve computational efficiency, we thin the chains by a factor of four and only compute predictions for the states in the thinned chains.This reduces the computational time but only slightly reduces the effective sample size as the states in the unthinned chain are strongly correlated.We run RJ-MCMC for 300 000 steps but discard 5000 burn-in steps to minimize bias from the beginning of the chain.For computational efficiency and following a multistart heuristic, we run 10 chains in parallel and combine them.

Predictions
The predictions from our TGP model for the R-ratio are shown in fig. 2 as a mean and an error band.The mean predictions pass smoothly around the data points without any undue fluctuations near the data points that are characteristic of over-fitting.The ρω and φ resonances are typically fitted by their own tree partitions with separate hyperparameters.They aren't forced to be as smooth as the rest of the spectrum and appear well-fitted.Our model predictions are noticeably more uncertain in regions with fewer or noisy measurements.We identify no anomalous features and tentatively conclude that the RJ-MCMC marginalization adequately converged.We ran standard MCMC diagnostics on the mean of the R-ratio using ArviZ [69]; finding n eff ≃ 600 bulk effective samples and Gelman-Rubin diagnostic 1.01 [70].There are typically five or six partitions, as the two peaks and the three flatter regions are modeled separately as shown in fig. 3.In fig. 4 we show the result when using an ordinary GP.To accommodate the narrow peaks in the measured R-ratio, the GP model permits substantial wiggles between data points, especially where the data points are sparse.
The TGP model outputs the mean, E R i , and covariance, Cov R i , R j , of the Rratio at the prediction locations s i .The mean function represents the expected or average output value for a given input value, while the covariance function represents the covariance between predictions at different CM energies.As the RJ-MCMC can be computationally expensive and time-consuming, we saved these results to disk and made them publicly available [71].We used the mean and covariance predictions for R in combination with the dispersion integrals to predict contributions to a HVP µ and ∆α had from the CM energy interval of 0.3 -1.937 GeV.As a HVP µ and ∆α had are linear functions of R, we propagate E R i and Cov R i , R j to obtain predictions.In all subsequent integration processes, we employ the trapezoidal rule [72], and in subsequent formulae s denotes the locations of our TGP predictions rather than the locations of the measurements.
The calculation of a HVP µ is based on eq. ( 4).However, since the independent variable in this case is the CM energy, a simple deformation of eq. ( 4) is necessary, Then we calculate the value of a HVP µ through numerical quadrature, where we defined where w i are the quadrature weights.We use the trapezoid rule such that From eq. ( 17) and by linearity, the mean can be found through and from the covariance matrix for predictions of the R-ratio, the uncertainty in our prediction of a HVP µ can be calculated using We compute ∆α had similarly using, We use eqs.( 20) and ( 21) though with coefficients, Because our calculation is performed at CM energies from 0.3 -1.937 GeV, the principalvalue prescription does not need to be considered.
For the sake of comparison and to verify parts of our tool-chain, we calculate a HVP µ and ∆α had naively without utilizing a TGP.We consider a naive model that at the locations of the measurements of the R-ratio predicts where Ri are the central values and σ i are the errors of the measurements.In this naive model there is no covariance between predictions, that is, Cov R i , R j = 0 for i ̸ = j .
Equations ( 20) and ( 21) apply to this simple case, although it should be noted that in this case s are a series of data points, whereas in the TGP s are the chosen prediction locations.
The results of the above calculations are summarized in table 1.We show the predictions from KNT18 [33] and KNT19 [34] for comparison, which are found by summing data-based exclusive channels in tables 2 and 1, respectively, and combing errors in quadrature. 3We see that the TGP prediction for a HVP µ is smaller than predictions from the naive model, KNT18 [33] and KNT19 [34].This would make tension between data-driven estimates and lattice QCD and the experimental measurements worse.The uncertainties in our TGP predictions are nearly identical to those from the naive model -we explain this similarity in uncertainties in appendix A -though substantially smaller than those from KNT18 [33] and KNT19 [34].We don't anticipate that the smaller TGP uncertainties are a consequence of the TGP model itself; rather, KNT18 [33] and KNT19 [34] are based on a different dataset and treatment of systematics.For example, they include uncertainties from vacuum polarization (VP) effects and finalstate radiation (FSR) that we omit.We thus find no clear evidence of mismodelling or that our more careful modeling can shed light on the tension between data-driven and ∆α had in the CM energy range of 0.3 -1.937 GeV from our TGP model, a naive model, KNT18 [33] and KNT19 [34].
estimates, lattice estimates and experiments.It is possible, however, that for an identical dataset to that in KNT18 [33] and KNT19 [34], the TGP predictions could be greater than KNT18 [33] and KNT19 [34]-the impact of reducing overfitting with a TGP could work in the opposite direction in that dataset.

Sequential design
We may use our TGP result to identify locations that contribute most to the uncertainty in the a HVP µ and ∆α had predictions and where future measurements would be most beneficial.For both a HVP µ and ∆α had , the ALM estimate from eq. ( 13) yields This lies near noisy measurements after the ρω resonance; see fig. 2.Besides lying close to noisy measurements, the uncertainty at this location is substantial because it is a boundary between partitions of the TGP -the behavior of the function changes abruptly here and so is hard to predict.

Correlation
Lastly, let us consider the relationship between the predictions for a HVP µ and ∆α had .From eqs. ( 4) and ( 5), we observe that the dispersion integral formulas used to calculate a HVP µ and ∆α had both involve the R-ratio.To quantify this relationship, covariance can be used to measure the correlation between two variables.The sign of covariance indicates whether the trends between the two variables are consistent.The correlation coefficient is usually utilized to reflect the strength of the correlation between two variables.Thus, to gain more understanding about the relationship between a HVP µ and ∆α had , we computed their covariance and correlation, Var ∆α had (27) where the (co)variances were computed under the TGP as described.As anticipated, we obtained a positive correlation between the two.Specifically, when ∆α had increases, the value of a HVP µ also increases, and vice versa.The calculated correlation coefficient was ρ ≃ 0.8, which is close to 1, quantifying the strong correlation between a HVP µ and ∆α had .

Discussion and conclusions
The BNL and FNAL measurements of the anomalous magnetic moment of the muon disagree with the Standard Model (SM) prediction by more than 4σ.This has led to renewed scrutiny of new physics explanations and the SM prediction.With that as motivation, we extracted the hadronic vacuum polarization (HVP) contributions, a HVP µ , from electron cross-section data using a treed Gaussian process (TGP) to model the unknown R-ratio as a function of CM energy.This is a principled and general method from data-science, that allows complete uncertainty quantification and automatically balances over-and under-fitting to noisy data.
The challenges in the data-driven approach are common in data-science.A competitive estimate of a HVP µ , however, requires domain-specific expertise, careful curation of measurements, and careful consideration of systematic errors and their correlation.This should be developed over time in collaboration with domain experts.Thus our work should be seen as preliminary and serves to explore an alternative statistical methodology based on more general principles and develop an associated toolchain.We used a dataset available from the PDG, though as noted as early as 2003 in Ref. [68], a more complete, documented and standardized database of measurements would allow further scrutiny of data-driven estimates of HVP.
Our analysis used about n ≈ 1000 data points.The linear algebra operations in GP computations scale as O (n 3 ).There are computational approaches and approximations to overcome this scaling (see e.g., Ref. [73][74][75][76][77]); nevertheless, working with more complete datasets could be challenging.On the other hand, splitting data channel by channel could help the situation.For a competitive estimate, we would require careful treatment of correlated systematic uncertainties.The approach started here -carefully building an appropriate statistical model -naturally allows us to model systematic uncertainties.For example, through nuisance parameters for scale uncertainties or sophisticated noise models for correlated noise (see e.g., Ref. [78]).The statistical model could include, for example, a hierarchical model of systematic uncertainties accounting for "errors on errors." The prediction for a HVP µ from our TGP model is slightly smaller than existing datadriven estimates.Thus, more principled modeling of the R-ratio in fact increases tension between the SM prediction and measurements for g − 2. On the other hand, because the kernel functions were slowly-varying, the TGP model predicted a HVP µ with a similar uncertainty to that obtained in naive approach.This can be understood from the trade-off between variance and covariance in predictions of the R-ratio at different CM energies.Looking forward, by the ALM criteria, the best CM energy for future measurements was s ≃ 0.788 GeV for both a HVP µ and ∆α had , as it lies close to particularly noisy measurements of the R-ratio.In conclusion, we developed a statistical model for the R-ratio, based on general principles and publicly available toolchains.
We found no indication that mismodeling the R-ratio could be responsible for tension with measurements or lattice predictions.We hope, however, that this work serves as a starting point for further scrutiny, principled modeling and development of associated public tools.

Data and code availability
The TGP algorithm is implemented in our public Python package kingpin [28].The dataset and codes, which use the kingpin package, for this paper are available online; see Ref. [71].

A Similarity between TGP and naive model uncertainties
Here we explain why the uncertainties in the predictions for a HVP µ and ∆α had calculated under our TGP and in the naive method are approximately equal, despite quite different predictions for the R-ratio.We anticipated a reduction in error in our TGP as we applied prior information about correlations between prediction points.There are two effects of introducing correlation: first, as anticipated a reduction in the variance at any prediction point, that is, Var R i ≪ σ 2 i .Second, an increase in the correlation between prediction points, Cov R i , R j > 0. The uncertainty in our TGP predictions includes terms containing variance and covariance, In practice we find that the decrease in the variance is almost exactly canceled by the increase in the covariance.
To understand this effect in the simplest way, consider fitting a horizontal line (m = 0) with an unknown intercept (c) through data points with errors σ as shown in Cov y i , y j (32) This is identical to the uncertainty from our naive model with no correlations that fits y = ŷ ± σ because the increased covariance cancels the decreased variance.Although we reduced the uncertainty at any prediction point, that reduction was offset by the covariance between predictions.Let us create an example closer to our TGP model and demonstrate an identical effect.In a GP with fixed hyperparameters for measurements with uniform noise σ, the covariance between prediction points X ⋆ can be expressed as Cov y(X ⋆ ), y(X ⋆ ) = K (X ⋆ , X ⋆ ) − K (X ⋆ , X ) T K (X , X ) + σ 2 −1 K (X ⋆ , X ) (34) where X are the training points, K is the choice of kernel function, and σ 2 is the noise.This expression is somewhat intractable due to the inverse matrix.Thus we consider X = X ⋆ and a simplified covariance, K i j = a + bδ i j .(35) With this choice, one can compute Cov y(X ⋆ ), y(X ⋆ ) and the sum over its elements analytically.The covariance may be written as where Summing the elements results in, where n represents the size of X ⋆ .By performing a Taylor expansion about σ = 0, we discover Cov y(X ⋆ ), y(X ⋆ ) = nσ 2 + O (σ 4 ).
Although the structure eq. ( 35) isn't particularly realistic, our result holds for any a, including 0% and 100% correlation.When C i ≈ C ≡ const., the uncertainties from the TGP and naive method are both around C nσ.The decrease in variance and increase in covariance cancel in the TGP uncertainties.For the case in which the prediction locations are denser than the input locations, this result may hold if the TGP predictions don't change substantially between input locations.

Figure 1 :
Figure 1: The GP (left) predicts a wiggly fit to the straight line sections due to the step.The TGP (right) automatically addresses the issue by partitioning the input space.

Figure 2 :
Figure 2: Predicted R-ratio from the TGP model.The experimental errors and uncertainty in the TGP predictions are scaled by 5 and the ρω and φ resonances are plotted separately for visibility.

Figure 3 :
Figure 3: Histogram of locations of partition edges in the TGP model.The mean prediction for the R-ratio is shown for reference (blue).

Figure 4 :
Figure 4: Similar to fig. 2, though showing results from an ordinary GP.

Figure 5 : 2 n ( 31 )
Figure5: A horizontal line fitted to noisy data (left) and a naive model that passes through every data point (right).Despite making quite different predictions for y, they make identical predictions for y i .