Model-Based Geostatistics from a Bayesian Perspective: Investigating Area-to-Point Kriging with Small Data Sets

Area-to-point kriging (ATPK) is a geostatistical method for creating high-resolution raster maps using data of the variable of interest with a much lower resolution. The data set of areal means is often considerably smaller (<50\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$<\,50 $$\end{document} observations) than data sets conventionally dealt with in geostatistical analyses. In contemporary ATPK methods, uncertainty in the variogram parameters is not accounted for in the prediction; this issue can be overcome by applying ATPK in a Bayesian framework. Commonly in Bayesian statistics, posterior distributions of model parameters and posterior predictive distributions are approximated by Markov chain Monte Carlo sampling from the posterior, which can be computationally expensive. Therefore, a partly analytical solution is implemented in this paper, in order to (i) explore the impact of the prior distribution on predictions and prediction variances, (ii) investigate whether certain aspects of uncertainty can be disregarded, simplifying the necessary computations, and (iii) test the impact of various model misspecifications. Several approaches using simulated data, aggregated real-world point data, and a case study on aggregated crop yields in Burkina Faso are compared. The prior distribution is found to have minimal impact on the disaggregated predictions. In most cases with known short-range behaviour, an approach that disregards uncertainty in the variogram distance parameter gives a reasonable assessment of prediction uncertainty. However, some severe effects of model misspecification in terms of overly conservative or optimistic prediction uncertainties are found, highlighting the importance of model choice or integration into ATPK.

found to have minimal impact on the disaggregated predictions. In most cases with

Introduction
An important challenge often encountered in scientific research is spatial prediction using areal-support data, that is, data about the variable of interest that is available as areal means only. Data may be aggregated for privacy protection, administrative, technical or other reasons. Examples include data on the failures in semiconductor chip production (White et al. 2017), cancer mortality (Goovaerts 2006), precipitation (Park 2013), soil properties (Kerry et al. 2012;Orton et al. 2012;Malone et al. 2009;Brus et al. 2014), soil hydraulic properties (Horta et al. 2014) and, the motivating example in this research, crop yields. You et al. (2009) state that, because of limited resources, crop yield information in many sub-Saharan countries is available only sparsely and in aggregated format; however, such information is needed at a finer spatial resolution to support efforts to increase crop productivity [see for example Orton et al. (2018)] and thereby improve both human welfare and ecological sustainability.
Model-based spatial prediction is usually done using algorithms based on point support (Diggle and Ribeiro 2007). In the case of areal-supported input, area-to-point kriging (ATPK) is a popular approach for creating fine-scale raster maps of the variable of interest and the corresponding prediction uncertainty (called spatial disaggregation or downscaling). In ATPK, regression coefficients (in the presence of covariates) and variogram parameters (describing spatial relationships) have to be estimated, for example by a least square estimator for the regression coefficients combined with an iterative variogram fitting deconvolution algorithm ('method of moments') on the regression residuals (Goovaerts 2008). More recent methods such as restricted maximum likelihood (REML) in combination with universal kriging 1 (UK) (both, and from hereon, referring to their application in the ATPK setting) consider the uncertainty in the regression coefficients (Webster and Oliver 2007). However, uncertainty in the variogram model parameters might also be a relevant source of uncertainty (Jansen 1998;Kitanidis 1986;Minasny et al. 2011). Truong et al. (2014) showed that variogram uncertainty can have a substantial impact on ATPK variances. Brus et al. (2018) summarised earlier work by Pardo-Igzquiza and Dowd (2001) showing that uncertainty in the variogram parameters can be quantified by the inverse Fisher matrix of the variogram parameters, but did not integrate this uncertainty in the kriging prediction uncertainty itself.
In the Bayesian statistics paradigm, parameters can be considered stochastic rather than fixed but unknown (Gotway 2005). From a Bayesian perspective, REML in combination with UK considers the regression coefficients as stochastic and subsequently integrates them out-from the likelihood function for estimation of the variogram parameters as well as from the prediction. In this paper, this Bayesian direction is continued by successively integrating out the spatial variance parameter (analytically) and the spatial correlation distance parameter (numerically). By applying analytical solutions whenever possible, Markov chain Monte Carlo (MCMC) sampling from the Bayesian posterior distribution as proposed for example by Minasny et al. (2011) is avoided. MCMC can be computationally expensive and may, when used in a spatial context, be difficult to converge to a posterior distribution due to correlated parameters (Christensen et al. 2006). The extra effort of taking variogram parameter uncertainty into account could be most beneficial in the case of ATPK or area-to-area kriging, because the provided data set of areal means, from which regression coefficients and variogram parameters at high-resolution support must be inferred, can often be limited in size. However, taking variogram parameter uncertainty into account can also be beneficial in the case of point-to-point kriging (Le and Zidek 1992;Berger et al. 2001) and for sampling design Lark 2004, 2007).
It is not uncommon in ATPK studies to have only a small data set of areal means and no available relevant expert knowledge to inform model parameters (for example Brus et al. 2018). Therefore, this research aims to provide some insight into the applicability and behaviour of ATPK methods under these circumstances. To provide this insight, the following questions are answered: (i) What is the impact of different prior distributions-selected to represent a lack of prior knowledge about model parameters-on the quality of the ATPK predictions and prediction uncertainties? (ii) Can some aspects of uncertainty be disregarded, which might allow for computational benefits? (iii) How sensitive are results to misspecifications of the underlying statistical model?
In the following sections, the theoretical framework of model-based geostatistics for areal data, the Bayesian paradigm and the combination of these are briefly introduced. In the simulation part of this paper, REML is compared with more Bayesian approaches to perform and assess ATPK using a simulated spatial signal, including data sets as small as nine observations. Using real-world data, different approaches on selfaggregated remote sensing data are tested, referred to as the synthetic case study. Finally, as the motivating example, millet and sorghum yields, known as areal means only, are downscaled for each of the 45 provinces of Burkina Faso to a fine-scale grid of predicted yields.
with MVN indicating a multivariate normal distribution; X the design matrix containing location-specific covariate values, including a column with ones to represent the regression intercept; β the vector of k regression coefficients (also called trend parameters); σ 2 in the terminology of geostatistics the partial sill variance; and C(φ) the spatial correlation matrix as a parametric function of distance parameter φ.
Among several other possibilities, the exponential covariance function is assumed, where i1 and i2 index two discrete point locations s i1 and s i2 within the Gaussian random field, and h(s i1 , s i2 ) represents the Euclidean distance between those locations, while (. . .) i1,i2 means the element in row i1, column i2 of the indicated covariance matrix. Spatial directional dependence is not considered. For notational convenience, C(φ) is indicated by C from now on.

Area-to-Point Kriging
In the case of ATPK, the observations are m areal means togetherz, with z(s) an unobserved realisation of an infinite number of point values Z in area A j and |A j | the surface area of area A j , turning Eq. (1) intō withZ the stochastic representation of observationsz,X containing covariate values averaged over the corresponding areas, and σ 2C the matrix with average covariances between and within the areas. Note that β, σ 2 and φ are equivalent in Eqs. (4) and (1): the parameters on point support are estimated from the areal data. Note the absence of a nugget effect (often indicated by τ 2 ), which might represent measurement errors and micro-scale variation, in the covariance model. Such a nugget effect cannot be identified based purely on areal-support data. Although Truong et al. (2014) demonstrated the potential of expert prior knowledge to help define a nugget parameter, in the situation investigated here, no such knowledge is available. This issue will return in the discussion.
To be able to predict values z * at n * point locations s * , together with the corresponding prediction variances v * , it is necessary to define (1) matrixC * , the mean correlation between points in the observation areas and the prediction points and also implicitly a function of φ; (2) matrix C * * , the correlation matrix between the prediction points, again a function of φ; (3) the design matrix for the prediction locations X * ; and finally (4) the generalised least squares (GLS) estimator for β as given by The n * + m-dimensional distribution of the predictions and observations together can be written as According to UK theory, the resulting prediction is given bŷ an implicit function of φ, with prediction variance-covariance matrix an implicit function of φ and σ 2 . The diagonal of var[ẑ * UK ] is the vector of prediction error variances, better known as the universal kriging variancev * UK , for every prediction point.
For a mathematical elaboration of the above equations, refer to Wackernagel (2014). The term regression kriging (RK) is used later to refer to simple kriging of the trend residuals, an approach which disregards any uncertainty in the estimated plug-in values of the trend parameters, which results in a different kriging variance. In the following sections, the framework presented above will be extended to a Bayesian one.

Bayesian Statistics
In the Bayesian framework, parameters are considered random variables (McElreath 2016). Formulated in general probability notation where Θ stands for 'parameters of interest' and E for evidence (observations), Bayes' rule states that where the probability distribution p(Θ|E) indicates the posterior belief in the parameters given the evidence, p(Θ) indicates the prior belief in the parameters, independent of the evidence, p(E|Θ) is the probability of the evidence as a function of the parameters-called the likelihood-and p(E) is the probability of the evidence. Note that to assess the likelihood, a correctly defined probability distribution of the modelled process is assumed. Note also that p(E) is often left out when a proportional value for p(Θ|E) is sufficient. Mathematically, p(E) equals p(E, Θ) integrated over its parameters In a Bayesian context, prediction entails formulating a distribution conditional on the evidence E. Inserting the parameters in the equation shows the derivation of the posterior predictive integral where P stands for the predicted values.
In the Bayesian framework, the prior p(Θ) needs to be defined, stating the current state of knowledge-or, inversely formulated, the current state of ignorance-about the parameters. In many cases, a low-informative prior is desired; however, formulating a prior that 'lets the data speak for itself' is not always straightforward (Seaman et al. 2012;Lindley 2004). Also, a 'conjugate' prior is often chosen so that its distribution function matches the likelihood, resulting in a closed-form description of the posterior (Albert 2009).
For various reasons, prior distributions that do not integrate to a finite value (i.e., cannot be normalised to integrate to one) might be considered, termed in the Bayesian framework as 'improper' priors. Improper priors can result in improper (poorly defined) posterior probability densities.
In the remainder of this paper, a posterior probability distribution-proper or notis indicated by f p (. . . | . . .), the likelihood by f l (. . . | . . .), and a prior distributionagain having propriety or not-by f 0 (. . .). When the function type is ambiguous, its interpretation depends on the context and its arguments.

Partly Analytical Bayesian Area-to-Point Algorithm
In this section, a partly analytical and partly numerical algorithm to execute Bayesian ATPK is described, based on integrating out the trend and variance parameters and systematically exploring gridded values in the correlation distance parameter space. This algorithm is developed as an alternative to methods based on sampling from the posterior distribution. Starting from Eq. (4), the likelihood of data generated by the geostatistical model is based on the multivariate normal distribution where |...| indicates the determinant.
Throughout this work, the priors for the trend, variance and correlation distance parameters are given by This prior represents a priori independence between the parameters with an unlimited uniform (and thus improper) prior for the regression coefficient vector; a prior for the variance that is equivalent to an unlimited uniform prior for ln(σ 2 ), again an improper prior; and f 0 (φ), for which different options are considered. It falls under the more general formulation of Berger et al. (2001), who considered appropriate objective (uninformative) priors for the analysis of spatial point-support data. Given the above prior and likelihood function, the joint posterior distribution for the parameters is (up to a constant of proportionality) Based on the above assumptions, Fig. 1 illustrates our partly analytical Bayesian algorithm, Bayesian areal kriging (BAK), to infer the marginal posterior distributions of all parameters and to calculate and summarise predictive distributions. The relevant equations and their derivation are given in Online Resource A; the summary stating the main equations follows in the coming sections.

Marginal Posterior Distance Parameter
Given the joint posterior (Eq. 14), β and σ 2 are analytically integrated out to arrive at the analytical solution for the marginal posterior for φ given by whereβ is defined according to Eq. (5). Numerically, BAK creates a one-dimensional grid covering the parameter space of φ, calculates the marginal posterior for each φ, and normalises the marginal posterior to a distribution that integrates to one within the bounds of the φ grid.

Marginal Posterior Sill
For the marginal posterior of σ 2 , β is analytically integrated out from the joint posterior (Eq. 14) to arrive at As, to the authors' knowledge, there is no analytical way of integrating out φ, BAK creates a two-dimensional grid over the parameter space of φ and σ 2 and calculates the joint posterior for σ 2 and φ (i.e., the integrand) for every grid point; then it performs a trapezoidal integration over φ and normalises to arrive at the marginal distribution for σ 2 .

Marginal Posterior Regression Coefficients
The marginal posteriors of the individual regression coefficients β q , q = 1 . . . k are based on the joint posterior for the vector β The integrand here can be shown to be proportional to a multivariate t distribution (Roth 2013) for β with m − k degrees of freedom, location (vector) parameterβ and scale (matrix) parameter This integrand can be marginalised to a scaled t-distribution for the individual regression coefficients, as an implicit function of φ, and rearranged to give with f p (φ|z) as indicated in Eq. (15) and where defines a t-distribution for β q with degrees of freedom ν = m − k, location parameter β q and scale parameter q the qth element on the diagonal of β . Note that the variance of this t-distribution is q ν/(ν − 2). Similarly to Sect. 3.1.2, BAK creates two-dimensional grids covering the parameter spaces of φ and β q (for all q) and applies the trapezoidal rule to calculate the integral over φ in Eq. (19); finally, it normalises to get the marginal distributions for each individual β q .

Posterior Predictive Distribution
The conditional distribution for the variable of interest, given the data and any particular value of the distance parameter, f (z * |z, φ), is a t-distribution with degrees of freedom ν = m − k, with location parameterẑ * |φ-an implicit function of φ-already given in Eq. (7), and with scale parameter as provided in Eq. A85 in Online Resource A. The variance of this conditional distribution, also a function of φ, is given by (21) Note that Eq. (21) is an increased universal kriging variance [see for comparison Eq. (8)] because the uncertainty in σ 2 is also considered-hence the increment expressed in the first fraction. The second fraction equals the REML estimate for σ 2 given φ.
The posterior predictive distribution is defined as an integral of the above conditional distribution with respect to the posterior distribution of the distance parameter, which is numerically approximated. BAK first creates for each prediction point s * a vector of predictions and a vector of corresponding prediction variances, both as a function of φ. Finally, the algorithm calculates the mean and variance of the posterior predictive distribution (or, more formally, of a finite mixture distribution that approximates this distribution, with weights defined based on f p (φ|z) and the spacings of the φ parameter grid).

Methodological Details
In this work, a number of increasingly Bayesian approaches to ATPK are applied and compared (Table 1). The first three rows of the table represent plug-in approaches for some of the parameters (i.e., the stated parameters are first estimated, by maximising a likelihood or marginal likelihood function, before being plugged into the relevant predictive distribution equation for prediction), while the final row represents the fully Bayesian approach. In the case of maximum likelihood estimation (ML, and not implemented in this work), all parameters (in the geostatistical context: regression coefficients and spatial covariance parameters) are estimated by analytically or numerically maximising the likelihood. This general approach was consolidated by Fisher almost a century ago (Stigler 2007) and applied in geostatistics for example by Kitanidis (1983) and Lark (2000). REML, which has been advocated for several decades in geostatistics, is based on a likelihood function for a set of projected data rather than the original data, and gives conditionally unbiased estimates for the spatial parameters (Webster and Oliver 2007;Lark and Cullis 2004); see also Sect. A2 in Online Resource A. REML represents a form of marginal likelihood (a likelihood function in which some parameters have been marginalised), and has been presented in a Bayesian framework as such (the integral of the likelihood function with respect to the trend parameters, assuming a flat improper prior for these parameters) (Harville 1974). Note that f 0 (β) can be considered an uninformative prior when neglected-this is often valid for centrality parameters but not for other parameters. Underpinning the same approach, UK takes the uncertainty in the trend coefficients into account, making it a logical combination with REML. Within this research, the combined application of REML and UK is indicated by 'REML approach'. The next gradation towards the fully Bayesian approach is maximum likelihood with both trend and variance integrated out, in the context of this paper indicated by the generic term 'maximum marginal likelihood' (MML). Finally, the full Bayesian approach (also referred to as 'Bayesian approach') provides a posterior distribution of all parameters, while in the prediction all parameters are integrated out and the uncertainty of all parameters is taken into account. In the following sections, REML, MML and the Bayesian approach are compared, and for the Bayesian approach different priors for φ (as defined in the following section) are applied. All algorithms (including the central BAK algorithm as presented in Fig. 1) are written in the statistical programming language R, and are available at Steinbuch et al. (2019).

Prior Distributions for φ
For f 0 (φ), three potential forms of prior distribution are compared, intended to represent limited prior knowledge. These are (1) a uniform prior with limited bounds; (2) the reference prior as suggested by Berger et al. (2001) for analysis of point-support data, applied in the context of areal-support data, and explained in Online Resource B; and-in the simulation ensemble-(3) an inverse-gamma distribution. The bounded uniform and the inverse-gamma distributions are proper; the assumed propriety of the reference prior will be discussed later.

Estimation and Prediction with REML and MML
For REML, the approach as described by Brus et al. (2018) was applied. For MML, the posterior mode of φ was calculated using the Bayesian approach with a uniform prior for φ. The predictive distribution was then defined conditionally on this value of φ. Mathematically, this equals integrating out β and σ 2 to arrive at an estimatedφ, which is successively used as a single plug-in value for MML prediction; the mean and variance of the predictive distribution (representing the prediction and prediction variance) are shown in Table 1 and Eqs. (7) and (21).

Estimating Average Covariances
The average correlation matrices,C andC * , can be approximated in different ways. In this research, many discretisation points within each area are defined and the relevant Euclidean distances between those points are calculated, followed by construction of the corresponding correlation matrix, based on the correlation function-such as given in Eq.
(2)-and distance parameter φ. Then, all correlations per area-area combination are averaged to arrive atC, and per area-prediction point combination to arrive atC * .
The discretisation points were on a regular grid in the simulation study, and selected by simple random sampling in the two case studies.

Validation
To quantify the performance of each approach-for the simulation study and synthetic case study, where the original point data were available-the predictions z * and prediction uncertainties v * were assessed in relation to the original signal z. As an indication of the quality of the prediction, the root mean squared error (RMSE) defined as was calculated, where a smaller number indicates more accurate predictions (Oliver and Webster 2014). For comparison, a baseline approach was also included, for which point predictions were defined simply by the areal mean data for the corresponding area. Unbiasedness of predictions was tested using the mean error (ME), The mass preserving property (MPP) of the predictions was checked, which states that, in the case of ATPK, the mean of all predictions in any observed area should equal the observed areal mean (Kyriakidis 2004). This check was summarised by showing the maximum observed difference between areal-average data and the mean of the corresponding predictions.
As an indication of the quality of the prediction uncertainty, a motivating factor for this work, the standardised squared error (StSE) defined as was calculated. This StSE should ideally have a mean of one (Lark 2000). Higher values indicate an underestimation of uncertainty, which is labelled 'optimistic', and lower values indicate an overestimation of uncertainty, labelled 'conservative'.

Simulation Study
The following shows a single one-dimensional simulation where REML and full Bayesian (defined with f 0 (φ) ∼ uniform) are compared for illustration purposes. Following the illustration, an ensemble of many simulations is applied to assess several settings. Online Resource D contains similar results for two-dimensional simulations.

Simulated Data Set
A line of length 300 abstract units (au) was created, and filled with n = 600 equally spaced nodes. Using the exponential covariance function, a spatially correlated signal (with a zero-nugget exponential model; σ 2 sim = 5, φ sim = 60) was generated and added to the trend of a linear function of the coordinate (β 1sim = 0 for the intercept, 0.02 for the slope on coordinate). For the Bayesian approach, f 0 (φ) ∼ uniform was used, bounded by φ l,u = {10, 300}; these bounds also defined bounds for the REML parameter search. The priors for β and σ 2 are provided in Eq. (13). The above settings are referred to as the standard settings. The line was split into m = 10 equal onedimensional 'areas' or line sections. Finally, both z and the covariate over the areas were averaged to arrive at the observed meansz and the averaged design matrixX.

Results
The original signal (assumed to be unobservable, and represented as point values), the areal means (the 'observations') and the predictions are shown in Fig. 2. The difference between the REML and the full Bayesian approach is mainly in the prediction uncertainty: the Bayesian approach gives a slightly larger prediction interval. The marginal densities in Fig. 3 show that, based on this-small-simulated data set, it is rather difficult to identify the distance parameter φ, which has a very flat mode. The REML estimates for φ and σ 2 (point values) are close to the modes of the respective marginal posteriors from the Bayesian approach. For the trend parameters, the marginal Bayesian posterior distributions are slightly skewed and wider than the corresponding distributions based on REML (Gaussian distributions parameterised by the GLS estimate and estimation variance), indicating that more uncertainty is included.
The pairwise joint posteriors (φ with each other parameter) are shown in Fig. 4: σ 2 and φ seem to have a positive correlation (subfigure a), while β 1 has a slightly negative correlation with φ (b) and β 2 a slightly positive correlation (c). Figure 5 shows the variogram models obtained with REML and the Bayesian-uniform approach. Table  a   0   2 shows the validation results of this single simulation. The quality of the prediction (RMSE) is almost equal for the REML and Bayesian approaches, and better than for the baseline approach. The ME and max(MPP) are close to zero, indicating the absence of bias, and a discretisation grid (which is different from the prediction grid) of sufficient density to provide good approximations of areal-average covariances and areal-average values of covariates. The uncertainty validation value mean(StSE) shows that REML is on average optimistic, while the Bayesian approach is conservative. Note that the baseline approach does not provide a quantification of prediction uncertainty.

Simulation Ensemble
In this section, results are generalised by generating many simulations, varying only the random number seed, while comparing validation statistics on the outcomes of several approaches: REML, MML and full Bayesian with the three different priors for φ indicated earlier. The applied inverse-gamma prior for φ is set as somewhat informative with shape = 11 and rate = 600. This results in a mean of 60 au, emulating a situation where decent prior knowledge about the range is available. Also, the number of observations m is varied by dividing the line into m sections of equal length; this together is one 'session'. Furthermore, to investigate how the approaches behave for differently simulated data sets and different inference settings, both are varied into an 'ensemble' of many sessions. The settings as used in standard session 1 are given in Sect. 4.1.1. Sessions 2 and 3 vary the upper bound for the uniform prior and for the REML and MML searches for estimation of φ in comparison with standard session 1 (where φ = 300 au). In session 4, the trend is removed (so that the inferential model has to infer the mean only); in session 5, the trend is based on a separate Gaussian random field (GRF) rather than on the coordinates. Sessions 6, 7 and 8 introduce a misfit between the simulation model and the inferential model, where the correlation function in the simulation model is changed or a nugget component is added-the inferential model stays unchanged. Finally, sessions 9 and 10 show the effects of a misfit between the actual signal and the support of the available data (i.e., short or very long distance parameter used in simulation compared with the area sizes and total extent), which might make it difficult to identify parameters. Table 3 presents the results expressed as the average (and, in small font, the corresponding standard deviation) over 250 means of the standardised squared error (mean(StSE)). Table D1 in Online Resource D shows the results of the analogous two-dimensional simulations. Additional validation statistics and assessments regarding φ and σ 2 for the simulation ensemble can be found in Online Resource C, which also includes m = 15 and m = 30 for the one-dimensional simulations, and in Online Resource E for the corresponding figures of the two-dimensional simulations. Table 3 Mean standardised squared error (mean(StSE)) for the one-dimensional simulation ensemble, comparing restricted maximum likelihood (REML), maximum marginal likelihood (MML) and three full Bayesian approaches with different priors for φ In the online version of this paper, colours are used to highlight larger and smaller values for the readers' convenience

General Results
Referring to Online Resources C and E, the maximal difference with respect to the mass preserving property (max(MPP)) ranges between 0.09 and 0.28 in the case of the two-dimensional simulations. In the one-dimensional case, max(MPP) is much smaller. With all approaches in all simulations, the ME was small. The RMSE was, for a given simulation, almost equal for all, but the baseline approach was, on average, larger. The main difference between the approaches was in the prediction uncertainty (assessed by StSE).
The standard session 1 in Table 3 shows that m = 10 caused REML to be optimistic, while the Bayesian-uniform approach was less optimistic, Bayesian-inverse-gamma was closest to one (perhaps due to the knowledge captured in the prior distribution for φ), MML was slightly conservative and Bayesian-reference very conservative.
With increasing m, all mean(StSE) approached one, while the corresponding sd(StSE) decreased. Even with m = 20, the differences between approaches and the deviation from one became small, except for the Bayesian-reference approach. The results for two-dimensional simulations were similar, although differences between approaches were a bit larger for m = 9 and deviations from one were often still substantial for m = 25.

Changing Uniform Prior for φ (Sessions 2 and 3)
Sessions 2 and 3 vary only in the upper bound of the uniform prior for φ (au = 100 and 2000, respectively) used as the basis for inference, rather than in the simulating model; for comparison, the same bounds in the REML and MML parameter searches were applied. Note that Bayesian-inverse-gamma and Bayesian-reference results (see Sect. 3.2.1), having their own bounds, are not repeated here. Also recall that the extent of the simulated data set was 300 (au). The seemingly arbitrary choice of the upper bound of the uniform prior for φ influenced the results, especially with few data (small m) and with the two-dimensional simulations (see Online Resource D).
Although MML and the Bayesian-uniform approach use the same range of possible φ values, MML was far less influenced by the upper bound for its parameter search. The proportion ofφ values (estimated by REML and MML, respectively) that were very close to the upper and lower bounds are also given (Online Resource C/E). Interestingly, in the case of a larger upper bound (session 3), the fraction of REMLestimated φ's close to the unchanged lower bound was larger than in session 1, while the fraction of MML-estimated φ's close to the lower bound stayed the same.

Varying Simulation Trend (Sessions 4 and 5)
The trend on the spatial coordinate (the standard) was also compared with a trend that was a simulated GRF itself (β 1 = 0, β 2 = 2, τ 2 = 0, σ 2 = 0.5, φ = 30), session 4, and with a constant mean, session 5. In both cases, the form of the trend (i.e., the design matrix) was assumed to be known for inference and prediction. The only difference between the sessions was the means: the simulated error signals for sessions 1-5 were identical. Compared with a trend on the coordinate, both the GRF trend and a constant mean gave only minor differences in mean(St S E); this also held for the two-dimensional simulations.

Misspecified Model (Sessions 6, 7 and 8)
In sessions 6 and 7, the error signal was simulated using a Matérn covariance function with large and small values for the smoothness parameter ν sim (not to be confused with the degrees of freedom ν of a t-distribution used earlier). The inference in these sessions was still based on the exponential covariance model, which equals the Matérn model with ν sim = 0.5. These sessions were designed to provide a test of how the methods deal with a misspecified inferential model. The large ν sim in session 6 caused all mean(StSE) to be far too conservative, with the Bayesian approaches slightly more conservative, and with average mean(StSE) values becoming smaller with increasing m. In the two-dimensional simulations, the values stayed considerably closer to one. A small ν sim , as shown in session 7, caused almost all results to be optimistic. With increasing m, the mean(StSE) did not converge towards one, but rather seemed to stabilise at an optimistic value. With a nugget component added to the simulated data (session 8; with nugget-sill ratio 1/6), all approaches were optimistic (except Bayesian-reference and m = 10, and its two-dimensional counterpart with m = 9), and the average mean(StSE) increased with m in the one-dimensional simulations. In the two-dimensional simulations, the relation between m and mean(StSE) was ambiguous.

Simulation with Extreme Distance Parameter (Sessions 9 and 10)
If the distance parameter used for the simulations was very small in relation to the areas under consideration, such as in session 9, all approaches seemed to be quite optimistic, but this effect strongly decreased with increasing m. The worst performer was the Bayesian-inverse-gamma, where information encapsulated in f 0 (φ) was now mismatched with the simulation model, although Bayesian-uniform also performed badly. The Bayesian-reference approach performed best. In the two-dimensional simulations, values were more extreme, especially for m = 9. When, as in session 10, the distance parameter was large compared with the total extent under consideration, REML performed almost perfectly while other approaches tended to be slightly or fairly conservative, but improved with increasing m.

Synthetic Case Study: Vegetation Index Data, with Validation on Point Support
To briefly investigate how REML, MML and full Bayesian would perform for a real-world data set, a remote-sensing vegetation index, CFAPAR-27, was used as the variable of interest. These data are used as a covariate in the real case study (spatial prediction of crop yield in Burkina Faso, Sect. 5.2) and therefore concisely described in Online Resource F. This spatial variable is, obviously, available on pixel support. The CFAPAR-27 data were masked using the crop yield mask (see also Sect. 5.2), and subsequently aggregated over the 45 provinces of Burkina Faso. As covariates for inference, two climate variables broadly representing rainfall and temperature (CRAIN-EC-27 and TMIN-EC-21) and one variable representing soil pH (PHAQ) were used. Gaussianity for all real-world variables of interest was assumed. ATPK was applied using four approaches: (1) REML, (2) MML, (3) the full Bayesian approach using the uniform prior for φ, and (4) the full Bayesian approach using the reference prior for φ. For REML and MML, the parameter search for φ was bounded between 37 and 300 km, being roughly the smallest distance between the centres of any two areas, and one third of the largest extent of the region of interest, respectively. The same bounds defined the uniform prior for the full Bayesian approach.
The resulting mean(StSE) was 2.87, 2.73, 2.92 and 2.59 for the REML, MML, Bayesian-uniform and Bayesian-reference approaches, respectively, showing that prediction uncertainty was seriously underestimated by all approaches. The mean(StSE) of the Bayesian-uniform approach could be changed by several tenths by adjusting the bounds of the uniform prior. All RMSE values for the four approaches were around 6.19 (compared with the baseline approach RMSE of 16.54), indicating that they offered the same prediction quality and probably quite similar predictions.

Real Case Study: Crop Yield Data
As a real-world case study, this paper predicts yields of sorghum and millet, both cereal staple foods, in Burkina Faso, West Africa. The observation areas are the 45 provinces, for which only the average yields are known (averaged over the years 2000-2013, and provided by AGRHYMET), as shown in Fig. 6 for millet. Covariates for the trends as suggested by Brus et al. (2018) are used: for millet no covariates, and for sorghum four covariates are shown and briefly explained in Online Resource F. REML, MML, Bayesian-uniform and Bayesian-reference approaches were applied, with similar settings to those of Sect. 5.1 in the previous analysis of the vegetation index data. The observed millet yields, and the resulting predictions and prediction uncertainties (standard deviations of the predictive distributions) when applying MML, are presented in Fig. 6; maps are presented first over the entire study region, then focused on a subregion to reveal more detail. Similar maps for sorghum are presented in Online Resource F. Figure 7 shows the densities of the millet yield predictions and prediction standard deviations based on all four approaches, indicating that the Bayesian and, to a lesser extent, MML approaches generated larger prediction uncertainties than REML. For sorghum (see Online Resource F), the Bayesian-reference-calculated prediction diverted from the other approaches, due to the tendency of the distance parameter to move as close as possible to zero; the applied lower bound for the uniform prior for φ and for the REML and MML parameter searches imposed a limit on this effect. This shows again that a seemingly arbitrary choice of a uniform prior or of a parameter range for REML or MML might influence the resulting prediction uncertainty.

Setting Uniform Prior
Both in the simulations (Table 3 and Online Resource C) and in the case studies, the choice of the upper and lower bounds of a uniform prior for φ can influence the prediction uncertainty, especially (but not exclusively) with smaller data sets and if the posterior mode of φ coincides with one of the bounds of the prior. This effect can also occur with REML and MML approaches, where the search for the optimum value of φ is bounded by the same limits. It should be stressed that, in this context, this 'flat' uniform prior cannot be considered uninformative. The fact that the posterior modes of φ (resulting from Bayesian approaches), orφ (from the REML approach), often  coincided with one of the bounds (for example see the 'φ, mode(φ) ≈ min, max' columns in Online Resources C and E, but also sorghum in the case study) highlights the importance of carefully considering such prior or parameter search settings in geostatistical practice.

Reference Prior
According to the simulations, the reference prior did not perform well, being in many cases too conservative about prediction uncertainty, and pushing posterior distributions of the distance parameter too strongly towards zero (see also Online Resources C and D). In the case of small ν sim or τ 2 sim > 0, this conservatism compensated to some extent for model misspecification. Berger et al. (2001) derived the form of the reference prior for analysis of spatial point-support data, and the same logic was applied-with area-to-area average correlations replacing the point-to-point correlations of Berger et al. (2001)-to justify a similar prior for analysis of areal-support data. However, although the logic to derive the form of the prior follows analogously, the authors are unsure of the analogous logic to ensure propriety of the resulting prior and posterior distributions. As such, even if the simulations would have demonstrated a strong advantage, it would require further work to derive the required proofs of propriety. With the simulation results not demonstrating strong advantages, other priors for φ are currently recommended.

Misspecified Model
Validation statistics about prediction uncertainties are very sensitive to the misspecification of variogram parameters that determine the smoothness of the spatial signal. Examples are the nugget parameter and the smoothness parameter of the Matérn covariance function, as demonstrated in the simulation sessions 6, 7 and 8 and, in the authors' opinion, in the vegetation index synthetic case study. Short-range spatial relationships are however difficult, if not impossible, to assess if only areal means are available. Situations with areal data combined with some high-density point data could improve the results, see for example Moraga et al. (2017); another approach would be to use prior information, such as expert opinions, for the nugget (Truong et al. 2014). In cases in which more summary data per area are available than only the mean, Orton et al. (2012) proposed a method for incorporating this information. In this situation, the exponential covariance model without a nugget was applied for convenience, as is often the case in comparable research; this is however a quite arbitrary choice, and given the results, a careful consideration of all model parameters that determine the smoothness of realisations is suggested for future research.

Number of Observations
In simulation sessions 6, 7 and 8 (very smooth or very rough simulated signals), mean(StSE) did not converge to one with an increasing number of observations m as might be expected, and actually diverged from one in most cases. Therefore, more data do not alleviate a poor choice of model. Furthermore, in the simulation set-up, increasing m had the side effect of decreasing the size of the individual areas, which might also have influenced this behaviour due to short-range variations becoming better observable.
Very small data sets (nine or ten observations) were analysed in the simulations. The main point of interest was to see how the different approaches behave in such extreme situations, assessed by taking averages over many simulations. The authors stress that, even when using a Bayesian approach, any geostatistical conclusion based on nine or ten observations should be interpreted with caution, except perhaps if strong and honest prior information is available and can be incorporated.

One-Dimensional Versus Two-Dimensional Simulations
The effect of simulation choices and statistical inference approaches was quite similar in the one-dimensional and two-dimensional simulations. Differences might be due to different mutual spatial relationships. For example, for the two-dimensional simulations with nine observations, the closest pairs of units have centroids separated by 100 au. For the one-dimensional simulations with ten observations, neighbouring units' centroids are separated by 30 au. Therefore, despite there being an almost equal amount of data, there is much more short-range information in the one-dimensional data in the set-up used. This might explain the extremely large mean(StSE) values in session 9 of the two-dimensional study (up to 49.2) compared with the less extreme values in the one-dimensional study (up to 4.9).
The approximation of the average covariance matrices might have been less successful for the two-dimensional simulations. This would explain the relatively high max(MPP) and the unexpected irregular spatial pattern of the prediction error sd (see Online Resource D, Fig. D1 d).

The Algorithm
Although the authors did not compare the approach used here with more conventional MCMC methods, it proved an effective and efficient means of performing Bayesian (and also MML) spatial data analysis in the area-to-point context presented. As indicated in Sect. 3.2.3, several different methods can be used to approximate the average covariance matrices. For example, the Legendre-Gauss quadrature-as described by Orton et al. (2017)-is computationally and memory-wise much cheaper, but perhaps less accurate than the applied discretisation points method. Both methods, including some variations, are included in the code, as is area-to-area kriging. Future extensions might include directionality and point-to-point kriging.
The Integrated Nested Laplace Approximations (INLA) alternative to MCMC (Blangiardo et al. 2013) has some similarities to our partly analytical Bayesian approach, such as a gridded search in parameter space and numerical integration. However, it uses Laplace approximations of some integrals and is applicable to a much wider range of models, including hierarchical ones. Furthermore, its spatial implemen-tation assumes the Markovian property on the spatial Gaussian random field (meaning that any point or area in the region is influenced only by its immediate neighbours), leading to sparse covariance matrices and thus reductions in computational costs. In our opinion, our approach offers specific and transparent insights into the Bayesian approach of model-based geostatistics. For future research, however, it would be interesting to redo the calculations with INLA, or to integrate some of the sophisticated and cost-reducing details of INLA into the code.

General Remarks
The general impression obtained from the ensemble of simulations is that REML tended to underestimate prediction uncertainty the most, followed by Bayesianuniform. The Bayesian-reference approach tended to be more conservative, while MML was slightly conservative but seemed relatively stable. The differences between the approaches decreased with increasing m. Given a covariance model that is more or less accurate in terms of short-range behaviour, the conclusion is that, for data sets of sufficient size, or if a slight underestimation of prediction uncertainty is allowed, the REML approach as demonstrated by Brus et al. (2018) should be sufficient. For smaller data sets with no prior information available, the most robust and in many cases best approach, albeit somewhat conservative, appeared to be MML. An additional advantage of MML is its relative insensitivity to arbitrary choices such as bounds on the correlation distance parameter. In several sessions, MML even outperformed Bayesian-inverse-gamma when the supplied prior information about φ was correct. Finally, MML has additional computational benefits for prediction over the fully Bayesian approach.
The authors suggest focusing future research on modelling short-range variation and including a smoothness parameter in the inferential models. Using honest and informative priors-depending on the research question at hand-might also yield interesting results. The Matérn smoothness parameter ν M could be made an integral part of the Bayesian model, or alternatively incorporated as an extra model parameter to be optimised in an MML approach, which could then be used as a plug-in value for prediction.

Conclusions
All tested geostatistical approaches for ATPK (REML, MML, and Bayesian with different priors for the distance parameter) provided very similar predictions, but were different in the prediction uncertainties, with REML slightly underestimating the uncertainty in the case of very few data. Prediction uncertainties are quite sensitive to the parameters determining the smoothness of the spatial signal (i.e., nugget and smoothness parameter of the Matérn covariance function). Given correctly modelled short-range effects, for data sets of sufficient size, or if an underestimation of prediction uncertainty is allowed, the REML approach as demonstrated by Brus et al. (2018) is satisfactory. The MML approach (maximum likelihood with trend and variance integrated out) provided acceptable results while being relatively robust to arbitrary settings for the parameter search. Also, this approach does not need a choice of prior for the distance parameter. A useful and robust full Bayesian approach could not be accomplished, perhaps due to the lack of a good uninformative prior for the distance parameter of the covariance function; the reference prior as proposed by Berger et al. (2001) overestimated the prediction uncertainty in most cases. For real-world case studies, the demonstrated algorithms (https://doi.org/10.4121/uuid: 1fe0c01e-7f67-435b-a240-800579adc6e6) can be used.