Partly Analytical Bayesian Area-to-Point Algorithm
In this section, a partly analytical and partly numerical algorithm to execute Bayesian ATPK is described, based on integrating out the trend and variance parameters and systematically exploring gridded values in the correlation distance parameter space. This algorithm is developed as an alternative to methods based on sampling from the posterior distribution. Starting from Eq. (4), the likelihood of data generated by the geostatistical model is based on the multivariate normal distribution
$$\begin{aligned} f_l (\bar{\varvec{z}} | \varvec{\beta }, {\sigma ^2}, \phi ) = \frac{1}{(2\pi )^\frac{m}{2} ({\sigma ^2})^{\frac{m}{2}} \left| \varvec{ \bar{C}}\right| ^\frac{1}{2} } \hbox {exp} \left\{ - \frac{1}{2{\sigma ^2}}(\bar{\varvec{z}} - \bar{\varvec{X}}\varvec{\beta })^\mathrm{T} \varvec{ \bar{C}}^{-1} (\bar{\varvec{z}} - \bar{\varvec{X}}\varvec{\beta }) \right\} , \end{aligned}$$
(12)
where \(\left| ... \right| \) indicates the determinant.
Throughout this work, the priors for the trend, variance and correlation distance parameters are given by
$$\begin{aligned} \begin{aligned} f_0(\varvec{\beta }, \sigma ^2, \phi ) \propto \frac{1}{\sigma ^2} f_0(\phi ). \end{aligned} \end{aligned}$$
(13)
This prior represents a priori independence between the parameters with an unlimited uniform (and thus improper) prior for the regression coefficient vector; a prior for the variance that is equivalent to an unlimited uniform prior for \(ln(\sigma ^2)\), again an improper prior; and \(f_0 (\phi )\), for which different options are considered. It falls under the more general formulation of Berger et al. (2001), who considered appropriate objective (uninformative) priors for the analysis of spatial point-support data.
Given the above prior and likelihood function, the joint posterior distribution for the parameters is (up to a constant of proportionality)
$$\begin{aligned} \begin{aligned}&f_p(\varvec{\beta }, \sigma ^2, \phi | \bar{\varvec{z}}) \\&\quad \propto \frac{1}{(2\pi )^\frac{m}{2} ({\sigma ^2})^{\frac{m}{2}} \left| \varvec{ \bar{C}}\right| ^\frac{1}{2} } \hbox {exp} \left\{ - \frac{1}{2{\sigma ^2}}(\bar{\varvec{z}} - \bar{\varvec{X}}\varvec{\beta })^\mathrm{T} \varvec{ \bar{C}}^{-1} (\bar{\varvec{z}} - \bar{\varvec{X}}\varvec{\beta }) \right\} \frac{1}{\sigma ^2} f_0(\phi ). \end{aligned} \end{aligned}$$
(14)
Based on the above assumptions, Fig. 1 illustrates our partly analytical Bayesian algorithm, Bayesian areal kriging (BAK), to infer the marginal posterior distributions of all parameters and to calculate and summarise predictive distributions. The relevant equations and their derivation are given in Online Resource A; the summary stating the main equations follows in the coming sections.
Marginal Posterior Distance Parameter
Given the joint posterior (Eq. 14), \(\varvec{\beta }\) and \(\sigma ^2\) are analytically integrated out to arrive at the analytical solution for the marginal posterior for \(\phi \) given by
$$\begin{aligned} f_{p} (\phi | \bar{\varvec{z}} ) \propto f_0(\phi ) \frac{ 1 }{ \left| {\varvec{ \bar{C}}} \right| ^\frac{1}{2} \left| \bar{\varvec{X}}^\mathrm{T} {\varvec{ \bar{C}}}^{-1} \bar{\varvec{X}} \right| ^\frac{1}{2} \left[ (\bar{\varvec{z}} - \bar{\varvec{X}} \hat{\varvec{\beta }} ) ^\mathrm{T} \, {\varvec{ \bar{C}}}^{-1} \, (\bar{\varvec{z}} - \bar{\varvec{X}} \hat{\varvec{\beta }} ) \right] ^\frac{m- k}{2} }, \end{aligned}$$
(15)
where \(\hat{\varvec{\beta }}\) is defined according to Eq. (5).
Numerically, BAK creates a one-dimensional grid covering the parameter space of \(\phi \), calculates the marginal posterior for each \(\phi \), and normalises the marginal posterior to a distribution that integrates to one within the bounds of the \(\phi \) grid.
Marginal Posterior Sill
For the marginal posterior of \(\sigma ^2\), \(\varvec{\beta }\) is analytically integrated out from the joint posterior (Eq. 14) to arrive at
$$\begin{aligned} f_p(\sigma ^2| \bar{\varvec{z}}) \propto \int f_{0}(\phi ) \frac{ \hbox {exp} \left\{ - \frac{1}{2{\sigma ^2}} \left[ (\bar{\varvec{z}} - \bar{\varvec{X}}\hat{\varvec{\beta }})^\mathrm{T} {\varvec{ \bar{C}}}^{-1} (\bar{\varvec{z}} - \bar{\varvec{X}}\hat{\varvec{\beta }}) \right] \right\} }{\left| {\varvec{ \bar{C}}} \right| ^\frac{1}{2} \left| \bar{\varvec{X}}^\mathrm{T} {\varvec{ \bar{C}}}^{-1} \bar{\varvec{X}} \right| ^\frac{1}{2} ({\sigma ^2})^{\frac{m- k+ 2}{2} } } \mathop {}\!\mathrm {d}\phi . \end{aligned}$$
(16)
As, to the authors’ knowledge, there is no analytical way of integrating out \(\phi \), BAK creates a two-dimensional grid over the parameter space of \(\phi \) and \(\sigma ^2\) and calculates the joint posterior for \(\sigma ^2\) and \(\phi \) (i.e., the integrand) for every grid point; then it performs a trapezoidal integration over \(\phi \) and normalises to arrive at the marginal distribution for \(\sigma ^2\).
Marginal Posterior Regression Coefficients
The marginal posteriors of the individual regression coefficients \(\beta _{q}\), \(q = 1\ldots k\) are based on the joint posterior for the vector \(\varvec{\beta }\)
$$\begin{aligned} f_p(\varvec{\beta }| \bar{\varvec{z}}) \propto \int f_{0}(\phi ) \frac{ 1 }{\left| {\varvec{ \bar{C}}} \right| ^\frac{1}{2} \left[ (\bar{\varvec{z}} - \bar{\varvec{X}}{\varvec{\beta }})^\mathrm{T} \varvec{ \bar{C}}^{-1} (\bar{\varvec{z}} - \bar{\varvec{X}}{\varvec{\beta }}) \right] ^{\frac{m}{2}}} \mathop {}\!\mathrm {d}\phi . \end{aligned}$$
(17)
The integrand here can be shown to be proportional to a multivariate t distribution (Roth 2013) for \(\varvec{\beta }\) with \(m- k\) degrees of freedom, location (vector) parameter \(\hat{\varvec{\beta }}\) and scale (matrix) parameter
$$\begin{aligned} \begin{gathered} \varvec{\Sigma }_\beta = \frac{(\bar{\varvec{z}} - \bar{\varvec{X}}\hat{\varvec{\beta }})^\mathrm{T} \varvec{ \bar{C}}^{-1} (\bar{\varvec{z}} - \bar{\varvec{X}}\hat{\varvec{\beta }})}{m- k} (\bar{\varvec{X}}^\mathrm{T} \varvec{\varvec{ \bar{C}}}^{-1} \bar{\varvec{X}})^{-1}. \end{gathered} \end{aligned}$$
(18)
This integrand can be marginalised to a scaled t-distribution for the individual regression coefficients, as an implicit function of \(\phi \), and rearranged to give
$$\begin{aligned} \begin{aligned} f_p(\beta _q | \bar{\varvec{z}}) \propto \int f_{p}(\phi | \bar{\varvec{z}}) t_\nu (\beta _q ; \hat{\beta }_q,\Sigma _q) \mathop {}\!\mathrm {d}\phi \end{aligned} \end{aligned}$$
(19)
with \(f_{p}(\phi | \bar{\varvec{z}})\) as indicated in Eq. (15) and where
$$\begin{aligned} \begin{aligned}&t_\nu (\beta _q ; \hat{\beta }_q,\Sigma _q)\\&\quad = \frac{\Gamma [(\nu +1)/2]}{\Gamma (\nu /2)(\nu \pi )^{1/2} \left| \Sigma _q \right| ^{1/2}} \left[ 1+ \frac{1}{\nu } (\beta _q - \hat{\beta }_q)^\mathrm{T} \Sigma _q^{-1}(\beta _q-\hat{\beta }_q) \right] ^{-(\nu +1)/2} \end{aligned} \end{aligned}$$
(20)
defines a t-distribution for \(\beta _q\) with degrees of freedom \(\nu = m - k\), location parameter \(\hat{\beta }_q\) and scale parameter \(\Sigma _q\) the qth element on the diagonal of \(\varvec{\Sigma }_\beta \). Note that the variance of this t-distribution is \(\Sigma _q \nu / (\nu -2)\).
Similarly to Sect. 3.1.2, BAK creates two-dimensional grids covering the parameter spaces of \(\phi \) and \(\beta _q\) (for all q) and applies the trapezoidal rule to calculate the integral over \(\phi \) in Eq. (19); finally, it normalises to get the marginal distributions for each individual \(\beta _q\).
Posterior Predictive Distribution
The conditional distribution for the variable of interest, given the data and any particular value of the distance parameter, \(f({\varvec{z}^*} | \bar{\varvec{z}}, \phi )\), is a t-distribution with degrees of freedom \(\nu = m- k\), with location parameter \(\hat{z}^{*} | \phi \)—an implicit function of \(\phi \)—already given in Eq. (7), and with scale parameter as provided in Eq. A85 in Online Resource A. The variance of this conditional distribution, also a function of \(\phi \), is given by
$$\begin{aligned} \begin{aligned}&\hbox {var}[\hat{{\varvec{z}^*} } | \phi ] = \frac{m- k}{m- k- 2} \frac{(\bar{\varvec{z}} - \bar{\varvec{X}}\hat{\varvec{\beta }})^\mathrm{T} {\varvec{ \bar{C}}}^{-1} (\bar{\varvec{z}} - \bar{\varvec{X}}\hat{\varvec{\beta }}) }{m- k} \\&\quad \times \left\{ {\varvec{C}^{**}}- {\bar{\varvec{C}}^*}\varvec{ \bar{C}}^{-1} ({\bar{\varvec{C}}^*})^\mathrm{T} + (\varvec{X}^* - {\bar{\varvec{C}}^*}\varvec{ \bar{C}}^{-1} \bar{\varvec{X}} ) (\bar{\varvec{X}} ^\mathrm{T} \varvec{ \bar{C}}^{-1} \bar{\varvec{X}} )^{-1} (\varvec{X}^* - {\bar{\varvec{C}}^*}\varvec{ \bar{C}}^{-1} \bar{\varvec{X}} )^\mathrm{T} \right\} . \end{aligned} \end{aligned}$$
(21)
Table 1 Geostatistical approaches from maximum likelihood to full Bayesian. Corresponding to their universal kriging counterparts, \({\hat{\varvec{z}}}^*_\mathrm{RK}\) and \({\hat{\varvec{v}}}^*_\mathrm{RK}\) indicate the regression kriging and regression kriging variance, respectively Note that Eq. (21) is an increased universal kriging variance [see for comparison Eq. (8)] because the uncertainty in \(\sigma ^2\) is also considered—hence the increment expressed in the first fraction. The second fraction equals the REML estimate for \(\sigma ^2\) given \(\phi \).
The posterior predictive distribution is defined as an integral of the above conditional distribution with respect to the posterior distribution of the distance parameter,
$$\begin{aligned} \begin{aligned} f_p({\varvec{z}^*} | \bar{\varvec{z}}) = \int _\phi f({\varvec{z}^*} | \bar{\varvec{z}}, \phi ) f_p(\phi | \bar{\varvec{z}}) \mathop {}\!\mathrm {d}\phi , \end{aligned} \end{aligned}$$
(22)
which is numerically approximated. BAK first creates for each prediction point \(s^*\) a vector of predictions and a vector of corresponding prediction variances, both as a function of \(\phi \). Finally, the algorithm calculates the mean and variance of the posterior predictive distribution (or, more formally, of a finite mixture distribution that approximates this distribution, with weights defined based on \(f_p (\phi | \bar{\varvec{z}})\) and the spacings of the \(\phi \) parameter grid).
Methodological Details
In this work, a number of increasingly Bayesian approaches to ATPK are applied and compared (Table 1). The first three rows of the table represent plug-in approaches for some of the parameters (i.e., the stated parameters are first estimated, by maximising a likelihood or marginal likelihood function, before being plugged into the relevant predictive distribution equation for prediction), while the final row represents the fully Bayesian approach. In the case of maximum likelihood estimation (ML, and not implemented in this work), all parameters (in the geostatistical context: regression coefficients and spatial covariance parameters) are estimated by analytically or numerically maximising the likelihood. This general approach was consolidated by Fisher almost a century ago (Stigler 2007) and applied in geostatistics for example by Kitanidis (1983) and Lark (2000). REML, which has been advocated for several decades in geostatistics, is based on a likelihood function for a set of projected data rather than the original data, and gives conditionally unbiased estimates for the spatial parameters (Webster and Oliver 2007; Lark and Cullis 2004); see also Sect. A2 in Online Resource A. REML represents a form of marginal likelihood (a likelihood function in which some parameters have been marginalised), and has been presented in a Bayesian framework as such (the integral of the likelihood function with respect to the trend parameters, assuming a flat improper prior for these parameters) (Harville 1974). Note that \(f_0(\varvec{\beta })\) can be considered an uninformative prior when neglected—this is often valid for centrality parameters but not for other parameters. Underpinning the same approach, UK takes the uncertainty in the trend coefficients into account, making it a logical combination with REML. Within this research, the combined application of REML and UK is indicated by ‘REML approach’. The next gradation towards the fully Bayesian approach is maximum likelihood with both trend and variance integrated out, in the context of this paper indicated by the generic term ‘maximum marginal likelihood’ (MML). Finally, the full Bayesian approach (also referred to as ‘Bayesian approach’) provides a posterior distribution of all parameters, while in the prediction all parameters are integrated out and the uncertainty of all parameters is taken into account.
In the following sections, REML, MML and the Bayesian approach are compared, and for the Bayesian approach different priors for \(\phi \) (as defined in the following section) are applied. All algorithms (including the central BAK algorithm as presented in Fig. 1) are written in the statistical programming language R, and are available at Steinbuch et al. (2019).
Prior Distributions for \(\phi \)
For \(f_0(\phi )\), three potential forms of prior distribution are compared, intended to represent limited prior knowledge. These are (1) a uniform prior with limited bounds; (2) the reference prior as suggested by Berger et al. (2001) for analysis of point-support data, applied in the context of areal-support data, and explained in Online Resource B; and—in the simulation ensemble—(3) an inverse-gamma distribution. The bounded uniform and the inverse-gamma distributions are proper; the assumed propriety of the reference prior will be discussed later.
Estimation and Prediction with REML and MML
For REML, the approach as described by Brus et al. (2018) was applied. For MML, the posterior mode of \(\phi \) was calculated using the Bayesian approach with a uniform prior for \(\phi \). The predictive distribution was then defined conditionally on this value of \(\phi \). Mathematically, this equals integrating out \(\varvec{\beta }\) and \(\sigma ^2\) to arrive at an estimated \(\hat{\phi }\), which is successively used as a single plug-in value for MML prediction; the mean and variance of the predictive distribution (representing the prediction and prediction variance) are shown in Table 1 and Eqs. (7) and (21).
Estimating Average Covariances
The average correlation matrices, \(\bar{\varvec{C}} \) and \(\bar{\varvec{C}} ^*\), can be approximated in different ways. In this research, many discretisation points within each area are defined and the relevant Euclidean distances between those points are calculated, followed by construction of the corresponding correlation matrix, based on the correlation function—such as given in Eq. (2)—and distance parameter \(\phi \). Then, all correlations per area-area combination are averaged to arrive at \(\bar{\varvec{C}} \), and per area-prediction point combination to arrive at \(\bar{\varvec{C}} ^*\). The discretisation points were on a regular grid in the simulation study, and selected by simple random sampling in the two case studies.
Validation
To quantify the performance of each approach—for the simulation study and synthetic case study, where the original point data were available—the predictions \(\varvec{z}^*\) and prediction uncertainties \(\varvec{v}^*\) were assessed in relation to the original signal \(\varvec{z}\). As an indication of the quality of the prediction, the root mean squared error (RMSE) defined as
$$\begin{aligned} \hbox {RMSE} = \root 2 \of {\frac{1}{n} \sum _{i=1}^{n} \left\{ z(s_i)-z^*(s_i) \right\} ^2} \end{aligned}$$
(23)
was calculated, where a smaller number indicates more accurate predictions (Oliver and Webster 2014). For comparison, a baseline approach was also included, for which point predictions were defined simply by the areal mean data for the corresponding area. Unbiasedness of predictions was tested using the mean error (ME), \(\frac{1}{n} \sum _{i=1}^{n} \left\{ z(s_i)-z^*(s_i) \right\} \). The mass preserving property (MPP) of the predictions was checked, which states that, in the case of ATPK, the mean of all predictions in any observed area should equal the observed areal mean (Kyriakidis 2004). This check was summarised by showing the maximum observed difference between areal-average data and the mean of the corresponding predictions.
As an indication of the quality of the prediction uncertainty, a motivating factor for this work, the standardised squared error (StSE) defined as
$$\begin{aligned} \hbox {StSE}(s_i) = \frac{\left\{ z(s_i)-z^*(s_i) \right\} ^2}{v^*(s_i)} \end{aligned}$$
(24)
was calculated. This StSE should ideally have a mean of one (Lark 2000). Higher values indicate an underestimation of uncertainty, which is labelled ‘optimistic’, and lower values indicate an overestimation of uncertainty, labelled ‘conservative’.