Skip to main content

A hierarchical approach to scalable Gaussian process regression for spatial data


Large scale and highly detailed geospatial datasets currently offer rich opportunities for empirical investigation, where finer-level investigation of spatial spillovers and spatial infill can now be done at the parcel level. Gaussian process regression (GPR) is particularly well suited for such investigations, but is currently limited by its need to manipulate and store large dense covariance matrices. The central purpose of this paper is to develop a more efficient version of GPR based on the hierarchical covariance approximation proposed by Chen et al. (J Mach Learn Res 18:1–42, 2017) and Chen and Stein (Linear-cost covariance functions for Gaussian random fields, arXiv:1711.05895, 2017). We provide a novel probabilistic interpretation of Chen’s framework, and extend his method to the analysis of local marginal effects at the parcel level. Finally, we apply these tools to a spatial dataset constructed from a 10-year period of Oklahoma County Assessor databases. In this setting, we are able to identify both regions of possible spatial spillovers and spatial infill, and to show more generally how this approach can be used for the systematic identification of specific development opportunities.


Large scale datasets such as County Assessor’s geodatabases offer novel opportunities to investigate spatial phenomena at much finer levels of resolution than in the past. Spatial spillovers, urban infill, renovation price effects and proximity to investment\amenity zones can now be examined at the parcel level, opening up new avenues for the bulk identification of specific development opportunities. The central purpose of this paper is to develop an approach to analyzing such data in an efficient manner. Our approach starts with Gaussian process regression (GPR), which is a well known prediction tool for analyzing spatial datasets. Moreover, the smooth nature of its prediction surfaces is particularly well suited for identifying the local marginal effects (LME) of key explanatory variables (as developed in Dearmon and Smith 2016, 2017). It is these effects that will allow an examination of more fine-grained spatial phenomena, such as the local development opportunities mentioned above.

However, the application of such GPR methods to large data sets has thus far been limited by the need to invert large dense covariance matrices. Thus, it is not surprising that this practical limitation has led to a variety of methods for approximating GPR models by more efficiently computable versions (as reviewed for example in Chen et al. 2017). In the present paper, we focus on one of the most promising of these approaches, namely the development of a hierarchical covariance approximation to GPR by Jie Chen ([C1] = Chen et al. 2017; [C2] = Chen and Stein 2017), which we denote by GPR-HCA. This hierarchical extension of Nyström’s low-rank approximation yields dramatic improvement in both speed and accuracy of predictions. In fact, this approximation allows matrix inversions that achieve the optimal efficiency level of \(O(n)\), i.e., are linear in the matrix dimension, n. Of equal importance, these approximations are guaranteed to yield positive definite matrices that generate well-defined Gaussian Processes. So, from a methodological perspective, our central objective is to extend such approximations to the analysis of local marginal effects in large-data contexts.

To do so, we begin in Sect. 2 with a review of the standard Gaussian Process Regression model, and in particular, its associated local marginal effects. In Sect. 3, we then develop the GPR-HCA method in detail. One contribution of this paper is to give an explicit probabilistic interpretation of this method, which we illustrate for two- and three-level hierarchies. In addition, we highlight some of the key auxiliary tools proposed by Chen ([C1],[C2]) which are particularly useful for our LME extensions. In Sect. 4, we test both the accuracy and scalability of this hierarchical approach by constructing a simple two-variable simulation model that allows for visual as well as numerical comparisons with other methods. Here we begin by comparing GPR-HCA with the standard Gaussian process regression model (GPR-FULL) over sample sizes small enough to allow the full version to be run. In addition, we compare GPR-HCA with two other large-scale prediction models for sample sizes up to half a million. Of particular relevance is the nearest-neighbor approximation of Gaussian processes (NNGP) first introduced by Datta et al. (2016), which also yields covariance approximations that are linear in matrix dimension and generate well-defined Gaussian Processes. We also compare GPR-HCA performance with one of the standard machine learning algorithms, namely the generalized boosted models (GBM) algorithm of Ridgeway (2007). In all cases we find comparable predictive performance, and much improved time costs over GPR-FULL in particular.

However, while such comparative tests are important, they are not of primary interest for our present purposes. More important is the technical extension of GPR-HCA to the evaluation of LME’s for large data sets. Within the same simulation framework, such estimated LME’s are shown to accurately replicate the derivatives of well-behaved functions corrupted by noise. We then turn to an empirical application in Sect. 5, where these HCA-tools are applied to the difficult and often ill-behaved relationship between house prices and attributes using data obtained from nearly a decade’s worth of County Assessor’s databases in Oklahoma County. In particular, we focus on two distinct regions of Oklahoma County; one just north of downtown where spatial spillovers appear to be present and the other a small, wealthy municipality, located further north, where spatial infill opportunities appear to exist. We investigate and analyze such phenomena using GPR-HCA, and provide confirmatory evidence of our findings using building permit data. Finally, we conclude in Sect. 6 with a brief discussion of several possible extensions of this work that are of both practical and technical importance.

Gaussian process regression

Given a spatial process with response variable, \(Y_{l}\), on a domain, \(S = \{ x_{l} = (x_{l1} , \ldots ,x_{ld} )\} \subseteq {\mathbb{R}}^{d}\) of possible explanatory variables [including the spatial coordinates of location, \(l\)], we start by assuming that stochastic variations in observed values of \(Y_{l}\) about their common mean, \(\mu\), are governed by a underlying (latent) zero-mean Gaussian process, \(f:S \to {\mathbb{R}}\), with observed values (measurements) corrupted by independent additive Gaussian noise,

$$Y_{l} = \mu + f(x_{l} ) + \varepsilon_{l} ,\quad \varepsilon_{l} \sim N\left( {0,\sigma^{2} } \right)$$

In essence this implies that latent responses, \(f = (f_{l} :\,l = 1, \ldots ,n)\), at any finite set of locations with associated explanatory variables, \(X = (x_{l} :l = 1, \ldots ,n)\) are multi-normally distributed as

$$f \sim N\left[ {0_{n} ,\,K(X,X)} \right]$$

with covariance matrix, \(K(X,X) = [k(x_{i} ,x_{j} ):i,j = 1, \ldots ,n]\), generated by a kernel function, \(k(x_{l} ,x_{h} )\,\,[ \equiv {\text{cov}} (f_{l} ,f_{h} )]\), depending only on the attribute profiles of response variates. By (1) this implies that the resulting observed responses, \(Y = (Y_{l} :\,l = 1, \ldots ,n)\), are distributed as

$$Y \sim N\left[ {\mu 1_{n} ,K\left( {X,X} \right) + \sigma^{2} I_{n} } \right]$$

where \(1_{n}\) and \(I_{n}\) denote respectively the unit vector and identity matrix of size n. To model spatial covariance, we employ the standard (anisotropic) squared exponential (SE) kernel function:

$$k(x_{l} ,x_{h} ) = v\,\exp \left[ { - \sum\nolimits_{i = 1}^{d} {\tfrac{1}{{2\tau_{i}^{2} }}\left( {x_{li} - x_{hi} } \right)^{2} } } \right]$$

where \(v\) denotes the common variance of all responses, i.e., \({\text{var}} (f_{l} )\,\, = \,\,k(x_{l} ,x_{l} ) = v\), and where each length-scale parameter, \(\tau_{j} > 0\), governs the degree to which variable, \(x_{j}\), influences covariance.Footnote 1

With these assumptions, the fundamental Gaussian Process Regression (GPR) problem is to obtain the predictive (conditional) distribution of latent responses, \(f_{*} = f(X_{*} )\), at \(n_{t}\) test locations with attributes, \(X_{*} = (x_{*l} :l = 1, \ldots ,n_{t} )\), given observed responses, \(Y = (Y_{1} , \ldots ,Y_{n} )\), at n training locations with attributes, \(X = (x_{l} :l = 1, \ldots ,n)\). If we start with the joint distribution,

$$\left( {\begin{array}{*{20}c} {f_{*} } \\ Y \\ \end{array} } \right) \sim N\left[ {\left( {\begin{array}{*{20}c} {0_{{n_{t} }} } \\ {\mu \,1_{n} } \\ \end{array} } \right),\left( {\begin{array}{*{20}c} {K(X_{*} ,X_{*} )} & {K(X_{*} ,X)} \\ {K(X,X_{*} )} & {K(X,X) + \sigma^{2} I_{n} } \\ \end{array} } \right)} \right]$$

then the desired (conditional) predictive distribution is well known to be multi-normal

$$f_{*} |Y \sim N\left[ {E\left( {f_{*} |Y} \right),{\text{cov}} \left( {f_{*} |Y} \right)} \right]$$

with conditional mean and covariance,Footnote 2

$$E\left( {f_{*} |Y} \right) = K\left( {X_{*} ,X} \right)\left[ {K(X,X) + \sigma^{2} I_{n} } \right]^{ - 1} \,\left( {Y - \mu } \right)$$
$${\text{cov}} \left( {f_{*} |Y} \right) = K\left( {X_{*} ,X_{*} } \right) - K\left( {X_{*} ,X} \right)\left[ {K(X,X) + \sigma^{2} I_{n} } \right]^{ - 1} K\left( {X,X_{*} } \right)$$

In a manner similar to Dearmon and Smith (2017), we also consider local marginal effects (LME),

$$\frac{{\partial E\left( {f_{*} |Y} \right)}}{{\partial x_{*l,j} }} = \frac{{\partial K\left( {x_{*l} ,X} \right)}}{{\partial x_{*l,j} }}\left[ {K(X,X) + \sigma^{2} I_{n} } \right]^{ - 1} \left( {Y - \mu } \right),\quad l = 1, \ldots ,n_{t}$$

capturing the expected impact of small changes in individual attributes,\(j = 1, \ldots ,d\), such as the impact of an additional square foot on the expected sales price of a given house with a specific set of attributes. Given our present interest in such local marginal effects, the continuous differentiability of the squared exponential kernel makes it particularly well suited for such analyses.Footnote 3

Hierarchical covariance approximation

For purposes of model calibration and prediction, a key scaling issue that arises is the size of the inverse to be calculated in (7), (8) and (9). Assuming that \(n\) is large, the objective of Chen’s procedure is to construct a hierarchical approximation to the \(n\)-square covariance matrix, \(K\). The approach starts by partitioning domain \(S\) into a collection of basic subdomains, \(S_{i} ,i = 1, \ldots ,b\), where each subset of sample points, \(X_{i} = S_{i} \cap X = [x_{i1} , \ldots ,x_{{in_{i} }} ]\), is sufficiently small to ensure that the associated covariance matrix, \(K_{ii} = K(X_{i} ,X_{i} )\), can easily be inverted. (Note that for notational simplicity, we have now dropped references to individual spatial locations, \(l\)). The second step is to approximate the covariances,

$$K_{ij} = K\left( {X_{i} ,X_{j} } \right) = \left[ {k\left( {x_{i} ,x_{j} } \right):x_{i} \in X_{i} ,x_{j} \in X_{j} } \right],\quad i,j = 1, \ldots ,q\;(i \ne j)$$

between distinct subdomains in terms of their mutual covariances with smaller sets of “landmark” points.Footnote 4 These concepts are best illustrated by simple examples.

Two-level hierarchical example

The simplest example involves a partitioning of \(S\) into two subdomains, \(S_{1}\) and \(S_{2}\), as illustrated in Fig. 1 below, where for graphical convenience we show only the spatial coordinates (\(d = 2\)).

Fig. 1

Two-level partition

To approximate the covariances between points in these two subdomains, one selects a small representative subset of points, \(X_{r} = [x_{r1} , \ldots ,x_{{r\,n_{r} }} ] \subset X_{1} \cup X_{2}\), designated as landmark points for \(X_{1}\) and \(X_{2}\). In this case, \(X_{r}\), is associated with the full domain, \(S = S_{1} \cup S_{2} \equiv S_{r}\). Moreover, given the hierarchical relations among these three domains (with respect to set containment, \(\subseteq\)), Fig. 1 can also be represented as a two-level tree structure with root, \(S_{r}\), and leaves, \((S_{1} ,S_{2} )\), as shown in Fig. 2. This underlying tree structure is of fundamental importance in the recursive calculation of the covariance approximations discussed below.

Fig. 2

Tree representation

In terms of these landmark points, Chen’s hierarchical approximation to \(K_{12}\) in (10) is given by Nyström’s (\(n_{r}\)-rank) approximation,

$$K_{12}^{H} = K_{1r} K_{rr}^{ - 1} K_{r2} = K\left( {X_{1} ,X_{r} } \right)K\left( {X_{r} ,X_{r} } \right)^{ - 1} K\left( {X_{r} ,X_{2} } \right) = \left( {K_{21}^{h} } \right)^{T}$$

where H denotes “hierarchical”. Note from the positive definiteness of the full covariance matrix, \(K\), that \(K_{rr}^{ - 1}\) is well defined and is also positive definite. In these terms, the full hierarchical approximation of \(K\) is given by

$$K^{H} = \left( {\begin{array}{*{20}c} {K_{11}^{H} } & {K_{12}^{H} } \\ {K_{21}^{H} } & {K_{22}^{H} } \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {K_{11} } & {K_{1r} K_{rr}^{ - 1} K_{r2} } \\ {K_{2r} K_{rr}^{ - 1} K_{r1} } & {K_{22} } \\ \end{array} } \right)$$

This is essentially the example in expression (4) of [C2] with only two subdomains. Note also from the positive definiteness of the block diagonal structure, that even though the off-diagonal approximations are not of full rank, it is not surprising that the overall approximation is of full rank. What is far less obvious is that this approximation is actually positive definite, i.e., is itself a full-rank covariance matrix. While the proof of positive definiteness in this two-level case is a simple consequence of Schur Complementarity ([C1], Theorem 3), the higher-level cases developed below are considerably more subtle.

Probabilistic interpretation

With this is mind, it is instructive to develop a direct probabilistic approach to these hierarchical approximations, i.e., a full-dimensional Gaussian probability model with precisely this covariance, \(K^{H}\). To do so, we start with the latent process, \(f\sim N(0,K)\), in (2) and let \(f_{i} = f(X_{i} )\,\,,\,\,i = 1,2,r\). To approximate the covariance, \(K_{12}\), between \(f_{1}\) and \(f_{2}\) in terms of their relations with \(f_{r}\), we then consider their conditional means and covariances

$$E\left( {f_{i} |f_{r} } \right) = K_{ir} K_{rr}^{ - 1} f_{r} ,\quad i = 1,2$$
$${\text{cov}} \left( {f_{i} |f_{r} } \right) = K_{ii} - K_{ir} K_{rr}^{ - 1} K_{ri} ,\quad i = 1,2$$

which are essentially obtained from (7) and (8) by setting \(\sigma^{2} = 0\). A key feature of the multi-normal distribution is that while the conditional mean in (13) depends on the value of \(f_{r}\), the conditional covariance in (14) does not. This plays a crucial role in the following construction. As a first step, if we now designate the following zero-mean version of \(f_{i} |f_{r} \,\) as a centered conditional,

$$Z_{i\,|r} \sim N\left( {0_{{n_{r} }} ,K_{ii} - K_{ir} K_{rr}^{ - 1} K_{ri} } \right),\quad i = 1,2$$

then since \(f_{r}\) does not appear in the distribution of \(Z_{i|r}\), we may choose \(Z_{1|r}\) and \(Z_{2|r}\) to be independent not only of one another but also \(f_{r}\). For notational consistency, we also let \(Z_{r} \sim N(0_{{n_{r} }} ,K_{rr} )\) denote a version of \(f_{r}\) that is independent of both \(Z_{1|r}\) and \(Z_{2|r}\), so that by construction the random vector, \(Z = (Z_{1|r} ,Z_{2|r} ,Z_{r} )\), is multi-normalFootnote 5 with:

$$Z = \left( {\begin{array}{*{20}c} {Z_{1|r} } \\ {Z_{2|r} } \\ {Z_{r} } \\ \end{array} } \right) \sim N\left[ {\left( {\begin{array}{*{20}c} {0_{{n_{1} }} } \\ {0_{{n_{2} }} } \\ {0_{{n_{r} }} } \\ \end{array} } \right),\left( {\begin{array}{*{20}c} {K_{11} - K_{1r} K_{rr}^{ - 1} K_{r1} } & {} & {} \\ {} & {K_{22} - K_{2r} K_{rr}^{ - 1} K_{r2} } & {} \\ {} & {} & {K_{rr} } \\ \end{array} } \right)} \right]$$

The desired probability model can then be formed as linear combinations of these independent basis vectors. If we now define the coefficient matrices,

$$A_{ij} = K_{ij} \,K_{jj}^{ - 1} ,\quad i,j = 1,2,r$$

then the appropriate hierarchical model, \(H = (H_{1} ,H_{2} )\), for the present case is given by

$$H_{1} = Z_{1\,|\,r} + A_{1r} Z_{r}$$
$$H_{2} = Z_{2\,|\,r} + A_{2r} Z_{r}$$

where each vector of latent variables, \(H_{i} = (h_{ij} :\,j = 1, \ldots ,n_{i} )\), represents a hierarchical version of the original latent responses, \((f_{ij} :j = 1, \ldots ,n)\), in the full model (1). Intuitively, it is the second terms in these expressions (both containing \(Z_{r}\)) that govern the covariances between random vectors \(H_{1}\) and \(H_{2}\). As we shall see below, the first terms then serve to maintain the desired marginal distributions of \(H_{1}\) and \(H_{2}\). Note also that since (18) and (19) can be written in matrix form as

$$H\,\, = \,\,\left( {\begin{array}{*{20}c} {H_{1} } \\ {H_{2} } \\ \end{array} } \right)\,\, = \,\,\left[ {\begin{array}{*{20}c} {I_{{n_{1} }} } & 0 & {A_{1r} } \\ 0 & {I_{{n_{2} }} } & {A_{2r} } \\ \end{array} } \right]\left( {\begin{array}{*{20}c} {Z_{1\,|\,r} } \\ {Z_{2\,|\,r} } \\ {Z_{r} } \\ \end{array} } \right)$$

it follows that \(H\) is a linear transformation of Z, and thus is also multi-normally distributed.Footnote 6 So if it can be shown that cov(H) = \(K^{H}\), then since \(E\,(Z)\, = \,0\) by construction, we will obtain a well-defined probability model

$$H \sim N\left( {0_{n} ,K^{H} } \right)$$

with the desired covariance matrix, \(K^{H}\). It is this hierarchical model, H, that will replace f in expression (2) of the original model. So, all that remains to be shown is that this hierarchical model has the desired covariance structure. These same observations will continue to hold in more complex examples, and shall not be repeated.

In the present case, we begin by observing that expressions (14) through (18), together with the independence of the Z components, imply that

$$\begin{aligned} {\text{cov}} \left( {H_{1} } \right) & = {\text{cov}} \left( {Z_{1|r} } \right) + {\text{cov}} \left( {A_{1r} Z_{r} } \right) \\ & = \left( {K_{11} - K_{1r} K_{rr}^{ - 1} K_{r1} } \right) + A_{1r} {\text{cov}} \left( {Z_{r} } \right)A_{1r}^{T} \\ & = \left( {K_{11} - K_{1r} K_{rr}^{ - 1} K_{r1} } \right) + \left( {K_{1r} K_{rr}^{ - 1} } \right)\left( {K_{rr} } \right)K_{rr}^{ - 1} K_{r1} \\ & = K_{11} = K_{11}^{h} \\ \end{aligned}$$

and similarly, that \({\text{cov}} (H_{22} ) = K_{22}^{h}\). Moreover, the independence and zero-mean properties of the Z components also imply that

$$\begin{aligned} {\text{cov}} \left( {H_{1} ,H_{2} } \right) & = E\left[ {H_{1} H_{2}^{T} } \right] = E\left[ {\left( {Z_{1\,|\,r} + A_{1r} Z_{r} } \right)\left( {Z_{2\,|\,r} + A_{2r} Z_{r} } \right)^{T} } \right] \\ & = E\left[ {\left( {A_{1r} Z_{r} } \right)\left( {A_{2r} Z_{r} } \right)^{T} } \right] = A_{1r} E\left( {Z_{r} Z_{r}^{T} } \right)A_{2r}^{T} \\ & = A_{1r} {\text{cov}} \left( {Z_{r} } \right)A_{2r}^{T} = \left( {K_{1r} K_{rr}^{ - 1} } \right)\left( {K_{rr} } \right)\left( {K_{rr}^{ - 1} K_{r2} } \right) \\ & = K_{1r} K_{rr}^{ - 1} K_{r2} = K_{12}^{h} , \\ \end{aligned}$$

which together with \({\text{cov}} (H_{2} ,H_{1} )\, = \,{\text{cov}} (H_{1} ,H_{2} )^{T}\) yields the desired result, \({\text{cov}} (H) = K^{H}\).

Three-level hierarchical example

If the full sample of locations, \(X \subset S\), is extremely large, then each of the subsets, \(X_{i} \subset S_{i} ,\quad i = 1,2\), may also be large. Suppose for example that \(S\) was partitioned into four smaller subdomains, \((S_{1} ,\,\,S_{2} ,\,\,S_{3} ,\,\,S_{4} )\), as shown in Fig. 3 below. While one could in principle use the same set of landmark points, \(X_{r} \subset S_{r} = S\), to approximate covariances among the points, \(X_{i} = X \cap S_{i} ,\quad i = 1, \ldots ,4\), it is now possible to refine these approximations. In the present spatial setting, it is reasonable to suppose that points in adjacent domains, say \(S_{i}\) and \(S_{j}\) are more closely related (have higher covariances) than other point pairs. If so, then a better approximation to covariances between \(S_{i}\) and \(S_{j}\) is obtained by using only landmark points in \(S_{i} \cup S_{j}\).

Fig. 3

Three-level partition

To model such relations, we first recall from the hierarchical tree structure in Fig. 2 above that subdomains \(S_{1}\) and \(S_{2}\) are also called children of the parent domain, \(S_{r} \,\). In these terms, the construction in (18) and (19) can be viewed as a “parent–child” relationship. Following Chen [C1, Sect. 2.2]), we refine covariance approximations by extending this type of relationship. If we let \(S_{5} = S_{1} \cup S_{2}\) and \(S_{6} = S_{3} \cup S_{4}\), then as seen in Fig. 3, \((S_{1} ,S_{2} )\) and \((S_{3} ,S_{4} )\) are the respective children of \(S_{5}\) and \(S_{6}\). If landmark points, \(X_{i} = [x_{i1} , \ldots ,x_{{i\,n_{i} }} ] \in S_{i}\), are chosen for \(i = 5,6\), then these can in principle be used to approximate covariances between their respective children. Similarly, if we again designate the root domain by \(S_{r} = S = S_{5} \cup S_{6}\), then the subdomains \((S_{5} ,S_{6} )\) are themselves children of \(S_{r}\). So, if we again choose landmark points for this parent domain, \(X_{r} = [x_{r1} , \ldots ,x_{{r\,n_{r} }} ] \in S_{r}\), then these can also be used to approximate covariances between children in \(X_{5}\) and \(X_{6}\). These nesting relationships can alternatively be represented by the tree structure in Fig. 4, where the basic partition domains, \((S_{1} ,\,S_{2} ,\,S_{3} ,\,S_{4} )\), at the lowest level again constitute the leaf nodes of the tree with root node, \(S_{r}\), and intermediate nodes, \(S_{5}\) and \(S_{6}\). Every link between nodes now represents a parent–child relation.

Fig. 4

Tree representation

Extended probabilistic interpretation

To extend the probabilistic interpretation of the two-level hierarchical covariance approximation above, we start at the upper level and define hierarchical random vectors for \(S_{5}\) and \(S_{6}\) [paralleling (18) and (19) above] as,

$$H_{ir} = Z_{i|r} + A_{ir} Z_{r} ,\quad i = 5,6$$

where the centered conditionals, \(Z_{i|r}\), and coefficients, \(A_{ir}\), have exactly the same meaning as in (15) and (17) [with (5,6) replacing (1,2)]. So in particular, these upper-level variables are capturing relations between the \(n_{i}\) landmark points in \(X_{i}\) and the \(n_{r}\) landmark points in \(X_{r}\). The desired hierarchical model, \(H = (H_{1} ,H_{2} ,H_{3} ,H_{4} )\), is then defined at the lower level by:

$$H_{i} = Z_{i|5} + A_{i5} H_{5r} = Z_{i|5} + A_{i5} \left( {Z_{5|r} + A_{5r} Z_{r} } \right) = Z_{i|5} + A_{i5} Z_{5|r} + A_{i5} A_{5r} Z_{r} ,\quad i = 1,2$$
$$H_{i} = Z_{i|6} + A_{i6} H_{6r} = Z_{i|6} + A_{i6} \left( {Z_{6|r} + A_{6r} Z_{r} } \right) = Z_{i|6} + A_{i6} Z_{6|r} + A_{i6} A_{6r} Z_{r} ,\quad i = 3,4$$

The parentheses in second equalities in (25) and (26) serve to highlight the recursive nature of these definitions, while the last equalities exhibit the linear relations between \(H\) and hierarchical family of basis vectors, \(Z = \,\,\{ Z_{1|5} ,Z_{2|5} ,Z_{3|6} ,Z_{4|6} ,Z_{5|r} ,Z_{6|r} ,Z_{r} \}\), shown in Fig. 5 below. As an extension of the two-level model in (18) and (19), we now see from (25) for example that the second terms involving \(Z_{5|r}\) reflect the covariance relations between \(H_{1}\) and \(H_{2}\). Similarly, the last terms involving \(Z_{r}\) in both (25) and (26) reflect additional covariance relations among all four components of \(H = (H_{1} ,H_{2} ,H_{3} ,H_{4} )\).

Fig. 5

Random basis vectors

For this three-level example, the hierarchical approximation, \(K^{H}\), to \(K = K(X,X)\), can be defined by specifying the matrix cells shown in Fig. 6 (together with symmetry). Following expression (16) in [C1], there are only three types of covariance expressions to be considered, namely within domains (first-level interactions), between adjacent domains (second-level interactions) and between non-adjacent domains (higher-level interactions), as can be illustrated by \(K_{11}^{H} \,,\,K_{12}^{H} ,\) and \(K_{13}^{H}\):

$$K_{11}^{H} = {\text{cov}} \left( {H_{1} ,H_{1} } \right) = K_{11}$$
$$K_{12}^{H} = {\text{cov}} \left( {H_{1} ,H_{2} } \right) = K_{15} K_{55}^{ - 1} K_{52}$$
$$K_{13}^{H} = {\text{cov}} \left( {H_{1} ,H_{3} } \right) = K_{15} K_{55}^{ - 1} K_{5r} K_{rr}^{ - 1} K_{r6} K_{66}^{ - 1} K_{63}$$
Fig. 6

Hierarchical covariance matrix

But (27) follows from the argument in (22) together with the recursive nature of (25). A first application of (22) [to expression (24)] yields \({\text{cov}} (H_{5r} )\,\, = \,\,K_{55}\). But the independence of \(Z_{1|5}\) and \(H_{5r} \,( = \,\,Z_{5|r} \, + \,A_{5r} Z_{r} )\) together with a second application of (22) [to the first equality in (25)] shows that

$${\text{cov}} \left( {H_{1} } \right) = {\text{cov}} \left( {Z_{1|5} } \right) + A_{15} {\text{cov}} \left( {H_{5r} } \right) = \left( {K_{11} - K_{15} K_{55}^{ - 1} K_{51} } \right) + \left( {K_{15} K_{55}^{ - 1} } \right)K_{55} \left( {K_{55}^{ - 1} K_{51} } \right) = K_{11}$$

Moreover, since \(Z_{1|5}\), \(Z_{2|5}\) and \(H_{5r} \,( = \,\,Z_{5|r} \, + \,A_{5r} Z_{r} )\) are mutually independent, it also follows that

$$\begin{aligned} {\text{cov}} \left( {H_{1} ,H_{2} } \right) & = {\text{cov}} \left[ {\left( {Z_{1|5} + A_{15} H_{5r} } \right),\left( {Z_{1|5} + A_{25} H_{5r} } \right)} \right] = A_{15} {\text{cov}} \left( {H_{5r} } \right)A_{52} \\ & = \left( {K_{15} K_{55}^{ - 1} } \right)K_{55} \left( {K_{55}^{ - 1} K_{52} } \right) = K_{15} K_{55}^{ - 1} K_{52} \\ \end{aligned}$$

Finally, since all components of \(Z\) are independent, the same argument shows that

$$\begin{aligned} {\text{cov}} \left( {H_{1} ,H_{3} } \right) & = {\text{cov}} \left[ {\left( {Z_{1|5} + A_{15} Z_{5|r} + A_{15} A_{5r} Z_{r} } \right),\left( {Z_{3|6} + A_{36} Z_{6|r} + A_{36} A_{6r} Z_{r} } \right)} \right] \\ & = A_{15} A_{5r} {\text{cov}} \left( {Z_{r} } \right)A_{r6} A_{63} = \left( {K_{15} K_{55}^{ - 1} } \right)\left( {K_{5r} K_{rr}^{ - 1} } \right)K_{rr} \left( {K_{rr}^{ - 1} K_{r6} } \right)\left( {K_{66}^{ - 1} K_{63} } \right) \\ & = K_{15} \,K_{55}^{ - 1} K_{5r} \,K_{rr}^{ - 1} \,K_{r6} \,K_{66}^{ - 1} \,K_{63} \\ \end{aligned}$$

So again, we see that \({\text{cov}} (H)\,\, = \,\,K^{H}\).

General modeling scheme

The above examples should make it sufficiently clear that the general hierarchical model consists of a family of random vectors, \(H = (H_{i} :i = 1, \ldots ,b)\), where for each basic subdomain, \(S_{i}\), of S (i.e., leaf of the associated tree), the random vector, \(H_{i}\), is a nested linear combination of the basis vectors, Z, such as in Fig. 5 above. In particular, if for each node, \(i_{1}\), in the tree we now designate the unique path, \(i_{1} \, \to \,i_{2} \, \to \,\, \cdots \,\, \to \,i_{m - 1} \to i_{m} \, \to \,r\), of successive parents (ancestors) up to the root node, \(r\), as the root path for \(i_{1}\), then the appropriate form of \(H_{i}\) for each leaf node, \(i\), with root path, \(i\, \to \,i_{1} \, \to \,i_{2} \to \, \cdots \,\, \to \,i_{m - 1} \to i_{m} \, \to \,r\), now takes the form:

$$H_{i} = Z_{{i|i_{1} }} + A_{{i\,i_{1} \,}} \left( {Z_{{i_{1} \,|\,i_{2} }} + A_{{i_{1} \,i_{2} }} \left( { \cdots \left( {Z_{{i_{m - 1} \,|\,i_{m} }} + A_{{i_{m - 1} \,i_{m} }} \left( {Z_{{i_{m} \,|\,r\,}} + A_{{i_{m} \,r}} Z_{r} } \right)} \right) \cdots } \right)} \right)$$

In terms of this notation, the desired covariance for \(H_{i}\) [given by the top half of expression (14) in [C1] for a representative point pair, \((x,x^{{\prime }} )\), in \(X_{i}\)] is simply the kernel covariance,

$${\text{cov}} \left( {H_{i} } \right) = k\left( {X_{i} ,X_{i} } \right) = K_{ii}$$

In addition, the desired covariance between any pair of leaf vectors, \(H_{i}\) and \(H_{j}\), with least common ancestor, \(s\) [possibly root, r, itself] and root paths

$$i \to i_{1} \to i_{2} \to \cdots \to i_{p - 1} \to i_{p} \to s \to h_{1} \cdots \to h_{m} \to r$$
$$j \to j_{1} \to j_{2} \to \cdots \to j_{q - 1} \to j_{q} \to s \to h_{1} \cdots \to h_{m} \to r$$

is given [in terms of expression (16) in [C1] for point pairs, \(x \in X_{i}\) and \(x^{{\prime }} \in X_{j}\)]

$${\text{cov}} \left( {H_{i} ,H_{j} } \right) = K_{{i\,\,i_{1} }} K_{{i_{1} \,i_{1} }}^{ - 1} \,K_{{i_{1} \,\,i_{2} }} K_{{i_{2} \,i_{2} }}^{ - 1} \, \cdots \,K_{{i_{p} \,s}} K_{s\,s}^{ - 1} \,K_{{s\,j_{q} }} K_{{j_{q} \,j_{q} }}^{ - 1} \, \cdots \,K_{{j_{2} \,j_{1} }} \,K_{{j_{1} \,j_{1} }}^{ - 1} \,K_{{j_{1} \,j}} \,$$

Note in particular that hierarchical covariances in (12) for our two-level example and in (27) through (29) for our three-level example are both instances of (34) and (37). In “Appendix” it is shown that the hierarchical model in (33) continues to exhibit this covariance structure in all cases, and thus provides a general probabilistic formulation of hierarchical covariance matrices. This is of particular importance in that such matrices are themselves exact covariance matrices (as observed in [C2, p.5]), and need not themselves be interpreted as “approximations”.

Efficient algorithms and storage

While the probabilistic development above provides a more concrete interpretation of hierarchical covariance matrices, it cannot be overemphasized that the real power of these hierarchical structures is their computational efficiency, which allows Gaussian Process Regression models to be extended to large data sets. Rather than storing the entire kernel matrix in memory, much smaller block diagonal matrices (covariances of leaves, \(H_{i}\)), are stored along with even smaller matrices found at the parent nodes of the space partitioning tree. Omitting off-diagonal blocks of the covariance matrix (covariances between leaf pairs, \(H_{i}\) and \(H_{j}\)), generates significant gains in scalability since these omitted blocks are only calculated on an as-needed basis using the appropriate tree traversal.

This may appear to simply trade the problem of storage with that of drastically increased calculation requirements. But careful inspection shows that this computation issue is not as serious as one might expect. Referring back to Eq. (37), suppose that leaves \(j\) and \(k\) share the same parent node, \(j_{1}\). Then the covariance between \(H_{i}\) and \(H_{k}\) is given by

$${\text{cov}} \left( {H_{i} ,H_{k} } \right) = K_{{i\,\,i_{1} }} K_{{i_{1} \,i_{1} }}^{ - 1} \,K_{{i_{1} \,\,i_{2} }} K_{{i_{2} \,i_{2} }}^{ - 1} \, \cdots \,K_{{i_{p} \,s}} K_{s\,s}^{ - 1} \,K_{{s\,j_{q} }} K_{{j_{q} \,j_{q} }}^{ - 1} \, \cdots \,K_{{j_{2} \,j_{1} }} \,K_{{j_{1} \,j_{1} }}^{ - 1} \,K_{{j_{1} \,k}}$$

which is seen to differ from (37) by only the last element, \(K_{{j_{1} k}}\). This type of overlap suggests that computational procedures can be recursively structured to avoid repeated calculations of common products such as in (37) and (38). Such recursive procedures are formalized in [C1] and [C2].

While the full set of procedures can be found in these references, the three most basic operations are matrix–vector products (O.1), matrix inversion (O.2), and determinant calculations (O.3). For our present purposes, the application these operations is best illustrated in terms of the log likelihood function,

$$L(\theta |y) = - \tfrac{1}{2}\,\log \left[ {\det (C_{\theta } )} \right] - \tfrac{1}{2}y^{{\prime }} C_{\theta }^{ - 1} y - \tfrac{n}{2}\,\log (2\pi )$$

for a multinormal random vector, \(y \sim N(0,C_{\theta } )\) with hierarchical covariance matrix, \(C_{\theta }\) parameterized by \(\theta\). Such likelihood calculations are performed many times in the estimation of \(\theta\), and require efficient methods for large scale datasets. Having constructed and stored the matrix, \(C_{\theta }\), within the HCA framework, one calculates \(\det (C_{\theta } )\) by the determinant operation (O.3) [which actually calculates the log determinant directly]. One then constructs \(C_{\theta }^{ - 1}\) by the inverse operation (O.2). Finally, this is followed by the calculation of \(C_{\theta }^{ - 1} y\) using matrix–vector product operation (O.1), which in turn reduces the quadratic form, \(y^{\prime}C_{\theta }^{ - 1} y\), to a simple inner product of n-vectors.

In addition to these three main operations which are used exclusively for calculations with training data \((y,X)\), there are also more specialized operations designed for calculations involving covariances, \(K(X_{*} ,X)\) with \(n_{t}\) prediction points, \(X_{*}\). In particular, there is a matrix–vector product operation (O.4) for calculating expressions such as the conditional means in (7) and local marginal effects in (9), while a quadratic form operation (O.5) is used for calculating the conditional covariances in (8). Here it should be noted that while these operations were originally developed in [C2] for the vector case of single prediction points (\(n_{t} = 1\)), such procedures are readily extendable to matrices. Hierarchical procedures such, as O.4 and O.5, avoid the need to form full \(n_{t} \times n\) covariance matrices, \(K(X_{*} ,X)\).

To make matters more concrete, we conduct a series of simple experiments (using Matlab R2018b and GPStuff (Vanhatalo et al. 2013). Results of these experiments are displayed in Fig. 1 below (with HCA = GPR-HCA and FULL = GPR-FULL). For HCA (where we use 150 landmark points and a maximum of 1000 observations per leaf) we consider sample sizes ranging from 2000 to 128,000 observations. For FULL we cap the number of observations to 32,000 for Storage and 16,000 for the Matrix Inverse Operation (which uses an efficient mex fileFootnote 7). As shown in Fig. 7, FULL has a dramatic acceleration of costs with increasing sample size, while HCA’s storage and operations are linear in the number of samples. These findings are consistent with [C1] and [C2] where it is shown that as long as the maximum number of landmark points on each level of the hierarchy is held constant, the overall costs of both computation and storage are linear in the number of samples, n.

Fig. 7

Computation and storage comparisons of matrix inversion for HCA versus FULL

Finally it should be noted that large kernel matrices tend to be ill-conditioned, and in particular, may lose their positive definiteness when inverted. In expression (3) above, the addition of measurement-error variance, \(\sigma^{2} I_{n}\), to matrix \(K\) tends to counteract this ill-conditioning for the diagonal blocks of hierarchical covariance matrices, \(K^{H}\), such as \(K_{11}\) and \(K_{22}\) in expression (12) above. But this is not true of off-diagonal “landmark” covariance matrices such as \(K_{rr}\) in the same expression. So following Chen [C1, Sect. 4.3] we add small regularizing effects to these matrices (which are similar in form to \(\sigma^{2} I_{n}\)).

Parameter estimation

Given this hierarchical covariance approximation structure, together with a set of observed responses, \(y = (y_{1} , \ldots ,y_{n} )^{{\prime }}\), and associated attributes, \(X = [x_{l} = (x_{l1} , \ldots ,x_{ld} ):l = 1, \ldots ,n]\), the estimation of mean and covariance parameters for GPR-HCA proceeds along standard lines. First, given that our primary interest is in covariance estimation, we employ the simple kriging conventions of estimating the common mean, \(\mu\), of responses in (1) by their sample mean, \(\overline{y} = \tfrac{1}{n}\Sigma_{i} y_{i}\). In this way, we can focus on response deviations about this sample mean, and proceed to estimate the covariances kernel parameters, \((v,\tau_{1} , \ldots ,\tau_{d} )\) in (4), together with measurement variance, \(\sigma^{2}\), in (1). Thus by letting \(\theta = (v,\tau_{1} , \ldots ,\tau_{d} ,\sigma^{2} )\) denote the full vector of parameters to be estimated, and making the parameter dependency of \(K\) explicit by writing \(K_{\theta }\) in (3), we now treat \(Y\) in (3) as a deviation vector with distribution, \(N[0_{n} ,K_{\theta } (X,X) + \sigma^{2} I_{n} ]\), so that the log likelihood function in (37) takes the more explicit form,

$$L\left( {\theta |X,y} \right) = - \tfrac{1}{2}\,\log \left( {\det \left[ {K_{\theta } (X,X) + \sigma^{2} I_{n} } \right]} \right) - \tfrac{1}{2}y^{{\prime }} \left[ {K_{\theta } (X,X) + \sigma^{2} I_{n} } \right]^{ - 1} y - \tfrac{n}{2}\,\log (2\pi )$$

In these terms our (positive) parameters, \(\theta = (\theta_{i} :i = 1, \ldots ,d + 2)\), are postulated to have independent log-Gaussian priors, \(p(\ln \theta_{i} )\), yielding a log posterior density of the form:

$$\begin{aligned} \log p\left( {\theta |y,X} \right) & = \log p\left( {y|X,\theta } \right) + \sum\nolimits_{i = 1}^{d + 2} {p\left( {\ln \theta_{i} } \right) - \log p(y)} \\ & = L(\theta |y,X) + \sum\nolimits_{i = 1}^{d + 2} {p\left( {\ln \theta_{i} } \right) - \log p(y)} \\ \end{aligned}$$

It is this energy function that is maximized to obtain maximum a-posteriori (MAP) estimates, \(\hat{\theta }\), of all parameters. In the numerical simulations and applications to follow, all parameter priors are assumed to have the common form, \(\log \theta_{i} \sim N(2,9)\), which essentially yields vague priors with conservative mean values for both length scales and variances.

The estimation procedure for this GPR-HCA model was programmed in Matlab, and optimized using Matlab’s fmincon routine. Matlab code is available from the authors.

Simulation analyses

To investigate the behavior of GPR-HCA, we begin with a simple simulation model that allows us to explore the computational efficiency of this method, well as its predictive accuracy. To do so, we employ the following two-variable model with Gaussian noise,

$$y = \cos \left( {8x_{2} - 3.5} \right) + .8\left[ {\sin \left( {4x_{1} x_{2} } \right) + \cos \left( {2x_{1} + 6.66} \right)} \right] + \varepsilon ,\;\;\varepsilon \sim N(0,\gamma )$$

defined over the unit square \((x_{1} ,x_{2} ) \in [0,1]^{2}\). Unless otherwise noted, the noise variance, \(\gamma\), is set to 0.25. For our later purposes, the associated local marginal effects for this model are given by:

$$\frac{\partial E[y]}{{\partial x_{1} }} = 3.2\,x_{2} \cos \left( {4x_{1} x_{2} } \right) - 1.6\sin \left( {2x_{1} + 6.66} \right)$$
$$\frac{\partial E[y]}{{\partial x_{2} }} = - 8\,\sin \left( {8x_{2} - 3.5} \right) + 3.2\,x_{1} \cos \left( {4x_{1} x_{2} } \right)$$

This two-variable setup allows the model mean, \(E(y)\), to be displayed visually as in Fig. 8a. This not only provides a contextual feel for the underlying relationship, but also allows a direct comparison with the estimated mean, \(\hat{E}(y)\), in Fig. 8b [to be discussed later].

Fig. 8

Contour plots of the a true mean dependent variable and b GPR-HCA estimated mean dependent variable

While the most natural benchmark for comparison is in terms of the full model (GPR-FULL), this estimation procedure is constrained to small sample sizes (at most 15,000 samples) rendering such comparisons infeasible on larger datasets. Consequently, we also employ the more scalable algorithms, NNGP and GBM, as mentioned in the in the introductions. This allows scalability comparisons at much larger sample sizes.

In Sect. 4.1, we begin by comparing the scalability and accuracy of GPR-HCA with GPR-FULL (again denoted as HCA and FULL) over a limited range of simulated sample sizes from model (42), and then consider more extended-range comparisons with GBM and NNGP in Sect. 4.2. Finally, we examine both the scalability and accuracy of Local Marginal Effects for GPR-HCA in Sect. 4.3.

Limited-range comparisons with GPR-FULL

Here it is instructive to compare these two methods both with respect to parameter estimation and out-of-sample predictions.

Parameter estimation

Turning first to the relative scalability of parameter estimation procedures, the computation times for estimating parameters, \(\theta = (v,\tau_{1} ,\tau_{2} ,\sigma^{2} )\), by both HCA and FULL are shown in Fig. 9 for a selected range of sample sizes up to 15,000. Here (as in all examples to follow) HCA is parameterized using leaves of maximum size 1,000 together with 150 landmark points at each hierarchical level. From this figure it is evident that even for sample sizes as small as a few hundred, HCA is already orders of magnitude faster than FULL.

Fig. 9

Comparison of computation times for GPR-HCA and GPR-FULL

To gauge the similarity of parameter estimates for these two methods, it is instructive to compare the energy functions (41) generated by HCA and FULL for a specific case (using a training set of 2117 points). Following Chen and Stein (C2, Fig. 5), we focus on a subplot of the \((\tau_{1} ,\tau_{2} )\) plane, holding all other parameters at their optimal values. Results for these two key parameters are shown for FULL and HCA in Fig. 10a, b, respectively, where length scales are plotted in terms of their log values, \(\ln (\tau_{1} )\) and \(\ln (\tau_{2} )\), and where the large dot in each figure denotes the optimal parameter values. Here it is clear that with only 150 landmark points, the energy functions are virtually identical in shape and size. More generally, it appears from further experiments with this model that Mean Absolute Errors of prediction are essentially flat beyond \(r = 150\). This is consistent with the simulation results of Chen and Stein [C2, p.16]) who found that \(r = 125\) was sufficient. However, as discussed further in Sect. 5 below, larger numbers of landmark points may be required in situations where the data are less well behaved than in these simulation models.

Fig. 10

Energy function for a FULL and b HCA

Out-of-sample prediction

With respect to predictions, comparable batch-sample procedures were carried out for a range of random test samples up to 350 K. Computation times for HCA and FULL are shown in Fig. 11a [where the linearity of computation times for FULL as well as HCA results from the batch nature of such computations]. While such times are seen to be about twice as large for FULL in the present illustration, these time values depend critically on the size of the training set used (and large training sets are infeasible for FULL).

Fig. 11

Comparisons of batch-sample predictions for FULL and HCA with respect to a computation times, and b accuracy in terms of MAE

Turning to the comparison of mean absolute errors in Fig. 11b, these errors are in fact so close in values that the blue curve for FULL cannot even be seen. So for predictions as well as parameter estimates, the key point is again that even in the range where the full version of GPR is feasible, HCA displays a high degree of fidelity to the FULL outcomes while being dramatically faster. In the present example it is also of interest to note that these mean absolute errors are remarkably small. In fact, for the model in (42) with normal errors, the absolute deviations of similated values about the mean are well known to be distributed as a “folded normal” with mean \(\sqrt {2\gamma /\pi }\), which for the present case of \(\gamma = 0.25\) yields 0.3989. So it should be clear that the prediction errors above are almost entirely due to fluctuations generated by the model error term itself.

Extended-range comparisons with NNGP and GBM

As should be evident from Fig. 9, the linear scalability properties of GPR-HCA allow for the analysis of data sets vastly larger than those feasible for GPR-FULL.Footnote 8 So for larger data sets, it is appropriate at this point to compare HCA with other well-known linearly scalable prediction models, namely NNGP and GBM, as mentioned above. Computation times (averaged across 3 different runs) over a selected range of sample sizes up to n = 500,000 are shown for HCA, NNGP, and GBM in Fig. 12a.Footnote 9 These times also include predictions for a randomly selected set of 10,000 out-of-sample points.Footnote 10 Using these points, the relative prediction accuracy is then compared in terms of mean absolute errors (MAE), as shown in Fig. 12b.

Fig. 12

Comparisons of GPR-HCA with both NNGP and GBM in terms of a Computing times, and b Mean absolute errors

The key feature of the computing-time plots in Fig. 12a is their approximate linearity, which underscores the demonstrable fact that all three procedures have complexity of order \(O(n)\). However, it should be stressed that the relative magnitudes of these computation times are more difficult to compare. On the one hand, both the both the NNGP and GBM models involve many alternative specifications, as well as tuning parameters that have not been fully optimized. In particular, the present version of NNGP used is the conjugate version (spConjNNGP) in the R package, spNNGP, with default settings including an exponential specification of the kernel function (as documented in Finley et al. 2020). With respect to GBM, the cross-validation method used to gauge iteration numbers involves many repetitions of model estimations (and can be replaced by faster but less accurate methods). On the other hand, it should be emphasized that both NNGP and GBM have been written in optimized C\C++ code, which is well known to be dramatically faster than the Matlab code used here for HCA.

Turning next to the relative accuracy of such predictions, it is clear from Fig. 12b that for this simulation example, HCA is uniformly more accurate than both NNGP and GBM. Moreover, while the MAE values exhibited by HCA are almost identical to the model fluctuations themselves (as mention at the end of Sect. 4.1 above), those of both NNGP and GBM are noticeably higher. However, it must again be emphasized that there is a speed-accuracy tradeoff here, especially for GBM. We elected to use 10,000 trees for GBM, which is a typical size in practice. The results for 100,000 trees (not shown) yield predictions very close to HCA, though with computing times that are actually slower than HCA. For NNGP, we have increased the default value of k = 2 to k = 5 in the k-fold cross-validation procedure for estimating covariance parameters. But even larger values appear to have little effect on prediction accuracy. However, it should also be noted that, unlike expression (4) above, the kernel functions employed in NNGP are isotropic, and thus somewhat less flexible than (4) for prediction purposes. In summary, the essential message of Fig. 12 from our point of view is that the present hierarchical covariance approximation method is competitive with existing alternative models both in terms its scalability and prediction accuracy.

To gain further appreciation for the accuracy of this method, the two-dimensional nature of our present simulation model allows a direct visual comparison of the contours of \(E(y)\) in Fig. 8a above with the estimated contours, \(\hat{E}(y)\), for GPR-HCA as shown in Fig. 8b above (for a training sample of 12,569 observations and 150 landmark points).Footnote 11 Here the remarkable similarity of these contours underscores the ability of GPR-HCA to faithfully capture the full structure of the underlying model.

Fig. 13

Contour plot of a \(LME(y|x_{1} )\,\) and b \(\widehat{LME}(y|x_{1} )\,\)

Fig. 14

Contour plot of a \(LME(y|x_{2} )\,\) and b \(\widehat{LME}(y|x_{2} )\,\)

Scalability and accuracy of local marginal effects

As detailed in Dearmon and Smith (2017), a key attractive feature of GPR-FULL is its ability to predict not only \(E(y)\) values at out-of-sample points but also to estimate the local rates of change of these values with respect to key explanatory variables, i.e., the Local Marginal Effects (LMEs) given by expression (9) above. In particular, for the specific squared exponential kernel in expression (4), it follows by direct calculation that for prediction points, \(X_{*} = [x_{*i} :i = 1, \ldots ,q]\),

$$\frac{{\partial K\left( {x_{*l} ,X} \right)}}{{\partial x_{*l,j} }} = - \tfrac{1}{{\tau_{j}^{2} }}\left[ {k\left( {x_{l} ,x_{*1} } \right)\left( {x_{lj} - x_{*1j} } \right), \ldots ,k\left( {x_{l} ,x_{*q} } \right)\left( {x_{lj} - x_{*qj} } \right)} \right],\quad j = 1, \ldots ,d$$

Here we consider how well the more scalable GPR-HCA version captures these same effects.Footnote 12 With respect to computation times for LME predictions, it is enough to note that these times now depend not only on the particular batch scheme employed, but also on the number of explanatory variables considered.Footnote 13 Other than these more complex dependencies, the results for our simulation model (not shown) continue to be linear, and are qualitatively similar to the linear graph for HCA predictions in Fig. 11a.

Of more interest for our present purposes is the accuracy of these LME predictions. As with the comparisons of \(E(y)\) and \(\hat{E}(y)\) in Fig. 8 above, the quality of LME predictions is best seen visually. In Figs. 13 and 14 below we compare contour plots of the exact LMEs for this simulation model with their associated predictions based on GPR-HCA. If the true partial derivatives with respect to each variable, \(x_{j}\), [given in (43) and (44)] are denoted by \(LME(y|x_{j} )\,\), and if the associated estimates based on HCA [obtained from (9) together with (45)] are denoted by \(\widehat{LME}(y|x_{j} )\), then the contour plots for \(LME(y|x_{1} )\,\) and \(\widehat{LME}(y|x_{1} )\,\) are shown in Fig. 13, and those of \(LME(y|x_{2} )\,\) and \(\widehat{LME}(y|x_{2} )\,\) are shown in Fig. 14.

These two figures suggest that the ability of GPR-FULL to capture local marginal effects is indeed well preserved by GPR-HCA (even with only 150 landmark points at each hierarchy level). Moreover, while the smooth nature of our present two-dimensional example allows these derivatives to be easily plotted and visualized, the empirical example developed in the next section shows that such local marginal effects can also be identified in more realistic multi-dimensional applications.Footnote 14

Empirical application: housing prices in Oklahoma County

In this final section, GPR-HCA is applied to residential parcels found in the Oklahoma County Assessor’s Database. As seen in Panel (a) of Fig. 15 below, Oklahoma County is centrally located within the state and contains several cities including Edmond, Bethany, and Nichols Hills as well as the largest portion of the state capital, Oklahoma City. From a real estate perspective, Oklahoma City represents a secondary or tertiary investment market; significant heterogeneity exists across parcels, more than would typically be present in a dense tier-one urban corridor.

Fig. 15

a Map of Counties in Oklahoma, b Map of Census Tracks in Oklahoma County

The spatially varying size of census tracts seen in Panel (b) is suggestive of this heterogeneity. The downtown core is found in the densest area of tracks. But this density dissipates as one moves outwards towards the borders of the county, especially to the North and East. This heterogeneity presents some unique modeling challenges; challenges that are very much absent from our well-behaved simulation above.

Technical considerations

Consequently, some key enhancements are necessary to improve HCA’s effectiveness for this real-world application. Of particular concern is the stability of the optimization routine—an issue previously noted by Chen and Stein (2017). With respect to Matlab in particular, we have found that greater stability can be achieved in all HCA algorithms by replacing the standard inverse operator, inv, with their backslash operator designed for near-singular matrices.Footnote 15

As a secondary enhancement, a more judicious and careful selection of landmark points is used here. Rather than randomly drawing points (as done in the simulations above), we opt for k-means clustering of our x values using scaled distances based on preliminary length-scale estimates.Footnote 16 The number of clusters is set to the desired number of landmark points, and the training point nearest the centroid of each cluster is selected as the landmark point for that cluster. This ensures a diverse spread of points across the appropriate region of the tree.Footnote 17

Even with this enhanced selection procedure, it was found (by cross validation) that substantial gains in accuracy could be achieved by increasing the number of landmark points to 1250 at each level. Finally, it was also found that when larger numbers of landmark points are used, a two-stage estimation procedure can often improve efficiency. By using a small number of landmark points in a first stage to obtain initial parameter estimates, convergence times for a larger number of landmark points in a second stage can be substantially reduced.

Housing data

Data for this application are taken from the Oklahoma County Assessor’s database from 2010 to 2018 (as well as 2019 for building permit information); most of these are certification databases required for assessments. To minimize data errors and outlier events, we focus on residential sales greater than $20,000 and involving house of more than 100 square feet on 3 acres or less. The training dataset consists of 110,837 residential sales, and is used to predict sales prices for 220,030 parcels as if sold in June 2018. For purposes of this exercise, just eight explanatory variables are used for price prediction: sale date, locational coordinates, lot size, square feet, year built, neighborhood code, and subdivision id.Footnote 18 Summary statistics for these datasets are provided in Table 1. Prediction data are, on average, associated with older, smaller homes located in more established neighborhoods. Sales data are consistent with the idea of suburban residential development where substantial numbers of new, large homes are developed and sold, pushing up the square feet and year built.

Table 1 Summary statistics for housing data

Turning next to model results, we begin in Fig. 16 with a spatial comparison of GPR-HCA predictions and corresponding Assessor assigned market values for each of the 220,030 parcels.

Fig. 16

a Sales Values predicted by GPR-HCA, b Oklahoma County Assessor Market Values (the yellow and green boxes are discussed below) (Color figure online)

From a visual perspective, the GPR-HCA predictions are seen to match quite well with County Assessor’s Market Values (with mean predicted value, $167,000, slightly higher than the Assessor values ($163,000).

We next compare out-of-sample performance across the three techniques: GPR-HCA, GBM and NNGP. To do so, recall first that only assessed market values are available for the 220,030 prediction parcels shown in Fig. 16b. So, in order to compare model performance in terms of actual sales data, we randomly partition the full training set of 110,837 parcels with sales data into a smaller training set of 75,000 parcels, and an out-of-sample test set of 35,837 parcels. Mean Absolute Error (MAE) performance is then evaluated based on the match between out-of-sample predictions and corresponding sales values for parcels in this test set. For NNGP, these predictions use the spNNGP package in R with 15 neighbors and 10,000 MCMC draws. For GBM, the R package GBM/dismo is used with 100,000 trees. Results are displayed in Fig. 17 below. Here it is clear that GPR-HCA is more peaked around zero, with an MAE of $31,113 versus $33,275 for NNGP and $ 36,161 for GBM.

Fig. 17

a Histogram of prediction errors (sales–predicted) for GPR-HCA (MAE = $31,113), b Histogram of prediction errors for NNGP (MAE = $33,275), c Histogram of prediction errors for GBM (MAE = $36,161)

However, the comparative plots of Predicted Values against Sales Values in Fig. 18 for GPR-HCA versus NNGP (Panel A) and versus GBM (Panel B) [again with blue denoting GPR-HCA predictions] GPR-HCA exhibits a few underestimation errors that are noticeably more extreme than either NNGP or GBM.Footnote 19 We return to this issue in the concluding remarks. But for the present we simply note that these outliers involve less than 0.01% of the entire sample.

Fig. 18

a Predicted versus Sales Values in millions of dollars, where blue points denote GPR-HCA predictions and red points denote NNGP predictions, b similar plot with red points now denoting GBM prediction (Color figure online)

Local marginal effects

We now turn our attention to a local marginal effect analysis for the full prediction set of 220,030 parcels. This represents a key contribution of this current work. For this empirical exercise, we focus on the local marginal effect of square feet (LME_sqft), which is the estimated impact of an additional square foot on sales price, given the other attributes of a house. We obtained estimates of LME_sqft for all prediction parcels. The overall distribution of these values, shown in Panel (a) of Fig. 19, is seen to be roughly normally distributed about a mean value of $64.55 (which is just below the low-end of per square foot remodeling costs of adding new square feetFootnote 20). We also show the spatial distribution of LME_sqft values in Panel (b). A comparison with Fig. 16 above suggests that such magnitudes are sensitive to location, and that larger magnitudes of LME_sqft are roughly associated with higher home prices.

Fig. 19

a Frequency distribution of LME_Sqft for the 220,030 prediction parcels in Oklahoma County (with positive values in shades of red, and negative values in blue). b Spatial distribution of LME_Sqft for these parcels (boxes are repeated from Fig. 16) (Color figure online)

More importantly, with the finer resolution implicit in LME analysis, we can now uncover more nuanced and detailed economic phenomena that would have been obscured by less granular methods. As examples, we focus on two smaller areas within Oklahoma County which appear to involve somewhat different aspects of economic development. In view of space limitations, we provide only an informal examination of these aspects.

We begin with the densely populated area of Oklahoma City shown in Fig. 20 (corresponding to the green box in Figs. 16). Here residences are characterized by small lots laid out on a fairly uniform grid. The top two panels show predicted and assessed values in this area, again reflecting the goodness of fit seen at the county-wide level Fig. 16 [where the smaller prices in the legend of panel (b) here reflect the sparsity of homes above $1 million in this area]. The most expensive homes in the southeast corner are just north of downtown, and consist of the two historic neighborhoods, Heritage Hills and Mesta Park. (The red neighborhoods further north, Edgemere and Crown Heights, are also historic areas). Turning to estimates of LME_sqft in panel (c) [corresponding to the green box Fig. 19] we see that within the highest price Heritage Hills area (denoted by the yellow ellipse), there are a number of negative LME_sqft values shown in blue. This is indicative of the large homes found in this historic neighborhood, where further expansion is evidently less attractive. In fact, the largest house in our training dataset, at a size of over 20,000 square feet, is located in this neighborhood.

Fig. 20

a Sales Values predicted by GPR-HCA, b Oklahoma County Assessor Market Values, c LME_sqft estimates (with yellow ellipse denoting the highest priced area), and d Building Permits issued in 2018–2019 (Color figure online)

But on the north and west peripheries of this area one sees more uniform positive values of LME_sqft, where proximity to both this higher priced housing area and downtown appear to offer attractive expansion opportunities. This is further supported by data on building permits for the same periodFootnote 21 [panel (d)] which show that such permits are most highly concentrated in the same area. As one moves further away to the north and west, both LME_sqft values and the density of building tend to decrease. Taken together, the results are strongly suggestive of the spatial-spillover effects widely studied in the housing literature (see for example, in Defusco et al. 2018). But while such effects are typically analyzed at a broader regional scale,Footnote 22 the present results suggest that LME analysis can provide meaningful information at the local neighborhood level.

While spatial spillovers are associated with trends in LME_sqft values at the neighborhood level and higher, there are also more localized development opportunities associated with individual homes or parcels. One type of local development, referred to in the planning literature as spatial-infill development (Landis et al. 2006; Daisa and Parker 2009; McConnell and Wiley 2010), includes both the development of vacant land in nearly built-up areas and the redevelopment of underutilized parcels. Such development is driven less by spatial trends in housing prices than by local variation in such prices. Adjacent parcels exhibiting a high degree of price variation may have significant differences that can be exploited by developers for profit. For commercial properties, an empty parcel of land sandwiched between two urban high rises is the most obvious example.Footnote 23 For residential properties, such differences can be more subtle. For example, older and smaller homes might actually be demolished to make room for more stately homes, provided their locations are in highly desirable areas. Here one might expect smaller homes to exhibit positive LME_sqft effects on price, especially when in close proximity to larger more expensive homes. Moreover, if these larger homes themselves tend to be overbuilt, the effect of an additional square foot might in fact be negative, leading to high local variation in such values.

In our present data, a good example is provided by the small city of Nichols Hills just north of Oklahoma City as shown in panel (b) of Fig. 21 below (the slightly larger region shown in the other three panels corresponds to the yellow boxes in Figs. 16 and 18). The median housing value ($686,300) in this wealthy community is more than four times that of Oklahoma City. The highest priced homes (over $1 million) in Panel (a) are seen from Panel (b) to be clustered around the golf course on the left and the smaller park on the right. The corresponding values of LME_sqft in Panel (c) exhibit much more extreme volatility than those of Fig. 20 above, and in particular, contain many more negative values. The large size of these homes is also evident from the large lots seen in this area. Finally, the building permits shown in Panel (d) are seen to be clustered in and around this same area. So, as indicated the discussion above, the presence of such price volatility may indeed be creating new opportunities for development.

Fig. 21

a Sales Values predicted by GPR-HCA, b Street Map of Nichols Hills, c LME_sqft estimates, and d Building Permits issued in 2018–2019

While such conjectures clearly require further analysis, the purpose of these examples is mainly to illustrate how this GPR-HCA model and its corresponding LME estimates can in principle be used to quickly identify possible areas for new development in large data sets. Finally, while we have here focused explicitly on the identification of development opportunities in a real estate context, it should be clear that a wide range of additional spatial applications are possible.

Conclusions and directions for further research

In this paper we have systematically developed the hierarchical covariance approximation to Gaussian process regression (GPR-HCA) created by Chen and his co-workers ([C1], [C2]), and have extended this method to include analyses of the local marginal effects (LMEs) generated by this model. Our main objective has been to show how this scalable extension of GPR can be applied to large spatial data sets, such as county assessor data. In particular, we have applied this model to county assessor data for three adjacent counties in Oklahoma, where it was shown that the estimates of both price predictions and local marginal effects generated by GPR-HCA can be used to analyze such data at scales never before possible with standard GPR.

However, the present analysis leaves certain important questions unanswered. A first issue relates to the apparent instability of predictions for extreme values. Investigations with smaller subsets of the Oklahoma data show that this is a problem with GPR-FULL itself, and is not simply a feature of GPR-HCA. In the case of negative predictions, it should be noted that (as with ordinary regression) the fundamental Gaussian assumption itself necessarily allows negative predictions. The standard approach here is to analyze the log of the dependent variable, and convert back to make predictions. But conditional means of log-normal variates do not exhibit the same scalability properties as those of normal variates, and would require extensive parallel computing in order to be implemented for large data sets. However, for housing prices in particular, one possible alternative is to replace standard conditional-mean prediction with predictors more closely related to the common real estate practice of forming offer prices based on weighted averages of recent similar sales (known as “comps”). Initial results using GPR covariances as “similarity weights” appears to be promising, and will be reported in a subsequent paper.

A more fundamental issue that has important consequences for both practitioners and researchers is the treatment of uncertainty in statistical decision making. For example, with respect the parcel-level investment decisions discussed in our Oklahoma application, measures of uncertainty could help individuals sift through thousands of parcels to identify investment opportunities with higher risk-adjusted rates of return.

But while the GPR model itself does allow for some degree of uncertainty in terms of the predictive distribution in expressions (6)–(8) above, no corresponding posterior distributions are available for either the derivatives of these predictions [i.e., the LMEs effects in expression (9)], or for the basic parameter estimates, \(\hat{\theta }\), underlying the model itself. While it is in principle possible to use GPR-HCA to approximate posterior distributions for all such quantities in terms of Markov Chain Monte Carlo methods, such an approach currently requires extensive use of parallel computing across many servers. Thus, a key task remaining for desktop applications is to develop direct approximations to the posterior distributions of both parameter estimates and LMEs. One possibility here is the following two-stage approach. First, by applying standard asymptotic likelihood approximations to the joint posterior distribution of \(\hat{\theta }\) (and employing certain extensions of the computational procedures sketched in Sect. 3.4), it is possible to obtain scalable approximations of this distribution. Second, by employing the Delta method to LMEs (as continuously differentiable functions of \(\theta\)), it is possible to obtain corresponding scalable approximation of LME posteriors as well. This approach will be developed in detail in a subsequent paper.

A final question relates to model uncertainty itself. In the present paper, we have implicitly assumed that all key explanatory variables are known, and that only their relative contributions remain to be determined. However, in a previous paper (Dearmon and Smith 2016), the GPR model was combined with Bayesian model averaging (BMA) to allow both predictions and LMEs to be averaged over sub-models involving different possible subsets of variables. Such GPR-BMA models are of course even more limited in terms of scalability. But the present GPR-HCA model is directly extendable to this BMA framework, and will be developed in a subsequent paper.

Data availability

The Oklahoma County Assessor has database exports that are available from their office upon request and can be used for empirical research.

Code availability

Matlab; GPStuff; Custom Coding in Matlab.


  1. 1.

    To allow comparability of length scales, individual attribute variables are implicitly assumed to be standardized.

  2. 2.

    Note in particular from (4) above that for test locations, \(x_{*l}\), far from all training locations,\(X = (x_{1} , \ldots ,x_{n} )\), the covariance vector, \(K(x_{*l} ,X)\), must approach zero. This in turn implies from (7) together with model (1) that the corresponding predictions, \(\hat{y}_{l} = E(Y_{*l} |Y) = E(f_{*l} |Y)\,\, + \mu\), must necessarily approach the mean, \(\mu\). Such (extrapolated) predictions thus exhibit “mean reversion”.

  3. 3.

    In this regard, the most popular alternative kernel function, namely the simple exponential kernel (as for example in Genten 2001), is overly sensitive to small differences between similar attribute profiles. Within the larger family of Matern kernels, the squared exponential kernel is also the simplest to analyze from both estimation and inference perspectives.

  4. 4.

    These are also referred to as “inducing” points (as for example in Rasmussen and Quinonero-Candela 2005).

  5. 5.

    Note that whenever \(X_{i} \cap X_{r} \ne \emptyset\), the conditional covariance matrix, \(X_{ii} - X_{ir} \,X_{rr}^{ - 1} X_{ri}\), in (14) must be singular. But as will be seen in footnote 3 below, this has no substantive consequences for the model constructed.

  6. 6.

    As seen in (22) below, the random vectors, \(H_{1}\) and \(H_{2}\), have full rank covariance matrices, and thus are properly multi-normally distributed even when \(Z_{1|r}\) and \(Z_{2|r}\) are singular multi-normal [see for example Anderson (1958, Theorem 2.4.5)].

  7. 7.

    Blake (2015). Fast and Accurate Symmetric Positive Definite Matrix Inverse, Matlab Central File Exchange.

  8. 8.

    In fact, the temperature application of Chen and Stein [C2] involves more than 2 million observations.

  9. 9.

    The explicit samples sizes shown are [10,000, 20,000, 40,000, 80,000, 160,000, 320,000, 500,000].

  10. 10.

    Computation times for HCA automatically include calculations of Local Marginal Effects at each prediction point (which are not directly relevant for either NNGP or GBM). But these add little in the way of time differences.

  11. 11.

    These predictions are computed for a regular grid of points in \([0,1]^{2}\) and, in a manner similar to Fig. 8a, contours are then interpolated and plotted using the Matlab program, contour.m. Similar procedures are used to obtain Figs. 13b and 14b below.

  12. 12.

    It should also be noted that “local marginal effects” are much more problematic for both NNGP and GBM models, and are not considered here. In NNGP, the dominant effects of predictors are regression-like global mean effects with only the spatial error term modeled as a nearest-neighbor Gaussian process. In GBM, prediction surfaces are either locally flat or at best governed by the type of weak-learner functions used. So local behavior is of less interest in either of these prediction models.

  13. 13.

    In particular, by adding 'irrelevant' variables to our simulation model, experiments showed that the combined time for computing predictions together with LMEs for any given variable is approximately linear in the total number of variables. In the present case with 12,569 sample points, 150 landmark points, and using only in-sample calculations of LMEs, the added time for each new irrelevant variable was approximately 30 s.

  14. 14.

    LME performance deteriorates under higher error variance. Within this simulation setting, we multiplied the error’s 0.5 standard deviation by a scaling factor which ranged from 0.5 to 2.5 by units of 0.5. Regression results suggest that, for every one unit increase in this scaling factor, we find that MAE increases by 0.06 for LME × 1 and by 0.12 for LME × 2.

  15. 15.

    This is consistent with Matlab’s own findings that “Using A\b instead of inv(A)*b is two to three times faster, and produces residuals on the order of machine accuracy relative to the magnitude of the data”, (

  16. 16.

    For well-behaved data sets, a random selection of landmark points is probably sufficient and much less costly to execute. But, as noted by Chen and Stein (2021, p.15), in more challenging cases such as our present housing-price data, a careful selection of landmarks can substantially reduce approximation errors. The primary drawback is computation time. Tree construction is much more costly with k-means. For a simulated data set with 311,422 observations and 500 landmark points, random selection took 28 s while k-means took almost 5 min. But nonetheless, k-means is often used way of selecting inducing points; see for example (Park and Choi 2010; Hensman et al. 2015).

  17. 17.

    Changes of a more indirect nature are to allow tree indexing on different sets of attributes. For our present purposes, we partitioned on all variables, except sale year, which made the grouping more spatial than temporal.

  18. 18.

    Assessor data also provides estimates of market value for each parcel. These are only used in the assessment of predictive model performance.

  19. 19.

    Note in particular the row of blue points with underestimated values (\(\approx \,\)$172,000) that roughly correspond to the sample mean price of the data. As mentioned in footnote 2 above, these are instances of parcels with extreme attribute profiles involving extrapolated price predictions that exhibit mean reversion. However, such outliers are usually identified easily and can be analyzed separately.

  20. 20.

    A cursory search suggests that the lower bound on a room addition is about $80 per square foot (as for example in,, and

  21. 21.

    This building permit data is taken from the County Assessor’s database, with dates issued in 2018 or later. Building costs for permits in our data set all exceed $5,000.

  22. 22.

    One exception is the recent paper by Cohen and Zabel (2020) which analyzes such spillover effects at the census tract level in the Greater Boston Area.

  23. 23.

    Such situations do not usually occur in tier-one markets.


  1. Anderson TW (1958) Introduction to multivariate statistical analysis. Wiley, New York

    Google Scholar 

  2. Chen J, Avron H, Sindhwani V (2017) Hierarchically compositional kernelsfor scalable nonparametric learning. J Mach Learn Res 18:1–42

    Google Scholar 

  3. Chen J, Stein ML (2017) Linear-cost covariance functions for gaussian random fields. arXiv:1711.05895

  4. Chen J, Stein ML (2021) Linear-cost covariance functions for Gaussian random fields. J Am Stat Assoc, p 1-43

  5. Cohen JP, Zabel J (2020) Local house price diffusion. Real Estate Econ 48:710–743

    Article  Google Scholar 

  6. Daisa JM, Parker T (2009) Trip generation rates for urban infill land uses in California. ITE J 79(6):30–39

    Google Scholar 

  7. Datta A, Banerjee S, Finley AO, Gelfand AE (2016) Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J Am Stat Assoc 111(514):800–812

    Article  Google Scholar 

  8. Dearmon J, Smith TE (2016) Gaussian process regression and Bayesian model averaging: an alternative approach to modeling spatial phenomena. Geogr Anal 48:82–111

    Article  Google Scholar 

  9. Dearmon J, Smith TE (2017) Local marginal analysis of spatial data: a Gaussian process regression approach with Bayesian model and kernel averaging. Spat Econom Qual Limit Depend Var 37:297–342

    Google Scholar 

  10. DeFusco A, Ding W, Ferreira F, Gyourko J (2018) The role of price spillovers in the American housing boom. J Urban Econ 108:72–84

    Article  Google Scholar 

  11. Finley AO, Abhirup D, Banerjee S (2020) spNNGP R package for nearest neighbor Gaussian process models. arXiv:2001.09111v1 [stat.CO]

  12. Genton MG (2001) Classes of kernels for machine learning: a statistics perspective. J Mach Learn Res 2:299–312

    Google Scholar 

  13. Hensman J, Matthews AG, Filippone M, Ghahramani Z (2015) MCMC for variationally sparse Gaussian processes. In: Advances in neural information processing systems, pp 1648–1656

  14. Landis JD, Hood H, Li G, Rogers T, Warren C (2006) The future of infill housing in California: opportunities, potential, and feasibility. Hous Policy Debate 17(4):681–725

    Article  Google Scholar 

  15. McConnell V, Wiley K (2010) Infill development: perspectives and evidence from economics and planning. Resour Fut 10:1–34

    Google Scholar 

  16. Park S, Choi S (2010) Hierarchical Gaussian process regression. In: Proceedings of 2nd Asian conference on machine learning, pp 95–110

  17. Rasmussen C, Quinonero-Candela J (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959

    Google Scholar 

  18. Ridgeway G (2007) Generalized boosted models: a guide to the gbm package. Update 1(1):2007

    Google Scholar 

  19. Vanhatalo J, Riihimäki J, Hartikainen J, Jylänki P, Tolvanen V, Vehtari A (2013) GPstuff: Bayesian modeling with Gaussian processes. J Mach Learn Res 14(Apr):1175–1179

    Google Scholar 

Download references


Not applicable.

Author information



Corresponding author

Correspondence to Jacob Dearmon.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



To show that expressions (34) and (37) are indeed the actual covariances of the random vectors in expression (33) of the text, it is convenient to introduce further simplifying notation. For each possible root path, \(i_{1} \, \to \,i_{2} \, \to \, \cdots \, \to \,i_{m - 1} \, \to \,i_{m} \to \,r\), let \(H_{{i_{1} \,i_{2} \, \cdots \,i_{m} \,r}}\) be defined recursively for paths of length one by

$$H_{{i_{1} \,r}} \, = \,Z_{{i_{1} \,|\,r}} \, + \,A_{{i_{1} \,r}} \,Z_{r}$$

[as in (18) of the text] and for longer paths by

$$H_{{i_{1} \,i_{2} \, \cdots \,i_{m} \,r}} \, = \,Z_{{i_{1} \,|\,i_{2} }} + \,A_{{i_{1} \,i_{2} \,}} H_{{i_{2} \cdots \,i_{m} \,r\,}}$$

Then, (A.1) together with the argument in (14) through (17) of the text again shows that for paths of length one,

$${\text{cov}} \left( {H_{{i_{1} \,r}} } \right)\, = \,K_{{i_{1} \,i_{1} }}$$

So if it hypothesized that

$${\text{cov}} (H_{{i_{1} \, \cdots \,i_{m} \,r}} ) = K_{{i_{1} \,i_{1} }}$$

holds for all paths of length m, then for paths of length \(m + 1\) it follows from the independence of \(Y_{{i_{1} \,|\,i_{2} }}\) and \(H_{{i_{2} \cdots \,i_{m + 1} \,r\,}}\), together with (A.4) that [again from the argument in (14) through (17) in the text],

$$\begin{aligned} {\text{cov}} \left( {H_{{i_{1} \,i_{2} \, \cdots \,i_{m} \,i_{m + 1} \,r}} } \right) & = {\text{cov}} \left( {Y_{{i_{1} \,|\,i_{2} }} + A_{{i_{1} \,i_{2} \,}} H_{{i_{2} \cdots \,i_{m + 1} \,r\,}} } \right) \\ & = {\text{cov}} \left( {Y_{{i_{1} \,|\,i_{2} }} } \right) + A_{{i_{1} \,i_{2} }} {\text{cov}} \left( {H_{{i_{2} \cdots \,i_{m + 1} \,r}} } \right)A_{{i_{2} \,i_{1} }} \\ & = K_{{i_{1} \,i_{1} }} - K_{{i_{1} \,i_{2} }} K_{{i_{2} \,i_{2} }}^{ - 1} K_{{i_{2} \,i_{1} }} + \left( {K_{{i_{1} \,i_{2} }} K_{{i_{2} \,i_{2} }}^{ - 1} } \right)\left[ {K_{{i_{2} \,i_{2} }} } \right]\left( {K_{{i_{2} \,i_{2} }}^{ - 1} K_{{i_{2} \,i_{1} }} } \right) \\ & = K_{{i_{1} \,i_{1} }} - K_{{i_{1} \,i_{2} }} K_{{i_{2} \,i_{2} }}^{ - 1} K_{{i_{2} \,i_{1} }} + K_{{i_{1} \,i_{2} }} K_{{i_{2} \,i_{2} }}^{ - 1} K_{{i_{2} \,i_{1} }} = K_{{i_{1} \,i_{1} }} \\ \end{aligned}$$

So by induction, (A.4) must hold for all m. But for any leaf, \(i\), with root path, \(i\, \to i_{1} \, \to \cdots \, \to i_{m} \to \,r\), this implies at once that

$${\text{cov}} \left( {H_{i} } \right) = {\text{cov}} \left( {H_{{i\,\,i_{1} \cdots i_{m} \,r}} } \right) = K_{i\,i}$$

and thus that expression (39) in the text must hold.

It remains to establish expression (37) in the text for any distinct leaves, \(i\) and \(j\) with root paths as in (35) and (36) (where again this taken to include the case, \(s = r\)). To do so, we first expand \(H_{i} = H_{{i\,\,i_{1} \cdots i_{p} \,s\,h_{\,1} \, \cdots \,h_{m} \,r}}\) and \(H_{j} = H_{{j\,\,j_{1} \cdots j_{q} \,s\,h_{\,1} \, \cdots \,h_{m} \,r}}\) as follows:

$$H_{i} = Y_{{i\,|\,i_{1} }} + A_{{i\,i_{1} }} Y_{{i_{1} \,|\,i_{2} }} + \left( {A_{{i\,i_{1} }} A_{{i_{1} \,i_{2} }} } \right)Y_{{i_{2} \,|\,i_{3} }} + \cdots + \left( {A_{{i\,\,i_{1} }} A_{{i_{1} \,i_{2} }} \cdots A_{{i_{p - 1\,p} }} } \right)Y_{{i_{p} \,s}} + \left( {A_{{i\,i_{1} }} A_{{i_{1} \,i_{2} }} \cdots A_{{i_{p} \,s}} } \right)H_{{s\,h_{\,1} \cdots h_{m} \,r}}$$
$$H_{j} = Y_{{j\,|\,j_{1} }} + A_{{j\,j_{1} }} Y_{{j_{1} \,|\,j_{2} }} + \left( {A_{{j\,j_{1} }} A_{{j_{1} \,j_{2} }} } \right)Y_{{j_{2} \,|\,j_{3} }} + \cdots + \left( {A_{{j\,j_{1} }} A_{{j_{1} \,j_{2} }} \cdots A_{{j_{q - 1\,q} }} } \right)Y_{{j_{q} \,s}} + \left( {A_{{j\,j_{1} }} A_{{j_{1} \,j_{2} }} \cdots A_{{j_{q} \,s}} } \right)H_{{s\,h_{\,1} \cdots h_{m} \,r}}$$

Next recall that since the random variables \((Y_{{i\,|\,i_{1} }} ,Y_{{i_{1} \,|\,i_{2} }} ,\, \ldots ,\,Y_{{i_{p} \,|\,s}} ,\,Y_{{j\,|\,j_{1} }} ,Y_{{j_{1} \,|\,j_{2} }} ,\, \ldots ,\,Y_{{j_{q} \,|\,s}} ,H_{{s\,h_{\,1} \cdots h_{m} \,r}} )\) are all independent, it follows [as for example in (31) of the text] that all covariance terms between \(H_{i}\) and \(H_{j}\) are zero except for the shared term involving \(H_{{s\,h_{\,1} \cdots h_{m} \,r}}\), so that,

$$\begin{aligned} {\text{cov}} \left( {H_{i} ,H_{j} } \right) & = {\text{cov}} [(A_{{i\,i_{1} }} A_{{i_{1} \,i_{2} }} \cdots A_{{i_{p} \,s}} )\,H_{{s\,h_{\,1} \cdots h_{m} \,r}} \,,\,\,(A_{{j\,j_{1} }} A_{{j_{1} \,j_{2} }} \cdots A_{{j_{q} \,s}} )\,H_{{s\,h_{\,1} \cdots h_{m} \,r}} ] \\ & = \left( {A_{{i\,i_{1} }} A_{{i_{1} \,i_{2} }} \cdots A_{{i_{p} \,s}} } \right){\text{cov}} \left( {H_{{s\,h_{\,1} \cdots h_{m} \,r}} } \right)\left( {A_{{s\,j_{q} }} \cdots A_{{j_{2} \,j_{1} }} \cdots A_{{j_{1} \,j}} } \right) \\ \end{aligned}$$

But this implies at once from (A.5) that

$${\text{cov}} \left( {H_{i} ,H_{j} } \right) = A_{{i\,i_{1} }} A_{{i_{1} \,i_{2} }} \cdots A_{{i_{p} \,s}} \left( {K_{ss} } \right)A_{{s\,j_{q} }} \cdots A_{{j_{2} \,j_{1} }} \cdots A_{{j_{1} \,j}}$$

and thus that expression (37) must hold.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dearmon, J., Smith, T.E. A hierarchical approach to scalable Gaussian process regression for spatial data. J Spat Econometrics 2, 7 (2021).

Download citation


  • Gaussian process regression
  • Spatial econometrics
  • Kriging
  • Nyström approximation
  • Hierarchical matrix

JEL Classification codes

  • C21- Spatial Models
  • C55- Large Datasets
  • R30- Real Estate Markets, Spatial Production Analysis, and Firm Location: General
  • R31- Housing Supply and Markets