Abstract
Large scale and highly detailed geospatial datasets currently offer rich opportunities for empirical investigation, where finer-level investigation of spatial spillovers and spatial infill can now be done at the parcel level. Gaussian process regression (GPR) is particularly well suited for such investigations, but is currently limited by its need to manipulate and store large dense covariance matrices. The central purpose of this paper is to develop a more efficient version of GPR based on the hierarchical covariance approximation proposed by Chen et al. (J Mach Learn Res 18:1–42, 2017) and Chen and Stein (Linear-cost covariance functions for Gaussian random fields, arXiv:1711.05895, 2017). We provide a novel probabilistic interpretation of Chen’s framework, and extend his method to the analysis of local marginal effects at the parcel level. Finally, we apply these tools to a spatial dataset constructed from a 10-year period of Oklahoma County Assessor databases. In this setting, we are able to identify both regions of possible spatial spillovers and spatial infill, and to show more generally how this approach can be used for the systematic identification of specific development opportunities.
Similar content being viewed by others
Data availability
The Oklahoma County Assessor has database exports that are available from their office upon request and can be used for empirical research.
Code availability
Matlab; GPStuff; Custom Coding in Matlab.
Notes
To allow comparability of length scales, individual attribute variables are implicitly assumed to be standardized.
Note in particular from (4) above that for test locations, \(x_{*l}\), far from all training locations,\(X = (x_{1} , \ldots ,x_{n} )\), the covariance vector, \(K(x_{*l} ,X)\), must approach zero. This in turn implies from (7) together with model (1) that the corresponding predictions, \(\hat{y}_{l} = E(Y_{*l} |Y) = E(f_{*l} |Y)\,\, + \mu\), must necessarily approach the mean, \(\mu\). Such (extrapolated) predictions thus exhibit “mean reversion”.
In this regard, the most popular alternative kernel function, namely the simple exponential kernel (as for example in Genten 2001), is overly sensitive to small differences between similar attribute profiles. Within the larger family of Matern kernels, the squared exponential kernel is also the simplest to analyze from both estimation and inference perspectives.
These are also referred to as “inducing” points (as for example in Rasmussen and Quinonero-Candela 2005).
Note that whenever \(X_{i} \cap X_{r} \ne \emptyset\), the conditional covariance matrix, \(X_{ii} - X_{ir} \,X_{rr}^{ - 1} X_{ri}\), in (14) must be singular. But as will be seen in footnote 3 below, this has no substantive consequences for the model constructed.
Blake (2015). Fast and Accurate Symmetric Positive Definite Matrix Inverse, Matlab Central File Exchange.
In fact, the temperature application of Chen and Stein [C2] involves more than 2 million observations.
The explicit samples sizes shown are [10,000, 20,000, 40,000, 80,000, 160,000, 320,000, 500,000].
Computation times for HCA automatically include calculations of Local Marginal Effects at each prediction point (which are not directly relevant for either NNGP or GBM). But these add little in the way of time differences.
It should also be noted that “local marginal effects” are much more problematic for both NNGP and GBM models, and are not considered here. In NNGP, the dominant effects of predictors are regression-like global mean effects with only the spatial error term modeled as a nearest-neighbor Gaussian process. In GBM, prediction surfaces are either locally flat or at best governed by the type of weak-learner functions used. So local behavior is of less interest in either of these prediction models.
In particular, by adding 'irrelevant' variables to our simulation model, experiments showed that the combined time for computing predictions together with LMEs for any given variable is approximately linear in the total number of variables. In the present case with 12,569 sample points, 150 landmark points, and using only in-sample calculations of LMEs, the added time for each new irrelevant variable was approximately 30 s.
LME performance deteriorates under higher error variance. Within this simulation setting, we multiplied the error’s 0.5 standard deviation by a scaling factor which ranged from 0.5 to 2.5 by units of 0.5. Regression results suggest that, for every one unit increase in this scaling factor, we find that MAE increases by 0.06 for LME × 1 and by 0.12 for LME × 2.
This is consistent with Matlab’s own findings that “Using A\b instead of inv(A)*b is two to three times faster, and produces residuals on the order of machine accuracy relative to the magnitude of the data”, (https://www.mathworks.com/help/matlab/ref/inv.html).
For well-behaved data sets, a random selection of landmark points is probably sufficient and much less costly to execute. But, as noted by Chen and Stein (2021, p.15), in more challenging cases such as our present housing-price data, a careful selection of landmarks can substantially reduce approximation errors. The primary drawback is computation time. Tree construction is much more costly with k-means. For a simulated data set with 311,422 observations and 500 landmark points, random selection took 28 s while k-means took almost 5 min. But nonetheless, k-means is often used way of selecting inducing points; see for example (Park and Choi 2010; Hensman et al. 2015).
Changes of a more indirect nature are to allow tree indexing on different sets of attributes. For our present purposes, we partitioned on all variables, except sale year, which made the grouping more spatial than temporal.
Assessor data also provides estimates of market value for each parcel. These are only used in the assessment of predictive model performance.
Note in particular the row of blue points with underestimated values (\(\approx \,\)$172,000) that roughly correspond to the sample mean price of the data. As mentioned in footnote 2 above, these are instances of parcels with extreme attribute profiles involving extrapolated price predictions that exhibit mean reversion. However, such outliers are usually identified easily and can be analyzed separately.
A cursory search suggests that the lower bound on a room addition is about $80 per square foot (as for example in https://www.ownerly.com/home-improvement/home-addition-cost/, https://www.homeadvisor.com/cost/additions-and-remodels/build-an-addition/, and https://www.homelight.com/blog/room-addition-cost/).
This building permit data is taken from the County Assessor’s database, with dates issued in 2018 or later. Building costs for permits in our data set all exceed $5,000.
One exception is the recent paper by Cohen and Zabel (2020) which analyzes such spillover effects at the census tract level in the Greater Boston Area.
Such situations do not usually occur in tier-one markets.
References
Anderson TW (1958) Introduction to multivariate statistical analysis. Wiley, New York
Chen J, Avron H, Sindhwani V (2017) Hierarchically compositional kernelsfor scalable nonparametric learning. J Mach Learn Res 18:1–42
Chen J, Stein ML (2017) Linear-cost covariance functions for gaussian random fields. arXiv:1711.05895
Chen J, Stein ML (2021) Linear-cost covariance functions for Gaussian random fields. J Am Stat Assoc, p 1-43
Cohen JP, Zabel J (2020) Local house price diffusion. Real Estate Econ 48:710–743
Daisa JM, Parker T (2009) Trip generation rates for urban infill land uses in California. ITE J 79(6):30–39
Datta A, Banerjee S, Finley AO, Gelfand AE (2016) Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. J Am Stat Assoc 111(514):800–812
Dearmon J, Smith TE (2016) Gaussian process regression and Bayesian model averaging: an alternative approach to modeling spatial phenomena. Geogr Anal 48:82–111
Dearmon J, Smith TE (2017) Local marginal analysis of spatial data: a Gaussian process regression approach with Bayesian model and kernel averaging. Spat Econom Qual Limit Depend Var 37:297–342
DeFusco A, Ding W, Ferreira F, Gyourko J (2018) The role of price spillovers in the American housing boom. J Urban Econ 108:72–84
Finley AO, Abhirup D, Banerjee S (2020) spNNGP R package for nearest neighbor Gaussian process models. arXiv:2001.09111v1 [stat.CO]
Genton MG (2001) Classes of kernels for machine learning: a statistics perspective. J Mach Learn Res 2:299–312
Hensman J, Matthews AG, Filippone M, Ghahramani Z (2015) MCMC for variationally sparse Gaussian processes. In: Advances in neural information processing systems, pp 1648–1656
Landis JD, Hood H, Li G, Rogers T, Warren C (2006) The future of infill housing in California: opportunities, potential, and feasibility. Hous Policy Debate 17(4):681–725
McConnell V, Wiley K (2010) Infill development: perspectives and evidence from economics and planning. Resour Fut 10:1–34
Park S, Choi S (2010) Hierarchical Gaussian process regression. In: Proceedings of 2nd Asian conference on machine learning, pp 95–110
Rasmussen C, Quinonero-Candela J (2005) A unifying view of sparse approximate Gaussian process regression. J Mach Learn Res 6:1939–1959
Ridgeway G (2007) Generalized boosted models: a guide to the gbm package. Update 1(1):2007
Vanhatalo J, Riihimäki J, Hartikainen J, Jylänki P, Tolvanen V, Vehtari A (2013) GPstuff: Bayesian modeling with Gaussian processes. J Mach Learn Res 14(Apr):1175–1179
Funding
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
To show that expressions (34) and (37) are indeed the actual covariances of the random vectors in expression (33) of the text, it is convenient to introduce further simplifying notation. For each possible root path, \(i_{1} \, \to \,i_{2} \, \to \, \cdots \, \to \,i_{m - 1} \, \to \,i_{m} \to \,r\), let \(H_{{i_{1} \,i_{2} \, \cdots \,i_{m} \,r}}\) be defined recursively for paths of length one by
[as in (18) of the text] and for longer paths by
Then, (A.1) together with the argument in (14) through (17) of the text again shows that for paths of length one,
So if it hypothesized that
holds for all paths of length m, then for paths of length \(m + 1\) it follows from the independence of \(Y_{{i_{1} \,|\,i_{2} }}\) and \(H_{{i_{2} \cdots \,i_{m + 1} \,r\,}}\), together with (A.4) that [again from the argument in (14) through (17) in the text],
So by induction, (A.4) must hold for all m. But for any leaf, \(i\), with root path, \(i\, \to i_{1} \, \to \cdots \, \to i_{m} \to \,r\), this implies at once that
and thus that expression (39) in the text must hold.
It remains to establish expression (37) in the text for any distinct leaves, \(i\) and \(j\) with root paths as in (35) and (36) (where again this taken to include the case, \(s = r\)). To do so, we first expand \(H_{i} = H_{{i\,\,i_{1} \cdots i_{p} \,s\,h_{\,1} \, \cdots \,h_{m} \,r}}\) and \(H_{j} = H_{{j\,\,j_{1} \cdots j_{q} \,s\,h_{\,1} \, \cdots \,h_{m} \,r}}\) as follows:
Next recall that since the random variables \((Y_{{i\,|\,i_{1} }} ,Y_{{i_{1} \,|\,i_{2} }} ,\, \ldots ,\,Y_{{i_{p} \,|\,s}} ,\,Y_{{j\,|\,j_{1} }} ,Y_{{j_{1} \,|\,j_{2} }} ,\, \ldots ,\,Y_{{j_{q} \,|\,s}} ,H_{{s\,h_{\,1} \cdots h_{m} \,r}} )\) are all independent, it follows [as for example in (31) of the text] that all covariance terms between \(H_{i}\) and \(H_{j}\) are zero except for the shared term involving \(H_{{s\,h_{\,1} \cdots h_{m} \,r}}\), so that,
But this implies at once from (A.5) that
and thus that expression (37) must hold.
Rights and permissions
About this article
Cite this article
Dearmon, J., Smith, T.E. A hierarchical approach to scalable Gaussian process regression for spatial data. J Spat Econometrics 2, 7 (2021). https://doi.org/10.1007/s43071-021-00012-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s43071-021-00012-5