Abstract
This study develops a spatial additive mixed modeling (AMM) approach estimating spatial and non-spatial effects from large samples, such as millions of observations. Although fast AMM approaches are already well established, they are restrictive in that they assume a known spatial dependence structure. To overcome this limitation, this study develops a fast AMM with the estimation of spatial structure in residuals and regression coefficients together with non-spatial effects. We rely on a Moran coefficient-based approach to estimate the spatial structure. The proposed approach pre-compresses large matrices whose size grows with respect to the sample size N before the model estimation; thus, the computational complexity for the estimation is independent of the sample size. Furthermore, the pre-compression is done through a block-wise procedure that makes the memory consumption independent of N. Eventually, the spatial AMM is memory free and fast even for millions of observations. The developed approach is compared to alternatives through Monte Carlo simulation experiments. The result confirms the estimation accuracy of the spatially varying coefficients and group coefficients, and computational efficiency of the developed approach. Finally, we apply our approach to an income analysis using United States (US) data in 2015.
Similar content being viewed by others
Notes
\({\mathbf{w}}\) yields a negatively dependent spatial process if \({\varvec{\Lambda}}\) is defined using the absolute values of negative eigenvalues and E is defined using their corresponding eigenvectors. Our approach is capable of analyzing negative spatial dependence situations (see Griffith, 2006).
They confirmed that the Moran eigenvector approach successfully captures/eliminates residual spatial dependence for large samples up to N = 80,000.
Our MC-based spatial process yields a rank-reduced GP (see Murakami et al., 2017). Therefore, the standard error of the MC-based process can be underestimated if the full GP is regarded as the true process. Murakami and Griffith (2019a) showed that standard error bias is small if the true process is a positively dependent spatial one, in which the MC-based approach intended. For fitting a full GP, bias correction (see, e.g., Finley et al. 2009) will be an important future step.
We prefer the radial basis function-based GP approximation (see Kammann and Wand 2003) because it is one of the most popular GP approximations (Liu et al. 2018). In geostatistics, the predictive process modeling (Banerjee et al. 2008) and fixed rank kriging (Cressie and Johannesson 2008) assume radial basis functions too. Although soap film smoothing (Wood 2008), the stochastic partial differential equation approach (Rue et al. 2009), and other approaches are also available for low rank GP modeling, they are relatively computationally demanding. Thus, we adopt the radial basis function approach for the SVC modeling in GeoAMM and GeoAMM*.
References
Anselin, L. (2003). Spatial externalities, spatial multipliers, and spatial econometrics. International Regional Science Review, 26(2), 153–166.
Anselin, L. (2010). Thirty years of spatial econometrics. Papers in Regional Science, 89(1), 3–25.
Arbia, G., Ghiringhelli, C., & Mira, A. (2019). Estimation of spatial econometric linear models with large datasets: How big can spatial Big Data be? Regional Science and Urban Economics, 76, 67–73.
Banerjee, S., Gelfand, A. E., Finley, A. O., & Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4), 825–848.
Bates, D. M. (2010). lme4: Mixed-effects modeling with R. http://lme4.r-forge.r-project.org/book/. Accessed 25 Nov 2011.
Cressie, N. (1992). Statistics for spatial data. New York: Wiley.
Cressie, N., & Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 209–226.
Datta, A., Banerjee, S., Finley, A. O., & Gelfand, A. E. (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514), 800–812.
Dray, S., Legendre, P., & Peres-Neto, P. R. (2006). Spatial modelling: A comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM). Ecological Modelling, 196(3–4), 483–493.
Drineas, P., & Mahoney, M. W. (2005). On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153–2175.
Finley, A. O., Sang, H., Banerjee, S., & Gelfand, A. E. (2009). Improving the performance of predictive process modeling for large datasets. Computational Statistics & Data Analysis, 53, 2873–2884.
Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441.
Furrer, R., Genton, M. G., & Nychka, D. (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15(3), 502–523.
Gelfand, A. E., Diggle, P., Guttorp, P., & Fuentes, M. (2010). Handbook of spatial statistics. Boca Raton: CRC Press.
Gelfand, A. E., Kim, H. J., Sirmans, C. F., & Banerjee, S. (2003). Spatial modeling with spatially varying coefficient processes. Journal of the American Statistical Association, 98(462), 387–396.
Genton, M. G., & Kleiber, W. (2015). Cross-covariance functions for multivariate geostatistics. Statistical Science, 30(2), 147–163.
Goldstein, H. (2011). Multilevel statistical models. West Sussex: Wiley.
Gotway, C. A., & Young, L. J. (2002). Combining incompatible spatial data. Journal of the American Statistical Association, 97(458), 632–648.
Griffith, D. A. (2003). Spatial autocorrelation and spatial filtering: Gaining understanding through theory and scientific visualization. Berlin: Springer.
Griffith, D. A. (2006). Hidden negative spatial autocorrelation. Journal of Geographical Systems, 8(4), 335–355.
Griffith, D. A., & Chun, Y. (2019). Implementing Moran eigenvector spatial filtering for massively large georeferenced datasets. International Journal of Geographical Information Science, 33(9), 1–15.
Griffith, D. A., & Peres-Neto, P. R. (2006). Spatial modeling in ecology: The flexibility of eigenfunction spatial analyses. Ecology, 87(10), 2603–2613.
Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., et al. (2018). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24, 1–28.
Henderson, C. R. (1975). Best linear unbiased estimation and prediction under a selection model. Biometrics, 31(2), 423–447.
Henderson, C. R. (1984). Applications of linear models in animal breeding. Guelph, ON: University of Guelph.
Hodges, J. S. (2016). Richly parameterized linear models: additive, time series, and spatial models using random effects. Boca Raton: Chapman and Hall/CRC.
Hox, J. (1998). Multilevel modeling: When and why. In I. Balderjahn, R. Mathar, & M. Schader (Eds.), Classification, data analysis, and data highways (pp. 147–154). Berlin: Springer.
Hughes, J., & Haran, M. (2013). Dimension reduction and alleviation of confounding for spatial generalized linear mixed models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 139–159.
Kammann, E. E., & Wand, M. P. (2003). Geoadditive models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 52(1), 1–18.
Katzfuss, M. (2017). A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association., 112(517), 201–214.
Kelejian, H. H., & Prucha, I. R. (1998). A generalized spatial two-stage least squares procedure for estimating a spatial autoregressive model with autoregressive disturbances. The Journal of Real Estate Finance and Economics, 17(1), 99–121.
Kneib, T., Hothorn, T., & Tutz, G. (2009). Variable selection and model choice in geoadditive regression models. Biometrics, 65(2), 626–634.
Krock, M., Kleiber, W., & Becker, S. (2019). Penalized basis models for very large spatial datasets. arXiv:1902.06877.
LeSage, J., & Pace, R. K. (2009). Introduction to Spatial Econometrics. Boca Raton: Chapman and Hall/CRC.
Li, Z., & Wood, S. N. (2019). Faster model matrix crossproducts for large generalized linear models with discretized covariates. Statistics and Computing. https://doi.org/10.1007/s11222-019-09864-2.
Lindgren, F., Rue, H., & Lindström, J. (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4), 423–498.
Liu, H., Ong, Y. S., Shen, X., & Cai, J. (2018). When Gaussian process meets big data: A review of scalable GPs. arXiv:1807.01065.
Moran, P. A. (1950). Notes on continuous stochastic phenomena. Biometrika, 37(1/2), 17–23.
Murakami, D., & Griffith, D. A. (2015). Random effects specifications in eigenvector spatial filtering: A simulation study. Journal of Geographical Systems, 17(4), 311–331.
Murakami, D., & Griffith, D. A. (2019a). Eigenvector spatial filtering for large data sets: Fixed and random effects approaches. Geographical Analysis, 51(1), 23–49.
Murakami, D., & Griffith, D. A. (2019b). Spatially varying coefficient modeling for large datasets: Eliminating N from spatial regressions. Spatial Statistics, 30, 39–64.
Murakami, D., Lu, B., Harris, P., Brunsdon, C., Charlton, M., Nakaya, T., et al. (2019). The importance of scale in spatially varying coefficient modeling. Annals of the American Association of Geographers, 109(1), 50–70.
Murakami, D., Yoshida, T., Seya, H., Griffith, D. A., & Yamagata, Y. (2017). A Moran coefficient-based mixed effects approach to investigate spatially varying relationships. Spatial Statistics, 19, 68–89.
Rue, H., Martino, S., & Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2), 319–392.
Samuels, M. L. (1993). Simpson’s paradox and related phenomena. Journal of the American Statistical Association, 88(421), 81–88.
Stein, M. L. (2014). Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics, 8, 1–19.
Tiefelsdorf, M., & Griffith, D. A. (2007). Semiparametric filtering of spatial autocorrelation: The eigenvector approach. Environment and Planning A, 39(5), 1193–1221.
Tsutsumi, M., & Seya, H. (2008). Measuring the impact of large-scale transportation projects on land price using spatial statistical models. Papers in Regional Science, 87(3), 385–401.
Tsutsumi, M., & Seya, H. (2009). Hedonic approaches based on spatial econometrics and spatial statistics: Application to evaluation of project benefits. Journal of Geographical Systems, 11(4), 357–380.
Wang, C., & Furrer, R. (2019). Combining heterogeneous spatial datasets with process-based spatial fusion models: A unifying framework. Arxiv, 1906, 00364.
Wiesenfarth, M., & Kneib, T. (2010). Bayesian geoadditive sample selection models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 59(3), 381–404.
Wood, S. N. (2008). Fast stable direct fitting and smoothness selection for generalized additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(3), 495–518.
Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(1), 3–36.
Wood, S. N. (2017). Generalized additive models: An introduction with R. Boca Raton: Chapman and Hall/CRC.
Wood, S. N., Goude, Y., & Shaw, S. (2015). Generalized additive models for large data sets. Journal of the Royal Statistical Society: Series C (Applied Statistics), 64(1), 139–155.
Wood, S. N., Li, Z., Shaddick, G., & Augustin, N. H. (2017). Generalized additive models for gigadata: Modeling the UK black smoke network daily data. Journal of the American Statistical Association, 112(519), 1199–1210.
Zhang, K., & Kwok, J. T. (2010). Clustered Nyström method for large scale manifold learning and dimension reduction. IEEE Transactions on Neural Networks, 21(10), 1576–1587.
Acknowledgements
This study is funded by the JSPS KAKENHI Grant numbers 17K12974 and 18H03628.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Murakami, D., Griffith, D.A. A memory-free spatial additive mixed modeling for big spatial data. Jpn J Stat Data Sci 3, 215–241 (2020). https://doi.org/10.1007/s42081-019-00063-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42081-019-00063-x