Skip to main content
Log in

A memory-free spatial additive mixed modeling for big spatial data

  • Original Paper
  • Spatial statistics
  • Published:
Japanese Journal of Statistics and Data Science Aims and scope Submit manuscript

Abstract

This study develops a spatial additive mixed modeling (AMM) approach estimating spatial and non-spatial effects from large samples, such as millions of observations. Although fast AMM approaches are already well established, they are restrictive in that they assume a known spatial dependence structure. To overcome this limitation, this study develops a fast AMM with the estimation of spatial structure in residuals and regression coefficients together with non-spatial effects. We rely on a Moran coefficient-based approach to estimate the spatial structure. The proposed approach pre-compresses large matrices whose size grows with respect to the sample size N before the model estimation; thus, the computational complexity for the estimation is independent of the sample size. Furthermore, the pre-compression is done through a block-wise procedure that makes the memory consumption independent of N. Eventually, the spatial AMM is memory free and fast even for millions of observations. The developed approach is compared to alternatives through Monte Carlo simulation experiments. The result confirms the estimation accuracy of the spatially varying coefficients and group coefficients, and computational efficiency of the developed approach. Finally, we apply our approach to an income analysis using United States (US) data in 2015.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. \({\mathbf{w}}\) yields a negatively dependent spatial process if \({\varvec{\Lambda}}\) is defined using the absolute values of negative eigenvalues and E is defined using their corresponding eigenvectors. Our approach is capable of analyzing negative spatial dependence situations (see Griffith, 2006).

  2. They confirmed that the Moran eigenvector approach successfully captures/eliminates residual spatial dependence for large samples up to N = 80,000.

  3. Our MC-based spatial process yields a rank-reduced GP (see Murakami et al., 2017). Therefore, the standard error of the MC-based process can be underestimated if the full GP is regarded as the true process. Murakami and Griffith (2019a) showed that standard error bias is small if the true process is a positively dependent spatial one, in which the MC-based approach intended. For fitting a full GP, bias correction (see, e.g., Finley et al. 2009) will be an important future step.

  4. We prefer the radial basis function-based GP approximation (see Kammann and Wand 2003) because it is one of the most popular GP approximations (Liu et al. 2018). In geostatistics, the predictive process modeling (Banerjee et al. 2008) and fixed rank kriging (Cressie and Johannesson 2008) assume radial basis functions too. Although soap film smoothing (Wood 2008), the stochastic partial differential equation approach (Rue et al. 2009), and other approaches are also available for low rank GP modeling, they are relatively computationally demanding. Thus, we adopt the radial basis function approach for the SVC modeling in GeoAMM and GeoAMM*.

References

  • Anselin, L. (2003). Spatial externalities, spatial multipliers, and spatial econometrics. International Regional Science Review, 26(2), 153–166.

    Google Scholar 

  • Anselin, L. (2010). Thirty years of spatial econometrics. Papers in Regional Science, 89(1), 3–25.

    Google Scholar 

  • Arbia, G., Ghiringhelli, C., & Mira, A. (2019). Estimation of spatial econometric linear models with large datasets: How big can spatial Big Data be? Regional Science and Urban Economics, 76, 67–73.

    Google Scholar 

  • Banerjee, S., Gelfand, A. E., Finley, A. O., & Sang, H. (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4), 825–848.

    MathSciNet  MATH  Google Scholar 

  • Bates, D. M. (2010). lme4: Mixed-effects modeling with R. http://lme4.r-forge.r-project.org/book/. Accessed 25 Nov 2011.

  • Cressie, N. (1992). Statistics for spatial data. New York: Wiley.

    MATH  Google Scholar 

  • Cressie, N., & Johannesson, G. (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 209–226.

    MathSciNet  MATH  Google Scholar 

  • Datta, A., Banerjee, S., Finley, A. O., & Gelfand, A. E. (2016). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514), 800–812.

    MathSciNet  Google Scholar 

  • Dray, S., Legendre, P., & Peres-Neto, P. R. (2006). Spatial modelling: A comprehensive framework for principal coordinate analysis of neighbour matrices (PCNM). Ecological Modelling, 196(3–4), 483–493.

    Google Scholar 

  • Drineas, P., & Mahoney, M. W. (2005). On the Nyström method for approximating a Gram matrix for improved kernel-based learning. Journal of Machine Learning Research, 6, 2153–2175.

    MathSciNet  MATH  Google Scholar 

  • Finley, A. O., Sang, H., Banerjee, S., & Gelfand, A. E. (2009). Improving the performance of predictive process modeling for large datasets. Computational Statistics & Data Analysis, 53, 2873–2884.

    MathSciNet  MATH  Google Scholar 

  • Friedman, J., Hastie, T., & Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3), 432–441.

    MATH  Google Scholar 

  • Furrer, R., Genton, M. G., & Nychka, D. (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15(3), 502–523.

    MathSciNet  Google Scholar 

  • Gelfand, A. E., Diggle, P., Guttorp, P., & Fuentes, M. (2010). Handbook of spatial statistics. Boca Raton: CRC Press.

    MATH  Google Scholar 

  • Gelfand, A. E., Kim, H. J., Sirmans, C. F., & Banerjee, S. (2003). Spatial modeling with spatially varying coefficient processes. Journal of the American Statistical Association, 98(462), 387–396.

    MathSciNet  MATH  Google Scholar 

  • Genton, M. G., & Kleiber, W. (2015). Cross-covariance functions for multivariate geostatistics. Statistical Science, 30(2), 147–163.

    MathSciNet  MATH  Google Scholar 

  • Goldstein, H. (2011). Multilevel statistical models. West Sussex: Wiley.

    MATH  Google Scholar 

  • Gotway, C. A., & Young, L. J. (2002). Combining incompatible spatial data. Journal of the American Statistical Association, 97(458), 632–648.

    MathSciNet  MATH  Google Scholar 

  • Griffith, D. A. (2003). Spatial autocorrelation and spatial filtering: Gaining understanding through theory and scientific visualization. Berlin: Springer.

    Google Scholar 

  • Griffith, D. A. (2006). Hidden negative spatial autocorrelation. Journal of Geographical Systems, 8(4), 335–355.

    Google Scholar 

  • Griffith, D. A., & Chun, Y. (2019). Implementing Moran eigenvector spatial filtering for massively large georeferenced datasets. International Journal of Geographical Information Science, 33(9), 1–15.

    Google Scholar 

  • Griffith, D. A., & Peres-Neto, P. R. (2006). Spatial modeling in ecology: The flexibility of eigenfunction spatial analyses. Ecology, 87(10), 2603–2613.

    Google Scholar 

  • Heaton, M. J., Datta, A., Finley, A. O., Furrer, R., Guinness, J., Guhaniyogi, R., et al. (2018). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24, 1–28.

    MathSciNet  MATH  Google Scholar 

  • Henderson, C. R. (1975). Best linear unbiased estimation and prediction under a selection model. Biometrics, 31(2), 423–447.

    MATH  Google Scholar 

  • Henderson, C. R. (1984). Applications of linear models in animal breeding. Guelph, ON: University of Guelph.

    Google Scholar 

  • Hodges, J. S. (2016). Richly parameterized linear models: additive, time series, and spatial models using random effects. Boca Raton: Chapman and Hall/CRC.

    MATH  Google Scholar 

  • Hox, J. (1998). Multilevel modeling: When and why. In I. Balderjahn, R. Mathar, & M. Schader (Eds.), Classification, data analysis, and data highways (pp. 147–154). Berlin: Springer.

    Google Scholar 

  • Hughes, J., & Haran, M. (2013). Dimension reduction and alleviation of confounding for spatial generalized linear mixed models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), 139–159.

    MathSciNet  Google Scholar 

  • Kammann, E. E., & Wand, M. P. (2003). Geoadditive models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 52(1), 1–18.

    MathSciNet  MATH  Google Scholar 

  • Katzfuss, M. (2017). A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association., 112(517), 201–214.

    MathSciNet  Google Scholar 

  • Kelejian, H. H., & Prucha, I. R. (1998). A generalized spatial two-stage least squares procedure for estimating a spatial autoregressive model with autoregressive disturbances. The Journal of Real Estate Finance and Economics, 17(1), 99–121.

    Google Scholar 

  • Kneib, T., Hothorn, T., & Tutz, G. (2009). Variable selection and model choice in geoadditive regression models. Biometrics, 65(2), 626–634.

    MathSciNet  MATH  Google Scholar 

  • Krock, M., Kleiber, W., & Becker, S. (2019). Penalized basis models for very large spatial datasets. arXiv:1902.06877.

  • LeSage, J., & Pace, R. K. (2009). Introduction to Spatial Econometrics. Boca Raton: Chapman and Hall/CRC.

    MATH  Google Scholar 

  • Li, Z., & Wood, S. N. (2019). Faster model matrix crossproducts for large generalized linear models with discretized covariates. Statistics and Computing. https://doi.org/10.1007/s11222-019-09864-2.

    Article  MATH  Google Scholar 

  • Lindgren, F., Rue, H., & Lindström, J. (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4), 423–498.

    MathSciNet  MATH  Google Scholar 

  • Liu, H., Ong, Y. S., Shen, X., & Cai, J. (2018). When Gaussian process meets big data: A review of scalable GPs. arXiv:1807.01065.

  • Moran, P. A. (1950). Notes on continuous stochastic phenomena. Biometrika, 37(1/2), 17–23.

    MathSciNet  MATH  Google Scholar 

  • Murakami, D., & Griffith, D. A. (2015). Random effects specifications in eigenvector spatial filtering: A simulation study. Journal of Geographical Systems, 17(4), 311–331.

    Google Scholar 

  • Murakami, D., & Griffith, D. A. (2019a). Eigenvector spatial filtering for large data sets: Fixed and random effects approaches. Geographical Analysis, 51(1), 23–49.

    Google Scholar 

  • Murakami, D., & Griffith, D. A. (2019b). Spatially varying coefficient modeling for large datasets: Eliminating N from spatial regressions. Spatial Statistics, 30, 39–64.

    MathSciNet  Google Scholar 

  • Murakami, D., Lu, B., Harris, P., Brunsdon, C., Charlton, M., Nakaya, T., et al. (2019). The importance of scale in spatially varying coefficient modeling. Annals of the American Association of Geographers, 109(1), 50–70.

    Google Scholar 

  • Murakami, D., Yoshida, T., Seya, H., Griffith, D. A., & Yamagata, Y. (2017). A Moran coefficient-based mixed effects approach to investigate spatially varying relationships. Spatial Statistics, 19, 68–89.

    MathSciNet  Google Scholar 

  • Rue, H., Martino, S., & Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2), 319–392.

    MathSciNet  MATH  Google Scholar 

  • Samuels, M. L. (1993). Simpson’s paradox and related phenomena. Journal of the American Statistical Association, 88(421), 81–88.

    MathSciNet  MATH  Google Scholar 

  • Stein, M. L. (2014). Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics, 8, 1–19.

    MathSciNet  Google Scholar 

  • Tiefelsdorf, M., & Griffith, D. A. (2007). Semiparametric filtering of spatial autocorrelation: The eigenvector approach. Environment and Planning A, 39(5), 1193–1221.

    Google Scholar 

  • Tsutsumi, M., & Seya, H. (2008). Measuring the impact of large-scale transportation projects on land price using spatial statistical models. Papers in Regional Science, 87(3), 385–401.

    Google Scholar 

  • Tsutsumi, M., & Seya, H. (2009). Hedonic approaches based on spatial econometrics and spatial statistics: Application to evaluation of project benefits. Journal of Geographical Systems, 11(4), 357–380.

    Google Scholar 

  • Wang, C., & Furrer, R. (2019). Combining heterogeneous spatial datasets with process-based spatial fusion models: A unifying framework. Arxiv, 1906, 00364.

    Google Scholar 

  • Wiesenfarth, M., & Kneib, T. (2010). Bayesian geoadditive sample selection models. Journal of the Royal Statistical Society: Series C (Applied Statistics), 59(3), 381–404.

    MathSciNet  Google Scholar 

  • Wood, S. N. (2008). Fast stable direct fitting and smoothness selection for generalized additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(3), 495–518.

    MathSciNet  MATH  Google Scholar 

  • Wood, S. N. (2011). Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(1), 3–36.

    MathSciNet  MATH  Google Scholar 

  • Wood, S. N. (2017). Generalized additive models: An introduction with R. Boca Raton: Chapman and Hall/CRC.

    MATH  Google Scholar 

  • Wood, S. N., Goude, Y., & Shaw, S. (2015). Generalized additive models for large data sets. Journal of the Royal Statistical Society: Series C (Applied Statistics), 64(1), 139–155.

    MathSciNet  Google Scholar 

  • Wood, S. N., Li, Z., Shaddick, G., & Augustin, N. H. (2017). Generalized additive models for gigadata: Modeling the UK black smoke network daily data. Journal of the American Statistical Association, 112(519), 1199–1210.

    MathSciNet  Google Scholar 

  • Zhang, K., & Kwok, J. T. (2010). Clustered Nyström method for large scale manifold learning and dimension reduction. IEEE Transactions on Neural Networks, 21(10), 1576–1587.

    Google Scholar 

Download references

Acknowledgements

This study is funded by the JSPS KAKENHI Grant numbers 17K12974 and 18H03628.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daisuke Murakami.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Murakami, D., Griffith, D.A. A memory-free spatial additive mixed modeling for big spatial data. Jpn J Stat Data Sci 3, 215–241 (2020). https://doi.org/10.1007/s42081-019-00063-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42081-019-00063-x

Keywords

Navigation