A statistical framework of data fusion for spatial prediction of categorical variables

Cao, Guofeng; Yoo, Eun-hye; Wang, Shaowen

doi:10.1007/s00477-013-0842-7

A statistical framework of data fusion for spatial prediction of categorical variables

Original Paper
Published: 01 January 2014

Volume 28, pages 1785–1799, (2014)
Cite this article

Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Guofeng Cao¹,
Eun-hye Yoo³ &
Shaowen Wang²

574 Accesses
10 Citations
Explore all metrics

Abstract

With rapid advances of geospatial technologies, the amount of spatial data has been increasing exponentially over the past few decades. Usually collected by diverse source providers, the available spatial data tend to be fragmented by a large variety of data heterogeneities, which highlights the need of sound methods capable of efficiently fusing the diverse and incompatible spatial information. Within the context of spatial prediction of categorical variables, this paper describes a statistical framework for integrating and drawing inferences from a collection of spatially correlated variables while accounting for data heterogeneities and complex spatial dependencies. In this framework, we discuss the spatial prediction of categorical variables in the paradigm of latent random fields, and represent each spatial variable via spatial covariance functions, which define two-point similarities or dependencies of spatially correlated variables. The representation of spatial covariance functions derived from different spatial variables is independent of heterogeneous characteristics and can be combined in a straightforward fashion. Therefore it provides a unified and flexible representation of heterogeneous spatial variables in spatial analysis while accounting for complex spatial dependencies. We show that in the spatial prediction of categorical variables, the sought-after class occurrence probability at a target location can be formulated as a multinomial logistic function of spatial covariances of spatial variables between the target and sampled locations. Group least absolute shrinkage and selection operator is adopted for parameter estimation, which prevents the model from over-fitting, and simultaneously selects an optimal subset of important information (variables). Synthetic and real case studies are provided to illustrate the introduced concepts, and showcase the advantages of the proposed statistical framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Correlation and variable importance in random forests

Article 23 March 2016

Spatial machine learning: new opportunities for regional science

Article Open access 24 December 2021

Spatial Data Management, Analysis, and Modeling in GIS: Principles and Applications

References

Atkinson P, Lewis P (2000) Geostatistical classification for remote sensing: an introduction. Comput Geosci 26(4):361–371
Article Google Scholar
Atkinson PM (2012) Downscaling in remote sensing. Int J Appl Earth Obs Geoinf
Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148
Article Google Scholar
Birgin E, Marttinez J, Raydan M (2000) Nonmonotone spectral projected gradient methods on convex sets. SISM SISM J Optim 10:1196–1211
Article Google Scholar
Bogaert P (2002) Spatial prediction of categorical variables: the Bayesian maximum entropy approach. Stoch Environ Res Risk Assess 16(6):425–448
Article Google Scholar
Bogaert P, Fasbender D (2007) Bayesian data fusion in a spatial prediction context: a general formulation. Stoch Environ Res Risk Assess 21:695–709
Google Scholar
Bogard V (1973) Soil survey of Pontotoc County, Oklahoma, U.S. Soil Conservation Service
Breslow N, Clayton D (1993) Approximate inference in generalized linear mixed models. J Am Stat Assoc 88(421):9–25
Google Scholar
Burgess D (1977) Soil survey of Johnston County, Oklahoma, National Cooperative Soil Survey
Cao G, Kyriakidis P, Goodchild M (2011) A multinomial logistic mixed model for the prediction of categorical spatial data. Int J Geogr Inf Sci 25(12):2071–2086
Article Google Scholar
Chiles J, Delfiner P (1999) Geostatistics: modeling spatial uncertainty. Wiley, New York
Christakos G (1990) A Bayesian/maximum-entropy view to the spatial estimation problem. Math Geol 22(7):763–777
Article Google Scholar
Christensen O (2004) Monte Carlo maximum likelihood in model-based geostatistics. J Comput Graph Stat 13(3):702–718
Article Google Scholar
Diggle P, Tawn J, Moyeed R (1998) Model-based geostatistics. Appl Stat 47(3):299–350
Google Scholar
Fagin T, Hoagland B (2011) Patterns from the past: modeling Public Land Survey witness tree distributions with weights-of-evidence. Plant Ecol 212:207–217
Article Google Scholar
Foody GM (2002) Status of land cover classification accuracy assessment. Remote Sens Environ 80:185–201
Article Google Scholar
Goodchild M, Zhang J, Kyriakidis P (2009) Discriminant models of uncertainty in nominal fields. Trans GIS 13(1):7–23
Article Google Scholar
Goovaerts P (1997) Geostatistics for natural resources evaluation. Oxford University Press, New York
Google Scholar
Goovaerts P (1998) Accounting for estimation optimality criteria in simulated annealing. Math Geol 30(5):511–534
Article Google Scholar
Gotway CA, Stroup WW (1997) A generalized linear model approach to spatial data analysis and prediction. J Agric Biol Environ Stat 2(2):157
Article Google Scholar
Goulard M, Voltz M (1992) Linear coregionalization model: tools for estimation and choice of cross-variogram matrix. Math Geol 24(3):269–286
Article Google Scholar
He H, Dey D, Fan X, Hooten M, Kabrick J, Wikle C, Fan Z (2007) Mapping pre-European settlement vegetation at fine resolutions using a hierarchical Bayesian model and GIS. Plant Ecol 11:85–94
Article Google Scholar
He H, Mladenoff D, Sickley T, Guntenspergen G (2000) GIS interpolations of witness tree records (1839–1866) for Northern Wisconsin at multiple scales. J Biogeogr 27:1131–1042
Article Google Scholar
Hengl T, Heuvelink G, Rossiter D (2007) About regression-kriging: from equations to case studies. Comput Geosci 33(10):1301–1315
Article Google Scholar
Hengl T, Heuvelink G, Stein A (2004) A generic framework for spatial prediction of soil variables based on regression-kriging. Geoderma 120(1):75–93
Article Google Scholar
Hengl T, Toomanian N, Reuter H, Malakouti M (2007) Methods to interpolate soil categorical variables from profile observations: lessons from Iran. Geoderma 140:417–427
Article Google Scholar
Journel AG (1983) Nonparametric estimation of spatial distributions. Math Geol 15(3):445–468
Article Google Scholar
Journel AG, Alabert F (1989) No-Gaussian data expansion in the Earth Sciences. Terra Nova 1(1):123–134
Article Google Scholar
Kimeldorf G, Wahba G (1970) A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat 41(2):495–502
Article Google Scholar
Lanckriet GRG, De Bie T, Cristianini N, Jordan MI, Noble WS (2004) A statistical framework for genomic data fusion. Bioinformatics 20(16):2626–2635
Article CAS Google Scholar
Li D, Zhang J, Wu H (2012) Spatial data quality and beyond. Int J Geogr Inf Sci 26(12):2277–2290
Article Google Scholar
Liang K, Zeger S (1986) Longitudinal data analysis using generalized linear models. Biometrika 73(1):13
Article Google Scholar
Meier L, Geer SVD, Bühlmann P (2008) The group lasso for logistic regression. J R Stat Soc B 70:53–71
Article Google Scholar
Miller HJ, Han J (2003) Geographic data mining and knowledge discovery. CRC Press, Boca Raton
Nocedal J (1980) Updating quasi-newton matrices with limited storage. Math Comput 35(151):773–782
Article Google Scholar
Obozinski G, Taskar B, Jordan M (2007) Joint covariate selection for grouped classification, technical report, University of California, Berkeley
Google Scholar
Pardo-Igúzquiza E, Dowd P, Pardoiguzquiza E (2005) Multiple indicator cokriging with application to optimal sampling for environmental monitoring. Comput Geosci 31(1):1–13
Article Google Scholar
Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc B 71(2):319–392
Article Google Scholar
Schmidt M (2010) Graphical model structure learning with l1-regularization. PhD thesis, University of British Columbia
Schmidt M, Berg EVD, Friedlander M, Murphy K (2009) Optimizing costly functions with simple constraints: a limited-memory projected quasi-newton algorithm. In: Proceedings of the 12th international conference on artificial intelligence and statistics (AISTATS), pp. 456–463
Schölkopf B, Herbrich R, Smola A (2001) A generalized representer theorem. In: Proceedings of the annual conference on computational learning theory, pp. 416–426
Schölkopf B, Smola A (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Schoölkopf B, Tsuda K, Vert J-P (2004) Kernel methods in computational biology. MIT Press, Cambridge
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58:267–288
Google Scholar
Tso B, Mather P (2009) Classification methods for remotely sensed data. CRC Press, Boca Raton
Wackernagel H (1998) Multivariate geostatistics—an Introduction with applications, 2nd edn. Springer, New York
Wahba G (1990) Spline models for observational data, vol. 59. Society for Industrial and Applied Mathematics, Philadelphia
West M (2003) Bayesian factor regression models in the large p, small n paradigm. Bayesian Stat 7(2003):723–732
Google Scholar
Wibrin M, Bogaert P, Fasbender D (2006) Combining categorical and continuous spatial information within the Bayesian Maximum Entropy paradigm. Stoch Environ Res Risk Assess 20:423–433
Article Google Scholar
Williams C, Barber D (2002) Bayesian classification with Gaussian processes. Pattern Anal Mach Intell IEEE Trans 20(12):1342–1351
Article Google Scholar
Yoo E-H, Hoagland BW, Cao G, Fagin T (2013) Spatial distribution of trees and landscapes of the past: a mixed spatially correlated multinomial logit model approach for the analysis of the public land survey data. Geogr Anal 45(4):419–440
Google Scholar
Yoo E-H, Trgovac A (2011) Scale effects in uncertainty modeling of presettlement vegetation distribution. Int J Geogr Inf Sci 25(3):405–421
Article Google Scholar
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68:49–67
Article Google Scholar
Zhang H (2002) On estimation and prediction for spatial generalized linear mixed models. Biometrics 58(1):129–136
Article Google Scholar
Zhang J, Goodchild M (2002) Uncertainty in geographic information. Taylor & Francis, London
Book Google Scholar

Download references

Acknowledgments

We gratefully acknowledge the funding provided by the National Science Foundation under grant number OCI-1047916 to support this research. We would like to thank Professors Bruce W. Hoagland and Todd D. Fagin from the University of Oklahoma for valuable discussions and the datasets they kindly provided. We would also thank the anonymous reviewers for the constructive comments and suggestions, and thank Professor Jeff Lee from Texas Tech University for his proofreading which has profoundly improved the composition of this manuscript.

Author information

Authors and Affiliations

Department of Geosciences, Texas Tech University, Lubbock, TX, USA
Guofeng Cao
Department of Geography and Geographic Information Science, University of Illinois at Urbana-Champaign, Champaign, IL, USA
Shaowen Wang
Department of Geography, State University of New York at Buffalo, Buffalo, NY, USA
Eun-hye Yoo

Authors

Guofeng Cao
View author publications
You can also search for this author in PubMed Google Scholar
Eun-hye Yoo
View author publications
You can also search for this author in PubMed Google Scholar
Shaowen Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guofeng Cao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cao, G., Yoo, Eh. & Wang, S. A statistical framework of data fusion for spatial prediction of categorical variables. Stoch Environ Res Risk Assess 28, 1785–1799 (2014). https://doi.org/10.1007/s00477-013-0842-7

Download citation

Published: 01 January 2014
Issue Date: October 2014
DOI: https://doi.org/10.1007/s00477-013-0842-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A statistical framework of data fusion for spatial prediction of categorical variables

Abstract

Access this article

Similar content being viewed by others

Correlation and variable importance in random forests

Spatial machine learning: new opportunities for regional science

Spatial Data Management, Analysis, and Modeling in GIS: Principles and Applications

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A statistical framework of data fusion for spatial prediction of categorical variables

Abstract

Access this article

Similar content being viewed by others

Correlation and variable importance in random forests

Spatial machine learning: new opportunities for regional science

Spatial Data Management, Analysis, and Modeling in GIS: Principles and Applications

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation