Advertisement

A statistical framework of data fusion for spatial prediction of categorical variables

  • Guofeng Cao
  • Eun-hye Yoo
  • Shaowen Wang
Original Paper

Abstract

With rapid advances of geospatial technologies, the amount of spatial data has been increasing exponentially over the past few decades. Usually collected by diverse source providers, the available spatial data tend to be fragmented by a large variety of data heterogeneities, which highlights the need of sound methods capable of efficiently fusing the diverse and incompatible spatial information. Within the context of spatial prediction of categorical variables, this paper describes a statistical framework for integrating and drawing inferences from a collection of spatially correlated variables while accounting for data heterogeneities and complex spatial dependencies. In this framework, we discuss the spatial prediction of categorical variables in the paradigm of latent random fields, and represent each spatial variable via spatial covariance functions, which define two-point similarities or dependencies of spatially correlated variables. The representation of spatial covariance functions derived from different spatial variables is independent of heterogeneous characteristics and can be combined in a straightforward fashion. Therefore it provides a unified and flexible representation of heterogeneous spatial variables in spatial analysis while accounting for complex spatial dependencies. We show that in the spatial prediction of categorical variables, the sought-after class occurrence probability at a target location can be formulated as a multinomial logistic function of spatial covariances of spatial variables between the target and sampled locations. Group least absolute shrinkage and selection operator is adopted for parameter estimation, which prevents the model from over-fitting, and simultaneously selects an optimal subset of important information (variables). Synthetic and real case studies are provided to illustrate the introduced concepts, and showcase the advantages of the proposed statistical framework.

Keywords

Categorical data Data fusion Kernel methods Geostatistics LASSO 

Notes

Acknowledgments

We gratefully acknowledge the funding provided by the National Science Foundation under grant number OCI-1047916 to support this research. We would like to thank Professors Bruce W. Hoagland and Todd D. Fagin from the University of Oklahoma for valuable discussions and the datasets they kindly provided. We would also thank the anonymous reviewers for the constructive comments and suggestions, and thank Professor Jeff Lee from Texas Tech University for his proofreading which has profoundly improved the composition of this manuscript.

References

  1. Atkinson P, Lewis P (2000) Geostatistical classification for remote sensing: an introduction. Comput Geosci 26(4):361–371CrossRefGoogle Scholar
  2. Atkinson PM (2012) Downscaling in remote sensing. Int J Appl Earth Obs GeoinfGoogle Scholar
  3. Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148CrossRefGoogle Scholar
  4. Birgin E, Marttinez J, Raydan M (2000) Nonmonotone spectral projected gradient methods on convex sets. SISM SISM J Optim 10:1196–1211CrossRefGoogle Scholar
  5. Bogaert P (2002) Spatial prediction of categorical variables: the Bayesian maximum entropy approach. Stoch Environ Res Risk Assess 16(6):425–448CrossRefGoogle Scholar
  6. Bogaert P, Fasbender D (2007) Bayesian data fusion in a spatial prediction context: a general formulation. Stoch Environ Res Risk Assess 21:695–709Google Scholar
  7. Bogard V (1973) Soil survey of Pontotoc County, Oklahoma, U.S. Soil Conservation ServiceGoogle Scholar
  8. Breslow N, Clayton D (1993) Approximate inference in generalized linear mixed models. J Am Stat Assoc 88(421):9–25Google Scholar
  9. Burgess D (1977) Soil survey of Johnston County, Oklahoma, National Cooperative Soil SurveyGoogle Scholar
  10. Cao G, Kyriakidis P, Goodchild M (2011) A multinomial logistic mixed model for the prediction of categorical spatial data. Int J Geogr Inf Sci 25(12):2071–2086CrossRefGoogle Scholar
  11. Chiles J, Delfiner P (1999) Geostatistics: modeling spatial uncertainty. Wiley, New YorkGoogle Scholar
  12. Christakos G (1990) A Bayesian/maximum-entropy view to the spatial estimation problem. Math Geol 22(7):763–777CrossRefGoogle Scholar
  13. Christensen O (2004) Monte Carlo maximum likelihood in model-based geostatistics. J Comput Graph Stat 13(3):702–718CrossRefGoogle Scholar
  14. Diggle P, Tawn J, Moyeed R (1998) Model-based geostatistics. Appl Stat 47(3):299–350Google Scholar
  15. Fagin T, Hoagland B (2011) Patterns from the past: modeling Public Land Survey witness tree distributions with weights-of-evidence. Plant Ecol 212:207–217CrossRefGoogle Scholar
  16. Foody GM (2002) Status of land cover classification accuracy assessment. Remote Sens Environ 80:185–201CrossRefGoogle Scholar
  17. Goodchild M, Zhang J, Kyriakidis P (2009) Discriminant models of uncertainty in nominal fields. Trans GIS 13(1):7–23CrossRefGoogle Scholar
  18. Goovaerts P (1997) Geostatistics for natural resources evaluation. Oxford University Press, New YorkGoogle Scholar
  19. Goovaerts P (1998) Accounting for estimation optimality criteria in simulated annealing. Math Geol 30(5):511–534CrossRefGoogle Scholar
  20. Gotway CA, Stroup WW (1997) A generalized linear model approach to spatial data analysis and prediction. J Agric Biol Environ Stat 2(2):157CrossRefGoogle Scholar
  21. Goulard M, Voltz M (1992) Linear coregionalization model: tools for estimation and choice of cross-variogram matrix. Math Geol 24(3):269–286CrossRefGoogle Scholar
  22. He H, Dey D, Fan X, Hooten M, Kabrick J, Wikle C, Fan Z (2007) Mapping pre-European settlement vegetation at fine resolutions using a hierarchical Bayesian model and GIS. Plant Ecol 11:85–94CrossRefGoogle Scholar
  23. He H, Mladenoff D, Sickley T, Guntenspergen G (2000) GIS interpolations of witness tree records (1839–1866) for Northern Wisconsin at multiple scales. J Biogeogr 27:1131–1042CrossRefGoogle Scholar
  24. Hengl T, Heuvelink G, Rossiter D (2007) About regression-kriging: from equations to case studies. Comput Geosci 33(10):1301–1315CrossRefGoogle Scholar
  25. Hengl T, Heuvelink G, Stein A (2004) A generic framework for spatial prediction of soil variables based on regression-kriging. Geoderma 120(1):75–93CrossRefGoogle Scholar
  26. Hengl T, Toomanian N, Reuter H, Malakouti M (2007) Methods to interpolate soil categorical variables from profile observations: lessons from Iran. Geoderma 140:417–427CrossRefGoogle Scholar
  27. Journel AG (1983) Nonparametric estimation of spatial distributions. Math Geol 15(3):445–468CrossRefGoogle Scholar
  28. Journel AG, Alabert F (1989) No-Gaussian data expansion in the Earth Sciences. Terra Nova 1(1):123–134CrossRefGoogle Scholar
  29. Kimeldorf G, Wahba G (1970) A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat 41(2):495–502CrossRefGoogle Scholar
  30. Lanckriet GRG, De Bie T, Cristianini N, Jordan MI, Noble WS (2004) A statistical framework for genomic data fusion. Bioinformatics 20(16):2626–2635CrossRefGoogle Scholar
  31. Li D, Zhang J, Wu H (2012) Spatial data quality and beyond. Int J Geogr Inf Sci 26(12):2277–2290CrossRefGoogle Scholar
  32. Liang K, Zeger S (1986) Longitudinal data analysis using generalized linear models. Biometrika 73(1):13CrossRefGoogle Scholar
  33. Meier L, Geer SVD, Bühlmann P (2008) The group lasso for logistic regression. J R Stat Soc B 70:53–71CrossRefGoogle Scholar
  34. Miller HJ, Han J (2003) Geographic data mining and knowledge discovery. CRC Press, Boca RatonGoogle Scholar
  35. Nocedal J (1980) Updating quasi-newton matrices with limited storage. Math Comput 35(151):773–782CrossRefGoogle Scholar
  36. Obozinski G, Taskar B, Jordan M (2007) Joint covariate selection for grouped classification, technical report, University of California, BerkeleyGoogle Scholar
  37. Pardo-Igúzquiza E, Dowd P, Pardoiguzquiza E (2005) Multiple indicator cokriging with application to optimal sampling for environmental monitoring. Comput Geosci 31(1):1–13CrossRefGoogle Scholar
  38. Rue H, Martino S, Chopin N (2009) Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J R Stat Soc B 71(2):319–392CrossRefGoogle Scholar
  39. Schmidt M (2010) Graphical model structure learning with l1-regularization. PhD thesis, University of British ColumbiaGoogle Scholar
  40. Schmidt M, Berg EVD, Friedlander M, Murphy K (2009) Optimizing costly functions with simple constraints: a limited-memory projected quasi-newton algorithm. In: Proceedings of the 12th international conference on artificial intelligence and statistics (AISTATS), pp. 456–463Google Scholar
  41. Schölkopf B, Herbrich R, Smola A (2001) A generalized representer theorem. In: Proceedings of the annual conference on computational learning theory, pp. 416–426Google Scholar
  42. Schölkopf B, Smola A (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, CambridgeGoogle Scholar
  43. Schoölkopf B, Tsuda K, Vert J-P (2004) Kernel methods in computational biology. MIT Press, CambridgeGoogle Scholar
  44. Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc 58:267–288Google Scholar
  45. Tso B, Mather P (2009) Classification methods for remotely sensed data. CRC Press, Boca RatonGoogle Scholar
  46. Wackernagel H (1998) Multivariate geostatistics—an Introduction with applications, 2nd edn. Springer, New YorkGoogle Scholar
  47. Wahba G (1990) Spline models for observational data, vol. 59. Society for Industrial and Applied Mathematics, PhiladelphiaGoogle Scholar
  48. West M (2003) Bayesian factor regression models in the large p, small n paradigm. Bayesian Stat 7(2003):723–732Google Scholar
  49. Wibrin M, Bogaert P, Fasbender D (2006) Combining categorical and continuous spatial information within the Bayesian Maximum Entropy paradigm. Stoch Environ Res Risk Assess 20:423–433CrossRefGoogle Scholar
  50. Williams C, Barber D (2002) Bayesian classification with Gaussian processes. Pattern Anal Mach Intell IEEE Trans 20(12):1342–1351CrossRefGoogle Scholar
  51. Yoo E-H, Hoagland BW, Cao G, Fagin T (2013) Spatial distribution of trees and landscapes of the past: a mixed spatially correlated multinomial logit model approach for the analysis of the public land survey data. Geogr Anal 45(4):419–440Google Scholar
  52. Yoo E-H, Trgovac A (2011) Scale effects in uncertainty modeling of presettlement vegetation distribution. Int J Geogr Inf Sci 25(3):405–421CrossRefGoogle Scholar
  53. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B 68:49–67CrossRefGoogle Scholar
  54. Zhang H (2002) On estimation and prediction for spatial generalized linear mixed models. Biometrics 58(1):129–136CrossRefGoogle Scholar
  55. Zhang J, Goodchild M (2002) Uncertainty in geographic information. Taylor & Francis, LondonCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.Department of GeosciencesTexas Tech UniversityLubbockUSA
  2. 2.Department of Geography and Geographic Information ScienceUniversity of Illinois at Urbana-ChampaignChampaignUSA
  3. 3.Department of GeographyState University of New York at Buffalo BuffaloUSA

Personalised recommendations