Bayesian logistic regression for presence-only data

  • Fabio Divino
  • Natalia Golini
  • Giovanna Jona Lasinio
  • Antti Penttinen
Original Paper

Abstract

Presence-only data are referred to situations in which a censoring mechanism acts on a binary response which can be partially observed only with respect to one outcome, usually denoting the presence of an attribute of interest. A typical example is the recording of species presence in ecological surveys. In this work a Bayesian approach to the analysis of presence-only data based on a two levels scheme is presented. A probability law and a case-control design are combined to handle the double source of uncertainty: one due to censoring and the other one due to sampling. In the paper, through the use of a stratified sampling design with non-overlapping strata, a new formulation of the logistic model for presence-only data is proposed. In particular, the logistic regression with linear predictor is considered. Estimation is carried out with a new Markov Chain Monte Carlo algorithm with data augmentation, which does not require the a priori knowledge of the population prevalence. The performance of the new algorithm is validated by means of extensive simulation experiments using three scenarios and comparison with optimal benchmarks. An application to data existing in literature is reported in order to discuss the model behaviour in real world situations together with the results of an original study on termites occurrences data.

Keywords

Case-control design Censored data Data augmentation Markov Chain Monte Carlo algorithm Stratified sampling Two levels scheme 

Notes

Acknowledgments

The authors would like to thank the anonymous referees and associate editor that with their comments helped to considerably improve the paper.

Supplementary material

477_2015_1064_MOESM1_ESM.pdf (231 kb)
Supplementary material 1 (PDF 232 kb)

References

  1. Aarts G, MacKenzie M, McConnell B, Fedak M, Matthiopoulos J (2008) Estimating space-use and habitat preference from wildlife telemetry data. Ecography 31:140–160CrossRefGoogle Scholar
  2. Araùjo M, Williams P (2000) Selecting areas for species persistence using occurrence data. Biol Conserv 96:331–345CrossRefGoogle Scholar
  3. Armenian H (2009) The case-control method: design and applications. Oxford University Press, New YorkCrossRefGoogle Scholar
  4. Banerjee S, Carlin B, Gelfand AE (2004) Hierarchical modeling and analysis for spatial data. Chapman & Hall Ltd, New YorkGoogle Scholar
  5. Barros M, Galea M, Gonzàlez M, Leiva V (2010) Influence diagnostics in the tobit censored response model. Stat Methods Appl 19(3):379–397CrossRefGoogle Scholar
  6. Bartholomew D, Knott M, Moustaki I (2011) Latent variable models and factors analysis: a unified approach. John Wiley & Sons, New YorkCrossRefGoogle Scholar
  7. Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. J R Stat Soc 36(2):192–236Google Scholar
  8. Breslow NE (2005) Case-control studies. Handbook of epidemiology, vol 6. Springer, New York, pp 287–319CrossRefGoogle Scholar
  9. Breslow NE, Day NE (1980) Statistical methods in cancer research, vol 1. The analysis of case-control studies. WHO international agency for research on cancer, Lyon, FranceGoogle Scholar
  10. Carl G, Kuhn I (2008) Analyzing spatial ecological data using linear regression and wavelet analysis. Stoch Environ Res Risk Assess 22(3):315–324CrossRefGoogle Scholar
  11. Chakraborty A, Gelfand AE, Wilson AM, Latimer AM, Silander JA (2011) Point pattern modelling for degraded presence-only data over large regions. J R Stat Soc 5:757–776CrossRefGoogle Scholar
  12. Cliff AD, Ord JK (1981) Spatial processes. Pion, LondonGoogle Scholar
  13. Di Lorenzo B, Farcomeni A, Golini N (2011) A Bayesian model for presence-only semicontinuous data with application to prediction of abundance of Taxus Baccata in two Italian regions. J Agric Biol Environ Stat 16(3):339–356CrossRefGoogle Scholar
  14. Divino F, Golini N, Jona Lasinio G, Pettinen A (2011) Data augmentation approach in Bayesian modelling of presence-only data. Procedia Environ Sci 7:38–43CrossRefGoogle Scholar
  15. Dorazio RM (2012) Predicting the geographic distribution of a species from presence-only data subject to detection errors. Biometrics 68:1303–1312CrossRefGoogle Scholar
  16. Elith J, Leathwick JR (2009) Species distribution models: ecological explanation and prediction across space and time. Annu Rev Ecol Evol Syst 40:677–697CrossRefGoogle Scholar
  17. Elith J, Graham CH, Anderson RP, Dudik M, Ferrier S, Guisan A, Hijmans RJ, Huettmann F, Leathwick JR, Lehmann A, Li J, Lohmann LG, Loiselle BA, Manion G, Moritz C, Nakamura M, Nakazawa Y, Overton JM, Peterson AT, Phillips SJ, Richardson KS, Scachetti-Pereira R, Schapire RE, Soberon J, Williams S, Wisz MS, Zimmermann NE (2006) Novel methods improve prediction of species’ distribution from occurence data. Ecography 29:129–151CrossRefGoogle Scholar
  18. Elith J, Phillips SJ, Hastie T, Dudík M, Chee YE, Yates CJ (2011) A statistical explanation of MaxEnt for ecologists. Divers Distrib 17:43–57CrossRefGoogle Scholar
  19. Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874CrossRefGoogle Scholar
  20. Fithian W, Hastie T (2013) Finite-sample equivalence in statistical models for presence-only data. Ann Appl Stat 7(4):1917–1939CrossRefGoogle Scholar
  21. Franklin J (2010) Mapping species distributions: spatial inference and prediction. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  22. Gelfand AE (2010) Handbook of spatial statistics, chapter misaligned spatial data: the change of support problem. Chapman & Hall, New York, pp 517–539Google Scholar
  23. Helsel DR (2012) Statistics for censored environmental data Using minitab and R. Wiley, HobokenGoogle Scholar
  24. Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630CrossRefGoogle Scholar
  25. Keating KA, Cherry S (2004) Use and interpretation of logistic regression in habitat-selection studies. J Wildl Manag 68:774–789CrossRefGoogle Scholar
  26. Lancaster T, Imbens G (1996) Case-control studies with contaminated controls. J Econom 71:145–160CrossRefGoogle Scholar
  27. Levy PS, Lemershow S (2008) Sampling of population: methods and applications. John Wiley & Sons, New YorkCrossRefGoogle Scholar
  28. Liao F, Wei Y (2014) Modeling determinants of urban growth in dongguan, china: a spatial logistic approach. Stoch Environ Res Risk Assess 28(4):801–816CrossRefGoogle Scholar
  29. Little RJA, Rubin DB (1987) Statistical analysis with missing data. John Wiley & Sons, New YorkGoogle Scholar
  30. Liu JS (2008) Monte Carlo strategies in scientific computing. Springer, New YorkGoogle Scholar
  31. Liu SY, Wu YN (1999) Parameter expansion for data augmentation. J Am Stat Assoc 94:1264–1274CrossRefGoogle Scholar
  32. Merow C, Silander JA Jr (2014) A comparison of maxlike and maxent for modelling species distributions. Methods Ecol Evol. doi: 10.1111/2041-210X.12152
  33. Muñoz F, Pennino M, Conesa D, Lopez-Quolez A, Bellido J (2013) Estimation and prediction of the spatial occurrence of fish species using bayesian latent gaussian models. Stoch Environ Res Risk Assess 27(5):1171–1180CrossRefGoogle Scholar
  34. Pearce JL, Boyce MS (2006) Modelling distribution and abundance with presence-only data. J Appl Ecol 43:405–412CrossRefGoogle Scholar
  35. Phillips SJ, Anderson RP, Schapire RE (2006) Maximum entropy modeling of species geographic distributions. Ecol Model 190:231–259CrossRefGoogle Scholar
  36. Renner IW, Warton DI (2013) Equivalence of MAXENT and Poisson point process models for species distribution modeling in ecology. Biometrics 69:274–281CrossRefGoogle Scholar
  37. Robert CP, Casella G (2004) Monte Carlo statistical methods. Springer, New YorkCrossRefGoogle Scholar
  38. Royle JA, Chandler RB, Yackulic C, Nichols JD (2012) Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions. Methods Ecol Evol 3:545–554CrossRefGoogle Scholar
  39. Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592CrossRefGoogle Scholar
  40. Särndal CE (1978) Design-based and model-based inference in survey sampling. Scand J Stat 5:27–52Google Scholar
  41. Schlesselman JJ (1982) Case-control studies. Oxford University Press, New YorkGoogle Scholar
  42. Tanner M (1996) Tools for statistical inference: observed data and data augmentation. Springer, New YorkCrossRefGoogle Scholar
  43. Tanner M, Wong W (1987) The calculation of posterior distribution by data augmentation. J Am Stat Assoc 82:528–550CrossRefGoogle Scholar
  44. Tonini F, Divino F, Jona Lasinio G, Hochmair HH, Scheffrahn RH (2014) Predicting the geographical distribution of two invasive termite species from occurrence data. Environ Entomol 43(5):1135–1144CrossRefGoogle Scholar
  45. Valliant R, Dorfman AH, Royall MR (2000) Finite population sampling and inference: a prediction approach. John Wiley & Sons, New YorkGoogle Scholar
  46. Ward G, Hastie T, Barry S, Elith J, Leathwick A (2009) Presence-only data and the EM algorithm. Biometrics 65:554–563CrossRefGoogle Scholar
  47. Warton DI, Shepherd L (2010) Poisson point porcess models solve the “pseudo-absence problem” for presence-only data in ecology. Ann Appl Stat 4(3):1383–1402CrossRefGoogle Scholar
  48. Woodward M (2005) Epidemiology: study design and data analysis. Chapman & Hall, New YorkGoogle Scholar
  49. Zaniewski AE, Lehmann A, Overton JM (2002) Prediction species spatial distributions using presence-only data: a case study of native New Zeland ferns. Ecol Model 157:261–280CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Division of Physics, Computer Science and MathematicsUniversity of MolisePescheItaly
  2. 2.Department of Statistical SciencesSapienza University of RomeRomeItaly
  3. 3.Department of Mathematics and StatisticsUniversity of JyväskyläJyväskyläFinland

Personalised recommendations