Abstract
Presence-only data are referred to situations in which a censoring mechanism acts on a binary response which can be partially observed only with respect to one outcome, usually denoting the presence of an attribute of interest. A typical example is the recording of species presence in ecological surveys. In this work a Bayesian approach to the analysis of presence-only data based on a two levels scheme is presented. A probability law and a case-control design are combined to handle the double source of uncertainty: one due to censoring and the other one due to sampling. In the paper, through the use of a stratified sampling design with non-overlapping strata, a new formulation of the logistic model for presence-only data is proposed. In particular, the logistic regression with linear predictor is considered. Estimation is carried out with a new Markov Chain Monte Carlo algorithm with data augmentation, which does not require the a priori knowledge of the population prevalence. The performance of the new algorithm is validated by means of extensive simulation experiments using three scenarios and comparison with optimal benchmarks. An application to data existing in literature is reported in order to discuss the model behaviour in real world situations together with the results of an original study on termites occurrences data.
Similar content being viewed by others
References
Aarts G, MacKenzie M, McConnell B, Fedak M, Matthiopoulos J (2008) Estimating space-use and habitat preference from wildlife telemetry data. Ecography 31:140–160
Araùjo M, Williams P (2000) Selecting areas for species persistence using occurrence data. Biol Conserv 96:331–345
Armenian H (2009) The case-control method: design and applications. Oxford University Press, New York
Banerjee S, Carlin B, Gelfand AE (2004) Hierarchical modeling and analysis for spatial data. Chapman & Hall Ltd, New York
Barros M, Galea M, Gonzàlez M, Leiva V (2010) Influence diagnostics in the tobit censored response model. Stat Methods Appl 19(3):379–397
Bartholomew D, Knott M, Moustaki I (2011) Latent variable models and factors analysis: a unified approach. John Wiley & Sons, New York
Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. J R Stat Soc 36(2):192–236
Breslow NE (2005) Case-control studies. Handbook of epidemiology, vol 6. Springer, New York, pp 287–319
Breslow NE, Day NE (1980) Statistical methods in cancer research, vol 1. The analysis of case-control studies. WHO international agency for research on cancer, Lyon, France
Carl G, Kuhn I (2008) Analyzing spatial ecological data using linear regression and wavelet analysis. Stoch Environ Res Risk Assess 22(3):315–324
Chakraborty A, Gelfand AE, Wilson AM, Latimer AM, Silander JA (2011) Point pattern modelling for degraded presence-only data over large regions. J R Stat Soc 5:757–776
Cliff AD, Ord JK (1981) Spatial processes. Pion, London
Di Lorenzo B, Farcomeni A, Golini N (2011) A Bayesian model for presence-only semicontinuous data with application to prediction of abundance of Taxus Baccata in two Italian regions. J Agric Biol Environ Stat 16(3):339–356
Divino F, Golini N, Jona Lasinio G, Pettinen A (2011) Data augmentation approach in Bayesian modelling of presence-only data. Procedia Environ Sci 7:38–43
Dorazio RM (2012) Predicting the geographic distribution of a species from presence-only data subject to detection errors. Biometrics 68:1303–1312
Elith J, Leathwick JR (2009) Species distribution models: ecological explanation and prediction across space and time. Annu Rev Ecol Evol Syst 40:677–697
Elith J, Graham CH, Anderson RP, Dudik M, Ferrier S, Guisan A, Hijmans RJ, Huettmann F, Leathwick JR, Lehmann A, Li J, Lohmann LG, Loiselle BA, Manion G, Moritz C, Nakamura M, Nakazawa Y, Overton JM, Peterson AT, Phillips SJ, Richardson KS, Scachetti-Pereira R, Schapire RE, Soberon J, Williams S, Wisz MS, Zimmermann NE (2006) Novel methods improve prediction of species’ distribution from occurence data. Ecography 29:129–151
Elith J, Phillips SJ, Hastie T, Dudík M, Chee YE, Yates CJ (2011) A statistical explanation of MaxEnt for ecologists. Divers Distrib 17:43–57
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874
Fithian W, Hastie T (2013) Finite-sample equivalence in statistical models for presence-only data. Ann Appl Stat 7(4):1917–1939
Franklin J (2010) Mapping species distributions: spatial inference and prediction. Cambridge University Press, Cambridge
Gelfand AE (2010) Handbook of spatial statistics, chapter misaligned spatial data: the change of support problem. Chapman & Hall, New York, pp 517–539
Helsel DR (2012) Statistics for censored environmental data Using minitab and R. Wiley, Hoboken
Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630
Keating KA, Cherry S (2004) Use and interpretation of logistic regression in habitat-selection studies. J Wildl Manag 68:774–789
Lancaster T, Imbens G (1996) Case-control studies with contaminated controls. J Econom 71:145–160
Levy PS, Lemershow S (2008) Sampling of population: methods and applications. John Wiley & Sons, New York
Liao F, Wei Y (2014) Modeling determinants of urban growth in dongguan, china: a spatial logistic approach. Stoch Environ Res Risk Assess 28(4):801–816
Little RJA, Rubin DB (1987) Statistical analysis with missing data. John Wiley & Sons, New York
Liu JS (2008) Monte Carlo strategies in scientific computing. Springer, New York
Liu SY, Wu YN (1999) Parameter expansion for data augmentation. J Am Stat Assoc 94:1264–1274
Merow C, Silander JA Jr (2014) A comparison of maxlike and maxent for modelling species distributions. Methods Ecol Evol. doi:10.1111/2041-210X.12152
Muñoz F, Pennino M, Conesa D, Lopez-Quolez A, Bellido J (2013) Estimation and prediction of the spatial occurrence of fish species using bayesian latent gaussian models. Stoch Environ Res Risk Assess 27(5):1171–1180
Pearce JL, Boyce MS (2006) Modelling distribution and abundance with presence-only data. J Appl Ecol 43:405–412
Phillips SJ, Anderson RP, Schapire RE (2006) Maximum entropy modeling of species geographic distributions. Ecol Model 190:231–259
Renner IW, Warton DI (2013) Equivalence of MAXENT and Poisson point process models for species distribution modeling in ecology. Biometrics 69:274–281
Robert CP, Casella G (2004) Monte Carlo statistical methods. Springer, New York
Royle JA, Chandler RB, Yackulic C, Nichols JD (2012) Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions. Methods Ecol Evol 3:545–554
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Särndal CE (1978) Design-based and model-based inference in survey sampling. Scand J Stat 5:27–52
Schlesselman JJ (1982) Case-control studies. Oxford University Press, New York
Tanner M (1996) Tools for statistical inference: observed data and data augmentation. Springer, New York
Tanner M, Wong W (1987) The calculation of posterior distribution by data augmentation. J Am Stat Assoc 82:528–550
Tonini F, Divino F, Jona Lasinio G, Hochmair HH, Scheffrahn RH (2014) Predicting the geographical distribution of two invasive termite species from occurrence data. Environ Entomol 43(5):1135–1144
Valliant R, Dorfman AH, Royall MR (2000) Finite population sampling and inference: a prediction approach. John Wiley & Sons, New York
Ward G, Hastie T, Barry S, Elith J, Leathwick A (2009) Presence-only data and the EM algorithm. Biometrics 65:554–563
Warton DI, Shepherd L (2010) Poisson point porcess models solve the “pseudo-absence problem” for presence-only data in ecology. Ann Appl Stat 4(3):1383–1402
Woodward M (2005) Epidemiology: study design and data analysis. Chapman & Hall, New York
Zaniewski AE, Lehmann A, Overton JM (2002) Prediction species spatial distributions using presence-only data: a case study of native New Zeland ferns. Ecol Model 157:261–280
Acknowledgments
The authors would like to thank the anonymous referees and associate editor that with their comments helped to considerably improve the paper.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix
Appendix
Property 1
Under a stratified random sampling design adjusted for presence-only data, with non-overlapping strata \({\mathcal {U}}\) and \({\mathcal {U}}_p\), the inclusion probabilities in the sample are given by
for the stratum of cases and by
for the stratum of controls.
Proof
By definition one has
The term \(P(Z=z \mid y)=\frac{P(Z, Y)}{P(Y)}\) can be computed easily from Table 1. In particular, one has
Now it is easy to derive
and
\(\square \)
Proposition 1
Let us consider the population \({\mathcal {U}}\) augmented with its subset \({\mathcal {U}}_p\). Then, under the assumption that the stratum variable \(Z\) is conditionally independent of \(X\) given \(Y\), the conditional probability of presence in the design population \({\mathcal {U}}_D\) is given by
Proof
The hypothesis of conditional independence results in
which can be express also as
Let us consider the case with \(Y=1\) and \(Z=0\), one has
The probabilities enclosed in the second term can be derived from Table 1 and one has
In the case \(Y=1\) and \(Z=1\) one, similarly, obtains
From (17) it is obtained \(P(Y=1|x)=2\,P(Z=1|x)\) and by substituting into (16), one can derive that \(P(Z=0|x)=\dfrac{1}{1+\pi ^*(x)}\) and hence \(P(Z=1|x)=\dfrac{\pi ^*(x)}{1+\pi ^*(x)}\). Now, it is easy to obtain that
\(\square \)
Corollary 1
Under the assumption that, given \(Y\), the inclusion into the sample (\(C=1\)) is conditionally independent of the covariates \(X\), one has
and
Proof
In general we have
From the conditional independence between \(C=1\) and \(X\) given \(Y\), the (18) becomes
hence
Recalling that \(P(Y=1|x)=\frac{2\pi ^*(x)}{1+\pi ^*(x)}\) and the definitions of \(\rho _0=P(C=1|Y=0)\) and \(\rho _1=P(C=1|Y=1)\) the proofs for \(Y=0\) and \(Y=1\) can be derived. \(\square \)
Property 2
Under the case-control design adjusted for presence-only data the logistic regression function \(\phi _{pod}(x)\) is given by
Proof
By the definition of logistic regression function for presence-only data and by simple algebra one has
\(\square \)
Proposition 2
Using the approximation (10) of the ratio (8), the posterior predictive probability of occurrence for an unobserved response \(Y=y\) in the sub-sample \(S_u\) is approximated by the probability law \({\mathbb {P}}\) that generates the data at the population level, that is
Proof
From the conditional independence between \(Z\) and \(X\) given \(Y\), the predictive probability of occurrence in \(S_u\) is given by
From Table 2 one has that \(P(Z=0|Y=1,C=1)= \frac{n_{1u}}{n_p+n_{1u}}\) and hence
Now, recalling that in the general case one has
and
by substituting (19) and (20) in (19), one obtains
\(\square \)
Rights and permissions
About this article
Cite this article
Divino, F., Golini, N., Jona Lasinio, G. et al. Bayesian logistic regression for presence-only data. Stoch Environ Res Risk Assess 29, 1721–1736 (2015). https://doi.org/10.1007/s00477-015-1064-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00477-015-1064-y