Skip to main content
Log in

Bayesian logistic regression for presence-only data

  • Original Paper
  • Published:
Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Abstract

Presence-only data are referred to situations in which a censoring mechanism acts on a binary response which can be partially observed only with respect to one outcome, usually denoting the presence of an attribute of interest. A typical example is the recording of species presence in ecological surveys. In this work a Bayesian approach to the analysis of presence-only data based on a two levels scheme is presented. A probability law and a case-control design are combined to handle the double source of uncertainty: one due to censoring and the other one due to sampling. In the paper, through the use of a stratified sampling design with non-overlapping strata, a new formulation of the logistic model for presence-only data is proposed. In particular, the logistic regression with linear predictor is considered. Estimation is carried out with a new Markov Chain Monte Carlo algorithm with data augmentation, which does not require the a priori knowledge of the population prevalence. The performance of the new algorithm is validated by means of extensive simulation experiments using three scenarios and comparison with optimal benchmarks. An application to data existing in literature is reported in order to discuss the model behaviour in real world situations together with the results of an original study on termites occurrences data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Aarts G, MacKenzie M, McConnell B, Fedak M, Matthiopoulos J (2008) Estimating space-use and habitat preference from wildlife telemetry data. Ecography 31:140–160

    Article  Google Scholar 

  • Araùjo M, Williams P (2000) Selecting areas for species persistence using occurrence data. Biol Conserv 96:331–345

    Article  Google Scholar 

  • Armenian H (2009) The case-control method: design and applications. Oxford University Press, New York

    Book  Google Scholar 

  • Banerjee S, Carlin B, Gelfand AE (2004) Hierarchical modeling and analysis for spatial data. Chapman & Hall Ltd, New York

    Google Scholar 

  • Barros M, Galea M, Gonzàlez M, Leiva V (2010) Influence diagnostics in the tobit censored response model. Stat Methods Appl 19(3):379–397

    Article  Google Scholar 

  • Bartholomew D, Knott M, Moustaki I (2011) Latent variable models and factors analysis: a unified approach. John Wiley & Sons, New York

    Book  Google Scholar 

  • Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. J R Stat Soc 36(2):192–236

    Google Scholar 

  • Breslow NE (2005) Case-control studies. Handbook of epidemiology, vol 6. Springer, New York, pp 287–319

    Chapter  Google Scholar 

  • Breslow NE, Day NE (1980) Statistical methods in cancer research, vol 1. The analysis of case-control studies. WHO international agency for research on cancer, Lyon, France

  • Carl G, Kuhn I (2008) Analyzing spatial ecological data using linear regression and wavelet analysis. Stoch Environ Res Risk Assess 22(3):315–324

    Article  Google Scholar 

  • Chakraborty A, Gelfand AE, Wilson AM, Latimer AM, Silander JA (2011) Point pattern modelling for degraded presence-only data over large regions. J R Stat Soc 5:757–776

    Article  Google Scholar 

  • Cliff AD, Ord JK (1981) Spatial processes. Pion, London

    Google Scholar 

  • Di Lorenzo B, Farcomeni A, Golini N (2011) A Bayesian model for presence-only semicontinuous data with application to prediction of abundance of Taxus Baccata in two Italian regions. J Agric Biol Environ Stat 16(3):339–356

    Article  Google Scholar 

  • Divino F, Golini N, Jona Lasinio G, Pettinen A (2011) Data augmentation approach in Bayesian modelling of presence-only data. Procedia Environ Sci 7:38–43

    Article  Google Scholar 

  • Dorazio RM (2012) Predicting the geographic distribution of a species from presence-only data subject to detection errors. Biometrics 68:1303–1312

    Article  Google Scholar 

  • Elith J, Leathwick JR (2009) Species distribution models: ecological explanation and prediction across space and time. Annu Rev Ecol Evol Syst 40:677–697

    Article  Google Scholar 

  • Elith J, Graham CH, Anderson RP, Dudik M, Ferrier S, Guisan A, Hijmans RJ, Huettmann F, Leathwick JR, Lehmann A, Li J, Lohmann LG, Loiselle BA, Manion G, Moritz C, Nakamura M, Nakazawa Y, Overton JM, Peterson AT, Phillips SJ, Richardson KS, Scachetti-Pereira R, Schapire RE, Soberon J, Williams S, Wisz MS, Zimmermann NE (2006) Novel methods improve prediction of species’ distribution from occurence data. Ecography 29:129–151

    Article  Google Scholar 

  • Elith J, Phillips SJ, Hastie T, Dudík M, Chee YE, Yates CJ (2011) A statistical explanation of MaxEnt for ecologists. Divers Distrib 17:43–57

    Article  Google Scholar 

  • Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874

    Article  Google Scholar 

  • Fithian W, Hastie T (2013) Finite-sample equivalence in statistical models for presence-only data. Ann Appl Stat 7(4):1917–1939

    Article  Google Scholar 

  • Franklin J (2010) Mapping species distributions: spatial inference and prediction. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Gelfand AE (2010) Handbook of spatial statistics, chapter misaligned spatial data: the change of support problem. Chapman & Hall, New York, pp 517–539

    Google Scholar 

  • Helsel DR (2012) Statistics for censored environmental data Using minitab and R. Wiley, Hoboken

    Google Scholar 

  • Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630

    Article  Google Scholar 

  • Keating KA, Cherry S (2004) Use and interpretation of logistic regression in habitat-selection studies. J Wildl Manag 68:774–789

    Article  Google Scholar 

  • Lancaster T, Imbens G (1996) Case-control studies with contaminated controls. J Econom 71:145–160

    Article  Google Scholar 

  • Levy PS, Lemershow S (2008) Sampling of population: methods and applications. John Wiley & Sons, New York

    Book  Google Scholar 

  • Liao F, Wei Y (2014) Modeling determinants of urban growth in dongguan, china: a spatial logistic approach. Stoch Environ Res Risk Assess 28(4):801–816

    Article  Google Scholar 

  • Little RJA, Rubin DB (1987) Statistical analysis with missing data. John Wiley & Sons, New York

    Google Scholar 

  • Liu JS (2008) Monte Carlo strategies in scientific computing. Springer, New York

    Google Scholar 

  • Liu SY, Wu YN (1999) Parameter expansion for data augmentation. J Am Stat Assoc 94:1264–1274

    Article  Google Scholar 

  • Merow C, Silander JA Jr (2014) A comparison of maxlike and maxent for modelling species distributions. Methods Ecol Evol. doi:10.1111/2041-210X.12152

  • Muñoz F, Pennino M, Conesa D, Lopez-Quolez A, Bellido J (2013) Estimation and prediction of the spatial occurrence of fish species using bayesian latent gaussian models. Stoch Environ Res Risk Assess 27(5):1171–1180

    Article  Google Scholar 

  • Pearce JL, Boyce MS (2006) Modelling distribution and abundance with presence-only data. J Appl Ecol 43:405–412

    Article  Google Scholar 

  • Phillips SJ, Anderson RP, Schapire RE (2006) Maximum entropy modeling of species geographic distributions. Ecol Model 190:231–259

    Article  Google Scholar 

  • Renner IW, Warton DI (2013) Equivalence of MAXENT and Poisson point process models for species distribution modeling in ecology. Biometrics 69:274–281

    Article  Google Scholar 

  • Robert CP, Casella G (2004) Monte Carlo statistical methods. Springer, New York

    Book  Google Scholar 

  • Royle JA, Chandler RB, Yackulic C, Nichols JD (2012) Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions. Methods Ecol Evol 3:545–554

    Article  Google Scholar 

  • Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    Article  Google Scholar 

  • Särndal CE (1978) Design-based and model-based inference in survey sampling. Scand J Stat 5:27–52

    Google Scholar 

  • Schlesselman JJ (1982) Case-control studies. Oxford University Press, New York

    Google Scholar 

  • Tanner M (1996) Tools for statistical inference: observed data and data augmentation. Springer, New York

    Book  Google Scholar 

  • Tanner M, Wong W (1987) The calculation of posterior distribution by data augmentation. J Am Stat Assoc 82:528–550

    Article  Google Scholar 

  • Tonini F, Divino F, Jona Lasinio G, Hochmair HH, Scheffrahn RH (2014) Predicting the geographical distribution of two invasive termite species from occurrence data. Environ Entomol 43(5):1135–1144

    Article  Google Scholar 

  • Valliant R, Dorfman AH, Royall MR (2000) Finite population sampling and inference: a prediction approach. John Wiley & Sons, New York

    Google Scholar 

  • Ward G, Hastie T, Barry S, Elith J, Leathwick A (2009) Presence-only data and the EM algorithm. Biometrics 65:554–563

    Article  Google Scholar 

  • Warton DI, Shepherd L (2010) Poisson point porcess models solve the “pseudo-absence problem” for presence-only data in ecology. Ann Appl Stat 4(3):1383–1402

    Article  Google Scholar 

  • Woodward M (2005) Epidemiology: study design and data analysis. Chapman & Hall, New York

    Google Scholar 

  • Zaniewski AE, Lehmann A, Overton JM (2002) Prediction species spatial distributions using presence-only data: a case study of native New Zeland ferns. Ecol Model 157:261–280

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank the anonymous referees and associate editor that with their comments helped to considerably improve the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giovanna Jona Lasinio.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 232 kb)

Appendix

Appendix

Property 1

Under a stratified random sampling design adjusted for presence-only data, with non-overlapping strata \({\mathcal {U}}\) and \({\mathcal {U}}_p\), the inclusion probabilities in the sample are given by

$$\begin{aligned} \rho _{0}=\frac{n_{0u}}{(1-\pi )N} \end{aligned}$$

for the stratum of cases and by

$$\begin{aligned} \rho _{1}=\frac{n_{1u}+n_{p}}{2\pi N}\end{aligned}$$

for the stratum of controls.

Proof

By definition one has

$$\begin{aligned} P(C=1 \mid y)=\, & {} P(C=1 \mid y,Z=0)P(Z=0 \mid y)\\ & {} \quad+ P(C=1 \mid y,Z=1)P(Z=1 \mid y). \end{aligned}$$

The term \(P(Z=z \mid y)=\frac{P(Z, Y)}{P(Y)}\) can be computed easily from Table 1. In particular, one has

$$\begin{aligned}P(Z=0 \mid Y=0)=\frac{N_0}{N_0}=1,\end{aligned}$$
$$\begin{aligned}P(Z=1 \mid Y=0)=\frac{0}{N_0}=0,\end{aligned}$$
$$\begin{aligned}P(Z=0 \mid Y=1)=\frac{N_1}{2N_1}=\frac{1}{2},\end{aligned}$$
$$\begin{aligned}P(Z=1 \mid Y=1)=\frac{N_1}{2N_1}=\frac{1}{2}.\end{aligned}$$

Now it is easy to derive

$$\begin{aligned} \rho _{0}=\, & {} P(C=1 \mid Y=0)\\= & {} \dfrac{n_{0u}}{N_0} \times 1+ 0\\= & {} \dfrac{n_{0u}}{(1-\pi )N}\\ \end{aligned}$$

and

$$\begin{aligned} \rho _{1}=\, & {} P(C=1 \mid Y=1)\\= & {} \dfrac{n_{1u}}{N_1} \times \frac{1}{2}+\dfrac{n_{p}}{N_1} \times \frac{1}{2}\\= & {} \dfrac{n_{1u}+n_p}{2 \pi N}. \end{aligned}$$

\(\square \)

Proposition 1

Let us consider the population \({\mathcal {U}}\) augmented with its subset \({\mathcal {U}}_p\). Then, under the assumption that the stratum variable \(Z\) is conditionally independent of \(X\) given \(Y\), the conditional probability of presence in the design population \({\mathcal {U}}_D\) is given by

$$\begin{aligned} P(Y=1|x)=\frac{2\pi ^*(x)}{1+\pi ^*(x)}. \end{aligned}$$

Proof

The hypothesis of conditional independence results in

$$\begin{aligned} P(Z|y,x)=P(Z|y), \end{aligned}$$

which can be express also as

$$\begin{aligned} \dfrac{P(Y|z,x)P(Z|x)}{P(Y|x)}=\dfrac{P(Y|z)P(Z)}{P(Y)}. \end{aligned}$$

Let us consider the case with \(Y=1\) and \(Z=0\), one has

$$\begin{aligned} \dfrac{P(Y=1|Z=0,x)P(Z=0|x)}{P(Y=1|x)}=\dfrac{P(Y=1|Z=0)P(Z=0)}{P(Y=1)}. \end{aligned}$$

The probabilities enclosed in the second term can be derived from Table 1 and one has

$$\begin{aligned} \dfrac{\pi ^*(x)P(Z=0|x)}{P(Y=1|x)}=\dfrac{\frac{N_1}{N}\frac{N}{N+N_1}}{\frac{2N_1}{N+N_1}}=\frac{1}{2}. \end{aligned}$$
(16)

In the case \(Y=1\) and \(Z=1\) one, similarly, obtains

$$\begin{aligned} \dfrac{P(Z=1|x)}{P(Y=1|x)}=\dfrac{\frac{N_1}{N_1}\frac{N_1}{N+N_1}}{\frac{2N_1}{N+N_1}}=\frac{1}{2}. \end{aligned}$$
(17)

From (17) it is obtained \(P(Y=1|x)=2\,P(Z=1|x)\) and by substituting into (16), one can derive that \(P(Z=0|x)=\dfrac{1}{1+\pi ^*(x)}\) and hence \(P(Z=1|x)=\dfrac{\pi ^*(x)}{1+\pi ^*(x)}\). Now, it is easy to obtain that

$$\begin{aligned} P(Y=1|x)=\dfrac{2\pi ^*(x)}{1+\pi ^*(x)}. \end{aligned}$$

\(\square \)

Corollary 1

Under the assumption that, given \(Y\), the inclusion into the sample (\(C=1\)) is conditionally independent of the covariates \(X\), one has

$$\begin{aligned} P(Y=0|C=1,x)\,P(C=1|x)=\frac{1-\pi ^*(x)}{1+\pi ^*(x)}\,\rho _0 \end{aligned}$$

and

$$\begin{aligned} P(Y=1|C=1,x)\,P(C=1|x)=\frac{2\pi ^*(x)}{1+\pi ^*(x)}\,\rho _1. \end{aligned}$$

Proof

In general we have

$$\begin{aligned} P(Y|C=1,x)=\frac{P(C=1|y,x)\,P(Y|x)}{P(C=1|x)}. \end{aligned}$$
(18)

From the conditional independence between \(C=1\) and \(X\) given \(Y\), the (18) becomes

$$\begin{aligned} P(Y|C=1,x)=\frac{P(C=1|y)\,P(Y|x)}{P(C=1|x)}, \end{aligned}$$

hence

$$\begin{aligned} P(Y|C=1,x)P(C=1|x)=P(C=1|y)\,P(Y|x). \end{aligned}$$

Recalling that \(P(Y=1|x)=\frac{2\pi ^*(x)}{1+\pi ^*(x)}\) and the definitions of \(\rho _0=P(C=1|Y=0)\) and \(\rho _1=P(C=1|Y=1)\) the proofs for \(Y=0\) and \(Y=1\) can be derived. \(\square \)

Property 2

Under the case-control design adjusted for presence-only data the logistic regression function \(\phi _{pod}(x)\) is given by

$$\begin{aligned} \phi _{pod}(x)={\text {log}}\frac{\pi ^*(x)}{1-\pi ^*(x)}+{\text {log}}\frac{n_{1u}+n_p}{n_{0u}}-{\text {log}}\frac{\pi }{1-\pi }. \end{aligned}$$

Proof

By the definition of logistic regression function for presence-only data and by simple algebra one has

$$\begin{aligned} \phi _{pod}(x)=\, & {} {\text {logit}} P(Y=1|C=1,x)\\=\, & {} {\text {log}} \frac{P(Y=1|C=1,x)}{P(Y=0|C=1,x)}\\=\, & {} {\text {log}} \frac{P(Y=1|C=1,x)P(C=1 \mid x)}{P(Y=0|C=1,x)P(C=1 \mid x)}\\=\, & {} {\text {log}} \frac{\frac{2\pi ^*(x)}{1+\pi ^*(x)}\,\rho _1}{\frac{1-\pi ^*(x)}{1+\pi ^*(x)}\,\rho _0}\\=\, & {} {\text {log}} \frac{2\pi ^*(x) \frac{n_{1u}+n_p}{2 \pi N}}{[1-\pi ^*(x)]\frac{n_{0u}}{(1-\pi )N}}\\=\, & {} {\text {log}} \frac{\pi ^*(x) \frac{n_{1u}+n_p}{\pi }}{[1-\pi ^*(x)]\frac{n_{0u}}{1-\pi }}\\=\, & {} {\text {log}}\frac{\pi ^*(x)}{1-\pi ^*(x)}+{\text {log}}\frac{n_{1u}+n_p}{n_{0u}}-{\text {log}}\frac{\pi }{1-\pi }. \end{aligned}$$

\(\square \)

Proposition 2

Using the approximation (10) of the ratio (8), the posterior predictive probability of occurrence for an unobserved response \(Y=y\) in the sub-sample \(S_u\) is approximated by the probability law \({\mathbb {P}}\) that generates the data at the population level, that is

$$\begin{aligned} P(Y=1|Z=0,C=1,x) \approx \pi ^*(x). \end{aligned}$$

Proof

From the conditional independence between \(Z\) and \(X\) given \(Y\), the predictive probability of occurrence in \(S_u\) is given by

$$\begin{aligned} P(Y=1|Z=0,C=1,x)=\dfrac{P(Z=0|Y=1,C=1)P(Y=1|C=1,x)}{P(Z=0|C=1,x)}. \end{aligned}$$

From Table 2 one has that \(P(Z=0|Y=1,C=1)= \frac{n_{1u}}{n_p+n_{1u}}\) and hence

$$\begin{aligned} P(Y=1|Z=0,C=1,x)= \dfrac{n_{1u}}{n_p+n_{1u}}\,\dfrac{P(Y=1|C=1,x)}{P(Z=0|C=1,x)}. \end{aligned}$$

Now, recalling that in the general case one has

$$\begin{aligned} P(Y=1|C=1,x) \approx \frac{\left( 1+\frac{n_p}{n_{1u}}\right) \, {\text {exp}}\{\phi (x)\}}{1+\left( 1+\frac{n_p}{n_{1u}}\right) {\text {exp}}\{\phi (x)\}} \end{aligned}$$
(19)

and

$$\begin{aligned} P(Z=0|C=1,x) \approx \frac{1+{\text {exp}}\{\phi (x)\}}{1+\left( 1+\frac{n_p}{n_{1u}}\right) {\text {exp}}\{\phi (x)\}}, \end{aligned}$$
(20)

by substituting (19) and (20) in (19), one obtains

$$\begin{aligned} P(Y=1|Z=0,C=1,x)\approx \,& {} \dfrac{n_{1u}}{n_p+n_{1u}} \,\dfrac{\left( 1+\frac{n_p}{n_{1u}}\right) \, {\text {exp}} \{ \phi (x) \}}{1+ {\text {exp}} \{ \phi (x) \}} \\=\, & {} \dfrac{{\text {exp}}\{\phi (x)\}}{1+{\text {exp}}\{\phi (x)\}}\\=\, & {} \pi ^*(x). \end{aligned}$$

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Divino, F., Golini, N., Jona Lasinio, G. et al. Bayesian logistic regression for presence-only data. Stoch Environ Res Risk Assess 29, 1721–1736 (2015). https://doi.org/10.1007/s00477-015-1064-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00477-015-1064-y

Keywords

Navigation