Bayesian logistic regression for presence-only data

Divino, Fabio; Golini, Natalia; Jona Lasinio, Giovanna; Penttinen, Antti

doi:10.1007/s00477-015-1064-y

Bayesian logistic regression for presence-only data

Original Paper
Published: 31 March 2015

Volume 29, pages 1721–1736, (2015)
Cite this article

Stochastic Environmental Research and Risk Assessment Aims and scope Submit manuscript

Fabio Divino¹,
Natalia Golini²,
Giovanna Jona Lasinio ORCID: orcid.org/0000-0001-8912-5018² &
…
Antti Penttinen³

823 Accesses
4 Citations
Explore all metrics

Abstract

Presence-only data are referred to situations in which a censoring mechanism acts on a binary response which can be partially observed only with respect to one outcome, usually denoting the presence of an attribute of interest. A typical example is the recording of species presence in ecological surveys. In this work a Bayesian approach to the analysis of presence-only data based on a two levels scheme is presented. A probability law and a case-control design are combined to handle the double source of uncertainty: one due to censoring and the other one due to sampling. In the paper, through the use of a stratified sampling design with non-overlapping strata, a new formulation of the logistic model for presence-only data is proposed. In particular, the logistic regression with linear predictor is considered. Estimation is carried out with a new Markov Chain Monte Carlo algorithm with data augmentation, which does not require the a priori knowledge of the population prevalence. The performance of the new algorithm is validated by means of extensive simulation experiments using three scenarios and comparison with optimal benchmarks. An application to data existing in literature is reported in order to discuss the model behaviour in real world situations together with the results of an original study on termites occurrences data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Right-Censored Mixed Poisson Count Models with Detection Times

Article 15 November 2019

Zero-Inflated Beta Distribution Regression Modeling

Article 13 September 2022

Extending null scenarios with Faddy distributions in a probabilistic randomization protocol for presence-absence data

Article 06 July 2022

References

Aarts G, MacKenzie M, McConnell B, Fedak M, Matthiopoulos J (2008) Estimating space-use and habitat preference from wildlife telemetry data. Ecography 31:140–160
Article Google Scholar
Araùjo M, Williams P (2000) Selecting areas for species persistence using occurrence data. Biol Conserv 96:331–345
Article Google Scholar
Armenian H (2009) The case-control method: design and applications. Oxford University Press, New York
Book Google Scholar
Banerjee S, Carlin B, Gelfand AE (2004) Hierarchical modeling and analysis for spatial data. Chapman & Hall Ltd, New York
Google Scholar
Barros M, Galea M, Gonzàlez M, Leiva V (2010) Influence diagnostics in the tobit censored response model. Stat Methods Appl 19(3):379–397
Article Google Scholar
Bartholomew D, Knott M, Moustaki I (2011) Latent variable models and factors analysis: a unified approach. John Wiley & Sons, New York
Book Google Scholar
Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. J R Stat Soc 36(2):192–236
Google Scholar
Breslow NE (2005) Case-control studies. Handbook of epidemiology, vol 6. Springer, New York, pp 287–319
Chapter Google Scholar
Breslow NE, Day NE (1980) Statistical methods in cancer research, vol 1. The analysis of case-control studies. WHO international agency for research on cancer, Lyon, France
Carl G, Kuhn I (2008) Analyzing spatial ecological data using linear regression and wavelet analysis. Stoch Environ Res Risk Assess 22(3):315–324
Article Google Scholar
Chakraborty A, Gelfand AE, Wilson AM, Latimer AM, Silander JA (2011) Point pattern modelling for degraded presence-only data over large regions. J R Stat Soc 5:757–776
Article Google Scholar
Cliff AD, Ord JK (1981) Spatial processes. Pion, London
Google Scholar
Di Lorenzo B, Farcomeni A, Golini N (2011) A Bayesian model for presence-only semicontinuous data with application to prediction of abundance of Taxus Baccata in two Italian regions. J Agric Biol Environ Stat 16(3):339–356
Article Google Scholar
Divino F, Golini N, Jona Lasinio G, Pettinen A (2011) Data augmentation approach in Bayesian modelling of presence-only data. Procedia Environ Sci 7:38–43
Article Google Scholar
Dorazio RM (2012) Predicting the geographic distribution of a species from presence-only data subject to detection errors. Biometrics 68:1303–1312
Article Google Scholar
Elith J, Leathwick JR (2009) Species distribution models: ecological explanation and prediction across space and time. Annu Rev Ecol Evol Syst 40:677–697
Article Google Scholar
Elith J, Graham CH, Anderson RP, Dudik M, Ferrier S, Guisan A, Hijmans RJ, Huettmann F, Leathwick JR, Lehmann A, Li J, Lohmann LG, Loiselle BA, Manion G, Moritz C, Nakamura M, Nakazawa Y, Overton JM, Peterson AT, Phillips SJ, Richardson KS, Scachetti-Pereira R, Schapire RE, Soberon J, Williams S, Wisz MS, Zimmermann NE (2006) Novel methods improve prediction of species’ distribution from occurence data. Ecography 29:129–151
Article Google Scholar
Elith J, Phillips SJ, Hastie T, Dudík M, Chee YE, Yates CJ (2011) A statistical explanation of MaxEnt for ecologists. Divers Distrib 17:43–57
Article Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874
Article Google Scholar
Fithian W, Hastie T (2013) Finite-sample equivalence in statistical models for presence-only data. Ann Appl Stat 7(4):1917–1939
Article Google Scholar
Franklin J (2010) Mapping species distributions: spatial inference and prediction. Cambridge University Press, Cambridge
Book Google Scholar
Gelfand AE (2010) Handbook of spatial statistics, chapter misaligned spatial data: the change of support problem. Chapman & Hall, New York, pp 517–539
Google Scholar
Helsel DR (2012) Statistics for censored environmental data Using minitab and R. Wiley, Hoboken
Google Scholar
Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630
Article Google Scholar
Keating KA, Cherry S (2004) Use and interpretation of logistic regression in habitat-selection studies. J Wildl Manag 68:774–789
Article Google Scholar
Lancaster T, Imbens G (1996) Case-control studies with contaminated controls. J Econom 71:145–160
Article Google Scholar
Levy PS, Lemershow S (2008) Sampling of population: methods and applications. John Wiley & Sons, New York
Book Google Scholar
Liao F, Wei Y (2014) Modeling determinants of urban growth in dongguan, china: a spatial logistic approach. Stoch Environ Res Risk Assess 28(4):801–816
Article Google Scholar
Little RJA, Rubin DB (1987) Statistical analysis with missing data. John Wiley & Sons, New York
Google Scholar
Liu JS (2008) Monte Carlo strategies in scientific computing. Springer, New York
Google Scholar
Liu SY, Wu YN (1999) Parameter expansion for data augmentation. J Am Stat Assoc 94:1264–1274
Article Google Scholar
Merow C, Silander JA Jr (2014) A comparison of maxlike and maxent for modelling species distributions. Methods Ecol Evol. doi:10.1111/2041-210X.12152
Muñoz F, Pennino M, Conesa D, Lopez-Quolez A, Bellido J (2013) Estimation and prediction of the spatial occurrence of fish species using bayesian latent gaussian models. Stoch Environ Res Risk Assess 27(5):1171–1180
Article Google Scholar
Pearce JL, Boyce MS (2006) Modelling distribution and abundance with presence-only data. J Appl Ecol 43:405–412
Article Google Scholar
Phillips SJ, Anderson RP, Schapire RE (2006) Maximum entropy modeling of species geographic distributions. Ecol Model 190:231–259
Article Google Scholar
Renner IW, Warton DI (2013) Equivalence of MAXENT and Poisson point process models for species distribution modeling in ecology. Biometrics 69:274–281
Article Google Scholar
Robert CP, Casella G (2004) Monte Carlo statistical methods. Springer, New York
Book Google Scholar
Royle JA, Chandler RB, Yackulic C, Nichols JD (2012) Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions. Methods Ecol Evol 3:545–554
Article Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Article Google Scholar
Särndal CE (1978) Design-based and model-based inference in survey sampling. Scand J Stat 5:27–52
Google Scholar
Schlesselman JJ (1982) Case-control studies. Oxford University Press, New York
Google Scholar
Tanner M (1996) Tools for statistical inference: observed data and data augmentation. Springer, New York
Book Google Scholar
Tanner M, Wong W (1987) The calculation of posterior distribution by data augmentation. J Am Stat Assoc 82:528–550
Article Google Scholar
Tonini F, Divino F, Jona Lasinio G, Hochmair HH, Scheffrahn RH (2014) Predicting the geographical distribution of two invasive termite species from occurrence data. Environ Entomol 43(5):1135–1144
Article Google Scholar
Valliant R, Dorfman AH, Royall MR (2000) Finite population sampling and inference: a prediction approach. John Wiley & Sons, New York
Google Scholar
Ward G, Hastie T, Barry S, Elith J, Leathwick A (2009) Presence-only data and the EM algorithm. Biometrics 65:554–563
Article Google Scholar
Warton DI, Shepherd L (2010) Poisson point porcess models solve the “pseudo-absence problem” for presence-only data in ecology. Ann Appl Stat 4(3):1383–1402
Article Google Scholar
Woodward M (2005) Epidemiology: study design and data analysis. Chapman & Hall, New York
Google Scholar
Zaniewski AE, Lehmann A, Overton JM (2002) Prediction species spatial distributions using presence-only data: a case study of native New Zeland ferns. Ecol Model 157:261–280
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank the anonymous referees and associate editor that with their comments helped to considerably improve the paper.

Author information

Authors and Affiliations

Division of Physics, Computer Science and Mathematics, University of Molise, Contrada Fonte Lappone, 86090, Pesche, IS, Italy
Fabio Divino
Department of Statistical Sciences, Sapienza University of Rome, P.le Aldo Moro 5, 00185, Rome, Italy
Natalia Golini & Giovanna Jona Lasinio
Department of Mathematics and Statistics, University of Jyväskylä, Jyväskylä, Finland
Antti Penttinen

Authors

Fabio Divino
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Golini
View author publications
You can also search for this author in PubMed Google Scholar
Giovanna Jona Lasinio
View author publications
You can also search for this author in PubMed Google Scholar
Antti Penttinen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giovanna Jona Lasinio.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (PDF 232 kb)

Appendix

Property 1

Under a stratified random sampling design adjusted for presence-only data, with non-overlapping strata ${\mathcal {U}}$ and ${\mathcal {U}}_p$, the inclusion probabilities in the sample are given by

$$\begin{aligned} \rho _{0}=\frac{n_{0u}}{(1-\pi )N} \end{aligned}$$

for the stratum of cases and by

$$\begin{aligned} \rho _{1}=\frac{n_{1u}+n_{p}}{2\pi N}\end{aligned}$$

for the stratum of controls.

Proof

By definition one has

$$\begin{aligned} P(C=1 \mid y)=\, & {} P(C=1 \mid y,Z=0)P(Z=0 \mid y)\\ & {} \quad+ P(C=1 \mid y,Z=1)P(Z=1 \mid y). \end{aligned}$$

The term $P(Z=z \mid y)=\frac{P(Z, Y)}{P(Y)}$ can be computed easily from Table 1. In particular, one has

$$\begin{aligned}P(Z=0 \mid Y=0)=\frac{N_0}{N_0}=1,\end{aligned}$$

$$\begin{aligned}P(Z=1 \mid Y=0)=\frac{0}{N_0}=0,\end{aligned}$$

$$\begin{aligned}P(Z=0 \mid Y=1)=\frac{N_1}{2N_1}=\frac{1}{2},\end{aligned}$$

$$\begin{aligned}P(Z=1 \mid Y=1)=\frac{N_1}{2N_1}=\frac{1}{2}.\end{aligned}$$

Now it is easy to derive

$$\begin{aligned} \rho _{0}=\, & {} P(C=1 \mid Y=0)\\= & {} \dfrac{n_{0u}}{N_0} \times 1+ 0\\= & {} \dfrac{n_{0u}}{(1-\pi )N}\\ \end{aligned}$$

and

$$\begin{aligned} \rho _{1}=\, & {} P(C=1 \mid Y=1)\\= & {} \dfrac{n_{1u}}{N_1} \times \frac{1}{2}+\dfrac{n_{p}}{N_1} \times \frac{1}{2}\\= & {} \dfrac{n_{1u}+n_p}{2 \pi N}. \end{aligned}$$

$\square $

Proposition 1

Let us consider the population ${\mathcal {U}}$ augmented with its subset ${\mathcal {U}}_p$. Then, under the assumption that the stratum variable $Z$ is conditionally independent of $X$ given $Y$, the conditional probability of presence in the design population ${\mathcal {U}}_D$ is given by

$$\begin{aligned} P(Y=1|x)=\frac{2\pi ^*(x)}{1+\pi ^*(x)}. \end{aligned}$$

Proof

The hypothesis of conditional independence results in

$$\begin{aligned} P(Z|y,x)=P(Z|y), \end{aligned}$$

which can be express also as

$$\begin{aligned} \dfrac{P(Y|z,x)P(Z|x)}{P(Y|x)}=\dfrac{P(Y|z)P(Z)}{P(Y)}. \end{aligned}$$

Let us consider the case with $Y=1$ and $Z=0$, one has

$$\begin{aligned} \dfrac{P(Y=1|Z=0,x)P(Z=0|x)}{P(Y=1|x)}=\dfrac{P(Y=1|Z=0)P(Z=0)}{P(Y=1)}. \end{aligned}$$

The probabilities enclosed in the second term can be derived from Table 1 and one has

$$\begin{aligned} \dfrac{\pi ^*(x)P(Z=0|x)}{P(Y=1|x)}=\dfrac{\frac{N_1}{N}\frac{N}{N+N_1}}{\frac{2N_1}{N+N_1}}=\frac{1}{2}. \end{aligned}$$

(16)

In the case $Y=1$ and $Z=1$ one, similarly, obtains

$$\begin{aligned} \dfrac{P(Z=1|x)}{P(Y=1|x)}=\dfrac{\frac{N_1}{N_1}\frac{N_1}{N+N_1}}{\frac{2N_1}{N+N_1}}=\frac{1}{2}. \end{aligned}$$

(17)

From (17) it is obtained $P(Y=1|x)=2\,P(Z=1|x)$ and by substituting into (16), one can derive that $P(Z=0|x)=\dfrac{1}{1+\pi ^*(x)}$ and hence $P(Z=1|x)=\dfrac{\pi ^*(x)}{1+\pi ^*(x)}$. Now, it is easy to obtain that

$$\begin{aligned} P(Y=1|x)=\dfrac{2\pi ^*(x)}{1+\pi ^*(x)}. \end{aligned}$$

$\square $

Corollary 1

Under the assumption that, given $Y$, the inclusion into the sample ($C=1$) is conditionally independent of the covariates $X$, one has

$$\begin{aligned} P(Y=0|C=1,x)\,P(C=1|x)=\frac{1-\pi ^*(x)}{1+\pi ^*(x)}\,\rho _0 \end{aligned}$$

and

$$\begin{aligned} P(Y=1|C=1,x)\,P(C=1|x)=\frac{2\pi ^*(x)}{1+\pi ^*(x)}\,\rho _1. \end{aligned}$$

Proof

In general we have

$$\begin{aligned} P(Y|C=1,x)=\frac{P(C=1|y,x)\,P(Y|x)}{P(C=1|x)}. \end{aligned}$$

(18)

From the conditional independence between $C=1$ and $X$ given $Y$, the (18) becomes

$$\begin{aligned} P(Y|C=1,x)=\frac{P(C=1|y)\,P(Y|x)}{P(C=1|x)}, \end{aligned}$$

hence

$$\begin{aligned} P(Y|C=1,x)P(C=1|x)=P(C=1|y)\,P(Y|x). \end{aligned}$$

Recalling that $P(Y=1|x)=\frac{2\pi ^*(x)}{1+\pi ^*(x)}$ and the definitions of $\rho _0=P(C=1|Y=0)$ and $\rho _1=P(C=1|Y=1)$ the proofs for $Y=0$ and $Y=1$ can be derived. $\square $

Property 2

Under the case-control design adjusted for presence-only data the logistic regression function $\phi _{pod}(x)$ is given by

$$\begin{aligned} \phi _{pod}(x)={\text {log}}\frac{\pi ^*(x)}{1-\pi ^*(x)}+{\text {log}}\frac{n_{1u}+n_p}{n_{0u}}-{\text {log}}\frac{\pi }{1-\pi }. \end{aligned}$$

Proof

By the definition of logistic regression function for presence-only data and by simple algebra one has

$$\begin{aligned} \phi _{pod}(x)=\, & {} {\text {logit}} P(Y=1|C=1,x)\\=\, & {} {\text {log}} \frac{P(Y=1|C=1,x)}{P(Y=0|C=1,x)}\\=\, & {} {\text {log}} \frac{P(Y=1|C=1,x)P(C=1 \mid x)}{P(Y=0|C=1,x)P(C=1 \mid x)}\\=\, & {} {\text {log}} \frac{\frac{2\pi ^*(x)}{1+\pi ^*(x)}\,\rho _1}{\frac{1-\pi ^*(x)}{1+\pi ^*(x)}\,\rho _0}\\=\, & {} {\text {log}} \frac{2\pi ^*(x) \frac{n_{1u}+n_p}{2 \pi N}}{[1-\pi ^*(x)]\frac{n_{0u}}{(1-\pi )N}}\\=\, & {} {\text {log}} \frac{\pi ^*(x) \frac{n_{1u}+n_p}{\pi }}{[1-\pi ^*(x)]\frac{n_{0u}}{1-\pi }}\\=\, & {} {\text {log}}\frac{\pi ^*(x)}{1-\pi ^*(x)}+{\text {log}}\frac{n_{1u}+n_p}{n_{0u}}-{\text {log}}\frac{\pi }{1-\pi }. \end{aligned}$$

$\square $

Proposition 2

Using the approximation (10) of the ratio (8), the posterior predictive probability of occurrence for an unobserved response $Y=y$ in the sub-sample $S_u$ is approximated by the probability law ${\mathbb {P}}$ that generates the data at the population level, that is

$$\begin{aligned} P(Y=1|Z=0,C=1,x) \approx \pi ^*(x). \end{aligned}$$

Proof

From the conditional independence between $Z$ and $X$ given $Y$, the predictive probability of occurrence in $S_u$ is given by

$$\begin{aligned} P(Y=1|Z=0,C=1,x)=\dfrac{P(Z=0|Y=1,C=1)P(Y=1|C=1,x)}{P(Z=0|C=1,x)}. \end{aligned}$$

From Table 2 one has that $P(Z=0|Y=1,C=1)= \frac{n_{1u}}{n_p+n_{1u}}$ and hence

$$\begin{aligned} P(Y=1|Z=0,C=1,x)= \dfrac{n_{1u}}{n_p+n_{1u}}\,\dfrac{P(Y=1|C=1,x)}{P(Z=0|C=1,x)}. \end{aligned}$$

Now, recalling that in the general case one has

$$\begin{aligned} P(Y=1|C=1,x) \approx \frac{\left( 1+\frac{n_p}{n_{1u}}\right) \, {\text {exp}}\{\phi (x)\}}{1+\left( 1+\frac{n_p}{n_{1u}}\right) {\text {exp}}\{\phi (x)\}} \end{aligned}$$

(19)

and

$$\begin{aligned} P(Z=0|C=1,x) \approx \frac{1+{\text {exp}}\{\phi (x)\}}{1+\left( 1+\frac{n_p}{n_{1u}}\right) {\text {exp}}\{\phi (x)\}}, \end{aligned}$$

(20)

by substituting (19) and (20) in (19), one obtains

$$\begin{aligned} P(Y=1|Z=0,C=1,x)\approx \,& {} \dfrac{n_{1u}}{n_p+n_{1u}} \,\dfrac{\left( 1+\frac{n_p}{n_{1u}}\right) \, {\text {exp}} \{ \phi (x) \}}{1+ {\text {exp}} \{ \phi (x) \}} \\=\, & {} \dfrac{{\text {exp}}\{\phi (x)\}}{1+{\text {exp}}\{\phi (x)\}}\\=\, & {} \pi ^*(x). \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Divino, F., Golini, N., Jona Lasinio, G. et al. Bayesian logistic regression for presence-only data. Stoch Environ Res Risk Assess 29, 1721–1736 (2015). https://doi.org/10.1007/s00477-015-1064-y

Download citation

Published: 31 March 2015
Issue Date: August 2015
DOI: https://doi.org/10.1007/s00477-015-1064-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian logistic regression for presence-only data

Abstract

Access this article

Similar content being viewed by others

Right-Censored Mixed Poisson Count Models with Detection Times

Zero-Inflated Beta Distribution Regression Modeling

Extending null scenarios with Faddy distributions in a probabilistic randomization protocol for presence-absence data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (PDF 232 kb)

Appendix

Property 1

Proof

Proposition 1

Proof

Corollary 1

Proof

Property 2

Proof

Proposition 2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bayesian logistic regression for presence-only data

Abstract

Access this article

Similar content being viewed by others

Right-Censored Mixed Poisson Count Models with Detection Times

Zero-Inflated Beta Distribution Regression Modeling

Extending null scenarios with Faddy distributions in a probabilistic randomization protocol for presence-absence data

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (PDF 232 kb)

Appendix

Appendix

Property 1

Proof

Proposition 1

Proof

Corollary 1

Proof

Property 2

Proof

Proposition 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation