Skip to main content
Log in

Prevalence affects the evaluation of discrimination capacity in presence-absence species distribution models

  • Original Paper
  • Published:
Biodiversity and Conservation Aims and scope Submit manuscript

Abstract

The aim of this study is to understand how prevalence—the ratio of instances of presence to total sample size—affects the estimation of three discrimination indexes commonly used in distribution modelling: the area under the receiver operating characteristic curve (AUC), the value of sensitivity at the threshold where sensitivity equals specificity (Se*), and the maximum value of the Youden index or true skill statistic (Y). For four sample size levels, samples of suitability scores for the instances of presence and absence with varying prevalences were simulated from known distributions so that the true values of the discrimination indexes were known, and the three indexes were empirically estimated (AUCest, Se*est, Yest). AUCest and Se*est are unbiased estimators, and the greatest precision is achieved with a balanced prevalence. As sample size increases, there is a larger prevalence interval around 0.5 in which precision is more or less stable. As a rule of thumb, in the case of n ≤ 100, at least ten observations of the rare state (either instances of presence or absence) should be considered, whereas the safety prevalence interval [0.01, 0.99] should be used for higher sample sizes. The lower the true discriminative power of the models, the higher the negative effect of prevalence. Yest is positively biased, and bias and precision become worse towards low and high prevalences. Highly unbalanced prevalences increase the imprecision in estimating the discrimination capacity of the models. Y is not recommended as a discrimination measure since it provides overoptimistic results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

Not applicable since no new data were used or generated.

References

  • Allouche O, Tsoar A, Kadmon R (2006) Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). J App Ecol 43:1223–1232

    Article  Google Scholar 

  • Bamber D (1975) The Area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J Math Psychol 12:387–415

    Article  Google Scholar 

  • Barbet-Massin M, Jiguet F, Albert CH, Thuiller W (2012) Selecting pseudo-absences for species distribution models: how, where and how many? Methods Ecol Evol 3:327–338

    Article  Google Scholar 

  • Brenner H, Gefeller O (1997) Variation of sensitivity, specificity, likelihood ratios and predictive values with disease prevalence. Stat Med 16:981–991

    Article  CAS  PubMed  Google Scholar 

  • Ewald B (2006) Post hoc choice of cut points introduced bias to diagnostic research. J Clin Epidemiol 59:798–801

    Article  PubMed  Google Scholar 

  • Faraggi D, Reiser B (2002) Estimation of the area under the ROC curve. Stat Med 21:3093–3106

    Article  PubMed  Google Scholar 

  • Fielding AH (2002) What are the appropriate characteristics of an accuracy measure? In: Scott JM, Heglund PJ, Haufler JB, Morrison M, Raphael MG, Wall WB, Samson F (eds), Predicting species occurrences. Issues of accuracy and scale. Island Press, Covelo, pp 271–280

  • Fielding AH, Bell JF (1997) A review of methods for the assessment of prediction errors in conservation presence/absence models. Environ Conserv 24:38–49

    Article  Google Scholar 

  • Flush R, Faraggi D, Reiser B (2005) Estimation of the Youden Index and its associated cutoff point. Biom J 47:458–472

    Article  Google Scholar 

  • Fois M, Cuena-Lombraña A, Fenu G, Bacchetta G (2018) Using species distribution models at local scale to guide the search of poorly known species: review, methodological issues and future directions. Ecol Model 385:124–132

    Article  Google Scholar 

  • Foody GM (2011) Impacts of imperfect reference data on the apparent accuracy of species presence-absence models and their prediction. Global Ecol Biogeogr 20:498–508

    Article  Google Scholar 

  • Franklin J (2009) Mapping species distributions. Spatial inference and prediction. Cambridge University Press, Cambridge

    Google Scholar 

  • Guisan A, Thuiller W, Zimmermann NE (2017) Habitat suitability and distribution models with applications in R. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Hilden J (1991) The area under the ROC curve and its competitors. Med Decis Making 11:95–101

    Article  CAS  PubMed  Google Scholar 

  • Hilden J, Glasziou P (1996) Regret graphs, diagnostic uncertainty and Youden’s Index. Stat Med 15:969–986

    Article  CAS  PubMed  Google Scholar 

  • Hosmer DW, Lemeshow S (2000) Applied logistic regression. Wiley, New York

    Book  Google Scholar 

  • Jiménez-Valverde A (2012) Insights into the area under the receiver operating characteristic curve (AUC) as a discrimination measure in species distribution modelling. Global Ecol Biogeogr 21:498–507

    Article  Google Scholar 

  • Jiménez-Valverde A (2014) Threshold-dependence as a desirable attribute for discrimination assessment: implications for the evaluation of species distribution models. Biodivers Conserv 23:369–385

    Article  Google Scholar 

  • Jiménez-Valverde A (2020) Sample size for the evaluation of presence-absence models. Ecol Indic 114:106289

    Article  Google Scholar 

  • Jiménez-Valverde A, Lobo JM (2006) The ghost of unbalanced species distribution data in geographical model predictions. Divers Distrib 12:521–524

    Article  Google Scholar 

  • Jiménez-Valverde A, Lobo JM (2007) Threshold criteria for conversion of probability of species presence to either-or presence-absence. Acta Oecol 31:361–369

    Article  Google Scholar 

  • Jiménez-Valverde A, Acevedo P, Barbosa AM, Lobo JM, Real R (2013) Discrimination capacity is species distribution modelling depends on the representativeness of the environmental domain. Global Ecol Biogeogr 22:508–516

    Article  Google Scholar 

  • Koenker RW (2005) Quantile regression. Cambridge Univerity Press, Cambridge

    Book  Google Scholar 

  • Koenker RW (2018) quantreg: quantile regression. R package version 5.36. http://CRAN.R-project.org/package=quantreg. Accessed on June 2018

  • Krzanowski WJ, Hand DJ (2009) ROC curves for continuous data. Chapman & Hall, Boca Raton

    Book  Google Scholar 

  • Leeflang MMG, Moons KGM, Reitsma JB, Zwinderman AH (2008) Bias in sensitivity and specificity caused by data-driven selection of optimal cutoff values: mechanisms, magnitude, and solutions. Clin Chem 54:729–737

    Article  CAS  PubMed  Google Scholar 

  • Leroy B, Delsol R, Hugueny B, Meynard CN, Barhoumi C, Barbet-Massin M, Bellard C (2018) Without quality presence-absence data, discrimination metrics such as TSS can be misleading measures of model performance. J Biogeogr 45:1994–2002

    Article  Google Scholar 

  • Lobo JM, Jiménez-Valverde A, Real R (2008) AUC: a misleading measure of the performance of predictive distribution models. Global Ecol Biogeogr 17:145–151

    Article  Google Scholar 

  • Lobo JM, Jiménez-Valverde A, Hortal J (2010) The uncertain nature of absences and their importance in species distribution modelling. Ecography 33:103–114

    Article  Google Scholar 

  • López-Ratón M, Cadarso-Suárez C, Molanes-López EM, Letón E (2016) Confidence intervals for the symmetry point: an optimal cutpoint in continuous diagnostic tests. Pharm Stat 15:178–192

    Article  PubMed  Google Scholar 

  • Manel S, Williams HC, Ormerod SJ (2001) Evaluating presence-absence models in ecology: the need to account for prevalence. J Appl Ecol 38:921–931

    Article  Google Scholar 

  • McCune JL (2016) Species distribution models predict rare species occurrences despite significant effects of landscape context. J Appl Ecol 53:1871–1879

    Article  Google Scholar 

  • McPherson JM, Jetz W, Rogers DJ (2004) The effects of species’ range sizes on the accuracy of distribution models: ecological phenomenon or statistical artefact? J Appl Ecol 41:811–823

    Article  Google Scholar 

  • Mersmann O, Trautmann H, Steuer D, Bornkamp B (2018) truncnorm: Truncated normal distribution. R package version 1.0–8. https://CRAN.R-project.org/package=truncnorm. Accessed on March 2018

  • Muggeo VMR (2003) Estimating regression models with unknown break-points. Stat Med 22:3055–3071

    Article  PubMed  Google Scholar 

  • Muggeo VMR (2008) segmented: an R package to fit regression models with broken-line relationships. R News 8:20–25

    Google Scholar 

  • Perkins NJ, Schisterman EF (2005) The Youden index and the optimal cut-point corrected for measurement error. Biom J 47:428–441

    Article  PubMed  Google Scholar 

  • Peterson AT, Soberón J, Pearson RG, Anderson RP, Martínez-Meyer E, Nakamura ML, Araújo MB (2011) Ecological niches and geographic distributions. Princeton University Press, Princeton

    Book  Google Scholar 

  • R Development Core Team (2018) R: a language and environment for statistical computing. Version 3.5.1. R Foundation for Statistical Computing, Vienna

    Google Scholar 

  • Santika T (2011) Assessing the effect of prevalence on the predictive performance of species distribution models using simulated data. Global Ecol Biogeogr 20:181–192

    Article  Google Scholar 

  • Schisterman EF, Perkins NJ, Liu A, Bondell H (2005) Optimal cut-point and its corresponding Youden index to discriminate individuals using pooled blood samples. Epidemiology 16:73–81

    Article  PubMed  Google Scholar 

  • Sing T, Sander O, Beerenwinkel N, Lengauer T (2005) ROCR: visualizing classifier performance in R. Bioinformatics 21:3940–3941

    Article  CAS  PubMed  Google Scholar 

  • Smith AB (2013) On evaluating species distribution models with random background sites in place of absences when test presences disproportionately sample suitable habitat. Divers Distrib 19:867–872

    Article  Google Scholar 

  • Somodi I, Lepesi N, Botta-Dukátet Z (2017) Prevalence dependence in model goodness measures with special emphasis on true skill statistics. Ecol Evol 7:863–872

    Article  PubMed  PubMed Central  Google Scholar 

  • Syfert MM, Joppa L, Smith MJ, Coomes DA, Bachman SP, Brummitt NA (2014) Using species distribution models to inform IUCN red list assessments. Biol Conserv 177:174–184

    Article  Google Scholar 

  • Webb GI, Ting KM (2005) On the application of ROC analysis to predict classification performance under varying class distributions. Mach Learn 58:25–32

    Article  Google Scholar 

  • Zhou X-H, Obuchowski NA, McClish DK (2002) Statistical methods in diagnostic medicine. Wiley, New York

    Book  Google Scholar 

Download references

Acknowledgements

This paper is a contribution to the DistriMod (CGL2017-89000-P) project, and A. J.- V. is supported by the Spanish Ramón y Cajal Program (RYC-2013-14441); both are financed by the Spanish Ministry of Science, Innovation and Universities. I am grateful to Adam B. Smith for his useful comments. Lucia Maltez kindly reviewed the English.

Funding

Spanish Ministry of Science, Innovation and Universities (CGL2017-89000-P and RYC-2013-14441).

Author information

Authors and Affiliations

Authors

Contributions

I am the sole author responsible for the entire study.

Corresponding author

Correspondence to Alberto Jiménez-Valverde.

Ethics declarations

Conflict of interest

The author declares no conflict of interest.

Additional information

Communicated by David Hawksworth.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiménez-Valverde, A. Prevalence affects the evaluation of discrimination capacity in presence-absence species distribution models. Biodivers Conserv 30, 1331–1340 (2021). https://doi.org/10.1007/s10531-021-02144-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10531-021-02144-4

Keywords

Navigation