Abstract
Prevalence (the presence/absence ratio in the training data) is commonly thought to influence the reliability of the predictions of species distribution models. However, little is known about its precise impact. We studied its effects using a virtual species, avoiding the presence of unaccounted-for effects in the modeling process (false absences, non-explanatory predictors, etc.). We sampled the distribution of the virtual species to obtain several data subsets of varying sample size and prevalence, and then modeled these data subsets using logistic regressions. Our results show that model predictions can be highly accurate over a wide range of sample sizes and prevalence scores, provided that the predictors are truly related to the distribution of the species and the training data are reliable. The effect of sample size becomes apparent for datasets of less than 70 data points, and the effect of prevalence is significant only for datasets with extremely unbalanced samples (<0.01 and >0.99). There is also a strong interaction between sample size and prevalence, indicating that the most negative factor is the sample size of each event (absence and/or presence), and not biased prevalence, as previously thought. We suggest that, in the real world, an interaction must exist between the sample size of each event and the quality of the training data. We discuss that biased prevalences can be a desirable property of the data, instead of a problem to be avoided, also pointing out the importance of using the best absence data possible when modeling the distribution of species of narrow geographic range.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Abbreviations
- AUC:
-
Area Under the Receiver Operating Characteristic Curve
- PCA:
-
Principal Component Analysis
- ROC:
-
Receiver Operating Characteristic Curve.
References
Allouche, O., A. Tsoar and R. Kadmon. 2006. Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). J. Appl. Ecol. 43: 1223–1232.
Anderson, R. P., M. Gómez-Laverde and A. T. Peterson. 2002a. Geographical distributions of spiny pocket mice in South America: insights from predictive models. Global Ecol. Biogeogr. 11: 131–141.
Anderson, R. P., A. T. Peterson and M. Gómez-Laverde. 2002b. Using niche-based GIS modeling to test geographic predictions of competitive exclusion and competitive release in South American pocket mice. Oikos 98: 3–16.
Austin, M.P. and J.A. Meyers. 1996. Current approaches to modeling the environmental niche of eucalyptus: implications for the management of forest biodiversity. For. Ecol. Manage. 85: 95–106.
Austin, M.P., L. Belbin, J.A. Meyers, M.D. Doherty and M. Luoto. 2006. Evaluation of statistical models used for predicting plant species distributions: Role of artificial data and theory. Ecol. Model. 199: 197–216.
Barbosa, A. M., R. Real, J. Olivero and J. M. Vargas. 2003. Otter (Lutra lutra) distribution modeling at two resolution scales suited to conservation planning in the Iberian Peninsula. Biol. Conserv. 114: 377–387.
Brotons, L., W. Thuiller, M. B. Araújo and A. H. Hirzel. 2004. Presence-absence versus presence-only modelling methods for predicting bird habitat suitability. Ecography 27: 437–448.
Chefaoui, R. M., J. Hortal and J. M. Lobo. 2005. Potential distribution modelling, niche characterization and conservation status assessment using GIS tools: a case study of Iberian Copris species. Biol. Conserv. 122: 327–338.
Clark Labs. 2003. Idrisi Kilimanjaro. GIS software package. Clark Labs, Worcester, MA.
Coudun, C. and J.-C. Gégout. 2006. The derivation of species response curves with Gaussian logistic regression is sensitive to sampling intensity and curve characteristics. Ecol. Model. 199: 164–175.
Cramer, J. S. 1999. Predictive performance of binary logit model in unbalanced samples. J. Royal Statistical Soc., Series D 48: 85–94.
Cumming, G. S. 2000. Using between-model comparisons to finetune linear models of species ranges. J. Biogeogr. 27: 441–455.
Dixon, P. M., A. M. Ellison and N. J. Gotelli. 2005. Improving the precision of estimates of the frequency of rare events. Ecology 85: 1114–1123.
Engler, R., A. Guisan and L. Rechsteiner. 2004. An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo-absence data. J. Appl. Ecol. 41: 263–274.
Estrada, A., R. Real and J.M. Vargas. 2008. Using crisp and fuzzy modelling to identify favourability hotspots useful to perform gap analysis. Biodivers. Conserv. 17: 857–871.
Fielding, A. H. 2002. What are the appropiate characteristics of an accuracy measure? In: J. M. Scott, P. J. Heglund, J. B. Haufler, M. Morrison, M. G. Raphael, W. B. Wall and F. Samson (eds.), Predicting Species Occurrences. Issues of Accuracy and Scale. Island Press, Covelo, CA, pp. 271–280.
Fielding, A. H. and J. F. Bell. 1997. A review of methods for the assessment of prediction errors in conservation presence/absence models. Environ. Conserv. 24: 38–49.
Funk, V.A., K.S. Richardson and S. Ferrier. 2005. Survey-gap analysis in expeditionary research: where do we go from here? Biol. J. Linn. Soc. 85: 549–567.
Guisan, A. and N. E. Zimmermann. 2000. Predictive habitat distribution models in ecology. Ecol. Model. 135: 147–186.
Guisan, A., T. C. Edwards and T. Hastie. 2002. Generalized linear and generalized additive models in studies of species distributions: setting the scene. Ecol. Model. 157: 89–100.
Harrell, F. E. J. 2001. Regression Modelling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. Springer, New York.
Hirzel, A. H., V. Helfer and F. Metral. 2001. Assessing habitat-suitability models with a virtual species. Ecol. Model. 145: 111–121.
Hirzel, A.H. and G. Le Lay. 2008. Habitat suitability and niche theory. J. Appl. Ecol. 45: 1371–1381.
Hortal, J. and J. M. Lobo. 2005. An ED-based protocol for optimal sampling of biodiversity. Biodiv. Conserv. 14: 2913–2947.
Hortal, J., P. Garcia-Pereira and E. García-Barros. 2004. Butterfly species richness in mainland Portugal: Predictive models of geographic distribution patterns. Ecography 27: 68–82.
Hortal, J., J.M. Lobo and A. Jiménez-Valverde. 2007. Limitations of biodiversity databases: case study on seed-plant diversity in Tenerife (Canary Islands). Conserv. Biol. 21: 853–863.
Hortal, J., A. Jiménez-Valverde, J.F. Gómez, J.M. Lobo and A. Baselga. 2008. Historical bias in biodiversity inventories affects the observed realized niche of the species. Oikos 117: 847–858.
Hosmer D. W. and S. Lemeshow. 1989. Applied Logistic Regression. Wiley, New York.
Jiménez-Valverde, A. and J. M. Lobo. 2004. Un método sencillo para seleccionar puntos de muestreo con el objetivo de inventariar taxones hiperdiversos: el caso práctico de las familias Araneidae y Thomisidae (Araneae) en la Comunidad de Madrid, España. Ecología 18: 297–308.
Jiménez-Valverde, A. and J. M. Lobo. 2006a. The ghost of unbalanced species distribution data in geographic model predictions. Divers. Distrib. 12: 521–524.
Jiménez-Valverde, A. and J. M. Lobo. 2006b. Distribution determinants of endangered Iberian spider Macrothele calpeiana (Ara-neae, Hexathelidae). Environ. Entomol. 35: 1491–1499.
Jiménez-Valverde, A. and J. M. Lobo. 2007a. Potential distribution of the endangered spider Macrothele calpeiana (Walckenaer, 1805) (Araneae, Hexathelidae) and the impact of climate warming. Acta Zool. Sin. 53: 865–876.
Jiménez-Valverde, A. and J. M. Lobo. 2007b. Threshold criteria for conversion of probability of species presence to either-or presence-absence. Acta Oecol. 31: 361–369.
Jiménez-Valverde, A., V. M. Ortuño and J. M. Lobo. 2007. Exploring the distribution of Sterocorax Ortuño, 1990 (Coleoptera, Carabidae) species in the Iberian Peninsula. J. Biogeogr. 34: 1426–1438.
Jiménez-Valverde, A., J.F. Gómez, J.M. Lobo, A. Baselga and J. Hortal. 2008a. Challenging species distribution models: the case of Maculinea nausithous in the Iberian Peninsula. Ann. Zool. Fenn. 45: 200–210.
Jiménez-Valverde, A., J. M. Lobo and J. Hortal. 2008b. Not as good as they seem: the importance of concepts in species distribution modelling. Divers. Distrib. 14: 885–890.
King, G. and L. Zeng. 2001. Logistic regression in rare events data. Political Analysis 9: 137–163.
Kadmon, R., O. Farber and A. Danin. 2004. Effect of roadside bias on the accuracy of predictive maps produced by bioclimatic models. Ecol. Appl. 14: 401–413.
Lehmann, A., J. M. Overton and M. P. Austin. 2002. Regression models for spatial prediction: their role for biodiversity and conservation. Biodiv. Conserv. 11: 2085–2092.
Liu, C., P. M. Berry, T. P. Dawson and R. G. Pearson. 2005. Selecting thresholds of occurrence in the prediction of species distributions. Ecography 28: 385–393.
Lobo, J. M., J. R. Verdú and C. Numa. 2006. Environmental and geographical factors affecting the Iberian distribution of flightless Jekelius species (Coleoptera: Geotrupidae). Divers. Distrib. 12: 179–188.
Lobo, J.M., A. Baselga, J. Hortal, A. Jiménez-Valverde and J. F. Gómez. 2007. How does the knowledge about the spatial distribution of Iberian dung beetle species accumulate over time? Divers. Distrib. 13: 772–780.
Lobo, J.M., A. Jimenez-Valverde and R. Real. 2008. AUC: a misleading measure of the performance of predictive distribution models. Global Ecol. Biogeogr. 17: 145–151.
Long, J. S. 1997. Regression Models for Categorical and Limited Dependent Variables. Sage Publications, Thousand Oaks, CA.
Luoto, M., J. Poyry, R. K. Heikkinen and K. Saarinen. 2005. Uncertainty of bioclimate envelope models based on the geographical distribution of species. Global Ecol. Biogeogr. 14: 575–584.
Manel, S., J. M. Dias, S. T. Buckton and S. J. Ormerod. 1999. Alternative methods for predicting species distributions: an illustration with Himalayan river birds. J. Appl. Ecol. 36: 734–747.
Manel, S., H. C. Williams and S. J. Ormerod. 2001. Evaluating presence-absence models in ecology: the need to account for prevalence. J. Appl. Ecol. 38: 921–931.
Martínez-Meyer, E. 2005. Climate change and biodiversity: some considerations in forecasting shifts in species´ potential distributions. Biodiv. Informatics 2: 42–55.
McCullagh, P. and J. A. Nelder. 1989. Generalized Linear Models, 2nd ed. Chapman and Hall, London.
McPherson, J. M., W. Jetz and D. J. Rogers. 2004. The effects of species’ range sizes on the accuracy of distribution models: ecological phenomenon or statistical artefact? J. Appl. Ecol. 41: 811–823.
Meynard, C. N. and J. F. Quinn. 2007. Predicting species distributions: a critical comparison of the most common statistical models using artificial species. J. Biogeogr. 34: 1455–1469.
Muggeo, V. M. R.. 2003. Estimating regression models with unknown break-points. Stat. Med. 22: 3055–3071.
Muggeo, V. M. R. 2004. segmented: segmented relationships in regression models. R package version 0.1–4.
Nogués-Bravo, D., J. Rodríguez, J. Hortal, P. Batra and M. B. Araújo. 2008. Climate change, humans and the extinction of the woolly mammoth. PLoS Biol. 6: e79.
Olden, J. D. and D. A. Jackson. 2000. Torturing data for the sake of generality: How valid are our regression models? Écoscience 7: 501–510.
Olden, J. D., D. A. Jackson and P. R. Peres-Neto. 2002. Predictive models of fish species distributions: A note on proper validation and chance predictions. Trans. Am. Fish. Soc. 131: 329–336.
Osborne, P. E., J. C. Alonso and R. G. Bryant. 2001. Modelling landscape-scale habitat use using GIS and remote sensing: a case study with great bustards. J. Appl. Ecol. 38: 458–471.
Pearce, J. and S. Ferrier. 2000. Evaluating the predictive performance of habitat models developed using logistic regression. Ecol. Model. 133: 225–245.
Peng, C.-Y. J., K. L. Lee and G. M. Ingersoll. 2002. An introduction to logistic regression analysis and reporting. J. Educational Res. 96:3–14.
Peterson, A. T. and R. D. Holt. 2003. Niche differentiation in Mexican birds: using point occurrences to detect ecological innovation. Ecol. Lett. 6: 774–782.
Peterson, A. T., J. Soberón and V. Sánchez-Cordero. 1999. Conservatism of ecological niches in evolutionary time. Science 285: 1265–1267.
R Development Core Team. 2006. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Available at http://www.R-project.org
Real, R., A. M. Barbosa and J. M. Vargas. 2006. Obtaining environmental favourability functions from logistic regression. Environ. Ecol. Stat. 13: 237–245.
Reese, G. C., K. R. Wilson, J. A. Hoeting and C. H. Flather. 2005. Factors affecting species distribution predictions: a simulation modeling experiment. Ecol. Appl. 15: 554–564.
Reineking, B. and B. Schröder. 2003. Computer-intensive methods in the analysis of species-habitat relationships. In: H. Reuter, B. Breckling and A. Mittwollen (eds), GfÖ Arbeitskreis Theorie in der Ökologie. P. Lang Verlag, Frankfurt, pp. 100–117.
Rushton, S. P., S. J. Ormerod and G. Kerby. 2004. New paradigms for modelling species distributions? J. Appl. Ecol. 41: 193–200.
Schadt, S., E. Revilla, T. Wiegand, F. Knauer, P. Kaczensky, U. Breitenmoser, L. Bufka,. I. Červený, P. Koubek, T. Huber, C. Staniša and L. Trepl. 2002. Assessing the suitability of central European landscapes for the reintroduction of Eurasian lynx. J. Appl. Ecol. 39: 189–203.
Scott, J. M., P. J. Heglund, J. B. Haufler, M. Morrison, M. G. Raphael, W. B. Wall and F. Samson (eds.). 2002. Predicting Species Occurrences. Issues of Accuracy and Scale. Island Press, Covelo, CA.
Segurado, P. and M. B. Araújo. 2004. An evaluation of methods for modelling species distributions. J. Biogeogr. 31: 1555–1568.
Seoane, J., J. H. Justribó, F. García, J. Retamar, C. Rabadán and J. C. Atienza. 2006. Habitat-suitability modelling to assess the effects of land-use changes on Dupont´s lark Chersophilus duponti: A case study in the Layna Important Bird Area. Biol. Conserv. 128: 241–252.
StatSoft. 2001. STATISTICA (data analysis software system and user’s manual), version 6. StatSoft, Inc., Tulsa, OK.
Stockwell, D. R. B. and A. T. Peterson. 2002. Effects of sample size on accuracy of species distribution models. Ecol. Model. 148: 1–13.
Svenning, J.C. and F. Skov. 2004. Limited filling of the potential range in European tree species. Ecol. Lett. 7: 565–573.
Swets, J. A. 1988. Measuring the accuracy of diagnostic systems. Science 240: 1285–1293.
Thuiller, W., L. Brotons, M.B. Araújo and S. Lavorel. 2004 Effects of restricting environmental range of data to project current and future species distributions. Ecography 27: 165–172.
Vaughan, I. P. and S. J. Ormerod. 2003. Improving the quality of distribution models for conservation by addressing shortcomings in the field collection of training data. Conserv. Biol. 17: 1601–1611.
Wessels, K. J., A. S. Van Jaarsveld, J. D. Grimbeek and M. J. Van der Linde. 1998. An evaluation of the gradsect biological survey method. Biodiv. Conserv. 7: 1093–1121.
Whittaker, R. J., M. B. Araújo, P. Jepson, R. J. Ladle, J. E. M. Watson and K. J. Willis. 2005. Conservation Biogeography: assessment and prospect. Divers. Distrib. 11: 3–23.
Wintle, B. A., J. Elith and J. M. Potts. 2005. Fauna habitat modelling and mapping: a review and case study in the Lower Hunter Central Coast region of NSW. Austral Ecol. 30: 719–738.
Wisz, M. S., R. J. Hijmans, J. Li, A. T. Peterson, C. H. Graham, A. Guisan and NCEAS Predicting Species Distributions Working Group. 2008. Effects of sample size on the performance of species distribution models. Divers. Distrib. 14: 763–773.
Wood, S. N. 2004. mgcv: GAMs with GCV smoothness estimation and GAMMs by REML/PQL. R package version 1.1–8.
Wood, S. N. and N. H. Augustin. 2002. GAMs with integrated model selection using penalized regression splines and applications to environmental modelling. Ecol. Model. 157: 157–177.
Zaniewski, A. E., A. Lehmann and J. M. Overton. 2002. Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns. Ecol. Model. 157: 261–280.
Zweig, M. H. and G. Campbell. 1993. Receiver-operating characteristics (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin. Chem. 39: 561–577.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Jiménez-Valverde, A., Lobo, J.M. & Hortal, J. The effect of prevalence and its interaction with sample size on the reliability of species distribution models. COMMUNITY ECOLOGY 10, 196–205 (2009). https://doi.org/10.1556/ComEc.10.2009.2.9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1556/ComEc.10.2009.2.9