Skip to main content
Log in

Modelling skewed data with many zeros: A simple approach combining ordinary and logistic regression

  • Published:
Environmental and Ecological Statistics Aims and scope Submit manuscript

Abstract

We discuss a method for analyzing data that are positively skewed and contain a substantial proportion of zeros. Such data commonly arise in ecological applications, when the focus is on the abundance of a species. The form of the distribution is then due to the patchy nature of the environment and/or the inherent heterogeneity of the species. The method can be used whenever we wish to model the data as a response variable in terms of one or more explanatory variables. The analysis consists of three stages. The first involves creating two sets of data from the original: one shows whether or not the species is present; the other indicates the logarithm of the abundance when it is present. These are referred to as the ‘presence data’ and the ‘log-abundance’ data, respectively. The second stage involves modelling the presence data using logistic regression, and separately modelling the log-abundance data using ordinary regression. Finally, the third stage involves combining the two models in order to estimate the expected abundance for a specific set of values of the explanatory variables. A common approach to analyzing this sort of data is to use a ln (y+c) transformation, where c is some constant (usually one). The method we use here avoids the need for an arbitrary choice of the value of c, and allows the modelling to be carried out in a natural and straightforward manner, using well-known regression techniques. The approach we put forward is not original, having been used in both conservation biology and fisheries. Our objectives in this paper are to (a) promote the application of this approach in a wide range of settings and (b) suggest that parametric bootstrapping be used to provide confidence limits for the estimate of expected abundance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • J. Aitchison J.A.C. Brown (1957) The Lognormal Distribution Cambridge University Press Cambridge, UK

    Google Scholar 

  • R. Coe R.D. Stern (1982) ArticleTitleFitting models to daily rainfall data Journal of Applied Meteorology 21 1024–1031

    Google Scholar 

  • E.L. Crow K. Shimizu (1988) Lognormal Distributions: Theory and Applications Dekker New York, USA

    Google Scholar 

  • A.C. Davison D.V. Hinkley (1997) Bootstrap Methods and Their Application Cambridge University Press Cambridge, UK

    Google Scholar 

  • M.J. Dobbie A.H. Welsh (2001) ArticleTitleModelling correlated zero-inflated count data Australian and New Zealand Journal of Statistics 43 431–444

    Google Scholar 

  • B. Efron R.J. Tibshirani (1993) An Introduction to the Bootstrap Chapman & Hall New York

    Google Scholar 

  • K.J. Gaston T.M. Blackburn J.D. Greenwood R.D. Gregory R.M. Quinn J.H. Lawton (2000) ArticleTitleAbundance-occupancy relationships Journal of Applied Ecology 37 IssueIDSuppl. 1 39–59

    Google Scholar 

  • D.W. Hosmer T. Hosmer S. Cessie Particlele S. Lemeshow (1997) ArticleTitleA comparison of goodness-of-fit tests for the logistic regression model Statistics in Medicine 16 965–980 Occurrence Handle9160492 Occurrence Handle1:STN:280:ByiA3cvitF0%3D

    PubMed  CAS  Google Scholar 

  • P.A. Lachenbruch (1976) ArticleTitleAnalysis of data with clumping at zero Biometrical Journal 18 351–356

    Google Scholar 

  • D. Lambert (1992) ArticleTitleZero-inflated Poisson regression, with an application to defects in manufacturing Technometrics 34 1–14

    Google Scholar 

  • N.C.H. Lo L.D. Jacobson J.L. Squire (1992) ArticleTitleIndices of relative abundance from fish spotter data based on delta-lognormal models Canadian Journal of Fisheries and Aquatic Science 49 2515–2526 Occurrence Handle10.1139/f92-278

    Article  Google Scholar 

  • B.F.J. Manly (1997) Randomization, Bootstrap and Monte Carlo Methods in Biology Chapman and Hall London, UK

    Google Scholar 

  • B.F.J. Manly L.L. McDonald D.L. Thomas (1993) Resource Selection by Animals: Statistical Design and Analysis for Field Studies Chapman and Hall London, UK

    Google Scholar 

  • P. McCullagh J.A. Nelder (2000) Generalized Linear Models Chapman and Hall London, UK (2nd Edition)

    Google Scholar 

  • McShane, P.E., Naylor, J.R., Anderson, O., Gerring, P., and Stewart, R. (1993) Pre-fishing surveys of kina (Evechinus chloroticus) in Dusky Sound, Southwest New Zealand. New Zealand Fisheries Assessment Research Document 93/11

  • R.A. Myers P. Pepin (1990) ArticleTitleThe robustness of lognormal-based estimators of abundance Biometrics 46 1185–1192

    Google Scholar 

  • M. Pennington (1983) ArticleTitleEfficient estimators of abundance, for fish and plankton surveys Biometrics 39 281–286

    Google Scholar 

  • J.N. Perry L.R. Taylor (1985) ArticleTitleAdes: new ecological families of species-specific frequency distributions that describe repeated spatial samples with an intrinsic power-law variance-mean property Journal of Animal Ecology 54 931–953

    Google Scholar 

  • G. Stefansson (1996) ArticleTitleAnalysis of groundfish survey abundance data: combining the GLM and delta approaches ICES Journal of Marine Science 53 577–588

    Google Scholar 

  • A.H. Welsh R.B. Cunningham C.F. Donnelly D.B. Lindenmayer (1996) ArticleTitleModelling the abundance of rare species: statistical models for counts with extra zeros Ecological Modelling 88 297–308

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David Fletcher.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fletcher, D., MacKenzie, D. & Villouta, E. Modelling skewed data with many zeros: A simple approach combining ordinary and logistic regression. Environ Ecol Stat 12, 45–54 (2005). https://doi.org/10.1007/s10651-005-6817-1

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10651-005-6817-1

Keywords

Navigation