Modelling skewed data with many zeros: A simple approach combining ordinary and logistic regression
- Cite this article as:
- Fletcher, D., MacKenzie, D. & Villouta, E. Environ Ecol Stat (2005) 12: 45. doi:10.1007/s10651-005-6817-1
We discuss a method for analyzing data that are positively skewed and contain a substantial proportion of zeros. Such data commonly arise in ecological applications, when the focus is on the abundance of a species. The form of the distribution is then due to the patchy nature of the environment and/or the inherent heterogeneity of the species. The method can be used whenever we wish to model the data as a response variable in terms of one or more explanatory variables. The analysis consists of three stages. The first involves creating two sets of data from the original: one shows whether or not the species is present; the other indicates the logarithm of the abundance when it is present. These are referred to as the ‘presence data’ and the ‘log-abundance’ data, respectively. The second stage involves modelling the presence data using logistic regression, and separately modelling the log-abundance data using ordinary regression. Finally, the third stage involves combining the two models in order to estimate the expected abundance for a specific set of values of the explanatory variables. A common approach to analyzing this sort of data is to use a ln (y+c) transformation, where c is some constant (usually one). The method we use here avoids the need for an arbitrary choice of the value of c, and allows the modelling to be carried out in a natural and straightforward manner, using well-known regression techniques. The approach we put forward is not original, having been used in both conservation biology and fisheries. Our objectives in this paper are to (a) promote the application of this approach in a wide range of settings and (b) suggest that parametric bootstrapping be used to provide confidence limits for the estimate of expected abundance.