1 Introduction

One of the long-standing challenges in quantitative ecology is to model biodiversity quantities, such as a species’ presence/absence, abundance or biomass. In the last decade or so there has been rapid methodological developments and a wide range of applications (Guisan and Zimmermann 2000; Ferrier and Guisan 2006; Pitcher et al. 2007; Elith and Leathwick 2009; Gattone and Battista 2009; Lozier et al. 2009; Franklin 2010; Bax 2011). The core of the modelling challenge is typically a regression-type problem: how to relate biodiversity to a set of descriptors (covariates) such as environmental or anthropogenic variables. It involves describing the variability in the data into parts that are common to all data (the signal), and a part that remains unexplained (the noise). There are two types of variance that can adversely affect the model: one is due to sampling biodiversity (considered in this article), and the other is due to sampling/predicting the covariates (e.g. Foster et al. 2012; Stoklosa et al. 2015).

It is tempting, when modelling biodiversity attributes, to ignore any variance in the data due to sampling (including imperfect observations). This assumes that the manner in which data were collected is unimportant, or that it will simply add to the random part of the model and not the signal about biodiversity. However, this requires assumptions that are unlikely to be met. The unfortunate implication is that sampling issues can affect inferences. Accounting for sampling effects within a model requires careful consideration as it tends to vary from one survey to the next.

One sampling effect, which is often ignored, arises when the data are a sample (not a census) of the biological material at a sample location. We refer to these data as imperfect observations, and they are the central topic of the work presented here. An example of where imperfect observations occur is in marine surveys, where a large amount of biotic material is obtained (too much for scientific processing of all the material). The practical method to quantify all of the biotic material is to sample the different catch (a process called sub-sampling). Sub-sampling can take a number of forms—sample from all the biotic material as a single group, or the sample different broad taxonomic or size strata. This process adds another layer of variability into the data.

The methods presented here have broader application. The effect of imperfect observations is explored in this work and simple methods for adjusting statistical models for analysis of these types of data are presented. The model unifies many of the disparate research areas that consider imperfect observations, which has not been done before. One key point of distinction in previous approaches for accounting for imperfect observations is whether the focus is on sampling an individual organism (individual detectability), or a species (species detectability). For example, species abundances and biomass are related with individual detectability and has been studied as ‘ascertainment’ (Fisher 1934; Fisher et al. 1943; Rao 1965), ‘detection probability’ (Borchers et al. 2002; Buckland et al. 2004), or as ‘attenuation’ (Shimadzu and Darnell 2015). In contrast, species presence/absence and richness are more related with species detectability, and it has been studied as ‘rarefaction’ (Sanders 1968; Hurlbert 1971; Simberloff 1972; Heck et al. 1975) and ‘occupancy’ (MacKenzie et al. 2002). However, these approaches concentrate on species-level data and fail to exploit the fact that for the species to be detected at least one of the individuals needs to be sampled. So individual detection must play a pivotal role in understanding species detection.

Our approach utilises a compound distribution of the possible number of individuals at a site. It highlights the precise data needed to disentangle the number of individuals and the probability of sampling. We present a modelling framework, which is trivially implemented in software packages, to handle imperfect observations (including detectability issues and sub-sampling). The formal modelling framework has some practical benefits as well—predictions of the true number of individuals at any sampling site can be made through predictive distributions and the effect of imprecise observations can be incorporated easily when designing surveys. The principles and methods are illustrated throughout the manuscript with two marine data sets from tropical northern Australia. Both of these examples are for a particular case of imperfect observations, namely subsampling. However, we note that the methods presented are for a wider class of applications—any situations where the probability of observation is less than one.

2 Conceptual framework for imperfect observation

Every ecological data set is the result of sampling from a population of interest and every ecological datum can likewise be thought of as a sample (sometimes a census) of the population at a site. Here and elsewhere we use the statistical term ‘population’ to mean the individuals at a sampling site, as this is our prime interest. We note however, that much of the suggested framework could extend to a broader definition of population. For notation convenience, we omit possible site subscripts. Formally, the principle of sampling is the random partition of a population \({\mathcal {P}}=\{\omega _i\}_{i=1}^{M_0}\), of size \(M_0=|{\mathcal {P}}|\), into two disjoint categories: the sample \({\mathcal {S}}\), and the remainder of the population \({\mathcal {S}}^c = {\mathcal {P}}\setminus {\mathcal {S}}\). These two sets are disjoint, \({\mathcal {S}}^c \cap {\mathcal {S}} = \emptyset \). Each element of the population, \(\omega _i\), is typically an individual organism but it may also be a colony or a family in certain situations (e.g. corals and sponges). We shall use the ‘individual’ nomenclature to describe all possibilities. Note that we treat \(M_0\) as random throughout the paper, in order to dealiniate the extent to which the expected abundance, E[\({M_0}\)], responds to different environment conditions; more details will be discussed in the later sections.

For ease of exposition, we introduce a random variable \(Z_i\) to indicate whether the population’s i-th element is in the sample \({\mathcal {S}}\) or not; it is defined as \(Z_i = I(\omega _i \in {\mathcal {S}})\). This simple variable enables us to efficiently describe key biodiversity measures such as species abundance, presence/absence and biomass of the population under study.

  • Abundance Species abundance in a sample, \(M_1 = |{\mathcal {S}}|\), is given as the compound form

    $$\begin{aligned} M_1 = {\left\{ \begin{array}{ll} \displaystyle \sum _{i=1}^{M_0} Z_i, &{}\quad (M_0 > 0);\\ 0, &{}\quad (M_0 = 0). \end{array}\right. } \end{aligned}$$
    (1)
  • Species presence/absence Species presence/absence in a sample \(Y_1\) can be defined by using an indicator function \(I(\cdot )\) as

    $$\begin{aligned} Y_1 = I(M_1 > 0). \end{aligned}$$
    (2)
  • Biomass Species biomass \(V_1\) in a sample can also be defined in a compound form as an extension of abundance. Let \(W_i\) be the weight of the i-th organism then biomass becomes

    $$\begin{aligned} V_1 = \sum _{i=1}^{M_0} W_i Z_i. \end{aligned}$$

    Note that \(V_1=0\) when \(M_0=0\) as in Eq. (1). If individual weight \(W_i\)’s are independent gamma random variables then biomass, \(V_1\), follows a distribution called Poisson–gamma distribution. This formulation has been exploited previously (Foster and Bravington 2013) as a special case of the Tweedie distribution (Jørgenson 1997; Dunn and Smyth 2005).

These descriptions provide a natural basis to handle imperfect observations. When the observations are imperfect, the probability of an individual being sampled is less than one, that is \(\Pr (Z_i = 1) < 1\). We shall refer to this probability as the individual detection probability. This probability can be assumed to be common, or varying, over all individuals, \(\{\omega _i\}_{i=1}^{M_0}\). The situation when the individual detection probability is constant over individuals is classically described as simple random sampling. Simple random sampling and immediate extensions will be discussed in the following sections.

Another useful construct is the species detection probability, which is defined as one minus the probability that none of the individuals of the species is observed: \(1-\prod _{i=1}^{M_0}\left\{ 1-\Pr (Z_i = 1)\right\} \); this obviously assumes individual independence. The species detection probability is a function of individual detection probability, \(\Pr (Z_i = 1)\), and species abundance, \(M_0\).

3 Models for imperfect observation

3.1 Compound distributions

Since we can only deal with the sample (observable) abundance \(M_1\), its probability mass function, \(f(m_1|\varvec{x})\) say, plays the pivotal role in modelling. Here, \(\varvec{x}\) represents auxiliary information/variables that act as covariates in the regression model. The simple expression for abundance, Eq. (1), gives the mechanism for defining the probability mass function. It becomes a compound distribution (Feller 1968), also known as a generalised distribution (Gurland 1957) and a stopped-sum distribution (Johnson et al. 1992)

$$\begin{aligned} f(m_{1} | \varvec{x})= & {} \sum _{m_0=0}^\infty f(m_1, m_0 | \varvec{x}) \nonumber \\= & {} \sum _{m_0=0}^\infty f(m_0 | \varvec{x}) f(m_1 | m_0), \end{aligned}$$
(3)

where conditional independence has been assumed. This assumption implies that the auxiliary variables play no role in the process of sampling the population. The first term in Eq. (3), \(f(m_0|\varvec{x})\), describes the way that the true, and not the sampled abundance varies with the covariates. This is appealing as a model defined on the sampled abundance cannot give inferences about the true abundance, which is what is ideally sought. The second term, \(f(m_1|m_0)\), defines the sampling process. The sampling process does not depend on covariates; it merely relates how the individuals are drawn from the population. Note that this assumption may not match reality, and the sampling process may be dependent on other auxiliary variables. It is possible to relax this assumption, by specifying the joint conditional distribution as \(f(m_0|\varvec{x})f(m_1|m_0,\varvec{x})\). However, we keep this assumption in this work for ease of exposition of our framework.

3.2 Sampling mechanisms

In general, sampling mechanisms are survey specific and can explicitly be described by \(f(m_1 | m_0)\) in Eq. (3). Two major sampling procedures in ecological studies are: simple random sampling and stratified sampling (Cochran 1977). We consider these two cases as examples.

  • Simple random sampling Under the simple random sampling scheme, the probability of observing a sample \({\mathcal {S}}\) consisting of \(m_1\) individuals from the population \({\mathcal {P}}\) of \(m_0\) individuals is a multiplication of the probability of each individual being sampled, \(\Pr (\omega _i \in {\mathcal {S}})\), or being not sampled, \(1-\Pr (\omega _i \in {\mathcal {S}})\). If it is assumed that the probability of being sampled is common among individuals, \(\Pr (\omega _i \in {\mathcal {S}})=r\) say, then the probability of sampling \(m_1\) individuals from a population with size \(m_0\) is a binomial distribution \(\mathsf{Bi}(m_{1}; m_{0}, r)\),

    $$\begin{aligned} f(m_1 | m_0) = {m_0 \atopwithdelims ()m_1} r^{m_1} (1-r)^{m_0 - m_1}. \end{aligned}$$

Simple random sampling assumes that the probability of individual detection is homogeneous, which can be inappropriate for some surveys. The idea of stratified sampling, discussed presently, allows the assumption of population homogeneity to be relaxed. This is achieved by defining strata, within which the individuals have a homogeneous probability of being sampled. The sampling probability between strata may vary, inducing population heterogeneity.

  • Stratified sampling Let \({\mathcal {U}}_j\) be the j-th stratum in stratified sampling. The population \({\mathcal {P}}\) then consists of k strata, \({\mathcal {P}} = \cup _{j=1}^k {\mathcal {U}}_j\), \({\mathcal {U}}_j \cap {\mathcal {U}}_{j'} = \emptyset \ (j \ne j')\), and the individuals in the population are partitioned into the strata. The number of individuals (\(m_0\) say) are randomly partitioned as \(\varvec{m}_0 = (m_{01}, m_{02}, \ldots , m_{0k}), m_{0j}=|{\mathcal {U}}_j|\). This partitioning mechanism can be described by a multinomial distribution, \(\mathsf{Mn}(\varvec{m}_0; m_0, \varvec{p})\), with parameters \(\varvec{p}=(p_1, p_2, \ldots , p_j)\) giving the probability of belonging to each multinomial class. The samples are then randomly drawn from each stratum j with sampling fraction \(r_j\). This draw is independent between the strata so each strata’s sampling process can be described by a binomial distribution with simple random sampling, as above. In terms of the imperfect observation model (3) the probability mass function of the stratified sample is

    $$\begin{aligned} f(\varvec{m}_1 | m_0)= & {} f(\varvec{m}_1 | \varvec{m}_0) f(\varvec{m}_0 | m_0)\\= & {} \left\{ \prod _{j=1}^k {m_{0j} \atopwithdelims ()m_{1j}} r_j^{m_{1j}} (1-r_j)^{m_{0j}-m_{1j}} \right\} {m_0 \atopwithdelims ()m_{01}, \ldots , m_{0k} } \prod _{j=1}^k p_j^{m_{0j}}, \end{aligned}$$

    where \(\sum p_j=1\).

3.3 Marginal distributions

In theory, the probability mass function \(f(m_0|\varvec{x})\) in Eq. (3) can take any plausible form. In this paper, we focus our attention to the commonly used Poisson models although other distributions could be used. That is, \(M_0 \sim \mathsf{Po}(\lambda ({\varvec{x})})\) where the Poisson probability function is

$$\begin{aligned} f(m_0 | \varvec{x}) = \frac{{\lambda (\varvec{x})}^{m_{0}}}{m_{0}!}e^{-\lambda (\varvec{x})}. \end{aligned}$$
(4)

The idea behind this is that there is a systematic component, \(\lambda (\varvec{x})\), which drives the expected number of individuals according to different environmental conditions, \(\varvec{x}\). When this is coupled with a sampling model, such as simple random sampling or stratified sampling, the distribution of the sampled species abundance \(M_1\) is marginalised over the true abundance, \(m_0\), and becomes another Poisson, \(\mathsf{Po}(m_{1}; \alpha \lambda (\varvec{x}))\):

$$\begin{aligned} f(m_1 | \varvec{x}) = \frac{\left( \alpha \lambda (\varvec{x})\right) ^{m_{1}}}{m_{1}!}e^{-\alpha \lambda (\varvec{x})}, \end{aligned}$$
(5)

where \(0 < \alpha \le 1\) is the individual detection probability, or sampling fraction. For simple random sampling, \(\alpha = r\) so that \(f(m_1 | \varvec{x}) = \mathsf{Po}(m_{1}; r\lambda (\varvec{x}))\). For stratified sampling, \(\alpha \) depends on each stratum j so it becomes \(\alpha _j = p_jr_j\) as

$$\begin{aligned} f(\varvec{m}_1 | \varvec{x}) = \prod _{j=1}^k \mathsf{Po}(m_{1j}; p_jr_j\lambda (\varvec{x})). \end{aligned}$$
(6)

See Appendix for the detailed derivations. Clearly the effect of sampling with these sampling schemes is to reduce the expected abundance by a constant amount.

In general, conventional ecological modelling can be regarded as a mix of design-based and model-based approaches. From a design-based aspect, as we have discussed, it leads to a general model (Eq. 5) that plays a key role in species abundance and presence/absence modelling as we will see in the following Sect. 3.4. The other aspect, model-based one, can also be vital because the assumption that the individual detection probability, \(\alpha \), is fixed by design may sound unreasonable for some cases. It assumes the detection probability as an unknown function of other variable t as \(\alpha (t)\) which needs to be estimated, for example by maximum likelihood. The component of estimation is thus model-based. Although the formulation allows more flexibility to cope with heterogeneity induced by different types of sampling, such as observer error and species rarity, for example, it requires an extra care, since with the Poisson model the individual detection probability, \(\alpha \), cannot be dis-entangled with the abundance expectation from the data alone. Sprott (1965) studied the condition of the probability generating function and identified this model as being inestimable, amongst other compound distributions. Further information about the sampling mechanism or the population’s rate are required.

As noted, the probability mass function \(f(m_0|\varvec{x})\) in Eq. (3) can take any plausible form, and a negative binomial distribution could also be used. When a negative binomial distribution \(\mathsf{NB}(m_0; s, t)\) is, instead of a Poisson distribution, coupled with a binomial sampling distribution, we still obtain an equivalent result. The marginal distribution is a compound negative binomial distribution and its form is explicitly written as \(\mathsf{NB}(m_1; s, t/\{1-(1-\alpha )(1-t)\})\) (see Appendix for the detailed derivation).

There is a close link between Eq. (5) and a model class, namely N-mixture models, by Royle (2004). When the sampling replication is one, Eq. (5) gives the exact analytical expression of the N-mixture model, although Royle (2004) suggested a numerical approximation, calculating instead a finite summation over \(m_0\) up to a reasonably large number. We note that a recent study (Dennis et al. 2015) has pointed out that the choice of the arbitrary large number in the numerical calculation can result in underestimation of abundance.

3.4 Modelling biological responses

We now discuss how to incorporate the sampling effect into modelling. Incorporation can be easily done using existing software that fit common models, such as generalised linear models (GLMs) (McCullagh and Nelder 1989) and their many extensions, including generalised additive models (GAMs) (Hastie and Tibshirani 1990).

  • Species abundance The sampled abundance \(m_1\) has the expected value, from Eq. (5), as

    $$\begin{aligned} \mu = \hbox {E}[{M_1}] = \alpha \lambda (\varvec{x}). \end{aligned}$$

    This form suggests that the effect of sampling can be treated as an offset using the log link function,

    $$\begin{aligned} \log (\mu ) = \log (\alpha ) + \eta (\varvec{x}), \end{aligned}$$

    where \(\eta (\varvec{x})\) is the (non-)linear predictor of environment variables. So, to convert a Poisson model for the sampled abundance into a model for the population abundance, all one has to do is to include \(\log (\alpha )\) as an offset.

Using an offset term with the log link function in modelling is also commonplace for normalisation, calculating an expected abundance per unit of sampling space such as area, time duration and the length of a transect. This appears to be a parallel to the modelling approach above, as the size of sampling space often varies among surveys. However, we stress here that \(0 < \alpha \le 1\) is the individual detection probability, whereas the offset term for normalisation is not related with imperfect observation due to sampling.

Note the fact that the model formulation presented for biological responses is still valid for the model-based approach, dealing with the detection probability as an unknown function \(\alpha (t)\), but the model fitting algorithm required will no longer be simple as adopting glm or gam provided in R, unless \(\log (\alpha (t))\) forms a linear function. Also note that it is inestimable when the expected abundance and the sampling probability are confounded, such as situations when there are no repeat visits to a survey site, since these cannot be disentagled as we have noted in Sect. 3.3.

  • Species presence/absence  From Eq. (2), the distribution of sampled presence/absence, \(Y_1\), is the binarisation of a Poisson random variable:

    $$\begin{aligned} f(y_1 | \varvec{x})= & {} \left( 1-e^{-\alpha \lambda (\varvec{x})}\right) ^{y_1} \left( e^{-\alpha \lambda (\varvec{x})}\right) ^{(1-y_1)}. \end{aligned}$$
    (7)

    It has the expected value

    $$\begin{aligned} \mu = \hbox {E}[{Y_1}] = 1-e^{-\alpha \lambda (\varvec{x})}. \end{aligned}$$

    and is equal to the probability of the Poisson random variable \(M_1\) to take non-zero values. In the GLM and GAM contexts, a Bernoulli model for the presence/absence variable is also easily implemented, using the complementary log-log link function, viz

    $$\begin{aligned} \log (-\log (1-\mu )) = \log (\alpha )+ \eta (\varvec{x}). \end{aligned}$$
    (8)

    The sampling effect \(\alpha \), again is an offset term. The early idea of the complementary log-log link can be found in Fisher (1922) for a dilution assay study and it is more formally stated by McCullagh and Nelder (1989).

We note here a link to a modelling framework widely used for species presence/absence, the occupancy model (MacKenzie et al. 2002, 2006), that deals with species, not individual, detectability. In fact, the occupancy model can be interpreted as an approximate model of Eq. (8) when the species has low abundance, low probability of occupancy in other words. The occupancy model, equivalent to Eq. (7), is expressed as a zero-inflated Bernoulli model as

$$\begin{aligned} f(y_1 | \varvec{x}) = \psi p^{y_1} (1-p)^{1-y_1} + (1-\psi ) I(y_1=0), \end{aligned}$$
(9)

where \(0 < \psi \le 1\) is the species occupancy (true presence) probability and \(0 < p \le 1\) is the species detection probability, each of which is modelled in the logit form, such as \(\mathrm{logit}(\psi )=\xi (\varvec{x})\). Since its expected value is \(\mu = \hbox {E}[{Y_1}] = \psi p\), it follows

$$\begin{aligned} \log (\alpha )+ \eta (\varvec{x}) = \log (-\log (1-\mu )) \approx \log (p) + \xi (\varvec{x}) - \psi . \end{aligned}$$

This suggests that for a species with low probability of occupancy, with a small \(\psi \), the occupancy model approximates Eq. (8), utilising species detectability. Note that we here used the facts that \(\log (-\log (1-b)) \approx \log (b)\) and \(\mathrm{logit}(b) \approx \log (b) + b\) for a small b.

3.5 The effect of binarisation

As we have shown, species’ presence/absence data can be treated as the binarisation of species abundance (Eq. 7), which allows us to deal with sampling effects. This also implies that species’ presence/absence model is able to predict the species abundance, by gaining the information from the estimated intensity function \(\hat{\lambda }(\varvec{x})\), as suggested by Royle and Nichols (2003) for example. Although this is appealing, the cost is wider standard errors for the estimates. This is due to the loss of information by binarisation. Let \(\hat{\lambda }(M_1; \alpha )\) and \(\hat{\lambda }(Y_1; \alpha )\) be the likelihood estimators based on the species abundance and presence/absence data, respectively. As the variance of the estimators are given by the Fisher information, noting Eqs. (5) and (7), the efficiency of \(\hat{\lambda }(Y_1; \alpha )\) becomes

$$\begin{aligned} \text{ eff }(\hat{\lambda }(Y_1; \alpha )) = \frac{\hbox {Var}[{\hat{\lambda }(M_1; \alpha )]}}{\hbox {Var}[{\hat{\lambda }(Y_1; \alpha )]}} = \frac{\alpha \hat{\lambda }}{e^{\alpha \hat{\lambda }}-1}, \quad (0 < \alpha \le 1). \end{aligned}$$

This suggests that the variance of the parameter estimated from species presence/absence data increases in the exponential order according to the mean abundance, \(\lambda \). For even moderate abundances, this is likely to be substantial—enough to suggest that in many situations modelling abundance from presence-absence data is a risky practice. This is formal verification of an intuitively unsurprising result. We also note Howard et al. (2014) as a recent study reporting an empirical evidence of this issue.

3.6 Predictions of biological responses

The estimated intensity function \(\hat{\lambda }(\varvec{x})\) reveals how the biological responses are related to the environment factors \(\varvec{x}\). It allows us to make predictions of unobserved biological responses: species population abundance, \(M_0\), and presence/absence, \(Y_0\). We describe here two types of predictions. The first is marginal prediction which is the unconditional expectation of the observation (\(\hbox {E}[{M_0}]\) or \(\hbox {E}[{Y_0}]\)) and is directly derived from the distribution \(f(m_0; \hat{\lambda }(\varvec{x}))\), Eq. (4). The other is conditional prediction (\(\hbox {E}[{M_0 | M_1 = m_1}]\) or \(\hbox {E}[{Y_0 | Y_1 = y_1}]\)) calculated from the distribution, \(f(m_0 | m_1;\hat{\lambda }(\varvec{x}))\). Note that \(\hat{\lambda }{(\varvec{x})}\) is used as a plug-in estimate. In a Bayesian analysis, one would incorporate uncertainty in this estimate into the predictive distribution. Due to the dependence on sample data, the conditional distribution (and predictions) are only available at previously sampled locations. However, at other locations one could define the conditional prediction to coincide with the marginal prediction—note though that there is no extra data to inform the process.

The conditional predictive distribution is derived by using Bayes’ theorem and the distributions \(f(m_1 | m_0)\), \(f(m_0|\varvec{x})\) and \(f(m_1|\varvec{x})=\sum _{m_0}f(m_1|m_0)f(m_0|\varvec{x})\). The first two terms are already specified in the earlier sections. The conditional distribution is given as

$$\begin{aligned} f(m_0 | m_1)= & {} \frac{f(m_1|m_0)f(m_0)}{f(m_1|\varvec{x})} \nonumber \\= & {} \frac{\left\{ (1-\alpha )\lambda (\varvec{x})\right\} ^{m_0-m_1}}{(m_0-m_1)!}e^{-(1-\alpha )\lambda (\varvec{x})}. \end{aligned}$$
(10)

Note that the distribution (10) suggests that the difference \((m_0 - m_1)\), which is the amount of abundance that should have been observed but degenerated by sampling, also follows the Poisson distribution with the rate parameter \((1-\alpha )\lambda (\varvec{x})\). A closed form for the conditional expectation is not generally available but it is here due to the Poisson assumption for \(f(m_1|\varvec{x})\). If other assumptions are made then it is likely that the conditional predictive distribution will have to be calculated numerically.

  • Species abundance The predictors are derived respectively from Eq. (4) and (10) as

    $$\begin{aligned} \hbox {E}[{M_0}]= & {} \lambda (\varvec{x}), \quad \hbox {and}\\ \hbox {E}[{M_0 | M_1=m_1}]= & {} \sum _{m_0=0}^\infty m_0 f(m_0 | m_1)\\= & {} (1-\alpha )\lambda (\varvec{x}) + m_1\\= & {} \hbox {E}[{M_0}] - \hbox {E}[{M_1}] + m_1. \end{aligned}$$

    This is the observed abundance plus the difference between expected true and expected observed abundance. In essence, it takes the observation and adjusts it for what is likely to be missed through sampling.

  • Presence/absence Viewing presence/absence data as the binarisation of a Poisson variable, the predictors are

    $$\begin{aligned} \hbox {E}[{Y_0}]= & {} 1-e^{-\lambda (\varvec{x})}, \quad \hbox {and}\\ \hbox {E}[{Y_0 | Y_1=y_1}]= & {} \sum _{y_0=0}^1 y_0 f(y_0 | y_1)\\= & {} \sum _{y_0=0}^1 y_0 \left\{ \left( 1-e^{-(1-\alpha ) \lambda (\varvec{x})}\right) ^{y_0} \left( e^{-(1-\alpha ) \lambda (\varvec{x})}\right) ^{(1-y_0)} I(y_1=0) \right. \\&\quad \left. + \,y_0 I(y_1=1)\right\} \\= & {} \left( 1-e^{-(1-\alpha ) \lambda (\varvec{x})}\right) I(y_1=0) + I(y_1=1)\\= & {} \frac{\hbox {E}[{Y_0}]-\hbox {E}[{Y_1}]}{1-\hbox {E}[{Y_1}]} I(y_1=0) + I(y_1=1). \end{aligned}$$

    The conditional expectation, \(\hbox {E}[{Y_0 | Y_1=y_1}]\), is 1 if the species is observed and a non-zero probability if the species is not observed. The non-zero probability reflects the difference in expectation between the true and the observed presence/absence record.

4 Data analysis

We analyse two ecological data sets from the marine realm. It is common, but not ubiquitous, in marine surveys to sub-sample the biological content at a survey location as the volume of biological material can be large. Sub-sampling (Heales et al. 2000) is performed to reduce the processing time and storage requirements for the biological material. The sub-sampling process sometimes divides the full ecological sample into strata, for example taxa groups or size classes, and then takes a proportion from each stratum. Often the sampling proportion changes between strata. We use these data sets to highlight the ideas and methods introduced in the previous sections. One data set exhibits simple random sampling and the other is generated from stratified sampling (Fig. 1).

Fig. 1
figure 1

Survey locations. a Great Barrier Reef; b Carnarvon Shelf

4.1 Carnarvon Shelf data

4.1.1 Data and sampling method

Data were collected in a seabed mapping survey of the Carnarvon Shelf offshore (Fig.1b) from central Western Australia (Brooke et al. 2009). The aim of the survey was to acquire physical and biological data to enable a range of environmental parameters to be tested as surrogates of benthic biodiversity patterns. A Smith-McIntyre grab was deployed at 142 sites. For each grab, a sediment sample (\(\sim \)50 ml) was retained for analysis of textural characteristics while the remaining sediments were processed for infauna. The infauna samples were separated by washing sediments through a 500 \(\upmu \)m sieve and then a sample was taken if necessary. The proportion sampled was recorded as the sampling ratio r. The samples were classified into food guild groups and species.

As an example, for illustration, we present the results for two food-guild groups: non-selective and selective feeders. Both guilds ingest sediment and derive nutrients from the microorganisms living on the particles but selective feeders often have a physical structure which enables them to select optimally-sized particles for ingestion (e.g. tentacle with a ciliated groove). Such a physical difference may let them have different preference in their ambient sediment conditions. In our modelling, we assume that a species’ preference in sediment conditions is common over all survey sites. That is, there is no interaction between preference and spatial location.

4.1.2 Modelling

We develop a model to describe how the presence/absence of each food guild group responds to the seabed grain size. As each species obviously has different abundances, we fit a model to each species separately and then combine the models for one food guild group.

Let \(Y_{1ks}\) be the presence/absence variable of the k-th species being 1 for presence or 0 for absence in the sample at the s-th site; \(Y_{1ks}\) is therefore drawn from the population of the interest \(Y_{0ks}\), the presence/absence of the k-th species in the grab at the s-th site. The subscript 1 indicates that it is a sample and follows from Sect. 2. From Eq. (7) and using the complementary log-log link, \(\mu _{ks} = \hbox {E}[{Y_{1ks}}]\) can be modelled as

$$\begin{aligned} \log (-\log (1-\mu _{ks})) = \log (r_{s})+ \beta _0 + \beta _1 x_s, \end{aligned}$$

where \(x_s\) is the log-scaled seabed grain size. Here \(r_s\) is the sampling fraction given at site s.

The probability of presence of a group is calculated as multiplication of the probabilities of presence of the group’s constituent species. Given the two food guild groups, \({\mathcal {G}}_j, j = 1, 2\) say, the probability of presence of the j-th species-group is

$$\begin{aligned} \mu _{js} = 1 - \prod _{k \in {\mathcal {G}}_j} (1-\mu _{ks}), \end{aligned}$$
(11)

where \(\mu _{ks} = \hbox {E}[{Y_{1ks}}] = \Pr (Y_{1ks}=1)\) which is the probability of the k-th species presence.

4.1.3 Result

Different inferences are obtained when the sub-sampling effect is taken into account or ignored (equivalent to assuming that \(r_s=1\)). Figure 2 illustrates the probability of presence of each food guild group (Eq. 11). The model that accounts for the sub-sampling effect (the left panel, Fig. 2) suggests that the probability of presence of the non-selective group species decreases as seabed sediments become coarse but the selective group species have little influence of the sediment size. In contrast, the model that ignores the effect (the right panel, Fig. 2) shows that both groups respond to the sediment size and the probability of presence decreases as the sediment size increases. These contradictory results highlight the risk of misinterpretation when the sub-sampling effect is mis-specified in the model. To us, it seems more plausible that the selective feeders should not have much dependence on grain size as this is their evolutionary advantage.

4.2 Great Barrier Reef data

4.2.1 Data and sampling method

Data were collected by trawl nets from 437 sites during a survey of the Great Barrier Reef (GBR) lagoon (Fig. 1a) off the north eastern coast of Australia (Pitcher et al. 2007). The purpose of data collection was to map habitats, assemblages and species throughout the Great Barrier Reef Marine Park.

Fig. 2
figure 2

Comparison of the probabilities of each food guild group presence: when the sub-sampling effect is taken into account (left); when the sub-sampling effect is ignored (right). The solid and two-dashed lines respectively represents the result of non-selective (solid) and selective (two-dashed) species. The dashed lines represent the upper and lower bounds of the 95 % confidence interval

The biological samples were collected by a scientific trawl net towed behind a survey vessel. After each tow, the samples were processed entirely or sub-sampled for enumeration, weighing and identification. On the deck, the samples were sorted into rough phylogenetic groups (strata \({\mathcal {U}}_j, j=1, 2, \ldots , k\)) and then a sub-sample was taken from each stratum (group) if necessary. The proportion of sub-sample was recorded as sub-sampling ratio \(r_j\) for the j-th stratum. On board taxonomic stratification is a difficult task, and some mis-specification is inevitable—some species were observed in an unexpected stratum some times. The taxonomic sorting suggested some heterogeneity (\(p_j \ne 1/k\)) was induced, and its mis-specification meant \(p_j \ne 1\) for the j-th stratum, \({\mathcal {U}}_j\), given an organism belonging to the j-th group (stratum). This required extra consideration for \(p_j, j=1, 2, \ldots , k\), when modelling. We consider here four environment variables: depth, % carbonates, % gravel and % sand. Note that none of these percentages sum to 100 %.

4.2.2 Modelling

Let \(M_{1js}\) be observed abundance of a species of the j-th stratum at the s-th site, which is a sample drawn from the population of the interest \(M_{0js}\), the species abundance in the j-th stratum actually caught at the s-th site. From Eq. (6) with a log link, \(\mu _{js} = \hbox {E}[{M_{1js}}]\) can be modelled as

$$\begin{aligned} \log (\mu _{js}) = \log (r_{js})+ \log (p_{js}) + \eta _0 + \eta (\varvec{x}_s), \end{aligned}$$

where \(\varvec{x}_s\) is the vector of the environment variables: depth, % carbonates, % gravel and % sand, and \(\eta (\varvec{x}_s)\) is non-linear and smooth function. This model is a generalised additive model (GAM) (Wood 2006). Here \(r_{js}\) and \(p_{js}\) are respectively the sampling fraction induced by the subsampling and the taxonomic sorting. However, as noted before, \(p_{js}\) is not observed so it needs to be estimated as a parameter. We therefore assume that \(p_j\)’s are common over the sites and fit the model

$$\begin{aligned} \log (\mu _{js})= & {} \log (r_{js})+ \xi _{j} + \eta (\varvec{x}_s), \end{aligned}$$
(12)

where \(\xi _j = \log (p_{j}) + \eta _0\). Note that \(\xi _j\) represents the probability of classification and the species intercept.

Fig. 3
figure 3

The relationship of the abundance and each environment variable. Each panel shows the estimated linear predictor. The solid and two-dashed lines respectively represents the result of the model with (solid) or without (two-dashed) the sub-sampling effect. The dashed lines represent the upper and lower bounds of the 95 % confidence interval

4.2.3 Result

The model is fitted to the abundance data of a squid species (Photololligo chinensis). Each panel in Fig. 3 represents the response of its abundance to the environment variable. The solid and two-dashed line respectively represents the predictor, \(\eta (\varvec{x}_s)\), of the model with or without the sub-sampling effect taken into account. The sub-sampling effect is now easily observed as a constant shift in the linear predictor for all covariates, except depth. This does not translate to a constant difference on the response scale though. The response of the model that ignores the sub-sampling effect underestimates the abundance. It also shows that the confidence interval of the model that accounts for the effect tends to be wider than the one ignoring the effect. These observations concur with the theoretical results in Sect. 5.

The two types of the species abundance predictions are illustrated in Fig. 4. The marginal predictions are more variable than the conditional predictions and deviate further from the observations. The conditional predictions are always greater than the observations that they are predicting. This is a consequence of the form for the conditional prediction, which shows that the predictive distribution is the observation with an adjustment term added on.

Fig. 4
figure 4

The marginal (left) and conditional (right) predictions against the observed abundance of Photololligo chinensis. When the predictions match the observations, all points lie on the solid line. Departure from this line signifies prediction inaccuracy

4.2.4 A technical matter

This stratified sampling used in the GBR data is a two step sampling technique. However, biological data are provided in a form that all biological quantities are aggregated at each site regardless of the stratification that is actually employed. For instance, the stratified sampling is applied, like the GBR data, then the aggregation over the strata, \(\sum _j M_{1j}\), may provide insufficient information unless the sampling ratio \(r_j\)’s are common over the strata (\(r_j = r_{j'}, j \ne j'\)), which becomes exactly the case of simple random sampling. This can be seen as

$$\begin{aligned} \mu _s = \sum _j \hbox {E}[{M_{1j}}]= \sum _j \mu _{js} = e^{\eta _0 + \eta (\varvec{x}_s)}\sum _j r_{js}p_{js}. \end{aligned}$$

The aggregation induces substantial difficulty in dealing with the sampling effect properly since the sampling effect of each stratum has now been aggregated, and becomes unknown. An alternative approach for this kind of already-aggregated data may be random effect models that assumes sampling effects to vary randomly over the sites.

5 Implications for survey design

Up until now, we have discussed a modelling framework that accounts for imperfect observations. Here, we view the framework from a slightly different perspective—designing surveys. Consider the Poisson model for imperfect observations (Eq. 5); the parameter of interest is \(\lambda \), the expected abundance of a species at a site, and the other notation remains the same as before. Yet, the sampling fraction is unspecified at the design phase so that the sampling fraction \(\alpha \) needs to be chosen. A common question here may be: “what is a sufficient number of sites to survey whilst retaining a predetermined precision for estimates of the models parameters?” The variation is a function of the number of observations, n, and the sampling fraction, \(\alpha \); the variance of likelihood estimators is given by the Fisher information,

$$\begin{aligned} \hbox {Var}[{\hat{\lambda }\left( \varvec{M}_1, \varvec{\alpha }\right) ]}= & {} -\mathrm{E}_{\varvec{M}_1, \varvec{\alpha }}\left[ \frac{\partial u\left( \lambda _0; \varvec{M}_1, \varvec{\alpha }\right) }{\partial \lambda } \right] ^{-1} \nonumber \\= & {} -n^{-1}\mathrm{E}_{M_1, \alpha }\left[ \frac{\partial u\left( \lambda _0; M_1, \alpha \right) }{\partial \lambda } \right] ^{-1}, \end{aligned}$$
(13)

for the observable abundance \(M_{1s}\) and the sampling fraction \(\alpha _s\) at a site \(s, s=1, 2, \ldots , n\). Note that the vector notation here is about the sites so that \(\varvec{M}_1 = (M_{11}, M_{12}, \ldots , M_{1n})^\top \). The function \(u(\cdot )\) is score function, the first derivative of the log-likelihood function, and \(\lambda _0\) is the true value of the parameter. The estimated variance can be calculated, substituting \(\lambda _0\) by the maximum likelihood estimate \(\hat{\lambda }\). Clearly, the variance decreases as the number of sites n increases (Eq. 13). However, the expectation term is also a function of the sampling fraction, \(\alpha \). The rate of decrease with n is slowed from \(O(n^{-1})\) unless \(\hbox {E}[{\alpha }]=1\) (note that \(0 < \alpha \le 1\)). This leads us to the idea that the accuracy of the estimator can be designed, taking the balance between the number of observations n and sampling fraction \(\alpha \).

From Eqs (5) and (13), the variance of the maximum likelihood estimator \(\hat{\lambda }(\varvec{M}_1, \varvec{\alpha })\) is given as

$$\begin{aligned} \hbox {Var}[{\hat{\lambda }\left( \varvec{M}_1, \varvec{\alpha }\right) }] = \frac{\hat{\lambda }}{n\hbox {E}[{\alpha }]} \end{aligned}$$

where \(0 < \hbox {E}[{\alpha }] \le 1\). This suggests a wider standard error for the same value of \(\hat{\lambda }\) when a sample is taken. If it is fully sampled, then \(\hbox {E}[{\alpha }]=1\) and \(\mathrm{Var} [\hat{\lambda }(\varvec{M}_1, \varvec{\alpha })]=\hat{\lambda }/n\). Let z be the number of survey sites required to obtain the same standard error range such that \(\hat{\lambda }/(z\hbox {E}[{\alpha }]) = \hat{\lambda }/n\) then

$$\begin{aligned} z=\frac{n}{\hbox {E}[{\alpha }]}. \end{aligned}$$

This means that, for example, if a survey is undertaken to be \(\hbox {E}[{\alpha }]=0.5\) then it requires two times more observations, 2n, for keeping the same error range as no sub-sampling is taken. This gives some concrete guidance as to whether to choose between sampling fewer sites with great detail (\(\alpha \) near 1), or to sample more sites with less detail (small \(\alpha \)). The actual choice will depend on the relative cost of performing more sites or taking a less imperfect sample at each site. Consideration should also be given to the range of covariates that competing sampling strategies will cover. If stratification of sampling locations is to occur, based on a covariate, then it is likely to be that more sites will aid the estimation of that covariate’s effect.

6 Summary

We have discussed how the imperfect observation effect due to sampling should be treated in ecological modelling, and presented how a general framework, the compound distribution, can accommodate individual detectability. The model is general and can handle many different types of sampling, including the two examples used in this paper: the commonly used simple random sampling and stratified random sampling. The method of implementing the sampling effect is straight-forward; the sampling effect enters the regression-type model as an offset term by using an appropriate link function. Other types of sampling mechanisms, such as cluster sampling, will require slightly more complex models that allow for the between individual correlation.

Our examples are typical of a sampling technique called ‘sub-sampling’, which is widely used in marine surveys. This is completely an anthropogenic effect induced during the survey process that should be taken into account for modelling. To the authors’ knowledge, anthropogenic sub-sampling is under studied; only two of articles can be found in the literature (Heales et al. 2000, 2003). Another sampling effect in marine surveys is the issue of catchability (also called detectability), which describes how likely the individuals will get caught given the sampling gear employed. We have not discussed this as our data consists of a single observation at each site, so the probability of catching an individual, given presence, is completely confounded with the probability of presence. If a site was visited multiple times then this could be incorporated into the compound distribution framework, and the catchability could be estimated. Commonly, this has been done using species occupancy models (MacKenzie et al. 2002). These types of models are an approximation to our framework, see Eq. (9).

Fisher (1934) clearly emphasises the importance of understanding the data collection procedure employed as a statistical commonplace. Accordingly, Rao (1965) generalises Fisher’s idea and proposes a general modelling framework that is able to accommodate a wide class of sampling mechanisms, such as non-observability of events by dealing with individual detectability. The compound model presented (Sect. 2 and Eq. 3) exhibits strong similarities with one of the models described in Rao (1965) and also in Patil and Rao (1978). Ecological studies will always have a limited number of observations from the population of interest, and so the statistical challenge has historically been centred around how to make effective inferences dealing properly with the sampling effect. This challenge will remain into the future. The compound distribution seems to have received little attention in ecological modelling and we show that it naturally underpins an effective modelling framework to account for imperfect observations.