Skip to main content
Log in

Capture–recapture estimation based upon the geometric distribution allowing for heterogeneity

  • Published:
Metrika Aims and scope Submit manuscript

Abstract

Capture–Recapture methods aim to estimate the size of an elusive target population. Each member of the target population carries a count of identifications by some identifying mechanism—the number of times it has been identified during the observational period. Only positive counts are observed and inference needs to be based on the observed count distribution. A widely used assumption for the count distribution is a Poisson mixture. If the mixing distribution can be described by an exponential density, the geometric distribution arises as the marginal. This note discusses population size estimation on the basis of the zero-truncated geometric (a geometric again itself). In addition, population heterogeneity is considered for the geometric. Chao’s estimator is developed for the mixture of geometric distributions and provides a lower bound estimator which is valid under arbitrary mixing on the parameter of the geometric. However, Chao’s estimator is also known for its relatively large variance (if compared to the maximum likelihood estimator). Another estimator based on a censored geometric likelihood is suggested which uses the entire sample information but is less affected by model misspecifications. Simulation studies illustrate that the proposed censored estimator comprises a good compromise between the maximum likelihood estimator and Chao’s estimator, e.g. between efficiency and bias.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Böhning D (2008) A simple variance formula for population size estimators by conditioning. Stat Methodol 5:410–423

    Article  MathSciNet  MATH  Google Scholar 

  • Borchers DL, Buckland ST, Zucchini W (2004) Estimating animal abundance. Closed populations. Springer, London

    Google Scholar 

  • Bunge J, Fitzpatrick M (1993) Estimating the number of species: a review. J Am Stat Assoc 88:364–373

    Google Scholar 

  • Chao A (1987) Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43:783–791

    Article  MathSciNet  MATH  Google Scholar 

  • Chao A (1989) Estimating population size for sparse data in capture-recapture experiments. Biometrics 45:427–438

    Article  MathSciNet  MATH  Google Scholar 

  • Chao A, Tsay PK, Lin SH, Shau WY, Chao DY (2001) Tutorial in biostatistics: The applications of capture-recapture models to epidemiological data. Stat Med 20:3123–3157

    Article  Google Scholar 

  • Dorazio RM, Royle JA (2005) Mixture models for estimating the size of a closed population when capture rates vary among individuals. Biometrics 59:351–364

    Article  MathSciNet  Google Scholar 

  • Hay G, Smit F (2003) Estimating the number of drug injectors from needle exchange data. Addict Res Theory 11:235–243

    Article  Google Scholar 

  • Holzmann H, Munk A, Zucchini W (2006) On identifiability in capture-recapture models. Biometrics 62:934–939

    Article  MathSciNet  Google Scholar 

  • Link WA (2003) Nonidentifiability of population size from capture-recapture data with heterogeneous detection probabilities. Biometrics 59:1123–1130

    Article  MathSciNet  MATH  Google Scholar 

  • Link WA (2006) Response to a paper by Holzmann, Munk and Zucchini. Biometrics 62:936–939

    Article  MathSciNet  Google Scholar 

  • Mao CX (2007a) Estimating population sizes for capture-recapture sampling with binomial mixtures. Comput Stat Data Anal 51:5211–5219

    Article  MATH  Google Scholar 

  • Mao CX (2007b) Estimating the number of classes. Ann Stat 35:917–930

    Article  MATH  Google Scholar 

  • Mao CX (2008a) On the nonidentifiability of population sizes. Biometrics 64:977–981

    Article  MathSciNet  MATH  Google Scholar 

  • Mao CX (2008b) Lower bounds to the population size when capture probabilities vary over individuals. Aust N Z J Stat 50:125–134

    Article  MathSciNet  MATH  Google Scholar 

  • Oosterlee A, Vink RM, Smit F (2009) Prevalence of family violence in adults and children: estimates using the capture-recapture method. Eur J Public Health 19:586–591

    Article  Google Scholar 

  • Paluscia VJ, Wirtz SJ, Covington TM (2010) Using capture-recapture methods to better ascertain the incidence of fatal child maltreatment. Child Abuse Neglect 34:396–402

    Article  Google Scholar 

  • Pledger SA (2005) The performance of mixture models in heterogeneous closed population capture-recapture. Biometrics 61:868–876

    Article  MathSciNet  Google Scholar 

  • Roberts JM, Brewer DD (2006) Estimating the prevalence of male clients of prostitute women in Vancouver with a simple capture-recapture method. J R Stat Soc Ser A 169:745–756

    Article  MathSciNet  Google Scholar 

  • Van der Heijden PGM, Cruyff M, van Houwelingen HC (2003) Estimating the size of a criminal population from police records using the truncated poisson regression model. Stat Neerlandica 57:1–16

    Google Scholar 

  • Van Hest NAH, De Vries G, Smit F, Grant AD, Richardus JH (2008) Estimating the coverage of Tuberculosis screening among drug users and homeless persons with truncated models. Epidemiol Infect 136:14–22

    Google Scholar 

  • Wang J-P, Lindsay BG (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J Am Stat Assoc 100:942–959

    Article  MathSciNet  MATH  Google Scholar 

  • Wang J-P, Lindsay BG (2008) An exponential partial prior for improving nonparametric maximum likelihood estimation in mixture models. Stat Methodol 5:30–45

    Article  MathSciNet  MATH  Google Scholar 

  • Wilson RM, Collins MF (1992) Capture-recapture estimation with samples of size one using frequency data. Biometrika 79:543–553

    Article  MATH  Google Scholar 

Download references

Acknowledgments

The authors would like to thank the Editor and two anonymous referees for their very helpful comments which considerably improved the paper. We also would like to thank Jeerapa Sappakitkamjorn (Department of Applied Statistics, King Mongkut’s University of Technology North–Bangkok) for her great support in finalizing the simulation study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dankmar Böhning.

Additional information

The idea for this paper was developed while the second author was visiting the Department of Applied Statistics at the King Mongkut’s University of Technology North–Bangkok in the summers 2009 and 2010 and would like to thank the department for any support that was received.

The paper was written while the first author was visiting the Department of Mathematics and Statistics at the University of Reading in the spring 2011 and would like to thank the department for any support that was received.

Appendices

Appendix 1: Proof of theorems

Theorem 1

Let \(k_y(p)=(1-p)^yp\) for \(y=0,1,\cdots \) and \(p \in (0,1)\).

  1. (a)

    Let \(\log L(p) = f_1\log (\pi _1)+f_2\log (\pi _2)\) with \(\pi _1 = 1/(2-p)\) and \(\pi _2 =(1-p)/(2-p)\) being the geometric probabilities truncated to counts of ones and twos. Then \(\log L(p)\) is maximized for \(\hat{p} = (f_1-f_2)/f_1.\)

  2. (b)

    \(E(f_0|f_1,f_2;\hat{p})=f_1^2/f_2,\) for \(\hat{p}= (f_1-f_2)/f_1\).

Proof

For the first part, it is clear that \( f_1\log (\pi _1)+f_2\log (\pi _2)\) is maximal for \(\hat{\pi }_1 =f_1/(f_1+f_2)=1/(2-\hat{p})\), which is attained for \(\hat{p} = (f_1-f_2)/f_1\). For the second part, we see that with \(e_y=E(f_y|f_1,f_2;p)= k_y(p) N\) we have the following:

$$\begin{aligned} e_y = k_y(p) N=k_y(p)\left(e_0+f_1+f_2+\sum _{j=3}^\infty e_j\right) \end{aligned}$$

so that

$$\begin{aligned} e_0+e_3^+=[1-k_1(p)-k_2(p)] (e_0+e_3^+) + [1-k_1(p)-k_2(p)] (f_1+f_2) \end{aligned}$$

with \(e_3^+= \sum _{j=3}^{\infty } e_j\). Hence

$$\begin{aligned} e_0+e_3^+=\frac{1-k_1(p)-k_2(p)}{k_1(p)+k_2(p)}(f_1+f_2) \end{aligned}$$

and

$$\begin{aligned} e_0=k_0(p)(f_1+f_2+ e_0+e_3^+)&= k_0(p)(f_1+f_2)\left[1+\frac{1-k_1(p)-k_2(p)}{k_1(p)+k_2(p)}\right]\\&= \frac{k_0(p)}{k_1(p)+k_2(p)}(f_1+f_2)=\frac{f_1+f_2}{(1-p)(2-p)}. \end{aligned}$$

Plugging in the maximum likelihood estimate \(\hat{p}=(f_1-f_2)/f_1\) for \(p\) yields

$$\begin{aligned} \frac{f_1+f_2}{(1-\hat{p})(2-\hat{p})}=\frac{f_1+f_2}{\frac{f_2}{f_1}\frac{f_1+f_2}{f_1}}=f_1^2/f_2, \end{aligned}$$

the desired result. \(\square \)

Theorem 2

Let \(k_y(p)=(1-p)^yp\) for \(y=0,1,\cdots \) and \(p \in (0,1)\). Then,

$$\begin{aligned} \lim _{N\rightarrow \infty }\frac{E(\hat{N})}{N} =1 \end{aligned}$$

for \(\hat{N}=\hat{N}_{ML}, \hat{N}_{C}\), or \(\hat{N}_{Cen}\).

Proof

Let \(\hat{N}=\hat{N}_{ML}=n/(1-n/S)\). Note that \(E(n)=Np\) and \(E(S/N)=(1-p)/p\) so that

$$\begin{aligned} \frac{E(n/(1-n/S))}{N} \mathop {\rightarrow }\limits _{N\rightarrow \infty } \frac{p}{1-\frac{p}{p/(1-p)}}=1. \end{aligned}$$

Let \(\hat{N}=\hat{N}_{C}=n + f_1^2/f_2\). Note that \(E(f_1)=Np(1-p)\) and \(E(f_2)=Np(1-p)^2\) so that

$$\begin{aligned} \frac{E(n+f_1^2/f_2)}{N} \mathop {\rightarrow }\limits _{N\rightarrow \infty } (1-p) + \frac{p^2(1-p)^2}{p(1-p)^2}=1. \end{aligned}$$

Finally, let \(\hat{N}=\hat{N}_{Cen}=\frac{n}{1-f_1/n}\). Using the above we have

$$\begin{aligned} \frac{E\left(\frac{n}{1-f_1/n}\right)}{N} \mathop {\rightarrow }\limits _{N\rightarrow \infty } \frac{1-p}{1-\frac{(1-p)p}{(1-p)}}=1, \end{aligned}$$

which ends the proof. \(\square \)

Appendix 2: Standard errors

Let \(\hat{N}\) be the estimator of the population size \(N\) of interest, the latter being a fixed but unknown quantity. Also, let the random quantity \(n\) be the observed number of units. We will make use of the result

$$\begin{aligned} \text{ Var}(\hat{N}) = E_n \{\text{ Var}(\hat{N}|n)\} + \text{ Var}_n\{E(\hat{N}|n)\}, \end{aligned}$$
(9)

where \(\hat{N}|n\) refers to the distribution of \(\hat{N}\) conditional upon \(n\) and \(E_n(.)\) and \(\text{ Var}_n(.)\) refer to the first and second (central) moment w.r.t. the distribution of \(n\). For more details see Böhning (2008).

1.1 Maximum likelihood estimator

We consider the maximum likelihood estimator \(\hat{p}_{ML}=n/S\) and the associated population size estimator \(\hat{N}=\hat{N}_{ML}=n/(1-n/S)\). We start with the second term in (9) and have that \(E(\hat{N}|n)\approx n/(1-p)\), approximately, so that

$$\begin{aligned} \text{ Var}_n [n/(1-p)] = \frac{1}{(1-p)^2}Np(1-p). \end{aligned}$$

Note that \(N(1-p)\) can be estimated by \(n\) and \(p\) by the maximum likelihood estimator \(n/S\), so that the variance estimator \(\frac{Sn^2}{(S-n)^2}\) arises.

For the first term in (9), we use the \(\delta \)-method to determine \(\text{ Var}(\hat{N}|n)\) as

$$\begin{aligned} \frac{n^2}{(1-p)^4} \text{ Var}_n(\hat{p}_{ML}) \end{aligned}$$

and, using the Fisher information for \(p\), we can determine \( \text{ Var}_n(\hat{p}_{ML})\) as

$$\begin{aligned} \text{ Var}_n(\hat{p}_{ML}) \approx \frac{n(S-n)}{S^3}. \end{aligned}$$

The expected value \(E_n \{\text{ Var}(\hat{N}|n)\} \) is then replaced by its moment estimate \(\text{ Var}(\hat{N}|n)\) to achieve the total variance

$$\begin{aligned} \frac{Sn^2}{(S-n)^2}+\frac{n^2}{(1-n/s)^4}\frac{n(S-n)}{S^3} = \frac{S^2n^2}{(S-n)^3} \end{aligned}$$
(10)

1.2 Censored estimator

We consider the censored estimator \(\hat{p}_{Cen}=f_1/n\) and the associated population size estimator \(\hat{N}=\hat{N}_{Cen}=n/(1-f_1/n)\). We have \(E(\hat{N}|n)\approx n/(1-p)\), approximately, so that, as before, \(\text{ Var}_n n/(1-p) = \frac{1}{(1-p)^2}Np(1-p)\), which can be estimated as \(\frac{f_1}{(1-f_1/n)^2}\) by replacing \(N(1-p)\) by \(n\) and \(p\) by \(f_1/n\).

For the first term in (9), \(\text{ Var}(\hat{N}|n)\), using the \(\delta -\)method once more we achieve the approximation

$$\begin{aligned} \text{ Var}\left(\frac{n}{1-f_1/n}|n\right)\approx \frac{n^2}{(1-f_1/n)^4} \text{ Var}\left(\frac{f_1}{n}|n\right), \end{aligned}$$

from where the variance estimator \(\frac{f_1(1-f_1/n)}{(1-f_1/n)^4}=\frac{f_1}{(1-f_1/n)^3}\) arises. In total, taking both variance terms into account, we achieve the variance estimator

$$\begin{aligned} \frac{f_1}{(1-f_1/n)^2}+\frac{f_1}{(1-f_1/n)^3} = \frac{f_1}{(1-f_1/n)^2}\frac{2n-f_1}{n-f_1} \end{aligned}$$
(11)

1.3 Chao’s estimator

Finally, we consider the Chao-type estimator \(\hat{N}= \hat{N}_{C}=n+f_1^2/f_2\). Note that it differs from the original Chao-estimator \(n+f_1^2/(2f_2)\) for which a variance estimator is provided in Chao (1987). If we would be only interested in a variance estimator of \(\hat{f}_0\) we could simply multiply the Chao-variance-estimator by a factor of 4. However, interest is usually in the population size estimator \(\hat{N}\) for which this simple adjustment is not valid. Hence we provide a full analysis in the following, again using the conditioning technique (9).

We have \(E(\hat{N}|n)= E(n+\frac{f_1^2}{f_2})\approx n + (g_1^+)^2 n/g_2^+=n(1+(g_1^+)^2/g_2^+)\), approximately. Recall that \(g_y^+=g_y/(1-g_0)\) for \(y=1,2,\ldots \). (Note that \(E(\hat{N}|n)\) refers to the conditional count distribution \(g_1^+, g_2^+, \ldots \) which is estimated by \(f_1/n, f_2/n, \ldots \). Hence \(\text{ Var}_n \{n(1+(g_1^+)^2/g_2^+)\}= (1+(g_1^+)^2/g_2^+)^2 Ng_0(1-g_0)\) which can be estimated as follows. \(Ng_0\) can be estimated as \(\hat{f}_0 =f_1^2/f_2\) and \((1-g_0)\) as \(1-f_1^2/(\hat{N} f_2)=\frac{f_2n}{f_2n+f_1^2}\), so that in total the estimate \((1+\frac{f_1^2}{f_2 n})^2 \frac{f_1^2n}{f_2n+f_1^2}\) arises, which we can simplify as

$$\begin{aligned} \left(1+\frac{f_1^2}{f_2n}\right)^2 \frac{f_1^2n}{f_2n+f_1^2}= f_1^2/f_2+f_1^4/(f_2^2n). \end{aligned}$$
(12)

For the first term in (9), \(\text{ Var}(\hat{N}|n)\), using the bivariate \(\delta -\)method, we achieve the approximation

$$\begin{aligned} \text{ Var}(\hat{N}|n) \approx \nabla \phi _0(f_1,f_2)^T cov(f_1,f_2) \nabla \phi _0(f_1,f_2) \end{aligned}$$

where \(\phi _0(f_1,f_2)=f_1^2/f_2\) and \(\nabla \phi _0(f_1,f_2)\) is the two-vector of partial derivatives with respect to \(f_1\) and \(f_2\):

$$\begin{aligned} \nabla \phi _0(f_1,f_2)^T=(2f_1/f_2,-f_1^2/f_2^2). \end{aligned}$$

The covariance matrix, conditional on \(n\), is \( cov(f_1,f_2)=n (\text{ dia}({{\mathbf g}^{+}}) -{{\mathbf g}^{+}}{{\mathbf g}^{+T}}) \), where \({\mathbf g}^{+}\) is the two-vector of probabilities, conditional on \(n\), for observing a one or a two, respectively. Also, \(\text{ dia}({{\mathbf g}^+})\) is the diagonal \(2\times 2\) matrix with \(g_1^+\) and \(g_2^+\) on the main diagonal. This matrix is estimated by

$$\begin{aligned} \left(\begin{array}{ll} f_1-f_1^2/n&-f_1f_2/n\\ -f_1f_2/n&f_2-f_2^2/n \end{array}\right). \end{aligned}$$

Hence we find for

$$\begin{aligned} \nabla \phi _0(f_1,f_2)^T \widehat{cov}(f_1,f_2) \nabla \phi _0(f_1,f_2)=\frac{4f_1^3}{f_2^2}+\frac{f_1^4}{f_2^3} -\frac{f_1^4}{f_2^2n}. \end{aligned}$$
(13)

Ultimately, taking (12) and (13) together, we achieve the variance estimator for \(\hat{N} =\hat{N}_C=n+f_1^2/f_2\) as

$$\begin{aligned} \frac{f_1^4}{f_2^3}+\frac{4f_1^3}{f_2^2}+\frac{f_1^2}{f_2}, \end{aligned}$$
(14)

being of remarkably simple form.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Niwitpong, Sa., Böhning, D., van der Heijden, P.G.M. et al. Capture–recapture estimation based upon the geometric distribution allowing for heterogeneity. Metrika 76, 495–519 (2013). https://doi.org/10.1007/s00184-012-0401-0

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00184-012-0401-0

Keywords

Navigation