Capture–recapture estimation based upon the geometric distribution allowing for heterogeneity

Niwitpong, Sa-aat; Böhning, Dankmar; van der Heijden, Peter G. M.; Holling, Heinz

doi:10.1007/s00184-012-0401-0

Capture–recapture estimation based upon the geometric distribution allowing for heterogeneity

Published: 27 July 2012

Volume 76, pages 495–519, (2013)
Cite this article

Metrika Aims and scope Submit manuscript

Sa-aat Niwitpong¹,
Dankmar Böhning²,
Peter G. M. van der Heijden³ &
…
Heinz Holling⁴

463 Accesses
13 Citations
Explore all metrics

Abstract

Capture–Recapture methods aim to estimate the size of an elusive target population. Each member of the target population carries a count of identifications by some identifying mechanism—the number of times it has been identified during the observational period. Only positive counts are observed and inference needs to be based on the observed count distribution. A widely used assumption for the count distribution is a Poisson mixture. If the mixing distribution can be described by an exponential density, the geometric distribution arises as the marginal. This note discusses population size estimation on the basis of the zero-truncated geometric (a geometric again itself). In addition, population heterogeneity is considered for the geometric. Chao’s estimator is developed for the mixture of geometric distributions and provides a lower bound estimator which is valid under arbitrary mixing on the parameter of the geometric. However, Chao’s estimator is also known for its relatively large variance (if compared to the maximum likelihood estimator). Another estimator based on a censored geometric likelihood is suggested which uses the entire sample information but is less affected by model misspecifications. Simulation studies illustrate that the proposed censored estimator comprises a good compromise between the maximum likelihood estimator and Chao’s estimator, e.g. between efficiency and bias.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the Turing estimator in capture–recapture count data under the geometric distribution

Article 12 November 2018

Population size estimation and heterogeneity in capture–recapture data: a linear regression estimator based on the Conway–Maxwell–Poisson distribution

Article 18 April 2016

Two-step semiparametric empirical likelihood inference from capture–recapture data with missing covariates

Article 14 February 2024

References

Böhning D (2008) A simple variance formula for population size estimators by conditioning. Stat Methodol 5:410–423
Article MathSciNet MATH Google Scholar
Borchers DL, Buckland ST, Zucchini W (2004) Estimating animal abundance. Closed populations. Springer, London
Google Scholar
Bunge J, Fitzpatrick M (1993) Estimating the number of species: a review. J Am Stat Assoc 88:364–373
Google Scholar
Chao A (1987) Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43:783–791
Article MathSciNet MATH Google Scholar
Chao A (1989) Estimating population size for sparse data in capture-recapture experiments. Biometrics 45:427–438
Article MathSciNet MATH Google Scholar
Chao A, Tsay PK, Lin SH, Shau WY, Chao DY (2001) Tutorial in biostatistics: The applications of capture-recapture models to epidemiological data. Stat Med 20:3123–3157
Article Google Scholar
Dorazio RM, Royle JA (2005) Mixture models for estimating the size of a closed population when capture rates vary among individuals. Biometrics 59:351–364
Article MathSciNet Google Scholar
Hay G, Smit F (2003) Estimating the number of drug injectors from needle exchange data. Addict Res Theory 11:235–243
Article Google Scholar
Holzmann H, Munk A, Zucchini W (2006) On identifiability in capture-recapture models. Biometrics 62:934–939
Article MathSciNet Google Scholar
Link WA (2003) Nonidentifiability of population size from capture-recapture data with heterogeneous detection probabilities. Biometrics 59:1123–1130
Article MathSciNet MATH Google Scholar
Link WA (2006) Response to a paper by Holzmann, Munk and Zucchini. Biometrics 62:936–939
Article MathSciNet Google Scholar
Mao CX (2007a) Estimating population sizes for capture-recapture sampling with binomial mixtures. Comput Stat Data Anal 51:5211–5219
Article MATH Google Scholar
Mao CX (2007b) Estimating the number of classes. Ann Stat 35:917–930
Article MATH Google Scholar
Mao CX (2008a) On the nonidentifiability of population sizes. Biometrics 64:977–981
Article MathSciNet MATH Google Scholar
Mao CX (2008b) Lower bounds to the population size when capture probabilities vary over individuals. Aust N Z J Stat 50:125–134
Article MathSciNet MATH Google Scholar
Oosterlee A, Vink RM, Smit F (2009) Prevalence of family violence in adults and children: estimates using the capture-recapture method. Eur J Public Health 19:586–591
Article Google Scholar
Paluscia VJ, Wirtz SJ, Covington TM (2010) Using capture-recapture methods to better ascertain the incidence of fatal child maltreatment. Child Abuse Neglect 34:396–402
Article Google Scholar
Pledger SA (2005) The performance of mixture models in heterogeneous closed population capture-recapture. Biometrics 61:868–876
Article MathSciNet Google Scholar
Roberts JM, Brewer DD (2006) Estimating the prevalence of male clients of prostitute women in Vancouver with a simple capture-recapture method. J R Stat Soc Ser A 169:745–756
Article MathSciNet Google Scholar
Van der Heijden PGM, Cruyff M, van Houwelingen HC (2003) Estimating the size of a criminal population from police records using the truncated poisson regression model. Stat Neerlandica 57:1–16
Google Scholar
Van Hest NAH, De Vries G, Smit F, Grant AD, Richardus JH (2008) Estimating the coverage of Tuberculosis screening among drug users and homeless persons with truncated models. Epidemiol Infect 136:14–22
Google Scholar
Wang J-P, Lindsay BG (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J Am Stat Assoc 100:942–959
Article MathSciNet MATH Google Scholar
Wang J-P, Lindsay BG (2008) An exponential partial prior for improving nonparametric maximum likelihood estimation in mixture models. Stat Methodol 5:30–45
Article MathSciNet MATH Google Scholar
Wilson RM, Collins MF (1992) Capture-recapture estimation with samples of size one using frequency data. Biometrika 79:543–553
Article MATH Google Scholar

Download references

Acknowledgments

The authors would like to thank the Editor and two anonymous referees for their very helpful comments which considerably improved the paper. We also would like to thank Jeerapa Sappakitkamjorn (Department of Applied Statistics, King Mongkut’s University of Technology North–Bangkok) for her great support in finalizing the simulation study.

Author information

Authors and Affiliations

Department of Applied Statistics, Faculty of Applied Science, King Mongkut’s University of Technology, North-Bangkok, Thailand
Sa-aat Niwitpong
Southampton Statistical Sciences Research Institute and School of Mathematics, University of Southampton, Southampton, UK
Dankmar Böhning
Department of Methodology and Statistics, Faculty of Social and Behavioral Sciences, Utrecht University, Utrecht, The Netherlands
Peter G. M. van der Heijden
Statistics and Quantitative Methods, Faculty of Psychology and Sports Science, University of Münster, Münster, Germany
Heinz Holling

Authors

Sa-aat Niwitpong
View author publications
You can also search for this author in PubMed Google Scholar
Dankmar Böhning
View author publications
You can also search for this author in PubMed Google Scholar
Peter G. M. van der Heijden
View author publications
You can also search for this author in PubMed Google Scholar
Heinz Holling
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dankmar Böhning.

Additional information

The idea for this paper was developed while the second author was visiting the Department of Applied Statistics at the King Mongkut’s University of Technology North–Bangkok in the summers 2009 and 2010 and would like to thank the department for any support that was received.

The paper was written while the first author was visiting the Department of Mathematics and Statistics at the University of Reading in the spring 2011 and would like to thank the department for any support that was received.

Appendices

Appendix 1: Proof of theorems

Theorem 1

Let $k_y(p)=(1-p)^yp$ for $y=0,1,\cdots $ and $p \in (0,1)$.

(a)
Let $\log L(p) = f_1\log (\pi _1)+f_2\log (\pi _2)$ with $\pi _1 = 1/(2-p)$ and $\pi _2 =(1-p)/(2-p)$ being the geometric probabilities truncated to counts of ones and twos. Then $\log L(p)$ is maximized for $\hat{p} = (f_1-f_2)/f_1.$
(b)
$E(f_0|f_1,f_2;\hat{p})=f_1^2/f_2,$ for $\hat{p}= (f_1-f_2)/f_1$.

Proof

For the first part, it is clear that $ f_1\log (\pi _1)+f_2\log (\pi _2)$ is maximal for $\hat{\pi }_1 =f_1/(f_1+f_2)=1/(2-\hat{p})$, which is attained for $\hat{p} = (f_1-f_2)/f_1$. For the second part, we see that with $e_y=E(f_y|f_1,f_2;p)= k_y(p) N$ we have the following:

$$\begin{aligned} e_y = k_y(p) N=k_y(p)\left(e_0+f_1+f_2+\sum _{j=3}^\infty e_j\right) \end{aligned}$$

so that

$$\begin{aligned} e_0+e_3^+=[1-k_1(p)-k_2(p)] (e_0+e_3^+) + [1-k_1(p)-k_2(p)] (f_1+f_2) \end{aligned}$$

with $e_3^+= \sum _{j=3}^{\infty } e_j$. Hence

$$\begin{aligned} e_0+e_3^+=\frac{1-k_1(p)-k_2(p)}{k_1(p)+k_2(p)}(f_1+f_2) \end{aligned}$$

and

$$\begin{aligned} e_0=k_0(p)(f_1+f_2+ e_0+e_3^+)&= k_0(p)(f_1+f_2)\left[1+\frac{1-k_1(p)-k_2(p)}{k_1(p)+k_2(p)}\right]\\&= \frac{k_0(p)}{k_1(p)+k_2(p)}(f_1+f_2)=\frac{f_1+f_2}{(1-p)(2-p)}. \end{aligned}$$

Plugging in the maximum likelihood estimate $\hat{p}=(f_1-f_2)/f_1$ for $p$ yields

$$\begin{aligned} \frac{f_1+f_2}{(1-\hat{p})(2-\hat{p})}=\frac{f_1+f_2}{\frac{f_2}{f_1}\frac{f_1+f_2}{f_1}}=f_1^2/f_2, \end{aligned}$$

the desired result. $\square $

Theorem 2

Let $k_y(p)=(1-p)^yp$ for $y=0,1,\cdots $ and $p \in (0,1)$. Then,

$$\begin{aligned} \lim _{N\rightarrow \infty }\frac{E(\hat{N})}{N} =1 \end{aligned}$$

for $\hat{N}=\hat{N}_{ML}, \hat{N}_{C}$, or $\hat{N}_{Cen}$.

Proof

Let $\hat{N}=\hat{N}_{ML}=n/(1-n/S)$. Note that $E(n)=Np$ and $E(S/N)=(1-p)/p$ so that

$$\begin{aligned} \frac{E(n/(1-n/S))}{N} \mathop {\rightarrow }\limits _{N\rightarrow \infty } \frac{p}{1-\frac{p}{p/(1-p)}}=1. \end{aligned}$$

Let $\hat{N}=\hat{N}_{C}=n + f_1^2/f_2$. Note that $E(f_1)=Np(1-p)$ and $E(f_2)=Np(1-p)^2$ so that

$$\begin{aligned} \frac{E(n+f_1^2/f_2)}{N} \mathop {\rightarrow }\limits _{N\rightarrow \infty } (1-p) + \frac{p^2(1-p)^2}{p(1-p)^2}=1. \end{aligned}$$

Finally, let $\hat{N}=\hat{N}_{Cen}=\frac{n}{1-f_1/n}$. Using the above we have

$$\begin{aligned} \frac{E\left(\frac{n}{1-f_1/n}\right)}{N} \mathop {\rightarrow }\limits _{N\rightarrow \infty } \frac{1-p}{1-\frac{(1-p)p}{(1-p)}}=1, \end{aligned}$$

which ends the proof. $\square $

Appendix 2: Standard errors

Let $\hat{N}$ be the estimator of the population size $N$ of interest, the latter being a fixed but unknown quantity. Also, let the random quantity $n$ be the observed number of units. We will make use of the result

$$\begin{aligned} \text{ Var}(\hat{N}) = E_n \{\text{ Var}(\hat{N}|n)\} + \text{ Var}_n\{E(\hat{N}|n)\}, \end{aligned}$$

(9)

where $\hat{N}|n$ refers to the distribution of $\hat{N}$ conditional upon $n$ and $E_n(.)$ and $\text{ Var}_n(.)$ refer to the first and second (central) moment w.r.t. the distribution of $n$. For more details see Böhning (2008).

1.1 Maximum likelihood estimator

We consider the maximum likelihood estimator $\hat{p}_{ML}=n/S$ and the associated population size estimator $\hat{N}=\hat{N}_{ML}=n/(1-n/S)$. We start with the second term in (9) and have that $E(\hat{N}|n)\approx n/(1-p)$, approximately, so that

$$\begin{aligned} \text{ Var}_n [n/(1-p)] = \frac{1}{(1-p)^2}Np(1-p). \end{aligned}$$

Note that $N(1-p)$ can be estimated by $n$ and $p$ by the maximum likelihood estimator $n/S$, so that the variance estimator $\frac{Sn^2}{(S-n)^2}$ arises.

For the first term in (9), we use the $\delta $-method to determine $\text{ Var}(\hat{N}|n)$ as

$$\begin{aligned} \frac{n^2}{(1-p)^4} \text{ Var}_n(\hat{p}_{ML}) \end{aligned}$$

and, using the Fisher information for $p$, we can determine $ \text{ Var}_n(\hat{p}_{ML})$ as

$$\begin{aligned} \text{ Var}_n(\hat{p}_{ML}) \approx \frac{n(S-n)}{S^3}. \end{aligned}$$

The expected value $E_n \{\text{ Var}(\hat{N}|n)\} $ is then replaced by its moment estimate $\text{ Var}(\hat{N}|n)$ to achieve the total variance

$$\begin{aligned} \frac{Sn^2}{(S-n)^2}+\frac{n^2}{(1-n/s)^4}\frac{n(S-n)}{S^3} = \frac{S^2n^2}{(S-n)^3} \end{aligned}$$

(10)

1.2 Censored estimator

We consider the censored estimator $\hat{p}_{Cen}=f_1/n$ and the associated population size estimator $\hat{N}=\hat{N}_{Cen}=n/(1-f_1/n)$. We have $E(\hat{N}|n)\approx n/(1-p)$, approximately, so that, as before, $\text{ Var}_n n/(1-p) = \frac{1}{(1-p)^2}Np(1-p)$, which can be estimated as $\frac{f_1}{(1-f_1/n)^2}$ by replacing $N(1-p)$ by $n$ and $p$ by $f_1/n$.

For the first term in (9), $\text{ Var}(\hat{N}|n)$, using the $\delta -$method once more we achieve the approximation

$$\begin{aligned} \text{ Var}\left(\frac{n}{1-f_1/n}|n\right)\approx \frac{n^2}{(1-f_1/n)^4} \text{ Var}\left(\frac{f_1}{n}|n\right), \end{aligned}$$

from where the variance estimator $\frac{f_1(1-f_1/n)}{(1-f_1/n)^4}=\frac{f_1}{(1-f_1/n)^3}$ arises. In total, taking both variance terms into account, we achieve the variance estimator

$$\begin{aligned} \frac{f_1}{(1-f_1/n)^2}+\frac{f_1}{(1-f_1/n)^3} = \frac{f_1}{(1-f_1/n)^2}\frac{2n-f_1}{n-f_1} \end{aligned}$$

(11)

1.3 Chao’s estimator

Finally, we consider the Chao-type estimator $\hat{N}= \hat{N}_{C}=n+f_1^2/f_2$. Note that it differs from the original Chao-estimator $n+f_1^2/(2f_2)$ for which a variance estimator is provided in Chao (1987). If we would be only interested in a variance estimator of $\hat{f}_0$ we could simply multiply the Chao-variance-estimator by a factor of 4. However, interest is usually in the population size estimator $\hat{N}$ for which this simple adjustment is not valid. Hence we provide a full analysis in the following, again using the conditioning technique (9).

We have $E(\hat{N}|n)= E(n+\frac{f_1^2}{f_2})\approx n + (g_1^+)^2 n/g_2^+=n(1+(g_1^+)^2/g_2^+)$, approximately. Recall that $g_y^+=g_y/(1-g_0)$ for $y=1,2,\ldots $. (Note that $E(\hat{N}|n)$ refers to the conditional count distribution $g_1^+, g_2^+, \ldots $ which is estimated by $f_1/n, f_2/n, \ldots $. Hence $\text{ Var}_n \{n(1+(g_1^+)^2/g_2^+)\}= (1+(g_1^+)^2/g_2^+)^2 Ng_0(1-g_0)$ which can be estimated as follows. $Ng_0$ can be estimated as $\hat{f}_0 =f_1^2/f_2$ and $(1-g_0)$ as $1-f_1^2/(\hat{N} f_2)=\frac{f_2n}{f_2n+f_1^2}$, so that in total the estimate $(1+\frac{f_1^2}{f_2 n})^2 \frac{f_1^2n}{f_2n+f_1^2}$ arises, which we can simplify as

$$\begin{aligned} \left(1+\frac{f_1^2}{f_2n}\right)^2 \frac{f_1^2n}{f_2n+f_1^2}= f_1^2/f_2+f_1^4/(f_2^2n). \end{aligned}$$

(12)

For the first term in (9), $\text{ Var}(\hat{N}|n)$, using the bivariate $\delta -$method, we achieve the approximation

$$\begin{aligned} \text{ Var}(\hat{N}|n) \approx \nabla \phi _0(f_1,f_2)^T cov(f_1,f_2) \nabla \phi _0(f_1,f_2) \end{aligned}$$

where $\phi _0(f_1,f_2)=f_1^2/f_2$ and $\nabla \phi _0(f_1,f_2)$ is the two-vector of partial derivatives with respect to $f_1$ and $f_2$:

$$\begin{aligned} \nabla \phi _0(f_1,f_2)^T=(2f_1/f_2,-f_1^2/f_2^2). \end{aligned}$$

The covariance matrix, conditional on $n$, is $ cov(f_1,f_2)=n (\text{ dia}({{\mathbf g}^{+}}) -{{\mathbf g}^{+}}{{\mathbf g}^{+T}}) $, where ${\mathbf g}^{+}$ is the two-vector of probabilities, conditional on $n$, for observing a one or a two, respectively. Also, $\text{ dia}({{\mathbf g}^+})$ is the diagonal $2\times 2$ matrix with $g_1^+$ and $g_2^+$ on the main diagonal. This matrix is estimated by

$$\begin{aligned} \left(\begin{array}{ll} f_1-f_1^2/n&-f_1f_2/n\\ -f_1f_2/n&f_2-f_2^2/n \end{array}\right). \end{aligned}$$

Hence we find for

$$\begin{aligned} \nabla \phi _0(f_1,f_2)^T \widehat{cov}(f_1,f_2) \nabla \phi _0(f_1,f_2)=\frac{4f_1^3}{f_2^2}+\frac{f_1^4}{f_2^3} -\frac{f_1^4}{f_2^2n}. \end{aligned}$$

(13)

Ultimately, taking (12) and (13) together, we achieve the variance estimator for $\hat{N} =\hat{N}_C=n+f_1^2/f_2$ as

$$\begin{aligned} \frac{f_1^4}{f_2^3}+\frac{4f_1^3}{f_2^2}+\frac{f_1^2}{f_2}, \end{aligned}$$

(14)

being of remarkably simple form.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Niwitpong, Sa., Böhning, D., van der Heijden, P.G.M. et al. Capture–recapture estimation based upon the geometric distribution allowing for heterogeneity. Metrika 76, 495–519 (2013). https://doi.org/10.1007/s00184-012-0401-0

Download citation

Received: 25 October 2011
Published: 27 July 2012
Issue Date: May 2013
DOI: https://doi.org/10.1007/s00184-012-0401-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Capture–recapture estimation based upon the geometric distribution allowing for heterogeneity

Abstract

Access this article

Similar content being viewed by others

On the Turing estimator in capture–recapture count data under the geometric distribution

Population size estimation and heterogeneity in capture–recapture data: a linear regression estimator based on the Conway–Maxwell–Poisson distribution

Two-step semiparametric empirical likelihood inference from capture–recapture data with missing covariates

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Proof of theorems

Theorem 1

Proof

Theorem 2

Proof

Appendix 2: Standard errors

1.1 Maximum likelihood estimator

1.2 Censored estimator

1.3 Chao’s estimator

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Capture–recapture estimation based upon the geometric distribution allowing for heterogeneity

Abstract

Access this article

Similar content being viewed by others

On the Turing estimator in capture–recapture count data under the geometric distribution

Population size estimation and heterogeneity in capture–recapture data: a linear regression estimator based on the Conway–Maxwell–Poisson distribution

Two-step semiparametric empirical likelihood inference from capture–recapture data with missing covariates

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Proof of theorems

Theorem 1

Proof

Theorem 2

Proof

Appendix 2: Standard errors

1.1 Maximum likelihood estimator

1.2 Censored estimator

1.3 Chao’s estimator

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation