Abstract
Capture–Recapture methods aim to estimate the size of an elusive target population. Each member of the target population carries a count of identifications by some identifying mechanism—the number of times it has been identified during the observational period. Only positive counts are observed and inference needs to be based on the observed count distribution. A widely used assumption for the count distribution is a Poisson mixture. If the mixing distribution can be described by an exponential density, the geometric distribution arises as the marginal. This note discusses population size estimation on the basis of the zero-truncated geometric (a geometric again itself). In addition, population heterogeneity is considered for the geometric. Chao’s estimator is developed for the mixture of geometric distributions and provides a lower bound estimator which is valid under arbitrary mixing on the parameter of the geometric. However, Chao’s estimator is also known for its relatively large variance (if compared to the maximum likelihood estimator). Another estimator based on a censored geometric likelihood is suggested which uses the entire sample information but is less affected by model misspecifications. Simulation studies illustrate that the proposed censored estimator comprises a good compromise between the maximum likelihood estimator and Chao’s estimator, e.g. between efficiency and bias.
Similar content being viewed by others
References
Böhning D (2008) A simple variance formula for population size estimators by conditioning. Stat Methodol 5:410–423
Borchers DL, Buckland ST, Zucchini W (2004) Estimating animal abundance. Closed populations. Springer, London
Bunge J, Fitzpatrick M (1993) Estimating the number of species: a review. J Am Stat Assoc 88:364–373
Chao A (1987) Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43:783–791
Chao A (1989) Estimating population size for sparse data in capture-recapture experiments. Biometrics 45:427–438
Chao A, Tsay PK, Lin SH, Shau WY, Chao DY (2001) Tutorial in biostatistics: The applications of capture-recapture models to epidemiological data. Stat Med 20:3123–3157
Dorazio RM, Royle JA (2005) Mixture models for estimating the size of a closed population when capture rates vary among individuals. Biometrics 59:351–364
Hay G, Smit F (2003) Estimating the number of drug injectors from needle exchange data. Addict Res Theory 11:235–243
Holzmann H, Munk A, Zucchini W (2006) On identifiability in capture-recapture models. Biometrics 62:934–939
Link WA (2003) Nonidentifiability of population size from capture-recapture data with heterogeneous detection probabilities. Biometrics 59:1123–1130
Link WA (2006) Response to a paper by Holzmann, Munk and Zucchini. Biometrics 62:936–939
Mao CX (2007a) Estimating population sizes for capture-recapture sampling with binomial mixtures. Comput Stat Data Anal 51:5211–5219
Mao CX (2007b) Estimating the number of classes. Ann Stat 35:917–930
Mao CX (2008a) On the nonidentifiability of population sizes. Biometrics 64:977–981
Mao CX (2008b) Lower bounds to the population size when capture probabilities vary over individuals. Aust N Z J Stat 50:125–134
Oosterlee A, Vink RM, Smit F (2009) Prevalence of family violence in adults and children: estimates using the capture-recapture method. Eur J Public Health 19:586–591
Paluscia VJ, Wirtz SJ, Covington TM (2010) Using capture-recapture methods to better ascertain the incidence of fatal child maltreatment. Child Abuse Neglect 34:396–402
Pledger SA (2005) The performance of mixture models in heterogeneous closed population capture-recapture. Biometrics 61:868–876
Roberts JM, Brewer DD (2006) Estimating the prevalence of male clients of prostitute women in Vancouver with a simple capture-recapture method. J R Stat Soc Ser A 169:745–756
Van der Heijden PGM, Cruyff M, van Houwelingen HC (2003) Estimating the size of a criminal population from police records using the truncated poisson regression model. Stat Neerlandica 57:1–16
Van Hest NAH, De Vries G, Smit F, Grant AD, Richardus JH (2008) Estimating the coverage of Tuberculosis screening among drug users and homeless persons with truncated models. Epidemiol Infect 136:14–22
Wang J-P, Lindsay BG (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J Am Stat Assoc 100:942–959
Wang J-P, Lindsay BG (2008) An exponential partial prior for improving nonparametric maximum likelihood estimation in mixture models. Stat Methodol 5:30–45
Wilson RM, Collins MF (1992) Capture-recapture estimation with samples of size one using frequency data. Biometrika 79:543–553
Acknowledgments
The authors would like to thank the Editor and two anonymous referees for their very helpful comments which considerably improved the paper. We also would like to thank Jeerapa Sappakitkamjorn (Department of Applied Statistics, King Mongkut’s University of Technology North–Bangkok) for her great support in finalizing the simulation study.
Author information
Authors and Affiliations
Corresponding author
Additional information
The idea for this paper was developed while the second author was visiting the Department of Applied Statistics at the King Mongkut’s University of Technology North–Bangkok in the summers 2009 and 2010 and would like to thank the department for any support that was received.
The paper was written while the first author was visiting the Department of Mathematics and Statistics at the University of Reading in the spring 2011 and would like to thank the department for any support that was received.
Appendices
Appendix 1: Proof of theorems
Theorem 1
Let \(k_y(p)=(1-p)^yp\) for \(y=0,1,\cdots \) and \(p \in (0,1)\).
-
(a)
Let \(\log L(p) = f_1\log (\pi _1)+f_2\log (\pi _2)\) with \(\pi _1 = 1/(2-p)\) and \(\pi _2 =(1-p)/(2-p)\) being the geometric probabilities truncated to counts of ones and twos. Then \(\log L(p)\) is maximized for \(\hat{p} = (f_1-f_2)/f_1.\)
-
(b)
\(E(f_0|f_1,f_2;\hat{p})=f_1^2/f_2,\) for \(\hat{p}= (f_1-f_2)/f_1\).
Proof
For the first part, it is clear that \( f_1\log (\pi _1)+f_2\log (\pi _2)\) is maximal for \(\hat{\pi }_1 =f_1/(f_1+f_2)=1/(2-\hat{p})\), which is attained for \(\hat{p} = (f_1-f_2)/f_1\). For the second part, we see that with \(e_y=E(f_y|f_1,f_2;p)= k_y(p) N\) we have the following:
so that
with \(e_3^+= \sum _{j=3}^{\infty } e_j\). Hence
and
Plugging in the maximum likelihood estimate \(\hat{p}=(f_1-f_2)/f_1\) for \(p\) yields
the desired result. \(\square \)
Theorem 2
Let \(k_y(p)=(1-p)^yp\) for \(y=0,1,\cdots \) and \(p \in (0,1)\). Then,
for \(\hat{N}=\hat{N}_{ML}, \hat{N}_{C}\), or \(\hat{N}_{Cen}\).
Proof
Let \(\hat{N}=\hat{N}_{ML}=n/(1-n/S)\). Note that \(E(n)=Np\) and \(E(S/N)=(1-p)/p\) so that
Let \(\hat{N}=\hat{N}_{C}=n + f_1^2/f_2\). Note that \(E(f_1)=Np(1-p)\) and \(E(f_2)=Np(1-p)^2\) so that
Finally, let \(\hat{N}=\hat{N}_{Cen}=\frac{n}{1-f_1/n}\). Using the above we have
which ends the proof. \(\square \)
Appendix 2: Standard errors
Let \(\hat{N}\) be the estimator of the population size \(N\) of interest, the latter being a fixed but unknown quantity. Also, let the random quantity \(n\) be the observed number of units. We will make use of the result
where \(\hat{N}|n\) refers to the distribution of \(\hat{N}\) conditional upon \(n\) and \(E_n(.)\) and \(\text{ Var}_n(.)\) refer to the first and second (central) moment w.r.t. the distribution of \(n\). For more details see Böhning (2008).
1.1 Maximum likelihood estimator
We consider the maximum likelihood estimator \(\hat{p}_{ML}=n/S\) and the associated population size estimator \(\hat{N}=\hat{N}_{ML}=n/(1-n/S)\). We start with the second term in (9) and have that \(E(\hat{N}|n)\approx n/(1-p)\), approximately, so that
Note that \(N(1-p)\) can be estimated by \(n\) and \(p\) by the maximum likelihood estimator \(n/S\), so that the variance estimator \(\frac{Sn^2}{(S-n)^2}\) arises.
For the first term in (9), we use the \(\delta \)-method to determine \(\text{ Var}(\hat{N}|n)\) as
and, using the Fisher information for \(p\), we can determine \( \text{ Var}_n(\hat{p}_{ML})\) as
The expected value \(E_n \{\text{ Var}(\hat{N}|n)\} \) is then replaced by its moment estimate \(\text{ Var}(\hat{N}|n)\) to achieve the total variance
1.2 Censored estimator
We consider the censored estimator \(\hat{p}_{Cen}=f_1/n\) and the associated population size estimator \(\hat{N}=\hat{N}_{Cen}=n/(1-f_1/n)\). We have \(E(\hat{N}|n)\approx n/(1-p)\), approximately, so that, as before, \(\text{ Var}_n n/(1-p) = \frac{1}{(1-p)^2}Np(1-p)\), which can be estimated as \(\frac{f_1}{(1-f_1/n)^2}\) by replacing \(N(1-p)\) by \(n\) and \(p\) by \(f_1/n\).
For the first term in (9), \(\text{ Var}(\hat{N}|n)\), using the \(\delta -\)method once more we achieve the approximation
from where the variance estimator \(\frac{f_1(1-f_1/n)}{(1-f_1/n)^4}=\frac{f_1}{(1-f_1/n)^3}\) arises. In total, taking both variance terms into account, we achieve the variance estimator
1.3 Chao’s estimator
Finally, we consider the Chao-type estimator \(\hat{N}= \hat{N}_{C}=n+f_1^2/f_2\). Note that it differs from the original Chao-estimator \(n+f_1^2/(2f_2)\) for which a variance estimator is provided in Chao (1987). If we would be only interested in a variance estimator of \(\hat{f}_0\) we could simply multiply the Chao-variance-estimator by a factor of 4. However, interest is usually in the population size estimator \(\hat{N}\) for which this simple adjustment is not valid. Hence we provide a full analysis in the following, again using the conditioning technique (9).
We have \(E(\hat{N}|n)= E(n+\frac{f_1^2}{f_2})\approx n + (g_1^+)^2 n/g_2^+=n(1+(g_1^+)^2/g_2^+)\), approximately. Recall that \(g_y^+=g_y/(1-g_0)\) for \(y=1,2,\ldots \). (Note that \(E(\hat{N}|n)\) refers to the conditional count distribution \(g_1^+, g_2^+, \ldots \) which is estimated by \(f_1/n, f_2/n, \ldots \). Hence \(\text{ Var}_n \{n(1+(g_1^+)^2/g_2^+)\}= (1+(g_1^+)^2/g_2^+)^2 Ng_0(1-g_0)\) which can be estimated as follows. \(Ng_0\) can be estimated as \(\hat{f}_0 =f_1^2/f_2\) and \((1-g_0)\) as \(1-f_1^2/(\hat{N} f_2)=\frac{f_2n}{f_2n+f_1^2}\), so that in total the estimate \((1+\frac{f_1^2}{f_2 n})^2 \frac{f_1^2n}{f_2n+f_1^2}\) arises, which we can simplify as
For the first term in (9), \(\text{ Var}(\hat{N}|n)\), using the bivariate \(\delta -\)method, we achieve the approximation
where \(\phi _0(f_1,f_2)=f_1^2/f_2\) and \(\nabla \phi _0(f_1,f_2)\) is the two-vector of partial derivatives with respect to \(f_1\) and \(f_2\):
The covariance matrix, conditional on \(n\), is \( cov(f_1,f_2)=n (\text{ dia}({{\mathbf g}^{+}}) -{{\mathbf g}^{+}}{{\mathbf g}^{+T}}) \), where \({\mathbf g}^{+}\) is the two-vector of probabilities, conditional on \(n\), for observing a one or a two, respectively. Also, \(\text{ dia}({{\mathbf g}^+})\) is the diagonal \(2\times 2\) matrix with \(g_1^+\) and \(g_2^+\) on the main diagonal. This matrix is estimated by
Hence we find for
Ultimately, taking (12) and (13) together, we achieve the variance estimator for \(\hat{N} =\hat{N}_C=n+f_1^2/f_2\) as
being of remarkably simple form.
Rights and permissions
About this article
Cite this article
Niwitpong, Sa., Böhning, D., van der Heijden, P.G.M. et al. Capture–recapture estimation based upon the geometric distribution allowing for heterogeneity. Metrika 76, 495–519 (2013). https://doi.org/10.1007/s00184-012-0401-0
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-012-0401-0