Population size estimation based upon zero-truncated, one-inflated and sparse count data

Estimating the size of a hard-to-count population is a challenging matter. In particular, when only few observations of the population to be estimated are available. The matter gets even more complex when one-inflation occurs. This situation is illustrated with the help of two examples: the size of a dice snake population in Graz (Austria) and the number of flare stars in the Pleiades. The paper discusses how one-inflation can be easily handled in likelihood approaches and also discusses how variances and confidence intervals can be obtained by means of a semi-parametric bootstrap. A Bayesian approach is mentioned as well and all approaches result in similar estimates of the hidden size of the population. Finally, a simulation study is provided which shows that the unconditional likelihood approach as well as the Bayesian approach using Jeffreys’ prior perform favorable.


Introduction and motivation
The objective here is to determine the size N of an elusive target population. To accomplish the purpose some mechanism (life trapping, register, surveillance system) is available which identifies a unit of the target population repeatedly. Hence, there is a count X informing about the number of identifications of each unit in the target population. Furthermore, suppose a sample X 1 ; X 2 ; . . .; X N of size N is available which leads to the empirical count distribution as presented in Table 1. There is, however, the well-known complication (Böhning et al. 2018;McCrea and Morgan 2015) that any sample units with X i ¼ 0 would not be observed leading to a reduced observable sample X 1 ; X 2 ; . . .; X n ; where-w.l.g.-we assume that In conclusion, we have that f 0 ¼ N À n is unknown. f 0 is also known as the dark or hidden figure and is the prime interest in this paper. In the following we illustrate the situation with two applications.
Estimating the size of a dice snake population in Graz. Tranninger and Friedl (2018) tried to estimate the size of a dice snake population in a closed area at the river Mur in Graz (Austria). The work was motivated by a resettlement project of the population due to the development of a water power plant in the vicinity of the living ground of the dire snakes. The major questions here was: how many dice snakes are there? We focus here on the year 2014 in which there were 31 capture occasions during the year. As above, X denotes the identification count per dice snake. The empirical distribution of X is provided in Table 3.
The number of flare stars in the Pleiades. Shortly after the appearance of two recent books on capture-recapture methods by McCrea and Morgan (2015); Böhning et al. (2018), it was pointed out to the authors by Akopian (2018) that capture-recapture methods are also used in astro-physics to estimate the size of star clusters. Indeed, Ambartsumyan et al. (1970) published work where the number of stars in the Plaiades is estimated using capture-recapture techniques. The Pleiades is a star cluster about 444 light years away from planet Earth and consists of 100s of stars, only some of these are visible at certain times, the flare stars. In Table 4, we see the empirical distribution of X representing the number of flares seen per star, for example, 123 stars were only seen once, 16 twice etc.
Both data situations have in common that there is perhaps sparsity exhibited by an abundance of unobserved zeros and relatively small non-zero counts. The second salient feature of both data sets is the occurrence of a relative large number of ones, Table 1 Frequency distribution of count X of repeated identifications indicating potential one-inflation. One-inflation models have recently experienced some attention in capture-recapture modelling; see Böhning and van der Heijden (2019), Böhning et al. (2019), Farcomeni (2020), Godwin (2017Godwin ( , 2019, or Godwin and Böhning (2017). In Böhning and Ogden (2020) a more general investigation of inflation models is delivered and their close connection to truncation models established. We see two major objectives for our paper: 1. With the motivation of the two case studies as background we are interested in raising the awareness of one-inflation in the capture-recapture context and its overestimation bias as consequence, 2. we are interested in illustrating some of the available approaches in estimating population size, with an emphasis on target populations that are small in size.
The rest of the paper is organized as follows: a probabilistic class of models that is based on zero-truncation and one-inflation is introduced in Sect. 2. In Sect. 3 goodness-of-fits of the case studies with respect to various count distributions are provided and it is found that the relatively simple geometric model seems to show up the best fit. Thus Horvitz-Thompson estimates based on zero-truncated oneinflated models are discussed in Sect. 4. Unconditional profile likelihood estimators under a geometric and a one-inflated geometric model are derived in Sect. 5. Section 6 is dedicated the idea to estimate the population size under a Bayesian setting. The performance of all estimating techniques discussed so far is evaluated by means of a Monte Carlo simulation study in Sect. 7. Section 8 presents ideas on how a semi-parametric bootstrap algorithm can be applied in order to find variance estimates and confidence intervals. The paper concludes with a short discussion that Table 2 Frequency distribution of count X of repeated identifications Table 3 Frequencies of the number of times dice snakes have been identified in the target area in 2014 Frequency of count of sightings per dice snake Table 4 Frequency distribution of the count (per star) of flares (Ambartsumyan et al. 1970) Frequency of count of flares is provided in Sect. 9. The analysis has been performed within the R environment and exploits various functions written by the authors that are available on request.

Modelling
For predicting f 0 some sort of modelling is unavoidable as the nonparametric estimates f x , x ¼ 1; . . .; m carry no information for f 0 . Hence, we need to find a base model for PðX ¼ xÞ ¼ b x ðhÞ so that an estimateĥ for h can be achieved. This leads to fitted probabilities b x ðĥÞ for x ¼ 0; 1; . . .; m, where m denotes the largest number of identifications. In particular, we can use for x ¼ 0 the Horvitz-Thompson-type estimator for estimating f 0f from which, ultimately, the population size estimatorN ¼ n þf 0 follows. For justified inference, the valid specification of the model b x ðhÞ is crucial.
For both case studies mentioned in the previous section, we see a large number of counts of ones, the singletons. Hence, we are concerned about one-inflation, a situation where more counts of ones occur than compatible with the baseline model b 1 ðhÞ as this can lead to a highly inflated estimate of f 0 as the following example shows. See also Godwin (2017) for further illustrations of this point.
A synthetic example. The empirical distribution of 500 counts sampled from a Poisson distribution with parameter 1 and 500 extra-counts of 1 so that N ¼ 1000 is shown in Table 5.
If we ignore the zeros and estimate h by means of zero-truncated maximum likelihood we findĥ ¼ 0:42344 and f 0 ¼ n expðÀĥÞ 1 À expðÀĥÞ ¼ 1582; clearly overestimating f 0 almost by a factor of 10.
To accommodate one-inflation we need to include it into the modelling. Hence we will focus on one-inflation modelling where p x ðhÞ ¼ b x ðhÞ=ð1 À b 0 ðhÞÞ is a zero-truncated base distribution and a 2 ½0; 1. As also mentioned in the synthetic example above, for the Poisson case we generally set The modelling is greatly simplified using the following general result from Böhning and van der Heijden (2019). Consider an arbitrary inflation point x 1 and an arbitrary count density p x ðhÞ with associated x 1 -inflation as Then the likelihood and log-likelihood functions are respectively, where p 1 ðhÞ ¼ p x 1 ðhÞ, f 1 ¼ f x 1 , and n is the sample size. Therefore the profile log-likelihood in h is andâ maximizes (2) for fixed h. It follows that 1 Àâ þâp 1 ðhÞ ¼ 1 À 1 À f 1 =n 1 À p 1 ðhÞ þ 1 À f 1 =n 1 À p 1 ðhÞ p 1 ðhÞ ¼ f 1 =n and the profile log-likelihood (3) becomes log PLðhjxÞ ¼f 1 log½1 Àâ þâp 1 ðhÞ þ X x6 ¼x 1 f x log p x ðhÞ þ ðn À f 1 Þ logâ as P x6 ¼x 1 f x ¼ n À f 1 . Thus, this x 1 -inflated profile log-likelihood equals the x 1 -truncated log-likelihood X x6 ¼x 1 f x log p x ðhÞ 1 À p 1 ðhÞ plus f 1 logðf 1 =nÞ þ ðn À f 1 Þ logð1 À f 1 =nÞ; which is independent of h. This result implies that x 1 -inflation models can be simply fitted by x 1 -truncated models.
To diagnose x 1 -inflation we may fit the x 1 -truncated log-likelihood and finally form the likelihood ratio statistic k ¼ 2 logðPL 1 ðĥjxÞ=L 0 ðĥjxÞÞ where is the non-inflated log-likelihood using all data. We apply these ideas to zero-truncated distributions. For an arbitrary count density b x ðhÞ, the base density, consider the associated zero-truncated count density According to the previous result, for the one-inflated density we can restrict inference on the zero-one-truncated density which then provides the one-inflated, zero-truncated density.

Finding the base distributions in the case studies
An important issue is the choice of the base distribution in the case studies. Graphical analysis using ratio plotting has been previously suggested; see Böhning et al. (2013Böhning et al. ( , 2018 or Böhning (2016). However, these techniques require large samples sizes to avoid misleading conclusions, and in the cases discussed here we have clearly small sizes. Hence we base our analysis on likelihood methods including information criteria such as the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). To cope with small samples we specifically use the modified version of the AIC in which the penalty for the model complexity, say 2k, is further increased by the factor 1 þ ðk þ 1Þ=ðn À k À 1Þ (see also McCrea and Morgan 2015). Table 6 provides a comparative analysis including the Poisson and geometric distribution as well as the negative-binomial distribution. For both data situations the best model is the geometric model since it shows up the smallest AIC c and BIC values.
For completeness, we mention the probability mass function of the negativebinomial with h ¼ ðl; dÞ: using the mean parameterization, so that E ðXÞ ¼ l and Var ðXÞ ¼ ð1 þ dlÞl, where l [ 0 is the mean and d [ 0 is the dispersion parameter. This yields the geometric distribution for d ¼ 1 and the Poisson as the limiting case when d ! 0. Table 6 gives evidence for both case studies that the geometric is a reasonable distribution here. However, the question arises if there is any evidence of one-inflation as the mere existence of many ones does not necessarily mean that there is one-inflation. Table 7 provides a diagnostic analysis of one-inflation. Note that we are testing here H 0 : a ¼ 1 vs. H 1 : a\1, so that the null-hypothesis is in the boundary of the alternative hypothesis and non-standard inference applies. In this case, the asymptotic distribution of the likelihood ratio test statistic 2 logðkÞ is a v 2distribution, namely where v 2 k is the v 2 -distribution with k degrees of freedom (Self and Liang 1987). v 2 0 is the singular distribution putting all its mass at 0. In practice, this means that conventional v 2 -values need to be halved. For example, for the dice snake data we have a value of the likelihood ratio statistic of 3.0 which would give a conventional p value of 0.084 under a v 2 -distribution with 1 df. Halving this leads to the correct p value of 0.042. For the Pleiades data we have a clear indication of one-inflation, whereas this is borderline for the dice snake data. As the overestimation effect of the population size is severe when ignoring one-inflation, we will use the one-inflation model when estimating population size.

Horvitz-Thompson estimation
The conventional Horvitz-Thompson estimator has the property E ðf 0 Þ ¼ Np 0 ðhÞ, if there is no inflation. The estimator (5) needs to be modified here as n contains the one-inflated part. This leads tô which again has the property E ðf 0 Þ ¼ Np 0 ðhÞ and, ultimately, we can define the modified Horvitz-Thompson estimator which is unbiased in the sense that E ðNÞ ¼ N.
As h is unknown, a plug-in estimate is used based on the 0-1-truncated geometric as evidenced in the previous analysis. In Table 8 we see the estimated population sizes for the two case studies. The conventional Horvitz-Thompson estimator (cHTE) uses the 0-truncated geometric distribution whereas the modified Horvitz-Thompson estimator (mHTE) uses the 0-1-truncated geometric as described above.

Unconditional maximum likelihood estimation
So far we maximized the conditional (zero-truncated) likelihood of the observed counts. In the following we discuss the general sampling mechanism that has generated these observations. In particular we will discuss the unconditional maximum likelihood approach which has a long history in capture-recapture modelling. Chao and Bunge (2002) give a nice discussion on the conditional and unconditional approach and how they connect. Sanathanan (1972Sanathanan ( , 1977 provides a comprehensive analysis of their statistical properties. The unconditional likelihood approach leads naturally, as we will see below, to a profile likelihood in N. The latter suggests also a generic way of constructing confidence intervals as outlined in Venzon and Moolgavkar (1988). Cormack (1992) provides an application to capture-recapture settings as well as do Lebreton et al. (1992). Let m denote the largest number of sightings, then the joint pmf of the sample is a multinomial model defined on the counts 0; 1; . . .; m coming from the population of size N. Since we only observe the counts of 1; . . .; m, the conditional model used is a zero-truncated multinomial for the n observed counts. This conditioning process is described by a binomial variable that is responsible for splitting the population into an observed part (of size n) and an unobserved part (of size N À n ¼ f 0 ). Together we have multinomðb 0 ðhÞ; b 1 ðhÞ; . . .; b m ðhÞjNÞ ¼multinom b 1 ðhÞ which allows now to check the validity of this factorization. Since f 1 ; . . .; f m are fixed given the observed counts, the relevant part of the unconditional likelihood is Therefore, we have to maximize the unconditional log-likelihood function For a given value of f 0 , the h-score function is This estimator depends on the value of N and thus on the unknown f 0 . We propose to evaluate the profile log-likelihood 'ðf 0 ;ĥjÁÞ for a grid of f 0 values to find the maximizerf 0 . This is shown in Fig. 1. Sincef 0 ¼ 286 with 95% profile confidence interval (159, 527) for f 0 , the total size of the population is estimated to be 356 snakes, which seems to be a reasonable number. This unconditional estimate can now be compared to the respective conditional estimateN c ¼ 358 given in Table 8.
We also apply this model to estimate the size of the flare stars. From the right panel of Fig. 1 we getf 0 ¼ 523 with 95% profile confidence interval (353, 782). The respective estimate of the population sizeN ¼ 668 is again slightly smaller compared withN c ¼ 671 from Table 8.
Under an arbitrary one-inflated count model the respective unconditional loglikelihood function (7) is . . .; m. Since for any fixed value of f 0 the derivation of the maximizer of this function w.r.t. a is the analogue to finding the maximizer (4) of the respective conditional log-likelihood (2), we immediately havê and define the profile log-likelihood function as log PLðh; f 0 jf 1 ; . . .; f m Þ ¼f 1 log Notice that in this unconditional log-likelihood the total population size N takes over the role of the observed sample size n in its conditional version and the sum also includes the additional term for x ¼ 0.
Under the one-inflated geometric situation the relevant term depending on h becomes the above profile log-likelihood simplifies to log PLðh; f 0 jf 1 ; . . .; f m Þ ¼f 1 log Since N ðÀ1Þ is a sum over all frequencies except f 1 , this score function actually depends on both, h and the unobserved f 0 . Thus, it is natural to find the maximizer of this profile log-likelihood using a grid of f 0 values and maximize the corresponding likelihood function in h conditional on each f 0 value. For the snake dataf 0 ¼ 45 maximizes the profile likelihood as shown in the left panel of Fig. 2. The respective population size estimateN ¼ 115 is therefore rather small. The reason for this surprising result might be the fairly wide 95% profile confidence interval (7, 399), reflecting the large variance of the estimatorf 0 . The situation is similar with the flare stars data shown in the right panel of Fig. 2. Sincef 0 ¼ 53 with 95% profile confidence interval (17, 165), the estimateN ¼ 198 is fairly small compared to the results under the previously considered models.

Bayesian analysis
We use the geometric density in the form 1 expðaÞ þ 1 expðaÞ x ðexpðaÞ þ 1Þ x ¼ hð1 À hÞ x with h ¼ 1=ðexpðaÞ þ 1Þ and for x ¼ 0; 1; . . .. Now, as we have 0-1-truncated data the associated 0-1-truncated density is 1 expðaÞ þ 1 expðaÞ xÀ2 ðexpðaÞ þ 1Þ xÀ2 ¼ hð1 À hÞ xÀ2 which is the form we use for the analysis. A non-informative normal prior (mean zero and standard deviation 100) is used for a and the Bayesian analysis is implemented using 10 parallel Markov chains, each run to produce a sample size of 10,000 after 2500 burn-in iterations. A nonparametric density estimate (Epanechnikov kernel with optimal bandwidth) for the posterior distribution of a is given in Fig. 3 for the dice snake data (left panel) and the flare stars data (right panel). To find the posterior distribution of N we use (6) for the geometric pmf, i.e. the transformationN Note that the latter is a monotone increasing function in the interval (0, 1). The associated values for the median as well as for the 95% HPD credible interval are given in Table 9.
As an alternative we might consider a Bayesian analysis using Jeffreys' invariance prior. This leads here to a prior distribution proportional to 1=h ffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi ð1 À hÞ p which corresponds to a beta distribution with parameters a ¼ n and The corresponding values for median, 0.025-quantile, and 0.975-quantile of this posterior are provided in Table 10.

Monte Carlo simulation
To allow a comparison on the performance of all suggested estimators we are including some simulation work to provide a more in-depth study of these suggestions. We have considered the following population sizes for N: 50, 100, 200, 500, 1000. We did not consider any sizes larger than 1000 as this deemed not appropriate for our setting and also most differences between estimators can be expected for smaller sizes. We have chosen the geometric distribution as baseline distribution with parameter values h ¼ 0:3; 0:4; 0:5. We studied three settings: no,  Fig. 3 Posterior distributions of the a parameter of the 0-1-truncated geometric distribution for the dice snake data (left) and the flare stars data (right); the smooth curves represent the normal densities with parameters replaced by respective posterior sample estimates  10% and 30% one-inflation. N units were sampled under the respective setting and zero-counts removed. Then six estimators were considered: the modified Chao estimator (1) discussed in Böhning et al. (2019), the estimator based on the conditional likelihood with no one-inflation (2), the estimator based on the conditional likelihood with one-inflation modelled (3), the estimator based on the unconditional likelihood with no one-inflation (4), the estimator based on the unconditional likelihood with one-inflation modelled (5) and the Bayes estimator using Jeffreys' prior (6). We like to provide results as relative bias and relative standard deviation defined as respectively. Here R is the number of replications andN r is the estimate of interest in the rÀth simulation run while N is the mean estimate of interest over all simulation runs. These relative definition forms are required to allow comparisons across different population sizes and also to permit meaningful asymptotic statements. Note that the usual relationship rb 2 þ rsd 2 ¼ rmse holds where ðN r À NÞ 2 =N 2 is the relative mean squared error. The results for R ¼ 1000 are presented visually in Figs. 4, 5 and 6 and show a clear picture. Estimators (2) and (4) show high overestimation bias under oneinflation, all other estimators behave reasonable in all settings with respect to bias and appear to be asymptotically unbiased. The modified Chao estimator shows larger variance than the conditional and unconditional estimators as well as the Bayes estimator. However, it should be kept in mind that the modified Chao estimator does allow heterogeneity under the geometric sampling distribution. In summary, the unconditional and Bayes estimator seem to perform best among the considered estimators.

Variance and bootstrap
For finding a variance and confidence interval estimate of the population size estimate under the zero-truncated one-inflated model we use the bootstrap approach. The conventional, nonparametric bootstrap works as follows: 1. Draw a sample of size N from the observed distribution defined by the relative frequencies f 0 =N; f 1 =N; . . .; f m =N.  The problem with this bootstrap algorithm is that neither f 0 nor N are known. This has been acknowledged in the capture-recapture community for some time. Norris and Pollock (1996) suggest three bootstrap methods to account for the uncertainty involved in estimating N. One method bases the bootstrap only on the observed sample size n and estimates the remaining uncertainty analytically. Another method uses a bootstrap based on a complete modelling approach. The third method they inflation modelled (5) and the Bayes estimator using Jeffreys' prior (6). The results of the bootstrap procedure are provided in Table 11. Due to the small sample size, the confidence intervals are rather wide. The upper interval end provides for both case studies valuable information on an upper bound for the hidden population units. Due to the sparsity of the data the bootstrap samples generate occasionally very large population size estimates. Typical, we would expect the bootstrap mean to be close to the population size estimate. However, this is not the case due to the occasional occurrence of spurious large size estimates. The bootstrap median does get close to the sample population size estimate. This aspect is of interest in practice as we might want to check if the bootstrap median is in agreement with the population size estimate for the given sample as this could indicate that the latter is spurious. It can be expected that also the conventional bootstrap standard deviation experiences a similar inflation and estimating the true variation by means of (8) is likely more useful. Note that the ranking of estimators according to BTse (8) is in line with the results of the simulation study. We can ignore the two estimators under no inflation as these are using a wrong model which contributes to their large variance in this case. Under the remaining estimators, Chao's modified estimator has by far the largest standard error. Jeffreys' Bayes estimator and the conditional under one-inflation are on par whereas the unconditional under one-inflation seems to perform best.

Discussion and conclusion
It is widely known that parameter heterogeneity which is not accounted for in the modelling can lead to severe bias in the estimation of population size. However, it is mostly assumed that the bias occurs in a form of underestimation. IWGDMF (1995) provide a generic argument for this fact. In particular, this is justified in zerotruncated count models as any heterogeneity which can be modelled as a mixture of parametric densities leads to an underestimation bias if the mixture is ignored and only a homogeneous model is fitted (Böhning andSchön 2005, van der Heijden et al. 2003). Here we have seen that in the case of one-inflation heterogeneity serious overestimation of population size can occur. This is particularly disturbing if Chao's lower bound estimator (Chao 1987(Chao , 1989) is used which seemingly provides a lower bound estimate for population size whereas under one-inflation the opposite is true. We have seen that under sparsity modelling of the remaining counts (after truncating inflated counts) is crucial for the predictive value. Of course, having a well-fitting model for the observed data does not automatically apply it is also a good fit for the unobserved part as the model might not be valid for this part. This assumption needs to be made and it is untestable given the data constellation for this paper. The inclusion of covariates (if available) will always help to improve the fit of the model and increase the likelihood of valid predictions of population size. This aspect needs to be investigated in future research.
All estimation methods provide similar results. The modified Chao estimator is least favourable as its standard error is relatively large when compared to the others, but has the benefit of avoiding distributional assumptions. The unconditional and the Bayesian approach both seem to perform better than the conditional one. Most important is, and this cannot be emphasized enough, that one-inflation is not ignored, as, if it is, it leads not only to large bias in the estimate but also inflates standard errors considerably.
Clearly, due to the small sample sizes, confidence intervals based on profile loglikelihoods are rather wide, but we like to notice, however, that standard errors of estimators based on the appropriate one-inflated model are remarkably smaller than those ignoring one-inflation (see Table 11).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by/4.0/.