Introduction

Diversity indices are quantitative measures for both richness, the number of categories, and the degree of the evenness of their relative abundances. See Rao (1982), Ludwig and Reynolds (1988), and Patil and Taillie (1979) for further information. It is important to measure the diversity index of a population. For example, in ecology, a decline in diversity over time may indicate a gradual extinction of an ecosystem, while a rapid decline may indicate an extinction due to some sudden impacts. Based on this, scientists argued that the extinction of the dinosaur is due to a large asteroid impact roughly contemporaneous with the end of the Cretaceous. Gini-Simpson’s index (GS), together with Shannon’s entropy, are the two best known diversity measures. They are widely used in modern sciences such as ecology, demography, anthropology, information theory, and so on. See Hurlbert (1971), Peet (1974), Hunter and Gaston (1988), and Rogers and Hsu (2001).

Consider a population with K species for which p i denotes the relative abundance of species i (i=1,…,K) such that \(\sum \limits _{i=1}^{K}p_{i}=1\). Simpson (1949) proposed the index

$$ \lambda=\sum\limits_{i=1}^{K}p_{i}^{2} $$
(1)

to measure the degree of concentration for the population. Gini-Simpson’s index is defined as

$$ GS=\sum\limits_{i=1}^{K} p_{i}(1-p_{i})=1-\lambda. $$
(2)

There are also many other indices in literature. See Shannon (1948), Good (1953), Renyi (1961), and Hill (1973) among others. In the literature of biodiversity, according to Ricotta (2005), there are a “jungle" of biological measures of diversity. For a comprehensive discussion on the various relationships among these indices, one may refer to Rennolls and Laumonier (2006) and Mao (2007).

Let \(\{X_{i}\}_{i=1}^{n}\) be an iid sample from the population {p k ;k=1,…,K}, and f k the observed frequency of the kth category. Let \(\hat {p}_{k}=\frac {f_{k}}{n}\) and \(\hat {P}=\{\hat {p}_{k}; k=1,\dots, K\}\). The most important estimator of GS is the MLE

$$ \widehat{GS}=1-\sum\limits_{k=1}^{K} \hat{p}_{k}^{2}. $$
(3)

When K is finite, MLE is asymptotically normal if the underlying distribution is inhomogeneous and is asymptotically distributed as Chi-square if the underlying distribution is homogeneous. Another closed related estimator is given by

$$ \frac{n}{n-1}\left[1-\sum\limits_{k=1}^{K}\hat{p}_{k}^{2}\right]=\frac{n}{n-1}\left[1-\sum\limits_{k=1}\left(\frac{f_{k}}{n}\right)^{2}\right]. $$
(4)

Bhargava and Uppulurif (1977) showed that it is unbiased and established its asymptotic distribution.

Although the MLE is asymptotically efficient when K is not large relative to the sample size, it does not work well for large K, especially when K is large or infinite. This is easy to understand, since there are only about n/K observations on average for estimating each parameter, and hence the MLE is inefficient when n/K is small. In fact, \(\widehat {GS}\) is inconsistent in the case of K= or K=K n converging to too fast, and furthermore one cannot use the modern penalized estimation, for example lasso, to estimate p k , since there is no sparsity structure here. As it will be shown in this paper, MLE also works for the case K= but under some restrictions. Most of the existing methodologies take some adjustment to deal with this problem but result in very complicated forms with less tractable distributional characteristics. Practical techniques include jackknife and bootstrap, see Fritsch and Hsu (1999). Zhang and Zhou (2010) studied a group of estimators for ζ u,v . Due to these problems, little is known about the asymptotic distributional characteristics except in a naive approach. This motivates us to propose a new approach to estimating the GS index. Our new estimator is unbiased, asymptotically normal and efficient for all the cases about the number of species K.

The remainder of the paper is organized as follows. In “A general birthday problem” section, the birthday problem is generalized to cases with unequal probabilities and infinite categories, and the connection between the generalized birthday problem and the GS index is established. In “The estimator” section, based on the relationship between the generalized birthday problem and the GS index, an unbiased estimator of the GS index is proposed and the asymptotic normality is derived under all the three cases with respect to the number of species in the population. In “Asymptotic properties” section, an empirical study about dinosaur extinction data and a simulation study are employed to demonstrate the performance of our estimator.

A general birthday problem

The Birthday problem is an important example in standard textbooks like Feller (1971). The problem is to find the probability that among n students in a class, no two or more students share the same birthday under the assumption that individuals’ birthdays are independent and that for every individual, all 365 days of the year are equally likely as possible birthdays. It has been generalized in many ways under the uniform probability assumption. See Johnson and Kotz (1977) and Fang (1985) among others. Birthday problems with unequal probabilities are also studied over the years. For recent works, see Joag-Dev and Proschan (1992) and Wagner (2002) among others.

Similar to the Bernoulli trial, we define a categorical trial X as a random experiment with K possible outcomes (categories) with probability distribution P={p k :k=1,…,K}, where K is finite (known or unknown) or infinite. We call it “a success of category i” if the outcome of a categorical trial belongs to category i. Consider an independent sequence of categorical trials {X i ;i=1,2,… } in which the probability of success of each category keeps the same for each trial. Let H m be the number of distinct categories shown up in the first m trials. We assume m≥2 since it is trivial if m=1. Calculating the probability distribution of H m is generally referred to as birthday problems with unequal probabilities. See the references above. Let Y k be the number of successes of the kth category in the first m trials and let \(I_{k}=1_{(Y_{k}=0)}\phantom {\dot {i}\!}\) for k=1,…,K be the indicator function with I k =1 if the kth category does not appear in the sample. Then

$$ H_{m}=\sum\limits_{k=1}^{K} \left(1-I_{k}\right). $$
(5)

Theorem 1

For fixed m and finite or infinite value of K, we have

$$\begin{array}{@{}rcl@{}} E(H_{m})&=&{m\choose 1}\sum\limits_{k=1}^{K} p_{k}- {m\choose 2}\sum\limits_{k=1}^{K} p_{k}^{2}+\cdots +{m\choose m}(-1)^{m+1}\sum\limits_{k=1}^{K} p_{k}^{m},\\ Var(H_{m}) &=& \sum\limits_{k=1}^{K} (1-p_{k})^{m}\left[\!1\,-\,(\!1\,-\,p_{k})^{m}\right]\,+\,2\!\sum\limits_{1\leq i<j\leq K}\!\left[\!(\!1\!-p_{i}-p_{j})^{m}-(1-p_{i})^{m}(1-p_{j})^{m}\right]. \end{array} $$

The proof of the theorem is given in the Appendix.

Remark 1

It is easy to see that V a r(H m ) is finite for fixed m. In fact, V a r(H m )<m 2.

Now we are ready to establish the connection between the generalized birthday problem and the GS index. For a categorical trial with K categories and probability distribution P={p k :k=1,…,K}, we have a population of K species with relative abundances {p k :k=1,…,K}. A random sample of size m from this population corresponds to the first m trials in the independent sequence of categorical trials {X i ;i=1,2,… }. As a result, the first m categorical trials can be equivalently viewed as a random sample of size m from the corresponding population, and consequently H m represents the number of distinct species in a random sample of size m from the corresponding population. The following theorem shows that \(GS=1-{\sum \nolimits }_{k=1}^{K} p_{k}^{2}\) is the same as E(H 2)−1.

Theorem 2

Consider a population with K species and relative abundances P={p k ;k=1,…,K}. Then,

$$ GS=E(H_{2})-1, $$
(6)

where H 2 is number of distinct species in a random sample of size 2.

The theorem is a direct result of Theorem 1 taking m=2 and definition of GS (Eq. (2)). The above theorem indicates that GS is an estimable parameter under population P.

The estimator

Let \(\{X_{i}\}_{i=1}^{n}\) be an iid random sample of size n from population P with finite or infinite value of K. For any sub-sample \(\{X_{i_{1}},\dots, X_{i_{m}}:\, 1\leq i_{1}<\dots <i_{m}\leq n\}\) from this sample,

\(H_{m}(X_{i_{1}},\dots, X_{i_{m}})\) is the number of distinct species in the sub-sample. Therefore \(H_{m}(X_{i_{1}},\dots, X_{i_{m}})\) is a symmetric function. Define the following U-statistic

$$ Z_{n,m}={{n}\choose{m}}^{-1}\sum\limits_{c} H_{m}(X_{i_{1}},\dots, X_{i_{m}}), $$
(7)

where \({\sum \nolimits }_{c}\) denotes the summation over all the \({{n}\choose {m}}\) combinations of m distinct elements {i 1, …, i m } from {1, 2, …, n}. Then Z n,m E(H m ) almost surely as n based on the asymptotic distribution of the U-statistics in DasGupta (2008). This motivates us to estimate GS by

$$ \widehat{GS}_{1}=Z_{n,2}-1. $$
(8)

It is easy to verify that \(\widehat {GS}_{1}=Z_{n,2}-1\) is always an unbiased estimator of GS. In fact, H 2−1 is an unbiased estimator of GS by Theorem 2, and Z n,2−1 is the average across all combinatorial selections of size 2 from the full set of observations of H 2−1 applied to each sub-sample.

Asymptotic properties

Asymptotic properties for MLE

Let’s firstly prove the asymptotic normality of \(\widehat {GS}\) when K=. That is, there are infinitely many species in the population. Assume the probability distribution is P={p i ;i=1,2,… } with p i p i+1 for all i and \(\sum \limits _{i=1}^{\infty } p_{i}=1\). And we have the corresponding Gini-Simpson’s index \(GS=1-\sum \limits _{i=1}^{\infty } p_{i}^{2}=1-\lambda \). We have the following result.

Theorem 3

Let P={p i ;i=1,2,… } be the probability distribution of a population with infinite species. Assume that there exits a sequence of \(\{N_{n}\}_{n=1}^{\infty }\) such that \(np_{N_{n}+1,+}\rightarrow 0\), then we have the following

$$\frac{\sqrt{n}\left(\widehat{GS}-GS\right)}{\hat{\sigma}}\overset{p}{\to} N(0,1) $$

where

$$ \hat{\sigma}^{2}=4\left[\sum\limits_{i=i}^{N_{n}}\hat{p}_{k}^{3}-\left(\sum\limits_{i=1}^{N_{n}}\hat{p}_{i}^{2}\right)^{2}\right]. $$
(9)

The proof is given in the Appendix.

The following theorem is implied by Bhargava and Uppulurif (1977) when K is finite, homogeneous or inhomogeneous.

Theorem 4

If the underlying population distribution is inhomogeneous, then

$$ \frac{\sqrt{n}\left(\widehat{GS}-GS\right)}{\hat{\sigma}}\overset{p}{\to} N(0,1) $$
(10)

where

$$\hat{\sigma}^{2}=4\left[\sum\limits_{k=1}^{K} \hat{p}_{k}^{3}-\left(\sum\limits_{k=1}^{K} \hat{p}_{k}^{2}\right)^{2}\right]. $$

If the underlying population distribution is homogeneous, we have

$$ nK\left(\widehat{GS}-GS\right)\overset{d}{\to}-\chi_{K-1}^{2}. $$
(11)

Asymptotic properties of \(\widehat {GS}_{1}\)

The above U-statistic construction paves the way to establish the asymptotic normality of Z n,2. For an iid random sample {X i ; i=1,…,n} under the distribution P, θ=θ(P) is an estimable parameter and h(X 1,…,X m ) is a symmetric kernel satisfying E P {h(X 1,…,X m )}=θ(P). Let \(U_{n}={{n}\choose {m}}^{-1}\sum \limits _{c}h(X_{i_{1}},\dots, X_{i_{m}})\) where \({\sum \nolimits }_{c}\) is the summation over the \({{n}\choose {m}}\) combinations of m distinct elements {i 1,…,i m } from {1,…,n}. Let h 1(x 1)=E P {h(x 1,X 2,…,X m )} be the conditional expectation of h given X 1=x 1, and \(\sigma _{1}^{2}=Var_{P}\{h_{1}(X_{1})\}\). Then we have the following proposition by Hoeffding (1948).

Proposition 1

If E P (h 2)< and \(\sigma _{1}^{2}>0\), then \(\sqrt {n}\left (U_{n}-\theta \right) \overset {d}{\to } N\left (0,m^{2}\sigma _{1}^{2}\right)\).

From Remark 1, E P (h 2)=V a r(H 2(X 1,X 2))+(E(H 2(X 1,X 2)))2≤4+G S 2<. Note that \(h_{1}(x_{1})=E_{P}\left (2-I_{X_{2}=x_{1}}\right)=2-p_{x_{1}}\). It follows that

$$ {~}^{2} \sigma_{1}^{2}=Var_{P}\left(h_{1}(X_{1})\right)=Var_{P}\left(2-p_{X_{1}}\right)=Var_{P}\left(p_{X_{1}}\right)=\sum\limits_{k=1}^{K}p_{k}^{3}-\left(\sum\limits_{k=1}^{K}p_{k}^{2}\right)^{2}\geq 0. $$
(12)

The equality holds if and only if the probability distribution {p k :k=1,…,K} is uniform. Of course, if K=, the inequality hold strictly since the distribution can never be uniform. Therefore, we have the following theorem.

Theorem 5

If the distribution {p k :k=1,…,K} is not uniform, then

$$ \sqrt{n}\left(\widehat{GS}_{1}-GS\right) \overset{d}{\to} N\left(0,4\sigma_{1}^{2}\right). $$
(13)

Remark 2

Non-uniform distribution includes two cases: non-uniform finite distributions(K<) and infinite distributions(K=).

By (7), (12), and Theorem 1, it is easy to see that

$$ {~}^{2} \hat{\sigma}_{1}^{2}=Z_{n,3}-Z_{n,2}^{2}+Z_{n,2}-1 $$
(14)

is a consistent estimator of \(\sigma _{1}^{2}\). Hence the following corollary is established.

Corollary 1

Under the conditions of Theorem 5, we have

$$ \frac{\sqrt{n}\left(\widehat{GS}_{1}-GS\right)}{2\hat{\sigma}_{1}} \overset{d}{\to} N(0,1). $$
(15)

For homogeneous distributions, we have the following result.

Theorem 6

If the distribution {p k :k=1,…,K} is homogeneous, then

$$ nK\left(\widehat{GS}_{1}-GS\right)\overset{d}{\to} \chi_{K-1}^{2}-K+1. $$
(16)

The proof is given in Appendix. Compared with the MlE estimator, our estimator is reaches the same effect in homogeneous situation.

Examples and simulation studies

Example 1

(Dinosaur Extinction) The cause of the extinction of dinosaurs at the end of the Cretaceous period remains a mystery. Among all the theories, it is now widely accepted that it is due to a large asteroid impact at the end of the cretaceous. Sheehan et al. (1991) argued that diversity remained relatively constant throughout the Cretaceous period. The scientists reason that if the disappearance of the dinosaurs was gradual, one should observe a decline in diversity prior to extinction.

The data were organized by dividing the formation into three equally spaced stratigraphic levels, each of which represented a period of approximately 730,000 years. Fossils were cross-tabulated according to the stratigraphic level and the family to which the dinosaur belonged. Families represented are Cerotopsidae, Hadrosauridae, Hypsilophodontidae, Pachycephalosauridae, Tryrannosauridae, Ornithomimidae, Saurornithoididae, Dromaeosauridae. The summarized data is shown in Table 1 available in Rogers and Hsu (2001).

Table 1 Dinosaur counts by family and stratigraphic level

Let’s denote the true value of GS indices at the Lower, Middle, and Upper level by G S L , G S M , and G S U , respectively. It is interesting to ask if the dinosaur diversity changed.

To address the questions, we would like to present 95% simultaneous confidence intervals for all the pairwise contrasts: G S L G S M , G S L G S U , and G S M G S U .

Using expressions for \(\widehat {GS}_{1}\) and \(\hat {\sigma }_{1}^{2}\) from the previous section and the normal approximation in our theorems, we obtain simultaneous confidence intervals for all pairwise contrasts. The results are provided in Table 2

Table 2 95% simultaneous confidence intervals for all pairwise contrasts

Since all the confidence intervals contain zero, we may infer that all three communities were practically equivalent with respect to the GS index. That is, there is no significant change or decline of the diversity over time. Therefore, our study supports the theory of a sudden extinction of dinosaurs.

Our proposed estimator has advantages over the MLE when the sample size n is not large relative to the number of species K, especially when K=. In the following we conduct a simulation study for K=. We omit simulations for other scenarios for saving space.

Example 2

(K=) Consider the population {p k =e −(k−1)/10e k/10: k≥1}. It is easy to calculate the true value of GS for this distribution:

$$GS=1-\sum\limits_{k=1}^{\infty} p_{k}^{2}=0.95004. $$

We generate random samples of size n=10, 50, and 100, and calculate the MLE \(\widehat {GS}\) and our proposed estimator \(\widehat {GS}_{1}\), together with their standard deviations(Eqs. (9’) and (14)). The simulation is based on 500 replications and the results are obtained by averaging the corresponding estimates in each replication. Also, since it is known that the population distribution is not uniform, we will just apply \(\widehat {GS}_{1}\) due to the reason mentioned before. The simulation results are summarized in Table 3.

Table 3 Estimates of GS for Example 2

From Table 3, we see that the deviations of the MLEs from the true value G S=0.95004 are much greater than those of our proposed estimates. This is due to the facts that \(\widehat {GS}\) has a large bias and that the sample coverage is limited when the sample size is relatively small compared with the number of species. Our proposed estimator, instead, overcome such obstacles since it is an unbiased estimator of GS.And it is also shown that our proposed estimator has smaller variance.

Discussion

Birthday problem has been studied and extended in different forms and in many different areas. The same is true for diversity measures. The connection between these two topics is established in this paper through H 2 and the mostly used Gini-Simpson’s index. There are many other correlated diversity indices in the literature, like Shannon’s entropy, Renyi’s index. For these indices, we can also find corresponding estimators in a similar way through the result in Theorem 1. The advantage of our approach over the MLE is obvious when the sample size is not large relative to the number of species. There are many other open problems built on this connection between birthday problem and diversity measures. For example, further investigation is needed to study the estimation of mutual information in view of generalized birthday problem. Our approach provides a framework for solving various problems inherited from the diversity measures.

Appendix 1: Proof of Theorem 1

Theorem 1

For fixed m and finite or infinite value of K, we have

$$ E(H_{m})={m\choose 1}\sum\limits_{k=1}^{K} p_{k}- {m\choose 2}\sum\limits_{k=1}^{K} p_{k}^{2}+\cdots +{m\choose m}(-1)^{m+1}\sum\limits_{k=1}^{K} p_{k}^{m}, $$
$$\begin{array}{*{20}l} {}Var(H_{m}) =\! \sum\limits_{k=1}^{K} (\!1-p_{k})^{m}\!\left[1-\!(1-p_{k})^{m}\right]\,+\,2\!\sum\limits_{1\leq i<j\leq K}\!\left[\!(1-p_{i}-p_{j})^{m}\,-\,(1-p_{i})^{m}(1-p_{j})^{m}\right]. \end{array} $$

Proof

Let’s consider the following lemma first. □

Lemma 1

For the class of random variables {I k ;k=1,…,K}, we have

$$\begin{array}{*{20}l} E(I_{k}) &=(1-p_{k})^{m}; \end{array} $$
(17)
$$\begin{array}{*{20}l} Var(I_{k}) &=(1-p_{k})^{m}-(1-p_{k})^{2m}, \end{array} $$
(18)
$$\begin{array}{*{20}l} Cov(I_{i},I_{j}) &= (1-p_{j}-p_{k})^{m}-(1-p_{j})^{m}(1-p_{k})^{m},\, \text{for \(i\neq j\) } \end{array} $$
(19)

Lemma 1 can be verified easily.

When K is finite, the following equations are easily established.

$${}\begin{aligned} E(H_{m})&=\sum\limits_{k=1}^{K}\left(1-EI_{k}\right) \\ &= \sum\limits_{k=1}^{K}\left(1-(1-p_{k})^{m}\right) \\ &= \sum\limits_{k=1}^{K}\left({m\choose 1}p_{k}- {m\choose 2}p_{k}^{2}+\cdots +{m\choose m}(-1)^{m+1}p_{k}^{m} \right) \\ &= {m\choose 1}\sum\limits_{k=1}^{K} p_{k}- {m\choose 2}\sum\limits_{k=1}^{K} p_{k}^{2}+\cdots +{m\choose m}(-1)^{m+1}\sum\limits_{k=1}^{K} p_{k}^{m}, \text{and }\\ Var\left(H_{m}\right) &=Var \left[\sum\limits_{k=1}^{K}\left(1-I_{k}\right)\right] \\ &=\sum\limits_{k=1}^{K} Var\left(1-I_{k} \right)+2\sum\limits_{1\leq i<j\leq K} Cov(1-I_{i},1-I_{j}) \\ &= \sum\limits_{k=1}^{K} Var\left(I_{k} \right)+2\sum\limits_{1\leq i<j\leq K} Cov(I_{i},I_{j}) \\ &= \sum\limits_{k=1}^{K} (1-p_{k})^{m}\!\left[\!1\,-\,(1\,-\,p_{k})^{m}\right]\,+\,2\!\sum\limits_{1\leq i<j\leq K}\!\left[(1-p_{i}-p_{j})^{m}-(1-p_{i})^{m}(1-p_{j})^{m}\right] \end{aligned} $$

When K is infinite, the above equations are guaranteed by dominated convergence theorem. In fact, we have H m m and \(H_{m}^{2}\leq m^{2}\).

Appendix 2: Proof of Theorem 3

Theorem 3

Let P={p i ;i=1,2,… } be the probability distribution of a population with infinite species. Assume that there exits a sequence of \(\{N_{n}\}_{n=1}^{\infty }\) such that \(np_{N_{n}+1,+}\rightarrow 0\), then we have the following

$$\frac{\sqrt{n}\left(\widehat{GS}-GS\right)}{\hat{\sigma}}\overset{p}{\to} N(0,1) $$

where

$$ \hat{\sigma}^{2}=4\left[\sum\limits_{i=i}^{N_{n}}\hat{p}_{k}^{3}-\left(\sum\limits_{i=1}^{N_{n}}\hat{p}_{i}^{2}\right)^{2}\right]. $$
(9’)

Proof

Now let’s consider a sequence of populations with probability distributions P N ={p 1,p 2,…,p N−1,P N,+}, where \(p_{N,+}=\sum \limits _{i=N}^{\infty } p_{i}\). The corresponding Gini-Simpson’s index is

$$GS_{N}=1-\left(\sum\limits_{i=1}^{N-1}p_{i}^{2}+p_{N,+}^{2}\right)=1-\lambda_{N}. $$

It is easy to check that

$$\lambda_{N} \rightarrow \lambda $$

as N.

Let \(\{X_{i}\}_{i=1}^{n}\) be an iid sample from the population P. The MLE of GS is

$$\widehat{GS}=1-\sum\limits_{i=1}^{\infty} \hat{p}_{k}^{2}. $$

For fixed N, let’s re-label the same sample \(\{X_{i}\}_{i=1}^{n}\) to another sample \(\{Y_{i}\}_{i=1}^{n}\) as follows:

$$\begin{array}{*{20}l} & Y_{i}=X_{i}\, \text{if}\; X_{i}< N \\ & Y_{i}=N\, \text{if}\; X_{i}\geq N \end{array} $$

Then \(\{Y_{i}\}_{i=1}^{n}\) can be regarded as a iid sample from P N with Gini-Simpson’s index G S N .

The MLE of G S N is

$$\widehat{GS}_{N}=1-\left(\sum\limits_{i=1}^{N-1} \hat{p}_{i}^{2}+\hat{p}^{2}_{N,+}\right). $$

It is easy to see that

$$\widehat{GS}_{N}-\widehat{GS}\rightarrow 0 $$

as N. In fact,

$$ \widehat{GS}-\widehat{GS}_{N}=0 $$
(20)

if X i N for all i=1,2,…,n.

Therefore,

$$\begin{array}{@{}rcl@{}} &&\sqrt{n}\left(\widehat{GS}-GS\right)\\ && =\sqrt{n}\left(\widehat{GS}-\widehat{GS}_{N}+\widehat{GS}_{N}-\lambda_{N}+\lambda_{N}-\lambda\right) \\ &&= \sqrt{n}\left(\widehat{GS}-\widehat{GS}_{N}\right)+\sqrt{n}\left(\widehat{GS}_{N}-\lambda_{N} \right)+\sqrt{n}(\lambda_{N}-\lambda) \end{array} $$

For any positive integer n, consider a corresponding integer N n . The probability that all the observations in the sample \(\{X_{i}\}_{i=1}^{n}\) is less or equal to N n is

$$\left(1-p_{N_{n}+1,+}\right)^{n}=\left(1-\sum\limits_{i=N_{n}+1}^{\infty} p_{i}\right)^{n}. $$

Therefore, if

$$\left(1-\sum\limits_{i=N_{n}+1}^{\infty} p_{i}\right)^{n} =\left(1-p_{N_{n}+1,+}\right)^{\frac{np_{N_{n}+1,+}}{p_{N_{n}+1,+}} }=e^{-np_{N_{n}+1}} \rightarrow 1 $$

that is,

$$ np_{N_{n}+1,+}\rightarrow 0 $$
(21)

then all the observations in the sample \(\{X_{i}\}_{i=1}^{n}\) falls into the first N n species with probability going to 1 as n increases. In turn,

$$\widehat{GS}-\widehat{GS}_{N} $$

equal to zero with probability going to 1 due to Eq. (20). Therefore,

$$\sqrt{n}\left(\widehat{GS}-\widehat{GS}_{N}\right) $$

converge to o with probability going to 1 as n increases.

In addition,

$$\begin{array}{*{20}l} \sqrt{n}\left(\lambda_{N_{n}}-\lambda \right) &=\sqrt{n}\left(\sum\limits_{i=1}^{N_{n}-1}p_{i}^{2}+p_{N_{n},+}^{2}-\sum\limits_{i=1}^{\infty}p_{i}^{2}\right) \\ &=\sqrt{n}\left[\left(\sum\limits_{i=N_{n}}^{\infty} p_{i}\right)^{2}-\sum\limits_{i=N_{n}}^{\infty} p_{i}^{2}\right] \\ &\leq \sqrt{n}\sum\limits_{i=N_{n}}^{\infty} p_{i}^{2} \\ &\leq \sqrt{n} p_{N_{n},+} \end{array} $$

Therefore, if \(\sqrt {n}p_{N_{n},+}\rightarrow 0\) which is a weaker condition than (21), we have

$$ \sqrt{n}\left(\lambda_{N_{n}}-\lambda \right)\rightarrow 0. $$

Therefore, by Slutsky’s theorem, the theorem is proved. □

Appendix 3: Proof of Theorem 6

Theorem 6

If the distribution {p k :k=1,…,K} is homogeneous, then

$$ nK\left(\widehat{GS}_{1}-GS\right)\overset{d}{\to} \chi_{K-1}^{2}-K+1. $$
(16’)

Proof

For an iid random sample {X i ; i=1,…,n} under the distribution P, θ=θ(P) is an estimable parameter and h(X 1,…,X m ) is a symmetric kernel satisfying E P {h(X 1,…,X m )}=θ(P). Let \(U_{n}={{n}\choose {m}}^{-1}\sum \limits _{c}h(X_{i_{1}},\dots, X_{i_{m}})\) where \({\sum \nolimits }_{c}\) is the summation over the \({{n}\choose {m}}\) combinations of m distinct elements {i 1,…,i m } from {1,…,n}. Let h 1(x 1)=E P {h(x 1,X 2,…,X m )} be the conditional expectation of h given X 1=x 1, and ζ 1=V a r P {h 1(X 1)}. Also let h 2(x 1,x 2)=E P {h(x 1,x 2,X 3…,X m )} be the conditional expectation of h given X 1=x 1,X 2=x 2, and ζ 2=V a r P {h 2(X 1,X 2)}. Define

$$\tilde{h}_{2}=h_{2}-\theta. $$

Then we have the following lemmas by Hoeffding (1948). □

Lemma 2

If E P (h 2)< and ζ 1>0, then \(\sqrt {n}\left (U_{n}-\theta \right) \overset {d}{\to } N\left (0,m^{2}\zeta _{1}\right)\).

Lemma 3

If E P (h 2)< and ζ 1=0<ζ 2, then \(n\left (U_{n}-\theta \right) \overset {d}{\to } \frac {m(m-1)}{2}Y\), where Y is a random variable of the form

$$Y=\sum\limits_{j}\lambda_{j}\left(\chi^{2}_{1j}-1\right), $$

where χ 11,χ 12,… are independent \(\chi _{1}^{2}\) variates and λ j s are the eigenvalues of the following operator on the function space L 2(R,P):

$$ Ag(x)=\int_{-\infty}^{\infty} \tilde{h}_{2}(x,y)g(y)dP(y),\ x\in R, g\in L_{2}. $$
(22)

For our case, we have θ=G S+1 and the kernal function given as

$$ h(x_{1}, x_{2})=V(x_{1},x_{2})=2-I_{x_{1}=x_{2}}. $$
(23)

That is,

$$GS+1=E_{P}\{h(X_{1},X_{2})\} $$

for given population distribution P.

Under the assumption of homogeneous population distribution, ζ 1=0. Since

$$h(X_{1},X_{2})=2-I_{X_{1}=X_{2}}=\left\{ \begin{aligned} 1 &\,\text{if}\,X_{1}=X_{2} \\ 2 &\,\text{if}\,X_{1}{\neq} X_{2} \end{aligned} \right.$$

We have

$$\begin{array}{*{20}l} {}\zeta_{2}&=Var\left(h(X_{1},X_{2})\right) \\ &=E\left(h^{2}(X_{1},X_{2})\right)-\left(E(h(X_{1},X_{2})\right)^{2} \\ &=1\cdot P(X_{1}=X_{2})+4P(X_{1}\neq X_{2})-\left(P(X_{1}=X_{2})+2P(X_{1}\neq X_{2})\right)^{2} \\ &= P(X_{1}=X_{2})\,+\,4P(X_{1}\!\neq X_{2})\,-\,P^{2}(X_{1}\,=\,X_{2})\,-\,4P(X_{1}\,=\,X_{2})P(X_{1\!}\neq\! X_{2})\,-\,4P^{2}(X_{1}\!\neq\! X_{2}) \\ &=P(X_{1}=X_{2})(1-P(X_{1}=X_{2}))+4P(X_{1}\neq X_{2})\left[1-P(X_{1}=X_{2})-P(X_{1}\neq X_{2})\right] \\ &=P(X_{1}=X_{2})P(X_{1}\neq X_{2}) \\ &=\sum\limits_{i=1}^{K} p_{i}^{2}\left(1-\sum\limits_{i=1}^{K} p_{i}^{2}\right) >0 \end{array} $$

Also

$$\theta=GS+1=2-\sum\limits_{i=1}^{K}\frac{1}{K^{2}}=2-\frac{1}{K}. $$

Now let’s find the eigenvalues of operator A under the homogeneous distribution. We have \(\tilde {h}_{2}(x,y)=2-I_{x=y}-\theta =\frac {1}{K}-I_{x=y}\). And

$$\begin{array}{*{20}l} Ag(x)&=\int_{-\infty}^{\infty} \tilde{h}_{2}g(y)dP(y) \\ &= \int_{-\infty}^{\infty} \left(\frac{1}{K}-I_{x=y}\right) g(y)dP(y) \\ &=\frac{1}{K^{2}}\sum\limits_{i=1}^{K}g(i)-\frac{1}{K}g(x) \\ &=\frac{1}{K^{2}} \sum\limits_{i\neq x} g(i)+\left(\frac{1}{K^{2}}-\frac{1}{K}\right) g(x) \end{array} $$

Since g:{1,2,…,K}→R, it can be viewed as a vector from R K. And A is a linear operator on R K. And the matrix representation of A is

$$A\vec{g}=T\vec{g} $$

where T is a K×K matrix with \(T(i,i)=\frac {1}{K^{2}}-\frac {1}{K}\) and \(T(i,j)=\frac {1}{K^{2}}\) for ij. The matrix T has two eigenvalues λ=0 with multiplicity one and \(\lambda =-\frac {1}{K}\) with multiplicity K−1.

Therefore due to Lemma 3 and properties of independent Chi-square distributions, theorem is proved

Appendix 4: About the variances of \(\widehat {GS}\) and \(\widehat {GS}_{1}\)

From section of Asymptotic behaviour for homogeneous case, we get that

$$ \zeta_{1}= \sum\limits_{i=1}^{K} p_{i}^{3}-\left(\sum\limits_{i=1}^{K} p_{i}^{2}\right)^{2} $$

and

$$\zeta_{2}=\sum\limits_{i=1}^{K} p_{i}^{2}\left(1-\sum\limits_{i=1}^{K} p_{i}^{2}\right). $$

By the following lemma by Hoeffding (1948):

Lemma 4

The variance of U n is given by

$$ Var_{F}(U_{n})=\dbinom{n}{m}^{-1} \sum\limits_{c=1}^{m} \dbinom{m}{c} \dbinom{n-m}{m-c} \zeta_{c} $$
(24)

Therefore,

$$\begin{array}{*{20}l} Var\left(\widehat{GS}_{1}\right) &=\dbinom{n}{2}^{-1}\left(2(n-2)\zeta_{1}+\zeta_{2}\right) \\ &=\frac{2}{n(n-1)}\left[2(n-2)\left(\sum\limits_{i=1}^{K} p_{i}^{3}-\left(\sum\limits_{i=1}^{K} p_{i}^{2}\right)^{2}\right)+\sum\limits_{i=1}^{K} p_{i}^{2}-\left(\sum\limits_{i=1}^{K} p_{i}^{2}\right)^{2}\right] \\ &=\frac{2}{n(n-1)}\left[2(n-2)\sum\limits_{i=1}^{K} p_{i}^{3}-(2n-3)\left(\sum\limits_{i=1}^{K} p_{i}^{2}\right)^{2}+\sum\limits_{i=1}^{K} p_{i}^{2}\right] \end{array} $$

From Bhargava and Uppuluri (1977), we have

$$Var\left(\widehat{GS}\right)=\frac{(n-1)^{2}}{n^{2}}\frac{2}{n(n-1)}\left[2(n-2)\sum\limits_{i=1}^{K} p_{i}^{3}-(2n-3)\left(\sum\limits_{i=1}^{K} p_{i}^{2}\right)^{2}+\sum\limits_{i=1}^{K} p_{i}^{2}\right] $$

Therefore, we have the following theorem.

Theorem 7

When K is finite, we have

$$Var\left(\widehat{GS}\right)=\frac{(n-1)^{2}}{n^{2}}Var\left(\widehat{GS}_{1}\right). $$