Machine Learning

, Volume 106, Issue 11, pp 1771–1785

# Asymptotic properties of Turing’s formula in relative error

• Michael Grabchak
• Zhiyi Zhang
Article

## Abstract

Turing’s formula allows one to estimate the total probability associated with letters from an alphabet, which are not observed in a random sample. In this paper we give conditions for the consistency and asymptotic normality of the relative error of Turing’s formula of any order. We then show that these conditions always hold when the distribution is regularly varying with index $$\alpha \in (0,1]$$.

## Keywords

Asymptotic normality Consistency Distributions on alphabets Missing mass Regular variation Turing’s formula

## 1 Introduction

In many situations one works with data that has no natural ordering and is categorical in nature. In such cases, an important problem is to estimate the probability of seeing a new category that has not been observed before. This probability is called the missing mass. See McAllester and Ortiz (2003), Berend and Kontorovich (2013), Ben-Hamou et al. (2017), Decrouez et al. (2016) and the references therein for a discussion of its properties. The problem of estimating the missing mass arises in many applications, including ecology (Good and Toulmin 1956; Chao 1981; Chao et al. 2015), genomics (Mao and Lindsay 2002), speech recognition (Gupta et al. 1992; Chen and Goodman 1999), authorship attribution (Efron and Thisted 1976; Thisted and Efron 1987; Zhang and Huang 2007), and computer networks (Zhang 2005). Perhaps the most famous estimator of the missing mass is Turing’s formula, sometimes also called the Good–Turing formula. This formula was first published by Good (1953), where the idea is credited, largely, to Alan Turing. To discuss this estimator and how it works, we begin by formally defining our framework.

Let $${\mathcal {A}}=(a_1,a_2,\dots )$$ be a countable alphabet and let $${\mathcal {P}}=(p_1,p_2,\dots )$$ be a probability distribution on $${\mathcal {A}}$$. We refer to the elements in $${\mathcal {A}}$$ as letters. These represent the various categories of our data. Assume that $$X_1,\dots ,X_n$$ is a random sample of size n from $${\mathcal {A}}$$ according to $${\mathcal {P}}$$, let
\begin{aligned} y_{k,n} = \sum _{i=1}^n 1_{[X_i=a_k]} \end{aligned}
be the number of times that letter $$a_k$$ appears in the sample, and let
\begin{aligned} N_{r,n} = \sum _k 1_{[y_{k,n}=r]} \end{aligned}
be the number of letters observed exactly r times.
Define
\begin{aligned} \pi _{0,n} = \sum _{k}p_k1_{[y_{k,n}=0]}. \end{aligned}
This is the missing mass, i.e. the probability that the next observation will be of a letter that has not yet been observed. Turing’s formula is an estimator of $$\pi _{0,n}$$ given by
\begin{aligned} T_{0,n} = \frac{N_{1,n}}{n}. \end{aligned}
One way to see how well this estimator works is through simulations; for a recent simulation study see Grabchak and Cosme (2017). Other approaches, based on decision theory and Bayesian inference, are given in Cohen and Sackrowitz (1990) and Favaro et al. (2016). A simpler approach is to consider the bias of Turing’s formula and to see when
\begin{aligned} \mathrm E\left[ T_{0,n}-\pi _{0,n}\right] \approx 0. \end{aligned}
Many discussions are given in, e.g., Robbins (1968), McAllester and Schapire (2000) and Zhang and Huang (2007). Along similar lines, it can be easily shown that
\begin{aligned} \left( T_{0,n}-\pi _{0,n}\right) \mathop {\rightarrow }\limits ^{p}0 \text{ as } n\rightarrow \infty . \end{aligned}
(1)
While this may seem to resolve the issue of consistency, it is not as informative as it first appears. This is because, as is easy to see, we have both $$T_{0,n}\mathop {\rightarrow }\limits ^{p}0$$ and $$\pi _{0,n}\mathop {\rightarrow }\limits ^{p}0$$ as $$n\rightarrow \infty$$. Thus, we cannot know if (1) tells us that $$T_{0,n}$$ is estimating $$\pi _{0,n}$$ well, or if both are just very small. This is an issue with the studies of bias as well.
A different and, arguably, more meaningful way to think about consistency was introduced in Ohannessian and Dahleh (2012). There, the question under consideration was when does the relative error of Turing’s formula approach zero, i.e. when does
\begin{aligned} \frac{T_{0,n}-\pi _{0,n}}{\pi _{0,n}}\mathop {\rightarrow }\limits ^{p}0 \text{ as } n\rightarrow \infty \end{aligned}
(2)
hold. Specifically, Ohannessian and Dahleh (2012) and Ben-Hamou et al. (2017) showed that a sufficient condition for (2) is that the underlying distribution, $${\mathcal {P}}$$, is regularly varying.
Another approach to understanding and using Turing’s formula is to ask when asymptotic normality holds. In particular, conditions under which there exists a deterministic sequence $$g_n$$ with
\begin{aligned} g_n\left( T_{0,n}-\pi _{0,n}\right) \mathop {\rightarrow }\limits ^{d}N(0,1) \text{ as } n\rightarrow \infty \end{aligned}
(3)
are given in Esty (1983), Zhang and Huang (2008), Zhang and Zhang (2009) and Zhang (2013). Results of this type are important for constructing statistical tests and confidence intervals. See Zhang and Huang (2007) for an application to authorship attribution.
In this paper we give conditions for the asymptotic normality of the relative error of Turing’s formula. Specifically, we give conditions under which there exists a deterministic sequence $$h_n$$ with
\begin{aligned} h_n\frac{T_{0,n}-\pi _{0,n}}{\pi _{0,n}}\mathop {\rightarrow }\limits ^{d}N(0,1) \text{ as } n\rightarrow \infty . \end{aligned}
(4)
These conditions also imply consistency of the relative error in the sense of (2). We note that (4) leads to confidence intervals and hypothesis tests quite different from those determined by (3). The nature of this difference will be studied in a future work. All of our results are presented not just for Turing’s formula, but also for higher order Turing’s formulae, which we discuss in the next section.

We will prove (4) under a new sufficient condition. Interestingly, this condition turn out to be more restrictive than the one for (3). This is likely due to the fact that, since we are now dividing by $$\pi _{0,n}$$, we must make sure that it does not approach zero too quickly. We note that our approach is quite different from the one used in Ohannessian and Dahleh (2012) to prove (2). That approach uses tools specific to regularly varying distributions, which we will not need.

The remainder of the paper is organized as follows. In Sect. 2 we recall Turing’s formulae of higher orders and give conditions for the asymptotic normality and consistency of their relative errors. In Sect. 3, we give some comments on the assumptions of the main results and give alternate ways of checking them. Then, in Sect. 4, we show that the assumptions always hold when the distribution, $${\mathcal {P}}$$, is regularly varying with index $$\alpha \in (0,1]$$. For $$\alpha \ne 1$$ this is the condition under which the consistency results of Ohannessian and Dahleh (2012) were obtained. Finally, proofs of the main results are given in Sect. 5.

Before proceeding we introduce some notation. For $$x>0$$ we denote the gamma function by $$\Gamma (x)=\int _0^\infty e^{-t} t^{x-1}\mathrm dt$$. For real valued functions f and g, we write $$f(x)\sim g(x)$$ as $$x\rightarrow c$$ to mean $$\lim _{x\rightarrow c}\frac{f(x)}{g(x)} = 1$$. For sequences $$a_n$$ and $$b_n$$, we write $$a_n\sim b_n$$ to mean $$\lim _{n\rightarrow \infty }\frac{a_n}{b_n}=1$$. We write $$N(\mu ,\sigma ^2)$$ to refer to a normal distribution with mean $$\mu$$ and variance $$\sigma ^2$$. We write $$\mathop {\rightarrow }\limits ^{d}$$ to refer to convergence in distribution and $$\mathop {\rightarrow }\limits ^{p}$$ to refer to convergence in probability.

## 2 Main results

In this section we give our main results about the asymptotic normality and consistency of the relative error of Turing’s formula. We begin by recalling higher order Turing’s formulae, which were introduced in Good (1953). For any $$r=0,1,\dots ,n-1$$ let
\begin{aligned} \pi _{r,n} = \sum _k p_k1_{[y_{k,n}=r]} \end{aligned}
be the probability that the next observation will be of a letter that has been observed exactly r times, and let
\begin{aligned} \mu _{r,n} := \mathrm E[\pi _{r,n}] = {n\atopwithdelims ()r}\sum _k p_k^{r+1}(1-p_k)^{n-r}. \end{aligned}
An estimator of $$\pi _{r,n}$$ is given by
\begin{aligned} T_{r,n} = \frac{r+1}{n-r} N_{r+1,n}, \end{aligned}
which is Turing’s formula of order r. Turing’s formula of order 0 is just called Turing’s formula and is the most useful in applications as it estimates the probability of seeing a letter that has never been observed before. We now recall the results about asymptotic normality given in Zhang and Huang (2008) and Zhang (2013).
Let $$g_n$$ be a deterministic sequence of positive numbers such that
\begin{aligned} \limsup _{n\rightarrow \infty }\frac{g_n}{n^{1-\beta }}<\infty \text{ for } \text{ some } \beta \in (0,1/2). \end{aligned}
(5)
For an integer s, we say that Condition $$A_s$$ is satisfied if
\begin{aligned} \lim _{n\rightarrow \infty }g_n^2n^{s-2}\sum _k p_k^s(1-p_k)^{n-s} = c_s \end{aligned}
(6)
for some $$c_s\ge 0$$. The following result is given in Zhang (2013) and, for the case $$r = 0$$, in Zhang and Huang (2008).

### Lemma 1

Fix any integer $$r\ge 0$$ and let $$g_n$$ be a deterministic sequence of positive numbers satisfying (5). If Conditions $$A_{r+1}$$ and $$A_{r+2}$$ hold with $$c_{r+1}+c_{r+2}>0$$, then
\begin{aligned} g_n\left( T_{r,n}-\pi _{r,n}\right) \mathop {\rightarrow }\limits ^{d}N\left( 0,\frac{(r+1)c_{r+1}+c_{r+2}}{r!}\right) \text{ as } n\rightarrow \infty . \end{aligned}

We are now ready to state our main result.

### Theorem 1

Fix any integer $$r\ge 0$$ and let $$g_n$$ be a deterministic sequence of positive numbers satisfying (5). If $$r\ge 2$$ assume that
\begin{aligned} \limsup _{n\rightarrow \infty } g_n^2\sum _k p_k^2(1-p_k)^{n-2}<\infty . \end{aligned}
(7)
If Conditions $$A_{r+1}$$ and $$A_{r+2}$$ hold with $$c_{r+1}>0$$ and $$c_{r+2}\ge 0$$, then
\begin{aligned} \mu _{r,n}g_n\left( \frac{T_{r,n}-\pi _{r,n}}{\pi _{r,n}}\right) \mathop {\rightarrow }\limits ^{d}N\left( 0,\frac{(r+1)c_{r+1}+c_{r+2}}{r!}\right) \text{ as } n\rightarrow \infty . \end{aligned}

### Proof

The proof is given in Sect. 5. $$\square$$

### Remark 1

Note that Theorem 1 does not, in general, give $$\sqrt{n}$$-convergence. In fact, the rate of convergence is different for different distributions. In Sect. 4 we will characterize the rates for the case of regularly varying distributions.

Since the most important case is when $$r=0$$, we restate Theorem 1 for this case.

### Corollary 1

Let $$g_n$$ be a deterministic sequence of positive numbers satisfying (5). If Conditions $$A_{1}$$ and $$A_{2}$$ hold with $$c_{1}>0$$ and $$c_{2}\ge 0$$, then
\begin{aligned} \mu _{0,n}g_n\left( \frac{T_{0,n}-\pi _{0,n}}{\pi _{0,n}}\right) \mathop {\rightarrow }\limits ^{d}N(0,c_1+c_2) \text{ as } n\rightarrow \infty . \end{aligned}

The results of Theorem 1 may not appear to be of practical use since we generally do not know the values of $$g_n$$, $$\mu _{r,n}$$, $$c_{r+1}$$, or $$c_{r+2}$$. However, it turns out that we do not need to know these quantities. So long as a sequence $$g_n$$ satisfying the assumptions exists, it and everything else can be estimated.

### Corollary 2

If the conditions of Theorem 1 are satisfied, then
\begin{aligned} \frac{\sqrt{r+1}\mathrm E[N_{r+1,n}]}{\sqrt{(r+1)\mathrm E[N_{r+1,n}] +(r+2)\mathrm E[N_{r+2,n}]}} \left( \frac{T_{r,n}-\pi _{r,n}}{\pi _{r,n}}\right) \mathop {\rightarrow }\limits ^{d}N(0,1) \end{aligned}
and
\begin{aligned} \frac{\sqrt{r+1}N_{r+1,n}}{\sqrt{(r+1)N_{r+1,n} +(r+2)N_{r+2,n}}}\left( \frac{T_{r,n}-\pi _{r,n}}{\pi _{r,n}}\right) \mathop {\rightarrow }\limits ^{d}N(0,1). \end{aligned}

### Proof

The proof is given in Sect. 5. $$\square$$

In the proof of Theorem 1, it is shown that, under the assumptions of that theorem, $$\mu _{r,n} g_n\rightarrow \infty$$. This means that we can immediately get consistency.

### Corollary 3

If the conditions of Theorem 1 are satisfied, then
\begin{aligned} \frac{T_{r,n}-\pi _{r,n}}{\pi _{r,n}}\mathop {\rightarrow }\limits ^{p}0. \end{aligned}
(8)

The assumptions of Corollary 3 are quite general. Different conditions are given in Corollary 5.3 of Ben-Hamou et al. (2017). The most general possible conditions for (8) are not known, but it is known is that some conditions are necessary. In fact, Mossel and Ohannessian (2015) showed that there cannot exist an estimator of $$\pi _{0,n}$$ for which (8) holds for every distribution.

## 3 Discussion

In this section we discuss the assumptions of Theorem 1 and give alternate ways to verify them. We begin by giving several equivalent ways to check that Condition $$A_s$$ holds. Toward this end, we introduce the notation
\begin{aligned} \Phi _s(n) = \frac{n^{s}}{s!}\sum _k p_k^se^{-np_k}. \end{aligned}

### Lemma 2

For any integer $$s\ge 1$$ the following are equivalent:
\begin{aligned} \mathrm {(a)}&\lim _{n\rightarrow \infty }g_n^2 n^{s-2}\sum _k p_k^s(1-p_k)^{n-s}= c_s,\\ \mathrm {(b)}&\lim _{n\rightarrow \infty }(s-1)!\frac{g_n^2}{n}\mu _{s-1,n-1} = c_s,\\ \mathrm {(c)}&\lim _{n\rightarrow \infty }s!\frac{g_n^2}{n^2}\mathrm E[N_{s,n}] = c_s,\\ \mathrm {(d)}&\lim _{n\rightarrow \infty }s!\frac{g_n^2}{n^2}\Phi _s(n)= c_s. \end{aligned}

### Proof

The proof is given in Sect. 5. $$\square$$

### Remark 2

An intuitive interpretation of $$\Phi _s(n)$$ is as follows. Consider the case where the sample size is not fixed at n, but is a random variable $$n^*$$, where $$n^*$$ follows a Poisson distribution with mean n. In this case $$y_{k,n^*}$$ follows a Poisson distribution with mean $$np_k$$ and $$\Phi _s(n)=\mathrm E[N_{s,n^*}]$$. In this sense, Condition (d) can be thought of as a Poissonization of Condition (c). Poissonization is a useful tool when studying the occupancy problem and is discussed, at length, in Gnedin et al. (2007).

We now turn to the effects of Condition $$A_s$$.

### Lemma 3

1. 1.
Let $$s\ge 1$$ be an integer. If Condition $$A_s$$ holds with $$c_s>0$$ then
\begin{aligned} \lim _{n\rightarrow \infty }\frac{g_n}{n^{1/2}}=\infty \end{aligned}
(9)
and
\begin{aligned} \lim _{n\rightarrow \infty }\frac{g_n}{g_{n+1}}=1. \end{aligned}
(10)

2. 2.
When $$r\ge 2$$ and Condition $$A_{r+1}$$ holds, (7) is equivalent to
\begin{aligned} \limsup _{n\rightarrow \infty } \sum _{k:p_k<1/n} \left( g_np_k\right) ^2<\infty . \end{aligned}

### Proof

The proof is given in Sect. 5. $$\square$$

It is important to note that (9) is implicitly used in the proof of Lemma 1 as given in Zhang and Huang (2008) and Zhang (2013), although it is not directly mentioned there. Further, (9) implies that the assumption in (5) that $$\beta \in (0,1/2)$$ is not much of a restriction. It also tells us that $$g_n$$ must approach infinity quickly, but (5) tell us that it should not do so too quickly. On the other hand, (10) is a smoothness assumption. It looks like a regular variation condition, but is a bit weaker, see Theorem 1.9.8 in Bingham et al. (1987).

### Remark 3

Lemmas 2 and 3 help to explain what kind of distributions satisfy the assumptions of Theorem 1. Specifically, the two lemmas imply that $$r!n^{-1}g_n^2\mu _{r,n} \rightarrow c_{r+1}>0$$. In light of (5), this means that $$\mu _{r,n}$$ cannot approach zero too quickly. Thus, $$\pi _{r,n}$$ cannot approach zero quickly either. This condition means that the distribution must have heavy tails of some kind. Arguably, the best known distributions with heavy tails are those that are regularly varying, which we focus on in the next section.

## 4 Regular variation

In this section we show that the assumptions of Theorem 1 are always satisfied when $${\mathcal {P}}$$ is regularly varying. The concept of regular variation of a probability measure on an alphabet seems to have originated in the classical paper by Karlin (1967), see also Gnedin et al. (2007) for a recent review. We begin by introducing the measure
\begin{aligned} \nu (\mathrm dx) = \sum _k \delta _{p_k}(\mathrm dx), \end{aligned}
where $$\delta _y$$ denotes the Dirac mass at y, and the function
\begin{aligned} {{\bar{\nu }}}(x) = \nu ([x,1]) = \sum _k 1_{[p_k\ge x]}. \end{aligned}
We say that $${\mathcal {P}}$$ is regularly varying with index $$\alpha \in [0,1]$$ if
\begin{aligned} {{\bar{\nu }}}(x)\sim \ell (1/x) x^{-\alpha } \text{ as } x\downarrow 0, \end{aligned}
(11)
where $$\ell$$ is slowly varying at infinity, i.e. for any $$t>0$$ it satisfies
\begin{aligned} \lim _{x\rightarrow \infty }\frac{\ell (xt)}{\ell (x)}=1. \end{aligned}
In this case we write $${\mathcal {P}}\in {{\mathcal {R}}}{{\mathcal {V}}}_\alpha (\ell )$$. When $$\alpha =0$$ we say that $${\mathcal {P}}$$ is slowly varying and when $$\alpha =1$$ we say that it is rapidly varying. These cases have different behavior from the others and we will discuss them separately.
To better understand the meaning of regular variation, we recall the following result from Gnedin et al. (2007). It says that $${\mathcal {P}}\in RV_\alpha (\ell )$$ with $$\alpha \in (0,1)$$ if and only if
\begin{aligned} p_k\sim \ell ^*(k)k^{-1/\alpha } \text{ as } k\rightarrow \infty , \end{aligned}
where $$\ell ^*$$ is a slowly varying at infinity function, in general, different from $$\ell$$. We now state the main results of this section.

### Proposition 1

If $${\mathcal {P}}\in {{\mathcal {R}}}{{\mathcal {V}}}_\alpha (\ell )$$ for some $$\alpha \in (0,1)$$ then, for any integer $$r\ge 0$$,
\begin{aligned} \kappa _\alpha n^{\alpha /2}[\ell (n)]^{1/2}\left( \frac{T_{r,n}-\pi _{r,n}}{\pi _{r,n}}\right) \mathop {\rightarrow }\limits ^{d}N(0,1) \text{ as } n\rightarrow \infty , \end{aligned}
where $$\kappa _\alpha =\sqrt{\frac{\alpha \Gamma (r+1-\alpha )}{r!(2r+2-\alpha )} }$$.

### Proof

Let $$g_n = n^{1-\alpha /2}[\ell (n)]^{-1/2}$$. Proposition 17 in Gnedin et al. (2007) implies that for every integer $$s\ge 1$$
\begin{aligned} \lim _{n\rightarrow \infty } s!\frac{g_n^2}{n^2}\Phi _s(n)= & {} \alpha \Gamma (s-\alpha ). \end{aligned}
By Lemmas 2 and 3 this implies that
\begin{aligned} \mu _{r,n} \sim \frac{\alpha }{r!}\Gamma (r+1-\alpha )n^{-(1-\alpha )}\ell (n). \end{aligned}
Thus
\begin{aligned} \mu _{r,n}g_n \sim \frac{\alpha }{r!}\Gamma (r+1-\alpha )n^{\alpha /2}[\ell (n)]^{1/2} \end{aligned}
and $$(r+1)c_{r+1}+c_{r+2} = (r+1)\alpha \Gamma (r+1-\alpha )+\alpha \Gamma (r+2-\alpha ) = \alpha (2r+2-\alpha )\Gamma (r+1-\alpha )$$. From here the result follows by Theorem 1. $$\square$$

Next, we turn to the case when $$\alpha =1$$.

### Proposition 2

Assume that $${\mathcal {P}}\in {{\mathcal {R}}}{{\mathcal {V}}}_1(\ell )$$ and let
\begin{aligned} \ell _1(y) = \int _y^\infty u^{-1} \ell (u)\mathrm du. \end{aligned}
If $$r=0$$ then
\begin{aligned} n^{1/2}[\ell _1(n)]^{1/2}\left( \frac{T_{0,n}-\pi _{0,n}}{\pi _{0,n}}\right) \mathop {\rightarrow }\limits ^{d}N(0,1) \text{ as } n\rightarrow \infty , \end{aligned}
and if $$r\ge 1$$ then
\begin{aligned} \kappa _1 n^{1/2}[\ell (n)]^{1/2}\left( \frac{T_{r,n}-\pi _{r,n}}{\pi _{r,n}}\right) \mathop {\rightarrow }\limits ^{d}N(0,1) \text{ as } n\rightarrow \infty , \end{aligned}
where $$\kappa _1 = [r(2r+1)]^{-1/2}$$.

We note that the integral in the definition of $$\ell _1$$ converges, see the proof of Proposition 14 in Gnedin et al. (2007). Further, by Karamata’s Theorem (see e.g. Theorem 2.1 in Resnick 2007), $$\ell _1$$ is slowly varying at infinity.

### Proof

We begin with the case $$r=0$$. In this case we let $$g_n = n^{1/2}[\ell _1(n)]^{-1/2}$$. Proposition 18 in Gnedin et al. (2007) implies that
\begin{aligned} \lim _{n\rightarrow \infty } \frac{g_n^2}{n^2}\Phi _1(n) = 1 \quad \text{ and } \quad 2\Phi _2(n) \sim n \ell (n). \end{aligned}
This means that $$\mu _{0,n} \sim \ell _1(n)$$ and that
\begin{aligned} 2\frac{g_n^2}{n^2}\Phi _2(n) \sim \frac{\ell (n)}{\ell _1(n)}\rightarrow 0, \end{aligned}
(12)
where the convergence follows by Karamata’s Theorem, see e.g. Theorem 2.1 in Resnick (2007). Thus $$\mu _{0,n}g_n\sim n^{1/2}[\ell _1(n)]^{1/2}$$ and $$c_1+c_2=1$$. From here the first part follows by Corollary 1.
Now assume that $$r\ge 1$$. In this case we let $$g_n = n^{1/2}[\ell (n)]^{-1/2}$$. Proposition 18 in Gnedin et al. (2007) says that for $$s\ge 2$$
\begin{aligned} \lim _{n\rightarrow \infty } s!\frac{g_n^2}{n^2}\Phi _s(n) = (s-2)!, \end{aligned}
which means that
\begin{aligned} \mu _{r,n} \sim \frac{\ell (n)}{r}. \end{aligned}
This implies that $$\mu _{r,n}g_n\sim r^{-1}\sqrt{n\ell (n)}$$ and that $$(r+1)c_{r+1}+c_{r+2}=(2r+1)(r-1)!$$. Putting everything together and applying Theorem 1 gives the result. $$\square$$

### Remark 4

From the proof of Proposition 2 we see that, when $$\alpha =1$$ and $$r\ge 1$$, we have $$g_n=n^{1/2}[\ell (n)]^{-1/2}$$. Further, $$\Phi _1(n)\sim n\ell _1(n)$$, which means that
\begin{aligned} \frac{g^2_n}{n^2} \Phi _1(n)\sim \frac{\ell _1(n)}{\ell (n)}\rightarrow \infty , \end{aligned}
where the convergence follows by (12). Thus condition $$A_1$$ fails to hold. However, Conditions $$A_s$$ for $$s\ge 2$$ hold.

### Remark 5

When $$\alpha =0$$ the distributions may no longer be heavy tailed and the results of Theorem 1 need not hold. In fact, while all geometric distributions are regularly varying with $$\alpha =0$$, Ohannessian and Dahleh (2012) showed that for some of them (8) does not hold, and thus neither does the result of Theorem 1.

We can also show that, under a mild additional condition, the assumptions of Theorem 1 do not hold for $$\alpha =0$$. Specifically, assume that $${\mathcal {P}}\in {{\mathcal {R}}}{{\mathcal {V}}}_0(\ell )$$ and that there is a slowly varying at infinity function $$\ell _0$$ such that
\begin{aligned} \sum _{k} p_k 1_{[p_k\le x]} \sim x\ell _0(1/x) \text{ as } x\downarrow 0. \end{aligned}
In this case, Proposition 19 in Gnedin et al. (2007) implies that
\begin{aligned} \ell (x)\sim \int _1^x u^{-1}\ell _0(u)\mathrm du \text{ as } x\rightarrow \infty , \end{aligned}
and that for each $$s\ge 1$$
\begin{aligned} \Phi _s(n) \sim \frac{1}{s}\ell _0(n). \end{aligned}
Thus, to get $$s!\frac{g_n^2}{n^2}\Phi _s(n)$$ to converge to a positive constant $$c_s$$, we must take $$g_n \sim n\sqrt{\frac{c_s}{(s-1)!\ell _0(n)}}$$. However, since $$\ell _0$$ is slowly varying at infinity, $$g_n$$ does not satisfy (5) for any $$\beta \in (0,1/2)$$, and the assumptions of Theorem 1 do not hold. Thus, the question of when and if asymptotic normality holds in this case cannot be answered using Theorem 1.

Combining Corollaries 2 and 3 with Propositions 1 and 2 gives the following.

### Corollary 4

If $${\mathcal {P}}\in {{\mathcal {R}}}{{\mathcal {V}}}_\alpha (\ell )$$ with $$\alpha \in (0,1]$$ then for any integer $$r\ge 0$$ we have
\begin{aligned} \frac{\sqrt{r+1}N_{r+1,n}}{\sqrt{(r+1)N_{r+1,n} +(r+2)N_{r+2,n}}}\left( \frac{T_{r,n}-\pi _{r,n}}{\pi _{r,n}}\right) \mathop {\rightarrow }\limits ^{d}N(0,1) \end{aligned}
and
\begin{aligned} \left( \frac{T_{r,n}-\pi _{r,n}}{\pi _{r,n}}\right) \mathop {\rightarrow }\limits ^{p}0. \end{aligned}

Note that, in the above, we do not need to know what $$\alpha$$ and $$\ell$$ are, only that they exist.

### Remark 6

Using a different approach, Ohannessian and Dahleh (2012) showed that, when $$\alpha \in (0,1),$$ the second convergence in Corollary 4 can be replaced by almost sure convergence. This was extended to the case $$\alpha =1$$ and $$r=0$$ by Corollary 5.3 of Ben-Hamou et al. (2017).

## 5 Proofs

In this section we give the proofs of our results. We begin by introducing some notation and giving several lemmas that may be of independent interest. For any integer $$r\ge 0$$ let
\begin{aligned} \Pi _{r,n} = \sum _{i=0}^r \pi _{i,n} = \sum _{k} p_k 1_{[y_{k,n}\le r]} \end{aligned}
be the total probability of all letters observed at most r times, and let
\begin{aligned} M_{r,n}=\mathrm E[\Pi _{r,n}]. \end{aligned}
Note that, for $$r\ge 1$$, we have $$\pi _{r,n}=\Pi _{r,n} - \Pi _{r-1,n}$$ and $$\mu _{r,n}=M_{r,n} - M_{r-1,n}$$.

### Lemma 4

For any integer $$r\ge 1$$ and any $$\epsilon >0$$
\begin{aligned} P(|\pi _{r,n}-\mu _{r,n}|>\epsilon )\le & {} P(|\Pi _{r,n}-M_{r,n}|>\epsilon /2) + P(|\Pi _{r-1,n}-M_{r-1,n}|>\epsilon /2)\\\le & {} 4\epsilon ^{-2}\left[ \mathrm {Var}(\Pi _{r,n})+\mathrm {Var}(\Pi _{r-1,n})\right] . \end{aligned}

The proof is similar to that of Lemma 20 in Ohannessian and Dahleh (2012). We include it for completeness.

### Proof

Define the events $$A=[-\epsilon /2<\Pi _{r,n}-M_{r,n}<\epsilon /2]$$, $$B=[-\epsilon /2<M_{r-1,n}-\Pi _{r-1,n}<\epsilon /2]$$, and $$C=[-\epsilon<\pi _{r,n}-\mu _{r,n}<\epsilon ]$$. Since $$A\cap B\subset C$$ it follows that $$P(C^c) \le P( A^c\cup B^c)\le P(A^c)+P(B^c)$$, which gives the first inequality. The second follows by Chebyshev’s inequality. $$\square$$

We will need bounds on the variances in the above lemma.

### Lemma 5

For any integer $$r\ge 0$$ we have
\begin{aligned} \mathrm {Var}(\Pi _{r,n})\le \sum _{i=2}^{r+2} n^{i-2} \sum _k p_k^{i} (1-p_k)^{n-i}. \end{aligned}

This result follows from the fact that the random variables $$\{y_{k,n}:k=1,2,\dots \}$$ are negatively associated, see Dubhashi and Ranjan (1998). For completeness we give a detailed proof.

### Proof

Note that
\begin{aligned} \mathrm {Var}(\Pi _{r,n})= & {} \mathrm E[\Pi _{r,n}^2] - \left( \mathrm E[\Pi _{r,n}]\right) ^2\\= & {} \mathrm E\left[ \left( \sum _k p_k 1_{[y_{k,n}\le r]}\right) ^2\right] - \left( \sum _k p_k P(y_{k,n}\le r)\right) ^2\\= & {} \sum _k p_k^2 P(y_{k,n}\le r) + \sum _{k\ne \ell }p_k p_\ell P(y_{k,n}\le r,y_{\ell ,n}\le r)\\&-\sum _k p_k^2 \left[ P(y_{k,n}\le r)\right] ^2 - \sum _{k\ne \ell }p_k p_\ell P(y_{k,n}\le r)P(y_{\ell ,n}\le r)\\\le & {} \sum _k p_k^2 P(y_{k,n}\le r)\\&+ \sum _{k\ne \ell }p_k p_\ell \left[ P(y_{k,n}\le r,y_{\ell ,n}\le r)-P(y_{k,n}\le r)P(y_{\ell ,n}\le r)\right] \\\le & {} \sum _k p_k^2 P(y_{k,n}\le r), \end{aligned}
where the last inequality follows by the well-known fact that, for a multinomial distribution, $$\left[ P(y_{k,n}\le r,y_{\ell ,n}\le r)-P(y_{k,n}\le r)P(y_{\ell ,n}\le r)\right] \le 0$$, see e.g. Mallows (1968). Combining the above with the fact that $$y_{k,n}$$ has a binomial distribution with parameters n and $$p_k$$ gives
\begin{aligned} \mathrm {Var}(\Pi _{r,n})\le & {} \sum _k p_k^2 \sum _{i=0}^r {n \atopwithdelims ()i} p_k^i(1-p_k)^{n-i}\\= & {} \sum _{i=0}^r {n \atopwithdelims ()i} \sum _k p_k^{i+2} (1-p_k)^{n-i}\\\le & {} \sum _{i=0}^r n^i \sum _k p_k^{i+2} (1-p_k)^{n-i}\\= & {} \sum _{i=2}^{r+2} n^{i-2} \sum _k p_k^{i} (1-p_k)^{n+2-i}\\\le & {} \sum _{i=2}^{r+2} n^{i-2} \sum _k p_k^{i} (1-p_k)^{n-i}, \end{aligned}
which completes the proof. $$\square$$

To help simplify the above bound, we give the following result.

### Lemma 6

If $$1\le s\le t\le u<\infty$$ then
\begin{aligned} n^{t-2}\sum _k p_k^t(1-p_k)^{n-t}\le & {} n^{s-2}\sum _k p_k^s(1-p_k)^{n-s}\\&+\, n^{u-2}\sum _k p_k^u(1-p_k)^{n-u}. \end{aligned}

### Proof

Observing that
\begin{aligned} n^{-2}\sum _k \left( \frac{np_k}{1-p_k}\right) ^t(1-p_k)^{n}\le & {} n^{-2}\sum _k \max \left\{ \left( \frac{np_k}{1-p_k}\right) ^s,\left( \frac{np_k}{1-p_k}\right) ^u\right\} (1-p_k)^{n}\\\le & {} n^{s-2}\sum _k p_k^s(1-p_k)^{n-s}+n^{u-2}\sum _k p_k^u(1-p_k)^{n-u} \end{aligned}
gives the result. $$\square$$

### Proof of Lemma 2

Observing that
\begin{aligned} (s-1)! \frac{g_n^2}{n}\mu _{s-1,n-1}= & {} (s-1)! \frac{g_n^2}{n} {n-1\atopwithdelims ()s-1}\sum _{k}p_k^s(1-p_k)^{n-s}\\\sim & {} g_n^2n^{s-2}\sum _{k}p_k^s(1-p_k)^{n-s} \end{aligned}
and
\begin{aligned} s!\frac{g_n^2}{n^2}\mathrm E[N_{s,n}]= & {} s!\frac{g_n^2}{n^2}{n\atopwithdelims ()s}\sum _k p_k^s(1-p_k)^{n-s}\\\sim & {} g_n^2n^{s-2}\sum _{k}p_k^s(1-p_k)^{n-s} \end{aligned}
gives the equivalence between (a), (b), and (c). The equivalence between (c) and (d) is shown in Zhang (2013). $$\square$$

### Proof of Lemma 3

First note that for $$s\ge 1$$
\begin{aligned} \lim _{n\rightarrow \infty }n^{s-1}\sum _k p_k^s(1-p_k)^{n-s}\le & {} \lim _{n\rightarrow \infty }\sum _k p_k(np_k)^{s-1}e^{-p_k(n-s)}\\= & {} e^s\sum _k p_k \lim _{n\rightarrow \infty }(np_k)^{s-1}e^{-p_kn}=0, \end{aligned}
where we use the well-known fact that $$(1-x)\le e^{-x}$$ for $$x\ge 0$$ (see, e.g., 4.2.29 in Abramowitz and Stegun 1972) and we interchange limit and summation by dominated convergence and the fact that the function $$f(x)=x^{s-1}e^{-x}$$ is bounded for $$x\ge 0$$. Combining the above with Condition $$A_s$$ and the assumption that $$c_s>0$$ gives $$g_n^2/n\rightarrow \infty$$, which implies (9).
We now turn to (10). Throughout the proof of this part we use the formulation of Condition $$A_s$$ given in (d) of Lemma 2. We have
\begin{aligned} g^2_{n+1}(n+1)^{s-2}\sum _k p_k^se^{-(n+1)p_k} \rightarrow c_s, \end{aligned}
and hence
\begin{aligned} g_{n}^2n^{s-2} \sum _k p_k^se^{-(n+1)p_k}= & {} \frac{g_{n}^2}{g_{n+1}^2}\frac{n^{s-2}}{(n+1)^{s-2}}g_{n+1}^2(n+1)^{s-2}\sum _k p_ke^{-(n+1)p_k}\\\sim & {} c_s\frac{g_{n}^2}{g_{n+1}^2}. \end{aligned}
Combining this with the fact that
\begin{aligned} g^2_{n}n^{s-2}\sum _k p_k^se^{-(n+1)p_k} \le g^2_{n}n^{s-2}\sum _k p_k^se^{-np_k} \rightarrow c_s \end{aligned}
gives
\begin{aligned} \limsup _{n\rightarrow \infty }\frac{g_n}{g_{n+1}}\le 1. \end{aligned}
Now, let $$A_n=\{k:p_k\le n^{-1/2}\}$$ and let $$B_n=A_n^c=\{k:p_k> n^{-1/2}\}$$. Note that the cardinality of $$B_n$$ is bounded by $$n^{1/2}$$. Using the facts that the function $$f(t) = t^se^{-nt}$$ is decreasing on (s / n, 1] and that $$sn^{-1}<n^{-1/2}$$ for large enough n gives
\begin{aligned} 0\le \limsup _{n\rightarrow \infty }g_n^2n^{s-2}\sum _{k\in B_n} p_k^s e^{-np_k}\le & {} \limsup _{n\rightarrow \infty }g_n^2n^{s-2} \sum _{k\in B_n} n^{-s/2} e^{-n^{1/2}}\\\le & {} \lim _{n\rightarrow \infty }\left( \frac{g_n}{n^{1-\beta }}\right) ^2n^{.5(s+1)-2\beta } e^{-n^{1/2}} = 0, \end{aligned}
where the convergence follows by (5). This implies that
\begin{aligned} \liminf _{n\rightarrow \infty }g_n^2n^{s-2}\sum _{k} p_k^s e^{-(n+1)p_k}\ge & {} \liminf _{n\rightarrow \infty }g_n^2n^{s-2}\sum _{k\in A_n} p_k^s e^{-np_k} e^{-p_k}\\\ge & {} \liminf _{n\rightarrow \infty }g_n^2n^{s-2}e^{-n^{-1/2}}\sum _{k\in A_n} p_k^s e^{-np_k} \\= & {} \lim _{n\rightarrow \infty }g_n^2n^{s-2}\sum _{k} p_k^s e^{-np_k} = c_s. \end{aligned}
Now note that
\begin{aligned} g_n^2n^{s-2}\sum _{k} p_k^s e^{-(n+1)p_k}= & {} \frac{g_{n}^2}{g_{n+1}^2}\frac{n^{s-2}}{(n+1)^{s-2}}g_{n+1}^2(n+1)^{s-2}\sum _{k} p_k e^{-(n+1)p_k}\\\sim & {} c_s\frac{g_{n}^2}{g_{n+1}^2}. \end{aligned}
This implies that
\begin{aligned} \liminf _{n\rightarrow \infty } \frac{g_n}{g_{n+1}} \ge 1, \end{aligned}
which completes the proof of the first part.
We now turn to the second part. Note that when $$A_{r+1}$$ holds for $$r\ge 2$$ we have
\begin{aligned} \limsup _{n\rightarrow \infty } g_n^2 \sum _{k:p_k\ge 1/n} p_k^2 (1-p_k)^{n-2}= & {} \limsup _{n\rightarrow \infty } \frac{g_n^2}{n^2} \sum _{k:p_k\ge 1/n} (np_k)^2 (1-p_k)^{n-2}\\\le & {} \limsup _{n\rightarrow \infty } \frac{g_n^2}{n^2} \sum _{k} (np_k)^{r+1} (1-p_k)^{n-(r+1)}=c_{r+1} \end{aligned}
and that for $$n\ge 2$$
\begin{aligned} \left( 1-\frac{1}{n}\right) ^{n-2}g_n^2 \sum _{k:p_k<1/n} p_k^2\le g_n^2 \sum _{k:p_k<1/n} p_k^2 (1-p_k)^{n-2} \le g_n^2 \sum _{k:p_k<1/n} p_k^2. \end{aligned}
From here, the fact that $$\left( 1-\frac{1}{n}\right) ^{n-2}\rightarrow e^{-1}$$ gives the result. $$\square$$

### Proof of Theorem 1

We have
\begin{aligned} \mu _{r,n}g_n\left( \frac{T_{r,n}-\pi _{r,n}}{\pi _{r,n}}\right) = \frac{\mu _{r,n}}{\pi _{r,n}}g_n\left( T_{r,n}-\pi _{r,n}\right) . \end{aligned}
By Lemma 1 and Slutsky’s Theorem it suffices to show that
\begin{aligned} \frac{\pi _{r,n}}{\mu _{r,n}}\mathop {\rightarrow }\limits ^{p}1. \end{aligned}
When $$r=0$$, Chebyshev’s inequality implies that for any $$\epsilon >0$$
\begin{aligned} P\left( \left| \frac{\pi _{0,n}}{\mu _{0,n}}-1\right|>\epsilon \right) = P\left( \left| \pi _{0,n}-\mu _{0,n}\right| >\mu _{0,n}\epsilon \right) \le \epsilon ^{-2}\frac{\mathrm {Var}\left( \pi _0\right) }{\mu _{0,n}^2}. \end{aligned}
Similarly, when $$r\ge 1$$, Lemma 4 implies that for any $$\epsilon >0$$
\begin{aligned} P\left( \left| \frac{\pi _{r,n}}{\mu _{r,n}}-1\right| >\epsilon \right) \le 4\epsilon ^{-2}\left[ \frac{\mathrm {Var}(\Pi _{r,n})}{\mu _{r,n}^2}+\frac{\mathrm {Var}(\Pi _{r-1,n})}{\mu _{r,n}^2}\right] . \end{aligned}
In both cases we can combine the above with Lemma 5 to show that
\begin{aligned} P\left( \left| \frac{\pi _{r,n}}{\mu _{r,n}}-1\right| >\epsilon \right)\le & {} 8\epsilon ^{-2} \frac{\sum _{i=2}^{r+2} n^{i-2} \sum _k p_k^{i} (1-p_k)^{n-i}}{\mu _{r,n}^2}\\= & {} 8\epsilon ^{-2} \frac{\sum _{i=2}^{r+2} g_n^2n^{i-2} \sum _k p_k^{i} (1-p_k)^{n-i}}{g_n^2\mu _{r,n}^2}. \end{aligned}
We must now show that this approaches zero. By (7), the fact that Condition $$A_{r+2}$$ holds, and Lemma 6, the limsup of the numerator is bounded and it suffices to show that $$[g_{n}\mu _{r,n}]^{-1}\rightarrow 0$$. To see this note that by (5) and Lemmas 2 and 3
\begin{aligned} \frac{1}{g_{n}\mu _{r,n}} = \frac{g_{n}}{n} \frac{n}{g_{n}^2\mu _{r,n}} \sim \frac{r!}{c_{r+1}}\frac{g_{n}}{n}= \frac{r!}{c_{r+1}}\frac{g_n}{n^{1-\beta }}n^{-\beta } \rightarrow 0, \end{aligned}
which concludes the proof. $$\square$$

### Proof of Corollary 2

We begin with the first part. Note that Lemma 2 combined with Lemma 3 implies that
\begin{aligned} \mu _{r,n}g_n \sim \frac{c_{r+1}}{r!}\frac{n}{g_n}\sim (r+1)\frac{g_n}{n}\mathrm E[N_{r+1,n}] \end{aligned}
and
\begin{aligned} (r+1)^2\frac{g_n^2}{n^2}\mathrm E[N_{r+1,n}] +(r+2)(r+1)\frac{g_n^2}{n^2}\mathrm E[N_{r+2,n}] \rightarrow \frac{(r+1)c_{r+1}+c_{r+2}}{r!}. \end{aligned}
Putting everything together and applying Slutsky’s Theorem gives the first part. For the second part, we note that in the proof of Theorem 3.3 in Zhang (2013) it was shown that
\begin{aligned} (r+1)!\frac{g_n^2}{n^2}N_{r+1,n}\mathop {\rightarrow }\limits ^{p}c_{r+1} \text{ and } (r+2)!\frac{g_n^2}{n^2}N_{r+2,n}\mathop {\rightarrow }\limits ^{p}c_{r+2}. \end{aligned}
From here the result follows as in the previous part. $$\square$$

## Notes

### Acknowledgements

The authors wish to thank the anonymous referees whose detailed comments led to improvements in the presentation of this paper.

## References

1. Abramowitz, M., & Stegun, I. A. (1972). Handbook of mathematical functions (10th ed.). New York: Dover Publications.
2. Ben-Hamou, A., Boucheron, S., & Ohannessian, M. I. (2017). Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli, 23(1), 249–287.
3. Berend, D., & Kontorovich, A. (2013). On the concentration of the missing mass. Electronic Communications in Probability, 18(3), 1–7.
4. Bingham, N. H., Goldie, C. M., & Teugels, J. L. (1987). Regular variation. Encyclopedia of mathematics and its applications. Cambridge: Cambridge University Press.
5. Chao, A. (1981). On estimating the probability of discovering a new species. The Annals of Statistics, 9(6), 1339–1342.
6. Chao, A., Hsieh, T. C., Chazdon, R. L., Colwell, R. K., & Gotelli, N. J. (2015). Unveiling the species-rank abundance distribution by generalizing the Good–Turing sample coverage theory. Ecology, 96(5), 1189–1201.
7. Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech and Language, 13(4), 359–394.
8. Cohen, A., & Sackrowitz, H. B. (1990). Admissibility of estimators of the probability of unobserved outcomes. Annals of the Institute of Statistical Mathematics, 42(4), 623–636.
9. Decrouez, G., Grabchak, M., & Paris, Q. (2016). Finite sample properties of the mean occupancy counts and probabilities. Bernoulli (to appear). arXiv:1601.06537v2.
10. Dubhashi, D., & Ranjan, D. (1998). Balls and bins: A study in negative dependence. Random Structures and Algorithms, 13(2), 99–124.
11. Efron, B., & Thisted, R. (1976). Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63(3), 435–447.
12. Esty, W. W. (1983). A normal limit law for a nonparametric estimator of the coverage of a random sample. Annals of Statistics, 11(3), 905–912.
13. Favaro, S., Nipoti, B., & Teh, Y. W. (2016). Rediscovery of Good–Turing estimators via Bayesian nonparametrics. Biometrics, 72(1), 136–145.
14. Gnedin, A., Hansen, B., & Pitman, J. (2007). Notes on the occupancy problem with infinitely many boxes: General asymptotics and power laws. Probability Surveys, 4, 146–171.
15. Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3/4), 237–264.
16. Good, I. J., & Toulmin, G. H. (1956). The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 43(1–2), 45–63.
17. Grabchak, M., & Cosme, V. (2017). On the performance of Turing’s formula: A simulation study. Communication in Statistics: Simulation and Computation, 46(6), 4199–4209.
18. Gupta, V., Lennig, M., & Mermelstein, P. (1992). A language model for very large-vocabulary speech recognition. Computer Speech and Language, 6(4), 331–344.
19. Karlin, S. (1967). Central limit theorems for certain infinite urn schemes. Journal of Mathematical Mechanics, 17, 373–401.
20. Mallows, C. L. (1968). An inequality involving multinomial probabilities. Biometrika, 55(2), 422–424.
21. Mao, C. X., & Lindsay, B. G. (2002). A Poisson model for the coverage problem with a genomic application. Biometrika, 89(3), 669–681.
22. McAllester, D. A., & Schapire, R. E. (2000). On the convergence rate of Good–Turing estimators. In COLT ’00: Proceedings of the thirteenth annual conference on computational learning theory (pp. 1–6).Google Scholar
23. McAllester, D. A., & Ortiz, L. E. (2003). Concentration inequalities for the missing mass and for histogram rule error. Journal of Machine Learning Research, 4(Oct), 895–911.
24. Mossel, E., & Ohannessian, M. I. (2015). On the impossibility of learning the missing mass. arXiv:1503.03613v1.
25. Ohannessian, M. I., & Dahleh, M. A. (2012). Rare probability estimation under regularly varying heavy tails. In JMLR workshop and conference proceedings (Vol. 23, pp. 21.1–21.24).Google Scholar
26. Resnick, S. I. (2007). Heavy-tail phenomena: Probabilistic and statistical modeling. New York: Springer.
27. Robbins, H. E. (1968). Estimating the total probability of the unobserved outcomes of an experiment. Annals of Mathematical Statistics, 39(1), 256–257.
28. Thisted, R., & Efron, B. (1987). Did Shakespeare write a newly discovered poem. Biometrika, 74(3), 445–455.
29. Zhang, C. H. (2005). Estimation of sums of random variables: Examples and information bounds. The Annals of Statistics, 33(5), 2022–2041.
30. Zhang, C. H., & Zhang, Z. (2009). Asymptotic normality of a nonparametric estimator of sample coverage. Annals of Statistics, 37(5A), 2582–2595.
31. Zhang, Z. (2013). A multivariate normal law for Turing’s formulae. Sankhya A, 75(1), 51–73.
32. Zhang, Z., & Huang, H. (2007). Turing’s formula revisited. Journal of Quantitative Linguistics, 14(2–3), 222–241.
33. Zhang, Z., & Huang, H. (2008). A sufficient normality condition for Turing’s formula. Journal of Nonparametric Statistics, 20(5), 431–446.