Abstract
The estimation of the minimum probability of a multinomial distribution is important for a variety of application areas. However, standard estimators such as the maximum likelihood estimator and the Laplace smoothing estimator fail to function reasonably in many situations as, for small sample sizes, these estimators are fully deterministic and completely ignore the data. Inspired by a smooth approximation of the minimum used in optimization theory, we introduce a new estimator, which takes advantage of the entire data set. We consider both the cases with a known and an unknown number of categories. We categorize the asymptotic distributions of the proposed estimator and conduct a small-scale simulation study to better understand its finite sample performance.
This is a preview of subscription content,
to check access.

References
Boyd S, Vandenberghe L (2004) Convex Optimization. Cambridge University Press, Cambridge
Chao A (1984) Nonparametric estimation of the number of classes in a population. Scandinavian J Stat 11:265–270
Chao A (1987) Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43:783–791
Chung K, AitSahlia F (2003) Elementary Probability Theory with Stochastic Processes and an Introduction to Mathematical Finance, 4th edn. Springer, New York
Colwell C (1994) Estimating terrestrial biodiversity through extrapolation. Philos Trans Biol Sci 345:101–118
Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) ‘Visual categorization with bags of keypoints’, In: Workshop on Statistical Learning in Computer Vision, ECCV, pp. 1–22
Grabchak M, Marcon E, Lang G, Zhang Z (2017) The generalized Simpson’s entropy is a measure of biodiversity. PLOS ONE 12:e0173305
Grabchak M, Zhang Z (2018) Asymptotic normality for plug-in estimators of diversity indices on countable alphabets. J Nonparam Stat 30:774–795
Gu Z, Shao M, Li L, Fu Y (2012) ‘Discriminative metric: Schatten norm vs. vector norm’, In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 1213–1216
Haykin S (1994) Neural networks: a comprehensive foundation. Pearson Prentice Hall, New York
Lange M, Zühlke D, Holz T, Villmann O (2014) ‘Applications of \(l_p\)-norms and their smooth approximations for gradient based learning vector quantization’, In: ESANN 2014: Proceedings of the 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pp. 271–276
May WL, Johnson WD (1998) On the singularity of the covariance matrix for estimates of multinomial proportions. J Biopharmaceut Stat 8:329–336
Shao J (2003) Mathematical Statistics, 2nd edn. Springer, New York
Turney P, Littman ML (2003) Measuring praise and criticism: inference of semantic orientation from association. ACM Trans Inf Syst 21:315–346
Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc information retrieval. ACM SIGIR Forum 51:268–276
Zhang Z, Chen C, Zhang J (2020) Estimation of population size in entropic perspective. Commun Stat Theory Methods 49:307–324
Acknowledgements
This paper was inspired by the question of Dr. Zhiyi Zhang (UNC Charlotte): How to estimate the minimum probability of a multinomial distribution? We thank Ann Marie Stewart for her editorial help. The authors wish to thank two anonymous referees whose comments have improved the presentation of this paper. The second author’s work was funded, in part, by the Russian Science Foundation (Project No. 17-11-01098).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest.
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Proofs
Appendix: Proofs
Throughout the section, let \(\Sigma ={\mathrm {diag}}({\mathbf{p}} )-{\mathbf{p}} {} {\mathbf{p}} ^T\), \(\sigma _n=\sqrt{\nabla g_n({\mathbf{p}} )^{T}{{\Sigma }} \nabla g_n({\mathbf{p}} )}\), \(\Lambda =\lim \limits _{n \rightarrow \infty } \nabla g_n({\mathbf{p}} )\), and \(\sigma =\sqrt{\Lambda ^T \Sigma \Lambda }\). It is well known that \(\Sigma\) is a positive definite matrix, see, e.g., [12]. For simplicity, we use the standard notation \(O(\cdot )\), \(o(\cdot )\), \(O_p(\cdot )\), and \(o_p(\cdot )\), see, e.g., [13] for the definitions. In the case of matrices and vectors, this notation should be interpreted as component wise.
It may, at first, appear that Theorem 2.1 can be proved using the delta method. However, the difficulty lies in the fact that the function \(g_n(\cdot )\) depends on n. For this reason, the proof requires a more subtle approach. We begin with several lemmas.
Lemma A.1
-
1
There is a constant \(\epsilon >0\) such that \(p_0 \le g_n({\mathbf{p}} ) \le p_0+ (k-r)e^{-n^{\alpha } \epsilon }\). When \(r\ne k\), we can take \(\epsilon = \min \limits _{j:p_j > p_0}(p_j-p_0)\)
-
2.
For any constant \(\beta \in {\mathbb {R}}\)
$$\begin{aligned} n^{\beta }\{g_n({\mathbf{p}} )-p_0\} \xrightarrow []{}0 {\text{ as }} n\rightarrow \infty . \end{aligned}$$ -
3
For any \(1\le j \le k\) and any constant \(\beta \in {\mathbb {R}}\)
$$\begin{aligned} n^{\beta }e^{-n^{\alpha }p_j}w^{-1}\{g_n({\mathbf{p}} )-p_j\} \xrightarrow []{}0 {\text{ as }} n\rightarrow \infty . \end{aligned}$$
Proof
We begin with the first part. First, assume that \(r=k\). In this case, it is immediate that \(g_n({\mathbf{p}} )=k^{-1}=p_0\) and the result holds with any \(\epsilon >0\). Now assume \(r\ne k\). In this case,
To show the other inequality, note that
and that, for any \(p_i>p_0\), we have
Setting \(\epsilon = \min \limits _{j:p_j> p_0}(p_j-p_0)>0\), it follows that, for \(p_i>p_0\),
We thus get
The second part follows immediately from the first. We now turn to the third part. When \(p_j=p_0\) Eq. (11) and Part 1 imply that \(e^{-n^{\alpha }p_j}w^{-1} \le r^{-1}\) and that there is an \(\epsilon >0\) such that
It follows that when \(p_j=p_0\)
On the other hand, when \(p_j > p_0\), by Part 1 there is an \(\epsilon >0\) such that
Using this and Eq. (12) gives
as \(n \rightarrow \infty\). \(\square\)
We now consider the case when the probabilities are estimated.
Lemma A.2
Let \({\mathbf{p}} ^*_n={\mathbf{p}} ^*=(p^*_1,\dots ,p^*_{k-1})\) be a sequence of random vectors with \(p^*_i\ge 0\) and \(\sum _{i=1}^{k-1} p_i^*\le 1\). Let \(p_k = 1-\sum _{i=1}^{k-1} p_i^*\), \(w^*= \sum \limits _{i=1}^k e^{-n^\alpha p^*_i}\), and assume that \({\mathbf{p}} _n^* \xrightarrow {} {\mathbf{p}}\) a.s. and \(n^\alpha ({\mathbf{p}} _n^* -{\mathbf{p}} )\xrightarrow {p}0\). For every \(j=1,2,\dots ,k\), we have
and
Proof
First note that, from the definition of \(w^*\), we have
Assume that \(p_j=p_0\). In this case, the first equation follows from (13) and the fact that \(n^\alpha \left( p_j^*-p_0\right) =n^\alpha \left( p_j^*-p_j\right) {\mathop {\rightarrow }\limits ^{p}}0\). In particular, this completes the proof of the first equation in the case where \(k= r\).
Now assume that \(k\ne r\). Let \(p^*_0=\min \{p^*_i:i=1,2,\dots ,k\}\), \(\epsilon =\min \limits _{i:p_i\ne p_0}\{ p_i-p_0\}\), and \(\epsilon _n^*=\min \limits _{i:p_i\ne p_0}\{ p^*_i-p^*_0\}\). Since \({\mathbf{p}} _n^* \rightarrow {\mathbf{p}}\) a.s., it follows that \(\epsilon ^*_n\rightarrow \epsilon\) a.s. Further, by arguments similar to the proof of Eq. (12), we can show that, if \(p_j\ne p_0\) then there is a random variable N, which is finite a.s., such that for \(n\ge N\)
It follows that for such j and \(n\ge N\)
This completes the proof of the first limit.
Now assume either \(k=r\) or \(k\ne r\). For the second limit, note that
From here the result follows by the first limit and (13). \(\square\)
Lemma A.3
1. If \(r= k\), then for each \(i=1,2,\dots ,k\)
2. If \(r\ne k\), then for each \(i=1,2,\dots ,k\)
Proof
When \(r= k\), the result is immediate from (4). Now assume that \(r\ne k\). We can rearrange equation (4) as
where \(r_n=n^{\alpha }e^{-n^{\alpha }p_i}w^{-1}\{g_n({\mathbf{p}} )-p_i\}-n^{\alpha }e^{-n^{\alpha }p_k}w^{-1}\{g_n({\mathbf{p}} )-p_k\}\). Note that Lemma A.1 implies that \(r_n\rightarrow 0\) as \(n \rightarrow \infty\). It follows that
Consider the case where \(p_k \ne p_0\) and \(p_i = p_0\). In this case, the first part has r component(s) in the denominator that are equal to one (\(e^0\)) and the remaining \(k-r\) terms go to zero individually. However, since \(p_k\ne p_0\), the denominator of the second fraction has r terms of the form \(e^{-n^{\alpha } (p_0-p_k) }\), which go to \(+\infty\), while the other terms go to 0, 1, or \(+\infty\). Thus, in this case, the limit is \(r^{-1}-0=r^{-1}\). The arguments in the other cases are similar and are thus omitted. \(\square\)
Lemma A.4
Assume that \(r\ne k\) and let \({\mathbf{p}} ^*_n\) be as in Lemma A.2. In this case, \(\frac{\partial (g_n(\mathbf{p }))}{\partial p_i}=O(1)\), \(\frac{\partial (g_n(\mathbf{p ^*_n}))}{\partial p_i}=O_p(1)\), \(\frac{\partial ^2 (g_n(\mathbf{p }))}{\partial p_i \partial p_j}=O(n^{\alpha })\), \(\frac{\partial ^2 (g_n(\mathbf{p ^*_n}))}{\partial p_i \partial p_j}=O_p(n^{\alpha })\), \(\frac{\partial ^3 (g_n(\mathbf{p }))}{\partial p_\ell \partial p_i \partial p_j}=O(n^{2\alpha })\), and \(\frac{\partial ^3 (g_n(\mathbf{p ^*_n}))}{\partial p_\ell \partial p_i \partial p_j}=O_p(n^{2\alpha })\).
Proof
The results for the first derivatives follow immediately from (4), (13), Lemma A.2, and Lemma A.3. Now let \(\delta _{ij}\) be 1 if \(i=j\) and zero otherwise. It is straightforward to verify that
that for \(\ell \ne i\) and \(\ell \ne j\) we have
and that for \(i=j=\ell\) we have
Combining this with Lemma A.2 and the fact that \(0\le w^{-1}e^{-n^\alpha p_s}\le 1\) for any \(1 \le s \le k\) gives the result. \(\square\)
Lemma A.5
Assume \(r\ne k\) and \(0<\alpha <0.5\), then \(\nabla g_n(\hat{\mathbf{p }})-\nabla g_n({\mathbf{p}} )=O_p(n^{\alpha -\frac{1}{2}})\).
Proof
By the mean value theorem, we have
where \({\mathbf{p}} ^*= {\mathbf{p}} +{\mathrm {diag}}(\varvec{\omega })(\hat{\mathbf{p }}-{\mathbf{p}} )\) for some \(\varvec{\omega } \in [0,1]^{k-1}\). Note that by the strong law of large numbers \(\hat{\mathbf{p }}\rightarrow {\mathbf{p}}\) a.s., which implies that \({\mathbf{p}} ^*-{\mathbf{p}} \rightarrow 0\) a.s. Similarly, by the multivariate central limit theorem and Slutsky’s theorem \(n^\alpha \left( \hat{\mathbf{p }}-{\mathbf{p}} \right) {\mathop {\rightarrow }\limits ^{p}}0\) implies that \(n^\alpha \left( {\mathbf{p}} ^* -{\mathbf{p}} \right) {\mathop {\rightarrow }\limits ^{p}}0\). Thus, the assumptions of Lemma A.4 are satisfied and that lemma gives
From here, the result is immediate. \(\square\)
Lemma A.6
Assume that \(r\ne k\). In this case, \(\sigma >0\) and \(\lim \limits _{n \rightarrow \infty }\sigma ^{-1}_n \sigma =1\). Further, if \(0<\alpha <0.5\), then \(\hat{\sigma }_n^{-1}\sigma _n \xrightarrow {p} 1\).
Proof
Since \(\Sigma\) is a positive definite matrix, and by Lemma A.3, \(\Lambda \ne 0\), it follows that \(\sigma >0\). From here, the fact that \(\lim \limits _{n \rightarrow \infty } \sigma _n=\sigma\) gives the first result. Now assume that \(0<\alpha <0.5\). It is easy to see that \(\hat{p}_i\hat{p}_j- p_i p_j=\hat{p}_j(\hat{p}_i-p_i)+p_i({\hat{p}}_j-p_j) =O_p(n^{-1/2})\) and \(\hat{p}_i(1-\hat{p}_i)- p_i(1-p_i)=({\hat{p}}_i-p_i)(1-p_i-{\hat{p}}_i)=O_p(n^{-1/2})\). Thus, \(\hat{\Sigma }=\Sigma +O_p(n^{-1/2})\). This together with Lemma A.3 and Lemma A.5 leads to
which completes the proof. \(\square\)
Lemma A.7
If \(r\ne k\) and \(0<\alpha <0.5\), then \(\sqrt{n}\hat{\sigma }_n^{-1}\{g_n(\hat{\mathbf{p }})-g_n({\mathbf{p}} )\}\xrightarrow {D} {{\mathcal {N}}}(0,1)\).
Proof
Taylor’s theorem implies that
where \({\mathbf{p}} ^*= {\mathbf{p}} +{\text{ diag }}(\varvec{\omega })(\hat{\mathbf{p }}-{\mathbf{p}} )\) for some \(\varvec{\omega } \in [0,1]^{k-1}\). Using Lemma A.4 and arguments similar to those used in the proof of Lemma A.5 gives \(n^{-\alpha }\nabla ^2 g_n({\mathbf{p}} ^*) =O_p(1)\), \(\sqrt{n}(\hat{\mathbf{p }}-{\mathbf{p}} )=O_p(1)\), and \(n^{\alpha }(\hat{\mathbf{p }}-{\mathbf{p}} )=o_p(1)\). It follows that the second term on the RHS above is \(o_p(1)\) and hence that
It is well known that \(\sqrt{n}(\hat{\mathbf{p }}-{\mathbf{p}} )^T\xrightarrow {D} {\mathcal N}(0,\Sigma )\). Hence
and, by Slutsky’s theorem,
By Lemma A.6, \(\sigma ^{-1}_n \sigma \rightarrow 1\), and \(\hat{\sigma }_n^{-1}\sigma _n \xrightarrow {p} 1\). Hence, the result follows by another application of Slutsky’s theorem. \(\square\)
Lemma A.8
Let \(\mathbf {A}=-n^{-\alpha }\nabla ^2 g_n({\mathbf{p}} )\) and let \(\mathbf {I}_{k-1}\) be the \((k-1)\times (k-1)\) identity matrix. If \(r=k\), then \(\Sigma ^{\frac{1}{2}}\mathbf {A}\Sigma ^{\frac{1}{2}}=2k^{-2}\mathbf {I}_{k-1}\).
Proof
Let \(\mathbf {1}\) be the column vector in \({\mathbb {R}}^{k-1}\) with all entries equal to 1. By Eq. (16), we have
Note that \(\Sigma ={\mathrm {diag}}({\mathbf{p}} )-{\mathbf{p}} {} {\mathbf{p}} ^T=k^{-2}[k\mathbf {I}_{k-1}-\mathbf {1}\mathbf {1}^T]\). It follows that
Now multiplying both sides by \(\Sigma ^{1/2}\) on the left and \(\Sigma ^{-1/2}\) on the right gives the result. \(\square\)
Proof of Theorem 2.1
(i) If \(r \ne k\), then
The first part approaches a \({\mathcal {N}}(0,1)\) distribution by Lemma A.7, and the second part approaches zero in probability by Lemmas A.6 and A.1. From there, the first part of the theorem follows by Slutsky’s theorem.
(ii) Assume that \(r =k\). In this case, \(g_n({\mathbf{p}} )=p_0=k^{-1}\), and by Lemma A.3, \(\nabla g_n({\mathbf{p}} )=0\). Thus, Taylor’s theorem gives
where \(r_n=- 6^{-1} \sum \limits _{q=1}^{k-1} \sum \limits _{r=1}^{k-1} \sum \limits _{s=1}^{k-1} \sqrt{n} (\hat{p}_q-p_q) \sqrt{n} (\hat{p}_r-p_r) n^{\alpha } (\hat{p}_s-p_s) n^{-2\alpha } \frac{\partial ^3 g_n({\mathbf{p}} ^*) }{\partial p_q \partial p_r \partial p_s}\), and \({\mathbf{p}} ^*= {\mathbf{p}} +{\text{ diag }}(\varvec{\omega })(\hat{\mathbf{p }}-{\mathbf{p}} )\) for some \(\varvec{\omega } \in [0,1]^{k-1}\). Lemma A.4 implies that \(n^{-2\alpha }\frac{\partial ^3 g_n({\mathbf{p}} ^*) }{\partial p_q \partial p_r \partial p_s}=O_p(1)\). Combining this with the facts that \(\sqrt{n} (\hat{p}_q-p_q)\) and \(\sqrt{n} (\hat{p}_r-p_r)\) are \(O_p(1)\) and that, for \(\alpha \in (0,0.5)\), \(n^{\alpha } (\hat{p}_s-p_s)=o_p(1)\), it follows that \(r_n{\mathop {\rightarrow }\limits ^{p}}0\).
Let \(\mathbf {x}_n=\sqrt{n}(\hat{\mathbf{p }}-{\mathbf{p}} )\), \(\mathbf {T}_n=\Sigma ^{-\frac{1}{2}}\mathbf {x}_n\), and \(\mathbf {A}=-n^{\alpha }\nabla ^2 g_n({\mathbf{p}} )\). Lemma A.8 implies that
Since \(\mathbf {x}_n \xrightarrow {D} {{\mathcal {N}}}(0, \Sigma )\), we have \(\mathbf {T}_n\xrightarrow {D} \mathbf {T}\), where \(\mathbf {T}\sim {{\mathcal {N}}}(0,\mathbf {I}_{k-1}).\) Let \(T_i\) be the ith component of vector \({\mathbf {T}}\). Applying the continuous mapping theorem, we obtain
Thus, Eq. (22) becomes
The result follows from the fact that the \(T_i^2\) are independent and identically distributed random variables, each following the Chi-square distribution with 1 degree of freedom. \(\square\)
The proof of Theorem 4.1 is very similar to that of Theorem 2.1 and is thus omitted. However, to help the reader to reconstruct the proof, we note that the partial derivatives of \(g_n^\vee\) can be calculated using the facts that
Further, we formulate a version of Lemmas A.1 and A.2 for the maximum.
Lemma A.9
-
1.
There is a constant \(\epsilon >0\) such that \(p_\vee - (k-r_\vee )e^{-n^{\alpha } \epsilon } \le g^\vee _n({\mathbf{p}} ) \le p_\vee\). When \(r_\vee \ne k\), we can take \(\epsilon = \min \limits _{j:p_j < p_\vee }(p_\vee -p_j)\).
-
2.
For any constant \(\beta \in {\mathbb {R}}\)
$$\begin{aligned} n^{\beta }\{g^\vee _n({\mathbf{p}} )-p_\vee \} \xrightarrow []{}0 {\text{ as }} n\rightarrow \infty . \end{aligned}$$ -
3.
For any \(1\le j \le k\) and any constant \(\beta \in {\mathbb {R}}\)
$$\begin{aligned} n^{\beta }\frac{e^{n^{\alpha }p_j}}{w^\vee }\{g_n^\vee ({\mathbf{p}} )-p_j\} \xrightarrow []{}0 {\text{ as }} n\rightarrow \infty . \end{aligned}$$ -
4.
If \({\mathbf{p}} ^*_n\) is as in Lemma A.2 and \(w^{\vee *}= \sum \limits _{i=1}^k e^{n^\alpha p^*_i}\), then for every \(j=1,2,\dots ,k\) we have
$$\begin{aligned} n^\alpha \left( p_j^*-p_\vee \right) e^{n^\alpha p_j^*}\frac{1}{w^{\vee *}} {\mathop {\rightarrow }\limits ^{p}}0 \end{aligned}$$and
$$\begin{aligned} n^{\alpha }e^{n^{\alpha }p^*_j}\frac{1}{w^{\vee *}}\{g^\vee _n({\mathbf{p}} ^*_n)-p_j^*\} {\mathop {\rightarrow }\limits ^{p}}0 {\text{ as }} n\rightarrow \infty . \end{aligned}$$
Proof
We only prove the first part, as proofs of the rest are similar to those of Lemmas A.1 and A.2. If \(r_\vee =k\), then \(g_n^\vee ({\mathbf{p}} )=1/k=p_\vee\) and the result holds with any \(\epsilon >0\). Now, assume that \(k\ne r_\vee\) and let \(\epsilon\) be as defined above. First note that
Note further that for \(p_i<p_\vee\)
It follows that
From here the result follows. \(\square\)
Rights and permissions
About this article
Cite this article
Mahzarnia, A., Grabchak, M. & Jiang, J. Estimation of the Minimum Probability of a Multinomial Distribution. J Stat Theory Pract 15, 24 (2021). https://doi.org/10.1007/s42519-020-00163-y
Accepted:
Published:
DOI: https://doi.org/10.1007/s42519-020-00163-y