Asymptotically Normal Estimators for Zipf’s Law

Chebunin, Mikhail; Kovalevskii, Artyom

doi:10.1007/s13171-018-0135-9

Asymptotically Normal Estimators for Zipf’s Law

Published: 13 July 2018

Volume 81, pages 482–492, (2019)
Cite this article

Sankhya A Aims and scope Submit manuscript

74 Accesses
3 Citations
Explore all metrics

Abstract

We study an infinite urn scheme with probabilities corresponding to a power function. Urns here represent words from an infinitely large vocabulary. We propose asymptotically normal estimators of the exponent of the power function. The estimators use the number of different elements and a few similar statistics. If we use only one of the statistics we need to know asymptotics of a normalizing constant (a function of a parameter). All the estimators are implicit in this case. If we use two statistics then the estimators are explicit, but their rates of convergence are lower than those for estimators with the known normalizing constant.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Asymptotic properties of Turing’s formula in relative error

Article 10 August 2017

Approximation by Normal Distribution for a Sample Sum in Sampling Without Replacement from a Finite Population

Article 01 August 2016

Cumulative distribution functions for the five simplest natural exponential families

Article 27 February 2019

References

Bahadur, R.R. (1960). On the number of distinct values in a large sample from an infinite discrete distribution. Proceedings of the National Institute of Sciences of India26A, Supp II, 67–75.
MathSciNet MATH Google Scholar
Barbour, A.D. (2009). Univariate approximations in the infinite occupancy scheme. Alea6, 415–433.
MathSciNet Google Scholar
Barbour, A.D. and Gnedin, A.V. (2009). Small counts in the infinite occupancy scheme. Electronic. J. Probab.14, 365–384.
MathSciNet MATH Google Scholar
Ben-Hamou, A., Boucheron, S. and Gassiat, E. (2016). Pattern coding meets censoring: (almost) adaptive coding on countable alphabets. arXiv:1608.08367.
Ben-Hamou, A., Boucheron, S. and Ohannessian, M.I. (2017). Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications. Bernoulli23, 249–287.
Article MathSciNet Google Scholar
Bogachev, L.V., Gnedin, A.V. and Yakubovich, Y.V. (2008). On the variance of the number of occupied boxes. Adv. Appl. Math.40, 401–432.
Article MathSciNet Google Scholar
Boonta, S. and Neammanee, K. (2007). Bounds on random infinite urn model. Bull. Malays. Math. Sci. Soc. Second Series30.2, 121–128.
MathSciNet MATH Google Scholar
Chebunin, M.G. (2014). Estimation of parameters of probabilistic models which is based on the number of different elements in a sample. Sib. Zh. Ind. Mat.17:3, 135–147. (in Russian).
MathSciNet MATH Google Scholar
Chebunin, M. and Kovalevskii, A. (2016). Functional central limit theorems for certain statistics in an infinite urn scheme. Statist. Probab. Lett.119, 344–348.
Article MathSciNet Google Scholar
Durieu, O. and Wang, Y. (2016). From infinite urn schemes to decompositions of self-similar Gaussian processes. Electron. J. Probab.21, 43.
Article Google Scholar
Dutko, M. (1989). Central limit theorems for infinite urn models. Ann. Probab.17, 1255–1263.
Article MathSciNet Google Scholar
Gnedin, A., Hansen, B. and Pitman, J. (2007). Notes on the occupancy problem with infinitely many boxes: general asymptotics and power laws. Probab. Surv.4, 146–171.
Article MathSciNet Google Scholar
Grubel, R. and Hitczenko, P. (2009). Gaps in discrete random samples. J. Appl. Probab.46, 1038–1051.
Article MathSciNet Google Scholar
Heaps, H.S. (1978). Information retrieval, computational and theoretical aspects. Academic Press.
Herdan, G. (1960). Type-token mathematics. The Hague, Mouton.
MATH Google Scholar
Hwang, H.-K. and Janson, S. (2008). Local limit theorems for finite and infinite urn models. Ann. Probab.36, 992–1022.
Article MathSciNet Google Scholar
Karlin, S. (1967). Central limit theorems for certain infinite urn schemes. J. Math. Mech.17, 373–401.
MathSciNet MATH Google Scholar
Key, E.S. (1992). Rare Numbers. J. Theor. Probab.5, 375–389.
Article MathSciNet Google Scholar
Key, E.S. (1996). Divergence rates for the number of rare numbers. J. Theor. Probab.9, 413–428.
Article MathSciNet Google Scholar
Khmaladze, E.V. (2011). Convergence properties in certain occupancy problems including the Karlin-Rouault law. J. Appl. Probab.48, 1095–1113.
Article MathSciNet Google Scholar
Mandelbrot, B. (1965). Information theory and psycholinguistics. In Scientific psychology. Basic Books, (B.B. Wolman and E. Nagel, eds.)
Muratov, A. and Zuyev, S. (2016). Bit flipping and time to recover. J. Appl. Probab.53, 650–666.
Article MathSciNet Google Scholar
Nicholls, P.T. (1987). Estimation of Zipf parameters. J. Am. Soc. Inf. Sci.38, 443–445.
Article Google Scholar
Ohannessian, M.I. and Dahleh, M.A. (2012). Rare probability estimation under regularly varying heavy tails. In Proceedings of the 25th Annual Conference on Learning Theory PMLR, pp. 23:21.1–21.24.
Petersen, A.M., Tenenbaum, J.N., Havlin, S., Stanley, H.E. and Perc, M. (2012). Languages cool as they expand: allometric scaling and the decreasing need for new words. Scientific Reports 2. Article No 943.
Zakrevskaya, N.S. and Kovalevskii, A.P. (2001). One-parameter probabilistic models of text statistics. Sib. Zh. Ind. Mat.4:2, 142–153. (in Russian).
MathSciNet MATH Google Scholar
Zipf, G.K. (1949). Human behavior and the principle of least effort. University Press, Cambridge.
Google Scholar

Download references

Acknowledgments

Our research was partially supported by RFBR grant 17-01-00683 and by the program of fundamental scientific researches of the SB RAS No. I.1.3., project No. 0314-2016-0008.

Author information

Authors and Affiliations

Sobolev Institute of Mathematics, Novosibirsk, Russia
Mikhail Chebunin
Novosibirsk State University, Novosibirsk, Russia
Mikhail Chebunin & Artyom Kovalevskii
Novosibirsk State Technical University, Novosibirsk, Russia
Artyom Kovalevskii

Authors

Mikhail Chebunin
View author publications
You can also search for this author in PubMed Google Scholar
Artyom Kovalevskii
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mikhail Chebunin.

Appendix: Functional Central Limit Theorem

Let for t ∈ [0, 1],k ≥ 1

$${Y}_{n,k}^{*}(t) = \frac{{R}_{[nt],k}^{*} - \mathbf{E} {R}_{[nt],k}^{*}}{(\alpha(n))^{1/2}}, Y_{n,k}(t) = \frac{R_{[nt],k} - \mathbf{E} R_{[nt],k}}{(\alpha(n))^{1/2}}. $$

Theorem 4.

Let us assume that (1.2) holds,ν ≥ 1 is integer. Then random process$ \left ((Y^{*}_{n,1}(t), Y_{n,1}(t),\ldots , Y_{n,\nu }(t)), 0 \leq t \leq 1 \right ) $convergesweakly in the uniform metrics inD(0, 1) to (ν + 1)-dimensionalGaussian process with continuous sample paths, zero expectation and covariancefunction$(c_{ij}(\tau ,t))_{i,j = 0}^{\nu }$,

$$\begin{array}{@{}rcl@{}} c_{ij}(\tau,t) &=& \frac{\theta \tau^{i} (t-\tau)^{j-i} t^{\theta-j} {\Gamma}(j-\theta)}{i!(j-i)!} - \frac{\theta \tau^{i} t^{j} (t+\tau)^{\theta-i-j} {\Gamma}(i+j-\theta)}{i!j!}\\ && \text{for} 1 \leq i \le j, \tau\leq t,\\ c_{ij}(\tau,t) &=& - \frac{\theta \tau^{i} t^{j} (t+\tau)^{\theta-i-j} {\Gamma}(i+j-\theta)}{i!j!} \text{for} i> j\geq 1, \tau\leq t,\\ c_{00}(\tau,t) &=& \left( (t+\tau)^{\theta}-t^{\theta}\right) {\Gamma}(1-\theta) \text{for} \tau\leq t,\\ c_{i0}(\tau,t) &=& - \frac{\theta \tau^{i} (t+\tau)^{\theta-i} {\Gamma}(i-\theta)}{i!} \text{for} i> 0, \tau\leq t,\\ c_{0j}(\tau,t) &=& \frac{\theta ((t-\tau)^{j} t^{\theta-j} - t^{j} (t+\tau)^{\theta-j}) {\Gamma}(j-\theta)}{j!} \text{for} j>0, \tau\leq t, \end{array} $$

c_ji(t, τ) = c_ij(τ, t).

Proof.

Theorem 3 by Chebunin and Kovalevskii (2016) states weak convergence of vector random process $ \left ((Y^{*}_{n,1}(t), \ldots , Y^{*}_{n,\nu }(t)), 0 \leq t \leq 1 \right ) $ in the uniform metrics in D(0, 1) to (ν + 1)-dimensional Gaussian process with continuous sample paths, zero expectation and covariance function $(c^{*}_{ij}(\tau ,t))_{i,j = 0}^{\nu }$.

The main focus of this paper was to prove tightness of components $(Y^{*}_{n,i}(t), 0 \leq t \leq 1 )$ by Poissonization and construction of an appropriate inequality for covariances.

As $Y_{n,i}(t)=Y^{*}_{n_{i}}(t)-Y^{*}_{n,i-1}(t)$, we state tightness of components (Y_{n, i},0 ≤ t ≤ 1) and calculate c_ij(τ, t) by formulas

$$c_{ij}(\tau,t)=c^{*}_{ij}(\tau,t)-c^{*}_{i + 1,j}(\tau,t)-c^{*}_{i,j + 1}(\tau,t) +c^{*}_{i + 1,j + 1}(\tau,t), $$

$$c_{0j}(\tau,t)=c^{*}_{1j}(\tau,t)-c^{*}_{1,j + 1}(\tau,t), c_{i0}(\tau,t)=c^{*}_{i1}(\tau,t)-c^{*}_{i + 1,1}(\tau,t). $$

The proof is complete. □

The limiting (ν + 1)-dimensional Gaussian process is self-similar with Hurst parameter H = 𝜃/2 < 1/2. Its first component coincides in distribution with the first component of the limiting process in Theorem 1 in Durieu and Wang (2016).

We need some specific corollary to calculate limiting variance in Theorem 2.

Corollary 3.

In assumptions of Theorem 4, randomvector$((Y^{*}_{n,1}(1)$,Y_n,1(1)) convergesweakly to a normal one with zero mean and covariance matrix

$${\Gamma}(1-\theta) \left( \begin{array}{cc} 2^{\theta}-1 & -\theta 2^{\theta-1}\\ -\theta 2^{\theta-1} & \theta(1-2^{\theta-2}(1-\theta)) \end{array} \right). $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chebunin, M., Kovalevskii, A. Asymptotically Normal Estimators for Zipf’s Law. Sankhya A 81, 482–492 (2019). https://doi.org/10.1007/s13171-018-0135-9

Download citation

Received: 14 June 2017
Published: 13 July 2018
Issue Date: December 2019
DOI: https://doi.org/10.1007/s13171-018-0135-9

Keywords and phrases.

AMS (2000) subject classification.

Primary 62F10; Secondary 62F12

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Asymptotically Normal Estimators for Zipf’s Law

Abstract

Access this article

Similar content being viewed by others

Asymptotic properties of Turing’s formula in relative error

Approximation by Normal Distribution for a Sample Sum in Sampling Without Replacement from a Finite Population

Cumulative distribution functions for the five simplest natural exponential families

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Functional Central Limit Theorem

Theorem 4.

Proof.

Corollary 3.

Rights and permissions

About this article

Cite this article

Keywords and phrases.

AMS (2000) subject classification.

Navigation

Asymptotically Normal Estimators for Zipf’s Law

Abstract

Access this article

Similar content being viewed by others

Asymptotic properties of Turing’s formula in relative error

Approximation by Normal Distribution for a Sample Sum in Sampling Without Replacement from a Finite Population

Cumulative distribution functions for the five simplest natural exponential families

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: Functional Central Limit Theorem

Appendix: Functional Central Limit Theorem

Theorem 4.

Proof.

Corollary 3.

Rights and permissions

About this article

Cite this article

Share this article

Keywords and phrases.

AMS (2000) subject classification.

Search

Navigation