How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams

Silva, Joaquim F.; Cunha, Jose C.

doi:10.1007/978-981-97-2259-4_16

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14647))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

193 Accesses

Abstract

The prediction of the numbers of distinct word n-grams and their frequency distributions in text corpora is important in domains like information processing and language modelling. With big data corpora, there is an increased application complexity due to the large volume of data. Traditional studies have been confined to small or moderate size corpora leading to statistical laws on word frequency distributions. However, when going to very large corpora, some of the assumptions underlying those laws need to be revised, related to the corpus vocabulary and numbers of word occurrences. So, although it becomes critical to know how the corpus size influences those distributions, there is a lack of models that characterise such influence. This paper aims at filling this gap, enabling the prediction of the impact of corpus growth upon application time and space complexities. It presents a fully principled model, which, distinctively, considers words and multiwords in very large corpora, predicting the cumulative numbers of distinct n-grams above or equal to a given frequency in a corpus, as well as the sizes of equal-frequency n-gram groups, from unigrams to hexagrams, as a function of corpus size, in a language, assuming a finite n-gram vocabulary. The model applies to low occurrence frequencies, encompassing the larger populations of n-grams. Practical assessment with real corpora shows relative errors around \(3\%\), stable over the considered ranges of n-gram frequencies, n-gram sizes and corpora sizes from million to billion words, for English and French.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Obviously, this excludes the corpora collecting and n-gram counting, which is made only once for parameters estimation and validation of the model.
2.
Equation (1) is equivalent to \(\frac{\frac{dD(k,C)}{D(k,C)}}{\frac{dC}{C}}=g_k \, \frac{V - D(k,C)}{V}\). The infinite V assumption would imply that the ratio in left side of the equation should be a constant (equal to \(g_k\)) wrt C, but the empirical observations showed that ratio decreases instead. Such decrease is captured by the vocabulary finiteness assumption (second factor).
3.
Indeed, (2) can be written as \(\frac{V - D(k,C)}{D(k,C)}= (h_k\,C)^{-g_k} \), which, for each k and n, is a power law wrt to C, since \(g_k\) and \(h_k\) were found constants wrt C.
4.
Empirical counts were obtained from the corpora with the help of Carlos Gonçalves.

References

Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47–97 (2002)
Article MathSciNet Google Scholar
Bacaër, N.: Verhulst and the logistic equation (1838). In: A Short History of Mathematical Population Dynamics, pp. 35–39. Springer, London (2011). https://doi.org/10.1007/978-0-85729-115-8_6
Balasubrahmanyan, V.K., Naranan, S.: Algorithmic information, complexity and Zipf law. Glottometrics 4, 1–26 (2002)
Google Scholar
Bass, F.M.: A new product growth for model consumer durables. Manage. Sci. 15(5), 215–227 (1969)
Article Google Scholar
Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: Size dependent word frequencies and translational invariance of books. CoRR abs/0906.0716 (2009)
Google Scholar
Booth, A.D.: A “law’’ of occurrences for words of low frequency. Inf. Control 10, 386–393 (1967)
Article Google Scholar
Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Joint Conference on EMNLP - CoNLL, pp. 858–867. ACL (2007)
Google Scholar
Buck, C., Heafield, K., van Ooyen, B.: N-gram counts and language models from the Common Crawl. In: LREC’14. European Language Resources Association (2014)
Google Scholar
Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)
Article Google Scholar
Goncalves, C., Silva, J.F., Cunha, J.C.: n-gram cache performance in statistical extraction of relevant terms in large corpora. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11537, pp. 75–88. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22741-8_6
Chapter Google Scholar
Lü, L., Zhang, Z.K., Zhou, T.: Deviation of Zipf and Heaps laws in human languages with limited dictionary sizes. Sci. Rep. 3, 1082 (2013). https://doi.org/10.1038/srep01082
Article Google Scholar
Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. Struct. Lang. Math. Aspects 12, 190–219 (1953)
Google Scholar
Newman, M.: Power laws, Pareto distributions and Zipf law. Contemp. Phys. 46(5), 323–351 (2005)
Article Google Scholar
Price, D.S.: A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27(5), 292–306 (1976)
Article Google Scholar
Silva, J.F., Cunha, J.C.: An empirical model for n-gram frequency distribution in large corpora. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 840–851. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47436-2_63
Chapter Google Scholar
Silva, J.F., Cunha, J.C.: A model for predicting n-gram frequency distribution in large corpora. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds.) ICCS 2021. LNCS, vol. 12742, pp. 699–706. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77961-0_55
Chapter Google Scholar
Silva, J.F., Gonçalves, C., Cunha, J.C.: A theoretical model for n-gram distribution in big data corpora. In: 2016 IEEE International Conference on Big Data, pp. 134–141 (2016)
Google Scholar
Simon, H.: On a class of skew distribution functions. Biometrika 42(3/4), 425–440 (1955)
Article MathSciNet Google Scholar
Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)
Google Scholar

Download references

Acknowledgment

This work is supported by NOVA LINCS (UIDB/04516/2020) with the financial support of FCT.IP.

Author information

Authors and Affiliations

NOVA LINCS, NOVA School of Science and Technology, Caparica, Portugal
Joaquim F. Silva & Jose C. Cunha

Authors

Joaquim F. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Jose C. Cunha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joaquim F. Silva .

Editor information

Editors and Affiliations

Academia Sinica, Taipei, Taiwan
De-Nian Yang
Microsoft Research Asia, Beijing, China
Xing Xie
National Yang Ming Chiao Tung University, Hsinchu, Taiwan
Vincent S. Tseng
Duke University, Durham, NC, USA
Jian Pei
National Cheng Kung University, Tainan, Taiwan
Jen-Wei Huang
Silesian University of Technology, Gliwice, Poland
Jerry Chun-Wei Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silva, J.F., Cunha, J.C. (2024). How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14647. Springer, Singapore. https://doi.org/10.1007/978-981-97-2259-4_16

Download citation

DOI: https://doi.org/10.1007/978-981-97-2259-4_16
Published: 25 April 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2261-7
Online ISBN: 978-981-97-2259-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams