Abstract
The prediction of the numbers of distinct word n-grams and their frequency distributions in text corpora is important in domains like information processing and language modelling. With big data corpora, there is an increased application complexity due to the large volume of data. Traditional studies have been confined to small or moderate size corpora leading to statistical laws on word frequency distributions. However, when going to very large corpora, some of the assumptions underlying those laws need to be revised, related to the corpus vocabulary and numbers of word occurrences. So, although it becomes critical to know how the corpus size influences those distributions, there is a lack of models that characterise such influence. This paper aims at filling this gap, enabling the prediction of the impact of corpus growth upon application time and space complexities. It presents a fully principled model, which, distinctively, considers words and multiwords in very large corpora, predicting the cumulative numbers of distinct n-grams above or equal to a given frequency in a corpus, as well as the sizes of equal-frequency n-gram groups, from unigrams to hexagrams, as a function of corpus size, in a language, assuming a finite n-gram vocabulary. The model applies to low occurrence frequencies, encompassing the larger populations of n-grams. Practical assessment with real corpora shows relative errors around \(3\%\), stable over the considered ranges of n-gram frequencies, n-gram sizes and corpora sizes from million to billion words, for English and French.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Obviously, this excludes the corpora collecting and n-gram counting, which is made only once for parameters estimation and validation of the model.
- 2.
Equation (1) is equivalent to \(\frac{\frac{dD(k,C)}{D(k,C)}}{\frac{dC}{C}}=g_k \, \frac{V - D(k,C)}{V}\). The infinite V assumption would imply that the ratio in left side of the equation should be a constant (equal to \(g_k\)) wrt C, but the empirical observations showed that ratio decreases instead. Such decrease is captured by the vocabulary finiteness assumption (second factor).
- 3.
Indeed, (2) can be written as \(\frac{V - D(k,C)}{D(k,C)}= (h_k\,C)^{-g_k} \), which, for each k and n, is a power law wrt to C, since \(g_k\) and \(h_k\) were found constants wrt C.
- 4.
Empirical counts were obtained from the corpora with the help of Carlos Gonçalves.
References
Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47–97 (2002)
Bacaër, N.: Verhulst and the logistic equation (1838). In: A Short History of Mathematical Population Dynamics, pp. 35–39. Springer, London (2011). https://doi.org/10.1007/978-0-85729-115-8_6
Balasubrahmanyan, V.K., Naranan, S.: Algorithmic information, complexity and Zipf law. Glottometrics 4, 1–26 (2002)
Bass, F.M.: A new product growth for model consumer durables. Manage. Sci. 15(5), 215–227 (1969)
Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: Size dependent word frequencies and translational invariance of books. CoRR abs/0906.0716 (2009)
Booth, A.D.: A “law’’ of occurrences for words of low frequency. Inf. Control 10, 386–393 (1967)
Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Joint Conference on EMNLP - CoNLL, pp. 858–867. ACL (2007)
Buck, C., Heafield, K., van Ooyen, B.: N-gram counts and language models from the Common Crawl. In: LREC’14. European Language Resources Association (2014)
Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)
Goncalves, C., Silva, J.F., Cunha, J.C.: n-gram cache performance in statistical extraction of relevant terms in large corpora. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11537, pp. 75–88. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22741-8_6
Lü, L., Zhang, Z.K., Zhou, T.: Deviation of Zipf and Heaps laws in human languages with limited dictionary sizes. Sci. Rep. 3, 1082 (2013). https://doi.org/10.1038/srep01082
Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. Struct. Lang. Math. Aspects 12, 190–219 (1953)
Newman, M.: Power laws, Pareto distributions and Zipf law. Contemp. Phys. 46(5), 323–351 (2005)
Price, D.S.: A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27(5), 292–306 (1976)
Silva, J.F., Cunha, J.C.: An empirical model for n-gram frequency distribution in large corpora. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 840–851. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47436-2_63
Silva, J.F., Cunha, J.C.: A model for predicting n-gram frequency distribution in large corpora. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds.) ICCS 2021. LNCS, vol. 12742, pp. 699–706. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77961-0_55
Silva, J.F., Gonçalves, C., Cunha, J.C.: A theoretical model for n-gram distribution in big data corpora. In: 2016 IEEE International Conference on Big Data, pp. 134–141 (2016)
Simon, H.: On a class of skew distribution functions. Biometrika 42(3/4), 425–440 (1955)
Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)
Acknowledgment
This work is supported by NOVA LINCS (UIDB/04516/2020) with the financial support of FCT.IP.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Silva, J.F., Cunha, J.C. (2024). How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14647. Springer, Singapore. https://doi.org/10.1007/978-981-97-2259-4_16
Download citation
DOI: https://doi.org/10.1007/978-981-97-2259-4_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2261-7
Online ISBN: 978-981-97-2259-4
eBook Packages: Computer ScienceComputer Science (R0)