Skip to main content

How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2024)

Abstract

The prediction of the numbers of distinct word n-grams and their frequency distributions in text corpora is important in domains like information processing and language modelling. With big data corpora, there is an increased application complexity due to the large volume of data. Traditional studies have been confined to small or moderate size corpora leading to statistical laws on word frequency distributions. However, when going to very large corpora, some of the assumptions underlying those laws need to be revised, related to the corpus vocabulary and numbers of word occurrences. So, although it becomes critical to know how the corpus size influences those distributions, there is a lack of models that characterise such influence. This paper aims at filling this gap, enabling the prediction of the impact of corpus growth upon application time and space complexities. It presents a fully principled model, which, distinctively, considers words and multiwords in very large corpora, predicting the cumulative numbers of distinct n-grams above or equal to a given frequency in a corpus, as well as the sizes of equal-frequency n-gram groups, from unigrams to hexagrams, as a function of corpus size, in a language, assuming a finite n-gram vocabulary. The model applies to low occurrence frequencies, encompassing the larger populations of n-grams. Practical assessment with real corpora shows relative errors around \(3\%\), stable over the considered ranges of n-gram frequencies, n-gram sizes and corpora sizes from million to billion words, for English and French.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Obviously, this excludes the corpora collecting and n-gram counting, which is made only once for parameters estimation and validation of the model.

  2. 2.

    Equation (1) is equivalent to \(\frac{\frac{dD(k,C)}{D(k,C)}}{\frac{dC}{C}}=g_k \, \frac{V - D(k,C)}{V}\). The infinite V assumption would imply that the ratio in left side of the equation should be a constant (equal to \(g_k\)) wrt C, but the empirical observations showed that ratio decreases instead. Such decrease is captured by the vocabulary finiteness assumption (second factor).

  3. 3.

    Indeed, (2) can be written as \(\frac{V - D(k,C)}{D(k,C)}= (h_k\,C)^{-g_k} \), which, for each k and n, is a power law wrt to C, since \(g_k\) and \(h_k\) were found constants wrt C.

  4. 4.

    Empirical counts were obtained from the corpora with the help of Carlos Gonçalves.

References

  1. Albert, R., Barabási, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47–97 (2002)

    Article  MathSciNet  Google Scholar 

  2. Bacaër, N.: Verhulst and the logistic equation (1838). In: A Short History of Mathematical Population Dynamics, pp. 35–39. Springer, London (2011). https://doi.org/10.1007/978-0-85729-115-8_6

  3. Balasubrahmanyan, V.K., Naranan, S.: Algorithmic information, complexity and Zipf law. Glottometrics 4, 1–26 (2002)

    Google Scholar 

  4. Bass, F.M.: A new product growth for model consumer durables. Manage. Sci. 15(5), 215–227 (1969)

    Article  Google Scholar 

  5. Bernhardsson, S., da Rocha, L.E.C., Minnhagen, P.: Size dependent word frequencies and translational invariance of books. CoRR abs/0906.0716 (2009)

    Google Scholar 

  6. Booth, A.D.: A “law’’ of occurrences for words of low frequency. Inf. Control 10, 386–393 (1967)

    Article  Google Scholar 

  7. Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Joint Conference on EMNLP - CoNLL, pp. 858–867. ACL (2007)

    Google Scholar 

  8. Buck, C., Heafield, K., van Ooyen, B.: N-gram counts and language models from the Common Crawl. In: LREC’14. European Language Resources Association (2014)

    Google Scholar 

  9. Egghe, L.: Untangling Herdan’s law and Heaps’ law: mathematical and informetric arguments. J. Am. Soc. Inf. Sci. Technol. 58(5), 702–709 (2007)

    Article  Google Scholar 

  10. Goncalves, C., Silva, J.F., Cunha, J.C.: n-gram cache performance in statistical extraction of relevant terms in large corpora. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11537, pp. 75–88. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22741-8_6

    Chapter  Google Scholar 

  11. Lü, L., Zhang, Z.K., Zhou, T.: Deviation of Zipf and Heaps laws in human languages with limited dictionary sizes. Sci. Rep. 3, 1082 (2013). https://doi.org/10.1038/srep01082

    Article  Google Scholar 

  12. Mandelbrot, B.: On the theory of word frequencies and on related Markovian models of discourse. Struct. Lang. Math. Aspects 12, 190–219 (1953)

    Google Scholar 

  13. Newman, M.: Power laws, Pareto distributions and Zipf law. Contemp. Phys. 46(5), 323–351 (2005)

    Article  Google Scholar 

  14. Price, D.S.: A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27(5), 292–306 (1976)

    Article  Google Scholar 

  15. Silva, J.F., Cunha, J.C.: An empirical model for n-gram frequency distribution in large corpora. In: Lauw, H.W., Wong, R.C.-W., Ntoulas, A., Lim, E.-P., Ng, S.-K., Pan, S.J. (eds.) PAKDD 2020. LNCS (LNAI), vol. 12085, pp. 840–851. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47436-2_63

    Chapter  Google Scholar 

  16. Silva, J.F., Cunha, J.C.: A model for predicting n-gram frequency distribution in large corpora. In: Paszynski, M., Kranzlmüller, D., Krzhizhanovskaya, V.V., Dongarra, J.J., Sloot, P.M.A. (eds.) ICCS 2021. LNCS, vol. 12742, pp. 699–706. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77961-0_55

    Chapter  Google Scholar 

  17. Silva, J.F., Gonçalves, C., Cunha, J.C.: A theoretical model for n-gram distribution in big data corpora. In: 2016 IEEE International Conference on Big Data, pp. 134–141 (2016)

    Google Scholar 

  18. Simon, H.: On a class of skew distribution functions. Biometrika 42(3/4), 425–440 (1955)

    Article  MathSciNet  Google Scholar 

  19. Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)

    Google Scholar 

Download references

Acknowledgment

This work is supported by NOVA LINCS (UIDB/04516/2020) with the financial support of FCT.IP.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joaquim F. Silva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Silva, J.F., Cunha, J.C. (2024). How Large Corpora Sizes Influence the Distribution of Low Frequency Text n-grams. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14647. Springer, Singapore. https://doi.org/10.1007/978-981-97-2259-4_16

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2259-4_16

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2261-7

  • Online ISBN: 978-981-97-2259-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics