n-gram Cache Performance in Statistical Extraction of Relevant Terms in Large Corpora

  • Carlos GoncalvesEmail author
  • Joaquim F. Silva
  • Jose C. Cunha
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11537)


Statistical extraction of relevant n-grams in natural language corpora is important for text indexing and classification since it can be language independent. We show how a theoretical model identifies the distribution properties of the distinct n-grams and singletons appearing in large corpora and how this knowledge contributes to understanding the performance of an n-gram cache system used for extraction of relevant terms. We show how this approach allowed us to evaluate the benefits from using Bloom filters for excluding singletons and from using static prefetching of nonsingletons in an n-gram cache. In the context of the distributed and parallel implementation of the LocalMaxs extraction method, we analyze the performance of the cache miss ratio and size, and the efficiency of n-gram cohesion calculation with LocalMaxs.


Large corpora Statistical extraction Multiword terms Parallel processing n-gram cache performance Cloud computing 


  1. 1.
    Google Ngram Viewer.
  2. 2.
    Lin, D., et al.: New tools for web-scale n-grams. In: LREC (2010)Google Scholar
  3. 3.
    da Silva, J.F., Dias, G., Guilloré, S., Pereira Lopes, J.G.: Using LocalMaxs algorithm for the extraction of contiguous and non-contiguous multiword lexical units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999). Scholar
  4. 4.
    da Silva, J.F., et al.: A local maxima method and a fair dispersion normalization for extracting multiword units. In: Proceedings of the 6th Meeting on the Mathematics of Language, pp. 369–381 (1999)Google Scholar
  5. 5.
    da Silva, J.F., et al.: A theoretical model for n-gram distribution in big data corpora. In: IEEE International Conference on Big Data, pp. 134–141 (2016)Google Scholar
  6. 6.
  7. 7.
    Arroyuelo, D., et al.: Distributed text search using suffix arrays. Parallel Comput. 40(9), 471–495 (2014)CrossRefGoogle Scholar
  8. 8.
    Goncalves, C., et al.: A parallel algorithm for statistical multiword term extraction from very large corpora. In: IEEE 17th International Conference on High Performance Computing and Communications, pp. 219–224 (2015)Google Scholar
  9. 9.
    Goncalves, C., et al.: An n-gram cache for large-scale parallel extraction of multiword relevant expressions with LocalMaxs. In: IEEE 12th International Conference on e-Science, pp. 120–129. IEEE Computer Society (2016)Google Scholar
  10. 10.
    Bloom, B.H.: Space/Time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRefGoogle Scholar
  11. 11.
    Daille, B.: Study and implementation of combined techniques for automatic extraction of terminology. In: The Balancing Act: Combining Symbolic and Statistical Approaches to Language. MIT Press (1996)Google Scholar
  12. 12.
    Velardi, P., et al.: Mining the web to create specialized glossaries. IEEE Intell. Syst. 23(5), 18–25 (2008)CrossRefGoogle Scholar
  13. 13.
    Pearce, D.: A comparative evaluation of collocation extraction techniques. In: 3rd International Conference on Language Resources and Evaluation (2002)Google Scholar
  14. 14.
    Church, K.W., et al.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 22–29 (1990)Google Scholar
  15. 15.
    Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 61–74 (1993)Google Scholar
  16. 16.
    Church, K.W., et al.: Concordance for parallel texts. In: 7th Annual Conference for the new OED and Text Research, pp. 40–62 (1991)Google Scholar
  17. 17.
    Goncalves, C.: Parallel and distributed statistical-based extraction of relevant multiwords from large corpora. Ph.D. dissertation, FCT/UNL (2017)Google Scholar
  18. 18.
    Zipf, G.K.: The Psychobiology of Language: An Introduction to Dynamic Philology. MIT Press, Cambridge (1935)Google Scholar
  19. 19.
    Mandelbrot, B.B.: On the theory of word frequencies and on related Markovian models of discourse. In: Structures of Language and its Mathematical Aspects, vol. 12, pp. 134–141. American Mathematical Society (1961)Google Scholar
  20. 20.
    Kuhn, R.: Speech recognition and the frequency of recently used words: a modified Markov model for natural language. In: Proceedings of the 12th Conference on Computational Linguistics, COLING 1988, vol. 1, pp. 348–350. ACM (1988)Google Scholar
  21. 21.
    Breslau, L., et al.: Web caching and Zipf-like distributions: evidence and implications. In: Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies, INFOCOM 1999, vol. 1, pp. 126–134, March 1999Google Scholar
  22. 22.
    Baeza-Yates, R., et al.: The impact of caching on search engines. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007, pp. 183–190. ACM (2007)Google Scholar
  23. 23.
    Yang, Q., et al.: Web-log mining for predictive web caching. IEEE Trans. Knowl. Data Eng. 15(4), 1050–1053 (2003)CrossRefGoogle Scholar
  24. 24.
    Balkir, A.S., et al.: A distributed look-up architecture for text mining applications using MapReduce. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11 (2011)Google Scholar
  25. 25.
  26. 26.
    Brants, T., et al.: Large language models in machine translation. In: Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 858–867 (2007)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Instituto Superior de Engenharia de LisboaLisbonPortugal
  2. 2.NOVA Laboratory for Computer Science and InformaticsCaparicaPortugal

Personalised recommendations