Skip to main content

Building a Wikipedia N-GRAM Corpus

Part of the Advances in Intelligent Systems and Computing book series (AISC,volume 1251)

Abstract

In this paper, we introduce a set of approaches to building a n-gram corpus from the Wikipedia monthly XML dumps. We then apply these to build a 1 to 5-g corpus data set, which we then describe in detail, explaining its benefits as a supplement to larger n-gram corpora like Google Web 1T 5-g corpus. We analyze our algorithms and discuss efficiency in terms of space and time. The dataset is publicly available at www.unlv.edu.

Keywords

  • NGRAM
  • NLP
  • Wikipedia
  • OCR
  • Wiki

This is a preview of subscription content, access via your institution.

Buying options

Chapter
EUR   29.95
Price includes VAT (Finland)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
EUR   160.49
Price includes VAT (Finland)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR   219.99
Price includes VAT (Finland)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. The Economist: The world’s most valuable resource is no longer oil, but data, The Economist: New York, NY, USA (2017)

    Google Scholar 

  2. Brants, T., Franz, A.: Web 1T 5-gram version 1 (2006)

    Google Scholar 

  3. Brants, T., Franz, A.: Web 1T 5-gram, 10 European languages version 1. Linguistic Data Consortium (2009)

    Google Scholar 

  4. Wikipedia Contributors: Size of Wikipedia, the free encyclopedia (2019). https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia. Accessed 12 Dec 2019

  5. Fonseca Cacho, J.R., Taghva, K.: Reproducible research in document analysis and recognition. In: Information Technology-New Generations, pp. 389–395. Springer (2018)

    Google Scholar 

  6. Artiles, J., Sekine, S.: Tagged and cleaned Wikipedia (Tc Wikipedia) and its Ngram. https://nlp.cs.nyu.edu/wikipedia-data/. Accessed 12 Dec 2019

  7. Wikipedia Contributors: Wikipedia, the free encyclopedia (2019). https://en.wikipedia.org/wiki/Wikipedia. Accessed 12 Dec 2019

  8. Evert, S.: Google Web 1T 5-grams made easy (but not for the computer). In: Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop, pp. 32–40. Association for Computational Linguistics (2010)

    Google Scholar 

  9. Fonseca Cacho, J.R.: Improving OCR post processing with machine learning tools. Ph.D. dissertation, University of Nevada, Las Vegas (2019)

    Google Scholar 

  10. Fonseca Cacho, J.R., Taghva, K., Alvarez, D.: Using the Google Web 1T 5-gram corpus for OCR error correction. In: 16th International Conference on Information Technology-New Generations (ITNG 2019), pp. 505–511. Springer (2019)

    Google Scholar 

  11. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol. 10, no. 8, pp. 707–710 (1966)

    Google Scholar 

  12. Wikipedia Contributors: Wikimedia downloads Wikipedia, the free encyclopedia (2019). https://dumps.wikimedia.org/backup-index.html. Accessed 12 Dec 2019

  13. Wikipedia Contributors: Database download Wikipedia, the free encyclopedia (2019). https://en.wikipedia.org/wiki/Wikipedia:Database_download. Accessed 12 Dec 2019

  14. Attardi, G., Fuschetto, A.: Wikiextractor 2.75 [software], 4 March 2017 (2012). http://attardi.github.io/wikiextractor/. Accessed 12 Dec 2019

  15. Häggström, M.: File: Wikipedia article size in gigabytes.png Wikipedia, the free encyclopedia (2019). https://en.wikipedia.org/wiki/File:Wikipedia_article_size_in_gigabytes.png. Accessed 12 Dec 2019

  16. Islam, A., Inkpen, D.: Real-word spelling correction using Google Web IT 3-grams. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pp. 1241–1249. Association for Computational Linguistics (2009)

    Google Scholar 

  17. Taghva, K., Gilbreth, J.: Recognizing acronyms and their definitions. Int. J. Doc. Anal. Recogn. 1(4), 191–198 (1999)

    CrossRef  Google Scholar 

  18. Taghva, K., Stofsky, E.: OCRSpell: an interactive spelling correction system for ocr errors in text. Int. J. Doc. Anal. Recogn. 3(3), 125–137 (2001)

    CrossRef  Google Scholar 

  19. Peters, T.: Timsort description (2015)

    Google Scholar 

  20. Auger, N., Nicaud, C., Pivoteau, C.: Merge strategies: from merge sort to timsort (2015)

    Google Scholar 

  21. De La Briandais, R.: File searching using variable length keys. Papers presented at the the March 3–5, 1959: Western Joint Computer Conference, pp. 295–298. ACM (1959)

    Google Scholar 

  22. Brass, P.: Advanced Data Structures, vol. 193. Cambridge University Press, Cambridge (2008)

    CrossRef  Google Scholar 

  23. Kunth, D.E.: The Art of Computer Programming: Vol. 3, Sorting and Searching, 2nd printing (1975)

    Google Scholar 

  24. Ferrández, A., Peral, J.: MergedTrie: efficient textual indexing. PLOS One 14(4), 1–19 (2019). https://doi.org/10.1371/journal.pone.0215288

    CrossRef  Google Scholar 

  25. Heinz, S., Zobel, J., Williams, H.E.: Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst. (TOIS) 20(2), 192–223 (2002)

    CrossRef  Google Scholar 

  26. Askitis, N., Zobel, J.: Redesigning the string hash table, burst trie, and bst to exploit cache. J. Exp. Algorithmics (JEA) 15, 1–7 (2010)

    MathSciNet  Google Scholar 

  27. Bagwell, P.: Ideal hash trees. Technical report (2001)

    Google Scholar 

  28. Fonseca Cacho, J.R., Taghva, K.: The state of reproducible research in computer science (to appear)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jorge Ramón Fonseca Cacho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cacho, J.R.F., Cisneros, B., Taghva, K. (2021). Building a Wikipedia N-GRAM Corpus. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1251. Springer, Cham. https://doi.org/10.1007/978-3-030-55187-2_23

Download citation