Abstract
In this paper, we introduce a set of approaches to building a n-gram corpus from the Wikipedia monthly XML dumps. We then apply these to build a 1 to 5-g corpus data set, which we then describe in detail, explaining its benefits as a supplement to larger n-gram corpora like Google Web 1T 5-g corpus. We analyze our algorithms and discuss efficiency in terms of space and time. The dataset is publicly available at www.unlv.edu.
Keywords
- NGRAM
- NLP
- Wikipedia
- OCR
- Wiki
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
The Economist: The world’s most valuable resource is no longer oil, but data, The Economist: New York, NY, USA (2017)
Brants, T., Franz, A.: Web 1T 5-gram version 1 (2006)
Brants, T., Franz, A.: Web 1T 5-gram, 10 European languages version 1. Linguistic Data Consortium (2009)
Wikipedia Contributors: Size of Wikipedia, the free encyclopedia (2019). https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia. Accessed 12 Dec 2019
Fonseca Cacho, J.R., Taghva, K.: Reproducible research in document analysis and recognition. In: Information Technology-New Generations, pp. 389–395. Springer (2018)
Artiles, J., Sekine, S.: Tagged and cleaned Wikipedia (Tc Wikipedia) and its Ngram. https://nlp.cs.nyu.edu/wikipedia-data/. Accessed 12 Dec 2019
Wikipedia Contributors: Wikipedia, the free encyclopedia (2019). https://en.wikipedia.org/wiki/Wikipedia. Accessed 12 Dec 2019
Evert, S.: Google Web 1T 5-grams made easy (but not for the computer). In: Proceedings of the NAACL HLT 2010 Sixth Web as Corpus Workshop, pp. 32–40. Association for Computational Linguistics (2010)
Fonseca Cacho, J.R.: Improving OCR post processing with machine learning tools. Ph.D. dissertation, University of Nevada, Las Vegas (2019)
Fonseca Cacho, J.R., Taghva, K., Alvarez, D.: Using the Google Web 1T 5-gram corpus for OCR error correction. In: 16th International Conference on Information Technology-New Generations (ITNG 2019), pp. 505–511. Springer (2019)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol. 10, no. 8, pp. 707–710 (1966)
Wikipedia Contributors: Wikimedia downloads Wikipedia, the free encyclopedia (2019). https://dumps.wikimedia.org/backup-index.html. Accessed 12 Dec 2019
Wikipedia Contributors: Database download Wikipedia, the free encyclopedia (2019). https://en.wikipedia.org/wiki/Wikipedia:Database_download. Accessed 12 Dec 2019
Attardi, G., Fuschetto, A.: Wikiextractor 2.75 [software], 4 March 2017 (2012). http://attardi.github.io/wikiextractor/. Accessed 12 Dec 2019
Häggström, M.: File: Wikipedia article size in gigabytes.png Wikipedia, the free encyclopedia (2019). https://en.wikipedia.org/wiki/File:Wikipedia_article_size_in_gigabytes.png. Accessed 12 Dec 2019
Islam, A., Inkpen, D.: Real-word spelling correction using Google Web IT 3-grams. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3, pp. 1241–1249. Association for Computational Linguistics (2009)
Taghva, K., Gilbreth, J.: Recognizing acronyms and their definitions. Int. J. Doc. Anal. Recogn. 1(4), 191–198 (1999)
Taghva, K., Stofsky, E.: OCRSpell: an interactive spelling correction system for ocr errors in text. Int. J. Doc. Anal. Recogn. 3(3), 125–137 (2001)
Peters, T.: Timsort description (2015)
Auger, N., Nicaud, C., Pivoteau, C.: Merge strategies: from merge sort to timsort (2015)
De La Briandais, R.: File searching using variable length keys. Papers presented at the the March 3–5, 1959: Western Joint Computer Conference, pp. 295–298. ACM (1959)
Brass, P.: Advanced Data Structures, vol. 193. Cambridge University Press, Cambridge (2008)
Kunth, D.E.: The Art of Computer Programming: Vol. 3, Sorting and Searching, 2nd printing (1975)
Ferrández, A., Peral, J.: MergedTrie: efficient textual indexing. PLOS One 14(4), 1–19 (2019). https://doi.org/10.1371/journal.pone.0215288
Heinz, S., Zobel, J., Williams, H.E.: Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst. (TOIS) 20(2), 192–223 (2002)
Askitis, N., Zobel, J.: Redesigning the string hash table, burst trie, and bst to exploit cache. J. Exp. Algorithmics (JEA) 15, 1–7 (2010)
Bagwell, P.: Ideal hash trees. Technical report (2001)
Fonseca Cacho, J.R., Taghva, K.: The state of reproducible research in computer science (to appear)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Cacho, J.R.F., Cisneros, B., Taghva, K. (2021). Building a Wikipedia N-GRAM Corpus. In: Arai, K., Kapoor, S., Bhatia, R. (eds) Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol 1251. Springer, Cham. https://doi.org/10.1007/978-3-030-55187-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-55187-2_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-55186-5
Online ISBN: 978-3-030-55187-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)
