Abstract
Data storage and information retrieval are some of the most important aspects when it comes to the development of a language corpus. Currently most corpora use either relational databases or indexed file systems. When selecting a data storage system, most important facts to consider are the speeds of data insertion and information retrieval. Other than the aforementioned two approaches, currently there are various database systems which have different strengths that can be more useful. This paper compares the performance of data storage and retrieval mechanisms which use relational databases, graph databases, column store databases and indexed file systems for various steps such as inserting data into corpus and retrieving information from it, and tries to suggest an optimal storage architecture for a language corpus.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aston, G., Burnard, L.: The BNC Handbook:Exploring the British National Corpus with SARA, http://corpus.leeds.ac.uk/teaching/aston-burnard-bnc.pdf
Bennet, G.R.: Using Corpora in the Language Learning Classroom, http://www.international.ucla.edu/media/files/Using-corpora-in-the-language-learning-classroom-Corpus-linguistics-for-teachers-my-atc.pdf
Davies, M.: The advantage of using relational databases for large corpora. International Journal of Corpus Linguistics 10(3), 307–335 (2005)
Davies, M.: The 385+ million word corpus of contemporary American English (1990–2008+) design, architecture, and linguistic insights. International Journal of Corpus Linguistics 14(2), 159–191 (2009)
H2: Performance, http://www.h2database.com/html/performance.html
Jouili, S., Vansteenberghe, V.: The advantage of using relational databases for large corpora. In: International Conference on Social Computing (SocialCom), pp. 708–715 (2013)
Rabl, T.M., et al.: Solving big data challenges for enterprise application performance management. PVLDB, 1724–1735 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Upeksha, D. et al. (2015). Comparison Between Performance of Various Database Systems for Implementing a Language Corpus. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. BDAS 2015. Communications in Computer and Information Science, vol 521. Springer, Cham. https://doi.org/10.1007/978-3-319-18422-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-18422-7_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18421-0
Online ISBN: 978-3-319-18422-7
eBook Packages: Computer ScienceComputer Science (R0)