Choosing the Right Storage Solution for the Corpus Management System (Analytical Overview and Experiments)

Mukhamedshin, Damir; Suleymanov, Dzhavdet; Nevzorova, Olga

doi:10.1007/978-3-030-21005-2_10

Damir Mukhamedshin⁵,
Dzhavdet Suleymanov⁵ &
Olga Nevzorova⁵

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 146))

Included in the following conference series:

International conference on the Sciences of Electronics, Technologies of Information and Telecommunications

700 Accesses
3 Citations

Abstract

Corpus management systems are widely used to solve the problems of human-computer interaction. There are many developments associated with the management of language corpora, for example, Sketch Engine [1], Manatee [2], EXMARaLDA [3], etc. We developed the system which considers certain specific features of Turkic languages on the one hand and has new search functions and components from the other hand.

The corpus management system “Tugan Tel” (http://tugantel.tatar) is specifically designed to work with the National Corpus of Tatar and can be used to work with both the linguistic corpora of Turkic languages and the corpora of other languages. The corpus management system developed by the authors allows searching of lexical units, morphological and lexical searching, searching of syntactic units, searching of the n-gram, named entity extraction and others.

The semantic model of the Tatar language data representation is the core of the system. Storage and processing of corpus data, searching in corpus data are performed using open source tools (MariaDB DBMS, Redis data storage).

There are three basic stages of corpus management search engine development: the data model development, the system architecture development, and the database architecture development. The issues of collecting and processing of corpus data should also be considered.

The main task of our research is the identification and description of solutions for the corpus data storage, collection, and processing. The developed data model can be used for supervised and unsupervised document classification, as well as in corpus exploring. The proposed solutions have been implemented in the corpus management system which is currently used for data representation and processing for the National Corpus of Tatar “Tugan Tel”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Suchomel, V.: The sketch engine: ten years on. Lexicography 1(1), 7–36 (2014)
Article Google Scholar
Rychlý, P.: Manatee/bonito-a modular corpus manager. In: 1st Workshop on Recent Advances in Slavonic Natural Language Processing, pp. 65–70, December 2007
Google Scholar
Schmidt, T., Wörner, K.: EXMARaLDA – creating, analyzing and sharing spoken language corpora for pragmatics research. Pragmat.-Q. Publ. Int. Pragmat. Assoc. 19(4), 565 (2009)
Google Scholar
Memcached: A distributed memory object caching system. https://memcached.org/. Accessed 30 June 2018
MemcacheDB: Wikipedia. https://en.wikipedia.org/wiki/MemcacheDB. Accessed 30 June 2018
MemcacheDB: Bauman National Libriary. https://en.bmstu.wiki/MemcacheDB. Accessed 30 June 2018
Nelson, J.: Mastering Redis. Packt Publishing Ltd, Birmingham (2016)
Google Scholar
How fast is Redis? – Redis. https://redis.io/topics/benchmarks. Accessed 30 June 2018
FoundationDB | Home. https://www.foundationdb.org/. Accessed 30 June 2018
Performance: FoundationDB 5.2. https://apple.github.io/foundationdb/performance.html. Accessed 30 June 2018
Sphinx | Open Source Search Engine. http://sphinxsearch.com/. Accessed 30 June 2018
Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine. O’Reilly Media, Inc., Newton (2015)
Google Scholar
Bartholomew, D.: Getting Started with MariaDB. Packt Publishing Ltd, Birmingham (2013)
Google Scholar
Nevzorova, O., Mukhamedshin, D., Gataullin, R.: Developing corpus management system: architecture of system and database. In: Proceedings of the 2017 International Conference on Information and Knowledge Engineering. CSREA Press, United States of America, pp. 108–112 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

The Tatarstan Academy of Sciences, Kazan, Russia
Damir Mukhamedshin, Dzhavdet Suleymanov & Olga Nevzorova

Authors

Damir Mukhamedshin
View author publications
You can also search for this author in PubMed Google Scholar
Dzhavdet Suleymanov
View author publications
You can also search for this author in PubMed Google Scholar
Olga Nevzorova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Damir Mukhamedshin .

Editor information

Editors and Affiliations

SETIT Lab, University of Sfax, Sfax, Tunisia
Med Salim Bouhlel
DIBRIS - University of Genoa, Genova, Genova, Italy
Stefano Rovetta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mukhamedshin, D., Suleymanov, D., Nevzorova, O. (2020). Choosing the Right Storage Solution for the Corpus Management System (Analytical Overview and Experiments). In: Bouhlel, M., Rovetta, S. (eds) Proceedings of the 8th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT’18), Vol.1. SETIT 2018. Smart Innovation, Systems and Technologies, vol 146. Springer, Cham. https://doi.org/10.1007/978-3-030-21005-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-21005-2_10
Published: 11 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21004-5
Online ISBN: 978-3-030-21005-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics