Skip to main content

Abstract

Corpus management systems are widely used to solve the problems of human-computer interaction. There are many developments associated with the management of language corpora, for example, Sketch Engine [1], Manatee [2], EXMARaLDA [3], etc. We developed the system which considers certain specific features of Turkic languages on the one hand and has new search functions and components from the other hand.

The corpus management system “Tugan Tel” (http://tugantel.tatar) is specifically designed to work with the National Corpus of Tatar and can be used to work with both the linguistic corpora of Turkic languages and the corpora of other languages. The corpus management system developed by the authors allows searching of lexical units, morphological and lexical searching, searching of syntactic units, searching of the n-gram, named entity extraction and others.

The semantic model of the Tatar language data representation is the core of the system. Storage and processing of corpus data, searching in corpus data are performed using open source tools (MariaDB DBMS, Redis data storage).

There are three basic stages of corpus management search engine development: the data model development, the system architecture development, and the database architecture development. The issues of collecting and processing of corpus data should also be considered.

The main task of our research is the identification and description of solutions for the corpus data storage, collection, and processing. The developed data model can be used for supervised and unsupervised document classification, as well as in corpus exploring. The proposed solutions have been implemented in the corpus management system which is currently used for data representation and processing for the National Corpus of Tatar “Tugan Tel”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Suchomel, V.: The sketch engine: ten years on. Lexicography 1(1), 7–36 (2014)

    Article  Google Scholar 

  2. Rychlý, P.: Manatee/bonito-a modular corpus manager. In: 1st Workshop on Recent Advances in Slavonic Natural Language Processing, pp. 65–70, December 2007

    Google Scholar 

  3. Schmidt, T., Wörner, K.: EXMARaLDA – creating, analyzing and sharing spoken language corpora for pragmatics research. Pragmat.-Q. Publ. Int. Pragmat. Assoc. 19(4), 565 (2009)

    Google Scholar 

  4. Memcached: A distributed memory object caching system. https://memcached.org/. Accessed 30 June 2018

  5. MemcacheDB: Wikipedia. https://en.wikipedia.org/wiki/MemcacheDB. Accessed 30 June 2018

  6. MemcacheDB: Bauman National Libriary. https://en.bmstu.wiki/MemcacheDB. Accessed 30 June 2018

  7. Nelson, J.: Mastering Redis. Packt Publishing Ltd, Birmingham (2016)

    Google Scholar 

  8. How fast is Redis? – Redis. https://redis.io/topics/benchmarks. Accessed 30 June 2018

  9. FoundationDB | Home. https://www.foundationdb.org/. Accessed 30 June 2018

  10. Performance: FoundationDB 5.2. https://apple.github.io/foundationdb/performance.html. Accessed 30 June 2018

  11. Sphinx | Open Source Search Engine. http://sphinxsearch.com/. Accessed 30 June 2018

  12. Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide: A Distributed Real-Time Search and Analytics Engine. O’Reilly Media, Inc., Newton (2015)

    Google Scholar 

  13. Bartholomew, D.: Getting Started with MariaDB. Packt Publishing Ltd, Birmingham (2013)

    Google Scholar 

  14. Nevzorova, O., Mukhamedshin, D., Gataullin, R.: Developing corpus management system: architecture of system and database. In: Proceedings of the 2017 International Conference on Information and Knowledge Engineering. CSREA Press, United States of America, pp. 108–112 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Damir Mukhamedshin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mukhamedshin, D., Suleymanov, D., Nevzorova, O. (2020). Choosing the Right Storage Solution for the Corpus Management System (Analytical Overview and Experiments). In: Bouhlel, M., Rovetta, S. (eds) Proceedings of the 8th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT’18), Vol.1. SETIT 2018. Smart Innovation, Systems and Technologies, vol 146. Springer, Cham. https://doi.org/10.1007/978-3-030-21005-2_10

Download citation

Publish with us

Policies and ethics