Skip to main content

New Parallel Corpora of Baltic and Slavic Languages — Assumptions of Corpus Construction

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2021)

Abstract

In this article, we describe the design principles of the ten newly published CLARIN-PL corpora of Slavic and Baltic languages. In relation to other non-commercial online corpora, we highlight the distinctive features of these CLARIN-PL corpora: resource selection, preprocessing, manual segmentation at the sentence level, lemmatisation, annotation and metadata. We also present current and planned work on the development of the CLARIN-PL Balto–Slavic corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    While preparing the corpora, we gained access to KLC Morfologijos servisas, a newly developed tool by the Centre of Computational Linguistics team at the Vytautas Magnus University, the tool had not yet been published at the time this article was written.

References

  1. Corpus of Parallel Russian and Bulgarian Texts. http://rbcorpus.com

  2. Inkluz. https://ws.clarin-pl.eu/inkluz.shtml

  3. InterCorp. https://intercorp.korpus.cz

  4. KonText. https://kontext.clarin-pl.eu

  5. Polish-Russian Parallel Corpus. http://www.pol-ros.polon.uw.edu.pl

  6. Russian National Corpus: Corpora structure. https://ruscorpora.ru/new/corpora-structure.html

  7. Speller. http://ws.clarin-pl.eu/speller.shtml

  8. Tokenizer. http://ws.clarin-pl.eu/tokenizer.shtml

  9. TxtClean. http://ws.clarin-pl.eu/txtclean.shtml

  10. Bień, J.: Rozprawy Uniwersytetu Warszawskiego / Dissertationes Univesitatis Varsoviensis, chap. Koncepcja słownikowej informacji morfologicznej i jej kompu-terowej weryfikacji, Wydawnictwa Uniwersytetu Warszawskiego (1991)

    Google Scholar 

  11. Čermák, F., Rosen, A.: The case of InterCorp, a multilingual parallel corpus. Int. J. Corpus Ling. 17, 411-427 (2012)

    Article  Google Scholar 

  12. Dimitrova, L., Koseska, V., Roszko, D., Roszko, R.: Trilingual aligned corpus — current state and new applications. Cogn. Stud. Études Cogn. 14, 13–20 (2014)

    Google Scholar 

  13. Dobrovol’sky, D., Kretov, A., Sharoff, S.: Natsional’nyy korpus russkogo yazyka: 2003–2005, chap. Korpus parallel’nykh tekstov: Arkhitektura i vozmozhnosti ispol’zovaniya. Indrik (2005)

    Google Scholar 

  14. Janz, A., Kocoń, J., Piasecki, M., Zaśko-Zielińska, M.: PlWordNet as a basis for large emotive lexicons of Polish. In: LTC 2017 8th Language & Technology Conference, pp. 189–193 (2017)

    Google Scholar 

  15. Kisiel, A., Koseska-Toszewa, V., Kotsyba, N., Staśkowiak-Satoła, J., Sosnowski, W.: Polish-Bulgarian-Russian Parallel Corpus (2016). http://hdl.handle.net/11321/308. CLARIN-PL Digital Repository

  16. Koeva, S., Genov, A.: Bulgarian language processing chain. In: Proceeding of the Workshop on the Integration of Multilingual Resources and Tools in Web Applications, 26 September 2011 (2011)

    Google Scholar 

  17. Kotsyba, N.: Polskojęzyczne korpusy równoległe. Polish-language Parallel Corpora, chap. Polsko-Ukraiński Korpus Równoległy PolUKR i jego następca PolUKR-2. Instytut Lingwistyki Stosowanej (2016)

    Google Scholar 

  18. Marcinkevičienė, R.: Vytauto Didžiojo universiteto mokslo klasteriai, chap. Teksto ir balso skaitmeniniai tyrimai, ištekli ir technologiju kūrimas bei taikymas (2012)

    Google Scholar 

  19. Pęzik, P., Ogrodniczuk, M., Przepiórkowski, A.: Parallel and spoken corpora in an open repository of Polish language resources. In: Proceedings of the 5th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (2011)

    Google Scholar 

  20. Piasecki, M., Walentynowicz, W.: Morphodita-based tagger adapted to the Polish language technology. In: Proceedings of Human Language Technologies as a Challenge for Computer Science and Linguistics (2017)

    Google Scholar 

  21. Przepiórkowski, A.: Korpus IPI PAN. Wersja wstępna, Instytut Podstaw Informatyki PAN (2004)

    Google Scholar 

  22. Rimkutė, E., Valskys, V., Vaskelienė, J.: Lietuvi kalbos leksem morfologinis anotavimas: ypatumai ir sunkumai. Kalb studijos 15, 63–70 (2009)

    Google Scholar 

  23. Roszko, D., Roszko, R.: Polish-Lithuanian Parallel Corpus (2016). https://clarin-pl.eu/dspace/handle/11321/309. CLARIN-PL Digital Repository

  24. Saloni, Z.: Podstawy teoretyczne “Słownika gramatycznego języka polskiego” (2012/2020). http://sgjp.pl/static/pdf/Podstawy_teoretyczne_SGJP.pdf

  25. Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: Proceedings of the International Conference on Machine Learning; Models, Technologies and Applications, MLMTA 2003, Las Vegas, Nevada, USA, 23–26 June 2003 (2003)

    Google Scholar 

  26. Simov, K., Simov, A., Osenova, P.: An XML architecture for shallow and deep processing. In: The Proceedings of the ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP (2004)

    Google Scholar 

  27. Straka, M., Straková, J.: UDPipe (2016). http://hdl.handle.net/11234/1-1702. LINDAT/CLARIAH-CZ Digital Library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University

  28. Tiedemann, J.: OPUS — parallel corpora for everyone. In: Proceedings of the 19th Annual Conference of the European Association for Machine Translation (EAMT) (2016). Baltic Journal of Modern Computing

    Google Scholar 

  29. von Waldenfels, R., Meyer, R.: ParaSol: A Parallel Corpus of Slavic and Other Languages (2006)

    Google Scholar 

  30. Walentynowicz, W., Piasecki, M., Oleksy, M.: Tagger for Polish computer mediated communication texts. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) (2019)

    Google Scholar 

  31. Walkowiak, T.: Language processing modelling notation — orchestration of NLP microservices. In: Advances in Dependability Engineering of Complex Systems: Proceedings of the Twelfth International Conference on Dependability and Complex Systems DepCoS-RELCOMEX (2017)

    Google Scholar 

  32. Waszczuk, J.: Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language. In: Proceedings of COLING 2012 (2012)

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by (1) the Polish Ministry of Education and Science, CLARIN-PL Project; (2) CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roman Roszko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Duszkin, M., Roszko, D., Roszko, R. (2021). New Parallel Corpora of Baltic and Slavic Languages — Assumptions of Corpus Construction. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-83527-9_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-83526-2

  • Online ISBN: 978-3-030-83527-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics