Abstract
In this article, we describe the design principles of the ten newly published CLARIN-PL corpora of Slavic and Baltic languages. In relation to other non-commercial online corpora, we highlight the distinctive features of these CLARIN-PL corpora: resource selection, preprocessing, manual segmentation at the sentence level, lemmatisation, annotation and metadata. We also present current and planned work on the development of the CLARIN-PL Balto–Slavic corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
While preparing the corpora, we gained access to KLC Morfologijos servisas, a newly developed tool by the Centre of Computational Linguistics team at the Vytautas Magnus University, the tool had not yet been published at the time this article was written.
References
Corpus of Parallel Russian and Bulgarian Texts. http://rbcorpus.com
InterCorp. https://intercorp.korpus.cz
KonText. https://kontext.clarin-pl.eu
Polish-Russian Parallel Corpus. http://www.pol-ros.polon.uw.edu.pl
Russian National Corpus: Corpora structure. https://ruscorpora.ru/new/corpora-structure.html
Speller. http://ws.clarin-pl.eu/speller.shtml
Tokenizer. http://ws.clarin-pl.eu/tokenizer.shtml
TxtClean. http://ws.clarin-pl.eu/txtclean.shtml
Bień, J.: Rozprawy Uniwersytetu Warszawskiego / Dissertationes Univesitatis Varsoviensis, chap. Koncepcja słownikowej informacji morfologicznej i jej kompu-terowej weryfikacji, Wydawnictwa Uniwersytetu Warszawskiego (1991)
Čermák, F., Rosen, A.: The case of InterCorp, a multilingual parallel corpus. Int. J. Corpus Ling. 17, 411-427 (2012)
Dimitrova, L., Koseska, V., Roszko, D., Roszko, R.: Trilingual aligned corpus — current state and new applications. Cogn. Stud. Études Cogn. 14, 13–20 (2014)
Dobrovol’sky, D., Kretov, A., Sharoff, S.: Natsional’nyy korpus russkogo yazyka: 2003–2005, chap. Korpus parallel’nykh tekstov: Arkhitektura i vozmozhnosti ispol’zovaniya. Indrik (2005)
Janz, A., Kocoń, J., Piasecki, M., Zaśko-Zielińska, M.: PlWordNet as a basis for large emotive lexicons of Polish. In: LTC 2017 8th Language & Technology Conference, pp. 189–193 (2017)
Kisiel, A., Koseska-Toszewa, V., Kotsyba, N., Staśkowiak-Satoła, J., Sosnowski, W.: Polish-Bulgarian-Russian Parallel Corpus (2016). http://hdl.handle.net/11321/308. CLARIN-PL Digital Repository
Koeva, S., Genov, A.: Bulgarian language processing chain. In: Proceeding of the Workshop on the Integration of Multilingual Resources and Tools in Web Applications, 26 September 2011 (2011)
Kotsyba, N.: Polskojęzyczne korpusy równoległe. Polish-language Parallel Corpora, chap. Polsko-Ukraiński Korpus Równoległy PolUKR i jego następca PolUKR-2. Instytut Lingwistyki Stosowanej (2016)
Marcinkevičienė, R.: Vytauto Didžiojo universiteto mokslo klasteriai, chap. Teksto ir balso skaitmeniniai tyrimai, ištekli ir technologiju kūrimas bei taikymas (2012)
Pęzik, P., Ogrodniczuk, M., Przepiórkowski, A.: Parallel and spoken corpora in an open repository of Polish language resources. In: Proceedings of the 5th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (2011)
Piasecki, M., Walentynowicz, W.: Morphodita-based tagger adapted to the Polish language technology. In: Proceedings of Human Language Technologies as a Challenge for Computer Science and Linguistics (2017)
Przepiórkowski, A.: Korpus IPI PAN. Wersja wstępna, Instytut Podstaw Informatyki PAN (2004)
Rimkutė, E., Valskys, V., Vaskelienė, J.: Lietuvi kalbos leksem morfologinis anotavimas: ypatumai ir sunkumai. Kalb studijos 15, 63–70 (2009)
Roszko, D., Roszko, R.: Polish-Lithuanian Parallel Corpus (2016). https://clarin-pl.eu/dspace/handle/11321/309. CLARIN-PL Digital Repository
Saloni, Z.: Podstawy teoretyczne “Słownika gramatycznego języka polskiego” (2012/2020). http://sgjp.pl/static/pdf/Podstawy_teoretyczne_SGJP.pdf
Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: Proceedings of the International Conference on Machine Learning; Models, Technologies and Applications, MLMTA 2003, Las Vegas, Nevada, USA, 23–26 June 2003 (2003)
Simov, K., Simov, A., Osenova, P.: An XML architecture for shallow and deep processing. In: The Proceedings of the ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP (2004)
Straka, M., Straková, J.: UDPipe (2016). http://hdl.handle.net/11234/1-1702. LINDAT/CLARIAH-CZ Digital Library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Tiedemann, J.: OPUS — parallel corpora for everyone. In: Proceedings of the 19th Annual Conference of the European Association for Machine Translation (EAMT) (2016). Baltic Journal of Modern Computing
von Waldenfels, R., Meyer, R.: ParaSol: A Parallel Corpus of Slavic and Other Languages (2006)
Walentynowicz, W., Piasecki, M., Oleksy, M.: Tagger for Polish computer mediated communication texts. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) (2019)
Walkowiak, T.: Language processing modelling notation — orchestration of NLP microservices. In: Advances in Dependability Engineering of Complex Systems: Proceedings of the Twelfth International Conference on Dependability and Complex Systems DepCoS-RELCOMEX (2017)
Waszczuk, J.: Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language. In: Proceedings of COLING 2012 (2012)
Acknowledgements
This work was partially supported by (1) the Polish Ministry of Education and Science, CLARIN-PL Project; (2) CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Duszkin, M., Roszko, D., Roszko, R. (2021). New Parallel Corpora of Baltic and Slavic Languages — Assumptions of Corpus Construction. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-83527-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)