New Parallel Corpora of Baltic and Slavic Languages — Assumptions of Corpus Construction

Duszkin, Maksim; Roszko, Danuta; Roszko, Roman

doi:10.1007/978-3-030-83527-9_15

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12848))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1204 Accesses

Abstract

In this article, we describe the design principles of the ten newly published CLARIN-PL corpora of Slavic and Baltic languages. In relation to other non-commercial online corpora, we highlight the distinctive features of these CLARIN-PL corpora: resource selection, preprocessing, manual segmentation at the sentence level, lemmatisation, annotation and metadata. We also present current and planned work on the development of the CLARIN-PL Balto–Slavic corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Corpora of the Russian Language

Linguistic Corpora: A View from Turkish

ParCoLab: A Parallel Corpus for Serbian, French and English

Notes

1.
While preparing the corpora, we gained access to KLC Morfologijos servisas, a newly developed tool by the Centre of Computational Linguistics team at the Vytautas Magnus University, the tool had not yet been published at the time this article was written.

References

Corpus of Parallel Russian and Bulgarian Texts. http://rbcorpus.com
Inkluz. https://ws.clarin-pl.eu/inkluz.shtml
InterCorp. https://intercorp.korpus.cz
KonText. https://kontext.clarin-pl.eu
Polish-Russian Parallel Corpus. http://www.pol-ros.polon.uw.edu.pl
Russian National Corpus: Corpora structure. https://ruscorpora.ru/new/corpora-structure.html
Speller. http://ws.clarin-pl.eu/speller.shtml
Tokenizer. http://ws.clarin-pl.eu/tokenizer.shtml
TxtClean. http://ws.clarin-pl.eu/txtclean.shtml
Bień, J.: Rozprawy Uniwersytetu Warszawskiego / Dissertationes Univesitatis Varsoviensis, chap. Koncepcja słownikowej informacji morfologicznej i jej kompu-terowej weryfikacji, Wydawnictwa Uniwersytetu Warszawskiego (1991)
Google Scholar
Čermák, F., Rosen, A.: The case of InterCorp, a multilingual parallel corpus. Int. J. Corpus Ling. 17, 411-427 (2012)
Article Google Scholar
Dimitrova, L., Koseska, V., Roszko, D., Roszko, R.: Trilingual aligned corpus — current state and new applications. Cogn. Stud. Études Cogn. 14, 13–20 (2014)
Google Scholar
Dobrovol’sky, D., Kretov, A., Sharoff, S.: Natsional’nyy korpus russkogo yazyka: 2003–2005, chap. Korpus parallel’nykh tekstov: Arkhitektura i vozmozhnosti ispol’zovaniya. Indrik (2005)
Google Scholar
Janz, A., Kocoń, J., Piasecki, M., Zaśko-Zielińska, M.: PlWordNet as a basis for large emotive lexicons of Polish. In: LTC 2017 8th Language & Technology Conference, pp. 189–193 (2017)
Google Scholar
Kisiel, A., Koseska-Toszewa, V., Kotsyba, N., Staśkowiak-Satoła, J., Sosnowski, W.: Polish-Bulgarian-Russian Parallel Corpus (2016). http://hdl.handle.net/11321/308. CLARIN-PL Digital Repository
Koeva, S., Genov, A.: Bulgarian language processing chain. In: Proceeding of the Workshop on the Integration of Multilingual Resources and Tools in Web Applications, 26 September 2011 (2011)
Google Scholar
Kotsyba, N.: Polskojęzyczne korpusy równoległe. Polish-language Parallel Corpora, chap. Polsko-Ukraiński Korpus Równoległy PolUKR i jego następca PolUKR-2. Instytut Lingwistyki Stosowanej (2016)
Google Scholar
Marcinkevičienė, R.: Vytauto Didžiojo universiteto mokslo klasteriai, chap. Teksto ir balso skaitmeniniai tyrimai, ištekli ir technologiju kūrimas bei taikymas (2012)
Google Scholar
Pęzik, P., Ogrodniczuk, M., Przepiórkowski, A.: Parallel and spoken corpora in an open repository of Polish language resources. In: Proceedings of the 5th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (2011)
Google Scholar
Piasecki, M., Walentynowicz, W.: Morphodita-based tagger adapted to the Polish language technology. In: Proceedings of Human Language Technologies as a Challenge for Computer Science and Linguistics (2017)
Google Scholar
Przepiórkowski, A.: Korpus IPI PAN. Wersja wstępna, Instytut Podstaw Informatyki PAN (2004)
Google Scholar
Rimkutė, E., Valskys, V., Vaskelienė, J.: Lietuvi kalbos leksem morfologinis anotavimas: ypatumai ir sunkumai. Kalb studijos 15, 63–70 (2009)
Google Scholar
Roszko, D., Roszko, R.: Polish-Lithuanian Parallel Corpus (2016). https://clarin-pl.eu/dspace/handle/11321/309. CLARIN-PL Digital Repository
Saloni, Z.: Podstawy teoretyczne “Słownika gramatycznego języka polskiego” (2012/2020). http://sgjp.pl/static/pdf/Podstawy_teoretyczne_SGJP.pdf
Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: Proceedings of the International Conference on Machine Learning; Models, Technologies and Applications, MLMTA 2003, Las Vegas, Nevada, USA, 23–26 June 2003 (2003)
Google Scholar
Simov, K., Simov, A., Osenova, P.: An XML architecture for shallow and deep processing. In: The Proceedings of the ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP (2004)
Google Scholar
Straka, M., Straková, J.: UDPipe (2016). http://hdl.handle.net/11234/1-1702. LINDAT/CLARIAH-CZ Digital Library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University
Tiedemann, J.: OPUS — parallel corpora for everyone. In: Proceedings of the 19th Annual Conference of the European Association for Machine Translation (EAMT) (2016). Baltic Journal of Modern Computing
Google Scholar
von Waldenfels, R., Meyer, R.: ParaSol: A Parallel Corpus of Slavic and Other Languages (2006)
Google Scholar
Walentynowicz, W., Piasecki, M., Oleksy, M.: Tagger for Polish computer mediated communication texts. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) (2019)
Google Scholar
Walkowiak, T.: Language processing modelling notation — orchestration of NLP microservices. In: Advances in Dependability Engineering of Complex Systems: Proceedings of the Twelfth International Conference on Dependability and Complex Systems DepCoS-RELCOMEX (2017)
Google Scholar
Waszczuk, J.: Harnessing the CRF complexity with domain-specific constraints. The case of morphosyntactic tagging of a highly inflected language. In: Proceedings of COLING 2012 (2012)
Google Scholar

Download references

Acknowledgements

This work was partially supported by (1) the Polish Ministry of Education and Science, CLARIN-PL Project; (2) CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19.

Author information

Authors and Affiliations

Institute of Slavic Studies PAS, Warsaw, Poland
Maksim Duszkin & Roman Roszko
University of Warsaw, Warsaw, Poland
Danuta Roszko

Authors

Maksim Duszkin
View author publications
You can also search for this author in PubMed Google Scholar
Danuta Roszko
View author publications
You can also search for this author in PubMed Google Scholar
Roman Roszko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roman Roszko .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein
University of West Bohemia, Pilsen, Czech Republic
František Pártl
University of West Bohemia, Pilsen, Czech Republic
Miloslav Konopík

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duszkin, M., Roszko, D., Roszko, R. (2021). New Parallel Corpora of Baltic and Slavic Languages — Assumptions of Corpus Construction. In: Ekštein, K., Pártl, F., Konopík, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-83527-9_15
Published: 30 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

New Parallel Corpora of Baltic and Slavic Languages — Assumptions of Corpus Construction

Abstract

Access this chapter