Skip to main content
Log in

MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the morphosyntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel “1984” by George Orwell, which is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages, mainly from Central and Eastern Europe: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset, unique in terms of languages covered and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper overviews the MULTEXT-East resources by type and language and gives some conclusions and directions for further work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

  1. EAGLES-based harmonized tagsets have been also used for various other language resources, such as those of the LE-PAROLE project, which produced a multilingual corpus and associated lexica for 14 European languages (Zampolli 1997).

References

  • Alexin, Z., Gyimóthy, T., Hatvani, C., Tihanyi, L., Csirik, J., Bibok, K., et al. (2003). Manually annotated hungarian corpus. In Proceedings of the tenth conference on European chapter of the association for computational linguistics (EACL’03) (pp. 53–56).

  • Arhar, Š., & Gorjanc, V. (2007). Korpus FidaPLUS: Nova generacija slovenskega referenčnega korpusa (the FidaPLUS corpus: A new generation of the Slovene reference corpus). Jezik in slovstvo, 52(2), 95–110.

    Google Scholar 

  • Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the tenth conference on computational natural language learning (CoNLL-X) (pp. 149–164). Morristown, NJ, USA: ACL.

  • Chiarcos, C., & Erjavec, T. (2011) OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In Proceedings of the 5th linguistics annotation workshop (LAW-V), ACL.

  • Derzhanski, I. A., & Kotsyba, N. (2009). Towards a consistent morphological tagset for Slavic languages: Extending MULTEXT-East for Polish, Ukrainian and Belarusian. In Proceedings of the Mondilex third open workshop: Metalanguage and encoding scheme design for digital lexicography (pp. 9–26). Bratislava, Slovakia: Ľ. Štúr Institute of Linguistic, Slovak Academy of Sciences.

  • Dimitrova, L., & Rashkov, P. (2009). A new version for Bulgarian MTE morphosyntactic specifications for some verbal forms. In Proceedings of the Mondilex second open workshop: Organization and development of digital lexical eesources (pp. 30–37). Kyiv, Ukraine: Dovira Publishing House.

  • Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H. J., Petkevič, V., & Tufiş, D. (1998). MULTEXT-East: Parallel and comparable corpora and lexicons for six Central and Eastern European languages. In Proceedings of the COLING-ACL’98 (pp. 315–319). Montréal, QC, Canada: ACL.

  • Džeroski, S., Erjavec, T., Ledinek, N., Pajas, P., Žabokrtsky, Z., & Žele, A. (2006). Towards a Slovene dependency treebank. In Proceedings of the fifth international conference on language resources and evaluation (LREC’06), Genoa.

  • EAGLES. (1996). Expert advisory group on language engineering standards. http://www.ilc.pi.cnr.it/EAGLES/home.html.

  • Erjavec, T. (2004). MULTEXT-East version 3: Multilingual morphosyntactic specifications, lexicons and corpora. In Proceedings of the fourth international conference on language resources and evaluation (LREC’06), Lisbon.

  • Erjavec, T. (2010) MULTEXT-East version 4: Multilingual morphosyntactic specifications, lexicons and Corpora. In Proceedings of the seventh international conference on language resources and evaluation (LREC’06), Valetta.

  • Erjavec, T., & Džeroski, S. (2004). Machine learning of language structure: Lemmatising unknown Slovene words. Applied Artificial Intelligence, 18(1), 17–41.

    Article  Google Scholar 

  • Erjavec, T., Fišer, D., Krek, S., & Ledinek, N. (2010). The JOS linguistically tagged corpus of Slovene. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10), Valetta.

  • Farrar, S., & Langendoen, D. T. (2003). A linguistic ontology for the semantic web. GLOT International, 7(3), 97–100.

    Google Scholar 

  • Feldman, A., & Hana, J. (2010). A resource-light approach to morpho–syntactic tagging. Language and computers: Studies in practical linguistics (Vol. 70). Amsterdam: Rodopi.

  • Garabík, R., & Gianitsová-Ološtiaková, L. (2005). Manual morphological annotation of the Slovak translation of Orwell’s novel 1984: Methods and findings. In Proceedings of the Slovko conference computer treatment of Slavic and East European languages”. Bratislava: Veda.

  • Garabík, R., Majchráková, D., & Dimitrova, L. (2009). Comparing Bulgarian and Slovak MULTEXT-East morphology tagset. In Proceedings of the Mondilex second open workshop: Organization and development of digital lexical resources (pp. 38–46). Kyiv, Ukraine: Dovira Publishing House.

  • Hajič, J. (2000). Morphological tagging: Data versus dictionaries. In Proceedings of the ANLP/NAACL 2000 (pp. 94–101). Seattle.

  • Hajič, J. (2002). Disambiguation of rich inflection (computational morphology of Czech) (Vol. 1). Prague: Karolinum Charles University Press.

    Google Scholar 

  • Horák, A., Gianitsová, L., Šimková, M., Šmotlák, M., & Garabík, R. (2004). Slovak national corpus. In Proceedings of the text speech and dialogue conference (TSD’04), Brno.

  • Ide, N. (1998). Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In Proceedings of the first international conference on language resources and evaluation (LREC’98) (pp. 463–470). Granada.

  • Ide, N. (2000). Cross-lingual sense determination: Can it work? Computers and the Humanities, 34, 223–234.

    Article  Google Scholar 

  • Ide, N., & Véronis, J. (1994). Multext (multilingual tools and corpora). In Proceedings of the 15th international conference on computational linguistics (CoLing’94) (pp. 90–96). Kyoto.

  • Ivanovska, A., Zdravkova, K., Džeroski, S., & Erjavec, T. (2005). Learning rules for morphological analysis and synthesis of Macedonian nouns. In Proceedings of the 8th international conference information society, IS 2005. Ljubljana: Jožef Stefan Institute.

  • Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2008). ISOcat: Corralling data categories in the wild. In Proceedings of the sixth international conference on language resources and evaluation (LREC’08), Marrakech.

  • Kopotev, M., & Mustajoki, A. (2003) Principy sozdanija Hel’sinkskogo annotirovannogo korpusa russkih tekstov (HANCO) v seti internet. Naučno-tehničeskaja informacija (Ser. 2, pp. 33–37) (in Russian).

  • Kotsyba, N., Radziszewski, A., & Derzhanski, I. (2009). Integrating the Polish language into the MULTEXT-East family. In Proceedings of the Mondilex fifth open workshop: Research infrastructure for digital lexicography. Ljubljana, Slovenia: Jožef Stefan Institute.

  • Krek, S., Stabej, M., Gorjanc, V., Erjavec, T., Romih, M., & Holozan, P. (1998) FIDA: A corpus of the Slovene language. http://www.fida.net/.

  • Krstev, C., Vitas, D., & Erjavec, T. (2004). MULTEXT-East resources for Serbian. In Proceedings B of the 7th international multiconference information society: Language technologies (pp. 108–114). Ljubljana: Jožef Stefan Institutue.

  • Martin, J., Mihalcea, R., & Pedersen, T. (2005). Word alignment for languages with scarce resources. In Proceedings of the ACL workshop on building and using parallel texts (pp. 65–74). Ann Arbor.

  • Petrovski, A. (2004). Morphological processing of nouns in Macedonian language. In Proceedings of the 7th intex/nooj workshop, Tours.

  • Piasecki, M. (2007). Polish tagger TaKIPI: Rule based construction and optimisation. Task Quarterly, 11, 151–167.

    Google Scholar 

  • Prószéky, G. (1995). Humor: A morphological system for corpus analysis. In Proceedings of the first European TELRI seminar: Language resources for language technology (pp. 149–158). Tihany, Hungary.

  • Prószéky, G., & Kis, B. (1999). A unification-based approach to morpho-syntactic parsing of agglutinative and other (highly) inflectional languages. In Proceedings of the 37th ACL, association for computational linguistics (pp. 261–268).

  • Przepiórkowski, A. (2006). The potential of the IPI PAN corpus. Poznań Studies in Contemporary Linguistics, 41, 31–48.

    Google Scholar 

  • Przepiórkowski, A., & Woliński, M. (2003). A flexemic tagset for Polish. In Proceedings of the EACL workshop on morphological processing of Slavic languages. ACL.

  • QasemiZadeh, B., & Rahimi, S. (2006) Persian in MULTEXT-East framework. In Proceedings of the 5th international conference on natural language processing (FinTAL’06) (pp. 541–551). Turku, Finland.

  • Rosen, A. (2010). Morphological tags in parallel corpora. In F. Čermák, A. Klégr, & P. Corness (Eds.), InterCorp: Exploring a Multilingual corpus. Praha: Nakladatelství Lidové noviny.

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing (pp. 44–49).

  • Sharoff, S. (2005). Methods and tools for development of the Russian reference corpus. In D. Archer, A. Wilson, & P. Rayson (Eds.), Corpus linguistics around the world (pp. 167–180). Amsterdam: Rodopi.

    Google Scholar 

  • Sharoff, S., Kopotev, M., Erjavec, T., Feldman, A., & Divjak, D. (2008). Designing and evaluating a Russian tagset. In Proceedings of the sixth international conference on language resources and evaluation (LREC’08). Marrakech.

  • Silberztein, M. (1999). Text Indexing with INTEX. In: Computers and the humanities (vol. 33(3)). Kluwer Academic Publishers.

  • Simov, K., Popova, G., & Osenova, P. (2002). HPSG-based syntactic treebank of Bulgarian (BulTreeBank). In A. Wilson, P. Rayson, & T. McEnery (Eds.), A rainbow of corpora: Corpus linguistics and the languages of the world (pp. 135–142). Munich: Lincom-Europa.

  • Slavcheva, M. (1997). A comparative representation of two Bulgarian morphosyntactic tagsets and the EAGLES encoding standard. Technical Report TELRI (Trans European Language Resources Infrastructure).

  • Sperberg-McQueen, C. M., & Burnard, L. (Eds.). (1994). Guidelines for electronic text encoding and interchange P3. Chicago and Oxford: Association for Computers and the Humanities/Association for Computational Linguistics/Association for Literary and Linguistic Computing.

  • Steenwijk, H. (1992). The Slovene Dialect of Resia San Giorgio. Amsterdam-Atlanta: Rodopi.

    Google Scholar 

  • Stolić, M., & Zdravkova, K. (2010). Resources for machine translation of the Macedonian language. In Proceedings of the ICT innovations conference, Ohrid.

  • Tadić, M. (2002). Building the Croatian national corpus. In Proceedings of the third international conference on language resources and evaluation (LREC’02) (pp. 441–446). Las Palmas.

  • Tadić, M. (2003). Building the Croatian morphological lexicon. In Proceedings of the EACL workshop on morphological processing of Slavic languages, ACL.

  • TEI Consortium. (2007). TEI P5: Guidelines for electronic text encoding and interchange. TEI Consortium, URL: http://www.tei-c.org/Guidelines/P5/.

  • Toutanova, K., & Cherry, C. (2009). A global model for joint lemmatization and part-of-speech prediction. In Proceedings of the 47th annual meeting of the ACL (ACL’09) (pp. 486–494). Singapore.

  • Tufiş, D. (1999). Tiered tagging and combined language model classifiers. In F. Jelinek & E. Noth (Eds.), Text, speech and dialogue no. 1692 in lecture notes in artificial intelligence (pp. 28–33). Berlin: Springer.

  • Tufiş, D. (2002). A cheap and fast way to build useful translation lexicons. In Proceedings of the 19th annual meeting of the ACL (ACL’02). Association for Computational Linguistics.

  • Tufiş, D., Cristea, D., & Stamou, S. (2004). BalkaNet: Aims, methods, results and perspectives: A general overview. Romanian Journal of Information Science and Technology,7(1–2), 9–43.

    Google Scholar 

  • Vitas, D., & Krstev, C. (2001). Intex and slavonic morphology. In 4es Journées INTEX. Bordeaux.

  • Vojnovski, V., Džeroski, S., & Erjavec, T. (2005). Learning PoS tagging from a tagged Macedonian text corpus. In Proceedings of the 8th international conference information society, IS 2005. Ljubljana: Jožef Stefan Institute.

  • Zampolli, A. (1997). The PAROLE project. In Proceedings of the second European TELRI seminar: Language applications for multilingual Europe (pp. 185–210). Kaunas, Lithuania.

  • Zdravkova, K., & Petrovski, A. (2007). Derivation of Macedonian verbal adjectives. In Proceedings of international conference “recent advances in natural language processing” (RANLP’07) (pp. 661–665).

Download references

Acknowledgments

The author would like to thank Radovan Garabik, Natalia Kotsyba, Katerina Zdravkova, and Darja Fišer for their helpful comments and suggestions. Work on the MULTEXT-East resources was initially supported by the EU project MULTEXT-East “Multilingual Text Tools and Corpora for Central and Eastern European Languages”, the US NSF grant IRI-9413451 and the EU Concerted Action TELRI “Trans-European Language Resources Infrastructure”. Work on the second release was supported by the EU Project CONCEDE “Consortium for Central European Dictionary Encoding”, while the work on the third release was partially funded by a the NEH grant to the TEI Task Force “SGML–XML migration”. Work on the fourth release was supported by the EU project MONDILEX “Conceptual Modeling of Networking of Centres for High-Quality Research in Slavic Lexicography and their Digital Resources”. The work on the resources has been additionally supported by bi-lateral projects between Slovenia and Serbia, Slovenia and Macedonia, as well as individual partners’ grants and contracts.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomaž Erjavec.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Erjavec, T. MULTEXT-East: morphosyntactic resources for Central and Eastern European languages. Lang Resources & Evaluation 46, 131–142 (2012). https://doi.org/10.1007/s10579-011-9174-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-011-9174-8

Keywords

Navigation