Language Resources and Evaluation

, Volume 46, Issue 1, pp 131–142 | Cite as

MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

Brief Report

Abstract

The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the morphosyntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel “1984” by George Orwell, which is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages, mainly from Central and Eastern Europe: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset, unique in terms of languages covered and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper overviews the MULTEXT-East resources by type and language and gives some conclusions and directions for further work.

Keywords

Morphosyntactic annotation Multilinguality Language encoding standards 

References

  1. Alexin, Z., Gyimóthy, T., Hatvani, C., Tihanyi, L., Csirik, J., Bibok, K., et al. (2003). Manually annotated hungarian corpus. In Proceedings of the tenth conference on European chapter of the association for computational linguistics (EACL’03) (pp. 53–56).Google Scholar
  2. Arhar, Š., & Gorjanc, V. (2007). Korpus FidaPLUS: Nova generacija slovenskega referenčnega korpusa (the FidaPLUS corpus: A new generation of the Slovene reference corpus). Jezik in slovstvo, 52(2), 95–110.Google Scholar
  3. Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In Proceedings of the tenth conference on computational natural language learning (CoNLL-X) (pp. 149–164). Morristown, NJ, USA: ACL.Google Scholar
  4. Chiarcos, C., & Erjavec, T. (2011) OWL/DL formalization of the MULTEXT-East morphosyntactic specifications. In Proceedings of the 5th linguistics annotation workshop (LAW-V), ACL.Google Scholar
  5. Derzhanski, I. A., & Kotsyba, N. (2009). Towards a consistent morphological tagset for Slavic languages: Extending MULTEXT-East for Polish, Ukrainian and Belarusian. In Proceedings of the Mondilex third open workshop: Metalanguage and encoding scheme design for digital lexicography (pp. 9–26). Bratislava, Slovakia: Ľ. Štúr Institute of Linguistic, Slovak Academy of Sciences.Google Scholar
  6. Dimitrova, L., & Rashkov, P. (2009). A new version for Bulgarian MTE morphosyntactic specifications for some verbal forms. In Proceedings of the Mondilex second open workshop: Organization and development of digital lexical eesources (pp. 30–37). Kyiv, Ukraine: Dovira Publishing House.Google Scholar
  7. Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H. J., Petkevič, V., & Tufiş, D. (1998). MULTEXT-East: Parallel and comparable corpora and lexicons for six Central and Eastern European languages. In Proceedings of the COLING-ACL’98 (pp. 315–319). Montréal, QC, Canada: ACL.Google Scholar
  8. Džeroski, S., Erjavec, T., Ledinek, N., Pajas, P., Žabokrtsky, Z., & Žele, A. (2006). Towards a Slovene dependency treebank. In Proceedings of the fifth international conference on language resources and evaluation (LREC’06), Genoa.Google Scholar
  9. EAGLES. (1996). Expert advisory group on language engineering standards. http://www.ilc.pi.cnr.it/EAGLES/home.html.
  10. Erjavec, T. (2004). MULTEXT-East version 3: Multilingual morphosyntactic specifications, lexicons and corpora. In Proceedings of the fourth international conference on language resources and evaluation (LREC’06), Lisbon.Google Scholar
  11. Erjavec, T. (2010) MULTEXT-East version 4: Multilingual morphosyntactic specifications, lexicons and Corpora. In Proceedings of the seventh international conference on language resources and evaluation (LREC’06), Valetta.Google Scholar
  12. Erjavec, T., & Džeroski, S. (2004). Machine learning of language structure: Lemmatising unknown Slovene words. Applied Artificial Intelligence, 18(1), 17–41.CrossRefGoogle Scholar
  13. Erjavec, T., Fišer, D., Krek, S., & Ledinek, N. (2010). The JOS linguistically tagged corpus of Slovene. In Proceedings of the seventh international conference on language resources and evaluation (LREC’10), Valetta.Google Scholar
  14. Farrar, S., & Langendoen, D. T. (2003). A linguistic ontology for the semantic web. GLOT International, 7(3), 97–100.Google Scholar
  15. Feldman, A., & Hana, J. (2010). A resource-light approach to morpho–syntactic tagging. Language and computers: Studies in practical linguistics (Vol. 70). Amsterdam: Rodopi.Google Scholar
  16. Garabík, R., & Gianitsová-Ološtiaková, L. (2005). Manual morphological annotation of the Slovak translation of Orwell’s novel 1984: Methods and findings. In Proceedings of the Slovko conference computer treatment of Slavic and East European languages”. Bratislava: Veda.Google Scholar
  17. Garabík, R., Majchráková, D., & Dimitrova, L. (2009). Comparing Bulgarian and Slovak MULTEXT-East morphology tagset. In Proceedings of the Mondilex second open workshop: Organization and development of digital lexical resources (pp. 38–46). Kyiv, Ukraine: Dovira Publishing House.Google Scholar
  18. Hajič, J. (2000). Morphological tagging: Data versus dictionaries. In Proceedings of the ANLP/NAACL 2000 (pp. 94–101). Seattle.Google Scholar
  19. Hajič, J. (2002). Disambiguation of rich inflection (computational morphology of Czech) (Vol. 1). Prague: Karolinum Charles University Press.Google Scholar
  20. Horák, A., Gianitsová, L., Šimková, M., Šmotlák, M., & Garabík, R. (2004). Slovak national corpus. In Proceedings of the text speech and dialogue conference (TSD’04), Brno.Google Scholar
  21. Ide, N. (1998). Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In Proceedings of the first international conference on language resources and evaluation (LREC’98) (pp. 463–470). Granada.Google Scholar
  22. Ide, N. (2000). Cross-lingual sense determination: Can it work? Computers and the Humanities, 34, 223–234.CrossRefGoogle Scholar
  23. Ide, N., & Véronis, J. (1994). Multext (multilingual tools and corpora). In Proceedings of the 15th international conference on computational linguistics (CoLing’94) (pp. 90–96). Kyoto.Google Scholar
  24. Ivanovska, A., Zdravkova, K., Džeroski, S., & Erjavec, T. (2005). Learning rules for morphological analysis and synthesis of Macedonian nouns. In Proceedings of the 8th international conference information society, IS 2005. Ljubljana: Jožef Stefan Institute.Google Scholar
  25. Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2008). ISOcat: Corralling data categories in the wild. In Proceedings of the sixth international conference on language resources and evaluation (LREC’08), Marrakech.Google Scholar
  26. Kopotev, M., & Mustajoki, A. (2003) Principy sozdanija Hel’sinkskogo annotirovannogo korpusa russkih tekstov (HANCO) v seti internet. Naučno-tehničeskaja informacija (Ser. 2, pp. 33–37) (in Russian).Google Scholar
  27. Kotsyba, N., Radziszewski, A., & Derzhanski, I. (2009). Integrating the Polish language into the MULTEXT-East family. In Proceedings of the Mondilex fifth open workshop: Research infrastructure for digital lexicography. Ljubljana, Slovenia: Jožef Stefan Institute.Google Scholar
  28. Krek, S., Stabej, M., Gorjanc, V., Erjavec, T., Romih, M., & Holozan, P. (1998) FIDA: A corpus of the Slovene language. http://www.fida.net/.
  29. Krstev, C., Vitas, D., & Erjavec, T. (2004). MULTEXT-East resources for Serbian. In Proceedings B of the 7th international multiconference information society: Language technologies (pp. 108–114). Ljubljana: Jožef Stefan Institutue.Google Scholar
  30. Martin, J., Mihalcea, R., & Pedersen, T. (2005). Word alignment for languages with scarce resources. In Proceedings of the ACL workshop on building and using parallel texts (pp. 65–74). Ann Arbor.Google Scholar
  31. Petrovski, A. (2004). Morphological processing of nouns in Macedonian language. In Proceedings of the 7th intex/nooj workshop, Tours.Google Scholar
  32. Piasecki, M. (2007). Polish tagger TaKIPI: Rule based construction and optimisation. Task Quarterly, 11, 151–167.Google Scholar
  33. Prószéky, G. (1995). Humor: A morphological system for corpus analysis. In Proceedings of the first European TELRI seminar: Language resources for language technology (pp. 149–158). Tihany, Hungary.Google Scholar
  34. Prószéky, G., & Kis, B. (1999). A unification-based approach to morpho-syntactic parsing of agglutinative and other (highly) inflectional languages. In Proceedings of the 37th ACL, association for computational linguistics (pp. 261–268).Google Scholar
  35. Przepiórkowski, A. (2006). The potential of the IPI PAN corpus. Poznań Studies in Contemporary Linguistics, 41, 31–48.Google Scholar
  36. Przepiórkowski, A., & Woliński, M. (2003). A flexemic tagset for Polish. In Proceedings of the EACL workshop on morphological processing of Slavic languages. ACL.Google Scholar
  37. QasemiZadeh, B., & Rahimi, S. (2006) Persian in MULTEXT-East framework. In Proceedings of the 5th international conference on natural language processing (FinTAL’06) (pp. 541–551). Turku, Finland.Google Scholar
  38. Rosen, A. (2010). Morphological tags in parallel corpora. In F. Čermák, A. Klégr, & P. Corness (Eds.), InterCorp: Exploring a Multilingual corpus. Praha: Nakladatelství Lidové noviny.Google Scholar
  39. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing (pp. 44–49).Google Scholar
  40. Sharoff, S. (2005). Methods and tools for development of the Russian reference corpus. In D. Archer, A. Wilson, & P. Rayson (Eds.), Corpus linguistics around the world (pp. 167–180). Amsterdam: Rodopi.Google Scholar
  41. Sharoff, S., Kopotev, M., Erjavec, T., Feldman, A., & Divjak, D. (2008). Designing and evaluating a Russian tagset. In Proceedings of the sixth international conference on language resources and evaluation (LREC’08). Marrakech.Google Scholar
  42. Silberztein, M. (1999). Text Indexing with INTEX. In: Computers and the humanities (vol. 33(3)). Kluwer Academic Publishers.Google Scholar
  43. Simov, K., Popova, G., & Osenova, P. (2002). HPSG-based syntactic treebank of Bulgarian (BulTreeBank). In A. Wilson, P. Rayson, & T. McEnery (Eds.), A rainbow of corpora: Corpus linguistics and the languages of the world (pp. 135–142). Munich: Lincom-Europa.Google Scholar
  44. Slavcheva, M. (1997). A comparative representation of two Bulgarian morphosyntactic tagsets and the EAGLES encoding standard. Technical Report TELRI (Trans European Language Resources Infrastructure).Google Scholar
  45. Sperberg-McQueen, C. M., & Burnard, L. (Eds.). (1994). Guidelines for electronic text encoding and interchange P3. Chicago and Oxford: Association for Computers and the Humanities/Association for Computational Linguistics/Association for Literary and Linguistic Computing.Google Scholar
  46. Steenwijk, H. (1992). The Slovene Dialect of Resia San Giorgio. Amsterdam-Atlanta: Rodopi.Google Scholar
  47. Stolić, M., & Zdravkova, K. (2010). Resources for machine translation of the Macedonian language. In Proceedings of the ICT innovations conference, Ohrid.Google Scholar
  48. Tadić, M. (2002). Building the Croatian national corpus. In Proceedings of the third international conference on language resources and evaluation (LREC’02) (pp. 441–446). Las Palmas.Google Scholar
  49. Tadić, M. (2003). Building the Croatian morphological lexicon. In Proceedings of the EACL workshop on morphological processing of Slavic languages, ACL.Google Scholar
  50. TEI Consortium. (2007). TEI P5: Guidelines for electronic text encoding and interchange. TEI Consortium, URL: http://www.tei-c.org/Guidelines/P5/.
  51. Toutanova, K., & Cherry, C. (2009). A global model for joint lemmatization and part-of-speech prediction. In Proceedings of the 47th annual meeting of the ACL (ACL’09) (pp. 486–494). Singapore.Google Scholar
  52. Tufiş, D. (1999). Tiered tagging and combined language model classifiers. In F. Jelinek & E. Noth (Eds.), Text, speech and dialogue no. 1692 in lecture notes in artificial intelligence (pp. 28–33). Berlin: Springer.Google Scholar
  53. Tufiş, D. (2002). A cheap and fast way to build useful translation lexicons. In Proceedings of the 19th annual meeting of the ACL (ACL’02). Association for Computational Linguistics.Google Scholar
  54. Tufiş, D., Cristea, D., & Stamou, S. (2004). BalkaNet: Aims, methods, results and perspectives: A general overview. Romanian Journal of Information Science and Technology,7(1–2), 9–43.Google Scholar
  55. Vitas, D., & Krstev, C. (2001). Intex and slavonic morphology. In 4es Journées INTEX. Bordeaux.Google Scholar
  56. Vojnovski, V., Džeroski, S., & Erjavec, T. (2005). Learning PoS tagging from a tagged Macedonian text corpus. In Proceedings of the 8th international conference information society, IS 2005. Ljubljana: Jožef Stefan Institute.Google Scholar
  57. Zampolli, A. (1997). The PAROLE project. In Proceedings of the second European TELRI seminar: Language applications for multilingual Europe (pp. 185–210). Kaunas, Lithuania.Google Scholar
  58. Zdravkova, K., & Petrovski, A. (2007). Derivation of Macedonian verbal adjectives. In Proceedings of international conference “recent advances in natural language processing” (RANLP’07) (pp. 661–665).Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.Department of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia

Personalised recommendations