Language Resources and Evaluation

, Volume 46, Issue 1, pp 131–142

MULTEXT-East: morphosyntactic resources for Central and Eastern European languages

Brief Report

DOI: 10.1007/s10579-011-9174-8

Cite this article as:
Erjavec, T. Lang Resources & Evaluation (2012) 46: 131. doi:10.1007/s10579-011-9174-8

Abstract

The paper presents the MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the morphosyntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel “1984” by George Orwell, which is sentence aligned and contains hand-validated morphosyntactic descriptions and lemmas. The resources are uniformly encoded in XML, using the Text Encoding Initiative Guidelines, TEI P5, and cover 16 languages, mainly from Central and Eastern Europe: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Macedonian, Persian, Polish, Resian, Romanian, Russian, Serbian, Slovak, Slovene, and Ukrainian. This dataset, unique in terms of languages covered and the wealth of encoding, is extensively documented, and freely available for research purposes. The paper overviews the MULTEXT-East resources by type and language and gives some conclusions and directions for further work.

Keywords

Morphosyntactic annotation Multilinguality Language encoding standards 

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.Department of Knowledge TechnologiesJožef Stefan InstituteLjubljanaSlovenia

Personalised recommendations