Starting in 2006, the European Commission’s Joint Research Centre and other European Union organisations have made available a number of large-scale highly-multilingual parallel language resources. In this article, we give a comparative overview of these resources and we explain the specific nature of each of them. This article provides answers to a number of question, including: What are these linguistic resources? What is the difference between them? Why were they originally created and why was the data released publicly? What can they be used for and what are the limitations of their usability? What are the text types, subject domains and languages covered? How to avoid overlapping document sets? How do they compare regarding the formatting and the translation alignment? What are their usage conditions? What other types of multilingual linguistic resources does the EU have? This article thus aims to clarify what the similarities and differences between the various resources are and what they can be used for. It will also serve as a reference publication for those resources, for which a more detailed description has been lacking so far (EAC-TM, ECDC-TM and DGT-Acquis).
This is a preview of subscription content, access via your institution.
Buy single article
Instant access to the full article PDF.
Tax calculation will be finalised during checkout.
All EU corpora discussed here can be downloaded from https://ec.europa.eu/jrc/language-technologies.
See https://ec.europa.eu/jrc/. All URLs were last visited on 7 February 2014.
For details, see https://ec.europa.eu/jrc/en/research-topic/internet-surveillance-systems.
The EMM websites can be accessed publicly via http://emm.newsbrief.eu/overview.html.
See http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:32003L0098:EN:NOT for details and to read the full text of the regulation.
See the META-NET report http://www.meta-net.eu/whitepapers/key-results-and-cross-language-comparison.
Events dedicated to building and exploiting parallel corpora are, for instance, the workshop series on ‘Annotation and exploitation of parallel corpora’ (e.g. http://www.bultreebank.org/AEPC2/); ‘Slavic parallel corpora’ (http://www.slavistik.uni-mainz.de/606.php); ‘Parallel corpora and linguistic theory’ (http://paralleltext.info/sle2013/); ‘Annotation and Alignment of parallel corpora for linguistic research’ (http://www.dagstuhl.de/13043); ‘ATA-AMTA Workshop on users and uses for parallel corpora’ (http://permalink.gmane.org/gmane.science.linguistics.corpora/11156); and ‘Workshop on building and using parallel texts: data-driven machine translation and beyond’ (http://www.statmt.org/wpt05/). The CLEF Initiative and its evaluation labs are also highly relevant for this field (http://www.clef-initiative.eu/).
The 24 official EU languages as of January 2014 are Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish and Swedish. We use the two-digit ISO codes to represent the languages.
The structure of CELEX document numbers is explained at http://eur-lex.europa.eu/en/tools/faq.htm#1.12.
At http://pelcra.pl/res/parallel/word-aligned/, for instance, Polish-English word alignments can be found.
The Vanilla software used implements the Gale and Church (1993) alignment algorithm.
While the Vanilla alignment was performed at the JRC, the separate HunAlign alignment was carried out by the Media Research Centre at Budapest University of Technology and Economics. The Romanian documents were collected and pre-processed by the Research Institute for Artificial Intelligence at the Romanian Academy of Sciences.
See http://publications.europa.eu/official/index_en.htm for more information on the Official Journal.
See http://dragoman.org/muset/ for details.
For details on ECDC, see http://www.ecdc.europa.eu.
For details on DG EAC, see http://ec.europa.eu/dgs/education_culture/.
i.e. the Ukrainian boxer and politician, see http://emm.newsexplorer.eu/NewsExplorer/entities/en/19011.html.
For download and more information, see http://datahub.io/dataset/jrc-names.
See https://open-data.europa.eu/. Quote extracted on 7 February 2014.
Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., van der Goot, E., Halkia, M., et al. (2010). Sentiment analysis in the news. In Proceedings of the 7th international conference on language resources and evaluation (LREC’2010), Valletta, Malta, 19–21 May 2010, pp. 2216–2220.
Carrasco-Benitez, M. T. (2008). Open architecture for multilingual parallel texts. http://arxiv.org/ftp/arxiv/papers/0808/0808.3889.pdf.
Chen, Y., Kay, M., & Eisele A. (2009). Intersecting multilingual data for faster and better statistical translations. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, Boulder, Colorado, pp. 128–136.
Chiao, Y.-C., Kraif, O., Laurent, D., Nguyen, T. M. H., Semmar, N., Stuck, F., et al. (2006). Evaluation of multilingual text alignment systems: The ARCADE II project. In Proceedings of the 5th international conference on language resources and evaluation (LREC’2006), Genoa, Italy, pp. 1975–1978.
Cohn T., & Lapata, M. (2007). Machine translation by triangulation: Making effective use of multi-parallel corpora. In Proceedings of the 45th annual meeting of the association for computational linguistics, Prague, Czech Republic, pp. 728–735.
EC&DGT (2008). European Commission & Directorate General for Translation—Translation tools and workflow. Office for Official Publications, Brussels, Belgium.
Ehrmann, M., Turchi, M., & Steinberger, R. (2011). Building a multilingual named entity-annotated corpus. In Proceedings of the 8th international conference recent advances in natural language processing (RANLP’2011), Hissar, Bulgaria.
Eisele A., & Chen, Y. (2010). MultiUN: A Multilingual Corpus from United Nation Documents. In Proceedings of the international conference on language resources and evaluation (LREC 2010), Valletta, Malta, pp. 2868–2872.
Erjavec, T. (2010). MULTEXT-East version 4: Multilingual morphosyntactic specifications, lexicons and corpora. In Proceedings of the seventh international conference on language resources and evaluation (LREC), Valletta, Malta, pp. 2544–2547.
Erjavec T., & Ide, N. (1998). The MULTEXT-East corpus. In Proceedings of the first international conference on language resources and evaluation (LREC), Granada, Spain.
Gale, W., & Church, K. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.
Hajlaoui, N., Kolovratnik, D., Väyrynen, J., Varga, D., & Steinberger, R. (2014). DCEP—Digital Corpus of the European Parliament. In Proceedings of the 9th edition of its language resources and evaluation conference, Reykjavik, Iceland.
Ide N., & Véronis, J. (1994). MULTEXT: Multilingual text tools and corpora. In Proceedings of the 15th international conference on computational linguistics (CoLing), Kyoto, Japan, pp. 588–592.
Koehn, P. (2005). EuroParl: A parallel corpus for statistical machine translation. In Proceedings of the machine translation summit, Phuket, Thailand, pp. 79–86.
Koehn, P., Birch, A., & Steinberger, R. (2009). 462 Machine Translation Systems for Europe. In L. Gerber, P. Isabelle, R. Kuhn, N. Bemish, M. Dillinger, & M.-J. Goulet (Eds.), Proceedings of the twelfth machine translation summit (MT-Summit XII), Ottawa, Canada, August 2009, pp. 65–72.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the annual meeting of the association for computational linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007.
Lardilleux, A., & Lepage, Y. (2009). Sampling-based multilingual alignment. In International conference on recent advances in natural language processing (RANLP’2009), Borovets, Bulgaria, pp. 214–218.
Lefever, E., & Hoste, V. (2010). SemEval-2010 task 3: Cross-lingual word sense disambiguation. In Proceedings of the 5th international workshop on semantic evaluation (SemEval’2010), Uppsala, Sweden, pp. 15–20.
Landauer T., & Littman, M. (1991). A statistical method for language-independent representation of the topical content of text segments. In Proceedings of the 11th international conference ‘Expert Systems and Their Applications’, Vol. 8, pp. 77–85.
Mehdad Y., Negri, M., & Federico, M. (2010). Towards cross-lingual textual entailment. In Proceedings of human language technologies, Los Angeles, CA, USA, pp. 321–324.
Naseem, T., Snyder, B., Eisenstein, J., & Barzilay, R. (2009). Multilingual part-of-speech tagging: Two unsupervised approaches. Journal of Artificial Intelligence, 36, 341–385.
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Padó, S., & Lapata, M. (2009). Cross-lingual annotation projection of semantic roles. Journal of Artificial Intelligence Research, 36, 307–340.
Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Cross-language plagiarism detection. In Language resources and evaluation. Special issue on plagiarism and authorship analysis, Vol. 45, no. 1, pp. 45–62.
Resnik, P., Olsen, M. B., & Diab, M. (1999). The Bible as a parallel corpus: Annotating the ‘Book of 2000 Tongues’. Computers and the Humanities, 33(1–2), 129–153.
Steinberger, R. (2011). A survey of methods to ease the development of highly multilingual Text Mining applications. Language Resources and Evaluation Journal, 46(2), 155–176.
Steinberger, R. (2013). Multilingual and cross-lingual news analysis in the Europe Media Monitor (EMM). In M. Lupu, E. Kanoulas, & F. Loizides (Eds.), Multidisciplinary information retrieval. 6th information retrieval facility conference (IRFC’2013), Limassol, Cyprus. Springer Lecture Notes in Computer Science, Vol. 8201, pp. 1–4.
Steinberger, R., Ebrahim, M., & Turchi, M. (2012a). JRC EuroVoc Indexer JEX—A freely available multi-label categorisation tool. In Proceedings of the 8th international conference on language resources and evaluation (LREC’2012), Istanbul, 21–27 May 2012.
Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2012b). DGT-TM: A freely available translation memory in 22 languages. In Proceedings of the 8th international conference on language resources and evaluation (LREC’2012), Istanbul, 21–27 May 2012, pp. 454–459.
Steinberger, J., Lenkova, P., Kabadjov, M., Steinberger, R., & van der Goot, E. (2011a). Multilingual entity-centered sentiment analysis evaluated by parallel corpora. In Proceedings of the 8th international conference recent advances in natural language processing (RANLP’2011), Hissar, Bulgaria, 12–14 September 2011.
Steinberger, R., Pouliquen, B., Kabadjov, M., & van der Goot, E. (2011b). JRC-Names: A freely available, highly multilingual named entity resource. In Proceedings of the 8th international conference recent advances in natural language processing (RANLP’2011), Hissar, Bulgaria, 12–14 September 2011, pp. 104–110.
Steinberger, R., Pouliquen, B., & van der Goot, E. (2009). An Introduction to the Europe media monitor family of applications. In F. Gey, N. Kando, & J. Karlgren (Eds.), Information access in a multilingual world—proceedings of the SIGIR 2009 workshop (SIGIR-CLIR’2009), Boston, USA, 23 July 2009, pp. 1–8.
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20 + languages. In Proceedings of the 5th international conference on language resources and evaluation (LREC’2006), Genoa, Italy, 24–26 May 2006, pp. 2142–2147.
Tiedemann, J. (2009). News from OPUS—A collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing (Vol. V, pp. 237–248). John Benjamins, Amsterdam/Philadelphia.
Tiedemann, J., & Nygaard, L. (2004). The OPUS corpus—Parallel and free. In Proceedings of the 4th international conference on language resources an evaluation (LREC), Lisbon, Portugal, pp. 1183–1186.
Tufiş, D. (2004). Term translations in parallel corpora: Discovery and consistency check. In Proceedings of the 4th international conference on language resources an evaluation (LREC), Lisbon, Portugal, pp. 1981–1984.
Turchi, M., Steinberger, J., Kabadjov, M. & Steinberger, R. (2010). Using parallel corpora for multilingual (multi-document) summarisation evaluation. Multilingual and multimodal information access evaluation. Springer Lecture Notes for Computer Science, LNCS 6360/2010, pp. 52–63.
Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., & Trón, V. (2005). Parallel corpora for medium density languages. In Proceedings of RANLP’2005, Borovets, Bulgaria, pp. 590–596.
Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2003). Inferring a semantic representation of text via cross-language correlation analysis: Advances in neural information processing systems 15. In S. Becker, S. Thrun & K. Obermayer (Eds), (pp. 1473–1480). Cambridge, MA: MIT Press.
Wei, C.-P., Yang, C. C., & Lin, C.-M. (2008). A Latent Semantic Indexing-based approach to multilingual document clustering. Decision Support Systems, 45(2008), 606–620.
Yarowsky, D., Ngai, G., & Wicentowski, R. (2001) Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of HLT’01, San Diego.
Zhechev, V., & Way, A. (2008). Automatic generation of parallel treebanks. In Proceedings of the 22nd international conference on computational linguistics (CoLing’2008), Manchester, UK, Vol. 1, pp. 1105–1112.
About this article
Cite this article
Steinberger, R., Ebrahim, M., Poulis, A. et al. An overview of the European Union’s highly multilingual parallel corpora. Lang Resources & Evaluation 48, 679–707 (2014). https://doi.org/10.1007/s10579-014-9277-0
- Parallel corpora
- Linguistic resources
- Highly multilingual
- European Union
- Translation memory
- JRC EuroVoc Indexer JEX