Language Resources and Evaluation

, Volume 48, Issue 4, pp 679–707 | Cite as

An overview of the European Union’s highly multilingual parallel corpora

  • Ralf Steinberger
  • Mohamed Ebrahim
  • Alexandros Poulis
  • Manuel Carrasco-Benitez
  • Patrick Schlüter
  • Marek Przybyszewski
  • Signe Gilbro
Project Notes

Abstract

Starting in 2006, the European Commission’s Joint Research Centre and other European Union organisations have made available a number of large-scale highly-multilingual parallel language resources. In this article, we give a comparative overview of these resources and we explain the specific nature of each of them. This article provides answers to a number of question, including: What are these linguistic resources? What is the difference between them? Why were they originally created and why was the data released publicly? What can they be used for and what are the limitations of their usability? What are the text types, subject domains and languages covered? How to avoid overlapping document sets? How do they compare regarding the formatting and the translation alignment? What are their usage conditions? What other types of multilingual linguistic resources does the EU have? This article thus aims to clarify what the similarities and differences between the various resources are and what they can be used for. It will also serve as a reference publication for those resources, for which a more detailed description has been lacking so far (EAC-TM, ECDC-TM and DGT-Acquis).

Keywords

Parallel corpora Linguistic resources Highly multilingual European Union Translation memory JRC-Acquis DGT-Acquis DGT-TM DCEP ECDC-TM EAC-TM JRC EuroVoc Indexer JEX EuroVoc Eur-Lex 

References

  1. Balahur, A., Steinberger, R., Kabadjov, M., Zavarella, V., van der Goot, E., Halkia, M., et al. (2010). Sentiment analysis in the news. In Proceedings of the 7th international conference on language resources and evaluation (LREC’2010), Valletta, Malta, 19–21 May 2010, pp. 2216–2220.Google Scholar
  2. Carrasco-Benitez, M. T. (2008). Open architecture for multilingual parallel texts. http://arxiv.org/ftp/arxiv/papers/0808/0808.3889.pdf.
  3. Chen, Y., Kay, M., & Eisele A. (2009). Intersecting multilingual data for faster and better statistical translations. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, Boulder, Colorado, pp. 128–136.Google Scholar
  4. Chiao, Y.-C., Kraif, O., Laurent, D., Nguyen, T. M. H., Semmar, N., Stuck, F., et al. (2006). Evaluation of multilingual text alignment systems: The ARCADE II project. In Proceedings of the 5th international conference on language resources and evaluation (LREC’2006), Genoa, Italy, pp. 1975–1978.Google Scholar
  5. Cohn T., & Lapata, M. (2007). Machine translation by triangulation: Making effective use of multi-parallel corpora. In Proceedings of the 45th annual meeting of the association for computational linguistics, Prague, Czech Republic, pp. 728–735.Google Scholar
  6. EC&DGT (2008). European Commission & Directorate General for Translation—Translation tools and workflow. Office for Official Publications, Brussels, Belgium.Google Scholar
  7. Ehrmann, M., Turchi, M., & Steinberger, R. (2011). Building a multilingual named entity-annotated corpus. In Proceedings of the 8th international conference recent advances in natural language processing (RANLP’2011), Hissar, Bulgaria.Google Scholar
  8. Eisele A., & Chen, Y. (2010). MultiUN: A Multilingual Corpus from United Nation Documents. In Proceedings of the international conference on language resources and evaluation (LREC 2010), Valletta, Malta, pp. 2868–2872.Google Scholar
  9. Erjavec, T. (2010). MULTEXT-East version 4: Multilingual morphosyntactic specifications, lexicons and corpora. In Proceedings of the seventh international conference on language resources and evaluation (LREC), Valletta, Malta, pp. 2544–2547.Google Scholar
  10. Erjavec T., & Ide, N. (1998). The MULTEXT-East corpus. In Proceedings of the first international conference on language resources and evaluation (LREC), Granada, Spain.Google Scholar
  11. Gale, W., & Church, K. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.Google Scholar
  12. Hajlaoui, N., Kolovratnik, D., Väyrynen, J., Varga, D., & Steinberger, R. (2014). DCEP—Digital Corpus of the European Parliament. In Proceedings of the 9th edition of its language resources and evaluation conference, Reykjavik, Iceland.Google Scholar
  13. Ide N., & Véronis, J. (1994). MULTEXT: Multilingual text tools and corpora. In Proceedings of the 15th international conference on computational linguistics (CoLing), Kyoto, Japan, pp. 588–592.Google Scholar
  14. Koehn, P. (2005). EuroParl: A parallel corpus for statistical machine translation. In Proceedings of the machine translation summit, Phuket, Thailand, pp. 79–86.Google Scholar
  15. Koehn, P., Birch, A., & Steinberger, R. (2009). 462 Machine Translation Systems for Europe. In L. Gerber, P. Isabelle, R. Kuhn, N. Bemish, M. Dillinger, & M.-J. Goulet (Eds.), Proceedings of the twelfth machine translation summit (MT-Summit XII), Ottawa, Canada, August 2009, pp. 65–72.Google Scholar
  16. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the annual meeting of the association for computational linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007.Google Scholar
  17. Lardilleux, A., & Lepage, Y. (2009). Sampling-based multilingual alignment. In International conference on recent advances in natural language processing (RANLP’2009), Borovets, Bulgaria, pp. 214–218.Google Scholar
  18. Lefever, E., & Hoste, V. (2010). SemEval-2010 task 3: Cross-lingual word sense disambiguation. In Proceedings of the 5th international workshop on semantic evaluation (SemEval’2010), Uppsala, Sweden, pp. 15–20.Google Scholar
  19. Landauer T., & Littman, M. (1991). A statistical method for language-independent representation of the topical content of text segments. In Proceedings of the 11th international conference ‘Expert Systems and Their Applications’, Vol. 8, pp. 77–85.Google Scholar
  20. Mehdad Y., Negri, M., & Federico, M. (2010). Towards cross-lingual textual entailment. In Proceedings of human language technologies, Los Angeles, CA, USA, pp. 321–324.Google Scholar
  21. Naseem, T., Snyder, B., Eisenstein, J., & Barzilay, R. (2009). Multilingual part-of-speech tagging: Two unsupervised approaches. Journal of Artificial Intelligence, 36, 341–385.Google Scholar
  22. Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.CrossRefGoogle Scholar
  23. Padó, S., & Lapata, M. (2009). Cross-lingual annotation projection of semantic roles. Journal of Artificial Intelligence Research, 36, 307–340.Google Scholar
  24. Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Cross-language plagiarism detection. In Language resources and evaluation. Special issue on plagiarism and authorship analysis, Vol. 45, no. 1, pp. 45–62.Google Scholar
  25. Resnik, P., Olsen, M. B., & Diab, M. (1999). The Bible as a parallel corpus: Annotating the ‘Book of 2000 Tongues’. Computers and the Humanities, 33(1–2), 129–153.CrossRefGoogle Scholar
  26. Steinberger, R. (2011). A survey of methods to ease the development of highly multilingual Text Mining applications. Language Resources and Evaluation Journal, 46(2), 155–176.Google Scholar
  27. Steinberger, R. (2013). Multilingual and cross-lingual news analysis in the Europe Media Monitor (EMM). In M. Lupu, E. Kanoulas, & F. Loizides (Eds.), Multidisciplinary information retrieval. 6th information retrieval facility conference (IRFC’2013), Limassol, Cyprus. Springer Lecture Notes in Computer Science, Vol. 8201, pp. 1–4.Google Scholar
  28. Steinberger, R., Ebrahim, M., & Turchi, M. (2012a). JRC EuroVoc Indexer JEX—A freely available multi-label categorisation tool. In Proceedings of the 8th international conference on language resources and evaluation (LREC’2012), Istanbul, 21–27 May 2012.Google Scholar
  29. Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2012b). DGT-TM: A freely available translation memory in 22 languages. In Proceedings of the 8th international conference on language resources and evaluation (LREC’2012), Istanbul, 21–27 May 2012, pp. 454–459.Google Scholar
  30. Steinberger, J., Lenkova, P., Kabadjov, M., Steinberger, R., & van der Goot, E. (2011a). Multilingual entity-centered sentiment analysis evaluated by parallel corpora. In Proceedings of the 8th international conference recent advances in natural language processing (RANLP’2011), Hissar, Bulgaria, 12–14 September 2011.Google Scholar
  31. Steinberger, R., Pouliquen, B., Kabadjov, M., & van der Goot, E. (2011b). JRC-Names: A freely available, highly multilingual named entity resource. In Proceedings of the 8th international conference recent advances in natural language processing (RANLP’2011), Hissar, Bulgaria, 12–14 September 2011, pp. 104–110.Google Scholar
  32. Steinberger, R., Pouliquen, B., & van der Goot, E. (2009). An Introduction to the Europe media monitor family of applications. In F. Gey, N. Kando, & J. Karlgren (Eds.), Information access in a multilingual worldproceedings of the SIGIR 2009 workshop (SIGIR-CLIR’2009), Boston, USA, 23 July 2009, pp. 1–8.Google Scholar
  33. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20 + languages. In Proceedings of the 5th international conference on language resources and evaluation (LREC’2006), Genoa, Italy, 24–26 May 2006, pp. 2142–2147.Google Scholar
  34. Tiedemann, J. (2009). News from OPUS—A collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing (Vol. V, pp. 237–248). John Benjamins, Amsterdam/Philadelphia.Google Scholar
  35. Tiedemann, J., & Nygaard, L. (2004). The OPUS corpus—Parallel and free. In Proceedings of the 4th international conference on language resources an evaluation (LREC), Lisbon, Portugal, pp. 1183–1186.Google Scholar
  36. Tufiş, D. (2004). Term translations in parallel corpora: Discovery and consistency check. In Proceedings of the 4th international conference on language resources an evaluation (LREC), Lisbon, Portugal, pp. 1981–1984.Google Scholar
  37. Turchi, M., Steinberger, J., Kabadjov, M. & Steinberger, R. (2010). Using parallel corpora for multilingual (multi-document) summarisation evaluation. Multilingual and multimodal information access evaluation. Springer Lecture Notes for Computer Science, LNCS 6360/2010, pp. 52–63.Google Scholar
  38. Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., & Trón, V. (2005). Parallel corpora for medium density languages. In Proceedings of RANLP’2005, Borovets, Bulgaria, pp. 590–596.Google Scholar
  39. Vinokourov, A., Shawe-Taylor, J., & Cristianini, N. (2003). Inferring a semantic representation of text via cross-language correlation analysis: Advances in neural information processing systems 15. In S. Becker, S. Thrun & K. Obermayer (Eds), (pp. 1473–1480). Cambridge, MA: MIT Press.Google Scholar
  40. Wei, C.-P., Yang, C. C., & Lin, C.-M. (2008). A Latent Semantic Indexing-based approach to multilingual document clustering. Decision Support Systems, 45(2008), 606–620.CrossRefGoogle Scholar
  41. Yarowsky, D., Ngai, G., & Wicentowski, R. (2001) Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of HLT’01, San Diego.Google Scholar
  42. Zhechev, V., & Way, A. (2008). Automatic generation of parallel treebanks. In Proceedings of the 22nd international conference on computational linguistics (CoLing’2008), Manchester, UK, Vol. 1, pp. 1105–1112.Google Scholar

Copyright information

© European Union 2014

Authors and Affiliations

  • Ralf Steinberger
    • 1
  • Mohamed Ebrahim
    • 2
  • Alexandros Poulis
    • 3
  • Manuel Carrasco-Benitez
    • 4
  • Patrick Schlüter
    • 4
  • Marek Przybyszewski
    • 5
  • Signe Gilbro
    • 6
  1. 1.European Commission – Joint Research Centre (JRC)IspraItaly
  2. 2.Cognizant-SetCon GmbHMunichGermany
  3. 3.Lionbridge Technologies, IncTampereFinland
  4. 4.European Commission – Directorate General for Translation (DGT)LuxembourgLuxembourg
  5. 5.European Commission – Directorate General Education And Culture (EAC)BrusselsBelgium
  6. 6.European Centre for Disease Prevention and Control (ECDC)StockholmSweden

Personalised recommendations