Skip to main content

Text Genre – An Unexplored Parameter in Statistical Machine Translation

  • Conference paper
  • First Online:
Human Language Technology Challenges for Computer Science and Linguistics (LTC 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8387))

Included in the following conference series:

  • 889 Accesses

Abstract

It is generally accepted that the performance of a statistical machine translation (SMT) system depends significantly on the concordance between the domain of training and test data. During the last years several methods have been proposed in order to deal with out- of-domain words. Less to no attention has been paid however to text genre within the same domain. In this paper we demonstrate that the style of the training corpus may influence the quality of the translation output even when the domain of the training and test data remains al- most unchanged, but the text genre changes. We use as training data the JRC-Acquis and as test data the Europarl corpus. We include also experiments with an out-of-domain test data, as comparison for the variation of performance of the SMT system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://ipsc.jrc.ec.europa.eu/index.php?id=198

  2. 2.

    http://www.statmt.org/europarl/

  3. 3.

    http://www.atlasproject.eu

  4. 4.

    http://www.systranet.com/

  5. 5.

    http://atlasproject.eu

  6. 6.

    see http://www.meta-net.eu/whitepapers

  7. 7.

    www.statmt.org/wmt11/baseline.html

  8. 8.

    www.statmt.org/moses/

  9. 9.

    The tag < p > from the initial HTML files.

  10. 10.

    In the Moses description, all sentences longer than forty tokens are excluded.

  11. 11.

    Status: February 2011; http://www.statmt.org/europarl/

  12. 12.

    A one-to-one comparison is not possible, as the training and test data are not the same.

  13. 13.

    Word-form = Declination form, conjugation form, etc.

References

  1. Calude, A.: Machine translation of various text genres. Presented at 7th Language and Society Conference of the New Zealand Linguistic Society. Hamilton, New Zealand, 12 p., November 2002. (unpublished) (http://www.mt-archive.info/Calude-2003.pdf)

  2. Cristea, D.: Romanian language technology and resources go to Europe. Presentation held at the FP7 Language Technology Informative Days, January, 20–11 (2009)

    Google Scholar 

  3. Doddington, G.: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 138–145. Morgan Kaufmann Publishers Inc., San Francisco (2002)

    Google Scholar 

  4. Gavrila, M.: improving recombination in a linear EBMT system by use of constraints, Ph.D. thesis, University of Hamburg (2012)

    Google Scholar 

  5. Gavrila, M., Elita, N.: Roger - un corpus paralel aliniat. In: Resurse Lingvistice si Instrumente pentru Prelucrarea Limbii Romane Workshop Proceedings, pp. 63–67, Ed. Univ. Alexandru Ioan Cuza, December 2006. Workshop held in November 2006. ISBN: 978-973-703-208-9

    Google Scholar 

  6. Ignat, C.: Improving Statistical Alignment and Translation Using Highly Multilin- gual Corpora. Ph.D. thesis, INSA - LGeco- LICIA, Strasbourg, France, 16 June 2009

    Google Scholar 

  7. Koehn, P., Europarl: A Parallel Corpus for Statistical Machine Translation, MT Summit (2005)

    Google Scholar 

  8. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, pp. 177–180, Prague, Czech Republic, June 2007

    Google Scholar 

  9. Koehn, P., Birch, A., Steinberger, R.: 462 Machine Translation Systems forEurope, MT Summit (2009)

    Google Scholar 

  10. Niehues, J., Waibel, A.: Domain adaptation in statistical machine translation using factored translation models. In: Proceedings of EAMT, Saint-Raphael (2010)

    Google Scholar 

  11. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguistl. 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  12. Papineni, K., Roukos, S., Ward, T., Zhu, W-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Session: Machine Translation and Evaluation, pp. 311–318. Association for Computational Linguistics Morristown, Philadelphia (2002)

    Google Scholar 

  13. Rousu, J., SMART Project: Workpackage 3 advanced language models. Report of the EU project: SMART (2008)

    Google Scholar 

  14. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul. J.: A study of translation edit rate with targeted human annotation. In: Proceedings of Association for Machine Translation in the Americas, pp. 223–231, August 2006

    Google Scholar 

  15. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufis, D., Varga, D.: The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006), pp. 2142–2147, May, Genoa, Italy (2006)

    Google Scholar 

  16. Stolcke, A.: SRILM - An extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language (2002)

    Google Scholar 

Download references

Acknowledgments

Part of the work in this paper was part of the EU-Project ATLAS, supported through the ICT-PSP-Programme of the EU-Commission (Topic “Multilingual Web”) and the PhD research conducted by Monica Gavrila at the University of Hamburg (see [4]).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristina Vertan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Gavrila, M., Vertan, C. (2014). Text Genre – An Unexplored Parameter in Statistical Machine Translation. In: Vetulani, Z., Mariani, J. (eds) Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011. Lecture Notes in Computer Science(), vol 8387. Springer, Cham. https://doi.org/10.1007/978-3-319-08958-4_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08958-4_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08957-7

  • Online ISBN: 978-3-319-08958-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics