Machine Translation

, Volume 23, Issue 1, pp 1–22 | Cite as

Automatically generated parallel treebanks and their exploitability in machine translation

Article

Abstract

Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for improvements to the current state-of-the-art in machine translation (MT). Parallel treebanks are a relatively recent innovation and appear to be ideal candidates for MT training material. However, until recently there has been no other means to build them than by hand. In this paper, we describe how we make use of new tools to automatically build a large parallel treebank and extract a set of linguistically-motivated phrase pairs from it. We show that adding these phrase pairs to the translation model of a baseline phrase-based statistical MT (PB-SMT) system leads to significant improvements in translation quality. Following this, we describe experiments in which we exploit the information encoded in the parallel treebank in other areas of the PB-SMT framework, while investigating the conditions under which the incorporation of parallel treebank data performs optimally. Finally, we discuss the possibility of exploiting automatically-generated parallel treebanks further in syntax-aware paradigms of MT.

Keywords

Parallel treebanks Statistical machine translation Phrase-based statistical machine translation Syntax in machine translation Sub-tree alignment Translation modelling Resource combination Word alignment Phrase alignment Hybrid models 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ahrenberg L (2007) LinES: an English–Swedish parallel treebank. In: Proceedings of the 16th Nordic conference of computational linguistics (NOLADIA’07). Tartu, Estonia, pp 270–274Google Scholar
  2. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of workshop on intrinsic and extrinsic evaluation measures for MT and/or summarization at the 43th annual meeting of the association of computational linguistics (ACL-05). Ann Arbor, MIGoogle Scholar
  3. Bikel D (2002) Design of a multi-lingual, parallel-processing statistical parsing engine. In: Human language technology conference (HLT). San Diego, CAGoogle Scholar
  4. Carpuat M, Wu D (2007) How phrase sense disambiguation outperforms word sense disambiguation for statistical machine translation. In: Proceedings of the 11th international conference on theoretical and methodological issues in machine translation (TMI-07). Skövde, Sweden, pp 43–52Google Scholar
  5. Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: 43rd annual meeting of the association for computational linguistics (ACL’05). Ann Arbor, MI, pp 263–270Google Scholar
  6. Chiang D (2007) Hierarchical phrase-based translation. Comput Linguist 33(2): 201–228CrossRefGoogle Scholar
  7. Chrupała G, van Genabith J (2006) Using machine-learning to assign function labels to parser output for Spanish. In: 44th annual meeting of the association for computational linguistics (ACL’06). Sydney, Australia, pp 136–143Google Scholar
  8. Civit M, Martí MA (2004) Building Cast3LB: a Spanish treebank. Res Lang Comput 2(4): 549–574CrossRefGoogle Scholar
  9. Čmejrek M, Cuřín J, Havelka J, Hajič J, Kuboň V (2004) Prague Czech-English dependency treebank. Syntactically annotated resources for machine translation. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 1597–1600Google Scholar
  10. Cyrus L, Feddes H, Schumacher F (2003) FuSe—a multi-layered parallel treebank. In: Proceedings of the second workshop on treebanks and linguistic theories (TLT’03). Växjö, Sweden, pp 213–216Google Scholar
  11. Doddington G (2002) Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. In: Human language technology: notebook proceedings. San Diego, CA, pp 128–132Google Scholar
  12. Eck M, Vogel S, Waibel A (2005) Low cost portability for statistical machine translation based on n-gram coverage. In: Machine translation summit X. Phuket, Thailand, pp 227–234Google Scholar
  13. Galley M, Graehl J, Knight K, Marcu D, DeNeefe S, Wang W, Thayer I (2006) Scalable inference and training of context-rich syntactic translation models. In: Proceedings of the 44th annual meeting of the association for computational linguistics. Sydney, Australia, pp 961–968Google Scholar
  14. Groves D (2007) Hybrid data-driven models of machine translation. Ph.D. thesis, Dublin City University, Dublin, IrelandGoogle Scholar
  15. Gustafson-Čapková S, Samuelsson Y, Volk M (2007) SMULTRON—the Stockholm MULtilingual parallel TReebank. www.ling.su.se/dali/research/smultron/index
  16. Han C, Han N-R, Ko E-S, Palmer M (2002) Development and evaluation of a Korean treebank and its application to NLP. In: Proceedings of the 3rd international conference on language resources and evaluation (LREC’02). Canary Islands, Spain, pp 1635–1642Google Scholar
  17. Hanneman G, Lavie A (2009) Decoding with syntactic and non-syntactic phrases in a syntax-based machine translation system. In: Proceedings of the third workshop on syntax and structure in statistical translation at the 2009 meeting of the North-American chapter of the association for computational linguistics (NAACL-HLT-2009). Boulder, CO, June 2009Google Scholar
  18. Hansen-Schirra S, Neumann S, Vela M (2006) Multi-dimensional annotation and alignment in an English-German translation corpus. In: Proceedings of the workshop on multi-dimensional markup in natural language processing (NLPXML-2006) at EACL. Trento, Italy, pp 35–42Google Scholar
  19. Hassan H, Sima’an K, Way A (2007) Supertagged phrase-based statistical machine translation. In: 45th annual meeting of the association for computational linguistics (ACL’07). Prague, Czech Republic, pp 288–295Google Scholar
  20. Hearne M (2005) Data-oriented models of parsing and translation. Ph.D. thesis, Dublin City University, Dublin, IrelandGoogle Scholar
  21. Hearne M, Tinsley J, Zhechev V, Way A (2007) Capturing translational divergences with a statistical tree-to-tree aligner. In: Proceedings of the 11th international conference on theoretical and methodological issues in machine translation (TMI-07). Skövde, Sweden, pp 83–94Google Scholar
  22. Hearne M, Ozdowska S, Tinsley J (2008) Comparing constituency and dependency representations for SMT phrase-extraction. In: 15ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN’08). Avignon, FranceGoogle Scholar
  23. Johnson H, Martin J, Foster G, Kuhn R (2007) Improving translation quality by discarding most of the phrasetable. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007). Prague, Czech Republic, pp 967–975Google Scholar
  24. Koehn P (2004) Statistical significance tests for machine translation evaluation. In: Proceedings of the 2004 conference on empirical methods in natural language processing. Barcelona, Spain, pp 388–395Google Scholar
  25. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Machine translation summit X. Phuket, Thailand, pp 79–86Google Scholar
  26. Koehn P, Hoang H (2007) Factored translation models. In: Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). Prague, Czech Republic, pp 868–876Google Scholar
  27. Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology (NAACL’03). Edmonton, Canada, pp 48–54Google Scholar
  28. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: 45th annual meeting of the association for computational linguistics (ACL), demonstration session. Prague, Czech Republic, pp 177–180Google Scholar
  29. Lavie A (2008) Stat-XFER: a general search-based syntax-driven framework for machine translation. In: Proceedings of the 9th international conference on intelligent text processing and computational linguistics (CICLing-08)—invited paper. Haifa, Israel, pp 362–375Google Scholar
  30. Lavie A, Parlikar A, Ambati V (2008) Syntax-driven learning of sub-sentential translation equivalents and translation rules from parsed parallel corpora. In: Proceedings of the second workshop on syntax and structure in statistical translation (SSST-2). Columbus, OHGoogle Scholar
  31. Lu Y, Huang J, Liu Q (2007) Improving statistical machine translation performance by training data selection and optimization. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL 2007). Prague, Czech Republic, pp 343–350Google Scholar
  32. Marton Y, Resnik P (2008) Soft syntactic constraints for hierarchical phrased-based translation. In: Proceedings of the 46th annual meeting of the association for computational linguistics (ACL’08). Columbus, OH, pp 1003–1011Google Scholar
  33. Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: 40th annual meeting of the association for computational linguistics (ACL-02). Philadelphia, PA, pp 311–318Google Scholar
  34. Petrov S, Klein D (2007) Improved inference for unlexicalized parsing. In: Human language technologies 2007: the conference of the North American chapter of the association for computational linguistics. Rochester, NY, pp 404–411Google Scholar
  35. Samuelsson Y, Volk M (2007) Alignment tools for parallel treebanks. In: Proceedings of the biennial GLDV conference. Tübingen, GermanyGoogle Scholar
  36. Stolcke A (2002) SRILM—an extensible language modeling toolkit. In: Proceedings of the international conference spoken language processing. Denver, COGoogle Scholar
  37. Stroppa N, van den Bosch A, Way A (2007) Exploiting source similarity for SMT using context-informed features. In: Proceedings of the 11th international conference on theoretical and methodological issues in machine translation (TMI-07). Skövde, Sweden, pp 231–240Google Scholar
  38. Tinsley J, Hearne M, Way A (2007a) Exploiting parallel treebanks to improve phrase-based statistical machine translation. In: Proceedings of the sixth international workshop on treebanks and linguistic theories (TLT-07). Bergen, Norway, pp 175–187Google Scholar
  39. Tinsley J, Zhechev V, Hearne M, Way A (2007b) Robust language-pair independent sub-tree alignment. In: Machine translation summit XI. Copenhagen, Denmark, pp 467–474Google Scholar
  40. Vilar D, Stein D, Ney H (2008) Analysing soft syntax features and heuristics for hierarchical phrase based machine translation. International workshop on spoken language translationGoogle Scholar
  41. Volk M, Samuelsson Y (2004) Bootstrapping parallel treebanks. In: Proceedings of the 7th conference of the workshop on linguistically interpreted corpora (LINC). Geneva, Switzerland, pp 71–77Google Scholar
  42. Yamada K, Knight K (2001) A syntax-based statistical translation model. In: Proceedings of the 39th annual meeting of the association for computational linguistics (ACL’01). Toulouse, France, pp 523–530Google Scholar
  43. Zhechev V, Way A (2008) Automatic generation of parallel treebanks. In: Proceedings of the 22nd international conference on computational linguistics (CoLing’08). Manchester, UK, pp 1105–1112Google Scholar
  44. Zollmann A, Venugopal A, Och F, Ponte J (2008) A systematic comparison of phrase-based, hierarchical and syntax-augmented statistical MT. In: Proceedings of the 22nd international conference on computational linguistics (CoLing’08). Manchester, England, pp 1145–1152Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2010

Authors and Affiliations

  1. 1.School of Computing, National Centre for Language TechnologyDublin City UniversityDublin 9Ireland

Personalised recommendations