Skip to main content
Log in

A comparison of discriminative training criteria for continuous space translation models

  • Published:
Machine Translation

Abstract

This paper explores a new discriminative training procedure for continuous-space translation models (CTM s) which correlates better with translation quality than conventional training methods. The core of the method lays in the definition of a novel objective function which enables us to effectively integrate the CTM with the rest of the translation system through \(N \hbox {-best} \) rescoring. Using a fixed architecture, where we iteratively retrain the CTM parameters and the log-linear coefficients, we compare various ways to define and combine training criteria for each of these steps, drawing inspirations both from max-margin and learning-to-rank techniques. We experimentally show that a recently introduced loss function, which combines these two techniques, outperforms several objective functions from the literature. We also show that ensuring the consistency of the losses used to train these two sets of parameters is beneficial to the overall performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Note, however that they could be used with any phrase-based system.

  2. Note that the complete model for a sentence pair involves latent variables that specify the reordering of the source sentence, as well as its segmentation into translation units. These are omitted henceforth for the sake of clarity.

  3. The features used in our experiments are standard phrase-based features, see, e.g., Crego et al. (2011).

  4. http://www.statmt.org/moses/. For MIRA, we use the KB MIRA implementation (Cherry and Foster 2012).

  5. Note that these corpora need not necessarily be distinct and can also partly overlap. For the sake of this presentation, we refer to these corpora respectively as the out-of-domain and the in-domain data. This also corresponds to our experimental setting.

  6. Two variants of expected-BLEU exist in the literature: one (that we use here) takes the expectation of BLEU score over an approximation of the search space; the other, used for instance in Rosti et al. (2010), computes BLEU with expected n-gram statistics.

  7. http://workshop2014.iwslt.org/.

  8. http://ncode.limsi.fr/.

  9. This is reflected in the train column, where we observe an important difference in BLEU score between the both scenarios.

References

  • Allauzen A, Pécheux N, Do QK, Dinarelli M, Lavergne T, Max A, Le H, Yvon F (2013) LIMSI @ WMT13. In: Proceedings of the workshop on statistical machine translation, Sofia, Bulgaria, pp 62–69

  • Auli M, Gao J (2014a) Decoder integration and expected BLEU training for recurrent neural network language models. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), pp 136–142

  • Auli M, Gao J (2014b) Decoder integration and expected BLEU training for recurrent neural network language models. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (ACL’14), pp 136–142

  • Auli M, Galley M, Quirk C, Zweig G (2013) Joint language and translation modeling with recurrent neural networks. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1044–1054

  • Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473

  • Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155

    MATH  Google Scholar 

  • Blunsom P, Osborne M (2008) Probabilistic inference for machine translation. In: Proceedings of the conference on empirical methods in natural language processing, pp 215–223

  • Blunsom P, Cohn T, Osborne M (2008) A discriminative latent variable model for statistical machine translation. In: ACL, pp 200–208

  • Casacuberta F, Vidal E (2004) Machine translation with inferred stochastic finite-state transducers. Comput Linguist 30(3):205–225

    Article  MathSciNet  MATH  Google Scholar 

  • Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: Proceedings of the North American chapter of the Association for Computational Linguistics: human language technologies (NAACL-HLT), pp 427–436

  • Chiang D, Marton Y, Resnik P (2008) Online large-margin training of syntactic and structural translation features. In: Proceedings of the conference on empirical methods in natural language processing, pp 224–233

  • Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, pp 1724–1734

  • Collins M (2002) Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1–8

  • Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning. ACM, New York, pp 160–167

  • Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537

    MATH  Google Scholar 

  • Crammer K, Singer Y (2003) Ultraconservative online algorithms for multiclass problems. J Mach Learn Res 3:951–991

    MATH  Google Scholar 

  • Crego JM, Mariño JB (2006) Improving statistical MT by coupling reordering and decoding. Mach Transl 20(3):199–215

    Article  Google Scholar 

  • Crego JM, Yvon F, Mariño JB (2011) N-code: an open-source bilingual N-gram SMT toolkit. Prague Bull Math Linguist 96:49–58

    Article  Google Scholar 

  • Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics. Long papers, vol 1, Baltimore, MD, pp 1370–1380

  • Do QK (2016) Discriminative training of continuous space translation models. PhD Thesis, Université Paris-Sud and Université Paris-Saclay

  • Do Q-K, Allauzen A, Yvon F (2014) Discriminative adaptation of continuous space translation models. In: International workshop on spoken language translation (IWSLT 2014), Lake Tahoe, USA

  • Do Q-K, Allauzen A, Yvon F (2015a) Apprentissage discriminant des modèles continus de traduction. In: Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles, Caen, France. Association pour le Traitement Automatique des Langues, pp 267–278

  • Do QK, Allauzen A, Yvon F (2015b) A discriminative training procedure for continuous translation models. In: Conference on empirical methods in natural language processing (EMNLP 2015), Lisboa, Portugal, p 7

  • Dyer C, Resnik P (2010) Context-free reordering, finite-state translation. In: Proceedings of the North American chapter of the Association for Computational Linguistics: human language technologies (NAACL-HLT), pp 858–866

  • Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn 37(3):277–296

    Article  MATH  Google Scholar 

  • Gao J, He X (2013) Training MRF-based phrase translation models using gradient ascent. In: Proceedings of the North American chapter of the Association for Computational Linguistics: human language technologies (NAACL-HLT), Atlanta, pp 450–459

  • Gao J, He X, Yih W-t, Deng L (2014) Learning continuous phrase representations for translation modeling. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, Baltimore, MD

  • Gutmann M, Hyvärinen A (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Teh YW, Titterington M (eds) Proceedings of the international conference on artificial intelligence and statistics (AISTATS), vol 9, pp 297–304

  • He X, Deng L (2012) Maximum expected BLEU training of phrase and lexicon translation models. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics: long papers, vol 1, pp 292–301

  • Hopkins M, May J (2011) Tuning as ranking. In: Proceedings of the 2011 conference on empirical methods in natural language processing, Edinburgh, Scotland, UK, pp 1352–1362

  • Lavergne T, Crego JM, Allauzen A, Yvon F (2011) From n-gram-based to CRF-based translation models. In: Proceedings of the sixth workshop on statistical machine translation, pp 542–553

  • Lavergne T, Allauzen A, Yvon F (2013) Un cadre d’apprentissage intégralement discriminant pour la traduction statistique. TALN-RÉCITAL 2013, p 450

  • Le H-S, Oparin I, Allauzen A, Gauvain J-L, Yvon F (2011) Structured output layer neural network language model. In: Proceedings of the international conference on audio, speech and signal processing, pp 5524–5527

  • Le H-S, Allauzen A, Yvon F (2012) Continuous space translation models with neural networks. In: Proceedings of the North American chapter of the Association for Computational Linguistics: human language technologies (NAACL-HLT), Montréal, Canada, pp 39–48

  • Liang P, Bouchard-Côté A, Klein D, Taskar B (2006) An end-to-end discriminative approach to machine translation. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), pp 761–768

  • Mariño JB, Banchs RE, Crego JM, de Gispert A, Lambert P, Fonollosa JA, Costa-Jussà MR (2006) N-gram-based machine translation. Comput Linguist 32(4):527–549

    Article  MathSciNet  MATH  Google Scholar 

  • McDonald R, Crammer K, Pereira F (2005) Online large-margin training of dependency parsers. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), pp 91–98

  • Mnih A, Hinton GE (2008) A scalable hierarchical distributed language model. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing Systems 21, vol 21, pp 1081–1088

  • Mnih A, Teh YW (2012) A fast and simple algorithm for training neural probabilistic language models. In: Proceedings of the international conference of machine learning (ICML)

  • Neubig G, Watanabe T (2016) Optimization for statistical machine translation: a survey. Comput Linguist 42(1):1–54

    Article  MathSciNet  Google Scholar 

  • Niehues J, Waibel A (2012) Continuous space language models using restricted Boltzmann machines. In: Proceedings of international workshop on spoken language translation (IWSLT), Hong-Kong, China, pp 164–170

  • Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, pp 160–167

  • Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), pp 311–318

  • Rosti A-V, Zhang B, Matsoukas S, Schwartz R (2010) BBN system description for WMT10 system combination task. In: Proceedings of the joint fifth workshop on statistical machine translation and MetricsMATR, Uppsala, Sweden. Association for Computational Linguistics, pp 321–326

  • Schwenk H (2007) Continuous space language models. Comput Speech Lang 21(3):492–518

    Article  Google Scholar 

  • Schwenk H, Costa-Jussa MR, Fonollosa JAR (2007) Smooth bilingual \(n\)-gram translation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), Prague, Czech Republic, pp 430–438

  • Shen L, Joshi AK (2005) Ranking and reranking with perceptron. Mach Learn 60(1–3):73–96

    Article  Google Scholar 

  • Shen L, Sarkar A, Och FJ (2004) Discriminative reranking for machine translation. In: HLT-NAACL, pp 177–184

  • Shen S, Cheng Y, He Z, He W, Wu H, Sun M, Liu Y (2015) Minimum risk training for neural machine translation. CoRR. arXiv:1512.02433

  • Simianer P, Riezler S, Dyer C (2012) Joint feature selection in distributed stochastic learning for large-scale discriminative training in SMT. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), pp 11–21

  • Socher R, Bauer J, Manning CD, Andrew YN (2013) Parsing with compositional vector grammars. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), Sofia, Bulgaria, pp 455–465

  • Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, NIPS*27, Montréal, Canada, pp 3104–3112

  • Vaswani A, Zhao Y, Fossum V, Chiang D (2013) Decoding with large-scale neural language models improves translation. In: Proceedings of the conference on empirical methods in natural language Processing (EMNLP), Seattle, Washington, USA, pp 1387–1392

  • Watanabe T, Suzuki J, Tsukada H, Isozaki H (2007) Online large-margin training for statistical machine translation. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), Prague, Czech Republic, pp 764–773

  • Yang N, Liu S, Li M, Zhou M, Yu N (2013) Word alignment modeling with context dependent deep neural networks. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), Sofia, Bulgaria, pp 166–175

  • Zens R, Och FJ, Ney H (2002) Phrase-based statistical machine translation. In: KI ’02: proceedings of the 25th annual German conference on AI. Springer, London, pp 18–32

  • Zens R, Hasan S, Ney H (2007) A systematic comparison of training criteria for statistical machine translation. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), Prague, Czech Republic, pp 524–532

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandre Allauzen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Allauzen, A., Do, Q.K. & Yvon, F. A comparison of discriminative training criteria for continuous space translation models. Machine Translation 31, 19–33 (2017). https://doi.org/10.1007/s10590-017-9195-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-017-9195-1

Keywords

Navigation