Abstract
This paper explores a new discriminative training procedure for continuous-space translation models (CTM s) which correlates better with translation quality than conventional training methods. The core of the method lays in the definition of a novel objective function which enables us to effectively integrate the CTM with the rest of the translation system through \(N \hbox {-best} \) rescoring. Using a fixed architecture, where we iteratively retrain the CTM parameters and the log-linear coefficients, we compare various ways to define and combine training criteria for each of these steps, drawing inspirations both from max-margin and learning-to-rank techniques. We experimentally show that a recently introduced loss function, which combines these two techniques, outperforms several objective functions from the literature. We also show that ensuring the consistency of the losses used to train these two sets of parameters is beneficial to the overall performance.
Similar content being viewed by others
Notes
Note, however that they could be used with any phrase-based system.
Note that the complete model for a sentence pair involves latent variables that specify the reordering of the source sentence, as well as its segmentation into translation units. These are omitted henceforth for the sake of clarity.
The features used in our experiments are standard phrase-based features, see, e.g., Crego et al. (2011).
http://www.statmt.org/moses/. For MIRA, we use the KB MIRA implementation (Cherry and Foster 2012).
Note that these corpora need not necessarily be distinct and can also partly overlap. For the sake of this presentation, we refer to these corpora respectively as the out-of-domain and the in-domain data. This also corresponds to our experimental setting.
Two variants of expected-BLEU exist in the literature: one (that we use here) takes the expectation of BLEU score over an approximation of the search space; the other, used for instance in Rosti et al. (2010), computes BLEU with expected n-gram statistics.
This is reflected in the train column, where we observe an important difference in BLEU score between the both scenarios.
References
Allauzen A, Pécheux N, Do QK, Dinarelli M, Lavergne T, Max A, Le H, Yvon F (2013) LIMSI @ WMT13. In: Proceedings of the workshop on statistical machine translation, Sofia, Bulgaria, pp 62–69
Auli M, Gao J (2014a) Decoder integration and expected BLEU training for recurrent neural network language models. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), pp 136–142
Auli M, Gao J (2014b) Decoder integration and expected BLEU training for recurrent neural network language models. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (ACL’14), pp 136–142
Auli M, Galley M, Quirk C, Zweig G (2013) Joint language and translation modeling with recurrent neural networks. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1044–1054
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
Blunsom P, Osborne M (2008) Probabilistic inference for machine translation. In: Proceedings of the conference on empirical methods in natural language processing, pp 215–223
Blunsom P, Cohn T, Osborne M (2008) A discriminative latent variable model for statistical machine translation. In: ACL, pp 200–208
Casacuberta F, Vidal E (2004) Machine translation with inferred stochastic finite-state transducers. Comput Linguist 30(3):205–225
Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: Proceedings of the North American chapter of the Association for Computational Linguistics: human language technologies (NAACL-HLT), pp 427–436
Chiang D, Marton Y, Resnik P (2008) Online large-margin training of syntactic and structural translation features. In: Proceedings of the conference on empirical methods in natural language processing, pp 224–233
Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), Doha, Qatar, pp 1724–1734
Collins M (2002) Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1–8
Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning. ACM, New York, pp 160–167
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Crammer K, Singer Y (2003) Ultraconservative online algorithms for multiclass problems. J Mach Learn Res 3:951–991
Crego JM, Mariño JB (2006) Improving statistical MT by coupling reordering and decoding. Mach Transl 20(3):199–215
Crego JM, Yvon F, Mariño JB (2011) N-code: an open-source bilingual N-gram SMT toolkit. Prague Bull Math Linguist 96:49–58
Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics. Long papers, vol 1, Baltimore, MD, pp 1370–1380
Do QK (2016) Discriminative training of continuous space translation models. PhD Thesis, Université Paris-Sud and Université Paris-Saclay
Do Q-K, Allauzen A, Yvon F (2014) Discriminative adaptation of continuous space translation models. In: International workshop on spoken language translation (IWSLT 2014), Lake Tahoe, USA
Do Q-K, Allauzen A, Yvon F (2015a) Apprentissage discriminant des modèles continus de traduction. In: Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles, Caen, France. Association pour le Traitement Automatique des Langues, pp 267–278
Do QK, Allauzen A, Yvon F (2015b) A discriminative training procedure for continuous translation models. In: Conference on empirical methods in natural language processing (EMNLP 2015), Lisboa, Portugal, p 7
Dyer C, Resnik P (2010) Context-free reordering, finite-state translation. In: Proceedings of the North American chapter of the Association for Computational Linguistics: human language technologies (NAACL-HLT), pp 858–866
Freund Y, Schapire RE (1999) Large margin classification using the perceptron algorithm. Mach Learn 37(3):277–296
Gao J, He X (2013) Training MRF-based phrase translation models using gradient ascent. In: Proceedings of the North American chapter of the Association for Computational Linguistics: human language technologies (NAACL-HLT), Atlanta, pp 450–459
Gao J, He X, Yih W-t, Deng L (2014) Learning continuous phrase representations for translation modeling. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, Baltimore, MD
Gutmann M, Hyvärinen A (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In: Teh YW, Titterington M (eds) Proceedings of the international conference on artificial intelligence and statistics (AISTATS), vol 9, pp 297–304
He X, Deng L (2012) Maximum expected BLEU training of phrase and lexicon translation models. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics: long papers, vol 1, pp 292–301
Hopkins M, May J (2011) Tuning as ranking. In: Proceedings of the 2011 conference on empirical methods in natural language processing, Edinburgh, Scotland, UK, pp 1352–1362
Lavergne T, Crego JM, Allauzen A, Yvon F (2011) From n-gram-based to CRF-based translation models. In: Proceedings of the sixth workshop on statistical machine translation, pp 542–553
Lavergne T, Allauzen A, Yvon F (2013) Un cadre d’apprentissage intégralement discriminant pour la traduction statistique. TALN-RÉCITAL 2013, p 450
Le H-S, Oparin I, Allauzen A, Gauvain J-L, Yvon F (2011) Structured output layer neural network language model. In: Proceedings of the international conference on audio, speech and signal processing, pp 5524–5527
Le H-S, Allauzen A, Yvon F (2012) Continuous space translation models with neural networks. In: Proceedings of the North American chapter of the Association for Computational Linguistics: human language technologies (NAACL-HLT), Montréal, Canada, pp 39–48
Liang P, Bouchard-Côté A, Klein D, Taskar B (2006) An end-to-end discriminative approach to machine translation. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), pp 761–768
Mariño JB, Banchs RE, Crego JM, de Gispert A, Lambert P, Fonollosa JA, Costa-Jussà MR (2006) N-gram-based machine translation. Comput Linguist 32(4):527–549
McDonald R, Crammer K, Pereira F (2005) Online large-margin training of dependency parsers. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), pp 91–98
Mnih A, Hinton GE (2008) A scalable hierarchical distributed language model. In: Koller D, Schuurmans D, Bengio Y, Bottou L (eds) Advances in neural information processing Systems 21, vol 21, pp 1081–1088
Mnih A, Teh YW (2012) A fast and simple algorithm for training neural probabilistic language models. In: Proceedings of the international conference of machine learning (ICML)
Neubig G, Watanabe T (2016) Optimization for statistical machine translation: a survey. Comput Linguist 42(1):1–54
Niehues J, Waibel A (2012) Continuous space language models using restricted Boltzmann machines. In: Proceedings of international workshop on spoken language translation (IWSLT), Hong-Kong, China, pp 164–170
Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, pp 160–167
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), pp 311–318
Rosti A-V, Zhang B, Matsoukas S, Schwartz R (2010) BBN system description for WMT10 system combination task. In: Proceedings of the joint fifth workshop on statistical machine translation and MetricsMATR, Uppsala, Sweden. Association for Computational Linguistics, pp 321–326
Schwenk H (2007) Continuous space language models. Comput Speech Lang 21(3):492–518
Schwenk H, Costa-Jussa MR, Fonollosa JAR (2007) Smooth bilingual \(n\)-gram translation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), Prague, Czech Republic, pp 430–438
Shen L, Joshi AK (2005) Ranking and reranking with perceptron. Mach Learn 60(1–3):73–96
Shen L, Sarkar A, Och FJ (2004) Discriminative reranking for machine translation. In: HLT-NAACL, pp 177–184
Shen S, Cheng Y, He Z, He W, Wu H, Sun M, Liu Y (2015) Minimum risk training for neural machine translation. CoRR. arXiv:1512.02433
Simianer P, Riezler S, Dyer C (2012) Joint feature selection in distributed stochastic learning for large-scale discriminative training in SMT. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), pp 11–21
Socher R, Bauer J, Manning CD, Andrew YN (2013) Parsing with compositional vector grammars. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), Sofia, Bulgaria, pp 455–465
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, NIPS*27, Montréal, Canada, pp 3104–3112
Vaswani A, Zhao Y, Fossum V, Chiang D (2013) Decoding with large-scale neural language models improves translation. In: Proceedings of the conference on empirical methods in natural language Processing (EMNLP), Seattle, Washington, USA, pp 1387–1392
Watanabe T, Suzuki J, Tsukada H, Isozaki H (2007) Online large-margin training for statistical machine translation. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), Prague, Czech Republic, pp 764–773
Yang N, Liu S, Li M, Zhou M, Yu N (2013) Word alignment modeling with context dependent deep neural networks. In: Proceedings of the annual meeting of the Association for Computational Linguistics (ACL), Sofia, Bulgaria, pp 166–175
Zens R, Och FJ, Ney H (2002) Phrase-based statistical machine translation. In: KI ’02: proceedings of the 25th annual German conference on AI. Springer, London, pp 18–32
Zens R, Hasan S, Ney H (2007) A systematic comparison of training criteria for statistical machine translation. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), Prague, Czech Republic, pp 524–532
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Allauzen, A., Do, Q.K. & Yvon, F. A comparison of discriminative training criteria for continuous space translation models. Machine Translation 31, 19–33 (2017). https://doi.org/10.1007/s10590-017-9195-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-017-9195-1