Skip to main content
Log in

Induction of latent domains in heterogeneous corpora: a case study of word alignment

  • Published:
Machine Translation

Abstract

This paper focuses on the insensitivity of existing word alignment models to domain differences, which often yields suboptimal results on large heterogeneous data. A novel latent domain word alignment model is proposed, which induces domain-focused lexical and alignment statistics. We propose to train the model on a heterogeneous corpus under partial supervision, using a small number of seed samples from different domains. The seed samples allow estimating sharper, domain-focused word alignment statistics for sentence pairs. Our experiments show that the derived domain-focused statistics, once combined together, produce significant improvements both in word alignment accuracy and in translation accuracy of their resulting SMT systems. Going beyond the findings, we surmise that virtually any large corpus (e.g., Europarl, Hansards, Common Crawl) harbors an arbitrary diversity of hidden domains, unknown in advance. We address the novel challenge of unsupervised induction of hidden domains in parallel corpora, applied within a domain-focused word-alignment modeling framework. On the technical side, we contrast flat estimation for the unsupervised induction of domains to a simple form of hierarchical estimation, consisting of two steps aiming at avoiding bad local maxima. Extensive experiments, conducted over seven different language pairs with fully unsupervised induction of domains for word alignment, demonstrate significant improvements in alignment accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Although our work focuses on the HMM-based alignment model, the approach can be also straightforwardly applied to fertility-based alignment models (Brown et al. 1993).

  2. We model explicitly distances in the range \(\pm \,5\) in this work.

  3. \(P(z|\ \mathbf f ,\ \mathbf e )\) can be also heuristically computed a symmetrized strategy \(P(z|\ \mathbf f ,\ \mathbf e ) \propto P(z)({{P(\mathbf f |\ \mathbf e ,\ z)P(\mathbf e | z)}} + {{P(\mathbf e |\ \mathbf f ,\ z)P(\mathbf f |z)}}).\) However, we found that this strategy does not provide any significant contribution to the final performance of alignment accuracy.

  4. During the initialization, we assume that the pool of the rest sentence pairs in the heterogeneous data is the exemplifying sample of the out-domain.

  5. Note that adding word-level features from both translation sides does not help much, as observed by Och et al. (2004) and Huck et al. (2012). We thus add only an one from a translation side.

  6. Naturally, the data, as any complex and large dataset, contains a wide variety of hidden sub-domains, yet they are not specified in advance. This motivates us to induce these domains automatically. In principle, we could induce domains without reference to the alignment problem and then use the latent domain variable within alignment models. However, we believe that this would not be an optimal choice as such domains are induced to capture phenomena potentially irrelevant to the word alignment problem (e.g., monolingual co-occurrence information).

  7. The corpus consists of 1.1M sentence pairs, which is available at http://www.isi.edu/natural-language/download/hansard/index.html. We kept only 808.39K sentence pairs as the training data after removing duplicate sentences.

  8. The corpus is available at http://www.statmt.org/europarl.

  9. Similarly, the original corpus (which contains duplicate sentences) consists of 1.0M sentence pairs, which is available at http://optima.jrc.it/Acquis/JRC-Acquis.3.0/alignments/index.html.

  10. We train the interpolated 3-grams latent domain LMs with expected Kneser–Ney smoothing in our experiments.

  11. Other choices of the hyperparameter have also been tried, yet we did not observe significant differences in the model performance.

References

  • Axelrod A, He X, Gao J (2011) Domain adaptation via pseudo in-domain data selection. In: EMNLP

  • Beal MJ (2003) Variational algorithms for approximate Bayesian inference. PhD Thesis, Gatsby Computational Neuroscience Unit, University College, London

  • Bojar O, Buck C, Callison-Burch C, Federmann C, Haddow B, Koehn P, Monz C, Post M, Soricut R, Specia L (2013) Findings of the 2013 workshop on statistical machine translation. In: WMT

  • Bojar O, Chatterjee R, Federmann C, Haddow B, Huck M, Hokamp C, Koehn P, Logacheva V, Monz C, Negri M, Post M, Scarton C, Specia L, Turchi M (2015) Findings of the 2015 workshop on statistical machine translation. In: WMT

  • Brown PF, Pietra VJD, Pietra SAD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19:263–311

    Google Scholar 

  • Carpuat M, Goutte C, Foster G (2014) Linear mixture models for robust machine translation. In: WMT

  • Chang YW, Rush AM, DeNero J, Collins M (2014) A constrained Viterbi relaxation for bidirectional word alignment. In: ACL

  • Cherry C, Foster G (2012) Batch tuning strategies for statistical machine translation. In: NAACL HLT

  • Clark JH, Dyer C, Lavie A, Smith NA (2011) Better hypothesis testing for statistical machine translation: controlling for optimizer instability. In: ACL HLT (short papers)

  • Cuong H, Sima’an K (2014a) Latent domain phrase-based models for adaptation. In: EMNLP

  • Cuong H, Sima’an K (2014b) Latent domain translation models in mix-of-domains haystack. In: COLING

  • Cuong H, Sima’an K, Titov I (2016) Adapting to all domains at once: rewarding domain invariance in SMT. TACL. https://transacl.org/ojs/index.php/tacl/article/view/768/176

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  • Denkowski M, Lavie A (2011) METEOR 1.3: automatic metric for reliable optimization and evaluation of machine translation systems. In: WMT

  • Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: ACL

  • Duh K, Sudoh K, Tsukada H (2010) Analysis of translation model adaptation in statistical machine translation. In: IWSLT

  • Farajian MA, Bertoldi N, Federico M (2014) Online word alignment for online adaptive machine translation. In: Proceedings of the EACL 2014 workshop on humans and computer-assisted translation

  • Fraser A, Marcu D (2006) Semi-supervised training for statistical word alignment. In: COLING-ACL

  • Galley M, Manning CD (2008) A simple and effective hierarchical phrase reordering model. In: EMNLP

  • Gao Q, Bach N, Vogel S (2010) A semi-supervised word alignment algorithm with partial manual alignments. In: WMT

  • Gao Q, Lewis W, Quirk C, Hwang MY (2011) Incremental training and intentional over-fitting of word alignment. In: MT Summit

  • Gao Q, Vogel S (2010) Consensus versus expertise: a case study of word alignment with mechanical turk. In: NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk

  • Graça JV, Ganchev K, Taskar B (2010) Learning tractable word alignment models with complex constraints. Comput Linguist 36(3):481–504. https://doi.org/10.1162/coli_a_00007

    Article  MathSciNet  Google Scholar 

  • Graca J, Pardal JP, Coheur L, Caseiro D (2008) Building a golden collection of parallel multi-language word alignment. In: LREC

  • Holmqvist M, Ahrenberg L (2011) A gold standard for English–Swedish word alignment. In: Proceedings of the 18th Nordic conference of computational linguistics NODALIDA 2011, vol 11

  • Hua W, Haifeng W, Zhanyi L (2005) Alignment model adaptation for domain-specific word alignment. In: ACL

  • Huck M, Peitz S, Freitag M, Nuhn M, Ney H (2012) The RWTH Aachen machine translation system for WMT 2012. In: WMT

  • Kirchhoff K, Bilmes J (2014) Submodularity for data selection in machine translation. In: EMNLP

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of MT Summit

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) MOSES: open source toolkit for statistical machine translation. In: ACL on interactive poster and demonstration sessions

  • Koehn P, Och FJ, Marcu D (2003) Statistical phrase-based translation. In: NAACL HLT

  • Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: HLT-NAACL

  • Liu C, Liu Y, Sun M, Luan H, Yu H (2015) Generalized agreement for bidirectional word alignment. In: Proceedings of the EMNLP

  • Mansour Y, Mohri M, Rostamizadeh A (2009a) Domain adaptation with multiple sources. In: Proceedings of NIPS

  • Mansour Y, Mohri M, Rostamizadeh A (2009b) Multiple source adaptation and the RÉnyi divergence. In: Proceedings of UAI

  • Mihalcea R, Pedersen T (2003) An evaluation exercise for word alignment. In: Proceedings of the HLT-NAACL 2003 workshop on building and using parallel texts: data driven machine translation and beyond, vol 3

  • Och FJ, Gildea D, Khudanpur S, Sarkar A, Yamada K, Fraser A, Kumar S, Shen L, Smith D, Eng K, Jain V, Jin Z, Radev D (2004) A smorgasbord of features for statistical machine translation. In: HLT-NAACL

  • Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51. https://doi.org/10.1162/089120103321337421

    Article  MATH  Google Scholar 

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: ACL

  • Riley D, Gildea D (2012) Improving the IBM alignment models using variational Bayes. In: Proceedings of ACL (short paper)

  • Shah K, Barrault L, Schwenk H (2010) Translation model adaptation by resampling. In: WMT

  • Shen S, Liu Y, Sun M, Luan H (2015) Consistency-aware search for word alignment. In: Proceedings of the EMNLP

  • Simion A, Collins M, Stein C (2013) A convex alternative to IBM model 2. In: EMNLP

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA

  • Steinberger R, Eisele A, Klocek S, Pilos S, Schlüter P (2012) DGT-TM: a freely available translation memory in 22 languages. In: LREC

  • Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. In: LREC

  • Tam YC, Lane I, Schultz T (2007) Bilingual LSA-based adaptation for statistical machine translation. Mach Transl 21(4):187–207. https://doi.org/10.1007/s10590-008-9045-2

    Article  Google Scholar 

  • Tamura A, Watanabe T, Sumita E (2014) Recurrent neural networks for word alignment model. In: ACL

  • Vogel S, Ney H, Tillmann C (1996) HMM-based word alignment in statistical translation. In: COLING, p 836–841. http://dblp.uni-trier.de/db/conf/coling/coling1996.html#VogelNT96

  • Wang X, Utiyama M, Finch A, Watanabe T, Sumita E (2015) Leave-one-out word alignment without garbage collector effects. In: Proceedings of the EMNLP

  • Zhang H, Chiang D (2014) Kneser–Ney smoothing on expected counts. In: Proceedings of ACL

  • Zhao B, Xing EP (2008) HM-BiTAM: bilingual topic exploration, word alignment, and translation. In: NIPS

Download references

Acknowledgements

We thanks anonymous reviewers and Ivan Titov for their inputs. The second author is supported by VICI Grant Nr 277-89-002 from the Netherlands Organization for Scientific Research (NWO).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoang Cuong.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cuong, H., Sima’an, K. Induction of latent domains in heterogeneous corpora: a case study of word alignment. Machine Translation 31, 225–249 (2017). https://doi.org/10.1007/s10590-018-9215-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-018-9215-9

Keywords

Navigation