Advertisement

Machine Translation

, Volume 31, Issue 1–2, pp 3–18 | Cite as

The representational geometry of word meanings acquired by neural machine translation models

  • Felix Hill
  • Kyunghyun Cho
  • Sébastien Jean
  • Yoshua Bengio
Article

Abstract

This work is the first comprehensive analysis of the properties of word embeddings learned by neural machine translation (NMT) models trained on bilingual texts. We show the word representations of NMT models outperform those learned from monolingual text by established algorithms such as Skipgram and CBOW on tasks that require knowledge of semantic similarity and/or lexical–syntactic role. These effects hold when translating from English to French and English to German, and we argue that the desirable properties of NMT word embeddings should emerge largely independently of the source and target languages. Further, we apply a recently-proposed heuristic method for training NMT models with very large vocabularies, and show that this vocabulary expansion method results in minimal degradation of embedding quality. This allows us to make a large vocabulary of NMT embeddings available for future research and applications. Overall, our analyses indicate that NMT embeddings should be used in applications that require word concepts to be organised according to similarity and/or lexical function, while monolingual embeddings are better suited to modelling (nonspecific) inter-word relatedness.

Keywords

Machine translation Word embeddings Representation 

Notes

Acknowledgements

This work was in part funded by a Google European Doctoral Fellowship and a Google Faculty Award.

References

  1. Agirre E, Alfonseca E, Hall K, Kravalova J, Pasca M, Soroa A (2009) A study on similarity and relatedness using distributional and wordnet-based approaches. In: Proceedings of NAACL-HLT 2009Google Scholar
  2. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLRGoogle Scholar
  3. Baroni M, Dinu G, Kruszewski G (2014) Dont count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, vol 1Google Scholar
  4. Bengio Y, Sénécal JS (2003) Quick training of probabilistic neural nets by importance sampling. In: Proceedings of AISTATS 2003Google Scholar
  5. Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155zbMATHGoogle Scholar
  6. Bruni E, Tran NK, Baroni M (2014) Multimodal distributional semantics. J Artif Intell Res(JAIR) 49:1–47MathSciNetzbMATHGoogle Scholar
  7. Chandar S, Lauly S, Larochelle H, Khapra MM, Ravindran B, Raykar V, Saha A (2014) An autoencoder approach to learning bilingual word representations. In: NIPSGoogle Scholar
  8. Cho K, van Merrienboer B, Gulcehre C, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the empirical methods in natural language processing (EMNLP 2014), to appearGoogle Scholar
  9. Collobert R, Weston J (2008) A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th international conference on machine learning, ACM, pp 160–167Google Scholar
  10. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537zbMATHGoogle Scholar
  11. Devlin J, Zbib R, Huang Z, Lamar T, Schwartz R, Makhoul J (2014) Fast and robust neural network joint models for statistical machine translation. In: 52nd annual meeting of the association for computational linguistics, Baltimore, JuneGoogle Scholar
  12. Faruqui M, Dyer C (2014) Improving vector space word representations using multilingual correlation. In: Proceedings of EACL, vol 2014Google Scholar
  13. Firth RJ (1957) A synopsis of linguistic theory 1930–1955. Philological Society, Oxford, pp 1–32Google Scholar
  14. Haghighi A, Liang P, Berg-Kirkpatrick T, Klein D (2008) Learning bilingual lexicons from monolingual corpora. In: ACL, vol 2008, pp 771–779Google Scholar
  15. Hermann KM, Blunsom P (2014) Multilingual distributed representations without word alignment. In: Proceedings of ICLRGoogle Scholar
  16. Hill F, Korhonen A (2014) Learning abstract concepts from multi-modal data: since you probably can’t see what i mean. In: Proceedings of the empirical methods in natural language processing (EMNLP 2014)Google Scholar
  17. Hill F, Reichart R, Korhonen A (2014) Simlex-999: evaluating semantic models with (genuine) similarity estimation. arXiv preprint arXiv:1408.3456
  18. Jean S, Cho K, Memisevic R, Bengio Y (2015) On using very large target vocabulary for neural machine translation. In: Proceedings of NAACLGoogle Scholar
  19. Kalchbrenner N, Blunsom P (2013) Recurrent continuous translation models. In: Proceedings of the 2013 conference on empirical methods in natural language processing, Association for Computational Linguistics, SeattleGoogle Scholar
  20. Klementiev A, Titov I, Bhattarai B (2012a) Inducing crosslingual distributed representations of words. COLINGGoogle Scholar
  21. Klementiev A, Titov I, Bhattarai B (2012b) Inducing crosslingual distributed representations of words. In: COLINGGoogle Scholar
  22. Kočiský T, Hermann KM, Blunsom P (2014) Learning bilingual word representations by marginalizing alignments. In: Proceedings of ACLGoogle Scholar
  23. Kusner M, Sun Y, Kolkin N, Weinberger KQ (2015) From word embeddings to document distances. In: Proceedings of the 32nd international conference on machine learning (ICML-15), pp 957–966Google Scholar
  24. Landauer TK, Dumais ST (1997) A solution to plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol Rev 104(2):211CrossRefGoogle Scholar
  25. Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, vol 2Google Scholar
  26. Luong T, Sutskever I, Le QV, Vinyals O, Zaremba W (2014) Addressing the rare word problem in neural machine translation. arXiv preprint arXiv:1410.8206
  27. Mikolov T, Le QV, Sutskever I (2013a) Exploiting similarities among languages for machine translation. In: CORRGoogle Scholar
  28. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119Google Scholar
  29. Mnih A, Hinton GE (2009) A scalable hierarchical distributed language model. In: Advances in neural information processing systems, pp 1081–1088Google Scholar
  30. Morin F, Bengio Y (2005) Hierarchical probabilistic neural network language model. AISTATS, Citeseer 5:246–252Google Scholar
  31. Nelson DL, McEvoy CL, Schreiber TA (2004) The university of south florida free association, rhyme, and word fragment norms. Behav Res Methods Instrum Comput 36(3):402–407CrossRefGoogle Scholar
  32. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the empirical methods in natural language processing (EMNLP 2014)Google Scholar
  33. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of NIPSGoogle Scholar
  34. Turney PD, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37(1):141–188MathSciNetzbMATHGoogle Scholar
  35. Vulić I, De Smet W, Moens MF (2011) Identifying word translations from comparable corpora using latent topic models. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: short papers, Vol 2, Association for Computational Linguistics, pp 479–484Google Scholar
  36. Weston J, Bengio S, Usunier N (2010) Large scale image annotation: learning to rank with joint word-image embeddings. Mach Learn 81(1):21–35MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2017

Authors and Affiliations

  1. 1.DeepmindLondonUK
  2. 2.Courant Institute of Mathematical SciencesNew York UniversityNew YorkUSA
  3. 3.MILAUniversité de MontréalMontrealCanada

Personalised recommendations