Advertisement

Learning to Scale Multilingual Representations for Vision-Language Tasks

Conference paper
  • 887 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12349)

Abstract

Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we-9*6 propose a Scalable Multilingual Aligned Language Representation (SMALR) that supports many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for just a few. We use a masked cross-language modeling loss to align features with context from other languages. Additionally, we propose a cross-lingual consistency module that ensures predictions made for a query and its machine translation are comparable. The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3–4% with less than 1/5th the training parameters compared to other word embedding methods.

Keywords

Scalable vision-language models Multilingual word embeddings Image-sentence retrieval 

Notes

Acknowledgements

This work is funded in part by the NSF, DARPA LwLL, and DARPA XAI grants, including NSF grant 1838193.

Supplementary material

504439_1_En_12_MOESM1_ESM.pdf (1.4 mb)
Supplementary material 1 (pdf 1454 KB)

References

  1. 1.
    Aharoni, R., Johnson, M., Firat, O.: Massively multilingual neural machine translation. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), June 2019Google Scholar
  2. 2.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  3. 3.
    Antol, S., et al.: VQA: visual question answering. In: The IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  4. 4.
    Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: International Conference on Learning Representations (ICLR) (2017)Google Scholar
  5. 5.
    Artetxe, M., Labaka, G., Agirre, E.: Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 2289–2294 (2016)Google Scholar
  6. 6.
    Barrault, L., Bougares, F., Specia, L., Lala, C., Elliott, D., Frank, S.: Findings of the third shared task on multimodal machine translation. In: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pp. 304–323 (2018)Google Scholar
  7. 7.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. (TACL) 5, 135–146 (2017)CrossRefGoogle Scholar
  8. 8.
    Burns, A., Tan, R., Saenko, K., Sclaroff, S., Plummer, B.A.: Language features matter: effective language representations for vision-language tasks. In: The IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  9. 9.
    Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Empirical Methods in Natural Language Processing (EMNLP) (2014)Google Scholar
  10. 10.
    Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)Google Scholar
  11. 11.
    Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. In: International Conference on Learning Representations (ICLR) (2018)Google Scholar
  12. 12.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  13. 13.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805v1 (2018)
  14. 14.
    Elliott, D., Frank, S., Barrault, L., Bougares, F., Specia, L.: Findings of the second shared task on multimodal machine translation and multilingual image description. arXiv:1710.07177 (2017)
  15. 15.
    Elliott, D., Frank, S., Sima’an, K., Specia, L.: Multi30k: multilingual English-German image descriptions. arXiv:1605.00459 (2016)
  16. 16.
    Gella, S., Sennrich, R., Keller, F., Lapata, M.: Image pivoting for learning multilingual multimodal representations. In: Empirical Methods in Natural Language Processing (EMNLP) (2017)Google Scholar
  17. 17.
    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  18. 18.
    Gu, J., Hassan, H., Devlin, J., Li, V.O.: Universal neural machine translation for extremely low resource languages. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT) (2018)Google Scholar
  19. 19.
    Gupta, T., Schwing, A., Hoiem, D.: Vico: word embeddings from visual co-occurrences. In: The IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  20. 20.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 (2015)
  21. 21.
    K, K., Wang, Z., Mayhew, S., Roth, D.: Cross-lingual ability of multilingual bert: an empirical study. arXiv:1912.07840 (2019)
  22. 22.
    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferitGame: referring to objects in photographs of natural scenes. In: Empirical Methods in Natural Language Processing (EMNLP) (2014)Google Scholar
  23. 23.
    Kim, D., Saito, K., Saenko, K., Sclaroff, S., Plummer, B.A.: Mule: multimodal universal language embedding. In: AAAI Conference on Artificial Intelligence (2020)Google Scholar
  24. 24.
    Klein, B., Lev, G., Sadeh, G., Wolf, L.: Fisher vectors derived from hybrid Gaussian-Laplacian mixture models for image annotation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  25. 25.
    Krishna, R., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. (IJCV) (2017)Google Scholar
  26. 26.
    Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv:1908.03557 (2019)
  27. 27.
    Li, X., et al.: COCO-CN for cross-lingual image tagging, captioning and retrieval. IEEE Trans. Multimedia (2019)Google Scholar
  28. 28.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  29. 29.
    Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv:1908.02265 (2019)
  30. 30.
    Maćkiewicz, A., Ratajczak, W.: Principal components analysis (PCA). Comput. Geosci. 19(3), 303–342 (1993)CrossRefGoogle Scholar
  31. 31.
    Miyazaki, T., Shimizu, N.: Cross-lingual image caption generation. In: Conference of the Association for Computational Linguistics (ACL) (2016)Google Scholar
  32. 32.
    Nguyen, D.K., Okatani, T.: Multi-task learning of hierarchical vision-language representation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  33. 33.
    Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual bert? arXiv:1906.01502 (2019)
  34. 34.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: The IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  35. 35.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2015)Google Scholar
  36. 36.
    Smith, S.L., Turban, D.H.P., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv:1702.03859 (2017)
  37. 37.
    Su, W., et al.: Vl-BERT: pre-training of generic visual-linguistic representations. arXiv:1908.08530 (2019)
  38. 38.
    Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2019)Google Scholar
  39. 39.
    Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems (NeurIPS), pp. 5998–6008 (2017)Google Scholar
  40. 40.
    Wang, L., Li, Y., Huang, J., Lazebnik, S.: Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell.(TPAMI) 41(2), 394–407 (2018)CrossRefGoogle Scholar
  41. 41.
    Wehrmann, J., Souza, D.M., Lopes, M.A., Barros, R.C.: Language-agnostic visual-semantic embeddings. In: The IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  42. 42.
    Wu, S., Dredze, M.: Beto, Bentz, Becas: the surprising cross-lingual effectiveness of Bert. arXiv:1904.09077 (2019)
  43. 43.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. (TACL) 2, 67–78 (2014)CrossRefGoogle Scholar
  44. 44.
    Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: visual commonsense reasoning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Boston UniversityBostonUSA
  2. 2.MIT-IBM Watson AI LabCambridgeUSA

Personalised recommendations