Skip to main content

Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12656))

Included in the following conference series:

Abstract

Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete. However, questions remain to which extent this finding generalizes 1) to unsupervised settings and 2) for ad-hoc cross-lingual IR (CLIR) tasks. Therefore, in this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a large number of language pairs. In contrast to supervised language understanding, our results indicate that for unsupervised document-level CLIR – a setup with no relevance judgments for IR-specific fine-tuning – pretrained encoders fail to significantly outperform models based on CLWEs. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders ‘off-the-shelf’, but rather relying on their variants that have been further specialized for sentence understanding tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Note that self-supervised learning can come in different flavors depending on the training objective [10], but language modeling objectives still seem to be the most popular choice.

  2. 2.

    In CLM, the model is trained to predict the probability of a word given the previous words in a sentence. TLM is a cross-lingual variant of standard masked LM (MLM), with the core difference that the model is given pairs of parallel sentences and allowed to attend to the aligned sentence when reconstructing a word in the current sentence.

  3. 3.

    In our initial experiments taking the vector of the first term’s subword consistently outperformed averaging vectors of all its subwords.

  4. 4.

    http://catalog.elra.info/en-us/repository/browse/ELRA-E0008/.

  5. 5.

    Russian is not included in Europarl and we therefore exclude it from sentence-level experiments. Further, since some multilingual encoders have not seen Finnish data in pretraining, we additionally report the results over a subset of language pairs that do not involve Finnish.

  6. 6.

    https://fasttext.cc/docs/en/pretrained-vectors.html.

  7. 7.

    Working with mBERT directly instead of its distilled version led to similar scores, while increasing running times.

  8. 8.

    As expected, m-USE and \(\text {DISTIL}_\text {USE}\) perform poorly on language pairs involving Finnish, as they have not been trained on any Finnish data.

References

  1. Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of ACL, pp. 789–798 (2018)

    Google Scholar 

  2. Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 7, 597–610 (2019)

    Article  Google Scholar 

  3. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)

  4. Braschler, M.: CLEF 2003 – Overview of Results. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 44–63. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30222-3_5

    Chapter  Google Scholar 

  5. Brown, T.B., et al.: Language models are few-shot learners. In: Proceedings of NeurIPS (2020)

    Google Scholar 

  6. Cao, S., Kitaev, N., Klein, D.: Multilingual alignment of contextual word representations. In: Proceedings of ICLR (2020)

    Google Scholar 

  7. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. In: Proceedings of SemEval, pp. 1–14 (2017)

    Google Scholar 

  8. Cer, D., et al.: Universal sentence encoder for English. In: Proceedings of EMNLP, pp. 169–174 (2018)

    Google Scholar 

  9. Chidambaram, M., et al.: Learning cross-lingual sentence representations via a multi-task dual-encoder model. In: Proceedings of the ACL Workshop on Representation Learning for NLP, pp. 250–259 (2019)

    Google Scholar 

  10. Clark, K., Luong, M., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. In: Proceedings of ICLR (2020)

    Google Scholar 

  11. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of ACL, pp. 8440–8451 (2020)

    Google Scholar 

  12. Conneau, A., Kiela, D.: SentEval: an evaluation toolkit for universal sentence representations. In: Proceedings of LREC, pp. 1699–1704 (2018)

    Google Scholar 

  13. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data. In: Proceedings of EMNLP, pp. 670–680 (2017)

    Google Scholar 

  14. Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Proceedings of NeurIPS, pp. 7059–7069 (2019)

    Google Scholar 

  15. Conneau, A., et al.: XNLI: evaluating cross-lingual sentence representations. In: Proceedings of EMNLP, pp. 2475–2485 (2018)

    Google Scholar 

  16. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL, pp. 4171–4186 (2019)

    Google Scholar 

  17. Ethayarajh, K.: How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In: Proceedings of EMNLP-IJCNLP, pp. 55–65 (2019)

    Google Scholar 

  18. Feng, F., Yang, Y., Cer, D., Arivazhagan, N., Wang, W.: Language-agnostic BERT sentence embedding. arXiv preprint arXiv:2007.01852 (2020)

  19. Glavaš, G., Litschko, R., Ruder, S., Vulić, I.: How to (properly) evaluate cross-lingual word embeddings: on strong baselines, comparative analyses, and some misconceptions. In: Proceedings of ACL, pp. 710–721 (2019)

    Google Scholar 

  20. Guo, M., et al.: Effective parallel corpus mining using bilingual sentence embeddings. In: Proceedings of WMT, pp. 165–176 (2018)

    Google Scholar 

  21. Hoogeveen, D., Verspoor, K.M., Baldwin, T.: CQADupStack: a benchmark data set for community question-answering research. In: Proceedings of ADCS, pp. 3:1–3:8 (2015)

    Google Scholar 

  22. Jiang, Z., El-Jaroudi, A., Hartmann, W., Karakos, D., Zhao, L.: Cross-lingual information retrieval with BERT. In: Proceedings of LREC, p. 26 (2020)

    Google Scholar 

  23. Karthikeyan, K., Wang, Z., Mayhew, S., Roth, D.: Cross-lingual ability of multilingual BERT: an empirical study. In: Proceedings of ICLR (2020)

    Google Scholar 

  24. Koehn, P.: Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit (MT SUMMIT), pp. 79–86 (2005)

    Google Scholar 

  25. Lei, T., et al.: Semi-supervised question retrieval with gated convolutions. In: Proceedings of NAACL, pp. 1279–1289 (2016)

    Google Scholar 

  26. Liang, Y., et al.: XGLUE: a new benchmark dataset for cross-lingual pre-training, understanding and generation. In: Proceedings of EMNLP (2020)

    Google Scholar 

  27. Litschko, R., Glavaš, G., Vulić, I., Dietz, L.: Evaluating resource-lean cross-lingual embedding models in unsupervised retrieval. In: Proceedings of SIGIR, pp. 1109–1112 (2019)

    Google Scholar 

  28. Liu, Q., Kusner, M.J., Blunsom, P.: A survey on contextual embeddings. arXiv preprint arXiv:2003.07278 (2020)

  29. Liu, Q., McCarthy, D., Vulić, I., Korhonen, A.: Investigating cross-lingual alignment methods for contextualized embeddings with token-level evaluation. In: Proceedings of CoNLL, pp. 33–43 (2019)

    Google Scholar 

  30. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  31. MacAvaney, S., Soldaini, L., Goharian, N.: Teaching a new dog old tricks: resurrecting multilingual retrieval using zero-shot learning. In: Proceedings of ECIR, pp. 246–254 (2020)

    Google Scholar 

  32. MacAvaney, S., Yates, A., Cohan, A., Goharian, N.: Cedr: contextualized embeddings for document ranking. In: Proceedings of SIGIR, pp. 1101–1104 (2019)

    Google Scholar 

  33. Nogueira, R., Yang, W., Cho, K., Lin, J.: Multi-stage document ranking with BERT. arXiv preprint arXiv:1910.14424 (2019)

  34. Pires, T., Schlinger, E., Garrette, D.: How multilingual is multilingual BERT? In: Proceedings of ACL, pp. 4996–5001 (2019)

    Google Scholar 

  35. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of SIGIR, pp. 275–281 (1998)

    Google Scholar 

  36. Ponti, E.M., Glavaš, G., Majewska, O., Liu, Q., Vulić, I., Korhonen, A.: XCOPA: a multilingual dataset for causal commonsense reasoning. In: Proceedings of EMNLP (2020)

    Google Scholar 

  37. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  38. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of EMNLP, pp. 3973–3983 (2019)

    Google Scholar 

  39. Reimers, N., Gurevych, I.: Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of EMNLP (2020)

    Google Scholar 

  40. Rogers, A., Kovaleva, O., Rumshisky, A.: A primer in BERTology: what we know about how BERT works. Trans. Assoc. Comput. Linguist. 8, 842–866 (2020)

    Article  Google Scholar 

  41. Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 65, 569–631 (2019)

    Article  MathSciNet  Google Scholar 

  42. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  43. Smith, S.L., Turban, D.H., Hamblin, S., Hammerla, N.Y.: Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In: Proceedings of ICLR (2017)

    Google Scholar 

  44. Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurIPS, pp. 5998–6008 (2017)

    Google Scholar 

  45. Vulić, I., Glavas, G., Reichart, R., Korhonen, A.: Do we really need fully unsupervised cross-lingual embeddings? In: Proceedings of EMNLP, pp. 4406–4417 (2019)

    Google Scholar 

  46. Vulić, I., Moens, M.F.: Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In: Proceedings of SIGIR, pp. 363–372 (2015)

    Google Scholar 

  47. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of ICLR (2019)

    Google Scholar 

  48. Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of NAACL, pp. 1112–1122 (2018)

    Google Scholar 

  49. Wu, S., Dredze, M.: Beto, bentz, becas: the surprising cross-lingual effectiveness of BERT. In: Proceedings of EMNLP, pp. 833–844 (2019)

    Google Scholar 

  50. Yang, Y., et al.: Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. In: Proceedings of AAAI, pp. 5370–5378 (2019)

    Google Scholar 

  51. Yang, Y., et al.: Multilingual universal sentence encoder for semantic retrieval. In: Proceedings of ACL: System Demonstrations, pp. 87–94 (2020)

    Google Scholar 

  52. Yang, Y., et al.: Improving multilingual sentence embedding using bi-directional dual encoder with additive margin softmax. In: Proceedings of IJCAI, pp. 5370–5378 (2019)

    Google Scholar 

  53. Yu, P., Allan, J.: A study of neural matching models for cross-lingual IR. In: Proceedings of SIGIR, pp. 1637–1640 (2020)

    Google Scholar 

  54. Zaheer, M., et al.: Big Bird: transformers for longer sequences. arXiv preprint arXiv:2007.14062 (2020)

  55. Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. (TOIS) 22(2), 179–214 (2004)

    Article  Google Scholar 

  56. Zhao, W., Eger, S., Bjerva, J., Augenstein, I.: Inducing language-agnostic multilingual representations. arXiv preprint arXiv:2008.09112 (2020)

  57. Zhao, W., Glavaš, G., Peyrard, M., Gao, Y., West, R., Eger, S.: On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In: Proceedings of ACL, pp. 1656–1671 (2020)

    Google Scholar 

  58. Ziemski, M., Junczys-Dowmunt, M., Pouliquen, B.: The United Nations parallel corpus v1.0. In: Proceedings of LREC, pp. 3530–3534 (2016)

    Google Scholar 

  59. Zweigenbaum, P., Sharoff, S., Rapp, R.: Overview of the third BUCC shared task: spotting parallel sentences in comparable corpora. In: Proceedings of LREC (2018)

    Google Scholar 

Download references

Acknowledgments

The work of Ivan Vulić is supported by the ERC Consolidator Grant LEXICAL: Lexical Acquisition Across Languages (no 648909). Robert Litschko and Goran Glavaš are supported by the Baden Württemberg Stiftung (Eliteprogramm, AGREE grant).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Litschko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Litschko, R., Vulić, I., Ponzetto, S.P., Glavaš, G. (2021). Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval. In: Hiemstra, D., Moens, MF., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2021. Lecture Notes in Computer Science(), vol 12656. Springer, Cham. https://doi.org/10.1007/978-3-030-72113-8_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-72113-8_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-72112-1

  • Online ISBN: 978-3-030-72113-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics