Skip to main content

Neural Approaches to Multilingual Information Retrieval

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2023)

Abstract

Providing access to information across languages has been a goal of Information Retrieval (IR) for decades. While progress has been made on Cross Language IR (CLIR) where queries are expressed in one language and documents in another, the multilingual (MLIR) task to create a single ranked list of documents across many languages is considerably more challenging. This paper investigates whether advances in neural document translation and pretrained multilingual neural language models enable improvements in the state of the art over earlier MLIR techniques. The results show that although combining neural document translation with neural ranking yields the best Mean Average Precision (MAP), 98% of that MAP score can be achieved with an 84% reduction in indexing time by using a pretrained XLM-R multilingual language model to index documents in their native language, and that 2% difference in effectiveness is not statistically significant. Key to achieving these results for MLIR is to fine-tune XLM-R using mixed-language batches from neural translations of MS MARCO passages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.2lingual.com/.

  2. 2.

    Batches include the same query paired with document passages translated into each language.

  3. 3.

    For a complete list: https://github.com/hltcoe/ColBERT-X/blob/main/scripts/stopstructure.txt.

  4. 4.

    https://ir-measur.es/.

  5. 5.

    Although Marian [23] is faster than Sockeye 2, benchmark results from Sockeye 1 [20] and Sockeye 2 [19] confirm that Sockeye 2 is within a factor of 2 to 3 of Marian’s speed, leaving our conclusions unchanged.

  6. 6.

    https://neuclir.github.io/.

  7. 7.

    https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz.

  8. 8.

    https://github.com/hltcoe/ColBERT-X/blob/main/xlmr_colbert/training/lazy_batcher.py.

References

  1. Aljlayl, M., Frieder, O.: Effective Arabic-English cross-language information retrieval via machine-readable dictionaries and machine translation. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 295–302 (2001)

    Google Scholar 

  2. Bajaj, P., et al.: MS MARCO: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)

  3. Bendersky, M., Kurland, O.: Utilizing passage-based language models for document retrieval. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 162–174. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_17

    Chapter  Google Scholar 

  4. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Sebastopol (2009)

    Google Scholar 

  5. Blloshmi, R., Pasini, T., Campolungo, N., Banerjee, S., Navigli, R., Pasi, G.: IR like a SIR: sense-enhanced information retrieval for multiple languages. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1030–1041, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.79, https://aclanthology.org/2021.emnlp-main.79

  6. Bonifacio, L.H., et al.: mMARCO: a multilingual version of MS MARCO passage ranking dataset. arXiv preprint arXiv:2108.13897 (2021)

  7. Braschler, M.: CLEF 2001 — overview of results. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 9–26. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45691-0_2

    Chapter  MATH  Google Scholar 

  8. Braschler, M.: CLEF 2002 — overview of results. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 9–27. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45237-9_2

    Chapter  Google Scholar 

  9. Braschler, M.: CLEF 2003 – overview of results. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (eds.) CLEF 2003. LNCS, vol. 3237, pp. 44–63. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30222-3_5

    Chapter  Google Scholar 

  10. Choudhury, M., Deshpande, A.: How linguistically fair are multilingual pre-trained language models? In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 12710–12718 (2021)

    Google Scholar 

  11. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Association for Computational Linguistics, Online, July 2020. https://aclanthology.org/2020.acl-main.747

  12. Costello, C., Yang, E., Lawrie, D., Mayfield, J.: Patapasco: a Python framework for cross-language information retrieval experiments. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13186, pp. 276–280. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99739-7_33

    Chapter  Google Scholar 

  13. Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 985–988 (2019)

    Google Scholar 

  14. Darwish, K., Oard, D.W.: Probabilistic structured query methods. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 338–344 (2003)

    Google Scholar 

  15. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Association for Computational Linguistics, Minneapolis, June 2019. https://aclanthology.org/N19-1423

  16. Domhan, T., Denkowski, M., Vilar, D., Niu, X., Hieber, F., Heafield, K.: The Sockeye 2 neural machine translation toolkit at AMTA 2020. In: Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pp. 110–115, Association for Machine Translation in the Americas, Virtual, October 2020

    Google Scholar 

  17. Gao, L., Ma, X., Lin, J.J., Callan, J.: Tevatron: an efficient and flexible toolkit for dense retrieval. arXiv preprint arXiv:2203.05765 (2022)

  18. Granell, X.: Multilingual Information Management: Information, Technology and Translators. Chandos Publishing, Cambridge (2014)

    Google Scholar 

  19. Hieber, F., Domhan, T., Denkowski, M., Vilar, D.: Sockeye 2: a toolkit for neural machine translation. In: EAMT 2020 (2020). https://www.amazon.science/publications/sockeye-2-a-toolkit-for-neural-machine-translation

  20. Hieber, F., et al.: Sockeye: a toolkit for neural machine translation. arXiv preprint arXiv:1712.05690 (2017)

  21. Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: spaCy: industrial-strength natural language processing in Python. Technical report, Explosion (2020)

    Google Scholar 

  22. Hull, D.A., Grefenstette, G.: Querying across languages: a dictionary-based approach to multilingual information retrieval. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 49–57 (1996)

    Google Scholar 

  23. Junczys-Dowmunt, M., Heafield, K., Hoang, H., Grundkiewicz, R., Aue, A.: Marian: cost-effective high-quality neural machine translation in C++. arXiv preprint arXiv:1805.12096 (2018)

  24. Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781. Association for Computational Linguistics, Online, November 2020. https://aclanthology.org/2020.emnlp-main.550

  25. Kassner, N., Dufter, P., Schütze, H.: Multilingual lama: investigating knowledge in multilingual pretrained language models. arXiv preprint arXiv:2102.00894 (2021)

  26. Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39–48 (2020)

    Google Scholar 

  27. Lawrie, D., Mayfield, J., Oard, D.W., Yang, E.: HC4: a new suite of test collections for ad hoc CLIR. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 351–366. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_24

    Chapter  Google Scholar 

  28. MacAvaney, S., Macdonald, C., Ounis, I.: Streamlining evaluation with ir-measures. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13186, pp. 305–310. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99739-7_38

    Chapter  Google Scholar 

  29. Magdy, W., Jones, G.J.F.: Should MT systems be used as black boxes in CLIR? In: Clough, P., et al. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 683–686. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20161-5_70

    Chapter  Google Scholar 

  30. McCarley, J.S.: Should we translate the documents or the queries in cross-language information retrieval? In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 208–214 (1999)

    Google Scholar 

  31. Mitamura, T., et al.: Overview of the NTCIR-7 ACLIA tasks: advanced cross-lingual information access. In: NTCIR (2008)

    Google Scholar 

  32. Nair, S., et al.: Transfer learning approaches for building cross-language dense retrieval models. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 382–396. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_26

    Chapter  Google Scholar 

  33. Nie, J.-Y., Jin, F.: A multilingual approach to multilingual information retrieval. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2002. LNCS, vol. 2785, pp. 101–110. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45237-9_8

    Chapter  Google Scholar 

  34. Oard, D.W., Dorr, B.J.: A survey of multilingual text retrieval. Technical report, UMIACS-TR-96019 CS-TR-3615, UMIACS (1996)

    Google Scholar 

  35. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Philadelphia , July 2002. https://doi.org/10.3115/1073083.1073135, https://aclanthology.org/P02-1040

  36. Peters, C., Braschler, M.: The importance of evaluation for cross-language system development: the CLEF experience. In: LREC (2002)

    Google Scholar 

  37. Peters, C., Braschler, M., Clough, P.: Multilingual Information Retrieval: From Research to Practice. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-23008-0

  38. Rahimi, R., Shakery, A., King, I.: Multilingual information retrieval in the language modeling framework. Inf. Retrieval J. 18(3), 246–281 (2015). https://doi.org/10.1007/s10791-015-9255-1

    Article  Google Scholar 

  39. Rehder, B., Littman, M.L., Dumais, S.T., Landauer, T.K.: Automatic 3-language cross-language information retrieval with latent semantic indexing. In: TREC, pp. 233–239. Citeseer (1997)

    Google Scholar 

  40. Robertson, S., Zaragoza, H., et al.: The probabilistic relevance framework: BM25 and beyond. Found. Trends® Inf. Retrieval 3(4), 333–389 (2009)

    Google Scholar 

  41. Santhanam, K., Khattab, O., Potts, C., Zaharia, M.: PLAID: an efficient engine for late interaction retrieval. arXiv preprint arXiv:2205.09707 (2022)

  42. Shi, P., Lin, J.: Cross-lingual relevance transfer for document retrieval. arXiv preprint arXiv:1911.02989 (2019)

  43. Si, L., Callan, J., Cetintas, S., Yuan, H.: An effective and efficient results merging strategy for multilingual information retrieval in federated search environments. Inf. Retrieval 11(1), 1–24 (2008)

    Article  Google Scholar 

  44. Sorg, P., Cimiano, P.: Exploiting Wikipedia for cross-lingual and multilingual information retrieval. Data Knowl. Eng. 74, 26–45 (2012). ISSN 0169-023X, https://www.sciencedirect.com/science/article/pii/S0169023X12000213, Appl. Nat. Lang. Inf. Syst

  45. Tsai, M.F., Wang, Y.T., Chen, H.H.: A study of learning a merge model for multilingual information retrieval. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 195–202 (2008)

    Google Scholar 

  46. Xu, H., Van Durme, B., Murray, K.: BERT, mBERT, or BiBERT? A study on contextualized embeddings for neural machine translation. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6663–6675. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://aclanthology.org/2021.emnlp-main.534

  47. Xu, Y.: Global divergence and local convergence of utterance semantic representations in dialogue. In: Proceedings of the Society for Computation in Linguistics 2021, pp. 116–124. Association for Computational Linguistics, Online, February 2021. https://aclanthology.org/2021.scil-1.11

  48. Yang, E., Nair, S., Chandradevan, R., Iglesias-Flores, R., Oard, D.W.: C3: continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2022). https://arxiv.org/abs/2204.11989

  49. Zhang, X., Ma, X., Shi, P., Lin, J.: Mr. TyDi: a multi-lingual benchmark for dense retrieval. In: Proceedings of the 1st Workshop on Multilingual Representation Learning, pp. 127–137. Association for Computational Linguistics, Punta Cana, Dominican Republic, November 2021. https://aclanthology.org/2021.mrl-1.12

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dawn Lawrie .

Editor information

Editors and Affiliations

A MTT Implementation Details

A MTT Implementation Details

As described in Sect. 3.2, MTT-M consists of examples with different languages in the training batches. We implement it by mixing the translated MS-MARCO triples round-robin. Specifically, each triple consists of an English query and positive and negative passages translated into the target languages. We constructed such triples using the translated documents provided by mMARCO [6]. Each language results in a triple file of the same structure as triples.train.small.tar.gz.Footnote 7 The following Bash command creates a combined triple file that mixes all languages:

figure a

Training with four GPUs and a per-GPU batch size of 32 triples guarantees that each batch consists of examples in different languages based on ColBERT-X’sFootnote 8 batching scheme.

For MTT-S, we modified the ColBERT-X batching mechanism to load multiple triple files and supply a batch of examples from only one source file whenever the training process requests one. After each request, we switch the source triple file to ensure all languages are presented equally to the model during training.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lawrie, D., Yang, E., Oard, D.W., Mayfield, J. (2023). Neural Approaches to Multilingual Information Retrieval. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13980. Springer, Cham. https://doi.org/10.1007/978-3-031-28244-7_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28244-7_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28243-0

  • Online ISBN: 978-3-031-28244-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics