Skip to main content

An Experimental Study on Pretraining Transformers from Scratch for IR

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13980))

Included in the following conference series:

Abstract

Finetuning Pretrained Language Models (PLM) for IR has been de facto the standard practice since their breakthrough effectiveness few years ago. But, is this approach well understood? In this paper, we study the impact of the pretraining collection on the final IR effectiveness. In particular, we challenge the current hypothesis that PLM shall be trained on a large enough generic collection and we show that pretraining from scratch on the collection of interest is surprisingly competitive with the current approach. We benchmark first-stage ranking rankers and cross-encoders for reranking on the task of general passage retrieval on MSMARCO, Mr-Tydi for Arabic, Japanese and Russian, and TripClick for specific domain. Contrary to popular belief, we show that, for finetuning first-stage rankers, models pretrained solely on their collection have equivalent or better effectiveness compared to more general models. However, there is a slight effectiveness drop for rerankers pretrained only on the target collection. Overall, our study sheds a new light on the role of the pretraining collection and should make our community ponder on building specialized models by pretraining from scratch. Last but not least, doing so could enable better control of efficiency, data bias and replicability, which are key research questions for the IR community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For instance, freezing the BERT encoding and learning an additional linear layer is sufficient to obtain good performance in NLP [11], while such approach is not as effective in IR.

  2. 2.

    We could not find in the literature an easy/practical way to perform statistical significance testing over BEIR.

  3. 3.

    We were not able to find the parameters used in the experiments.

  4. 4.

    Note that MContriever TyDi (first row) is not available, statistical tests cannot be performed. We do our best to evaluate fairly under our training setting (second row).

  5. 5.

    We suspect they use more compute, but could not find accurate compute information.

References

  1. Aroca-Ouellette, S., Rudzicz, F.: On losses for modern language models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4970–4981. Association for Computational Linguistics, Online, November 2020. https://doi.org/10.18653/v1/2020.emnlp-main.403, https://aclanthology.org/2020.emnlp-main.403

  2. Bai, B., et al.: Supervised semantic indexing. In: Proceedings of the 18th ACM International Conference on Information and Knowledge Management, pp. 187–196. ACM (2009). https://doi.org/10.1145/1645953.1645979

  3. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text (2019). https://doi.org/10.48550/ARXIV.1903.10676, https://arxiv.org/abs/1903.10676

  4. Bommasani, R., et al.: On the opportunities and risks of foundation models (2021). https://doi.org/10.48550/ARXIV.2108.07258, https://arxiv.org/abs/2108.07258

  5. Bonifacio, L.H., Campiotti, I., Jeronymo, V., Lotufo, R., Nogueira, R.: MMARCO: a multilingual version of the MS MARCO passage ranking dataset. arXiv preprint arXiv:2108.13897 (2021)

  6. Chang, W.C., Yu, F.X., Chang, Y.W., Yang, Y., Kumar, S.: Pre-training tasks for embedding-based large-scale retrieval. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=rkg-mA4FDr

  7. Clinchant, S., Jung, K.W., Nikoulina, V.: On the use of BERT for neural machine translation. In: Proceedings of the 3rd Workshop on Neural Generation and Translation, pp. 108–117. Association for Computational Linguistics, Hong Kong, November 2019. https://doi.org/10.18653/v1/D19-5611, https://aclanthology.org/D19-5611

  8. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116 (2019)

  9. Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of click position-bias models. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 87–94 (2008)

    Google Scholar 

  10. Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.B.: Neural ranking models with weak supervision. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2017, pp. 65–74. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3077136.3080832

  11. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805

  12. El-Nouby, A., Izacard, G., Touvron, H., Laptev, I., Jegou, H., Grave, E.: Are large-scale datasets necessary for self-supervised pre-training? arXiv preprint arXiv:2112.10740 (2021)

  13. Formal, T., Lassance, C., Piwowarski, B., Clinchant, S.: From distillation to hard negative sampling: making sparse neural IR models more effective. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2022, pp. 2353–2359. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3477495.3531857

  14. Gao, L., Callan, J.: Condenser: a pre-training architecture for dense retrieval. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 981–993. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.75, https://aclanthology.org/2021.emnlp-main.75

  15. Gao, L., Callan, J.: Unsupervised corpus aware language model pre-training for dense passage retrieval. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2843–2853. Association for Computational Linguistics, Dublin, Ireland, May 2022. https://doi.org/10.18653/v1/2022.acl-long.203, https://aclanthology.org/2022.acl-long.203

  16. Gao, L., Dai, Z., Callan, J.: COIL: Revisit exact lexical match in information retrieval with contextualized inverted list. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3030–3042. Association for Computational Linguistics, Online, June 2021. https://doi.org/10.18653/v1/2021.naacl-main.241, https://aclanthology.org/2021.naacl-main.241

  17. Gu, Y., et al.: Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3(1), 1–23 (2022). https://doi.org/10.1145/3458754

  18. Guo, Y., et al.: Webformer: pre-training with web pages for information retrieval. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2022, pp. 1502–1512. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3477495.3532086

  19. He, P., Liu, X., Gao, J., Chen, W.: DeBERTa: decoding-enhanced BERT with disentangled attention. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=XPZIaotutsD

  20. Hofstätter, S., Althammer, S., Schröder, M., Sertkan, M., Hanbury, A.: Improving efficient neural ranking models with cross-architecture knowledge distillation (2020)

    Google Scholar 

  21. Hofstätter, S., Althammer, S., Sertkan, M., Hanbury, A.: Establishing strong baselines for tripclick health retrieval (2022)

    Google Scholar 

  22. Hofstätter, S., Lin, S.C., Yang, J.H., Lin, J., Hanbury, A.: Efficiently teaching an effective dense retriever with balanced topic aware sampling. In: Proceedings of SIGIR (2021)

    Google Scholar 

  23. Izacard, G., et al.: Towards unsupervised dense information retrieval with contrastive learning (2021)

    Google Scholar 

  24. Kaplan, J., et al.: Scaling laws for neural language models. arXiv abs/2001.08361 (2020)

    Google Scholar 

  25. Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781. Association for Computational Linguistics, Online, November 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550, https://www.aclweb.org/anthology/2020.emnlp-main.550

  26. Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2020, pp. 39–48. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3397271.3401075

  27. Kim, T., Yoo, K.M., Lee, S.G.: Self-guided contrastive learning for BERT sentence representations. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2528–2540. Association for Computational Linguistics, Online, August 2021. https://doi.org/10.18653/v1/2021.acl-long.197, https://aclanthology.org/2021.acl-long.197

  28. Lassance, C., Clinchant, S.: An efficiency study for splade models. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR 2022, pp. 2220–2226. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3477495.3531833

  29. Lin, J., Ma, X.: A few brief notes on deepimpact, coil, and a conceptual framework for information retrieval techniques. CoRR abs/2106.14807 (2021). https://arxiv.org/abs/2106.14807

  30. Lin, J., Nogueira, R., Yates, A.: Pretrained transformers for text ranking: BERT and beyond. arXiv:2010.06467 [cs] (Oct 2020), http://arxiv.org/abs/2010.06467, zSCC: NoCitationData[s0] arXiv: 2010.06467

  31. Lin, S.C., Yang, J.H., Lin, J.: In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In: Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pp. 163–173. Association for Computational Linguistics, Online, August 2021. https://doi.org/10.18653/v1/2021.repl4nlp-1.17, https://aclanthology.org/2021.repl4nlp-1.17

  32. Liu, Z., Shao, Y.: Retromae: pre-training retrieval-oriented transformers via masked auto-encoder (2022). https://doi.org/10.48550/ARXIV.2205.12035, https://arxiv.org/abs/2205.12035

  33. Ma, X., Guo, J., Zhang, R., Fan, Y., Ji, X., Cheng, X.: B-prop: bootstrapped pre-training with representative words prediction for ad-hoc retrieval. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021)

    Google Scholar 

  34. Ma, X., Guo, J., Zhang, R., Fan, Y., Ji, X., Cheng, X.: Prop: pre-training with representative words prediction for ad-hoc retrieval. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining (2021)

    Google Scholar 

  35. Ma, Z., et al.: Pre-training for ad-hoc retrieval: hyperlink is also you need. In: Proceedings of the 30th ACM International Conference on Information and Knowledge Management (2021)

    Google Scholar 

  36. Muennighoff, N.: SGPT: GPT sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904 (2022)

  37. Nair, S., et al.: Transfer learning approaches for building cross-language dense retrieval models. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 382–396. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_26

    Chapter  Google Scholar 

  38. Nair, S., Yang, E., Lawrie, D., Mayfield, J., Oard, D.W.: Learning a sparse representation model for neural CLIR. In: Design of Experimental Search and Information REtrieval Systems (DESIRES) (2022)

    Google Scholar 

  39. Nguyen, T., et al.: MS MARCO: a human generated machine reading comprehension dataset. In: CoCo@ NIPs (2016)

    Google Scholar 

  40. Nogueira, R., Cho, K.: Passage re-ranking with BERT (2019)

    Google Scholar 

  41. Paria, B., Yeh, C.K., Yen, I.E.H., Xu, N., Ravikumar, P., Póczos, B.: Minimizing flops to learn efficient sparse representations (2020)

    Google Scholar 

  42. Qu, Y., et al: RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. In: In Proceedings of NAACL (2021)

    Google Scholar 

  43. Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2019. http://arxiv.org/abs/1908.10084

  44. Rekabsaz, N., Lesota, O., Schedl, M., Brassey, J., Eickhoff, C.: Tripclick: the log files of a large health web search engine. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2507–2513 (2021). https://doi.org/10.1145/3404835.3463242

  45. Ren, R., et al.: RocketQAv2: a joint training method for dense passage retrieval and passage re-ranking. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 2825–2835. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, November 2021. https://doi.org/10.18653/v1/2021.emnlp-main.224, https://aclanthology.org/2021.emnlp-main.224

  46. Robertson, S.E., Walker, S., Beaulieu, M., Gatford, M., Payne, A.: Okapi at TREC-4. Nist Special Publication Sp, pp. 73–96 (1996)

    Google Scholar 

  47. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  48. Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: ColBERTv2: effective and efficient retrieval via lightweight late interaction (2021)

    Google Scholar 

  49. Tay, Y., et al.: Are pretrained convolutions better than pretrained transformers? In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4349–4359. Association for Computational Linguistics, Online, August 2021. https://doi.org/10.18653/v1/2021.acl-long.335, https://aclanthology.org/2021.acl-long.335

  50. Tay, Y., et al.: Scale efficiently: insights from pre-training and fine-tuning transformers. arXiv abs/2109.10686 (2022)

    Google Scholar 

  51. Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., Gurevych, I.: BEIR: a heterogenous benchmark for zero-shot evaluation of information retrieval models. CoRR abs/2104.08663 (2021). https://arxiv.org/abs/2104.08663

  52. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation (2016)

    Google Scholar 

  53. Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: International Conference on Learning Representations (2021). https://openreview.net/forum?id=zeFrfgyZln

  54. Zhang, X., Ma, X., Shi, P., Lin, J.: Mr. Tydi: a multi-lingual benchmark for dense retrieval. In: Proceedings of the 1st Workshop on Multilingual Representation Learning, pp. 127–137 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hervé Dejean .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lassance, C., Dejean, H., Clinchant, S. (2023). An Experimental Study on Pretraining Transformers from Scratch for IR. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13980. Springer, Cham. https://doi.org/10.1007/978-3-031-28244-7_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28244-7_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28243-0

  • Online ISBN: 978-3-031-28244-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics