Skip to main content

Investigating the Effects of Sparse Attention on Cross-Encoders

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14608))

Included in the following conference series:

Abstract

Cross-encoders are effective passage and document re-rankers but less efficient than other neural or classic retrieval models. A few previous studies have applied windowed self-attention to make cross-encoders more efficient. However, these studies did not investigate the potential and limits of different attention patterns or window sizes. We close this gap and systematically analyze how token interactions can be reduced without harming the re-ranking effectiveness. Experimenting with asymmetric attention and different window sizes, we find that the query tokens do not need to attend to the passage or document tokens for effective re-ranking and that very small window sizes suffice. In our experiments, even windows of 4 tokens still yield effectiveness on par with previous cross-encoders while reducing the memory requirements by at least 22%/59% and being 1%/43% faster at inference time for passages/documents. Our code is publicly available (https://github.com/webis-de/ECIR-24).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2.

References

  1. Ainslie, J., et al.: ETC: encoding long and structured inputs in transformers. In: Proceedings of EMNLP 2020, pp. 268–284 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.19

  2. Arapakis, I., Bai, X., Cambazoglu, B.B.: Impact of response latency on user behavior in web search. In: Proceedings of SIGIR 2014, pp. 103–112 (2014). https://doi.org/10.1145/2600428.2609627

  3. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The Long-Document Transformer. arXiv (2020). https://doi.org/10.48550/arXiv.2004.05150

  4. Bevendorff, J., Stein, B., Hagen, M., Potthast, M.: Elastic ChatNoir: search engine for the ClueWeb and the common crawl. In: Proceedings of ECIR 2018, pp. 820–824 (2018). https://doi.org/10.1007/978-3-319-76941-7_83

  5. Bondarenko, A., et al.: Overview of Touché 2022: argument retrieval. In: Proceedings of CLEF 2022, pp. 311–336 (2022). https://doi.org/10.1007/978-3-031-13643-6_21

  6. Bondarenko, A., et al.: Overview of Touché 2021: argument retrieval. In: Proceedings of CLEF 2021, pp. 450–467 (2021). https://doi.org/10.1007/978-3-030-85251-1_28

  7. Boteva, V., Ghalandari, D.G., Sokolov, A., Riezler, S.: A full-text learning to rank dataset for medical information retrieval. In: Proceedings of ECIR 2016, pp. 716–722 (2016)

    Google Scholar 

  8. Burges, C.J.C.: From RankNet to LambdaRank to LambdaMART: An Overview. Technical report, Microsoft Research (2010)

    Google Scholar 

  9. Büttcher, S., Clarke, C.L.A., Soboroff, I.: The TREC 2006 terabyte track. In: Proceedings of TREC 2006, NIST Special Publication, vol. 500-272 (2006)

    Google Scholar 

  10. Chen, S., Wong, S., Chen, L., Tian, Y.: Extending Context Window of Large Language Models via Positional Interpolation. arXiv (2023). https://doi.org/10.48550/arXiv.2306.15595

  11. Chen, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of OSDI 2018, pp. 579–594 (2018)

    Google Scholar 

  12. Child, R., Gray, S., Radford, A., Sutskever, I.: Generating Long Sequences with Sparse Transformers. arXiv (2019). https://doi.org/10.48550/arXiv.1904.10509

  13. Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2004 terabyte track. In: Proceedings of TREC 2004, NIST Special Publication, vol. 500-261 (2004)

    Google Scholar 

  14. Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of TREC 2009, NIST Special Publication, vol. 500-278 (2009)

    Google Scholar 

  15. Clarke, C.L.A., Craswell, N., Soboroff, I., Cormack, G.V.: Overview of the TREC 2010 web track. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of TREC 2010, NIST Special Publication, vol. 500-294 (2010)

    Google Scholar 

  16. Clarke, C.L.A., Craswell, N., Soboroff, I., Voorhees, E.M.: Overview of the TREC 2011 web track. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of TREC 2011, NIST Special Publication, vol. 500-296 (2011)

    Google Scholar 

  17. Clarke, C.L.A., Craswell, N., Voorhees, E.M.: Overview of the TREC 2012 web track. In: Voorhees, E.M., Buckland, L.P. (eds.) Proceedings of TREC 2012, NIST Special Publication, vol. 500-298 (2012)

    Google Scholar 

  18. Clarke, C.L.A., Scholer, F., Soboroff, I.: The TREC 2005 terabyte track. In: Proceedings of TREC 2005, NIST Special Publication, vol. 500-266 (2005)

    Google Scholar 

  19. Cleverdon, C.: The cranfield tests on index language devices. In: ASLIB Proceedings, pp. 173–192, MCB UP Ltd. (Reprinted in Readings in Information Retrieval, Karen Sparck-Jones and Peter Willett, editors, Morgan Kaufmann, 1997) (1967)

    Google Scholar 

  20. Cleverdon, C.W.: The significance of the cranfield tests on index languages. In: Bookstein, A., Chiaramella, Y., Salton, G., Raghavan, V.V. (eds.) Proceedings of the 14th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Chicago, Illinois, USA, 13–16 October 1991 (Special Issue of the SIGIR Forum), pp. 3–12 (1991)

    Google Scholar 

  21. Collins-Thompson, K., Bennett, P.N., Diaz, F., Clarke, C., Voorhees, E.M.: TREC 2013 web track overview. In: Proceedings of TREC 2013, NIST Special Publication, vol. 500-302 (2013)

    Google Scholar 

  22. Collins-Thompson, K., Macdonald, C., Bennett, P.N., Diaz, F., Voorhees, E.M.: TREC 2014 web track overview. In: Proceedings of TREC 2014, NIST Special Publication, vol. 500-308 (2014)

    Google Scholar 

  23. Craswell, N., Hawking, D.: Overview of the TREC 2002 web track. In: Proceedings of TREC 2002, NIST Special Publication, vol. 500-251 (2002)

    Google Scholar 

  24. Craswell, N., Hawking, D.: Overview of the TREC 2004 web track. In: Proceedings of TREC 2004, NIST Special Publication, vol. 500-261 (2004)

    Google Scholar 

  25. Craswell, N., Hawking, D., Wilkinson, R., Wu, M.: Overview of the TREC 2003 web track. In: Proceedings of TREC 2003, NIST Special Publication, vol. 500-255, pp. 78–92 (2003)

    Google Scholar 

  26. Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the TREC 2020 deep learning track. In: Proceedings TREC 2020, NIST Special Publication, vol. 1266 (2020)

    Google Scholar 

  27. Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Lin, J.: Overview of the TREC 2021 deep learning track. In: Proceedings TREC 2021, NIST Special Publication, vol. 500-335 (2021)

    Google Scholar 

  28. Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the TREC 2019 deep learning track. In: Proceedings TREC 2019, NIST Special Publication, vol. 500-331 (2019)

    Google Scholar 

  29. Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhess, J.L.E.M., Soboroff, I.: Overview of the TREC 2021 deep learning track. In: Proceedings TREC 2021, NIST Special Publication, vol. 500-338 (2022)

    Google Scholar 

  30. Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural language modeling. In: Proceedings of SIGIR 2019, pp. 985–988 (2019). https://doi.org/10.1145/3331184.3331303

  31. Dao, T.: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv (2023). https://doi.org/10.48550/arXiv.2307.08691

  32. Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: FlashAttention: fast and memory-efficient exact attention with IO-awareness. In: Proceedings of NeurIPS 2022, pp. 16344–16359 (2022)

    Google Scholar 

  33. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423

  34. Falcon, W.: The PyTorch Lightning team: PyTorch Lightning (2023). https://doi.org/10.5281/zenodo.7859091

  35. Feng, C., Wang, X., Zhang, Y., Zhao, C., Song, M.: CASwin transformer: a hierarchical cross attention transformer for depth completion. In: Proceedings of ITSC 2022, pp. 2836–2841 (2022). https://doi.org/10.1109/ITSC55140.2022.9922273

  36. Fröbe, M., et al.: The information retrieval experiment platform. In: Proceedings of SIGIR 2023, pp. 2826–2836 (2023). https://doi.org/10.1145/3539618.3591888

  37. Hashemi, H., Aliannejadi, M., Zamani, H., Croft, W.B.: ANTIQUE: a non-factoid question answering benchmark. In: Proceedings of ECIR 2020, pp. 166–173 (2020)

    Google Scholar 

  38. Hersh, W.R., Bhupatiraju, R.T., Ross, L., Cohen, A.M., Kraemer, D., Johnson, P.: TREC 2004 genomics track overview. In: Proceedings of TREC 2004, NIST Special Publication, vol. 500-261 (2004)

    Google Scholar 

  39. Hersh, W.R., Cohen, A.M., Yang, J., Bhupatiraju, R.T., Roberts, P.M., Hearst, M.A.: TREC 2005 genomics track overview. In: Proceedings of TREC 2005, NIST Special Publication, vol. 500-266 (2005)

    Google Scholar 

  40. Hofstätter, S., Althammer, S., Schröder, M., Sertkan, M., Hanbury, A.: Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. arXiv (2021). https://doi.org/10.48550/arXiv.2010.02666

  41. Hofstätter, S., Lin, S.C., Yang, J.H., Lin, J., Hanbury, A.: Efficiently teaching an effective dense retriever with balanced topic aware sampling. In: Proceedings of SIGIR 2021, pp. 113–122 (2021). https://doi.org/10.1145/3404835.3462891

  42. Hofstätter, S., Zlabinger, M., Hanbury, A.: Interpretable & time-budget-constrained contextualization for re-ranking. In: Proceedings of ECAI 2020, pp. 513–520 (2020). https://doi.org/10.3233/FAIA200133

  43. Ilinykh, N., Dobnik, S.: Attention as grounding: exploring textual and cross-modal attention on entities and relations in language-and-vision transformer. In: Findings of ACL 2022, pp. 4062–4073 (2022). https://doi.org/10.18653/v1/2022.findings-acl.320

  44. Jiang, J.Y., Xiong, C., Lee, C.J., Wang, W.: Long document ranking with query-directed sparse transformer. In: Findings of EMNLP 2020, pp. 4594–4605 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.412

  45. Karpukhin, V., et al.: Dense passage retrieval for open-domain question answering. In: Proceedings of EMNLP 2020, pp. 6769–6781 (2020). https://doi.org/10.18653/v1/2020.emnlp-main.550

  46. Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of SIGIR 2020, pp. 39–48 (2020). https://doi.org/10.1145/3397271.3401075

  47. Lauzon, C., Caffo, B.: Easy multiplicity control in equivalence testing using two one-sided tests. Am. Stat. 63, 147–154 (2009). ISSN 0003-1305. https://doi.org/10.1198/tast.2009.0029

  48. Lefaudeux, B., et al.: xFormers: A modular and hackable Transformer modelling library (2022). https://github.com/facebookresearch/xformers

  49. Lin, J., Ma, X., Lin, S.C., Yang, J.H., Pradeep, R., Nogueira, R.: Pyserini: an easy-to-use python toolkit to support replicable IR research with sparse and dense representations. In: Proceedings of SIGIR 2021, pp. 2356–2362 (2021). https://doi.org/10.1145/3404835.3463238

  50. Lin, S.C., Yang, J.H., Lin, J.: In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In: Proceedings of RepL4NLP 2021, pp. 163–173 (2021). https://doi.org/10.18653/v1/2021.repl4nlp-1.17

  51. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Proceedings of ICLR 2019 (2019)

    Google Scholar 

  52. MacAvaney, S., Nardini, F.M., Perego, R., Tonellotto, N., Goharian, N., Frieder, O.: Efficient document re-ranking for transformers by precomputing term representations. In: Proceedings of SIGIR 2020, pp. 49–58 (2020). https://doi.org/10.1145/3397271.3401093

  53. MacAvaney, S., Yates, A., Feldman, S., Downey, D., Cohan, A., Goharian, N.: Simplified data wrangling with ir_datasets. In: Proceedings of SIGIR 2021, pp. 2429–2436 (2021). https://doi.org/10.1145/3404835.3463254

  54. Nguyen, T., Rosenberg, M., Song, X., Gao, J., Tiwary, S., Majumder, R., Deng, L.: MS MARCO: a human generated MAchine reading COmprehension dataset. In: Proceedings of COCO@NeurIPS 2016 (2016)

    Google Scholar 

  55. Nogueira, R., Cho, K.: Passage Re-ranking with BERT. arXiv (2020). https://doi.org/10.48550/arXiv.1901.04085

  56. Nogueira, R., Jiang, Z., Pradeep, R., Lin, J.: Document ranking with a pretrained sequence-to-sequence model. In: Findings of EMNLP 2020, pp. 708–718 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.63

  57. Nogueira, R., Yang, W., Cho, K., Lin, J.: Multi-Stage Document Ranking with BERT. arXiv (2019). https://doi.org/10.48550/arXiv.1910.14424

  58. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of NeurIPS 2019, vol. 32 (2019)

    Google Scholar 

  59. Petit, O., Thome, N., Rambour, C., Themyr, L., Collins, T., Soler, L.: U-Net transformer: self and cross attention for medical image segmentation. In: Proceedings of MLMI@MICCAI 2021, pp. 267–276 (2021). https://doi.org/10.1007/978-3-030-87589-3_28

  60. Potthast, M., et al.: ChatNoir: a search engine for the ClueWeb09 corpus. In: Proceedings of SIGIR 2012, p. 1004 (2012). https://doi.org/10.1145/2348283.2348429

  61. Qu, Y., et al.: RocketQA: an optimized training approach to dense passage retrieval for open-domain question answering. In: Proceedings of NAACL-HLT 2021, pp. 5835–5847 (2021). https://doi.org/10.18653/v1/2021.naacl-main.466

  62. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of EMNLP-IJCNLP 2019, pp. 3980–3990 (2019). https://doi.org/10.18653/v1/D19-1410

  63. Roberts, K., Demner-Fushman, D., Voorhees, E.M., Hersh, W.R., Bedrick, S., Lazar, A.J.: Overview of the TREC 2018 precision medicine track. In: Proceedings of TREC 2018, NIST Special Publication, vol. 500-331 (2018)

    Google Scholar 

  64. Roberts, K., et al.: Overview of the TREC 2017 precision medicine track. In: Proceedings of TREC 2017, NIST Special Publication, vol. 500-324 (2017)

    Google Scholar 

  65. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: Proceedings of TREC 1994, vol. 500-225, pp. 109–126 (1994)

    Google Scholar 

  66. Rosa, G., et al.: In Defense of Cross-Encoders for Zero-Shot Retrieval. arXiv (2022). https://doi.org/10.48550/arXiv.2212.06121

  67. Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: ColBERTv2: effective and efficient retrieval via lightweight late interaction. In: Proceedings of NAACL-HLT 2022, pp. 3715–3734 (2022). https://doi.org/10.18653/v1/2022.naacl-main.272

  68. Scells, H., Zhuang, S., Zuccon, G.: Reduce, reuse, recycle: green information retrieval research. In: Proceedings of SIGIR 2022, pp. 2825–2837 (2022). https://doi.org/10.1145/3477495.3531766

  69. Schuirmann, D.J.: A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J. Pharmacokinet. Biopharm. 15(6), 657–680 (1987). https://doi.org/10.1007/BF01068419

    Article  Google Scholar 

  70. Sekulić, I., Soleimani, A., Aliannejadi, M., Crestani, F.: Longformer for MS MARCO document re-ranking task. In: Proceedings of TREC 2020, NIST Special Publication, vol. 1266 (2020)

    Google Scholar 

  71. Sui, X., et al.: CRAFT: cross-attentional flow transformer for robust optical flow. In: Proceedings of CVPR 2022, pp. 17581–17590 (2022). https://doi.org/10.1109/CVPR52688.2022.01708

  72. Tay, Y., Dehghani, M., Bahri, D., Metzler, D.: Efficient transformers: a survey. ACM Comput. Surv. 55, 109:1–109:28 (2023). https://doi.org/10.1145/3530811

  73. Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurIPS 2017, pp. 5998–6008 (2017)

    Google Scholar 

  74. Voorhees, E.M.: NIST TREC Disks 4 and 5: Retrieval Test Collections Document Set (1996)

    Google Scholar 

  75. Voorhees, E.M.: Overview of the TREC 2004 robust track. In: Proceedings of TREC 2004, NIST Special Publication (2004)

    Google Scholar 

  76. Voorhees, E.M., Harman, D.: Overview of the seventh text retrieval conference (TREC-7). In: Proceedings of TREC 1998, NIST Special Publication (1998)

    Google Scholar 

  77. Voorhees, E.M., Harman, D.: Overview of the eight text retrieval conference (TREC-8). In: Proceedings of TREC 1999, NIST Special Publication (1999)

    Google Scholar 

  78. Wang, L.L., et al.: CORD-19: The Covid-19 Open Research Dataset. arXiv (2020). https://doi.org/10.48550/arXiv.2004.10706

  79. Wang, W., Bao, H., Huang, S., Dong, L., Wei, F.: MiniLMv2: multi-head self-attention relation distillation for compressing pretrained transformers. In: Findings of ACL-IJCNLP 2021, pp. 2140–2151 (2021). https://doi.org/10.18653/v1/2021.findings-acl.188

  80. Wolf, T., et al.: HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv (2020). https://doi.org/10.48550/arXiv.1910.03771

  81. Xiong, L., et al.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. In: Proceedings of ICLR 2021 (2021)

    Google Scholar 

  82. Yan, M., et al.: IDST at TREC 2019 deep learning track: deep cascade ranking with generation-based document expansion and pre-trained language modeling. In: Proceedings of TREC 2019, NIST Special Publication, vol. 1250 (2019)

    Google Scholar 

  83. Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Proceedings of NeurIPS 2020, pp. 17283–17297 (2020)

    Google Scholar 

  84. Zhang, Y., Long, D., Xu, G., Xie, P.: HLATR: Enhance Multi-stage Text Retrieval with Hybrid List Aware Transformer Reranking. arXiv (2022). https://doi.org/10.48550/arXiv.2205.10569

Download references

Acknowledgments

This work has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070014 (OpenWebSearch.EU, https://doi.org/10.3030/101070014).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ferdinand Schlatt .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Schlatt, F., Fröbe, M., Hagen, M. (2024). Investigating the Effects of Sparse Attention on Cross-Encoders. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14608. Springer, Cham. https://doi.org/10.1007/978-3-031-56027-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-56027-9_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-56026-2

  • Online ISBN: 978-3-031-56027-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics