Abstract
Neural retrieval models are often trained on (subsets of) the millions of queries of the MS MARCO/ORCAS datasets and then tested on the 250 Robust04 queries or other TREC benchmarks with often only 50 queries. In such setups, many of the few test queries can be very similar to queries from the huge training data—in fact, 69% of the Robust04 queries have near-duplicates in MS MARCO/ORCAS. We investigate the impact of this unintended train–test leakage by training neural retrieval models on combinations of a fixed number of MS MARCO/ORCAS queries, which are very similar to actual test queries, and an increasing number of other queries. We find that leakage can improve effectiveness and even change the ranking of systems. However, these effects diminish the smaller, and thus more realistic, the extent of leakage is in all training instances.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
All code and data is publicly available at https://github.com/webis-de/SPIRE-22.
- 3.
- 4.
- 5.
- 6.
Of the available pre-trained Sentence-BERT models, we use the paraphrase detection model: https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2.
- 7.
References
Allan, J., Harman, D., Kanoulas, E., Li, D., Gysel, C., Voorhees, E.: TREC 2017 common core track overview. In: Proceedings of TREC 2017, vol. 500–324. NIST (2017)
Ateniese, G., Mancini, L., Spognardi, A., Villani, A., Vitali, D., Felici, G.: Hacking smart machines with smarter ones: how to extract meaningful data from machine learning classifiers. Int. J. Secur. Netw. 10(3), 137–150 (2015)
Benham, R., et al.: RMIT at the 2017 TREC CORE track. In: Proceedings of TREC 2017, NIST Special Publication, vol. 500-324. NIST (2017)
Benham, R., et al.: RMIT at the 2018 TREC CORE track. In: Proceedings of TREC 2018, NIST Special Publication, vol. 500-331. NIST (2018)
Berthelot, D., Raffel, C., Roy, A., Goodfellow, I.: Understanding and improving interpolation in autoencoders via an adversarial regularizer. In: Proceedings of ICLR 2019. OpenReview.net (2019)
Chen, C., Wu, B., Qiu, M., Wang, L., Zhou, J.: A comprehensive analysis of information leakage in deep transfer learning. CoRR abs/2009.01989 (2020)
Chollet, F.: Deep Learning with Python. Simon and Schuster (2021)
Craswell, N., Campos, D., Mitra, B., Yilmaz, E., Billerbeck, B.: ORCAS: 20 million clicked query-document pairs for analyzing search. In: Proceedings of CIKM 2020, pp. 2983–2989. ACM (2020)
Craswell, N., Mitra, B., Yilmaz, E., Campos, D.: Overview of the TREC 2021 deep learning track. In: Voorhees, E.M., Ellis, A. (eds.) Notebook. NIST (2021)
Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.: Overview of the TREC 2019 deep learning track. In: Proceedings of TREC 2019, NIST Special Publication. NIST (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL 2019, Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Linguistics (2019)
Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of the Third International Workshop on Paraphrasing (IWP 2005) (2005)
Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., Auli, M.: ELI5: long form question answering. In: Proceedings of ACL 2019, pp. 3558–3567. ACL (2019)
Fan, Y., Guo, J., Lan, Y., Xu, J., Zhai, C., Cheng, X.: Modeling diverse relevance patterns in ad-hoc retrieval. In: Proceedings of SIGIR 2018, pp. 375–384. ACM (2018)
Feldman, V.: Does learning require memorization? A short tale about a long tail. In: Proceedings of STOC 2020, pp. 954–959. ACM (2020)
Feldman, V., Zhang, C.: What neural networks memorize and why: discovering the long tail via influence estimation. In: Proceedings of NeurIPS 2020 (2020)
Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit confidence information and basic countermeasures. In: Proceedings of CCS 2015, pp. 1322–1333. ACM (2015)
Fuhr, N.: Some common mistakes in IR evaluation, and how they can be avoided. SIGIR Forum 51(3), 32–41 (2017)
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Hofstätter, S., Zlabinger, M., Hanbury, A.: Interpretable & time-budget-constrained contextualization for re-ranking. In: Proceedings of ECAI 2020, Frontiers in Artificial Intelligence and Applications, vol. 325, pp. 513–520. IOS Press (2020)
Hui, K., Yates, A., Berberich, K., Melo, G.: PACRR: a position-aware neural IR model for relevance matching. In: Proceedings of EMNLP 2017, pp. 1049–1058. ACL (2017)
Jansen, B., Booth, D., Spink, A.: Patterns of query reformulation during web searching. J. Assoc. Inf. Sci. Technol. 60(7), 1358–1371 (2009)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7(3), 535–547 (2021)
Krishna, K., Roy, A., Iyyer, M.: Hurdles to progress in long-form question answering. In: Proceedings of NAACL 2021, pp. 4940–4957. ACL (2021)
Li, C., Yates, A., MacAvaney, S., He, B., Sun, Y.: PARADE: passage representation aggregation for document reranking. CoRR abs/2008.09093 (2020)
Lin, J., Ma, X., Lin, S., Yang, J., Pradeep, R., Nogueira, R.: Pyserini: a Python toolkit for reproducible information retrieval research with sparse and dense representations. In: Proceedings of SIGIR 2021, pp. 2356–2362. ACM (2021)
Lin, J., Yang, P.: The impact of score ties on repeatability in document ranking. In: Proceedings of SIGIR 2019, pp. 1125–1128. ACM (2019)
Lin, S., Yang, J., Lin, J.: Distilling dense representations for ranking using tightly-coupled teachers. CoRR abs/2010.11386 (2020)
Linjordet, T., Balog, K.: Sanitizing synthetic training data generation for question answering over knowledge graphs. In: Proceedings of ICTIR 2020, pp. 121–128. ACM (2020)
MacAvaney, S., Yates, A., Cohan, A., Goharian, N.: CEDR: contextualized embeddings for document ranking. In: Proceedings of SIGIR 2019, pp. 1101–1104. ACM (2019)
MacAvaney, S., Yates, A., Feldman, S., Downey, D., Cohan, A., Goharian, N.: Simplified data wrangling with ir_datasets. In: Proceedings of SIGIR 2021, pp. 2429–2436. ACM (2021)
Macdonald, C., Tonellotto, N., MacAvaney, S., Ounis, I.: PyTerrier: declarative experimentation in Python from BM25 to dense retrieval. In: Proceedings of CIKM 2021, pp. 4526–4533. ACM (2021)
Mitra, B., Diaz, F., Craswell, N.: Learning to match using local and distributed representations of text for web search. In: Proceedings of WWW 2017, pp. 1291–1299. ACM (2017)
Mokrii, I., Boytsov, L., Braslavski, P.: A systematic evaluation of transfer learning and pseudo-labeling with BERT-based ranking models. In: Proceedings of SIGIR 2021, pp. 2081–2085. ACM (2021)
Nasr, M., Shokri, R., Houmansadr, A.: Comprehensive privacy analysis of deep learning: passive and active white-box inference attacks against centralized and federated learning. In: Proceedings of SP 2019, pp. 739–753. IEEE (2019)
Nogueira, R., Jiang, Z., Pradeep, R., Lin, J.: Document ranking with a pretrained sequence-to-sequence model. In: Findings of EMNLP 2020, vol. EMNLP 2020, pp. 708–718. ACL (2020)
Nogueira, R., Yang, W., Cho, K., Lin, J.: Multi-stage document ranking with BERT. CoRR abs/1910.14424 (2019)
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Proceedings of EMNLP 2019, pp. 3980–3990. ACL (2019)
Sandhaus, E.: The New York times annotated corpus. Linguist. Data Consortium Philadelphia 6(12), e26752 (2008)
Sharma, L., Graesser, L., Nangia, N., Evci, U.: Natural language understanding with the Quora question pairs dataset. CoRR abs/1907.01041 (2019)
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: Proceedings of SP 2017, pp. 3–18. IEEE (2017)
Voorhees, E.: The TREC robust retrieval track. SIGIR Forum 39(1), 11–20 (2005)
Wahle, J.P., Ruas, T., Meuschke, N., Gipp, B.: Are neural language models good plagiarists? A benchmark for neural paraphrase detection. In: Proceedings of JCDL 2021, pp. 226–229 (2021)
Xiong, C., Dai, Z., Callan, J., Liu, Z., Power, R.: End-to-end neural ad-hoc ranking with kernel pooling. In: Proceedings of SIGIR 2017, pp. 55–64. ACM (2017)
Yates, A., Arora, S., Zhang, X., Yang, W., Jose, K., Lin, J.: Capreolus: a toolkit for end-to-end neural ad hoc retrieval. In: Proceedings of WSDM 2020, pp. 861–864. ACM (2020)
Zhan, J., Xie, X., Mao, J., Liu, Y., Zhang, M., Ma, S.: Evaluating extrapolation performance of dense retrieval. CoRR abs/2204.11447 (2022)
Zhang, X., Yates, A., Lin, J.: A little bit is worse than none: ranking with limited training data. In: Proceedings of SustaiNLP 2020, pp. 107–112. Association for Computational Linguistics (2020)
Zobel, J., Rashidi, L.: Corpus bootstrapping for assessment of the properties of effectiveness measures. In: Proceedings of CIKM 2020, pp. 1933–1952. ACM (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Fröbe, M., Akiki, C., Potthast, M., Hagen, M. (2022). How Train–Test Leakage Affects Zero-Shot Retrieval. In: Arroyuelo, D., Poblete, B. (eds) String Processing and Information Retrieval. SPIRE 2022. Lecture Notes in Computer Science, vol 13617. Springer, Cham. https://doi.org/10.1007/978-3-031-20643-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-20643-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20642-9
Online ISBN: 978-3-031-20643-6
eBook Packages: Computer ScienceComputer Science (R0)