1 Introduction

A commonly used ranking pipeline consists of a first-stage retriever, e.g. BM25 [2], that efficiently retrieves a set of documents from the full document collection, followed by one or more re-rankers [3, 4] that improve the initial ranking. An effective reranking strategy are BERT-based models with a cross-encoder architecture, concatenating the query and the candidate document in the input [4,5,6,7]. In this paper, we refer to these re-rankers as Cross-Encoder\(_{\textrm{CAT}}\) (CE\(_{\textrm{CAT}}\)). In the common re-ranking set-up, BM25 [2] is widely leveraged [8,9,10] for finding the top-k documents to be re-ranked; however, the relevance score produced by BM25 based on exact lexical matching is not explicitly taken into account in the second stage. Besides, although cross-encoder re-rankers substantially improve the retrieval effectiveness compared to BM25 alone [11], Rau et al. [12] show that BM25 is a more effective exact lexical matcher than CE\(_{\textrm{CAT}}\) rankers; in their exact-matching experiment they only use the words from the passage that also appear in the query as the input of the CE\(_{\textrm{CAT}}\). This suggests that CE\(_{\textrm{CAT}}\) re-rankers can be further improved by a better exact word matching, as the presence of query words in the document is one of the strongest signals for relevance in ranking [13, 14]. Moreover, obtaining improvement in effectiveness by interpolating the scores (score fusion [15]) of BM25 and CE\(_{\textrm{CAT}}\) is challenging: a linear combination of the two scores has shown to decrease effectiveness on the MSMARCO Passage collection compared to only using the CE\(_{\textrm{CAT}}\) re-ranker in the second stage retrieval [11].

To tackle this problem, in this work, we propose a method to enhance CE\(_{\textrm{CAT}}\) re-rankers illustrated in Fig. 1 by directly injecting the BM25 score as a string to the input of the Transformer. Figure 2 show our method for the injection of BM25 in the input of the CE re-ranker. We refer to our method as CE\(_{\textrm{BM25CAT}}\). Our idea is inspired by the finding by Wallace et al. [16] that BERT models can capture numeracy. In this regard, we address the following research questions:

Fig. 1
figure 1

Regular cross-encoder input

Fig. 2
figure 2

Injection of BM25 in input

RQ1: What is the effectiveness of BM25 score injection in addition to the query and document text in the input of CE re-rankers?

This research question is designed to explore the integration of a commonly used lexical information retrieval model (BM25) with CE re-rankers, specifically examining how is the impact of the inclusion of the BM25 score on the CE re-rankers in terms of retrieval effectiveness. Addressing this question is crucial as it seeks to bridge the gap between lexical and CE re-rankers, potentially leading to improvements in the relevance and effectiveness. To answer this question we setup two experiments on three datasets: MSMARCO, TREC DL’19 and ’20. First, since the BM25 score has no defined range, we investigate the effect of different representations of the BM25 score by applying various normalization methods. We also analyze the effect of converting the normalized scores of BM25 to integers. Second, we evaluate the best representation of BM25—based on our empricial study – on four cross-encoders: BERT-base, BERT-large [17], DistillBERT [18], and MiniLM [19], comparing CE\(_{\textrm{BM25CAT}}\) to CE\(_{\textrm{CAT}}\) across different Transformer models with a smaller and larger number of parameters. Next, we compare our proposed approach to common interpolation approaches:

RQ2: What is the effectiveness of CE\(_{\textrm{BM25CAT}}\) compared to common approaches for combining the final relevance scores of CE\(_{\textrm{CAT}}\) and BM25?

This research question is designed to empirically evaluate the effectiveness of the novel CE\(_{\textrm{BM25CAT}}\) model against common methodologies that combine the relevance scores of CE\(_{\textrm{CAT}}\) and BM25. Addressing this question is essential as it highlights the improvements or drawbacks of adopting this new approach on the effectiveness of CE re-rankers. To analyze CE\(_{\textrm{BM25CAT}}\) and CE\(_{\textrm{CAT}}\) in terms of exact matching compared to BM25 we address the following question:

RQ3: How effective can CE\(_{\textrm{BM25CAT}}\) capture exact matching relevance compared to BM25 and CE\(_{\textrm{CAT}}\)?

This research question is designed to empirically assess the performance of the CE\(_{\textrm{BM25CAT}}\) model in capturing exact matching relevance, which is essential to be analyzed compared to the BM25 where we inject the BM25 score into the input of CE re-rankers. By comparing CE\(_{\textrm{BM25CAT}}\) with the traditional BM25 and the CE\(_{\textrm{CAT}}\) model, we aim to highlight the specific advantages or limitations of our proposed method. Next, we analyze what the optimal position for injecting the retriever score is, addressing the following question:

RQ4: What is the optimal position for injecting the retriever score: before the query, between query and document, or after the document?

This research question is designed to identify the most effective strategy for integrating the first-stage retriever score within the CE re-rankers, which is crucial to be identified in order to identify the most optimal design of our proposed method in terms of retrieval quality. We find out that injecting the relevance before the query is slightly more effective than injecting between query and document or after the document.

To investigate the generalizability of the proposed method, we assess if injecting other available relevance scores during re-ranking (i.e., a dense passage retriever’s relevance score) also improves the effectiveness; we call this model CE\(_{\textrm{DPRCAT}}\). To do so, we address the following question:

RQ5: What is the effectiveness of injecting the dense passage retriever (DPR) score into the input, and what is its optimal representation?

This research question is designed to explore the impact of injecting the score of dense passage retrievers (DPR) score into CE re-rankers. By assessing both the effectiveness and the optimal representation of the DPR score when injected into the retrieval input, this question addresses a crucial gap in understanding how CE re-rankers can be improved by utilizing of DPR score into their input representation. Identifying the optimal method for incorporating the DPR score will enable the development of more effective retrieval strategies, providing insights on how improving the effectiveness of search systems in real-world applications. Furthermore, to provide an explanation on the improvement of CE\(_{\textrm{BM25CAT}}\), we perform a qualitative analysis of a case where CE\(_{\textrm{CAT}}\) fails to identify the relevant document that is found using CE\(_{\textrm{BM25CAT}}\) with the help of the BM25 score.Footnote 1

Moreover, previous studies have shown than using larger cross-encoder re-rankers as teachers for smaller cross-encoder re-rankers or dense retrievers leads to improvement of the effectiveness [6]. We analyze injecting the BM25 or DPR score into the input while training the student cross-encoders with using three larger cross-encoders as teachers to assess the effectiveness of injection together with knowledge distillation.

To the best of our knowledge, there is no prior work on the effectiveness of cross-encoder re-rankers by injecting a retrieval model’s score into their input. Our main contributions in this work are seven-fold:

  1. 1.

    We introduce a novel strategy for optimizing the utilization of initial-stage retriever scores, including both BM25 and DPR scores, in cross-encoder re-rankers. Our approach demonstrates statistically significant improvements across all official metrics, confirmed through extensive experiments and detailed analysis.

  2. 2.

    Our investigation reveals superior performance of our method over traditional techniques that linearly interpolate the scores from BM25 and cross-encoders. We provide empirical evidence supporting the effectiveness of our non-linear integration strategy.

  3. 3.

    Through rigorous comparison, we establish the superior exact matching capabilities of our cross-encoder model, CE\(_{\textrm{BM25CAT}}\), over the standard BM25, while highlighting the limitations of CE\(_{\textrm{CAT}}\) in similar scenarios.

  4. 4.

    We conduct a thorough analysis on the performance of CE\(_{\textrm{CAT}}\) and CE\(_{\textrm{BM25CAT}}\) across various query types, demonstrating consistent superiority of CE\(_{\textrm{BM25CAT}}\) over CE\(_{\textrm{CAT}}\) in handling diverse queries.

  5. 5.

    We prove the generalizability of our relevance score injection approach by incorporating DPR scores into the cross-encoder, CE\(_{\textrm{DPRCAT}}\). This adaptation shows even greater effectiveness than CE\(_{\textrm{BM25CAT}}\), further validating our method’s robustness.

  6. 6.

    We find that injecting the relevance score into the student model could lead to improvement on the effectiveness in a teacher-student training setup.

  7. 7.

    We explore and analyze the impact of different positions for injecting retriever scores into the input of cross-encoder re-rankers. Our findings identify the most optimal position and provide insights into the positional effects of retriever score injection on performance of cross-encoder re-ranker.

After a discussion of related work in Sect. 2, we describe the retrieval models employed in section 3 and the specifics of our experiments and methods in Sect. 4. The results are examined and the research questions are addressed in Sect. 5. Finally, the conclusion is described in Sect. 6.

2 Related work

2.1 Modifying the input of re-rankers

Boualili et al. [20, 21] propose a method for highlighting exact matching signals by marking the start and the end of each occurrence of the query terms by adding markers to the input. In addition, they modify original passages and expand each passage with a set of generated queries using Doc2query [22] to overcome the vocabulary mismatch problem. This strategy is different from ours in two aspects: (1) the type of information added to the input: they add four tokens as markers for each occurrence of query terms, adding a burden to the limited input length of 512 tokens for query and document together, while we only add the BM25 score. (2) The need for data augmentation: they need to train a Doc2query model to provide the exact matching signal for improving the BERT re-ranker while our strategy does not need any extra overhead in terms of data augmentation. A few recent, but less related examples are Al-Hajj et al. [23], who experiment with the use of different supervised signals into the input of the cross-encoder to emphasize target words in context and Li et al. [24], who insert boundary markers into the input between contiguous words for Chinese named entity recognition.

Additionally, there are various studies that show modifying cross-encoder re-ranker inputs by adding additional information represented as splitter tokens can improve their effectiveness [25,26,27,28]. BERT-FP [26] demonstrates that adding splitter tokens between each utterance of a dialogue can improve the effectiveness of BERT-based re-rankers for the passage response retrieval task. CLosER [25] proposes an expertise-aware post-training that modifies the input of BERT-based re-rankers with three different splitter tokens to differentiate the end of an utterance by the questioner, the end of an utterance by the professional responder with deep expertise, and the end of an utterance by the professional responder with shallow expertise. They show this boost the effectiveness of BERT-based re-rankers for conversational search in legal domain statistically significantly. Askari et al. [27] shows that injecting question tags alongside the question title and description in a modified fine-grained structured input can lead to improved effectiveness of cross-encoder re-rankers in answer retrieval in community question answering systems. Gretkowski et al. [28] shows that modifying the input of BERT-based classifiers by injecting commonsense concepts in addition to the question and solution can improve the effectiveness of BERT-based classifiers in terms of commonsense reasoning ability.

2.2 Numerical information in transformer models

The incorporation of numerical information into Transformer models encompasses a range of tasks [29] including basic arithmetic operations, numeration, magnitude comparison, arithmetic word problems, exact facts, measurement estimation, and numerical language modeling. Simple arithmetic tasks test models on basic operations such as addition and subtraction, often using synthetic datasets designed for both masked and causal language models [30, 31]. Numeration, or the task of decoding numeric strings into their corresponding values, has been explored with both static and contextualized embeddings [32, 33]. Magnitude comparison tasks assess the ability of models to determine the larger of two or more numbers [32]. Arithmetic word problems and exact facts reasoning require models to apply numerical knowledge in more complex textual scenarios,with several datasets challenging models with such tasks [34,35,36,37,38,39]. Measurement estimation further explores models’ capabilities in approximating and comparing quantities,using diverse benchmarks [40,41,42,43,44]. Numerical language modeling, an extension of traditional language modeling,involves predicting numeric values in text, evaluated through regression metrics [30]. Downstream applications of these numeracy capabilities are extensive, ranging from sarcasm detection in tweets to enhancing performance on numeric-heavy QA tasks [30, 45, 46], showcasing both the utility and versatility of Transformer models in handling numerical data.

Wallace et al. [16] analyze the ability of BERT models to work with numbers and come to the conclusion that the models capture numeracy and are able to do numerical reasoning; however the models appeared to struggle with interpreting floats. Moreover, Zhang et al. [47] show that BERT models capture a significant amount of information about numerical scale except for general common-sense reasoning. There are various studies that are inspired by the fact that Transformer models can correctly process numbers [28, 46, 48,49,50,51]. Gu et al. [52] incorporate text, categorical and numerical data as different modalities with Transformers using a combining module accross different classification tasks. They discover that adding tabular features increases the effectiveness while using only text is insufficient and results in the worst performance. Recently, [53] has shown the impact of injection credibility score as statements into the input of cross-encoder re-rankers for guiding re-rankers to retrieve not only relevant but also credible documents on top of relevant but less credible documents.

2.3 Methods for combining rankers

Linearly interpolating different rankers’ scores has been studied extensively in the literature [11, 15, 54,55,56]. In information retrieval (IR), various methodologies have been developed for combining search results from multiple rankers. These strategies can be categorized into score-based, rank-based, probabilistic, and voting-based methods, each addressing different aspects of the fusion challenge [57]. Score-based methods integrate the relevance scores provided by individual search engines to formulate the final relevance score [58,59,60]. Rank-based methods [61,62,63], in contrast, do not rely on explicit relevance scores but instead utilize the ordinal positions of documents across different search results. This method is particularly valuable when dealing with data from web search aggregators, such as Kayak and Skyscanner, where individual relevance scores might not be accessible. Rank-based strategies focus solely on the synthesis of ranking positions, streamlining the fusion process under constraints of limited data. Probabilistic methods [64,65,66,67] introduce a statistical approach by estimating the probability distribution of document relevance across ranking positions for each search engine. These methods necessitate a training phase to accurately model these probabilities, thus adding a layer of complexity but potentially increasing the accuracy of the fused rankings by incorporating probabilistic inference. Voting-based methods [64, 68] adapt traditional voting procedures to metasearch. Methods like the Borda Count and the Condorcet method are used to amalgamate the preferences of multiple search engines, treating each engine as an "expert" voter. This approach is often seen as an extension of rank-based methods, as it primarily relies on the ranks rather than scores.

In this paper, we investigate multiple score-based linear and non-linear interpolation ensemble methods to analyze the performance of them for combining BM25 and CE\(_{\textrm{CAT}}\) scores in comparison to CE\(_{\textrm{BM25CAT}}\). For the sake of a fair analysis, we do not compare CE\(_{\textrm{BM25CAT}}\) with a Learning-to-rank approach that is trained on 87 features by Zhang et al. [69]. The use of ensemble methods brings additional overhead in terms of efficiency because it adds one more extra step to the re-ranking pipeline. It is noteworthy to mention that in this paper, we concentrate on analyzing the improvement by combining the first-stage retriever and a BERT-based re-ranker: BM25 and CE\(_{\textrm{CAT}}\) respectively. However, we are aware that combining scores of BM25 and Dense Retrievers that both are first-stage retrievers has also shown improvements [70,71,72] that are outside the scope of our study. In particular, CLEAR [10] proposes an approach to train the dense retrievers to encode semantics that BM25 fails to capture for first stage retrieval. However, in this study, our aim is to improve re-ranking in the second stage of two-stage retrieval setting.

2.4 Knowledge distillation

Knowledge distillation has become an important approach in information retrieval [6, 73]. In machine learning, knowledge distillation refers to the process of transferring knowledge from a more capable model (called the teacher) to a less capable model (called the student) [74]. Following this convention, the goal in information retrieval is to train a small ranker (the student) with one or multiple more larger ranker models (the teachers) on a given dataset.

Knowledge distillation methods can be categorized into methods that employ multiple teachers of one type of model, called teacher ensembles, or methods that use different types of models as teachers [75]. In this paper, where we experiment with knowledge distillation, we use a highly effective method called Marginal Mean Square Error (Margin-MSE) [6] which uses the ensemble of three relatively large BERT-based cross-encoder re-rankers as teachers and a smaller BERT-based cross-encoder re-ranker as the student. Margin-MSE utilizes the margin between the scores of relevant and non-relevant passages as distilled knowledge.

3 Methods

In the following, we first describe the first-stage retrievers and re-rankers that are used and then describe the proposed method in detail.

3.1 First stage rankers

We experiment with two widely used first-stage rankers namely BM25 [76] and dense passage retrievers [77].

3.1.1 BM25

Lexical retrievers estimate the relevance of a document to a query based on word overlap [76]. Many lexical methods, including vector space models, Okapi BM25, and query likelihood, have been developed in previous decades. We use BM25 because of its popularity as first-stage ranker in current systems. Based on the statistics of the words that overlap between the query and the document, BM25 calculates a score for the pair:

$$\begin{aligned} s_{lex}(q,d)&= BM25(q,d) \nonumber \\&= \sum _{t \in q \cap d }{rsj_t . \frac{tf_{t,d}}{tf_{t,d} + k_{1} \left\{ (1-b) + b \frac{|d|}{l} \right\} }} \end{aligned}$$
(1)

where t is a term, \(tf_{t,d}\) is the frequency of t in document d, \(rsj_t\) is the Robertson-Spärck Jones weight [2] of t, and l is the average document length. \(k_1\) and b are parameters [78, 79].

3.1.2 DPR

Dense passage retrieval (DPR) models [77] provide an efficient BERT-based first-stage retriever that estimates relevance beyond word overlap by matching query and document text in a continuous representation space. In contrast to cross-encoder re-rankers that process both the query and document at the same time to estimate relevance, which makes the process computationally heavy, the DPR model processes this in two phases, significantly improving efficiency and making DPR models an efficient alternative first-stage retriever. The representations of the collection’s passages are pre-computed in an offline setup, where a single CLS vector represents the contextualized representation of a passage (\(\textbf{p}\)). Given a query, DPR models represent the query as a single CLS vector (\(\textbf{q}\)) and estimate the relevance of a document to a query based on the dot product of their corresponding CLS vectors. This is also called BERT\(_{\textrm{DOT}}\) [6], formally defined as:

$$\begin{aligned} {BERT_{DOT}(q_{1:m},p_{1:n}) = \textbf{q} \cdot \textbf{p}} \end{aligned}$$
(2)

3.2 CE\(_{\textrm{CAT}}\): cross-encoder re-rankers without BM25 injection

Concatenating query and passage input sequences is the typical method for using cross-encoder (e.g., BERT) architectures with pre-trained Transformer models in a re-ranking setup [4, 6, 80, 81]. This basic design is referred to as CE\(_{\textrm{CAT}}\) and shown in Fig. 1. The query \(q_{1:m}\) and passage \(p_{1:n}\) sequences are concatenated with the [SEP] token, and the [CLS] token representation computed by CE is scored with a single linear layer \(W_s\) in the CE\(_{\textrm{CAT}}\) ranking model:

$$\begin{aligned} CE_{CAT}(q_{1:m},p_{1:n}) = CE([CLS]\,q\,[SEP]\,p\,[SEP]) * W_s \end{aligned}$$
(3)

We use CE\(_{\textrm{CAT}}\) as our baseline re-ranker architecture. We evaluate different cross-encoder models in our experiments and all of them follow the above design.

3.3 CE\(_{\textrm{BM25CAT}}\): cross-encoder re-rankers with BM25 injection

To study the effectiveness of injecting the BM25 score into the input, we modify the input of the basic input format as follows and call it CE\(_{\textrm{BM25CAT}}\):

$$\begin{aligned}&CE_{BM25CAT}(q_{1:m},p_{1:n})\nonumber \\&\quad =CE([CLS]\,q\,[SEP] \,BM25\,[SEP]\,p\,[SEP]) * W_s \end{aligned}$$
(4)

where BM25 represent the relevance score produced by BM25 between query and passage.

We study different representations of BM25 to find the optimal approach for injecting BM25 into the cross-encoders. The reasons are: (1) BM25 scores do not have an upper bound and should be normalized for having an interpretable score given a query and passage; (2) BERT-based models can process integers better than floating point numbers [16] so we analyze if converting the normalized score to an integer is more effective than injecting the floating point score. For normalizing BM25 scores, we compare three different normalization methods: Min-Max, StandardizationFootnote 2 (Z-score), and Sum:

$$\begin{aligned}&Min\text {-}Max(s_{BM25}) = \frac{s_{BM25} -s_{min}}{s_{max}-s_{min}} \end{aligned}$$
(5)
$$\begin{aligned}&Standard(s_{BM25}) = \frac{s_{BM25}- \mu (S)}{\sigma (S)} \end{aligned}$$
(6)
$$\begin{aligned}&Sum(s_{BM25}) = \frac{s_{BM25}}{sum(S)} \end{aligned}$$
(7)

where \(s_{BM25}\) is the original score, and \(s_{max}\) and \(s_{min}\) are the maximum and minimum scores respectively, in the ranked list. Sum(S), \(\mu (S)\), and \(\sigma (S)\) refer to sum, average and standard deviation over the scores of all passages retrieved for a query. The anticipated effect of the Sum normalizer is that the sum of the scores of all passages in the ranked list will be 1; thus, if the top-n passages receive much higher scores than the rest, their normalized scores will have a larger difference with the rest of passages’ scores in the ranked list; this distance could give a good signal to CE\(_{\textrm{BM25CAT}}\). We experiment with Min-Max and Standardization in a local and a global setting. In the local setting, we get the minimum or maximum (for Min-Max) and mean and standard deviation (for Standard) from the ranked list of scores per query. In the global setting, we use \(\{0,50,42,6\}\) as {minimum, maximum, mean, standard deviation} as they have been empirically suggested in prior work to be used as default values across different queries to globally normalize BM25 scores [82]. In our data, the {minimum, maximum, mean, standard deviation} values are \(\{0,98,7,5\}\) across all queries. Because of the differences between the recommended defaults and the statistics of our collections, we explore other global values for Min-Max, using 25, 50, 75, 100 as maximum and 0 as minimum. However, we got the best result using default values of [82]. To convert the float numbers to integers we multiply the normalized score to 100 and discard decimals. Finally, we store the number as a string.

3.4 Linear interpolation ensembles of BM25 and CE\(_{\textrm{CAT}}\)

We compare our approach to common ensemble methods [11, 83] for interpolating BM25 and BERT re-rankers. We combine the scores linearly using the following methods: (1) Sum: compute sum over BM25 and CE\(_{\textrm{CAT}}\) scores, (2) Max: select maximum between BM25 and CE\(_{\textrm{CAT}}\) scores, and (3) Weighted-Sum:

$$\begin{aligned} s_i = \alpha \, . \, s_{BM25} + (1-\alpha ) \, . \, s_{CE_{CAT}} \end{aligned}$$
(8)

where \(s_i\) is the weighted sum produced by the interpolation, \(s_{BM25}\) is the normalized BM25 score, \(s_{CE_{CAT}}\) is the CE\(_{\textrm{CAT}}\) score, and \(\alpha \in [0..1]\) is a weight that indicates the relative importance. Since CE\(_{\textrm{CAT}}\) score \(\in [0,1]\), we also normalize BM25 score using Min-Max normalization. Furthermore, we train ensemble models that take \(s_{BM25}\) and \(s_{CE_{CAT}}\) as features. We experiment with three different classifiers for this purpose: SVM with a linear kernel, SVM with an RBFkernel, Naive Bayes, and Multi Layer Perceptron (MLP) as a non-linear method and report the best classifier performance in Sect. 5.3.

3.5 CE\(_{\textrm{DPRCAT}}\): cross-encoder re-rankers with DPR injection

We inject the DPR score into the input of the cross-encoders, similarly to CE\(_{\textrm{BM25CAT}}\). We compare different representations of DPR scores to find the optimal approach for injecting DPR into the cross-encoders. This is because the relevance score computed by calculating the dot product is not normalized to a fixed range. Therefore, in addition to injecting the original score, we investigate Min-Max normalization in global and local setting with float and integer representations. We use a pre-trained BERT-Base\(_{\textrm{DOT}}\) as our DPR model which is a dense retrieval model trained with knowledge distillation [6].Footnote 3

3.6 Knowledge distillation with score injection

Previous studies have shown than using larger cross-encoder re-rankers as teachers for smaller cross-encoder re-rankers or dense retrievers leads to improvement of the effectiveness. We analyze injecting the BM25 and DPR score in the input while training the student cross-encoders with using three larger cross-encoders as teachers. To do so, following [6], we train MiniLM to predict the prediction of the ensemble score of 3 large teachers models, BERT-Base\(_{\textrm{CAT}}\), BERT-Large-WM\(_{\textrm{CAT}}\), and ALBERT-Large\(_{\textrm{CAT}}\), per query–document sample. We use Mean Square Error (MSE) as loss function and the scores of the teacher models published by Hofstätter et al. [6].Footnote 4 This gives MiniLM performances comparable to large models, while being 18 times faster. Our motivation for doing so is analyzing if even if a such scenario, injecting the BM25 or DPR score can lead to improvement of effectiveness.

4 Experimental design

4.1 Dataset and metrics

We conduct our experiments on the MSMARCO-passage collection [84] and the two TREC Deep Learning tracks (TREC-DL’19 and TREC-DL’20) [85, 86]. The MSMARCO-passage dataset contains about 8.8 million passages (average length: 73.1 words) and about 1 million natural language queries (average length: 7.5 words) and has been extensively used to train deep language models for ranking because of the large number of queries. Following prior work on MSMARCO [11, 87,88,89,90], we use the dev set (\(\sim 7k\) queries) for our empirical evaluation. MAP@1000 and nDCG@10 are calculated in addition to the official evaluation metric MRR@10. The passage corpus of MSMARCO is shared with TREC DL’19 and DL’20 collections with 43 and 54 queries respectively. We evaluate our experiments on these collections using nDCG@10 and MAP@1000, as is standard practice in TREC DL [85, 86] to make our results comparable to previously published and upcoming research. We cap the query length at 30 tokens and the passage length at 200 tokens following prior work [6].

4.2 Training configuration and model parameters

We use the Huggingface library [91], Cross-encoder package of Sentence-transformers library [92], and PyTorch [93] for the cross-encoder re-ranking training and inference. We fine-tune all of the cross-encoder re-rankers in a similar configuration to ensure the fairness and reliability of our comparison. We train a distinct cross-encoder re-ranker for each query set in the evaluation. For each TREC DL collection, we use the other TREC DL query set as the validation set, and we select both TREC DL (’19 and ’20) query sets as the validation set to train CEs for the MSMARCO DEV Passage collection. Please note that the train/validation/test data are the same within the compared CE\(_{\textrm{CAT}}\), CE\(_{\textrm{BM25CAT}}\), and CE\(_{\textrm{DPRCAT}}\) models to ensure the only difference during fine-tuning and evluation a cross-encoder re-ranker is the presence or exclusion of score injection in the input. For the dense passage retriever model, we use an already existing state-of-the-art dense passage retrieval model that is trained on the MSMarco training set.Footnote 5

For injecting the BM25 or DPR score as text, we pass the score in string format into the BERT tokenizer in a similar way to passing query and document. Please note that the integer numbers are already included in the BERT tokenizer’s vocabulary, allowing for appropriate tokenization. Following prior work [6] we use the Adam [94] optimizer with a learning rate of \(7*10^{-6}\) for all cross-encoder layers, regardless of the number of layers trained. We employ early stopping, based on the nDCG@10 value of the validation set. We use a training batch size of 32. For all cross-Encoder re-rankers, we use Cross-Entropy loss [95]. For the lexical retrieval with BM25 we employ the tuned parameters from the Anserini documentation [78, 79].Footnote 6

4.3 Knowledge distillation

For knowledge distillation (KD), we use mean square error loss followed by Hofstätter et al. [6]. For the teachers’ relevance scores, we use the publicly available scores for pairs of queries and documents from the MSMARCO training set released by Hofstätter et al. [6]. For injection in the KD setup, we compute and inject either BM25 or DPR score.

5 Results

5.1 Choice of BM25 score representation

As introduced in Sect. 3.3, we compare different representations of the BM25 score in Table 1 for injection into CE\(_{\textrm{BM25CAT}}\). We chose MiniLM [19] for this study as it has shown competitive results in comparison to BERT-based models while it is 3 times smaller and 6 times faster.Footnote 7 Our first interesting observation is that injecting the original float score rounded down to 2 decimal points (row b) of BM25 into the input seems to slightly improve the effectiveness of re-ranker. We assume this is due to the fact that the average query and passage length is relatively small in the MSMARCO Passage collection, which prevents from getting high numbers—with low interpretability for BERT—as BM25 score. Second, we find that the normalized BM25 score with Min-Max in the global normalization setting converted to integer (row f) is the most significant effectiveFootnote 8. representation for injecting BM25.

Table 1 Effectiveness results

The global normalization setting gives better results for both Min-Max (rows ef) and Standardization (rows ij) than local normalization (rows cd and gh).Footnote 9 The reason is probably that in the global setting a candidate document obtains a high normalized score (close to 1 in the floating point representation) if its original score is close to default maximum (for Min-Max normalization) so the normalized score could be more interpretable across different queries. On the other hand, in the local setting, the passages ranked at position 1 always receive 1 as normalized score with Min-Max even if its original score is not high and it does not have a big difference with the last passage in the ranked list.

Moreover, converting the normalized float score to integers gives better results for both Min-Max (rows df) and Standardization (rows hj) than the float representation (rows ce and gi). We find that Min-Max normalization is a better representation for injecting BM25 than Standardization, which could be due to the fact that in Min-Max the normalized score could not be negative, and, as a result, interpreting the injected score is easier for CE\(_{\textrm{BM25CAT}}\). We find that the Sum normalizer (rows k and l) decreases effectiveness. Apparently, our expectation that Sum would help distinguish between the top-n passages and the remaining passages in the ranked list (see Sect. 5.1) is not true.

Table 2 Effectiveness results
Table 3 The effectiveness of injecting BM25 score into the input (Bert-Base\(_{\textrm{BM25CAT}}\)) compared to interpolation performance of BM25 and Bert-Base\(_{\textrm{CAT}}\) using common ensemble methods

5.2 Impact of BM25 injection for various cross-encoders (RQ1)

Table 2 shows that injecting the BM25 score – using the best normalizer which is Min-Max in the global normalization setting converted to integer – into all four cross-encoders improves their effectiveness in all of the metrics compared to using them without injecting BM25. This shows that injecting the BM25 score into the input as a small modification to the current re-ranking pipeline improves the re-ranking effectiveness. This is without any additional computational burden as we train CE\(_{\textrm{CAT}}\) and CE\(_{\textrm{BM25CAT}}\) in a completely equal setting in terms of number of epochs, batch size, etc. We receive the highest result by BERT-Large\(_{\textrm{BM25CAT}}\) for cross-encoder with BM25 injection, which could be due to the higher number of parameters of the model. We find that the results of MiniLM are similar to those for BERT-Base on MSMARCO-DEV while the former is more efficient.

5.3 Comparing BM25 Injection with Ensemble Methods (RQ2)

Table 3 shows that while injecting BM25 leads to improvement, regular ensemble methods and Naive Bayes classifier fail to do so; combining the scores of BM25 and BERT\(_{\textrm{CAT}}\) in a linear and non-linear (MLP) interpolation ensemble setting even leads to lower effectiveness than using the cross-encoder as sole re-ranker. Therefore, our strategy is a better solution than linear interpolation. We only report results for Naive Bayes—having BM25 and BERT\(_{\textrm{CAT}}\) score as features—as it had the highest effectiveness of the four estimators. Still, the effectiveness is much lower than BERT\(_{\textrm{BM25CAT}}\) and also lower than a simple Weighted-Sum. Weighted-Sum (tuned) in Table 3 is tuned on the validation set, for which \(\alpha =0.1\) was found to be optimal. We analyze the effect of different \(\alpha\) values in a weighted linear interpolation (Weighted-Sum) to draw a more complete picture on the impact of combining scores on the DEV set. Figure 3 shows that by increasing the weight of BM25, the effectiveness decreases. The figure also shows that the tuned alpha which was found on the validation set in Table 3 is not the most optimal possible alpha value for the DEV set. The highest effectiveness for \(\alpha =0.0\) in Fig. 3 confirms we should not combine the scores by current interpolation methods and only using scores of Bert-Base\(_{\textrm{CAT}}\) is better, at least for the MSMARCO passage collection.

Fig. 3
figure 3

Effectiveness on MSMARCO DEV with varying the interpolation weight of BM25 and BERT-Base\(_{\textrm{CAT}}\) scores. \(\alpha =0\) means only BERT\(_{\textrm{CAT}}\) scores are used

5.4 Exact matching relevance results (RQ3)

To conduct exact matching analysis, we replace the passage words that do not appear in the query with the [MASK] token, leaving the model only with a skeleton of the original passage and force it to rely on the exact word matches between query and passage [12]. We do not train models on this input but use our models that were fine-tuned on the original data. Table 4 shows that BERT-Base\(_{\textrm{BM25CAT}}\) performs better than both BM25 and BERT-Base\(_{\textrm{CAT}}\) in the exact matching setting on all metrics. Moreover, we found that the percentage of relevant passages ranked in top-10 that are common between BM25 and BERT\(_{\textrm{BM25CAT}}\) is \(40\%\), which is higher than the percentage of relevant passages between BM25 and BERT\(_{\textrm{CAT}}\) (\(37\%\)). Therefore, the higher effectiveness of BERT\(_{\textrm{BM25CAT}}\) in exact matching setting could be at least partly because it mimics BM25 more than BERT\(_{\textrm{CAT}}\). In comparison, this percentage is 57 between BERT\(_{\textrm{BM25CAT}}\) and BERT\(_{\textrm{CAT}}\).

Table 4 Comparing exact matching effectiveness of BERT-Base\(_{\textrm{BM25CAT}}\) and BERT-Base\(_{\textrm{CAT}}\) by keeping only the query words in each passage for re-ranking

5.5 Impact of position of injection (RQ4)

To assess whether injection position has an impact on the effectiveness of CE\(_{\textrm{BM25CAT}}\), we try three different positions for injecting the BM25 relevance score into the input of CE\(_{\textrm{BM25CAT}}\):

  • Between query and candidate document, in which the position of the injection is influenced by the query length: ‘[CLS] query [SEP] BM25 [SEP] document [SEP]’

  • Before the query, in which the position of the injection is independent from query or document length: ‘[CLS] BM25 [SEP] query [SEP] document [SEP]’

  • After candidate document, in which the position of injection is dependent on the query and document length: ‘[CLS] query [SEP] document [SEP] BM25 [SEP]’

The motivation behind this analysis comes from the fact that injecting the BM25 score in the middle, between query and candidate document, gives a different position each time to the BM25 score, while there is not any actual meaningful difference based on the position of the BM25 score. As a result, using a fix position for injection might lead to the improvement. Table 5 confirms this analysis and shows that injecting BM25 before the query achieves the highest effectiveness. Since the gain of modifying the position of injection is not significant, we do not re-do the previous experiments and keep the position of injection in previous experiments in the middle as explained in Sect. 3.

Table 5 Analyzing the position of the BM25 score injection

5.6 Effectiveness of injecting DPR into CE\(_{\textrm{DPRCAT}}\) (RQ5)

5.6.1 Choice of DPR score representation

As introduced in Sect. 3.5, we compare different representations of the DPR score in Table 6 for injection into CE\(_{\textrm{DPRCAT}}\) by using Min-Max as normalization strategy. We chose MiniLM [19] for this study similar to the previous analysis for CE\(_{\textrm{BM25CAT}}\). We observe a similar pattern to CE\(_{\textrm{BM25CAT}}\) for injecting DPR into CE\(_{\textrm{DPRCAT}}\): First, injecting the original float score rounded down to 2 decimal points DPR into the input seems to slightly improve the effectiveness of re-ranker. Second, we find that the normalized score with Min-Max in the global normalization setting converted to integer is the most effective representation for injecting DPR. We found 118 as global maximum value and 89 as global minimum value in the DPR scores for the MS MARCO training set and used them as global values for Min-Max. The position of score injection is in the middle, between the query and the candidate document.

Table 6 Effectiveness results

5.6.2 Finding the most effective two-stage setting

To determine the optimal configuration for CE\(_{\textrm{BM25CAT}}\) and CE\(_{\textrm{DPRCAT}}\), we employed two initial rankers: BM25 and DPR, followed by two re-rankers: CE\(_{\textrm{BM25CAT}}\) and CE\(_{\textrm{DPRCAT}}\). Table 7 presents our findings, which reveals that utilizing DPR as the initial ranker in combination with CE\(_{\textrm{DPRCAT}}\) as the re-ranker yields the highest effectiveness. This combination slightly outperforms the use of CE\(_{\textrm{BM25CAT}}\) as the re-ranker in the same setup.

Additionally, we observed that re-ranking without injecting, using the effective DPR model as the intial ranker, does not consistently result in improved effectiveness over only the first-stage retrieval. Specifically, when considering nDCG@10 for TREC’19, DPR achieves a score of 0.721, while MiniLM\(_{\textrm{CAT}}\) scores 0.717. This is while, employing either MiniLM\(_{\textrm{BM25CAT}}\) or MiniLM\(_{\textrm{DPRCAT}}\) as the re-ranker consistently leads to significant improvements over DPR.

Furthermore, we found out that MiniLM\(_{\textrm{BM25CAT}}\) is less effective than MiniLM\(_{\textrm{DPRCAT}}\) for re-ranking over BM25 as the initial ranker. This shows even if BM25 is used as first-stage retriver, there is still a benefit in terms of effectiveness by using MiniLM\(_{\textrm{DPRCAT}}\) as the re-ranker. The position of score injection is in the middle, between the query and the candidate document.

Table 7 Effectiveness results of injecting DPR and BM25 scores using DPR or BM25 as initial ranker
Table 8 Effectiveness results

5.6.3 Impact of DPR injection for various cross-encoders

Table 8 shows that injecting the DPR score—using the best normalizer which is Min–Max in the global normalization setting converted to integer—into all four cross-encoders improves their effectiveness in all of the metrics compared to using them without injecting DPR. This shows that injecting the DPR score into the input as a small modification to the current re-ranking pipeline improves the re-ranking effectiveness. This is without any additional computational burden as we train CE\(_{\textrm{CAT}}\) and CE\(_{\textrm{DPRCAT}}\) in a completely equal setting in terms of number of epochs, batch size, etc. We receive the highest result by BERT-Large\(_{\textrm{DPRCAT}}\) with DPR injection, which is likely be due to the large number of parameters of the model.

We observe that MiniLM yields results on par with BERT-Base for the MSMARCO-DEV dataset while demonstrating superior efficiency. Furthermore, our investigation reveals that across all four cross-encoders, incorporating DPR scores consistently improves the effectiveness more than injecting the BM25 score. This difference is clearly observable when comparing the results presented in Tables 2 and 8. However, it’s worth noting that CE\(_{\textrm{DPRCAT}}\) exhibits slower convergence during training when compared to CE\(_{\textrm{BM25CAT}}\), a phenomenon that is discussed in more detail in Sect. 5.9.

5.7 Knowledge distillation

Table 9 demonstrates the significant impact of knowledge distillation on retrieval and ranking effectiveness. In this configuration, we consistently achieve higher effectiveness when training MiniLM for three variants: CE\(_{\textrm{CAT}}\), CE\(_{\textrm{BM25CAT}}\), and CE\(_{\textrm{DPRCAT}}\) compared to Table 7, in which we do not use knowledge distillation. Notably, our observations indicate that through knowledge distillation, MiniLM\(_{\textrm{DPRCAT}}\) attains even greater effectiveness compared to BERT-large\(_{\textrm{DPRCAT}}\). For instance, when evaluating NDCG@10 on TREC DL’20, MiniLM\(_{\textrm{DPRCAT}}\) achieves an impressive score of 0.759, surpassing BERT-Large\(_{\textrm{DPRCAT}}\)’s 0.757, as illustrated in Tables 8 and 9, respectively.

We consider this analysis to be of great importance from an industrial perspective, as knowledge distillation can significantly enhance the effectiveness of small cross-encoder re-rankers. This is important for keeping effeciency while increasing effectiveness in industry. Our findings also underscore the generalizability of CE\(_{\textrm{BM25CAT}}\) and CE\(_{\textrm{DPRCAT}}\) across various training configurations.

Table 9 Effectiveness results of injecting DPR and BM25 scores using DPR or BM25 as initial ranker in a knowledge distillation training setup

5.8 Analysis of the results

5.8.1 Query types

In order to analyze the effectiveness of BERT-base\(_{\textrm{CAT}}\) and BERT-base\(_{\textrm{BM25CAT}}\) across different types of questions, we classify questions based on the lexical answer type. We use the rule-based answer type classifierFootnote 10 inspired by Li and Roth [97] to extract answer types. We classify MSMARCO queries into 6 answer types: abbreviation, location, description, human, numerical and entity. 4105 queries have a valid answer type and at least one relevant passage in the top-1000. We perform our analysis in two different settings: normal (full-text) and exact-matching (keeping only query words and replacing non-query words with [MASK]). The average MRR@10 per query type is shown in Table 10. The table shows that BERT\(_{\textrm{BM25CAT}}\) is more effective than BERT\(_{\textrm{CAT}}\) consistently on all types of queries.

Table 10 MRR@10 on MSMARCO-DEV per query type for comparing BERT-Base\(_{\textrm{BM25CAT}}\) and BERT-Base\(_{\textrm{CAT}}\) on different query types in full-text and exact-matching (only keeping query words) settings
Fig. 4
figure 4

Example query and two passages in the input of BERT\(_{\textrm{BM25CAT}}\). The color of each word indicates the word-level attribution value according to Integrated Gradient (IG) [98], where red is positive, blue is negative, and white is neutral. We use the brightness of different colors to indicate the values of these gradients

5.8.2 Qualitative analysis

We show a qualitative analysis of one particular case in Fig. 4 to analyze more in-depth what the effect of BM25 injection is and why it works. In the top row, while BERT\(_{\textrm{CAT}}\) mistakenly ranked the relevant passage at position 104, BM25 ranked that passage at position 3 and BERT\(_{\textrm{BM25CAT}}\)—apparently helped by BM25—ranked that relevant passage at position 1. In the bottom row, BERT\(_{\textrm{CAT}}\) mistakenly ranked the irrelevant passage at position 1 and informed by the low BM25 score, BERT\(_{\textrm{BM25CAT}}\) ranked it much lower, at 69. In order to interpret the importance of the injected BM25 score in the input of CE\(_{\textrm{BM25CAT}}\) and show its contributions to the matching score in comparison to other words in the query and passage, we use Integrated Gradient (IG) [98] which has been proven to be a stable and reliable interpretation method in many different applications including Information Retrieval [99,100,101].Footnote 11 On both rows of Fig. 4, we see that the BM25 score (‘22’ in the top row and ‘11’ in the bottom row) is a highly attributed term in comparison to other terms. This shows that injecting the BM25 score assists BERT\(_{\textrm{BM25CAT}}\) to identify relevant or non-relevant passages better than BERT\(_{\textrm{CAT}}\).

As a more general analysis, we randomly sampled 100 queries from MSMARCO-DEV. For each query, we took the top-1000 passages retrieved by BM25, we fed all pairs of query and their corresponding retrieved passages (100k pairs) into BERT\(_{\textrm{BM25CAT}}\), and computed the attribution scores over the input at the word-level. We ranked tokens based on their importance using the absolute value of their attribution score and found the mode of the rank of the BM25 token over all samples is 3. This shows that BERT\(_{\textrm{BM25CAT}}\) highly attributes the BM25 token for ranking.

Fig. 5
figure 5

Convergence analysis between MiniLM\(_{\textrm{CAT}}\), MiniLM\(_{\textrm{BM25CAT}}\), and MiniLM\(_{\textrm{DPRCAT}}\). The nDCG@10 is reported at each step on the validation set during training

5.9 Convergence analysis

To assess the relation between the effectiveness of models and the training time required to reach the optimal weights, we present Fig. 5, which illustrates the nDCG@10 performance on the validation dataset throughout the training process. Our observations indicate that although CE\(_{\textrm{DPRCAT}}\) exhibits slightly superior effectiveness compared to CE\(_{\textrm{BM25CAT}}\), it requires a longer convergence time. This observation implies that when dealing with limited training data resources, choosing CE\(_{\textrm{BM25CAT}}\) may yield a more effective cross-encoder re-ranker in a shorter training time as opposed to CE\(_{\textrm{CAT}}\) and CE\(_{\textrm{DPRCAT}}\).

5.10 Entropy analysis

To delve deeper into the impact of score normalization variations within CE\(_{\textrm{BM25CAT}}\), we conducted an in-depth assessment of MiniLM\(_{\textrm{BM25CAT}}\)’s effectiveness on TREC DL’19. This assessment involved manipulating the global maximum value in the min-max normalization process. Specifically, we set the global minimum to zero and incrementally raised the global maximum from 10 to 100 in steps of 10. We kept the global minimum as 0 and the reason for this choice is is explained previously in section 3.3. The results, presented in Table 11, reveal an interesting pattern.

As we increase the global maximum value, we observe a reduction in the entropy of the normalized scores, coinciding with an increase in effectiveness. However, a notable trend emerges: once the global maximum exceeds 50, there is a drop in effectiveness. This suggests a potential trade-off between diminishing entropy in the scores and achieving enhanced effectiveness when injecting BM25 scores into CE\(_{\textrm{BM25CAT}}\).

Table 11 Analyzing the relationship between entropy of normalized scores and effectiveness of MiniLM\(_{\textrm{BM25CAT}}\) trained on them

5.11 How do CE\(_{\textrm{BM25CAT}}\) and BM25 rankings vary?

To interperet the difference between CE\(_{\textrm{CAT}}\) and CE\(_{\textrm{BM25CAT}}\) in more-depth, inspired by Rau et al. [102], we plot the proportions of overlapping documents between BERT Base\(_{\textrm{CAT}}\) and BM25, and BERT Base\(_{\textrm{BM25CAT}}\) and BM25 in Fig. 6 and 7. Intuitively, each row indicates to what ratio documents stem from different rank-ranges. E.g., the top row can be read as the documents in rank 1-10 of the CE\(_{\textrm{CAT}}\) re-ranking originate 34% from rank 1 to 10, 40% from rank 11 to 100, 20% from rank 101 to 500 and 5.6% from rank 501 to 1000 in the initial BM25 ranking. In Fig. 7, we observe that that there is more similarity between ranking of CE\(_{\textrm{BM25CAT}}\) and BM25 than CE\(_{\textrm{CAT}}\) and BM25. This could be a reason for the fact that CE\(_{\textrm{BM25CAT}}\) is a more powerfull exact matcher.

Fig. 6
figure 6

Proportions of overlapping documents between BERT-Base\(_{\textrm{CAT}}\) and BM25 ranking

Fig. 7
figure 7

Proportions of overlapping documents between BERT-Base\(_{\textrm{BM25CAT}}\) and BM25 ranking

6 Conclusion and future work

In this paper, we have proposed an efficient and effective way of combining first-stage retrievers and cross-encoder re-rankers. Prior research has primarily focused on the independent optimization of retrieval stages or simple linear combinations of scores from different models. Our approach deviates from these traditional methods by introducing a non-linear and continuous strategy for score integration, by injecting the first-stage retriever score as text in the input of the cross-encoder re-rankers. We find that the resulting models, CE\(_{\textrm{BM25CAT}}\) and CE\(_{\textrm{DPRCAT}}\), achieve a statistically significant improvement for all evaluated cross-encoders. Furthermore, the generalizability of our approach is demonstrated across various query types.

Our research builds upon the foundations of previous work that suggested the capability of BERT-based models in processing numeric data in textual representation. Our work provides a robust empirical example of the application of this ability of BERT-based models in information retrieval. We also found that injecting the BM25 or DPR relevance score in a knowledge distillation training setup can lead to statistically significant improvements.

Based on the experiments on injecting two different first-stage retrievers in two different training setups, we conclude that injecting the first-stage retriever relevance score is an impactful and straightforwardly available signal that leads to significant improvements in the effectiveness of cross-encoder re-rankers.

In conclusion, this work contributes to the information retrieval community by offering a refined, empirically validated method for the integration of first-stage retriever scores into cross-encoder re-rankers that might open new avenues for future research, potentially improving existing paradigms and encouraging a reevaluation of current retrieval practices. It provides a step forward in the development of more effective multi-stage retrieval systems. Future research, inspired by our findings, could further explore the implications of our approach in other contexts and for other types of data, e.g., score injection for first-stage dense passage retrievers.