1 Introduction

Pretrained Language Models (PLMs), such as BERT (Devlin et al., 2019), ELECTRA (Clark et al., 2020) and T5 (Raffel et al., 2020), have become the core components for building highly effective ranking models. The success of PLMs is largely owed to the heavy pre-training on language modeling objectives on the one hand, and learning deeply-contextualized representations of input sequences using the transformer architecture (Vaswani et al., 2017) on the other. Thanks to the fine-tuning strategy and the availability of large publicly-released training datasets, applying a PLM to document ranking is straightforward. Nogueira and Cho (2019) was the first to propose a simple application of BERT to text ranking using fine-tuning on the large public MS MARCO (Nguyen et al., 2016) dataset. In this work, BERT was deployed as a relevance classifier trained to estimate the probability each document is “relevant” w.r.t a given query.

Table 1 Extracts from top ranked passages by Vanilla BERT for the query: “causes of left ventricular hypertrophy”

Compared to the first wave of neural ranking models including DRMM (Guo et al., 2016), DUET (Mitra et al., 2017), and KNRM (Xiong et al., 2017), referred to as pre-BERT models, BERT and its variants do not appear to require any specialized neural architectural components to capture different aspects of relevance between a query and a document (Lin et al., 2020). The same architecture based on homogeneous transformer layers is employed regardless of the downstream task. Qiao et al. (2019) study the behaviour of BERT for ranking and revealed that it focuses more on document terms that directly match the query. Compared to pre-BERT models such as Conv-KNRM (Dai et al., 2018) that prefer terms related to the query in search, BERT’s pretraining on surrounding contexts favors text sequence pairs that are closer in their semantic meaning (Qiao et al., 2019). Qiao et al. (2019) conclude that BERT can be considered as an interaction-based sequence-to-sequence soft matching model that owes its effectiveness to the transformer’s cross-match attention. While soft semantic matching is, undeniably, a valuable signal for relevance that alleviates the vocabulary mismatch problem, a ranking model needs proper handling of exact matching cues as well (Guo et al., 2016; Mitra et al., 2017; Luan et al., 2020). Let us take the following query: “Causes of left ventricular hypertrophy” form the MS MARCO passage ranking task, as an example. Table 1 reports extracts from the top passages retrieved by BERT. We can see that all top ranked passages are related to “right ventricular hypertrophy” due to the soft matching between “left” and “right”. This example is a reminder of the importance of exact matching for relevance ranking. Boualili et al. (2020) suggest that a PLM like BERT can benefit from explicit exact matching signals for passage ranking. The authors propose MarkedBERT, a model that uses marker tokens to convey exact matches between the query and document terms from the input sequence. Special tokens, i.e, \([e_i]\) and \([/e_i]\), were added to the textual input sequence of BERT to indicate the start and the end, respectively, of terms that match exactly with the i-th term of the query. For example:

Query: Causes of \([e_2]\)left\([/e_2]\) \([e_3]\)ventricular\([/e_3]\) \([e_4]\)hypertrophy\([/e_4]\)

Passage: \([e_2]\)Left\([/e_2]\) \([e_3]\)ventricular\([/e_3]\) \([e_4]\)hypertrophy\([/e_4]\) can occur...

Exact term-matching integration via marking has proven to induce significant gains on the MSMARCO passage ranking task over “Vanilla” BERT (monoBERT) (Nogueira and Cho, 2019). Analysis of the attention shows that marker tokens bring more focus on the exact matches allowing more relevant documents to be ranked higher. Table 2 shows extracts from top ranked passages returned by MarkedBERT for the query “Causes of left ventricular hypertrophy” where we can count more documents related to “left ventricular hypertrophy” without explicit bias, since the passage 47203 ranked first by BERT is still ranked high (second) by MarkedBERT.

Table 2 Extracts from top ranked passages by MarkedBERT for the query: “causes of left ventricular hypertrophy”

In this work, we follow the same hypothesis stating that exact matching cues can enhance PLMs and extend the previously proposed marking-based approach to ad hoc document ranking. We introduce new simple marking strategies to identify which aspect of exact match marking is important for ad hoc document ranking, namely: Does the model require marking both query and document segments or is marking the document enough? Does the model require query-term identification in the marker or is using the same marker for all query terms enough? And which combination works better? We conduct extensive experiments to determine the contribution of exact match marking on the most used PLM, BERT, and the more recent and effective ELECTRA model on standard ad hoc benchmarks. We empirically demonstrate the effectiveness of explicit exact match marking across different experimental scenarios including in-domain, zero-shot transfer and multi-phase fine-tuning settings. Since our approach aims at injecting an established traditional IR cue to recent pretrained transformers, we study the effectiveness of our models with interpolating the traditional BM25 scores. We find that best match scores obtained by BM25 are still valuable since they contribute to the end-to-end effectiveness. Furthermore, the marking-based models require less intervention from BM25 scores to achieve better ranking performance than the vanilla baseline.

Our main contributions can be summarized as follows:

  • We present, to our knowledge, the first work investigating the impact of exact match integration into BERT for long document ranking.

  • We extend the idea of exact match marking by introducing a new simple and unique marker token for highlighting all the exact term-matches without distinction and explore two marking levels: document and pair marking.

  • We conduct extensive experiments to evaluate the effectiveness of our proposed marking strategies on in-domain data using the MS MARCO document ranking benchmark, and zero-shot generalizability to out-of-domain data using the standard TREC ad hoc Robust04 and GOV2 benchmarks.

  • We investigate the impact of short key word queries vs. long natural descriptions and propose a hybrid pipeline taking advantage of both the retriever and ranker strengths.

  • We study the contribution of exact match scores from a bag-of-words model to the out-of-domain effectiveness of our models.

  • We study the contribution of multi-phase fine-tuning with additional in-domain fine-tuning to the out-of-domain performance.

  • We evaluate the robustness of our approach by considering different PLMs BERT and ELECTRA.

  • We compare our best configurations with diverse state-of-the-art approaches.

  • We publish our source code as well as our ready-to-use checkpoints at: https://github.com/BOUALILILila/ExactMatchMarking

2 Background and related work

In this paper, we focus on ad hoc document retrieval (also referred to as document ranking) over corpora comprising either news articles or web pages. Following the standard formulation: Given a corpus of documents C, potentially large, the task of a ranking system is to produce a ranked list of k documents from the corpus in response to a user’s information need expressed as query q.

2.1 Exact matching in pre-BERT models

Deep Learning approaches have steadily grown in popularity since their introduction in IR over a decade ago. Even though Learning to Rank had reached its zenith early in the 2010s (Liu, 2009; Li, 2011), its use of discrete hand-crafted features, numbering in the hundreds or even more was a major limitation. The promise of Deep Learning models was precisely to obviate the need of such costly manual-engineered features by relying on neural networks and continuous vector representations. Soon, numerous neural ranking models emerged, such as DRMM (Guo et al., 2016), DUET (Mitra et al., 2017), KNRM (Xiong et al., 2017) and Conv-KNRM (Dai et al., 2018). We do not have sufficient space to thoroughly review early neural ranking models and therefore refer the readers to existing overviews (Mitra et al., 2018; Onal et al., 2018). Aside from the models that were specifically designed for document ranking, models from the NLP community built for semantic similarity share some architectural similarities and there has been cross-fertilization between NLP and IR Lin et al. (2020). This interaction lead IR researchers to realise that relevance matching and semantic matching (e.g: sentence similarity) are different tasks (Guo et al., 2016). While the former requires proper handling of the exact matching signals, the later requires accurately capturing semantics. Thus, neural ranking models required new architecture designs to handle both semantic and exact matching signals. In Mitra et al. (2017), authors proposed a duet architecture composed of two deep neural networks, a local model that captures exact matching signals and a distributed model for semantic matching. Despite the reported successes of these neural models, there has recently been some skepticism about whether these successes, in the absence of large amounts of data, are not just inflated by comparison to weak baselines. The study conducted over a 100 papers by Yang et al. (2019) on the Robust04 dataset showed that most models failed against strong non-neural baselines (RM3 (Lavrenko and Croft, 2001)).

2.2 PLMs for multi-stage reranking

Recently, the inception of the transformer architecture (Vaswani et al., 2017) instigated a new wave of approaches (Nogueira and Cho, 2019; MacAvaney et al., 2019; Akkalyoncu Yilmaz et al., 2019) that, at last, were able to significantly outperform well-tuned traditional IR baselines such as RM3 (Lavrenko and Croft, 2001). Nogueira and Cho (Nogueira and Cho, 2019) describe the first successful application of BERT (Devlin et al., 2019) —known as monoBERT— to passage reranking where the ranking task is modeled as a binary classification problem over individual candidate passages. This work marks the beginning of the “BERT revolution”. The results of the TREC Deep Learning Track 2019 (Craswell et al., 2020) demonstrated clearly the effectiveness of BERT-based models and revealed a significant distinction with the pre-BERT models. Regardless of its effectiveness, BERT has a key limitation for document ranking: it cannot handle long input sequences that are longer than 512 tokens. In order to address this challenge, (Yang et al., 2019) apply inference on sentences individually, and then use interpolation of the original document score —obtained by a traditional ranker— and the weighted top n sentence scores to rerank the documents. Following the same strategy, Birch (Akkalyoncu Yilmaz et al., 2019) reports state-of-the-art effectiveness on the TREC newswire test collections Robust04, Core17 and Core18 using fine-tuned monoBERT on exclusively out of domain passage-level datasets (TREC Microblog, MS MARCO and TREC CAR). Their experiments demonstrate that relevance models can be transferred across different domains, which solves the problem of the lack of passage-level relevance annotations in the target domain. Similarly, (Dai and Callan, 2019) use passage-level evidence to fine-tune BERT by considering all passages from a relevant document as relevant. For inference, the document is split into overlapping passages and each passage is scored individually. Document scores based either on the score of the first passage, the best passages or the sum of all passage scores have been investigated, simple best passage score was found to be the best approach (BERT-MaxP). This was the first work to highlight BERT’s capacity to exploit linguistically rich descriptions as opposed to previous keyword search techniques. (MacAvaney et al., 2019) propose a new approach (CEDR) that incorporates the BERT’s classification token [CLS] that encodes the representation of the full input into existing pre-BERT neural IR models. The authors show that this joint approach outperforms a vanilla BERT ranker. Instead of aggregating the scores of individual passages as in Birch and BERT-MaxP, Parade (Li et al., 2020) aggregates the passage representations. This yields an end-to-end differential model like CEDR but without the use of pre-BERT models. In order to obtain the document representation, several aggregation methods were investigated and using a small stack of transformer encoders was found to be the best method. Arguing that exact matching is a valuable cue for ranking, (Boualili et al., 2020) propose a new adaptation of monoBERT, entitled MarkedBERT, that uses a marking technique to highlight exact match signals in the input sequence. The authors demonstrate the effectiveness of MarkedBERT on the MS MARCO passage ranking task and confirmed that marker tokens bring focus on exact matching terms through attention analysis. Beyond BERT, (Nogueira et al., 2020) report new state-of-the-art effectiveness on Robust04 using a novel adaptation of the pretrained sequence-to-sequence model T5 (Raffel et al., 2020) to the document ranking task. This new generation-based approach proved to be more effective than BERT in the data-poor regime with limited training data.

2.3 PLMs for sparse and dense retrieval

The commonly adopted monoBERT approach takes as input the concatenated query document text through BERT and use BERT’s [CLS] output token to produce a relevance score. The PLM rerankers compute full cross-attention between contextualized token representations, and thus referred to as cross-encoders. However, their cross-attention operations are too expensive for full collection retrieval. To overcome this challenge, a line of work resorted to augmenting lexical retrieval with PLMs. (Nogueira et al., 2019) propose DocT5Query, a document expansion technique for reducing the vocabulary gap between queries and documents. The idea is to train a sequence-to-sequence (T5 (Raffel et al., 2020)) model that, given a text from a corpus, produces queries for which that document might be relevant. Dai and Callan (2019) propose a different framework, DeepCT, for estimating a term’s context-specific term importance based on contextual embeddings from BERT. These term importance weights are then mapped into integers so that they can be directly interpreted as term frequencies, replacing term frequencies in a standard bag-of-words inverted index.

Another line of research proposes bi-encorders as an alternative trading off the higher effectiveness of cross-encoders for improved efficiency by encoding the query and document separately. Single-vector systems encode each query and each document into a single dense vector and relevance is modeled as a simple measure of vector similarity (Reimers and Gurevych, 2019; Karpukhin et al., 2020; Xiong et al., 2021). MacAvaney et al. (2020) proposed PreTTR, a hybrid model between bi and cross-encoders by eliminating cross attention on some layers of a cross-encoder model. Luan et al. (2020) raises the limited capacity of single-vector representation to support retrieval of long documents and propose Me-BERT that encodes documents into a set of vectors. Similarly, poly-encoder (Humeau et al., 2020) encodes queries into a set of vectors. Following the same paradigm, ColBERT (Khattab and Zaharia, 2020) represents both queries and documents with token-level vectors and estimates relevance using a late interaction mechanism capturing rich interactions between the two sets of vectors. However, encoding documents with all tokens impose an order-of-magnitude larger index complexity than all previous models.

For an exhaustive review of all research lines using BERT-like models we refer readers to this recent survey (Lin et al., 2020).

2.4 Understanding BERT’s success

In the light of the improvements brought by BERT to a wide range of IR tasks, many researchers investigate the reasons behind such substantial improvements. Padigela et al. (2019) empirically study a set of hypotheses that show that BM25 is more biased towards high query term frequency which hurts its performance while BERT retrieves passages with more novel words. However, they found that BERT fails at capturing the query context for long queries. Dai and Callan (2019) demonstrate that unlike traditional IR models, BERT takes advantage of stop words and punctuation thanks to its capacity to model language structure. Qiao et al. (2019) show that BERT is an interaction-based model (Guo et al., 2016), its advantage lies in the cross query-document attentions. Discarding these cross sequence interactions lead to a performance close to random. They also find that BERT assigns extreme matching scores to query-document pairs and most pairs get either one or zero ranking scores, showing it is well tuned by pre-training on large corpora. Câmara and Hauff (2020) analyze BERT using diagnostic datasets built from retrieval heuristics (Rennings et al., 2019). Their experiments show that BERT does not fulfil most retrieval heuristics created by IR experts and argue that these axioms are not suitable to understand BERT performance. MacAvaney et al. (2020) introduce ABNIRML a new framework for analysing the behavior of neural IR models. The authors found that neural ranking models have fundamentally different characteristics from prior ranking models such as high sensitivity to word order and increasing relevance scores when non-relevant content is added to the document.

Our work falls in the category of cross-encoders, and this paper represents, to the best of our knowledge, the first paper detailing with a general approach of highlighting exact matching signals to enhance contextualized pretrained language models such as BERT and reporting an exhaustive set of experiments using long-document ranking benchmarks. Although, using a marking technique to emphasize exact term matches in the query-document pair was first proposed in our own previous work entitled MarkedBERT (Boualili et al., 2020) that represents our initial study. This last was limited to one marking technique on a passage ranking task with a weak training regime, raising the question of its full potential (Lin et al., 2020). Aside from MarkedBERT, marking techniques were mentioned in the descriptions of our TREC-COVID challenge (Voorhees et al., 2021) submissions. This present work is a generalisation of the approach followed in MarkedBERT where we present a complete description of our ideas, comprehensive evaluation on in and out of domain TREC ad hoc benchmarks with a better training regime, making it directly comparable to state-of-the-art models.

3 Augmenting pretrained contextualized language models with exact match signals

In this section, we first describe the general architecture we adopt in this work and then present the marking strategies we propose to explicitly highlight exact term matches in the query-document pairs before feeding them to the BERT model. We consider the traditional formulation of exact matching where two terms \(t_1\) and \(t_2\) match exactly if their stems are identical. We use the Porter algorithm for stemming and stop words are not considered during marking. By adding explicit indications of exact matching signals in the textual inputs, the models can benefit from this traditional hint and adapt better to the ad hoc task.

Fig. 1
figure 1

BERT sentence pair classification architecture Devlin et al. (2019) used in vanilla BERT | monoBERT (Nogueira and Cho, 2019)

3.1 Model architecture

We adopt the model configuration described by Nogueira and Cho (2019) referred to as monoBERT or vanilla BERT. In this configuration, BERT is applied as a binary relevance classifier for text ranking. The architecture of the model is shown in Fig. 1. Using the same notation as Devlin et al. (2019), the query q is fed as Segment A and the candidate document d as Segment B. The special token [CLS] is prepended to the input sequence, and the special delimiter token [SEP] is placed at the beginning and end of the document segment to build the input sequence S as follows:

$$\begin{aligned} S= [[CLS],Q,[SEP],D,[SEP]] \end{aligned}$$
(1)

where Q and D represent the sequences of tokens obtained after applying the WordPiece tokenizer to the query q and document d texts, respectively.

Once the sequence S is passed through BERT, the final vector representation C of the standard classification token [CLS], that captures the interaction between the query and the document, is used as input to a single layer neural network that estimates a score R(dq) quantifying how relevant the candidate document d is to the query q. That is:

$$\begin{aligned} R(d,q) = P(Relevant=1|q,d) \end{aligned}$$
(2)

The details of the fine-tuning and inference process are given in Sect. 4.

Table 3 Example of the proposed marking strategies applied to the query Q: “causes of left ventricular hypertrophy”, and the document D: “Left ventricular hypertrophy can occur when some factor ...”

3.2 Exact match marking

We propose different marking strategies that only intervene at the textual input level to augment the input sequence S defined in Eq. (1). Instead of altering the model’s architecture in order to integrate the desired traditional signal, we prefer letting the model learn how to use the given hints and avoid the risk of introducing a systematic bias towards exact term matching. A marking strategy is defined by two(2) parameters:

  1. 1.

    Marker-Token type: we introduce two types of marker tokens, namely: Simple Markers and Precise Markers.

  2. 2.

    Marking level: we investigate two levels for marking: Document Marking and Pair Marking.

Table 3 illustrates all the four(4) marking strategies that can be defined using the two marker-token types at the two different marking levels. Note that the Pre-Pair marking strategy corresponds to the strategy used in the MarkedBERT model (Boualili et al., 2020).

3.2.1 Marker-token type

We investigate two types of marker tokens to investigate whether distinguishing query terms is important or not to the model performance.

Simple Markers. Uses a simple unique marker (#) for all query terms without explicit distinction. Considering a query \(Q= \{q_1,\dots , q_{|Q |}\}\), whose terms \(q_n\) and \(q_m\), with \(1< n< m <|Q |\), occur in the document and thus have to be marked, we obtain the new marked query segment \(\tilde{Q}\) as follows:

$$\begin{aligned} \tilde{Q} = \{q_1,\dots ,\# q_n \#,\dots ,\# q_m \#,\dots ,q_{|Q |}\} \end{aligned}$$
(3)

Precise Markers. Uses precise markers consisting of newly introduced tokens \([e_k]\) and \([/e_k]\), where \(k=\{1,..., |Q |\}\) identify query terms, that mark the start and the end of each matched term, respectively. This marking technique associates each unique query-term \(q_k\) with a unique pair of marker tokens \([e_k]\) and \([/e_k]\) that identifies it and its occurrences. If a term is repeated in the query, all occurrences of this query term will be highlighted using the same identifier i.e that of the first occurrence. For example, the query Q described in the previously paragraph with simple markers would be marked as follows:

$$\begin{aligned} \tilde{Q} = \{q_1,\dots ,[e_n] q_n [/e_n],\dots ,[e_m] q_m [/e_m],\dots ,q_{|Q |}\} \end{aligned}$$
(4)

3.2.2 Marking level

In order to better understand whether it is relevant to mark both the query and the document segments or the document segment only, we investigate two marking levels: Document and Pair marking. In the former, the occurrences of query terms in the document are marked in the document segment while in the later, the exact matching terms are marked in both the document and query segments as shown in Table 3. We use the same notations defined in the model’s architecture where Q refers to the query segment and D refers to the document segment that constitute the input sequence S.

Document marking. It only augments the document segment D with marker tokens indicating the start and the end of each query-term occurrences in the document. Considering a query \(Q= \{q_1,\dots , q_{|Q |}\}\) and a document \(D= \{d_1,\dots , d_{|D |}\}\), if \(\{d_i, d_j\}\) are occurrences of query term \(q_n\) and \(d_l\) is the only occurrence of \(q_m\) in D with \(1<n<m<|Q |\) and \(1<i<j<l<|D |\), the augmented query and document sequences \(\tilde{Q}\) and \(\tilde{D}\), respectively, are as follows when using the simple markers:

$$\begin{aligned} \tilde{Q}&= \{q_1,\dots ,q_n,\dots ,q_m,\dots ,q_{|Q |}\} \\ \tilde{D}&= \{d_1,\dots ,\# d_i \#,\dots ,\# d_j \#,\dots ,\# d_l \#,\dots ,d_{|D |}\} \end{aligned}$$

and as follows when using the precise markers:

$$\begin{aligned} \tilde{Q}&= \{q_1,\dots ,q_n,\dots ,q_m,\dots ,q_{|Q |}\} \\ \tilde{D}&= \{d_1,\dots ,[e_n] d_i [/e_n],\dots , [e_n] d_j [/e_n],\dots ,[e_m] d_l [/e_m],\dots ,d_{|D |}\} \end{aligned}$$

Pair marking. It augments both the query and document sequences with marker tokens indicating the start and the end of each exact matched term between the query and the document. In our experiments, a query term with no occurrences in the document is not marked. Considering the same example as in the Document marking level, the augmented query and document sequences \(\tilde{Q}\) and \(\tilde{D}\), respectively, are as follows when using the simple markers:

$$\begin{aligned} \tilde{Q}&= \{q_1,\dots ,\# q_n \#,\dots ,\# q_m \#,\dots ,q_{|Q |}\} \\ \tilde{D}&= \{d_1,\dots ,\# d_i \#,\dots ,\# d_j \#,\dots ,\# d_l \#,\dots ,d_{|D |}\} \end{aligned}$$

and as follows when using the precise markers:

$$\begin{aligned} \tilde{Q}&= \{q_1,\dots ,[e_n] q_n [/e_n],\dots ,[e_m] q_m [/e_m],\dots ,q_{|Q |}\} \\ \tilde{D}&= \{d_1,\dots ,[e_n] d_i [/e_n],\dots , [e_n] d_j [/e_n],\dots ,[e_m] d_l [/e_m],\dots ,d_{|D |}\} \end{aligned}$$

4 Experimental setup

This section describes the experimental setup used for studying the effectiveness of our models for document ranking. We present the detailed fine-tuning our models on the large-scale MS MARCO passage dataset and describe the MS MARCO document ranking benchmark used for in-domain evaluations, and the standard TREC Robust04 and GOV2 benchmarks used for studying the out-of-domain transfer capabilities of our models. We further describe the inference process and the diverse state-of-the-art baselines we use to comparatively evaluate our approach. We report results using the official metrics of each collection, namely: nDCG@10 and MAP@100 for MS MARCO document ranking collection in the context of TREC Deep Learning 2019 and 2020 tracks, and nDCG@20 and P@20 for Robust04 and GOV2, enabling thus direct comparisons with previous work.

4.1 Datasets

We conduct experiments on two standard ad hoc benchmarks: Robust04 and GOV2. In addition to these traditional benchmarks, we use the recent TREC Deep Learning (DL) Document Ranking benchmark from 2019 and 2020 tracks. Robust04Footnote 1 is a news wire collection comprising 500K documents (TREC Disks 4 and 5) and 249 judged topics. Each topic is composed of three fields: The “title” is a short keyword query, the “description” is a longer well-formed natural language sentence that describes the information need and the “narrative” is a paragraph that provides guidance for relevance assessment. Table 4 provides an example of a TREC Robust04 topic. GOV2Footnote 2 is a Web collection crawled from government Websites in early 2004 comprising 25M documents and only 149 topics in the same format as Robust04 topics with title, description and narrative. Documents in the GOV2 corpus are on average much longer than those in the Robust04 corpus; see Table 5. MS MARCO Document Ranking dataset is a benchmark for web search used in TREC DL 2019-2020 tracks (Craswell et al., 2020, 2021). The dataset contains more than 3M documents composed of three fields: title, URL and body. Dense NIST judgments are provided for 43 and 45 topics for DL 2019 and 2020, respectively.

Table 5 resumes some statistics on the evaluation benchmarks.

Table 4 Example of Robust04 search topic: Topic 302
Table 5 Benchmarks statistics

4.2 Baselines

We compare our models against diverse baselines including: Traditional non-neural approaches also known as Lexical Retrieval methods, sparse retrieval approaches, dense retrieval models (bi-encoders), and strong reranking models (cross-encoders).

4.2.1 Lexical retrieval baselines

  • BM25, we use the Anserini (Yang et al., 2017) implementation with default parameters. For description queries, we set \(k_1=0.9\) for Robust04 and \(k_1=2.0\) for GOV2 and \(b=0.6\) for both datasets. This unsupervised model serves both as a baseline and as the first stage retriever in all our experiments.

  • BM25+RM3, a query expansion model based on RM3 (Lavrenko and Croft, 2001) considered as a strong non-neural baseline. We use the Anserini (Yang et al., 2017) implementation with the default parameters. For description queries, we use 20 expansion terms following (Li et al., 2020).

4.2.2 Sparse retrieval baselines

  • DeepCT (Dai and Callan, 2020), we report results on Robust04 and GOV2 obtained using the BOW+DeepCT-Query model (Dai and Callan, 2019), and use the re-weighted MS MARCO documents provided by the authorsFootnote 3 using the HDCT model (Dai and Callan, 2020) in combination with Anserini’s BM25 with default parameters for TREC DL 2019 and 2020 evaluations.

  • DocT5Query (Nogueira et al., 2019), following the paper setup, we generate 40 expansion queries per document and use Anserini’s BM25 with default parameters. Due to the large size of the GOV2 collection (see Table 5) and the high computational cost of DocT5query we do not report results on this collection.

4.2.3 Dense retrieval baselines

  • DPR (Karpukhin et al., 2020), we use DPR as a retriever with the open source implementation from the Transformers library (Wolf et al., 2020) and the publicly released DPR checkpoints for QueryFootnote 4 and ContextFootnote 5 encoders.

  • ANCE (Xiong et al., 2021), we use ANCE as a retriever and use the Sentence Transformers library (Reimers and Gurevych, 2019) with the publicly released checkpointFootnote 6.

  • ColBERT (Khattab and Zaharia, 2020), we use ColBERT as a dense retriever using the authors released code: after encoding the whole collection, we use the top-1000 documents retrieved using ANN with faiss (Johnson et al., 2017) and rerank them using ColBERT late-interaction operation. Considering the size of the GOV2 collection (25M documents), and the important space footprint of ColBERT indexesFootnote 7, we could not produce results on GOV2.

4.2.4 Reranking baselines

  • Vanilla baseline, the vanilla monoBERT model is our main baseline since it represents the core model we augment with explicit exact match cues in our proposed models. The vanilla baseline as well as our models share the same configuration and evaluation setup making it suitable for evaluating the impact of exact match marking.

  • Birch (MS) and Birch (MS-MB) (Akkalyoncu Yilmaz et al., 2019), the notation in parentheses indicate the fine-tuning dataset(s): Ms for MS MARCO and MS-MB refers to the model fine-tuned first on MS MARCO and then further fine-tuned on Microblog (MB) data. We use the results reported by Li et al. (2020) that uses BM25 instead of BM25+RM3 as the first-stage retriever.

  • BERT-MaxP (MS) (Dai and Callan, 2019), we report the results obtained with the re-implementation by Li et al. (2020) where the results are improved using a BERT model fine-tuned on MS MARCO rather than Bing search log.

  • Parade (Li et al., 2020), we report results obtained using both BERT and ELECTRA variant from the paper.

  • T5 (Nogueira et al., 2020), the T5, also known as monoT5, with 3B parameters detains the state-of-the-art across many ad hoc benchmarks like Robust04. We report the original results from the paper.

4.3 Training

We use the base version (12 layers, 768 hidden size, 12 heads, 110M parameters) of BERT due to hardware limitations. We fine-tune both our vanilla baseline and our models augmented with the different marking strategies on the large publicly released MS MARCO passage dataset. We use a batch-size of 128 and the maximum sequence length (\(128~sequences \times 512~tokens = {65536}~tokens/batch\)) for 100k on free Google Colab TPUsFootnote 8. We use Adam optimizer (Kingma and Ba, 2015) with the initial learning rate set to \(3e^{-6}\) and linear decay of the learning rate. The drop out rate is set to 0.1 for all our experiments. We use the open source implementation of BERT by Hugging Face (Wolf et al., 2020). It is important to note that fine-tuning an augmented model with a marking strategy does not add a computational cost compared to the vanilla model.

4.4 Inference

We use a two-stage ranking pipeline. We retrieve an initial candidate list of top 1, 000 documents per query using BM25. We use the BM25 implementation from off-the-shelf Anserini open-source IR toolkit (Yang et al., 2017).

The length of BERT’s input sequence cannot exceed 512 tokens due to the fact that the positional embeddings were trained on sequences of a maximum length of 512 tokens. This limitation prevents from directly applying our models to long documents. Following the strategy proposed by Dai and Callan (2019), we split each document into overlapping passages that can be handled individually by BERT. For Robust04 and GOV2, passages are generated using a sliding window of 150 words and a stride of 75 words, formally expressed as \(d=\{p_1,..., p_n\}\) where n is the number of passages in the document d. As a trade-off between latency and effectiveness, we only consider a maximum of 30 passages per document. The first and last passages are always picked while the remaining 28 are randomly chosen. The fine-tuned BERT models on exclusively out-of-domain data are used afterwards to predict the relevance of each passage w.r.t a query q independently. The best scoring passage is then taken as a proxy for the Document-level relevance:

$$\begin{aligned} R(d,q) = max(R(p_1,q), ..., R(p_n,q)) \end{aligned}$$
(5)

For the queries we consider both the topic titles that are preferred by most pre-BERT models including BM25, and the descriptions that are more similar to MS MARCO’s natural language questions.

For TREC DL Document ranking evaluation, we split each document into overlapping passages with the same maximum length of 384 and a stride of 192 following the splitting strategy in Yan et al. (2019). In addition, the title is added to the beginning of every passage if it is available. Similarly to Robust04 and GOV2, we use the best scoring passage as proxy for the whole document relevance.

5 Results and analysis

We address, in this section, our research questions. First, we investigate the effectiveness of our proposed exact match marking strategies with a BERT core on in-domain data, i.e, MS MARCO document ranking benchmark, and the robustness to out-of-domain collections, i.e, Robust04 and GOV2. Then, we study how to improve domain-transfer capabilities of our models using score interpolation with a bag-of-words model. We further investigate the contribution of additional fine-tuning on limited target-domain data in a multi-phase fine-tuning setting and how our exact match marking contributes in each phase. Finally, we verify the contribution of our exact match marking on the more effective ELECTRA model, and compare our best configurations to diverse state-of-the-art baselines.

5.1 Performance of the models augmented with exact match marking

We evaluate the contribution of our proposed exact match marking strategies and discuss our research question RQ1 Is exact match marking beneficial to pretrained transformers exemplified by BERT? by comparing the augmented models with exact match marking to the vanilla baseline. We consider results in the in-domain setting with MS MARCO Document dataset and the zero-shot transfer setting to out-of-domain datasets, namely: Robust04 and GOV2.

5.1.1 In-domain effectiveness

We re-rank the initial list of candidate documents retrieved by BM25 with RM3 query expansion, using all our models and the vanilla baseline. We report the performance on the TREC DL 2019 and 2020 test sets in Table 6, in terms of the official evaluation metrics: nDCG@10 and MAP@100.

Table 6 Reranking effectiveness on the TREC DL 2019 and DL 2020 Document ranking tasks

Comparison with baselines. Compared to BM25 and the first-stage retriever (BM25+RM3), all BERT-based models perform significantly better. Interestingly, the non-neural methods perform better on DL 2020 test set while the BERT-based models perform better on the DL 2019 test set. Adding exact match marking regardless of the marking strategy, leads to better or at least the same performance as the vanilla baseline (marking ablation). The Pre-PairBERT model achieves the overall best performance on DL 2019 test topics, but also on DL 2020 along with Sim-Pair BERT.

Impact of the marker type and marking level on the performance. On TREC DL 2019, using the pair marking strategy brings substantial gains in performance when used in combination with the precise marker type, Pre-Pair BERT achieves \(+3.7\%\) relative gain over the Pre-Doc BERT model. While it leads to a drop in performance when combined with the simple marker, Sim-Pair BERT has a relative loss of \(-0.9\%\) compared to Sim-Doc BERT. Interestingly, on TREC DL 2020 using the Pair marking level has the same impact regardless of the marker type.

Marking both the query and document segments seems to be more beneficial considering results on both test collections. Using the precise marker type brings further gains in performance on DL 2019.

5.1.2 Out-of-domain effectiveness

We use the fine-tuned models on exclusively MS MARCO passages to rerank the documents retrieved by BM25 in the first-stage. We do not train the models on the target collections (Robust04, GOV2), we use all their queries and relevance judgements as a held-out test set. Thus, this evaluation is an instance of a zero-shot transfer setting.

Table 7 Reranking effectiveness in the zero-shot transfer setting of the different models on Robust04 and GOV2 collections

Table 7 shows the reranking effectiveness of our different models and baselines on the top 1, 000 candidate documents retrieved by BM25 from Robust04 and GOV2 collections using both the title and description fields of the TREC topics. We recall that titles are short key word queries preferred by traditional bag-of-words models like BM25 and descriptions are well-written natural language queries similar to MS MARCO’s questions on which the BERT models are fine-tuned. We report results using the commonly used nDCG@20 and P@20 metrics to enable direct comparisons with previous work on these collections.

Comparison with baselines. All BERT-based models achieve substantially better performance on both collection compared to the traditional non-neural baselines at the only exception of GOV2 titles. We observe a discrepancy in the impact of the exact match marking on GOV2 compared to Robust04. While all our models, except Sim-Doc BERT, significantly outperform the vanilla baseline on Robust04 descriptions or at least achieve similar performance on titles, our models have no significant impact on GOV2. Importantly, in no case a marking-based model leads to a significant degradation of performance on GOV2. The disparity in the behavior of the models on the two benchmarks is probably due to the nature of the documents involved. While Robust04 comprises well-written news articles, GOV2 documents are web pages that include navigation bars, advertisements, tables and discontinuous text. The zero-shot domain transfer –from the MS MARCO fine-tuned models to Robust04 articles– seems to be more attainable than to GOV2 web pages even though MS MARCO passages were extracted from the web. We hypothesise that further fine-tuning on domain-specific data may be required to learn better domain-specific text representations. We investigate this in-domain adaptation in Sect. 5.3.

Impact of the marker type and marking level on the performance. On Robust04, marking both the query and the document –models based on pair marking– has more impact on the simple marker than the precise marker. On the description queries, Sim-Pair BERT achieves an nDCG@20 of 0.4931 while Sim-Doc BERT has an nDCG@20 of only 0.4166, and achieve 0.4773 compared to 0.4447, respectively, on title queries. While the marking strategy has a lower impact on models using the precise markers (Pre-Doc BERT and Pre-Pair BERT) especially on descriptions. On the other hand, results on the GOV2 collection are quite mitigated.

Marking both the query and the document segments with a simple marker (#) appears to be the best setting, Sim-Pair BERT has the best ranking accuracy among the four strategies tested, with clear margins on the Robust04 collection especially on descriptions. We, thus, choose to continue our analysis using the Sim-Pair BERT strategy, the full results using all the marking strategies can be found in Appendix 1.

Table 8 Recall of BM25 on Robsut04 and GOV2 collections on both title and description queries

Title versus description queries. Since we are in a reranking configuration, it is important to note that the first stage retriever BM25, as most pre-BERT ranking models, prefers short key word queries to longer natural language descriptions (Dai and Callan, 2019; Nogueira et al., 2020). Table 8 shows the recall at rank 1, 000 of BM25 for both title and description queries, where we notice a substantial difference in recall affecting the quality of the candidate documents that the reranking models receive. Despite this disadvantageous initialization, the reranking models manage to reduce the gap between title and description runs. The improvement rate over BM25 is much higher for description queries compared to title queries on both collections especially on GOV2 where vanilla BERT has a change rate of \(-5.0\%\) over BM25, while it achieves over \(+10\%\) gain on descriptions. The descriptions that are longer natural language queries carrying richer information that could not be fully harnessed by the traditional bag-of-words method, are more effectively leveraged in the reranking stage. This BERT ability was already noted in previous work (Dai and Callan, 2019), and Sim-Pair BERT follows the same preference, as it improves the search accuracy of the description runs more effectively than the title runs. The overall performance reported for our model using descriptions clearly surpasses that obtained using titles by \(+4.1\%\) on average, despite the lower recall in the initial stage.

Impact of the initial stage retriever. Considering that first stage ranker BM25 has higher recall on title queries, and that the marking-based models prefer description queries, we propose a hybrid reranking pipeline where the documents retrieved by BM25 using title queries are reranked with the BERT-based models using the description queries. Using this hybrid pipeline allow as to obtain a higher recall in the first stage since BM25 performs better on short keyword queries, and thus better candidate documents for reranking. Description queries are longer statements of information needs more suitable for pretrained reranking models to fulfill their potential. This pipeline remains realistic as language queries may be generated from standard key-word queries (Padaki et al., 2020). This hybrid approach is also adopted in recent state-of-the-art ranking model based on T5 Nogueira et al. (2020).

Table 9 Reranking effectiveness in the zero-shot transfer setting of the different models on Robust04 and GOV2 collections using the hybrid pipeline

Table 9 shows the results obtained using the hybrid reranking pipeline on both test collections. Unsurprisingly, using better candidate documents for reranking with descriptions yields even better accuracy. The vanilla BERT model achieves an improvement rate of \(+14\%\) over BM25 on Robust04 and \(+3.4\%\) on GOV2 (we recall that BM25 results are obtained using titles). Adding exact match marking in the hybrid reranking pipeline outperforms the vanilla baseline on both collections; significantly on Robust04 with a gain of over \(+8\%\).

5.1.3 In-domain versus out-of-domain effectiveness.

Results on both in-domain and out-of-domain benchmarks clearly indicate that exact match marking, aside from the Sim-Doc marking strategy which significantly underperforms the vanilla baseline on Robust04, is more beneficial than using a vanilla baseline. Using Sim-Pair (especially for out-of-domain experiments) or Pre-Pair (especially for in-domain experiments) marking strategies seems to be working best.

In the next two sections, we focus on out-of-domain effectiveness and study common techniques used in the literature to enhance the effectiveness of BERT-based models, and how our models behave in combination with these techniques. Therefore, the MS MARCO document ranking benchmark is not suitable and thus we only report results on Robust04 and GOV2 collections.

5.2 Contribution of the first-stage retriever scores to the end-to-end effectiveness

Our experimental design is based on a two-stage ranking architecture also known as a retrieve-then-rerank architecture where our BERT-based models rerank the documents retrieved by the BM25 model. In this section we evaluate the contribution of the best match scores from the initial bag-of-words retriever to the end-to-end effectiveness by simply combining BM25’s document-level scores with the passage-level evidence from the reranker using linear interpolation. We follow the linear combination defined in the Birch model (Akkalyoncu Yilmaz et al., 2019).

Birch uses a monoBERT sentence-level relevance classifier at its core. To determine document relevance \(s_f\), inference is applied over each individual sentence \(s_i\) in a candidate document d, and then the top n sentence scores are combined with the original document score \(s_{doc}\) given by the first-stage retriever as follows:

$$\begin{aligned} s_f = \alpha . s_{doc} + (1-\alpha ) . \sum _{i=1}^n w_i.s_i \end{aligned}$$
(6)

where \(s_i\) is the i-th top scoring sentence according to monoBERT. The parameters \(\alpha\) and \(w_i\)’s are tuned via cross-validation. In other words, the relevance score of a document comes from the combination of its document-level term-matching score and evidence contributions from the top sentences in the documents as determined by monoBERT.

For our experiments, the linear interpolation is applied to the results obtained in the zero-shot transfer setting with the best-scoring passage (\(n=1\)). In other words, we use the score combination defined in Eq. (6) on the document scores obtained by the BM25 retriever at cutoff 1, 000 and their corresponding scores estimated with the best-scoring passage method by the reranking models. Table 10 first shows the results of the traditional BM25 model alone, then the second and third sections are each dedicated to a reranker: vanilla and Sim-Pair BERT models. For both rerankers, we remind the results of the model alone obtained in the zero-shot transfer setting and then present the end-to-end effectiveness after interpolating BM25 scores (\(+\) BM25) with the indication of the change rate (%) over the reranker-only effectiveness. These results allow us to answer our research question RQ2 Do exact match scores from the first-stage retriever contribute to end-to-end effectiveness of the pretrained transformers and how exact match marking affects this contribution?

5.2.1 Impact of interpolating BM25 scores

Interpolating BM25 scores (Best Match) that are solely based on surface-level features such as TF and IDF leads to a significant gain in performance, indicating that BM25 document-level scores provide an additional relevance signal that the BERT-based models alone could not effectively capture. We notice that the improvement rate resulting from interpolating BM25 scores is much substantial on the GOV2 collection (\(+15\%\) in average) compared to Robust04 (\(+5.7\%\) in average). The fact that the BERT models outperform BM25 by a large margin on Robust04 while this margin is much smaller on the GOV2 can explain why BM25 scores have more incidence on the end-to-end effectiveness on GOV2 than on Robust04.

5.2.2 Impact of exact match marking

From Table 10, we can clearly see that for Robust04, where the exact match marking is effective, the improvement rate over the reranker-only effectiveness is lower when using exact match marking, about \(+12\%\) in average, compared to the vanilla model with \(+22\%\) gain in average. In other words, the impact of the BM25 scores is more important on the vanilla model compared to the Sim-Pair model. While on GOV2 the improvement rate after BM25 scores interpolation compared to the reranker-only performance is either comparable or slightly higher when using exact match marking compared to the vanilla baseline. However, the performance of the Sim-Pair BERT model with BM25 scores interpolation is, in all cases, higher than the vanilla BERT + BM25 performance regardless of the improvement rate brought by the score combination. Since we use the results obtained in the zero-shot domain transfer setting where, we recall, the exact the marking is more effective, the gains of the Sim-Pair BERT+BM25 configuration over the vanilla BERT+BM25 are more substantial on Robust04 than on GOV2.

Table 10 Reranking effectiveness of the different models before and after interpolating BM25 scores on Robust04 and GOV2 collections

5.2.3 Contribution of BM25 scores

The contribution of BM25 scores is controlled by the parameter \(\alpha\) in Eq. (6) which we tuned via 5-fold in-collection cross validation. In all scenarios, the weight put on \(\alpha\) is non-negligible, in other words, the contribution of BM25 signals remain important, this observation was also reported for the Birch model (Lin et al., 2020). However, we notice that the weight of \(\alpha\) is always lighter when combining with the Sim-Pair BERT model that uses exact match marking. For Robust04 descriptions, the vanilla BERT+BM25 baseline puts a weight of \(\alpha \in \{0.3,0.4\}\) on BM25 scores, when Sim-Pair BERT+BM25 only consider a contribution of \(\alpha =0.2\) from BM25, while achieving substantially better performance. This indicates that the vanilla model relies more on BM25 to complete its relevance estimation unlike the marking-based model that is able to effectively capture more relevance signals and thus needing less contribution from BM25 scores.

Figure 2 visualizes the end-to-end ranking accuracy measured by nDCG@20 for \(\alpha \in [0,1]\) on both Robust04 and GOV2 collections. On Robust04, we can clearly see that Sim-Pair BERT+BM25 reaches the most effective combination with smaller contribution from BM25 scores (smaller \(\alpha\)), while the vanilla baseline requires more intervention from BM25 and still cannot reach the performance of Sim-Pair BERT+BM25, especially on descriptions. It is only logical that the most performing model, that outperforms BM25 by a large margin, requires less contribution from this later. Nevertheless, if we take the example of the GOV2 descriptions, despite the similar starting performance at \(\alpha =0.0\) of vanilla and Sim-Pair BERT models, the gap between their performance starts getting wider at only \(\alpha =0.1\) to reach its peak at \(\alpha =0.2\).

Combining the original document score obtained in the first-stage retriever with passage-level evidence from BERT-based reranking models to determine the final relevance score of a document yields substantial gains in performance. Relevance scores based on traditional IR axioms complete the relevance signals captured by contextual pretrained LMs such as BERT. Moreover, using our simple marking strategy to highlight the exact matching signals in the query-document pairs enhance BERT’s own ability to estimate relevance and thus, requires less contribution from BM25 to achieve the best performance.

Fig. 2
figure 2

The end-to-end ranking accuracy of the vanilla BERT and Sim-Pair BERT models with BM25 scores interpolation on Robust04 and GOV2 collections. \(\alpha =0.0\) indicates the reranking model effectiveness only without BM25 scores, and \(\alpha =1.0\) means that only BM25 scores are used

5.3 Multi-phase fine-tuning

In previous experiments, we leveraged out-of-domain relevance assessments to fine-tune our BERT models. This fine-tuning aims at providing the model with general notions of relevance matching. However, transferring these relevance patterns to the target corpus may, in some cases, be ineffective. To overcome this domain-transfer limitation, we use additional fine-tuning on labeled data drawn from the same distribution as the final task, in other words, in-domain labeled data fine-tuning. This approach is known as “stage-wise” or “multi-phase” fine-tuning (Lin et al., 2020).

Once the models are fine-tuned on the MS MARCO passage dataset following the training setting described in Sect. 4.3, we further fine-tune them on the target task using 5-fold cross validation for both Robust04 and GOV2 collections. We use the folds from (Yang et al., 2019) for Robust04 and the 5-folds configuration adopted by Li et al. (2020).

Following prior work by Dai and Callan (2019), we consider a maximum of 30 passages per document as a trade-off between latency and effectiveness. During training, passages issued from the top 1, 000 documents retrieved by BM25 for queries in the training folds are sub-sampled to avoid catastrophic forgetting. Aside from the first passage, passages in a document are randomly preserved with a probability of 0.1. Passages from a relevant document according to the ground-truth (TREC relevance judgements) are taken as positive examples and passages issued from the other remaining documents as negative examples. We use a pointwise cross entropy loss and fine-tune the models for 1 single epoch with a batch size of 32 training instances comprising a query and a passage. We use the Adam optimizer with a learning rate of \(1e^{-5}\) with warm up over the first \(10\%\) of the total training steps.

For queries in the left-out test fold, we set the rerank threshold to 100 as a trade-off between latency and effectiveness. We report the average performance across all test folds measured in terms of P@20 and nDCG@20 using pytrec_evalFootnote 9. In this setting, our vanilla baseline corresponds to the pointwise trained BERT-MaxP model (Dai and Callan, 2019) initialized with monoBERT fine-tuned on MS MARCO instead of Google’s BERT pretrained checkpoint without any prior fine-tuning on the text ranking task.

Table 11 Reranking effectiveness in the multi-phase vs. zero-shot transfer setting for the Sim-Pair and vanilla models on Robust04 and GOV2 collections

Table 11 reports the reranking effectiveness obtained using the multi-phase fine-tuning setting compared to the single-phase MS MARCO fine-tuning (zero-shot transfer setting) for both Robust04 and GOV2 collections. We report results obtained for reranking the top 100 documents retrieved by BM25 in both settings. Thanks to the additional in-domain fine-tuning on the target collection, the performance on both collections improves regardless of the topic field used. We notice in this setting that Sim-Pair BERT is able to achieve significant gains over the vanilla baseline on the GOV2 collection, confirming our hypothesis that the zero-shot domain transfer from MS MARCO was not sufficient for this collection.

Using the multi-phase fine-tuning setting BERT-based models are able to achieve better performance on descriptions compared to titles on Robust04 by \(+7.5\%\) and \(+8.3\%\) for the vanilla and Sim-Pair models respectively, despite the lower retrieval effectiveness of BM25 on descriptions compared titles (\(-4.3\%\)). On the other hand, the difference in BM25 retrieval effectiveness between descriptions compared to titles is more important on GOV2, about \(-11\%\). The BERT-based rerankers reduce this gap to \(-5.5 \%\) and \(-5.9\%\) for the vanilla and Sim-Pair models respectively but not enough to reverse the tendency. The end-to-end effectiveness on this collection is thus higher on titles than descriptions as observed in previous state-of-the-art models such as BERT-MaxP (Dai and Callan, 2019) or Parade (Li et al., 2020)(see results in Sect. 5.5). Still, the hybrid pipeline outperforms both title and description runs on both collections. The reranking accuracy achieved by the hybrid runs are the highest reported results using a BERT-based model on both collections, at the time this article was written.

Table 12 Reranking effectiveness with exact matching ablation at different phases of the multi-phase fine-tuning configuration of Sim-Pair BERT on Robust04 and GOV2 collections

5.3.1 Phase-wise marking

Previous results of the Sim-Pair BERT model presented in Table 11 in the multi-phase setting are obtained using the exact match marking through out the two fine-tuning phases. While the first phase fine-tuning focuses on learning general notions of relevance from a large passage collection, the goal of adding in-domain fine-tuning is to learn directly from labeled data with the same distribution as the target task. It is important to determine on which of the two phases, the marking strategy is more beneficial and at which phase it can be omitted. To this aim, we conduct an ablation study on the Sim-Pair BERT model. Table 12 shows the results of the marking-strategy ablation on Robust04 and GOV2 collections using the different topic fields. With these results, we can now discuss our research question RQ3 At which phase the exact match marking is the most beneficial in a multi-phase fine-tuning configuration?

MS marking (labelled run A in Table 12), uses exact match marking in the MS MARCO (MS) fine-tuning phase only then use the original data without further marking for the in-domain (ID) fine-tuning phase. We can see, in Table 12, that using the marking strategy in the general fine-tuning phase is sufficient to outperform the vanilla baseline or at least perform similarly for Robust04 titles. In other words, initializing BERT with the weights learnt from marked inputs is better than those learnt from non-marked inputs. Ablating marking in the in-domain fine-tuning phase can even surpass the performance of the Sim-Pair BERT that uses marking across the two fine-tuning phases as observed for descriptions on both collections and the hybrid run on GOV2.

ID marking (labelled run B in Table 12), uses the marking strategy to augment the inputs during fine-tuning on the in-domain data while the BERT model was initialized with the weights learnt from non-marked MS MARCO inputs. The results of this first-phase marking ablation either has no substantial impact on the model’s performance or leads to a degradation in performance. This behavior is predictable, since there is not enough in-domain data for BERT to learn useful representations of the marker tokens and their contribution to the relevance prediction.

Using a marking strategy during first general-purpose fine-tuning phase (MS marking) is already enough to outperform the vanilla baseline without requiring additional marking during the in-domain fine-tuning phase. At the end, the fine-tuned model using the Sim-Pair marking strategy on MS MARCO is able to use the relevance matching patterns learned using out-of-domain data, with explicit marking, for later phases even without the guidance of the explicit markers. Nevertheless, additional marking in the in-domain fine-tuning phase used in the classical Sim-Pair BERT approach is beneficial for title queries where it brings and additional gain of \(+1.6\%\) and \(+1.4\%\) over the MS marking only (run A) on Robust04 and GOV2, respectively.

5.4 Impact of exact match marking on ELECTRA variant

While BERT is the most famous and largely adopted pretrained language model, additional variants such as RoBERTa (Liu et al., 2019) or ELECTRA (Clark et al., 2020) were proposed in order to improve the model from different aspects. Recent state-of-the-art results reported on Robust04 and GOV2 collections were achieved using the ELECTRA model that appears to outperform BERT. ELECTRA (Clark et al., 2020) replaces the Masked Language Modeling (MLM) with a novel more sample-efficient pretraining task called replaced token detection. In this task, the model learns to distinguish real input tokens from plausible but synthetically generated replacements by a small “generator” model. This approach uses two components: the generator, a small two-layer BERT model that predicts masked tokens and the ELECTRA discriminator model that both require training. However, the new objective allows the model to learn from all input positions rather than only \(15\%\) of the positions in the MLM task.

In order to be confident in our approach, we investigate if exact match marking is beneficial for a BERT variant pretrained on a more robust task and study RQ4 Is exact match marking beneficial in alternative transformer-based models such as ELECTRA?

For our experiments, we use the base version of the ELECTRA model as the core of our model architecture illustrated in Fig. 1 as a replacement of the BERT model. We use the same single-layer neural network that estimates a score R(dq) quantifying how relevant the candidate document d is to the query q. We also use the same fine-tuning hyper parameters used with BERT.

5.4.1 In-domain effectiveness

Using the same setting used for the BERT-based models, we report the results obtained on TREC DL2019 and 2020 test collections in Table 13. For clarity, we only show results with the Sim-Pair marking strategy, full results with all the strategies can be found in Appendix 3.

Table 13 Reranking effectiveness on the TREC DL 2019 and DL 2020 Document ranking tasks for Sim-Pair and vanilla models with both BERT and ELECTRA cores

Interestingly using the ELECTRA core in place of BERT in the vanilla baseline does not lead to increased performance and we even observe a slight drop in performance in TREC DL 2020. Adding exact match marking, using both cores, leads to similar gains over the vanilla baselines. While the gain in average precision is more pronounced with ELECTRA on both DL 2019 and 2020, the effectiveness in terms of nDCG@10 is more interesting with the BERT core on the DL 2020 test collection.

5.4.2 Zero-shot transfer setting

We use the fine-tuned models on exclusively out-of-domain data, i.e MS MARCO passage dataset, and apply inference on the window-passages obtained by splitting each document using the same passage length of 150 words and a 75 words stride used in the BERT experiments. Table 14 shows the results obtained at cutoff 1, 000 on both Robust04 and GOV2 collections. We recall the results of the Vanilla and Sim-Pair models with the BERT core for comparison.

Table 14 Reranking effectiveness in the zero-shot transfer setting for the Sim-Pair and vanilla models on Robust04 and GOV2 collections using both BERT and ELECTRA cores

Exact Match Marking on ELECTRA Results indicate clearly that adding exact match marking is still beneficial for the ELECTRA variant. As for the BERT version, Sim-Pair ELECTRA is more effective on Robust04 with an average improvement rate of \(+5\%\) compared to only half, \(+2.5\%\), on GOV2. However, exact match marking has more notable impact on titles rather than descriptions, when the vanilla ELECTRA baseline prefers clearly description queries.

ELECTRA versus BERT core The Sim-PairELECTRA variant achieves better performance than its BERT counterpart regardless of the topic field on the GOV2 collection. In contrast, using the BERT core is more effective on Robust04 on both titles, descriptions and the hybrid pipeline. The same tendency can be observed for the vanilla baseline with smaller margins.

Table 15 Reranking effectiveness in the multi-phase fine-tuning setting for the Sim-Pair and vanilla models on Robust04 and GOV2 collections using both BERT and ELECTRA cores

5.4.3 Multi-phase fine-tuning

Table 15 shows the results obtained using the multi-phase fine-tuning on both MS MARCO passage dataset and in-domain labeled data, described in Sect. 5.3 for BERT. The ELECTRA-based models outperform the BERT-based models on both collections regardless of the topic field used indicating that ELECTRA is a more effective core PLM than BERT in a multi-phase fine-tuning setting. However, adding exact match marking has no significant impact in this setting. Sim-Pair ELECTRA performs slightly better than the vanilla ELECTRA baseline on the Robust04 collection across title, description and hybrid runs. On the other hand, exact match marking leads to better ranking accuracy on GOV2 titles, but provokes a slight degradation in performance when the description field is used for reranking (description and hybrid runs).

Exact match marking is indeed beneficial for the ELECTRA model, especially in a zero-shot transfer setting where no labeled data is available in the target domain. Sim-Pair ELECTRA is able to achieve significant gains on titles, where Sim-Pair BERT is less effective. However, for description and hybrid runs that use descriptions for reranking, exact match marking appears to have more substantial impact when using a BERT core. On TREC DL 2019 and 2020 benchmarks, both vanilla and Sim-Pair models perform similarly with both BERT and ELECTRA cores. The only advantage of the ELECTRA core is increased average precision with Sim-Pair. Finally, we can say that, in most cases, the ELECTRA-based versions of our models are more effective compared to their BERT counterparts.

5.5 Comparison with state-of-the-art baselines

In this section we try to situate our approach with regard to what has already been proposed for document ranking. In a first part, we try to conduct comparative evaluations with models presenting a similar experimental setup for a fair comparison. Then in a second part, we compare our best runs to a wide variety of SOTA approaches with different configurations.

5.5.1 Comparison in the same experimental design.

In order to fairly compare a novel approach with previously proposed ones, it is important to conduct the evaluation in the same experimental conditions. Here, we try to reproduce as much of the original settings used to produce the results of the Birch and BERT-maxP baselines, respectively.

Birch (MS) This baseline is fine-tuned exclusively on MS MARCO passages, therefore we use our Sim-Pair BERT + BM25 model equally fine-tuned on MS MARCO passages and augmented with BM25 scores interpolation following the same Equation 6 used in Birch (Akkalyoncu Yilmaz et al., 2019). All Robust04 and GOV2 topics and relevance judgements are used as a held-out test set.

Table 16 Reranking effectiveness of the Sim-Pair BERT with interpolating BM25 scores vs. Birch (MS) baseline on both Robust04 and GOV2 collections

Table 16 shows the results of our Sim-Pair BERT + BM25 model compared to the Birch (MS) baseline. The results clearly indicate that our model outperforms Birch (MS). Since our model already outperforms the baseline with a BERT Base version, it unnecessary to conduct the same experiment with a BERT Large whose computational cost is, unfortunately, beyond our hardware limitations.

BERT-MaxP (MS) The configuration of this baseline is the same we used in the multi-phase fine-tuning setting. We compare the results of Sim-Pair BERT fine-tuned first on MS MARCO and then further fine-tuned on the target task obtained with a 5-fold cross validation with BERT-MaxP (MS) in Table 17. We report the results when using the exact match marking during fine-tuning on MS MARCO passages only [MS], and the results with the full marking on both MS MARCO and in-domain data, i.e., Sim-Pair BERT. Our approach outperforms clearly the BERT-maxP baseline on titles, and performs slightly better on descriptions. It is important to notice that the BERT-MaxP results reported by Li et al. (2020) are better than our vanilla BERT baseline in the multi-phase fine-tuning setting, especially on GOV2. This slight difference can be exaplined by the the traditional use of the pointwise loss function (monoBERT (Nogueira and Cho, 2019)) while they use a pairwise loss function.

Table 17 Reranking effectiveness of the Sim-Pair BERT with multi-phase fine-tuning vs. BERT-MaxP (MS) baseline on both Robust04 and GOV2 collections

5.5.2 Comparison with different experimental designs

Each approach has the optimal experimental conditions that lead to the best ranking accuracy possible, and these optimal conditions are hardly the same for the different models we want to compare. Independently of the experimental framework employed to obtain the results, or the nature of the approach, Table 18 compares our best runs with both BERT and ELECTRA cores obtained in the multi-phase fine-tuning setting, with the best baseline runs. While Table 19 compares our best in-domain runs to both TREC best runs from the TREC DL 2019 and 2020 tracks and the SOTA baselines.

Table 18 Reranking effectiveness on Robust04 and GOV2 of our best runs versus the best baseline runs

Robust04 and GOV2 collections nsurprisingly, the reranking models achieve the best results and largely outperform all other baselines. For a fair comparison with the sparse and dense retrieval methods (runs [03-07]) which do not use target-domain fine-tuning, we add our runs in the zero-shot setting on descriptions (runs [08-09]). Nevertheless, our rerankers still outperform the retrievers.

Results obtained using the best Sim-Pair BERT, run [17] in Table 18, outperform all the BERT-based models that represent the state of the art and achieves better performance than T5 for both base and large versions on robust04. The Sim-Pair ELECTRA variant (run [18]) achieves comparable performance with the T5-3B model while using only \(3.6\%\) of its parameters and outperforms the Parade ELECTRA model on both Robust04 and GOV2 collections by a varying margin from \(+3\%\) to more than \(+4\%\). The T5 baseline is by far the strongest baseline, it is important to note that it uses a zero-shot transfer setting without the need for in-domain fine-tuning as opposed to BERT-MaxP, Parade and our best runs [17-18], however, its large size make it unpractical compared to a BERTBase or ELECTRA Base.

Table 19 Reranking effectiveness on TREC DL 2019 and 2020 Document ranking tasks of our Sim-Pair models with both BERT and ELECTRA cores versus the best TREC runs and baselines

TREC DL Document Ranking task imilarly to the Robust04 and GOV2 results, the best TREC runs which are cross-encoding rerankers outperform all other baselines. For TREC DL 2019, we include the best idst_bert_r1 run (Yan et al., 2019) which uses StructBERT (Wang et al., 2020), a BERT model which better models sentence relationships thanks to an improved Next Sentence Prediction task, and ucas_runid1 Chen et al. (2019) which uses BERT-MaxP Dai and Callan (2019). We also include Parade results Li et al. (2020). Our runs outperform Parade and ucas_runid1 but cannot outperform idst_bert_r1 –StructBERT core– in terms of nDCG@10. In TREC DL 2020, the best run d_d2q_duo Pradeep et al. (2020) is a large multi-stage ranking model including a BM25 retriever, DocT5Query document expansion and two cascading T5-3B rerankers, making hard to outperform. The ICIP_run1 Chen et al. (2020), uses a BERT-Large model at its core with a refined fine-tuning process including passage filtering and better negative sampling which explains its higher performance. Nevertheless, our runs are still competitive and outperform Parade which has the same model size as our models. Interestingly, the performance on TREC DL 2020 are lower in terms on nDCG@10 compared to TREC DL 2019 for the same model as observed for both our runs and the Parade run.

6 Discussion and future work

Our research is related to effectively harnessing the exact matching signals from the query-document pairs to enhance document ranking with pretrained language models (PLMs) exemplified by BERT. We have shown through the empirical experiments reported in this paper that PLMs such as BERT can benefit from explicit exact match cues conveyed via marker tokens to be more effective for ad hoc ranking.

BERT as the most famous PLM, was successfully applied to text ranking as well as a wide range of other tasks without requiring any specialized neural architectural components to capture different relevance signals as opposed to pre-BERT neural ranking models. Previous work by Qiao et al. (2019) study the behaviour of BERT for ranking and find that it is able to capture semantic matching signals between paraphrase tokens. However, research from the pre-BERT era have proven that, in addition to semantic matching, exact matching is still an important cue for neural ranking models (Guo et al., 2016; Mitra et al., 2017). Guo at al. Guo et al. (2016) argue that “exact matching of terms in documents with those in queries is still the most important signal in ad hoc retrieval due to the indexing and search paradigm in modern search engines”. This is why, (Boualili et al., 2020) suggest to emphasize the exact match signals for BERT using a marking technique that does not involve redesigning the model’s architecture, which will cost the immense benefits of self-supervised pretraining.

In this paper we extend (Boualili et al., 2020) and study four research questions that aim to investigate the effectiveness of our newly proposed marking strategies for ad hoc document ranking.

First, we investigated the benefits of exact match marking for a BERT-based model in both in-domain and zero-shot transfer settings. The results of the experiment showed that combining a simple soft marker with a pair marking strategy (Sim-Pair) is the most simple yet effective marking strategy. Moreover, experiments on Robust04 and GOV2 showed this exact match marking approach has a higher effectiveness on the description field of the topic compared to the title field. This preference for well-written natural language questions is in line with BERT’s preference for descriptions revealed by Dai and Callan (2019). On the other hand, we follow a retrieve-then-rerank architecture where the retriever is a bag-of-words model that prefers short key word queries while the reranker is a BERT-based model that prefers long natural language questions (Dai and Callan, 2019; Nogueira et al., 2020). In order to get the best of the two stages, we propose a hybrid pipeline where titles are used during the retrieving stage and then replaced by descriptions in the reranking stage which leads to substantial gains in performance.

Second, we investigate how to improve effectiveness on the out-of-domain collections using two methods: (1) linear interpolation of BM25 document-level scores with BERT-based passage-level scores, and (2) adding in-domain fine-tuning on the target collection. With the first method, we find exact term matching scores from traditional bag-of-words models like BM25 are still beneficial for BERT-based document reranking for out-of-domain collections. Indeed, combining document-level scores from BM25 with passage-evidence from out BERT-based models with a simple linear interpolation leads to substantial gains in performance. The document-level scores from initial BM25 retrieval based on traditional IR cues (TF, IDF) provide additional relevance signals that complete the passage-level scores from BERT-based models. Furthermore, using exact match marking appears to better take advantage of the combination with BM25 scores to achieve better performance than the vanilla model.

With the second method, when adding in-domain fine-tuning on top of the first general-purpose fine-tuning phase on out-of-domain data, we demonstrated through an ablation study that using exact match marking in the general-purpose fine-tuning phase on large out-of-domain data is enough to achieve substantial leaps in performance especially on descriptions. We publish our fine-tuned checkpoints on MS MARCO so it can be accessible to the community as a more effective alternative to a vanilla checkpoint.

Third, we study the contribution of our exact match marking strategy on a BERT variant, ELECTRA, that has been recently used in state of the art models such as Parade (Li et al., 2020). Experiments showed that exact match marking is indeed beneficial for ELECTRA, especially in the zero-shot transfer setting where no in-domain annotated data is used for training. In addition, the ELECTRA-based models were able to outperform their BERT counterparts in most cases.

Finally, we compared our best runs using both BERT and ELECTRA to a wide range of transformer-based ranking models that represent the state of the art at the time this article was written. On the one hand, the comparative evaluation showed that our exact match marking approach combined with the hybrid pipeline, that uses titles for BM25 retrieval and descriptions for BERT reranking, achieves near state-of-the-art results on Robust04 compared to the strong and larger T5-3B baseline, and outperforms previously proposed models on GOV2. On the other hand, the comparative evaluation on the TREC DL Document rankings tasks of 2019 and 2020, showed that our marking-based approach is a competitive model compared to the best TREC runs. Even if this evaluation is an in-domain setting, the benefits from exact match marking seem to be less prominent than those observed on Robust04 or GOV2. Differently from the title and description queries used with Robust04 and GOV2, the TREC DL queries are questions. Additionally, documents and other aspects of evaluation also differ. Further analysis is needed to determine the factors behind these discrepancy.

At the end, what does this mean for a deployment choice of a vanilla BERT vs. Sim-Pair BERT? We would argue Sim-Pair BERT induces focus on exact match signals leading to better performance than the vanilla BERT (in 24 comparisons, with 9 being significant), or at least to comparable performance (only in 4 comparisons, with no significant loss). Importantly, our extensive experiments did not show a single case where Sim-Pair BERT perform significantly worse, thus we would recommend it. On the efficiency side, our approach inherits the efficiency issue of the monoBERT cross-encoder. However, we do not add more complexity to the model making our approach a better substitute for a vanilla BERT with the exact same number of parameters (110M).

Our approach was empirically proven to be effective on standard ad hoc benchmarks, however in terms of explicability, there is still a lot of analysis that need to be done in order to understand how exactly the marking conveys the exact match signals to BERT and how are they integrated in the relevance prediction process. To this day, only so little is understood about the inner workings of BERT and PLMs in general regardless of all the efforts put into studying their behaviors. Previous research attempted to reveal insights about how BERT “works” in the limited context of passage retrieval, but studies lack when it comes to long documents ranking. Aside from the explicability limitation, our approach is rather simple and considers all query terms to be of equal importance when, in reality, they hardly have the same importance in the query especially in long descriptions.

For future work, we plan to develop diagnostic tests in attempt to shed light on the contribution of the exact match marking to the inner workings of BERT. Once the intervention of the markers determined, their representations can be leveraged for relevance classification in addition or instead of the current standard [CLS]. Identifying the subset of queries that are most likely to be improved by adding explicit exact match cues can be used to choose whether to use marking or not. Furthermore, our approach could be further improved by integrating the query term importance. Finally, other methods may be investigated to better integrate exact match signals into BERT.

7 Conclusion

Pretrained language models perform well on an impressively wide range of tasks. They were proven to excel at semantic matching, nevertheless exact matching is essential for relevance matching. In the light of this fact, we proposed to use marker tokens to convey exact match cues from the textual input that yield strong performance while maintaining the same architecture i.e number of parameters. We showed through empirical experiments that using a simple marker combined with a pair marking level is the most simple strategy that yields the best effectiveness. We show that applying this marking strategy in a hybrid retrieve-than-rerank pipeline that uses short key word queries for the first bag-of-words retriever and then adopts long natural language queries for reranking with PLMs like BERT and ELECTRA produce competitive effectiveness compared to state-of-the-art models. We published our fine-tuned checkpoints on marked data on the HuggingFace model hub so it can be easily used by the community via the famous “transformers” library without changes to their setups while benefiting from the improvements brought by exact match marking and build upon them.