1 Introduction

The sentence retrieval (SR) task consists of finding relevant sentences from a document base given a query. This task is very useful in a wide range of Information Retrieval (IR) applications, such as summarization, question answering, and opinion mining. SR is a challenging problem area that has attracted a great deal of attention recently (Allan et al. 2003; Losada 2008; Losada and Fernández 2007; Murdock 2006; White et al. 2005). The bulk of SR methods proposed in the literature are a straight-forward adaptation of standard retrieval models (such tf-idf, BM25, Language Models, etc), where the sentence is the unit of retrieval, as opposed to the document. This leads to SR models which estimate relevance based only on the match between query and sentence terms. The state of the art SR method is known as term frequency—inverse sentence frequency (tfisf) which is analogous to the traditional tf-idf method used in document retrieval (Allan et al. 2003; Losada 2008). While, numerous attempts to develop more sophisticated models that employ techniques, such as Natural Language Processing and Clustering have been proposed (Kallurkar et al. 2003; Li and Croft 2005; Zhang et al. 2003), they have failed to significantly and consistently outperform the tfisf method. Consequently, little progress has been made in terms of improving sentence retrieval effectiveness.

To develop a more effective sentence retrieval method, we argue that the assumption engaged as a result of the naive application of document retrieval, i.e. that all sentences are independent, does not hold. This is because a sentence is surrounded by other sentences which help to contextualize it. Also the sentence is part of a document, and this sentence may or may not be important in representing the topic of the document. Presently, this local context is either ignored or underutilized by existing methods. We posit that by incorporating the local context within SR models, more effective SR methods can be developed.

The reasons for this are as follows: Any model using only standard term statistics to match query and sentences will suffer severely from the vocabulary mismatch problem because there is little overlap between the query and sentence terms. Intuitively, the local context could be used to improve retrieval, by helping to mitigate the difficulties posed by the vocabulary mismatch rooted in the sparsity of sentences. Additionally, current methods do not exploit the importance of a sentence in a document, which we posit is an important factor in determining the relevance of a sentence. A relevant sentence needs to be indicative of the query topic, but also representative and important in the context of the document, i.e. we assume that key statements within a document are more likely to be relevant.

To this aim, we propose a novel reformulation of the SR problem that includes the local context in a Language Modeling (LM) framework. Within this principled framework, it is possible to naturally include additional evidence into the smoothing process in order to enrich the representation of sentences. Also, the model provides a way to include a query-independent probability that encodes the importance of a sentence in a document. In a set of experiments performed over several TREC test collections, we compare the proposed models against existing SR models and demonstrate that using local context within a LM framework delivers retrieval performance that significantly outperforms the current state of the art in sentence retrieval.

The remainder of this paper is organized as follows. Section 2 presents previous work related to this research. Section 3 explains the methods we propose to address the SR problem. Section 4 reports on the conducted experiments and analyzes the outcomes. The paper concludes with Sect. 5, where a summary of our findings and directions for future work are presented.

2 Related work

In this paper, we adopt the same definition of the sentence retrieval problem as proposed in the TREC Novelty Tracks (Harman 2002; Soboroff 2004; Soboroff and Harman 2003). Although these tracks are mostly focused on researching redundancy filtering, they also involve a SR task that enables research into how to retrieve sentences that are relevant to a given query.

As previously mentioned, there have been numerous SR methods that have been proposed in the literature. One of the first methods was coined as tfisf (Allan et al. 2003). It is an adaptation of the document retrieval method tf-idf, but at the sentence level. This simple approach is regarded as the state of the art in SR as it has been shown to consistently outperform other methods (Allan et al. 2003; Fernández and Losada 2009; Losada and Fernández 2007). As a matter of fact, this parameter-free method has been shown to perform at least as well as the best performing empirically tuned and trained SR models based on BM25 or LMs (Fernández and Losada 2009; Losada and Fernández 2007). While this tends not to be the case in document retrieval, on other tasks where the unit of retrieval is smaller such as passage retrieval, vector-space models have performed empirically well. For instance, Kaszkiel and Zobel (1997, 2001) showed that some cosine and pivoted models are highly effective for document ranking based on passages. Although we evaluate here SR (rather than document retrieval), past studies on passage-based document retrieval confirm also that vector-space methods are also state of the art models for query-passage scoring.

Li and Croft (2005) analyzed the components of sentences and identified patterns (such as phrases, name entities and combination of query terms) to estimate the relevance of the sentences. Although this method succeeded in detecting redundant information, it was not able to improve the tfisf baseline to estimate relevance. Clustering methods have been also considered as alternative techniques to improve SR models, such methods have shown mixed performance (Kallurkar et al. 2003; Zhang et al. 2003) seldom improving upon the tfisf baseline. These cluster methods also incur additional computation costs and increased complexity making them unattractive to implement. Query expansion techniques have been also proposed to improve the performance of current sentence retrieval approaches. Among them, the most common is query expansion via pseudo-relevance feedback (Collins-Thompson et al. 2002; Losada 2008) and with selective feedback (Jaleel et al. 2004; Losada and Fernández 2007), or relevance models (Liu and Croft 2002). While query expansion techniques tend to improve performance by addressing the vocabulary mismatch problem, they rely on good performance during the first pass of retrieval to realize such improvements.

In this paper, we reformulate the problem of sentence retrieval within the LM framework, where localized smoothing is employed to improve the representation of sentences. The work most related to this research has been performed by Losada and Fernández (2007) and Murdock (2006). In Losada and Fernández (2007), the local context of a sentence was informally introduced into the computation of sentence similarity. Basically, extra weight was given to those terms that have high frequency in the associated documents. In Murdock (2006), the estimation of the sentence language model included some local context, and combines the evidence from the sentence and document level. More specifically, a simple mixture model of the sentence, document and collection was proposed in order to form a better representation of the sentence. From the limited experiments reported, Murdock showed that the mixture model was better than other LM methods with the TREC novelty data. However, the results are far from conclusive because competitive SR methods, such as tfisf, were not evaluated. Nor was any indication of the sensitivity of the method w.r.t the smoothing parameters reported. In this paper, we provide a more general framework that encompasses both previous formulations using Language Models, but also provides avenues for incorporating other forms of local context.

3 Sentence retrieval models

The SR task consists of estimating the relevance of each sentence s in a given document set, and supplying the user with a ranked list of sentences that satisfy his/her need (expressed as a user query q). In this section, we first outline the standard LM approach applied to the problem of SR. Then, we propose a novel reformulation which includes local context seamlessly and intuitively within the model. Finally, we conclude the section with a description of baseline SR models (tfisf and BM25).

3.1 Sentence retrieval with language models (standard method)

Language Models are probabilistic mechanisms to explain the generation of text (Ponte and Croft 1998). The simplest LM is the unigram LM, which consists of associating a probability to each word of the vocabulary (Hiemstra 2001; Miller et al. 1999; Zhai and Lafferty 2001). This is a very intuitive and powerful approach that has been shown to be very effective in many IR tasks, such as ad-hoc retrieval (Zhai and Lafferty 2001), distributed IR (Si et al. 2002), and expert finding (Balog et al. 2009).

Given the SR problem, the idea is to estimate relevance according to the probability of generating a sentence s given the query q, expressed as p(s|q). Instead of directly estimating this probability, Bayes Theorem is applied, and sentences can be ranked using the query-likelihood approach, p(q|s). Footnote 1 The probability of a query q given the sentence s can then be estimated using the standard LM approach, where for each sentence s, a sentence LM is inferred. From the sentence model θ s it is assumed that each query term t is sampled independently and identically, such that:

$$ p(q|\theta_s) = \prod_{t \in q} p(t|\theta_s)^{c(t,q)} $$

where, c(tq) is the number of times the term t appears in q. The sentence model is constructed through a mixture between the probability of a term in the sentence and the probability of a term occurring in some background collection (i.e. maximum likelihood estimators of sentence and collection, respectively). This is usually performed in one of two ways by using (a) Jelinek–Mercer (JM) smoothing as shown in (2), or (b) Dirichlet (DIR) smoothing as shown in (3).

$$ p(t|\theta_s) = (1-\lambda) p(t|s) + \lambda p(t) $$
$$ p(t|\theta_s) = \frac{c(t,s) + \mu p(t)}{c(s) + \mu} $$

where c(ts) is the number times that t appears in s, and c(s) is the number of terms in the sentence. λ and μ are parameters that control the amount of smoothing. Note that, in (2) and (3), the smoothing expression ignores any local context and resorts immediately to the most general background knowledge p(t). This is a strong assumption because it focuses the computation on sentence and collection statistics, without regard to any reference to other terms and phrases in sentences within the same document. As previously mentioned, many SR models (Allan et al. 2003) take similar simplifications as the query-sentence similarity values do not take into account any information from the document (i.e. all sentences are treated independently).

JM and DIR smoothing yield to retrieval matching functions with specific length retrieval trends. In Losada and Azzopardi (2008a) and Smucker and Allan (2005), the authors studied these trends. Losada and Azzopardi (2008a) reported that DIR smoothing performs better than JM smoothing by showing that the document length pattern resembles the relevance pattern. They showed that DIR priors balance the query modeling and the document modeling roles, whereas JM smoothing does not consider the document length in the smoothing process. Thus, JM leads to poor retrieval performance because documents tend to be longer than the documents retrieved by DIR and the smoothing cannot compensate this. Smucker and Allan (2005) demostrated that DIR smoothings performance advantage arises from an implicit document prior that favors longer documents by smoothing them less. They tested the performance of a DIR prior and the JM smoothing with and without the document prior and showed that both methods smooth documents identically, except that the DIR prior smooths longer documents less. The result of this meant that the DIR prior tends to favor the retrieval of longer documents. Given the sentence retrieval problem, it is an open question as to what kind of length correction is appropriate for this task and whether the implicit length correction of smoothing methods employed help or hinder in the retrieval of relevant sentences.

3.2 Sentence retrieval using language models with local context

In this section, we relax the independence assumption between sentences and assume that the document (i.e. the local context) plays an important role in determining the relevance of a sentence. Therefore, we treat the SR problem as a problem of estimating the probability of the query and the document given the sentence, i.e. is the sentence likely to be a generator of both the query and the document? This assumes that there is a correlation between this likelihood, p(qd|s) (where d is the document that contains s) and the relevance of the sentence. Thus, we posit that relevance is affected by how well the sentence explains both the document and the query topic (as opposed to the query topic alone). In order to simplify the estimation of the conditional joint probability, we can rewrite it as follows:

$$ p(q,d|s) = p(q|s,d) p(d|s) $$

where p(q|sd) is the probability of the query given the sentence and document, and p(d|s) is the probability of the document given the sentence. Now we can clearly see that the estimation of the query likelihood will depend on both the sentence and the document. In addition, the p(d|s) provides another way in which the local context is captured, by encoding the importance of a sentence within the document. In the next subsections we consider how these probabilities can be estimated.

3.3 Estimating p(d|s)

The probability of generating the document given the sentence, p(d|s), can be regarded as a measure of the importance of the sentence within the topic of the document. Formally, this expression can be rewritten using Bayes’ rule:

$$ p(d|s) = \frac{p(s|d)p(d)}{p(s)} $$

where p(s|d) is the probability of a sentence given a document, the p(s) the probability of a sentence, and p(d) is the prior probability of a document. Here, we assume that there is no a priori preference towards any of the documents, and treat p(d) as a constant. Footnote 2 The p(s|d) represents how likely the sentence is to be generated from the document, whereas p(s) represents how likely the sentence is to be generated randomly. The ratio between the two expresses the importance of the sentence. Hence, in order to estimate p(d|s), we compute p(s) as:

$$ p(s) = \prod_{t \in s} p(t)^{c(t,s)} $$

where p(t) can be calculated using the maximum likelihood estimator of the term in a large collection: p(t|C) (where C is the collection). Analogously, we define the probability of a sentence s given a document d as:

$$ p(s|d) = \prod_{t \in s} p(t|d)^{c(t,s)} $$

where p(t|d) is the probability of generating t from the maximum likelihood estimator of the document, and c(ts) usually equals one as most terms only appear once in a sentence (unless the term is a stop word). It is to be noted that the problem of obtaining null probabilities from these estimates does not exist because terms that occur in a sentence will have non-zero probability in the LM of the document. Observe that p(d|s) will give preference to those sentences that are central to the document’s topics (i.e. high p(s|d)) but also rare within the collection (i.e. low p(s)). In this paper we carefully study the effect of p(d|s) on performance, and have designed a complete set of experiments where we compare the estimation described above against the simplest (and naive) assumption: p(d|s) is uniform.

3.4 Estimating p(q|sd)

In order to estimate the query likelihood given the sentence and the document, we do this in a similar manner to the standard approach: first we assume that there is a model θ s,d which generates the query terms, such that the probability of query given the sentence and the document is:

$$ p(q|s,d) = \prod_{t \in q} p(t|\theta_{s,d})^{c(t,q)} $$

The LM p(t s,d ) is determined by the sentence and the local context denoted by d, thus we can represent the model as a mixture between the probability of a term in the sentence and the probability of a term in a document, which is then smoothed by the background model. The idea is that the terms in the document provide meaning to the sentence, and can improve the estimate of the relevance of a sentence.

For the time being, we assume that p(t|d) is the normalized term frequency of t in d, but later we explore restricting this estimate to the sentences surrounding the sentence s.

There are several ways in which a mixture model can be defined using smoothing:

3.4.1 Three mixture model (3MM)

The first model we propose here is a mixture of three LMs. This model assumes that queries are generated from a mixture of three different probability distributions: a LM for the sentence, p(t|s), a LM for the document, p(t|d), and a LM for the collection, p(t|C) (or, simply, p(t)). Formally, we define this approach as:

$$ p(t|\theta_{s,d}) =\lambda p(t|s) + \gamma p(t|d) + (1-\lambda -\gamma) p(t) $$

where λ and γ are smoothing parameters such that λ, γ ∈ [0, 1]. This estimator was initially proposed by Murdock (2006). Other authors have also applied 3MMs for other tasks such as question-answering (Xue et al. 2008). Since the 3MM is very general, it is worth considering alternatives which smooth the sentence with the document and the collection but in a length-dependent way. This can be achieved by either first smoothing with the document proportionally to the sentence, and then interpolating with the collection (i.e. the Two Stage Model). Or, alternatively, first interpolating the sentence and the document, and then smoothing with the collection proportional to the sentence length. We shall detail these methods next.

3.4.2 Two-stage model (2S)

The two-stage model adopted here is a variant of the well-known two-stage model used for document retrieval (Zhai and Lafferty 2002). This model is a combination of Dirichlet (DIR) and Jelinek–Mercer (JM) smoothing. Rather than smoothing with the collection model in both stages, we adapt here the model to the characteristics of the SR task and, therefore, the DIR stage uses p(t|d) while the JM stage uses p(t) for smoothing purposes. This is a simple and natural application of the two-stage smoothing for our problem. The formal expression is:

$$ p(t|\theta_{s,d}) = (1-\lambda) \frac{c(t,s) + \mu p(t|d)}{c(s) + \mu} + \lambda p(t) $$

3.4.3 Two-stage model, stages inverted (2S-I)

We propose here a two-stage model where the order in which DIR and JM smoothing methods are applied is inverted:

$$ p(t|\theta_{s,d}) = ( 1 - \beta ) ( (1-\lambda) p(t|s) + \lambda p(t|d) ) +\beta p(t) $$

where \(\beta = \frac{\mu}{c(s)+\mu}\). The sentence model is first smoothed using linear interpolation with the document’s model. Next, Dirichlet is applied to smooth with the collection model.Footnote 3 By smoothing in this way the first stage provides a new estimate of the foreground terms by combining the sentence and the document (through linear interpolation), and then the next stage adjusts the estimates with the background language model proportional to the length of the sentence. By inverting the smoothing methods, different length normalization schemes are applied to the sentence language models. In later sections, we shall analytically and empirically show how the 2S and 2S-I models differ in this respect.

Observe that DIR and JM smoothing can also be included within this framework assuming that p(q|sd) = p(q|s) and applying DIR or JM to estimate the likelihood. If p(d|s) is uniform, then these models are equivalent to the ones discussed in Sect. 3.1. However, if p(d|s) is not uniform then we get a novel combination of these popular smoothing strategies with the estimation of the importance of sentences in documents. Table 1 summarizes the different proposed models and informs about what configurations are novel (and, therefore, have not been tested in the literature).

Table 1 Language models included in our study

3.5 Baseline sentence retrieval models

For completeness, we also include the score functions for popular SR models, tfisf (Allan et al. 2003) and BM25 (Robertson et al. 1999), which we shall employ as baselines. tfisf was adopted in the literature as the state-of-the-art sentence retrieval method (Allan et al. 2003). In Losada and Fernández (2007) we demonstrated that it performs similar to tuned BM25. BM25 is a simple adaption of the popular BM25 formula used in document retrieval to the SR case, such that:

$$ sim_{\rm BM25}(s,q) = \sum_{t \in q \cap s} \log \frac{N - sf(t) + 0.5}{sf(t) + 0.5} \cdot \frac{(k_1+1) c(t,s)}{k_1 \left( (1-b) + b \frac{c(s)}{avsl}\right) + c(t,s)} \cdot \frac{(k_3+1)c(t,q)}{k_3 + c(t,q)} $$

where N is the number of sentences in the collection, sf(t) is the number of sentences that contain tavsl is the average sentence length and k 1, b and k 3 are parameters.

On the other hand, we also used tfisf, which is a state of the art SR baseline. This measure is an adaptation of tf-idf at sentence level:

$$ sim_{\rm tfisf}(s,q) = \sum_{t \in q \cap s} \log(c(t,q)+1) \log(c(t,s)+1) \log \left( \frac{N+1}{0.5 + sf(t)} \right) $$

Unlike the BM25 method, this method is parameter-free. Its performance for sentence retrieval has been shown to be comparable to the best performance obtained by BM25 (Losada 2008; Losada and Fernández 2007).

Besides these models, we also experimented with variants of tfisf and BM25 that support the combination of sentence and contextual statistics. These variants are discussed in Sect. 4.2.

4 Empirical study

This section presents the experimental methodology employed to thoroughly evaluate the performance of the proposed models against existing and state of the art models. Particular attention is paid to examining the differences in performance brought about by the inclusion of the local context. Specifically, we hypothesize that:

  1. 1.

    localized smoothing will improve the estimate of the sentence models, resulting in improved effectiveness, and

  2. 2.

    the centrality of a sentence in a document helps to infer the relevance of a sentence, i.e. sentences that briefly summarize a document tend to be more relevant than the rest of sentences in the document.

4.1 Experimental setup

As previously mentioned, we adopt the SR task as defined in the TREC novelty tracks: given a textual query that represents an information need, a ranked set of documents is supplied and systems have to process this ranking to extract the sentences that are estimated as relevant to the information need. Along with this definition we used all three TREC Novelty Track collections 2002, 2003 and 2004 (Harman 2002; Soboroff 2004; Soboroff and Harman 2003). Each collection provides the same sentence retrieval task, but under different conditions. In TREC 2002, the track contains 50 topics, extracted from earlier ad hoc tracks. TREC 2003 and TREC 2004 contain also 50 topics each but these were built specifically by assessors for this task. Because in TREC 2002 and TREC 2003 the aim was to find relevant sentences in relevant documents, all the documents of the ranked list of documents in TREC 2002 and TREC 2003 are relevant. In contrast, in TREC 2004 the ranked set of documents contains both relevant and non-relevant documents. In TREC 2002, on average, only the 2% of sentences were judged as relevant, while in TREC 2003 and TREC 2004 the number of sentences judged as relevant is higher (39.07 and 15.97%, respectively). All of these collections include complete relevance judgments (i.e. human assessors judged every sentence in the retrieved documents as relevant or non-relevant). By using all three test collection it is possible to assess the robustness of the sentence retrieval methods and thoroughly evaluate their performance.

The baseline methods and the LM models were implemented using the Lemur toolkit.Footnote 4 For the experiments, each collection was indexed where standard stop words were removed but stemming was not applied. The corresponding set of topics for each collection was used, where short queries were constructed taking the title field of the TREC Topic. Observe that we use short queries while the teams participating in the TREC novelty tracks were allowed to use the whole topic. This means that the results presented here are not directly comparable to the official TREC results.

For all of our experiments, we report the performance of each method using three standard measures: precision at ten sentences (P@10), mean average precision (MAP) and R-Prec. Observe that the models proposed are recall-oriented in nature, so we would expect to witness gains in terms of MAP, and to some extent R-Prec. This is because the new models are able to promote sentences that do not necessarily match many query terms, but their context matches with some of the query terms. This should enhance the recall of relevant sentences (in particular sentences which may not overlap with the query terms). The usefulness of recall in sentence retrieval can be illustrated using the application scenario presented in the TREC novelty track (Harman 2002): where a user is examining the ranked list of documents, and is interested in reviewing all the on-topic sentences but wants to skip through the non-relevant sentences. In this case, navigation could be made more efficient so that they can transverse through all the relevant sentences in all the documents. Whereas in the context of multi-document summarization, having access to all the relevant sentences is also very important. However, the precision oriented measures, P@10 and to some extent R-Prec, also are important for tasks likes query-biased summarization, snippet generation, and question-answering. Ideally, the proposed models will be able to enhance both precision and recall based measures, but are likely to gain the largest improvements in terms of recall.

To compare the differences in performance between the different methods, statistical significance tests were applied using the t test with a 95% confidence level.Footnote 5

During the course of our experiments, each method presented in Sect. 3 was evaluated. Since many of the methods required parameter tuning, we ensured a fair comparison by employing a train-test methodology. Training of each method (except tfisf, which is parameter free) was performed on one of the three TREC novelty datasets. For BM25 we considered the following range of values: k 1 = 1.0–2.0 (steps of 0.1), b = 0.0–1.0 (steps of 0.1) and k 3 was fixed to 0 (the effect of k 3 is negligible with short queries). For the LM methods, λ was set to 0.1–0.9 (steps of 0.1), the range of values of μ (for 2S and 2S-I) was {1, 5, 10, 25, 50, 100, 250, 500, 1,000, 2,500, 5,000, 10,000} and the range of values for γ (for the 3MM model) was 0.1–0.9 (steps of 0.1). The parameter settings showing best performance were then fixed. These were then used to conduct the remainder of the evaluation, which was performed on the two remaining datasets. We experimented with the three possible training/testing configurations (training with TREC 2002 and testing with TREC 2003 and TREC 2004; training with TREC 2003 and training with TREC 2002 and TREC 2004; and training with TREC 2004 and training with TREC 2002 and TREC 2003) and found the same trends. In the next sections we report and discuss the results achieved by training with TREC 2002 and testing with TREC 2003 and TREC 2004. However, we include the results for the other training/testing configurations in “Appendix” to further demonstrate that our methods are robust.

Three models may be needed in order to estimate the relevance of a sentence: a sentence model, a local context model (where all the sentences in the document or the surrounding sentences where considered, depending on the type of the smoothing applied) and the background model (which is generated from all the documents in the collection).

When evaluating the LM approaches, we considered different alternatives. On one hand, we study the impact of p(d|s) to specifically study the effect that this extra and novel component has on SR effectiveness. On the other hand, we considered two different contexts: the document (as it was shown in Sect. 3) and the surrounding sentences (see the below subsection).

4.1.1 Smoothing with surrounding sentences

In the previous sections we studied smoothing methods that included p(t|d) within the sentence model, where p(t|d) was estimated using the maximum likelihood estimate of a term in a document. This implies that all terms in the document are related to the sentence. Here, we propose an alternative estimate of p(t|d) which relaxes this assumption, and assumes that only the sentences surrounding the sentence being scored are related. So given a sentence s, the sentences immediately preceding and following s are directly related to it and, therefore, they constitute a closer context to the sentence s. In this way, considering the surrounding sentences only, a more accurate representation of the sentence LM should be obtained, which we anticipate will also lead to improved performance.

In this case, given a sentence s, its context c s is composed by the previous sentence s prev , the current sentence s and the next sentence in the document s next .Footnote 6 Smoothing is performed by using p(t|c s ) instead of p(t|d) in (9)–(11), where p(t|c s ) is the normalized count of t that occurs in s prev , s and s next .

In the next subsection we show the results of this approach and compare them against the results obtained when smoothing with documents instead of surrounding sentences.

4.2 Experimental results

The first set of experiments tested the effect of localized smoothing without p(d|s) (i.e. sentence importance is not considered, all sentences are considered as equally important). Then, we perform a second set of experiments that examines the impact of sentence importance. Finally, we present additional experiments to determine whether or not the baseline models can also be enhanced by including local context.

4.2.1 Influence of localized smoothing

Table 2 reports the parameter setting that optimized performance. Given the TREC 2002 as the training collection, Table 3 shows the performance in the test collections of the methods against the baselines in terms of P@10, MAP and R-Prec. The table shows the performance of models that use either the document as context, or the surrounding sentences. The best performance is presented in bold. Statistically significant differences between a given result and tfisf are marked with an asterisk, and statistically significant differences w.r.t. standard DIR smoothing are marked with a † (DIR provides the LM baseline, which is referred to as LMB). The test results obtained when TREC 2003 and TREC 2004 were used as the training collection are also provided in the “Appendix”.

Table 2 Optimal parameter settings in the training collection (TREC 2002) for BM25 and LMs without p(d|s)
Table 3 P@10, MAP and R-Prec in the test collections (TREC 2003 & TREC 2004)

In Table 3, where the language models have been trained using TREC 2002, the first prominent result is that the 2S-I smoothing method is the best performing method in terms of MAP and R-Prec. And this novel method is significantly better than the tfisf and DIR baselines, when either surrounding sentences or the entire document is used in the estimate. This is a good result, as it provides a simple and intuitive method that outperforms the long standing benchmark held on these standard test collections. The results in Tables 11 and 13 also show similar improvements.

In terms of P@10, though, the performance of most of the contextually smoothed models is slightly poorer than the baselines. The 2S-I method does provide the best performance at P@10 on the TREC 2004 collection, when using the surrounding sentences to smooth the language models. However, though this is not always significantly different from the baselines.

As previously mentioned, this is perhaps to be expected because the proposed methods are more likely to improve recall. Still, it is very encouraging to see that early precision can also be increased if the smoothing parameters are appropriately set. Recall that we have trained the parameters on a held out test collection, so the performance reported here is not necessarily the best that could be obtained using improved parameter estimation methods. For the remaining of this paper, the focus of the discussion will be on performance with respect to the recall oriented measures, MAP and R-Prec, unless otherwise specified.

In terms of the type of smoothing, i.e. using surrounding sentences or documents, there was no significant differences between the performance obtained with the different estimates. Though, using the complete document was slightly better overall. The other notable point is that the 3MM and 2S localized smoothing methods did not provide improvements to performance. This suggests that the 2S-I smoothing method provides an advantage over these other smoothing methods, which may not necessarily be because of the local information used. We explore the reasons in subsection 4.3.

4.2.2 Impact of sentence importance

In this set of experiments we considered the influence of the local context stemming from the importance of a sentence within a document. Table 4 reports the best settings in the training collections for the proposed LM methods with the sentence importance component. The performance of each method is shown in Table 5 while Figs. 1, 2 and 3 provide a bar graph of the P@10, MAP and R-Prec of each method with and without p(d|s). It is clear from these results that the inclusion of the sentence importance results in significantly better retrieval performance for all the LMs over the state of the art method (tfisf). It appears that the impact of the sentence importance dominates the localized smoothing. For instance, given the query “Chinese earthquake”, the 3MM with sentence importance is able to retrieve the following relevant sentence within the top-10 sentences: “Chinese architects from the Ministry of Construction and Hebei Province and the city of Zhangjiakou have begun work on rebuilding earthquake-damaged parts of Hebei and have completed design work on ten types of residential housing for nine villages as models”. Nevertheless, this sentence does not appear in the top-10 of the version of 3MM that does not include sentence importance. This is because this sentence summarizes well the document and, therefore, the p(d|s) factor promotes it.

Table 4 Optimal parameter settings in the training collection (TREC 2002) for LMs with p(d|s)
Table 5 P@10, MAP and R-Prec in the test collections (TREC 2003 & TREC 2004)
Fig. 1
figure 1

P@10 in the test collections (TREC 2003 & TREC 2004) of the LMs with and without sentence importance

Fig. 2
figure 2

MAP in the test collections (TREC 2003 & TREC 2004) of the LMs with and without sentence importance

Fig. 3
figure 3

R-Prec in the test collections (TREC 2003 & TREC 2004) of the LMs with and without sentence importance

There are not significantly different levels of effectiveness between each of the different smoothing methods. Observe also that the performance of 2S-I is not substantially affected by the sentence importance factor.

All the models that include p(d|s) are novel, as previous proposals using LMs are solely based on query likelihood estimations. Note also that the three mixture model as proposed in Murdock (2006) (i.e. without p(d|s)) performs worse than the strong and weak baselines (results shown in the 5th column of Table 3).

4.2.3 Incorporating context into the baselines

The baseline models (tfisf and BM25) are context-unaware w.r.t. the local context. Given the findings we have obtained from incorporating local context in the LM framework, it is natural to wonder whether introducing the local context into the baselines can also improve their performance. First, we present several straight forward adaptions of BM25 and tfisf to include local context, then we compare these variations under the same experimental conditions as above.

A natural solution to introduce document statistics into BM25 (Robertson 2005) is to use the extended version of this model to handle multiple weighted fields, i.e. BM25f (Robertson et al. 2004). BM25f estimates the relevance of documents considering a document as a set of components. Each of these components may be assigned a specific weight within the document. For our case, a sentence (s) can be considered as an aggregate of the sentence itself and the context containing the sentence (i.e. the document or the surrounding sentences provide local context to the sentence). Given these two components, the BM25f model can be instantiated as follows:

$$ sim_{\rm BM25f} (s,q) = \sum_{t\in q \cap s} \log \frac{N - sf(t) + 0.5}{sf(t) + 0.5} \cdot \frac{weight(t,s)}{k_1 + weight(t,s)} \cdot \frac{(k_3+1)c(t,q)}{k_3+c(t,q)} $$
$$ weight(t,s) = \frac{c(t,s) \cdot \alpha}{(1-b_{sen}) + b_{sen} \frac{c(s)}{avsl}} + \frac{c(t,context) \cdot (1-\alpha)} {(1-b_{context}) + b_{context} \frac{c(context)}{avcl}} $$

where b sent and b context are normalizing constants associated to the field length in s and its context, respectively; α is a boost factor that controls the term frequency mixture between context statistics and sentence statistics; c(context) (c(s)) is the number of terms in context (s), c(tcontext) is either c(td) or c(tc s ) (depending on whether we apply document-level or surrounding sentences context), and avcl (avsl) is the average context (sentence) length in the collection. To reduce the number of parameters to be tuned, b context was fixed to 0.75 (the value usually recommended for document length normalization in BM25 (Robertson 2005)), k 1 was set to the optimal value found with BM25 (Table 2) and k 3 was set again to 0. The remaining parameters, α and b sen , were tuned in the training collection (ranging from 0 to 1 in steps of 0.1).

Regarding tfisf, no extensions have been defined to handle local context and, therefore, we defined ad-hoc adjustments to mix context statistics with sentence statistics. We tested the following variants of tfisf:

  1. (a)

    tfmix: c(t,s) is replaced by αc(ts) + (1 − α)c(tcontext);

  2. (b)

    idfdoc: sf(t) is replaced by df(t) (i.e. idf is computed at the document level rather than at sentence level);

  3. (c)

    tfmix + idfdoc: where both (a) and (b) were applied.

At training time, only α needs to be tuned (between 0 and 1 in steps of 0.1). Again, TREC 2002 was the training collection and TREC 2003 and TREC 2004 were the test collections. The optimal performance was reached with b sen  = 0 and α = 1 (BM25f), and α = 1 (tfisf). This means that these models obtain best performance, when the local context is largely ignored! Tables 6 and 7 report the results achieved in the test collections. Not surprisingly, the variations perform virtually the same as the original models. As a matter of fact, BM25f with α = 1 (considering either the surrounding sentences or the document as a local context) yields the same SR strategy as BM25. The same happens for tfisf + tfmix (α = 1) with respect to tfisf when the document is considered as the local context. Nevertheless, tfisf+tfmix considering the surrounding sentences (α = 0.6) performs worse than tfisf in TREC 2003 and the same as tfisf in TREC 2004. With idfdoc there are some slight variations in performance with respect to the baseline but they are insignificant.Footnote 7

Table 6 Performance of the BM25 and its variations (BM25f) to include context in the test collections (TREC 2003 & TREC 2004)
Table 7 Performance of tfisf its variations to include context in the test collections (TREC 2003 & TREC 2004)

While it appears that local context can be useful the model in which it is incorporated determines how successfully this evidence can be used. In the Language Modeling approach, the framework provides a natural and intuitive manner to encode and incorporate the local context through the smoothing process. However, it is unclear how to effectively incorporate the evidence within these other models. We leave this direction for future work, and study more precisely why and how the Language Models are able to capitalize on this additional evidence.

4.3 Analysis

In this section, we conduct a detailed analysis to understand precisely the reasons behind the differences in effectiveness of the LMs designed. To explain the improvements in performance brought about by the 2S-I model when no sentence importance is used, we derived the retrieval formulas associated to these LMs [similar to that performed in Losada and Azzopardi (2008b), Zhai and Lafferty (2001]. The retrieval formulas in sum-log form are shown in Table 8. Examining the models in this way we can see the differences between each smoothing method. It is interesting to pay attention to the second addend in these formulas. This component incorporates usually some form of length correction. In the DIR and 2S method, this component penalizes long sentences and acts as a length normalization component (which is useful for document retrieval)Footnote 8 (Losada and Azzopardi 2008a). In the JM and 3MM methods, this component is independent to the length of the sentence. However, in the 2S-I method, this component promotes long sentences because a high c(s) means that β is low making that, overall, the sum is greater (because, usually, p(t|d) ≫ p(t)).

Table 8 Sum-log retrieval formulas for the SR models based on LMs (without p(d|s))

To illustrate this point further, the Fig. 4 shows the behavior of the length correction that the DIR, 2S and 2S-I methods produce with respect to the sentence length. Such correction is given by the second addend of expressions in Table 8. In this example, a query q with three terms (q A q B q C ) is used, where c(q A q) = c(q B q) = c(q C q) = 1, p(q A ) = 10−6p(q B ) = 10−12p(q C ) = 10−3p(q A |d) = p(q B |d) = p(q C |d) = 10−2, λ = 0.5,  μ = 100. Then the sentence length was varied from 1 to 50 (in steps of 1). Note that in DIR and 2S the correction factor decreases with sentence length, while in 2S-I the value of this factor increases with sentence length. This illustrates graphically that DIR and 2S methods are likely to promote short sentences, while the 2S-I method is likely to promote long sentences.

Fig. 4
figure 4

Effect of non-matching component (length correction) in DIR, 2S and 2S-I against sentence length. The plots show that the score assigned to sentences are adjusted proportionally to the length of the sentence. Note that the 2S-I method favors longer sentences, while the other methods penalize longer sentences

This seems to indicate that promoting long sentences is a way to achieve better performance, as opposed to using more information. Observe also that the best parameter setting in BM25 fixes b to 0 (Table 2), meaning that sentences are not penalized because of their length. To further support this claim, we analyzed the average length of sentences in these collections and compared it to the average length of relevant sentences. The average sentence length is around 9 terms in all collections, while the average length of relevant sentences is around 14 terms. Furthermore, we analyzed the top 100 sentences retrieved by every model and found that 2S-I yields an average length of 13.71 and 13.66 (TREC 2003 & TREC 2004, respectively), while the other models retrieve shorter sentences on average (e.g. 3MM retrieves sentences whose average length is 12.68 and 12.67, respectively). These statistics suggest that 2S-I is superior to the other models because it promotes longer sentences, and this is required to achieve better performance for the task of sentence retrieval.

Further to this analysis, it is interesting to note that in the estimation of p(d|s) longer sentences will also attract a higher probability. As a matter of fact, in Table 9 and Fig. 5 we compare the performance of DIR and JM methods and a variant of them consisting of incorporating a sentence length prior. We show that this variant outperforms significantly their corresponding original versions. However, it does not outperform the 2S-I model and, therefore, the sentence length is not the only component that makes the 2S-I model effective.

Table 9 Comparative between DIR and JM against their variants with the sentence length prior (trained with TREC 2002 and tested with TREC 2003 and TREC 2004)
Fig. 5
figure 5

Comparative between DIR and JM against their variants considering a sentence length prior (trained with TREC 2002 and tested with TREC 2003 and TREC 2004)

Observe that p(d|s), as estimated in Sect. 3.3, is a factor that favors long sentences (because, for the vast majority of the terms in a sentence, p(t|d) ≫ p(t)Footnote 9). This explains why 2S-I does not receive any significant benefits from p(d|s) (as 2S-I already retrieves many long sentences) while the other LM techniques receive significant increases. As a matter of fact, analyzing the top 100 sentences retrieved by every method with p(d|s), we found that the average lengths are quite uniform across models (around 20 terms). This analysis suggests that the local context used indirectly promotes longer sentences, which results in improved retrieval effectiveness.

4.3.1 Summary and discussion

To sum up, the importance of sentences within documents, p(d|s), makes that the performance of the LMs improve significantly beyond existing state of the art. When ignoring p(d|s), 2S-I is the only approach that handles well the retrieval of long sentences with document-level smoothing.

It is quite remarkable that any LM method with p(d|s) is superior to the baselines. This suggests that retrieval methods such as tfisf and BM25 are limited because they are simple adaptations of document retrieval techniques and, therefore, they involve some sort of correction to avoid retrieving many long texts (e.g. b in BM25) but they do not have the opposite tool: some correction to retrieve more long texts. Standard models without length normalization (tfisf or BM25 setting b to 0) have already some tendency towards long pieces of text (because long sentences match more terms) but, given our findings, this is not sufficient to improve the model’s performance. However, this also opens the door to future developments, or extensions of current SR models to try to account for this tendency. This will also help to understand whether the important benefits reported here come exclusively from promoting long sentences or, on the contrary, it is the combination of retrieving long sentences and localized smoothing the reason behind such good performance.

5 Conclusions and future work

In this paper, we proposed several novel probabilistic LMs to address the SR problem by including the local context. The context provided by the document meant that the estimate of relevance was based on the sentence, the document and the query. As part of the sentence language model, localized smoothing was included to provide a better estimate of the probability of a term in a sentence. The importance of sentences within the document was also included in our models. In a comprehensive set of experiments performed over several TREC test collections, we have compared the proposed models against existing SR models. Our experiments showed that using both forms of local context significantly outperforms the standard LM approach applied to sentence retrieval and the current state of the art sentence retrieval models. This is an important advancement in the development of effective SR methods. More specifically, it was found that:

  • Using localized smoothing (2S-I) improves the performance of the LMs methods (by up to 13.8% improvement in mean average precision (MAP)).

  • Including sentence importance significantly improves the performance of all the LM approaches.

  • LMs that use local context significantly outperform the current state of the art.

It was also shown that the improvements in the proposed methods were partly due to their tendency to favor longer sentences. This finding demonstrates that the naive application of document retrieval models to other retrieval tasks can lead to non-optimal performance; and warrants the development of sentence retrieval methods which account for the length normalization problem. These findings suggest that further progress in the area of sentence retrieval is possible, and that more sophisticated, and more effective models can be developed by incorporating the local context within the LM framework. This work motivates future research and development on:

  1. (i)

    developing other methods in a principled fashion to also include local context, i.e. changing the vector representation in tfisf, including a sentence importance factor, or including the local context in the classic Probabilistic Model for IR,

  2. (ii)

    instead of considering the closest surrounding sentences (previous and next), consider a variable number of surrounding sentences,

  3. (iii)

    define a four-mixture model that combines the sentence, the local context, the document and the background model,

  4. (iv)

    the modification of pivoted length normalization (Singhal et al. 1996), or BM25 to do SR promoting long sentences; or sentence priors for LMs to investigate the length normalization issues,

  5. (v)

    other estimation methods of the LMs and priors, along with automatic parameter estimation techniques, and

  6. (vi)

    the application and extension of the Language Modeling Framework to other tasks, such as query-biased summarization or novelty detection.