1 Introduction

An information retrieval (IR) system is supposed to fulfil every information need of users with an acceptable level of satisfaction, where a posed information need is a query that may be formulated by any inquirer. Given a set of queries, high average retrieval effectiveness is, in this respect, necessary but not the sufficient criterion. An IR system may have relatively a high level of retrieval effectiveness on average, while it makes abject failures for a few queries. Averaging over a set of queries would in general hide the per-query effectiveness of IR systems (Voorhees 2004).

In addition to high average retrieval effectiveness, an IR system should also be robust in per-query effectiveness, robust in the sense of making no abject failure for any query. We argue, in this study (Sect. 2), that robustness in per-query effectiveness can be maintained by means of a selective approach that predicts what retrieval strategy should be applied to which query, assuming that some retrieval strategy performs well on one query but poorly on a second, while other strategies may perform poorly on the first query, but succeed on the second (Buckley 2009).

In this article, we propose a selective approach to index term weighting (Sect. 3). The approach is of pre-retrieval type, where the model selection is made before the actual search takes place. It predicts the best model amongst a set of 8 well-established probabilistic term weighting models, including BM25, PL2, DFRee, DPH, DLH13, LGD, DFI and the language model with Dirichlet smoothing (DLM). For any given query, the best model is determined by utilizing only one source of information, the frequency distributions of (query) terms on the target document collection. As a feature, term frequency distributions relate the underlying assumptions of probabilistic term weighting models to queries, and hence it provides information on the expected effectiveness of the models.

The contributions of the work presented in this paper can be summarized as given by:

  • A query-based selective term weighting algorithm that predicts the best index term weighting method, for any given query, among a predefined set of index term weighting methods (Sect. 3).

  • An empirical justification in support for the claim that the probabilistic index term weighting models can be characterized with respect to retrieval effectiveness on the basis of the frequency distributions of query terms on documents.

From the results of the experiments presented in this article (Sect. 4), we observe that the proposed approach is on average more effective and also more robust than that of the 4 well-formed baselines considered, including the current state-of-the-art selective approach to index term weighting in the IR literature (He and Ounis 2003b, 2004). On the other hand, we note, also, that there is still room for improvement in this research direction. On this account the experimental results reveal that the proposed approach shows a significantly lower performance on average than an optimal/oracle approach that could predict the most effective model for any given query, with 100% accuracy. We speculate the reasons behind the latter observation, and discuss the possible improvements over the proposed approach, in the discussion section (Sect. 5).

The experimental evaluations presented in this article is performed using the set of 200 queries released from the TREC Web track studies performed between the years 2009 and 2012, and the set of 562 queries released from the TREC Million Query (MQ) track study in 2009. The official document collection used in those two studies is called “ClueWeb09 collection” (Callan et al. 2009). For the set of 200 TREC Web track queries, we use the English portion of the ClueWeb09 collection consisting of about 500 million English Web pages, and for the set of 562 MQ track queries, the Category B subset of the English portion consisting of about 50 million English Web pages. These subsets are the original subsets used in the corresponding TREC tracks. The details of the data sets and the experimental setup are given in “Appendix.”

2 Motivation

The reliable information access (RIA) workshopFootnote 1 is the pioneering effort to investigate the factors that affect the variability in retrieval effectiveness, where the goal is to perform a per-query failure analysis of individual IR systems (Harman and Buckley 2004). A major result from the RIA workshop is that most of the failures could in fact be fixed by applying an existing retrieval strategy. On this account, Harman and Buckley (2009) state, later on, that “it may be more important for research to discover what current techniques should be applied to which topics [i.e. queries], rather than to come up with new techniques.”

Selective information retrieval is in theory capable of fulfilling every information need of users, with a level of satisfaction that the current IR strategies could together provide. Table 1, for instance, lists the 50 TREC 2012 Web track queries along with the highest observed nDCG@100 scores over all of the participating IR systems. The average of the per-topic highest nDCG@100 scores over the 50 queries is 0.4239. The mean nDCG@100 score of the most effective IR system participating in the TREC 2012 Web track is 0.2784. For this instance, selective information retrieval is, therefore, capable of being nearly as twice as more effective than the most effective single IR system, on average.

Table 1 The highest nDCG@100 scores observed for 50 TREC 2012 Web Track queries over all of the participating IR systems

In the context of selective information retrieval, it is presumed that a query that a particular retrieval strategy failed to fulfil can be fulfilled by one of the other existing retrieval strategies. Basically, the success of any selective approach depends on to what degree this fundamental assumption holds true in practice. As seen in Table 1, the assumption holds true for the TREC 2012 Web track queries.

Actually, the truth value of this assumption, in turn, depends on the richness or diversity of the alternative retrieval strategies among which the selection will be made, simply because similar retrieval strategies would in general show similar performances on the same queries. In this respect, it can be said, in general, that the potential retrieval effectiveness of any selective approach increases, as the number of distinct retrieval strategies increases.

In the TREC 2012 Web track, the total number of participating IR systems is 48 and the number of distinct systems yielding the highest nDCG@100 scores for the 50 queries in Table 1 is 24. This means that a set of 24 distinct retrieval strategies is diverse enough to be as twice as more effective than the state-of-the-art TREC 2012 IR system, on average.

A full-fledged IR system, such as the systems participating in the TREC Web track, usually employs a multi-stage retrieval strategy (Mackenzie et al. 2018), including query expansion techniques, index term weighting models, learning-to-rank techniques, spam filtering, etc. Index term weighting is the core component of such multi-stage retrieval strategies, since it quantifies the degree of relevance between a document and a given query. Thus, the resulting effectiveness of any retrieval strategy basically depends on the effectiveness of the index term weighting model in use. In this respect, it can be said that the key to effective retrieval is to determine what index term weighting model should be applied to which query.

In this study, we consider a set of probabilistic term weighting models that is diverse enough to cover the major methods in the IR literature, including the information theoretic models (e.g. LGD), the language model (e.g. DLM), the divergence from randomness models (e.g. PL2, DFRee, DPH and DLH13), the divergence from independence models (e.g. DFIC) and the Harter’s two Poisson model (e.g. BM25). The models under consideration are listed in Table 2, along with the probability distributions that each model assumes.

Table 2 The 8 probabilistic index term weighting models and the probability distributions that each model assumes for the frequency distributions of (query) terms

Table 3 lists the same TREC 2012 Web track queries along with the highest observed nDCG@100 scores over those term weighting models. The average of the per-topic highest nDCG@100 scores is 0.1760. The most effective model is “PL2” and the corresponding mean nDCG@100 score is 0.1368. This means that, for the term weighting models, an optimal selective approach could provide nearly 50% increase in average nDCG@100 score, compared to the most effective, single term weighting model.

Table 3 The highest nDCG@100 scores observed for 50 TREC 2012 Web Track queries over the eight index term weighting models under consideration

Table 3 also shows the within-query performance variations among the 8 term weighting models, i.e. the column labeled as “CoV.” Here, within-query performance variation is expressed as a standardized measure of dispersion, called the “Coefficient of Variation” in statistics. For each query in Table 3, the associated “CoV” value refers to the ratio of the standard deviation (s) of the 8 models’ within-query nDCG@100 scores to the corresponding mean (\(\mu\)) of the scores, i.e., \(s/\mu\). For this reason, “Coefficient of Variation” is also known as relative standard deviation, i.e., s “relative” to  \(\mu\). In the current context, “Coefficient of Variation” can be interpreted as how informative a query is, with respect to the performance differences between the 8 term weighting models under consideration. Since “Coefficient of Variation” is a standardized measure of dispersion, CoV values can be compared with each others. In other words, two queries with different within-query mean scores may have the same CoV value, and hence they can provide equal information on the within-query rankings of 8 term weighting models, irrespective of the mean scores. A “supervised” selective approach to index term weighting is, thus, likely to benefit more from the queries with high CoV values than the queries with low CoV values.

In addition to improved average retrieval effectiveness, selective approaches are also capable to provide robustness, in a way that the “Risk-Sensitive” measures of IR can quantify (Collins-Thompson 2009). Risk-sensitive measures assess the extend to which a system is more effective for a given query than a baseline system. For any given query, baseline effectiveness can, in general, be thought of as the level of performance that a state-of-the-art IR system would, on average, show for that query. In this respect, as the per-query effectiveness of an IR system increases, the level of robustness of the system increases. In particular, Table 3 represents an instance of the highest level of robustness, in terms of nDCG@100, that an optimal selective approach could achieve by using the 8 index term weighting models under consideration. Similarly, Table 1 represents an instance of the highest level of robustness, with respect to full-fledged IR systems. This notion of robustness can directly be quantified by using the Geometric Risk measure, “GeoRisk” (Dinçer et al. 2016), as demonstrated in Sect. 4—Results.

Although the uncertainty associated with a selective approach to index term weighting is relatively high (i.e., for the case of 8 term weighting models it is 87.5%) and the per-query expected effectiveness of a term weighting model is difficult to estimate, probabilistic term weighting models have a common property that in fact enables selective term weighting. Every probabilistic term weighting model makes an assumption (Table 2) about the shape of the frequency distributions of terms on documents and this property can be exploited in selective term weighting, as explained in the next section.

3 The proposed selective approach to index term weighting

The key to success in selective term weighting is to determine a source of information that explains the variation in the retrieval effectiveness of individual term weighting models across queries. In other words, retrieval effectiveness of term weighting models should, somehow, be related to the characteristics of queries, in a way that permits to predict the model that is most likely to show the highest performance for any given query. Here, we argue that one of the primary sources of information for this purpose is the observed frequency distributions of query terms on the document collection in use.

A simple but powerful selective approach to index term weighting can be built upon pairwise query similarity, as demonstrated in the inspiring works of He and Ounis (2003b, 2004). In this approach, it is assumed that the same term weighting model will show similar levels of effectiveness for two queries that are similar to each other in terms of some measurable query characteristics. Application of such an approach can vary in practice depending on the measure of similarity to be adapted and the query characteristics to be chosen for similarity measurements. For instance, He and Ounis (2004) use Euclidean distance as the measure of similarity, and the vectors of 3 query properties for similarity measurements, where the properties are (1) the number of terms in a given query, (2) the number of documents that contain at least one of the query terms, and (3) the ratio of the minimum Inverse Document Frequency (IDF) to the maximum IDF associated with the query terms.

In this study, we use only a single query property, the frequency distributions of query terms, and, as the similarity measure, we use the Chi-square statistic.

3.1 The Chi-square statistic as a query similarity measure

We claim that frequency distributions of query terms on documents, as a query characteristic, can explain the variation in the retrieval effectiveness of individual index term weighting models across queries. The underlying theoretical basis for this claim is simple and it can be expressed as follows. Every probabilistic term weighting model assumes a particular probability distribution (e.g., Poisson, Geometric, etc.) for the observed term frequencies on documents (i.e. empirical distribution). Such an assumed probability distribution characterizes the corresponding term weighting model, with respect to the degree of relevance to be quantified by the model, given a pair of document and query. Thus, for any given query, it is expected that the effectiveness of a probabilistic term weighting model will be proportional to the degree of the goodness-of-fit between the assumed probability distribution and the actual distribution of term frequencies on documents. This implies that any probabilistic term weighting model would show similar performances for the queries that are similar to each others with respect to the term frequency distributions. Indeed, the results of the experiments presented in Sect. 4 provide empirical evidence in support of this claim.

To measure the similarity in distribution between two queries, we use the Pearson’s Chi-square statistic. The Chi-square statistic, which is the test statistic of the Chi-square goodness-of-fit test (Agresti 2002; Press et al. 2007), can be expressed for the frequency distributions of two (query) terms, \(t_1\) and \(t_2\), as given by:

$$\begin{aligned} \chi ^2 = \sum _{i=0}^{n} \frac{\left[ F_{t_1}(i) - F_{t_2}(i) \right] ^2}{F_{t_1}(i) + F_{t_2}(i)} \end{aligned}$$
(1)

In Eq. 1, n denotes the number of relative frequency groups taken into account (i.e. \(n=1000\)), and \(F_{t_1}(i)\) and \(F_{t_2}(i)\) (\(i=0,1,2,\ldots ,n\)) refers to the observed document density at the ith bin for the terms \(t_1\) and \(t_2\), respectively. In particular, \(F_{t_1}(0)\) and \(F_{t_2}(0)\) refer to the density of the documents at the relative term frequency value of 0 for \(t_1\) and \(t_2\), respectively. Here, a low \(\chi ^2\) value implies a high degree of goodness-of-fit, where \(\chi ^2=0\) refers to the perfect fit between two term frequency distributions. It is worth noticing, here, that, for any term t, the document density at 0, \(F_{t}(0)\) is proportional to the inverse document frequency (IDF) of the term t.

Every probabilistic term weighting model applies some form of normalization to within-document raw term frequencies (He and Ounis 2003a, 2005), in order to avoid biasing towards long documents in quantifying the relevance of a document to the query given. The reason behind this practice can be explained as follows. A term may occur in two documents with the same value of frequency, but the documents would in general be different in length from each others. On the other hand, for any given term, it is expected that the number of occurrence of the term would increase, as the length of the document increases. Probabilistic term weighting models assume that the number of occurrence of a query term in a particular document is proportional to the relevance of the document to the query. This means that longer documents are more relevant than shorter documents to any given query, which in fact is not always true. Hence, in order to make such frequency values comparable with each others across documents, probabilistic term weighting models employ “document length normalization.” For this purpose, we use relative term frequencies, i.e. the ratio of the number of occurrences of a term in a document to the length of the documents.

Here, raw term frequencies constitute a discrete distribution, while, in contrast, relative term frequencies constitute a continuous distribution. The Chi-square statistic can only be applied to discrete distributions. Thus, for any given term, the calculated relative frequencies should be grouped into a finite number of bins, in order to obtain the required discrete distribution. According to our normalization scheme, relative term frequencies can vary in between 0 and 1. Except for the relative frequency value of 0, we divided that range into 1000 intervals of equal length: (0.000–0.001], (0.001–0.002], and so on. The case of the value 0 is special because, in contrast to those 1000 bins, it refers to the density of the documents that the term under consideration does not occur in. In query similarity measurements, we take into account both the relative frequency value of 0 (as a separate group) and the relative frequency values grouped into 1000 bins, so that the calculated relative frequencies for any given term sum up to 1 over all of the documents in the target collection, i.e. in order to obtain a formal probability distribution.

Figure 1 illustrates the relative frequency distribution of the terms family, for, of and wedding, where the relative frequency value of 0 is excluded for the ease in interpretation.Footnote 2 The observed frequency distributions of family and wedding are quite different from that of for and of, while the distributions are relatively similar to each other for both the former terms and the latter terms. As seen in Fig. 1, the frequency distributions of the terms for and of resemble a Poisson distribution. Given that the terms for and of are used due to grammatical necessity rather than serving to impart knowledge (i.e. semantically non-selective words) and the terms family and wedding are semantically selective words, it is reasonable that, for a query including these 4 terms, a term weighting model assuming a Poisson distribution for the frequency distributions of terms is likely to distinguish semantically selective words (i.e. index terms) from the semantically non-selective words (i.e. function words), than a term weighting model assuming a different probability distribution.

Fig. 1
figure 1

Grouped relative frequency distributions for the terms family, for, of and wedding

Our approach is based on pairwise query similarity but queries would in general be different from each others in length. This is an issue, because Chi-square statistic can in fact be used as the measure of pairwise similarity between terms, rather than queries, except for the queries that are composed of a single term. In order to measure the similarity between two queries having more than one term, we adapted a simple heuristic that aggregates the term similarity measurements over queries.

Assuming that the occurrence of a term in a query is independent of the occurrences of other terms, we can construct a cartesian table for any given pair of queries. Table 4 illustrates a cartesian table for two queries \(Q_1 = \{internet, phone, service\)} and \(Q_2 = \{air, travel, information\)}. Each cell of such a cartesian table contains the value of the Chi-squared distance/difference that is measured between the frequency distribution of the associated row and the frequency distribution of the corresponding column. For instance, the measured Chi-squared distance is 0.163 for the term pair (internetair).

Table 4 Cartesian table for the queries \(Q_1 = \{internet, phone, service\)} and \(Q_2 = \{air, travel, information\)}

In Table 4, the smallest measured Chi-square distance is 0.001 and it is observed for the term pair (internet, information). This suggests that the terms internet and information are the terms that have the highest degree of similarity in frequency distribution, among all possible pairs of terms that are yielded by the cross-product of the sets of terms \(Q_1\) and \(Q_2\). According to our heuristic algorithm, after determining the most similar query terms, we remove the corresponding row and the column from the table. For two queries each of which consists of n terms, this operation results in a cartesian table of \((n-1)\times (n-1)\) cells, as illustrated in Table 5 for the example queries. In the resulting cartesian table, the most similar terms are phone and air, with the smallest Chi-squared distance value of 0.006. For this example, there remains only one pair of terms, \(({ service},{ travel})\), with a Chi-squared distance value of 0.0014, and as a result, we can, now, aggregate the obtained term-based Chi-square measurements to derive an overall similarity value for the queries \(Q_1\) and \(Q_2\). We have examined two aggregation methods, namely arithmetic mean and Euclidean distance, but the aggregation method that serves well with respect to average effectiveness is the Normalized Euclidean Distance, as given by:

$$\begin{aligned} sim(Q_1,Q_2)=\frac{\sqrt{0.001^2+0.006^2+0.014^2}}{3}=0.005 \end{aligned}$$
(2)

The pseudocode of the demonstrated sim measure is given in Algorithm 1. Similar to the \(\chi ^2\) measure, a low value of this sim measure refers to a high level of similarity between the queries \(Q_1\) and \(Q_2\), where \(sim(Q_1,Q_2)=0\) indicates that \(Q_1\) is identical to \(Q_2\), with respect to term frequency distributions.

Table 5 The resulting cartesian table after removing the most similar pair of terms internet and information from the original table in Table 4
figure a

This sim measure requires that the two queries to be compared should have equal lengths as measured by the number of terms. In our heuristic, when the lengths of the queries are different, we label the two queries as \(Q_{long}\) and \(Q_{short}\). Then, we generate \(Q_{short}\) combination of \(Q_{long}\), each of which has the same length as \(Q_{short}\). In order to give a concrete example, consider two queries \(X = \{internet, phone, service\}\) and \(Y = \{disneyland, hotel\)}. First, we obtain \(\left( {\begin{array}{c}Q_{long}\\ Q_{short}\end{array}}\right) = \left( {\begin{array}{c}3\\ 2\end{array}}\right) = 3\) sub-queries of the long query X: [internet, phone] [internet, service] [phone, service]. Then we apply sim, as is, for each sub-piece using \(Q_{short}\):

  • sim(disneyland hotel, internet phone)

  • sim(disneyland hotel, internet service)

  • sim(disneyland hotel, phone service).

This process results in a list of similarity values computed for each sub-query of the long query hence an aggregation method is required to obtain an overall similarity value for the queries \(Q_1\) and \(Q_2\). To obtain a final similarity score, we use the average of the minimum and the maximum of the list: \(sim=\frac{max(list)+min(list)}{2}\). The whole process of how unequal query lengths are handled is given in Algorithm 2.

figure b

Figure 2 shows the plot of the calculated pairwise similarities between the 194 TREC Web track queries. The scatter plot in Fig. 2 is obtained by “Multidimensional Scaling” (MDS) of the matrix of calculated pairwise similarities. In such a MDS plot, distances between points correspond to the magnitudes of the differences between rows and columns, as measured by the similarity/difference measure in use. Here, points represent queries. Thus, in Fig. 2, similar queries, with respect to the similarity scores calculated as given above, are shown close to each others, and vice versa. In the plot, selected queries are labeled by their query texts for the ease in interpretation. As seen in Fig. 2, the queries that are different from each others in length are scattered along the x axis, suggesting that the difference between one-term queries and multiple-term queries are exhibited along the x axis for this MDS plot. Similarly, along the y axis, it appears that the differences between the queries with equal lengths are depicted. The queries with terms such as the, of, and a are grouped at the upper right corner. One-term queries are positioned at the upper left corner, with the exception of the query maps, which is neither specific nor common. The queries that are composed of the terms with similar properties to that of the term maps are grouped around the origin of the plot, suggesting that these queries are not different from each others compared to the differences observed for those queries that are scattered towards the edges of the plot. The MDS plot in Fig. 2 suggests in general that term frequency distribution, as a feature, is capable to characterize queries with respect to the types of the terms that the queries are composed of.

Fig. 2
figure 2

The multidimensional scaling analysis of the ClueWeb09 queries

3.2 The win-sets, the loss-sets and the small sample size problem

Our selective approach is of supervised classification type and hence it requires training data to learn the association between queries and term weighting models, with respect to retrieval performance. In our approach, given a set of training queries, we, first, measure the per-query performances of the 8 term weighting models in order to determine the best model amongst the 8 term weighting models under consideration. We, then, compose a win set of queries for each of the 8 term weighting models, by grouping those training queries that the corresponding term weighting model has the highest per-query performance score. In the case of a tie, we apply the following process. When there are more than one winner model for a training query, the query is added to all win-sets of the winner models separately; and when every term weighting model has a per-query performance score of 0 or all of the observed scores are equal to each others in magnitude, the query is simply discarded. Having the win-sets for the 8 term weighting models, the model that is likely to show the highest performance for any given new query can be predicted by measuring the similarity of the new query to the associated win-sets. The predicted model for a new query will in this respect be the one whose win-set consists of the queries that are, on average, more similar to the new query than the queries in the win-sets of the other models.

This classification algorithm actually suffers from the same, universal weakness that every supervised, statistical classification algorithm suffers from, the lack of enough training data. In theory, the information to be provided by the “win-sets” can be assumed enough to fully explain the differences in per-query effectiveness between individual term weighting models, as long as the training set is large enough in size. For instance, the number of queries released from the TREC Web track studies in between 2009 and 2012 is 200. In our approach, there are 8 term weighting models to be classified with respect to their per-query performances. This means that each model would ideally have the highest per-query performance score for 25 queries at most, if the win-sets were, somehow, to be balanced in size for the 8 term weighting models. In practice, the win-sets associated with individual term weighting models would usually vary in size. For the 200 TREC Web track queries and the 8 term weighting models under consideration, the number of the training queries in each win-set varies from 10 (for DLM) to 38 (for BM25). Comparing to the sample size of 8! (40,320) required for the full factorial design to have at least one sample query for every possible within-query rankings of the 8 term weighting models, it would appear that a set of 200 training queries is quite small in size.

To alleviate the effect of this weakness, we also use loss sets, in addition to the win-sets, which in theory doubles the amount of information that could be obtained for each term weighting model from the same set of queries. In similar to win-sets, we compose a loss-set of queries for each of the 8 term weighting models, by grouping those training queries that the corresponding term weighting model has the lowest per-query performance score. On this account, the most likely term weighting model is the one whose loss-set consists of the queries that are, on average, more dissimilar to the new query than the queries in the loss-sets of the other models. As a result, relating the win-sets and the loss-sets, we can say that the term weighting model that is likely to show the highest performance for any given query would be the one whose win-set and loss-set are respectively the most similar to, and, simultaneously, the most dissimilar to the query given. To obtain an overall similarity score for a query with respect to both the win-set and the loss-set associated with a particular term weighting model, we use the ratio of the win-set similarity score to the loss-set dissimilarity score.

Lastly, we compose the training sets of queries by choosing those queries within which the 8 term weighting models show high variation in performance. In other words, given a set of queries, the training set of queries is composed of the 75% of the original queries having the highest CoV scores (i.e., the highest Coefficient of Variation scores) among all. This election process discards the queries that carry relatively less or no information about the within-query performance differences between the models, i.e. it eliminates the noise and the extreme/outlier cases from the training data.

4 Results

In this section, we demonstrate the effectiveness of the proposed selective approach to index term weighting, using the standard TREC test collections. Two sets of queries are used for this purpose: (1) the official set of 200 queries from the TREC Web track and (2) the official set of 562 queries from the TREC MQ track. In accordance with the sets of queries at hand, we divided this section into two subsections, and at the end of the section we summarize the results of individual experiments.

The proposed selective approach is evaluated with respect to the two aspects of retrieval effectiveness: (1) the observed average retrieval performance and (2) the accuracy in classifying the test queries into the true classes of 8 term weighting models, i.e. the classification accuracy. The measure of retrieval effectiveness that we use in the evaluations is the normalized Discounted Cumulative Gain at 100 document, nDCG@100 (Järvelin and Kekäläinen 2002). Although the main analysis is made using the nDCG@100 values, nDCG@20 values are also reported in order to make the performance gains reported in this paper comparable with existing work.

To measure the robustness of the proposed selective approach, we use the current state-of-the-art risk-sensitive evaluation measure, called the GeoRisk (Dinçer et al. 2016). GeoRisk is a well-founded measure that is used for the risk-sensitive evaluation of IR experiments (Collins-Thompson 2009; Wang et al. 2012; Dinçer et al. 2014). As a risk-sensitive measure, GeoRisk combines the average performance of an IR system and the level of risk associated with the system, i.e. the geometric mean of the average retrieval performance of a system and the associated level of risk:

$$\begin{aligned} {\mathrm{GeoRisk}}\left( s_i\right) = \sqrt{{\mathrm{RP}}\left( s_i\right) \times \varPhi {\left( Z_{Risk}\left( s_i\right) /c\right) }} \end{aligned}$$
(3)

where \({\mathrm{RP}}\left( s_i\right)\) stands for the average retrieval performance of the system \(s_i\) as measured by a performance measure (e.g., nDCG, ERR, etc.) and \(Z_{Risk}\left( s_i\right)\) stands for the level of risk associated with the system \(s_i\). Here, c is the number of queries under consideration and \(0 \le \varPhi () \le 1\) is the cumulative distribution function of the standard normal distribution. \(\varPhi ()\) is used to normalize \(Z_{Risk}\) values into [0,1], because \(-\,\infty \le Z_{Risk}/c \le +\infty\).

The measure of risk in GeoRisk is ZRisk and in the context of risk-sensitive IR evaluation, “risk” refers to the risk of performing worse than a baseline system for a given query. In this respect, ZRisk rewards the system under evaluation for the queries that the system is better than the baseline, and conversely, it punishes for the queries that the system is worse than the baseline, as given by

$$\begin{aligned} {\mathrm{Z}}_{Risk}\left( s_i\right) =\left[ \sum _{q\in Q_+} z_{iq} + (1+\alpha ) \sum _{q\in Q_-} z_{iq}\right] \end{aligned}$$
(4)

For any system \(s_i\) (\(i=1,2,\dots ,r\)), \(Q_+\) (\(Q_-\)) is the set of queries where \(z_{iq} > 0\) (\(z_{iq} < 0\), respectively), determined by whether system \(s_i\) outperforms the baseline on query q. In Eq. (4), the risk sensitivity parameterFootnote 3\(\alpha \ge 0\) controls the tradeoff between reward and risk (or win and loss). Here, \(z_{iq}=\left( x_{iq}-e_{iq}\right) /\sqrt{e_{iq}}\), where \(x_{iq}\) and \(e_{iq}\) are, respectively, the performance score of the system \(s_i\) for query q and the expected performance score for q from the baseline system(s). Given a system \(s_i\), the expected performance score for a particular query j (\(j=1,2,\dots ,c\)) is calculated as \(e_{ij} = \left( S_i \times T_j\right) /N\), where N is the total performance score over all systems and queries (i.e., \(N=\sum _i\sum _j x_{ij}\)), \(S_i\) is the total performance score of the system \(s_i\) over all queries (i.e., \(S_i=\sum _j x_{ij}\)), and \(T_j\) is the total performance score for the query j over all systems (i.e., \(T_j=\sum _i x_{ij}\)).

The risk measure ZRisk, as a result, promotes a particular system over another system if that system is more robust, or rather less “risky” than the other. The ZRisk measure permits to derive the baseline performance of a query from multiple (baseline) systems. In this study, we derive the per-query baseline performances from the set of 8 term weighting models under consideration.

It is worth mentioning that the baselines in ZRisk measurements are different from the baselines that are used for the comparative evaluation of the proposed approach. The former baselines are implicit, and embedded into the GeoRisk measurements, in contrast to the latter baselines. For the comparative evaluation of the proposed approach, we use four (explicit) baselines. One of those 4 baselines is the most effective, single term weighting model on average. For the TREC Web track queries, the most effective term weighting model is LGD, with an average nDCG@100 score of 0.1808, and the most effective term weighting model for the TREC MQ track queries is DPH, with an average nDCG@100 score of 0.3585. In addition, we define two theoretical selection strategies as two baselines: (1) a random selection strategy (RND), and (2) a maximum likelihood estimation/selection strategy (MLE). The fourth and the last baseline is the current state-of-the-art selective approach to index term weighting in the IR literature (He and Ounis 2003b, 2004), which is referred to as “MS7” in this study.

Lastly, we use both the Student’s t test and the Wilcoxon signed-rank test for testing the significance of the experimental results presented in this section.

4.1 The results for the TREC Web track queries

The results of the experiment on the TREC Web track queries are presented in Table 6. There are 194 queries that are actively used in this experiment, due to the lack of relevant documents in the result sets of the considered 8 term weighting models for 6 queries.

Table 6 Selective term weighting result for ClueWeb09A dataset over 194 queries

The 13 models that are listed in Table 6 are ranked according to their average nDCG@100 scores. As seen in the table, the proposed selective approach, “SEL,” has the highest average nDCG@100 score of 0.1934, except for the score associated with the virtual model, “Oracle,” that could select the best model for every query. Similarly, “SEL” has the highest GeoRisk score of 0.3111. Relating these effectiveness and robustness scores of the proposed method, we can say that, for the TREC Web track queries, the proposed method, “SEL” is more effective and more robust than the state-of-the-art term weighting models under consideration.

Except for the “MS7” model, the differences in average nDCG@100 scores between “SEL” and each of the models listed in Table 6 are statistically significant, with a p value that is less than 0.05, according to both the t test and the signed-rank test. The two hypothesis tests, the t test and the signed-rank test, failed to give significance to the observed difference between “SEL” and “MS7.” This suggests, either that the difference may be attributed to chance fluctuation on the population of queries, or that the size of the sample at hand may not be sufficient to provide a reasonable chance (power) to the hypothesis tests in order to detect the population effect between the models, “SEL” and “MS7.” Considering the results presented in the next sub-section, it would appear that the latter is true: a set of 200 queries is not sufficient in size to provide reasonable power. Indeed, a set of approximately 500 queries can provide enough power to both the t-test and the signed-rank test to give significance to the observed difference between “SEL” and “MS7,” as demonstrated in Sect. 4.2.

The risk-sensitive evaluation measure, GeoRisk quantifies the degree of robustness associated with each model. As seen in Table 6, the robustness ranking of the models is identical to the ranking based on their average performance scores, except for the models “DFRee” and “DFIC.” This means that each model distributes their total performance on the queries proportional to the expected performances for each query. Here, the expected performance for each query refers to the baseline performance for each query. Thus, for two models, the one that has a higher GeoRisk score is the one that is better than the other in making no abject failure, on average.

The number of queries that a model has the highest score, which corresponds to the “Accuracy” in this study, is actually a measure of robustness in a similar sense that GeoRisk refers to. However, they are different, in that GeoRisk takes into account the per-query baseline performances, in contrast to the “Accuracy.” In Table 6, the number of queries that a model has the highest scoreFootnote 4 is listed under the column with label “0 \(\times\) SE.” Here, for a query, “SE” stands for the standard error in the within-query average nDCG@100 score over 8 models. Thus, a “0 \(\times\) SE” difference from the highest score for a query corresponds to exactly that highest score. On the other hand, a “1 \(\times\) SE” difference from the highest score means that the query will also be considered as a hit for those models whose scores are less than the highest score but within the range of one standard error from the highest score. Similarly, “2 \(\times\) SE” refers to the range of two standard error from the highest score. Factoring out the hits associated with a model in such a way allows us to interpret the GeoRisk score of the model in detail, to a certain extent.

A visual comparison of the proposed selective term weighting method (“SEL”) with the 4 baselines is given in Fig. 3. For the pairs of models, such a visual comparison allows us to fully explore the per-query score differences between the models, and hence the robustness of one model with respect to the other. There are 4 plots in Fig. 3, each of which is dedicated to the comparison of the proposed method with one baseline model. Each plot in Fig. 3 show the per-query differences in nDCG@100 scores between “SEL” and one of the 4 baselines, i.e. y axis. The x axis represents the number of queries, where the per-query score differences are sorted in increasing order of magnitude. Thus, the left side of each plot (i.e. the low values of x axis) shows the queries that the baseline has a higher nDCG@100 score than “SEL,” and conversely the right side of the plot shows the queries that “SEL” has higher scores than the baseline. The middle part of the plots, where the score difference is equal to 0, shows ties (i.e. no risk). In such a plot, the ideal case is to have no score difference that is less than zero, i.e. the area under the origin line, the left part, would be equal to zero for a model that is absolutely more robust than the baseline.

Fig. 3
figure 3

Visual comparison of the proposed selective term weighting method “SEL” with the 4 baselines for the 194 TREC Web track queries. Each of the 4 plots shows the per-query nDCG@100 score differences, in ascending order of magnitude along the x axis, for “SEL” and one of the 4 baselines. For each plot, the label at the left, on the origin line, shows the percent of queries that the corresponding baseline has higher scores than “SEL.” The label at the right shows the percent of queries where “SEL” has higher scores than the corresponding baseline

Figure 3 shows in general that the proposed selective term weighting method is more robust than all of the 4 baselines. In particular, it seems that the observed difference in average nDCG@100 scores between “SEL” and “MS7” is due to a few queries (i.e., a high score difference in positive direction for a few queries that are shown on the right side of the corresponding plot), while for the majority of the queries, there is a tie. On the other hand, this plot also shows that the GeoRisk scores reflect the fact that “SEL” does not make any abject failure, compared to the model “MS7.” For the other 3 baselines, the superiority of the model “SEL,” with respect to robustness, is clearly exposed by the plots.

Overall, the results of the experiment on TREC Web track queries show that term frequency distributions are a viable source of information for the prediction of the per-query effectiveness of individual term weighting models. Indeed, as we demonstrated in the following sub-section, a selective term weighting method built upon this single source of information can outperform every single term weighting model, as well as the existing approach to selective term weighting, “MS7.”

4.2 The results for the TREC million query track queries

The TREC MQ track provides 562 queries in total, 34 of which are eliminated due to the lack of relevant documents in the result sets of the 8 base models. The resulting set of 528 queries is used for the evaluation of the proposed selective term weighting method. Table 7 lists the results of the experiment performed on the TREC MQ track queries.

Table 7 Selective term weighting result for Million Query 2009 dataset over 528 queries

For the TREC MQ track queries, the proposed selective term weighting method, “SEL” has the highest average nDCG@100 score (0.3740), except for the score of the virtual model, “Oracle” (0.4498). Similar to the TREC Web track queries, for the TREC MQ track queries, the t test and the signed-rank test give significance to the observed differences in average nDCG@100 scores between “SEL” and each of the models under consideration, with a p value less than 0.05. In contrast to the TREC Web track queries, for this query set, the observed average performance difference between “SEL” and “MS7” is statistically significant, suggesting that a set of approximately 500 queries is sufficient enough in size to detect the population effect between “SEL” and “MS7.”

The calculated GeoRisk scores for the models indicate, in this time, that the most effective single term weighting model, DPH is more robust than the baseline selective approach “MS7.” As listed in the column “Accuracy,” the model “DPH” has more hits than “MS7” at every level of standard error. It is worth mentioning that the model DPH has also more hits than the proposed method “SEL” at “0 \(\times\) SE,” while the “SEL” is in fact more robust than “DPH,” as the number of hits at the deeper levels of standard error indicates. This means that the model DPH is a strong alternative to selective term weighting. On the other hand, since the observed difference in average nDCG@100 scores between “SEL” and “DPH” is statistically significant, it is expected, on the population of queries, that using “DPH” for every query will cause significant performance losses, compared to “SEL.” This is also true with respect to robustness, as indicated by the GeoRisk scores associated with “SEL” and “DPH.”

A visual comparison of the proposed term weighting method with the 4 baselines is given in Fig. 4. Figure 4 has the same properties with the Fig. 3 that is given for the visual comparison for TREC Web track queries in Sect. 4.1.

Fig. 4
figure 4

Visual comparison of the proposed selective term weighting method “SEL” with the 4 baselines for the 528 TREC Million Query track queries. Each of the 4 plots shows the per-query nDCG@100 score differences, in ascending order of magnitude along the x axis, for “SEL” and one of the 4 baselines. For each plot, the label at the left, on the origin line, shows the percent of queries that the corresponding baseline has higher scores than “SEL.” The label at the right shows the percent of queries where “SEL” has higher scores than the corresponding baseline

For the TREC MQ track queries, the observed difference in average nDCG@100 scores between “SEL” and “MS7” is not attributed to a few queries: the right side of the corresponding plot in Fig. 4 (i.e. SEL > MS7) has an apparently larger area than the left side of the plot (i.e. MS7 > SEL), as indicated by the associated GeoRisk scores. “SEL” has a higher score than “MS7” for 248 queries in total (i.e. \(47\%\)), and for 73 queries (i.e. \(14\%\)), they have the same scores, and for 207 queries (i.e. \(39\%\)), “MS7” has a higher score than “SEL.”

The case of the most effective single term weighting model, “DPH,” is similar to the case of “MS7.” The proposed term weighting method, “SEL” has a higher score than “DPH” for 263 queries (i.e. \(50\%\)), and there is a tie for 24 queries (i.e. \(4\%\)), and for 241 queries (i.e. \(46\%\)), “DPH” has a higher score than “SEL.”

In summary, Fig. 4 shows that the proposed selective term weighting method, “SEL” is better in both performance and robustness than the most effective term weighting model, as well as the exiting approach to selective term weighting, “MS7.”

4.3 Overall analysis

We evaluated the proposed selective term weighting method in comparison of 4 baselines, using the two official sets of queries from previous TREC studies. Relating the results in Table 6 and the results in Table 7, it would appear that, as the set of queries changes, the most effective, single term weighting model changes. For the TREC MQ track, the most effective term weighting model is “DPH,” whereas it is “LGD” for the TREC Web track queries. As seen in Table 6, where the results for the TREC Web track queries are presented, the model “DPH” has a rank of 8: that is, it is listed below the baseline “MLE” and above the baseline “RND.” Thus, in the general context of making a decision to choose between a single term weighting model and a selective approach, it can be said, in the body of the data at hand, that the best decision to be made is, on average, to choose a selective approach to index term weighting. In particular, compared to the existing selective approach “MS7,” the proposed selective term weighting method “SEL” will be the best choice, with respect to both average retrieval performance and robustness.

5 Discussion

From the results of the experiments presented in Sect. 4, we observe that the proposed selective term weighting method has significantly lower effectiveness than the “Oracle,” the optimal selective approach that could predict the best model for any given query, with 100% accuracy. We speculate that one of the reasons behind this optimality issue is the existence of the supplementary components in the functional forms of the probabilistic term weighting models. Probabilistic term weighting models are, in theory, built upon a particular assumed probability distribution, but in practice the implemented functional forms usually include supplementary components, in addition to the theoretical basis. For instance, the PL2 weighting method, which is an instance of the divergence from randomness model (Amati and Van Rijsbergen 2002), assumes a Poisson distribution, denoted by “P” in PL2, and its functional form includes an additional component that is derived from the Laplace law of succession, denoted by “L” in PL2, and it also applies a term frequency normalization scheme, denoted by “2” in PL2. Similarly, the BM25 method, which is one of the successful implementations (Robertson et al. 1981; Robertson and Walker 1994) of the Harter’s 2-Poisson model (Harter 1975a, b), assumes in principle a Poisson distribution, but its functional form additionally includes an “IDF” component, and applies a term frequency normalization scheme. A remedy for this issue could be factoring out each model into its components and then combining a term weighting model on the fly based on the query given. However, such an approach would suffer from the lack of enough training queries in number. One of the future work that can be carried out in the same line of research will perhaps be to experiment on this component-based selective approach, once a large enough set of queries is obtained.

6 Related work

Selective IR is an attractive subject of interest, simply because it promises, at least in theory, a great deal of improvement in retrieval effectiveness, as well as robustness, compared to the traditional methods of IR. Unfortunately, this potential has not been completely put into practice yet, though there exist several attacks in the literature.

The scope of selective IR is wide and virtually it covers every phase of IR process. A typical example for the successful application of selective IR is perhaps query expansion, where the expansion is applied to the queries that are likely to benefit from automatic query expansion (Amati et al. 2004; Yom-Tov et al. 2005).

Regarding the different tasks in IR, a selective approach to personalization, for instance, is introduced by Teevan et al. (2008). Similarly, in the work of White et al. (2008), commercial search engines are the subject of selection and in the works of Peng et al. (2010) and Balasubramanian and Allan (2010), the subject is learning-to-rank methods. Search result diversification (Santos et al. 2010) and collection enrichment (Peng et al. 2009a, b) are also known subjects of selective IR.

In addition to making selective application of alternative IR techniques, it is also possible to make selective application of different document representations (Plachouras et al. 2004, 2006), and selective application of query-independent features (Peng and Ounis 2009), or to make selection among different query sets for the purpose of training a machine learning technique (Geng et al. 2008), or to make dynamic pruning of the result sets to be re-ranked via a learning-to-rank technique (Tonellotto et al. 2013).

Among all of the IR tasks, the least studied one is the task of index term weighting. In this respect, the pioneering work is the work of He and Ounis (2003b, 2004). In that study, queries are represented by vectors of three features: (1) the number of query terms, (2) the number of documents that contain at least one of the query terms,Footnote 5 and (3) the ratio of the minimum IDF to the maximum IDF associated with the query terms. The candidate model set used in the original work consists of 11 term weighting models that are derived from the divergence-from-randomness (DFR) framework (Amati and Van Rijsbergen 2002). Given a set of training queries, the proposed approach clusters the queries into k clusters, and assigns the most effective DFR model to each cluster. The term weighting model to be applied to a new query is determined according to the distance of the new query from the k clusters.

Recently, Petersen et al. (2016) present an extension to the DFR framework, called Adaptive Distributional Ranking (ADR) model. In that work, given a dataset consisting of a document collection and a query set, the best-fitting distribution to non-informative query terms is, first, selected among a candidate set of statistical distributions including the geometric, negative Binomial, Poisson, power law and Yule-Simon distributions. Then, the corresponding term weighting model is derived from the DFR framework, and applied to any given new query. In this respect, ADR can be considered a per-dataset basis selective approach to index term weighting.

As a summary, the aforementioned works suggest in general that selective IR is a promising line of research, with the potential of being a viable remedy for the long-standing challenge of robustness in IR.

7 Conclusions

There has been a great deal of research dedicated to developing term weighting models for IR. However, IR research has shown that there is no single term weighting model that could satisfy every information need of users, with an acceptable level of satisfaction. Rather, high performance fluctuation across information needs has been empirically shown in time. This issue refers to the robustness in retrieval effectiveness. The presented study in this paper investigates the selective application of existing term weighting models on a per-query basis to tackle down the challenge of robustness in retrieval effectiveness.

We test the proposed selective method on the ClueWeb09-English corpus and its corresponding TREC tasks, namely the Web Track and the Million Query Track. The experimental results show that selective term weighting does improve retrieval effectiveness on average, compared to a baseline where a single term weighting model is applied uniformly to every query. The experimental results also show that the proposed method forms a robust system that avoids making an abject failure for any query, while maintaining a high average retrieval effectiveness at the same time. In other words, we show that a robust and an effective system can be built by leveraging only the existing term weighting models in a selective manner, without inventing a new one.

Most importantly, to our best knowledge, this study is the first that provides empirical evidence in favor of the fundamental assumption of probabilistic term weighting models, which relates the relevance of a document to a query by means of probability distributions. In particular, we empirically justify the presumed relationship between the frequency distributions of (query) terms and the retrieval effectiveness of probabilistic term weighting models.