Keywords

1 Introduction

It is understood that the performance of retrieval models is not always consistent over different queries and corpora and there are some queries that have lower performance, often referred to as hard or difficult queries [1]. As such, the area of Query Performance Prediction is concerned with estimating the performance of a retrieval system for a given query. There is already a well-established body of work that explores query performance prediction through either a post-retrieval or a pre-retrieval strategy [2]. Methods in post-retrieval measure query difficulty, by analyzing the results obtained from the retrieval system as a response to the query. In contrast, pre-retrieval methods, which are the focus of this work as well, are based on linguistic and statistical features of the query and documents.

While existing work in pre-retrieval query performance has been predominantly focused on defining various statistical measures based on term and corpus-level frequency, the IR community has recently embarked on exploring the impact and importance of neural IR techniques [5,6,7]. There are some recent work that propose to use neural networks for QPP based on a host of signals [8] but to the best of our knowledge, there is only one recent work that specifically utilizes neural embeddings of query terms for performing QPP [9]. Neural embeddings maintain interesting geometric properties between embedded terms [10] which are manifested by how term vectors are distributed in the embedding space. We explore exploiting the geometric properties of embeddings to define beyond-frequency QPP metrics. Our work distinguishes itself from the recent work [9], which proposes to cluster neural embeddings based on their vector similarity to perform QPP, by proposing to not only consider term similarity but also take term neighborhood and association into account through a network representation of neural embeddings. More specifically, we benefit from term vector associations in the neural embedding space for formalizing term specificity, which is correlated with query difficulty [3, 4, 11].

We base our work on the intuition that a term that has been closely surrounded by several other terms in the embedding space is more likely to be specific while a term with a fewer number of closely surrounded terms is more likely to be generic. We conceptualize the space surrounding a term by using an ego network representation where the term of interest serves as the ego and is contextualized by a set of alter nodes, which are other terms that are similar to it in the embedding space. We apply various measures of node centrality on the ego node to determine the specificity of the term that is being represented by the ego, which would then indicate query difficulty [16]. We have performed experiments based on three widely used TREC corpora, namely Robust04, ClueWeb09 and Gov2 and their corresponding topic sets. Our experiments show that the proposed metrics are effective in QPP using pre-trained neural embeddings.

2 Proposed Approach

This paper is concerned with the design of effective metrics for pre-retrieval QPP based on pre-trained neural embeddings. We focus on distribution of neural embedding vectors in the embedding space to define specificity metrics for QPP. Existing work in the literature [3, 12] have already shown that measures of term specificity are suitable indicators of query difficulty, i.e., more specific terms are more discriminative and are hence easier to handle when used as queries.

Fig. 1.
figure 1

Schematic of two \(\alpha \)-depth \(\beta \)-cut ego networks.

Our work is driven by the intuition that more specific terms have a higher likelihood of being surrounded by a larger number of terms compared to generic terms. For instance, as shown in Fig. 1, the set of terms related to the specific term ‘Arsenal’, with an association degree (computed based on cosine similarity of terms’ vector representation) above 0.75, includes terms such as ‘Wenger’, ‘Tottenham’, ‘Everton’, among others, which are also themselves very specific; whereas, the generic term ‘soccer’ has only one closely associated term (association degree above 0.75) and that is ‘football’, which is quite generic itself. While it is not possible to measure frequency information from neural embeddings, it is convenient to identify the set of highly similar terms to a term based on vector similarity. We benefit from this to formalize the notion of an ego network that is based on vector similarities within the embedding space. We benefit from this to formalize our recursive definition of specificity, i.e., the extent to which a term is specific can be determined from the context created by the surrounding highly similar terms within the neural embedding space. In order to formalize specificity, we define an ego network, as follows:

Definition 1

Let \(\mathcal {P}(t_i,t_j)\) be the degree of similarity between vectors of terms \(t_i\) and \(t_j\), V be the complete vocabulary set, and \(\mathcal {P_M}(t_i)\) be the highest degree of similarity to \(t_i\) from any term in V. We define an \(\alpha -depth\) ego network for an ego node \(t_i\) in the form of a fully connected graph with a maximum depth \(\alpha \) around the ego where the edge weights are \(\mathcal {P}(t_k,t_l)\) between any two nodes \(t_k\) and \(t_l\). We further refine the \(\alpha -depth\) ego network into an \(\alpha -depth\) \(\beta -cut\) ego network where any edge with a weight less than \(\beta \times \mathcal {P_M}(t_i)\) is pruned.

In simple terms, we propose to build an ego network for a term \(t_i\) such that \(t_i\) is the ego node and is connected directly to other adjacent terms only if the degree of similarity between the ego and the neighbor is above a discounted rate (\(\beta \)) of the most similar term to the ego. For instance, assuming ‘Arsenal’ is the ego and \(\beta =0.8\), given that ‘Gunners’ is the most similar term to the ego with a similarity of 0.854, the immediate neighbors of the ego will consist of all the terms in V that have a similarity above 0.6832 to ‘Arsenal’. Furthermore, we allow the ego network to have a depth of \(\alpha \) from the ego. For a depth of one (\(\alpha =1\)), the ego network will only consist of the ego and its immediate neighbors. For a depth of two (\(\alpha =2\)), each node in layer one will become the ego for another sub-ego network with a \(\beta -cut\), as explained earlier. Figure 1 shows a schematic of the \(\alpha -depth\) \(\beta -cut\) ego network for the specific term ‘Arsenal’ and generic term ‘soccer’. As seen, in Arsenal’s case, the graph is populated with many terms closely related to the ego. In the second layer, the nodes immediately connected to the ego, e.g., ‘Wenger’, become an ego node for a second layer subgraph, which are in turn connected to their own alters, e.g., ‘Mourinho’, ‘Benitez’ and ‘Ferguson’. In contrast, the network associated with the generic term ‘soccer’ is quite sparse with only two additional nodes present when \(\alpha =2\).

Table 1. Node centrality metrics on the ego network.

Based on the developed ego network, we propose to measure the specificity of the ego through the use of node centrality metrics [13, 16]. Given queries can be composed of more than one term, we adopt the integration approach that uses aggregation functions [14] over the specificity of individual query terms. Table 1 provides an overview of the metrics used in this paper.

3 Experiments

Corpora and Topics: We employed three widely used corpora, namely, Robust04, ClueWeb09, and Gov2. For Robust04, TREC topics 301–450 and 601–650, for Gov2, topics 701–850 and for ClueWeb09, topics 1–200 were used. Topic difficulty was based on Average Precision of each topic computed using QL [15].

Baselines: We adopt the widely used pre-retrieval metrics reported in [2]. The formulation of these metrics is provided in Table 2. As another baseline, we adopt the recent approach by Roy et al. [9] that utilizes embedded word vectors to predict query performance. Their specificity metric, known as \(P_{clarity}\), is based on the idea that the number of clusters around the neighbourhood of a query term is a potential indicator of its specificity. To apply their approach on our embedding vectors, we have used the implementation provided by the authors.

Neural Embeddings: We used a pre-trained word2vec model based on the Google News corpus (https://goo.gl/wQ8eQ1).

Evaluation: A common approach for measuring the performance of a QPP metric is to use rank correlation metrics to measure the correlation between the list of queries (1) ordered by their difficulty for the retrieval method (ascending order of average precision), and (2) ordered by the QPP metric. Kendall’s \(\tau \) and Pearson’s \(\rho \) co-efficient are common correlation metrics in this space.

Empirical studies on pre-retrieval QPP metrics have shown that there is no single or set of metrics that outperforms the others on all topics and corpora [2]. Our experiments confirm this. Therefore, to be able to rank the different metrics over a range of topics, we compute the rank of each metric in each topic set and report the rank of the median of each metric over all topics of each document collection. This is specified as rank and is reported separately for Kendall’s \(\tau \) and Pearson’s \(\rho \). These ranks show how a metric has performed over the different topic sets. Given our metrics are dependent on the \(\alpha \) and \(\beta \) parameters, we set them using 5-fold cross validation optimized for Pearson correlation.

Table 2. Baseline metrics. t is a term in query q. d is a document in collection D. \(D_t\) is the set of documents with t. tf(t, D) is term frequency of term t in D. \(Pr(t|D)=tf(t,D)/|D|\). \(\pi _m\) is the prior probability of the most dominating sense of term t and \(P(t|N(\mu _m, \varSigma _m))\) is the posterior probability of term t for the selected cluster.
Table 3. Results on Robust04. Gray rows are baselines. Bold metrics are the top-3 on Kendall \(\tau \) (left) and Pearson \(\rho \) (right). \(\dagger \) indicates statistical significance at alpha = 0.05.

Findings: The results of our experiments are shown in Tables 3, 4 and 5. As shown, our metrics are among the top-3 on both measures on all corpora. On Robust04, two of our metrics, i.e., BC and IEF, are among the top-3 metrics based on Kendall \(\tau \). Based on Pearson \(\rho \), IEF and EWS are among the top-3 along with IDF. On Robust04, there is little metric performance consistency on Kendall \(\tau \) and Pearson \(\rho \). When looking for those metrics that perform well on both measures, IEF and BC are consistent metrics where IEF ranks first on both Kendall \(\tau \) and Pearson \(\rho \) whereas BC ranks third and fifth on these measures, respectively. The other metrics, both baseline metrics and the ones we proposed, have a high performance difference on the two measures. For instance, while the baseline VAR metric ranks first on \(\tau \), it ranks twelfth on \(\rho \). On ClueWeb09 and Gov2, unlike Robust04, the top metrics are consistent for Kendall and Pearson where the top-3 metrics include the proposed DC and CC metrics for both measures. On ClueWeb09, these two metrics are accompanied by the BC and PR metrics for \(\tau \) and \(\rho \), respectively. However, on Gov2, these metrics are followed by the baseline SCQ metric on \(\tau \) and our IEF and EWS metrics on \(\rho \). In summary, balancing between the evaluation measures and performance on all topics and corpora, we find our CC metric to perform well across the board. It is among the best metrics on Gov2 and ClueWeb09 and has a balanced performance on Robust04. However, CC has a high time complexity of \(O(V^3)\). On the other hand, the DC metric performs well on both ClueWeb09 and Gov2 (in the top-3) but less effectiveness on Robust04. The benefit of DC is its low complexity: O(1). Overall, CC is the preferred metric given QPP computations are performed offline. DC can serve as an alternative if computation limitations exist.

Table 4. Results on ClueWeb09. Table format is similar to Table 3.
Table 5. Results on Gov2. Table format is similar to Table 3.

4 Concluding Remarks

We have shown that it is possible to devise metrics based on the neural embedding-based representation of terms to perform pre-retrieval QPP. Specifically, we have shown that specificity of a query term, estimated based on an ego network representation, can lead to better performance on QPP compared to several baselines such as the one that considers term clusters based on neural embeddings [9].