# Neural Embedding-Based Metrics for Pre-retrieval Query Performance Prediction

- 1 Citations
- 3.2k Downloads

## Abstract

Query Performance Prediction (QPP) is concerned with estimating the effectiveness of a query within the context of a retrieval model. It allows for operations such as query routing and segmentation, leading to improved retrieval performance. *Pre-retrieval* QPP methods are oblivious to the performance of the retrieval model as they predict query difficulty prior to observing the set of documents retrieved for the query. Since neural embedding-based models are showing wider adoption in the Information Retrieval (IR) community, we propose a set of pre-retrieval QPP metrics based on the properties of *pre-trained* neural embeddings and show that such metrics are more effective for query performance prediction compared to the widely known QPP metrics such as SCQ, PMI and SCS. We report our findings based on Robust04, ClueWeb09 and Gov2 corpora and their associated TREC topics.

## Keywords

Query Performance Prediction Neural embeddings Specificity## 1 Introduction

It is understood that the performance of retrieval models is not always consistent over different queries and corpora and there are some queries that have lower performance, often referred to as *hard* or *difficult* queries [1]. As such, the area of *Query Performance Prediction* is concerned with estimating the performance of a retrieval system for a given query. There is already a well-established body of work that explores query performance prediction through either a *post-retrieval* or a *pre-retrieval* strategy [2]. Methods in post-retrieval measure query difficulty, by analyzing the results obtained from the retrieval system as a response to the query. In contrast, pre-retrieval methods, which are the focus of this work as well, are based on linguistic and statistical features of the query and documents.

While existing work in pre-retrieval query performance has been predominantly focused on defining various statistical measures based on term and corpus-level frequency, the IR community has recently embarked on exploring the impact and importance of neural IR techniques [5, 6, 7]. There are some recent work that propose to use neural networks for QPP based on a host of signals [8] but to the best of our knowledge, there is only one recent work that specifically utilizes *neural embeddings* of query terms for performing QPP [9]. Neural embeddings maintain interesting *geometric properties* between embedded terms [10] which are manifested by how term vectors are distributed in the embedding space. We explore exploiting the geometric properties of embeddings to define beyond-frequency QPP metrics. Our work *distinguishes* itself from the recent work [9], which proposes to cluster neural embeddings based on their vector similarity to perform QPP, by proposing to not only consider *term similarity* but also take term neighborhood and association into account through a *network representation* of neural embeddings. More specifically, we benefit from term vector associations in the neural embedding space for formalizing *term specificity*, which is correlated with query difficulty [3, 4, 11].

We base our work on the intuition that a term that has been closely surrounded by several other terms in the embedding space is more likely to be *specific* while a term with a *fewer number* of closely surrounded terms is more likely to be *generic*. We conceptualize the space surrounding a term by using an *ego network* representation where the term of interest serves as the *ego* and is contextualized by a set of *alter* nodes, which are other terms that are similar to it in the embedding space. We apply various measures of node centrality on the ego node to determine the specificity of the term that is being represented by the ego, which would then indicate query difficulty [16]. We have performed experiments based on three widely used TREC corpora, namely Robust04, ClueWeb09 and Gov2 and their corresponding topic sets. Our experiments show that the proposed metrics are effective in QPP using pre-trained neural embeddings.

## 2 Proposed Approach

*term specificity*are suitable indicators of query difficulty, i.e., more specific terms are more discriminative and are hence easier to handle when used as queries.

Our work is driven by the *intuition* that more specific terms have a higher likelihood of being surrounded by a larger number of terms compared to generic terms. For instance, as shown in Fig. 1, the set of terms related to the specific term ‘Arsenal’, with an association degree (computed based on cosine similarity of terms’ vector representation) above 0.75, includes terms such as ‘Wenger’, ‘Tottenham’, ‘Everton’, among others, which are also themselves very specific; whereas, the generic term ‘soccer’ has only one closely associated term (association degree above 0.75) and that is ‘football’, which is quite generic itself. While it is not possible to measure frequency information from neural embeddings, it is convenient to identify the set of highly similar terms to a term based on vector similarity. We benefit from this to formalize the notion of an *ego network* that is based on vector similarities within the embedding space. We benefit from this to formalize our recursive definition of specificity, i.e., the extent to which a term is specific can be determined from the context created by the surrounding highly similar terms within the neural embedding space. In order to formalize *specificity*, we define an *ego network*, as follows:

### Definition 1

Let \(\mathcal {P}(t_i,t_j)\) be the degree of similarity between vectors of terms \(t_i\) and \(t_j\), *V* be the complete vocabulary set, and \(\mathcal {P_M}(t_i)\) be the highest degree of similarity to \(t_i\) from any term in *V*. We define an \(\alpha -depth\) *ego network* for an ego node \(t_i\) in the form of a fully connected graph with a maximum depth \(\alpha \) around the ego where the edge weights are \(\mathcal {P}(t_k,t_l)\) between any two nodes \(t_k\) and \(t_l\). We further refine the \(\alpha -depth\) *ego network* into an \(\alpha -depth\) \(\beta -cut\) *ego network* where any edge with a weight less than \(\beta \times \mathcal {P_M}(t_i)\) is pruned.

*V*that have a similarity above 0.6832 to ‘Arsenal’. Furthermore, we allow the ego network to have a depth of \(\alpha \) from the ego. For a depth of one (\(\alpha =1\)), the ego network will only consist of the ego and its immediate neighbors. For a depth of two (\(\alpha =2\)), each node in layer one will become the ego for another sub-ego network with a \(\beta -cut\), as explained earlier. Figure 1 shows a schematic of the \(\alpha -depth\) \(\beta -cut\)

*ego network*for the specific term ‘Arsenal’ and generic term ‘soccer’. As seen, in Arsenal’s case, the graph is populated with many terms closely related to the ego. In the second layer, the nodes immediately connected to the ego, e.g., ‘Wenger’, become an ego node for a second layer subgraph, which are in turn connected to their own alters, e.g., ‘Mourinho’, ‘Benitez’ and ‘Ferguson’. In contrast, the network associated with the generic term ‘soccer’ is quite sparse with only two additional nodes present when \(\alpha =2\).

Node centrality metrics on the ego network.

Based on the developed ego network, we propose to measure the *specificity* of the ego through the use of *node centrality* metrics [13, 16]. Given queries can be composed of more than one term, we adopt the integration approach that uses aggregation functions [14] over the specificity of individual query terms. Table 1 provides an overview of the metrics used in this paper.

## 3 Experiments

**Corpora and Topics:** We employed three widely used corpora, namely, Robust04, ClueWeb09, and Gov2. For Robust04, TREC topics 301–450 and 601–650, for Gov2, topics 701–850 and for ClueWeb09, topics 1–200 were used. Topic difficulty was based on Average Precision of each topic computed using QL [15].

**Baselines:** We adopt the widely used pre-retrieval metrics reported in [2]. The formulation of these metrics is provided in Table 2. As another baseline, we adopt the recent approach by Roy et al. [9] that utilizes embedded word vectors to predict query performance. Their *specificity metric*, known as \(P_{clarity}\), is based on the idea that the number of clusters around the neighbourhood of a query term is a potential indicator of its specificity. To apply their approach on our embedding vectors, we have used the implementation provided by the authors.

**Neural Embeddings:** We used a pre-trained word2vec model based on the Google News corpus (https://goo.gl/wQ8eQ1).

**Evaluation:** A common approach for measuring the performance of a QPP metric is to use rank correlation metrics to measure the correlation between the list of queries (1) ordered by their difficulty for the retrieval method (ascending order of average precision), and (2) ordered by the QPP metric. Kendall’s \(\tau \) and Pearson’s \(\rho \) co-efficient are common correlation metrics in this space.

*rank*and is reported separately for Kendall’s \(\tau \) and Pearson’s \(\rho \). These ranks show how a metric has performed over the different topic sets. Given our metrics are dependent on the \(\alpha \) and \(\beta \) parameters, we set them using 5-fold cross validation optimized for Pearson correlation.

Baseline metrics. *t* is a term in query *q*. *d* is a document in collection *D*. \(D_t\) is the set of documents with *t*. *tf*(*t*, *D*) is term frequency of term *t* in *D*. \(Pr(t|D)=tf(t,D)/|D|\). \(\pi _m\) is the prior probability of the most dominating sense of term *t* and \(P(t|N(\mu _m, \varSigma _m))\) is the posterior probability of term *t* for the selected cluster.

Metric | Formulation | Ref |
---|---|---|

IDF | \( idf(t)=log ( \displaystyle \frac{|D|}{|D_{t}| } )\) | [2] |

VAR | Variance of query term weights | [2] |

SCQ | \(SCQ(t) = (1 + log(tf(t,D))) .idf(t) \) | [17] |

SCS | \(SCS(q)=\sum _{t \in q} Pr(t|q)log( \displaystyle \frac{Pr(t|q)}{Pr(t|D)}) \) | [3] |

PMI | \(PMI(t_{1},t_{2})=log \displaystyle \frac{Pr(t_{1},t_{2}|D)}{Pr(t_{1}|D)Pr(t_{2}|D) }\) | [18] |

\(P_{clarity}\) | \(P_{clarity}(t) = \pi _m P(t|N(\mu _m, \varSigma _m))\) | [9] |

Results on Robust04. Gray rows are baselines. Bold metrics are the top-3 on Kendall \(\tau \) (left) and Pearson \(\rho \) (right). \(\dagger \) indicates statistical significance at alpha = 0.05.

**Findings:**The results of our experiments are shown in Tables 3, 4 and 5. As shown, our metrics are among the top-3 on both measures on all corpora. On Robust04, two of our metrics, i.e., BC and IEF, are among the top-3 metrics based on Kendall \(\tau \). Based on Pearson \(\rho \), IEF and EWS are among the top-3 along with IDF. On Robust04, there is little metric performance consistency on Kendall \(\tau \) and Pearson \(\rho \). When looking for those metrics that perform well on both measures, IEF and BC are consistent metrics where IEF ranks first on both Kendall \(\tau \) and Pearson \(\rho \) whereas BC ranks third and fifth on these measures, respectively. The other metrics, both baseline metrics and the ones we proposed, have a high performance difference on the two measures. For instance, while the baseline VAR metric ranks first on \(\tau \), it ranks twelfth on \(\rho \). On ClueWeb09 and Gov2, unlike Robust04, the top metrics are consistent for Kendall and Pearson where the top-3 metrics include the proposed DC and CC metrics for both measures. On ClueWeb09, these two metrics are accompanied by the BC and PR metrics for \(\tau \) and \(\rho \), respectively. However, on Gov2, these metrics are followed by the baseline SCQ metric on \(\tau \) and our IEF and EWS metrics on \(\rho \). In summary, balancing between the evaluation measures and performance on all topics and corpora, we find our

*CC metric*to perform well across the board. It is among the best metrics on Gov2 and ClueWeb09 and has a balanced performance on Robust04. However, CC has a high time complexity of \(O(V^3)\). On the other hand, the DC metric performs well on both ClueWeb09 and Gov2 (in the top-3) but less effectiveness on Robust04. The benefit of DC is its low complexity:

*O*(1). Overall, CC is the preferred metric given QPP computations are performed offline. DC can serve as an alternative if computation limitations exist.

Results on ClueWeb09. Table format is similar to Table 3.

Results on Gov2. Table format is similar to Table 3.

## 4 Concluding Remarks

We have shown that it is possible to devise metrics based on the neural embedding-based representation of terms to perform pre-retrieval QPP. Specifically, we have shown that specificity of a query term, estimated based on an *ego network* representation, can lead to better performance on QPP compared to several baselines such as the one that considers term clusters based on neural embeddings [9].

## References

- 1.Mizzaro, S., Mothe, J.: Why do you think this query is difficult?: A user study on human query prediction. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1073–1076. ACM (2016)Google Scholar
- 2.Carmel, D., Yom-Tov, E.: Estimating the query difficulty for information retrieval. Synthesis Lectures Inf. Concepts Retrieval Serv.
**2**(1), 1–89 (2010)CrossRefGoogle Scholar - 3.He, B., Ounis, I.: Inferring query performance using pre-retrieval predictors. In: Apostolico, A., Melucci, M. (eds.) SPIRE 2004. LNCS, vol. 3246, pp. 43–54. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30213-1_5CrossRefzbMATHGoogle Scholar
- 4.He, J., Larson, M., de Rijke, M.: Using coherence-based measures to predict query difficulty. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 689–694. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_80CrossRefGoogle Scholar
- 5.Zuccon, G., Koopman, B., Bruza, P., Azzopardi, L.: Integrating and evaluating neural word embeddings in information retrieval. In: Proceedings of the 20th Australasian Document Computing Symposium, p. 12. ACM (2015)Google Scholar
- 6.Zhang, L., Zhang, S., Balog, K.: Table2Vec: neural word and entity embeddings for table population and retrieval. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1029–1032. ACM (2019)Google Scholar
- 7.Mitra, B., Craswell, N.: An introduction to neural information retrieval. Found. Trends Inf. Retrieval
**13**(1), 1–126 (2018)CrossRefGoogle Scholar - 8.Zamani, H., Croft, W.B., Culpepper, J.S.: Neural query performance prediction using weak supervision from multiple signals. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 105–114. ACM (2018)Google Scholar
- 9.Roy, D., Ganguly, D., Mitra, M., Jones, G.J.: Estimating Gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction. Inf. Process. Manage.
**56**(3), 1026–1045 (2019)CrossRefGoogle Scholar - 10.Mimno, D., Thompson, L.: The strange geometry of skip-gram with negative sampling. In: Empirical Methods in Natural Language Processing (2017)Google Scholar
- 11.Hauff, C., Hiemstra, D., de Jong, F.: A survey of pre-retrieval query performance predictors. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 1419–1420. ACM (2008)Google Scholar
- 12.Thomas, P., Scholer, F., Bailey, P., Moffat, A.: Tasks, queries, and rankers in pre-retrieval performance prediction. In: Proceedings of the 22nd Australasian Document Computing Symposium, p. 11. ACM (2017)Google Scholar
- 13.Segarra, S., Ribeiro, A.: Stability and continuity of centrality measures in weighted graphs. IEEE Trans. Signal Process.
**64**(3), 543–555 (2015)MathSciNetCrossRefGoogle Scholar - 14.Hauff, C., Kelly, D., Azzopardi, L.: A comparison of user and system query performance predictions. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, pp. 979–988. ACM (2010)Google Scholar
- 15.Song, F., Croft, W.B.: A general language model for information retrieval. In: Proceedings of the Eighth International Conference on Information and Knowledge Management, pp. 316–321. ACM (1999)Google Scholar
- 16.Arabzadeh, N., Zarrinkalam, F., Jovanovic, J., Bagheri, E.: Geometric estimation of specificity within embedding spaces. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2109–2112. ACM (2019)Google Scholar
- 17.Zhao, Y., Scholer, F., Tsegay, Y.: Effective pre-retrieval query performance prediction using similarity and variability evidence. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 52–64. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_8CrossRefGoogle Scholar
- 18.Hauff, C.: Predicting the effectiveness of queries and retrieval systems. In: SIGIR Forum, vol. 44, no. 1, p. 88. ACM (2010)Google Scholar