A general approximation framework for direct optimization of information retrieval measures
Authors
- First Online:
- Received:
- Accepted:
DOI: 10.1007/s10791-009-9124-x
- Cite this article as:
- Qin, T., Liu, T. & Li, H. Inf Retrieval (2010) 13: 375. doi:10.1007/s10791-009-9124-x
- 13 Citations
- 156 Views
Abstract
Recently direct optimization of information retrieval (IR) measures has become a new trend in learning to rank. In this paper, we propose a general framework for direct optimization of IR measures, which enjoys several theoretical advantages. The general framework, which can be used to optimize most IR measures, addresses the task by approximating the IR measures and optimizing the approximated surrogate functions. Theoretical analysis shows that a high approximation accuracy can be achieved by the framework. We take average precision (AP) and normalized discounted cumulated gains (NDCG) as examples to demonstrate how to realize the proposed framework. Experiments on benchmark datasets show that the algorithms deduced from our framework are very effective when compared to existing methods. The empirical results also agree well with the theoretical results obtained in the paper.
Keywords
Learning to rankDirect optimization of IR measuresPosition function approximationTruncation function approximationAccuracy analysis1 Introduction
In this paper, we consider the direction optimization of IR measures in learning to rank. This has been regarded as one of the most important directions for the area (Xu et al. 2008).
Several methods that directly optimize IR measures have been developed. In general, they can be grouped into two categories. The methods in the first category introduce upper bounds of the IR measures and try to optimize the upper bounds as surrogate objective functions (Chapelle et al. 2007; Xu and Li 2007; Yue et al. 2007). The methods in the other category approximate the IR measures using some smooth functions and conduct optimization on the surrogate objective functions (Guiver and Snelson 2008; Taylor et al. 2008).
Previous studies have shown that the approach of directly optimizing IR measures can achieve high performances when compared to the other approaches (Chapelle et al. 2007; Taylor et al. 2008; Xu and Li 2007; Xu et al. 2008; Yue et al. 2007). This is mainly because IR measures are explicitly considered in the direct optimization approach. However, there are still some open problems regarding the approach, as shown below.
First, although there seems to be some relationship between the surrogate functions and the corresponding IR measures, the relationship has not been sufficiently studied. This is a critical issue, because it is necessary to know whether optimizing the surrogate functions can indeed optimize the corresponding IR measures.
Second, some of the proposed surrogate functions are not easy to optimize. Complicated techniques have to be employed for the optimization. For example, both SVM^{map} (Yue et al. 2007) and SVM^{ndcg} (Chapelle et al. 2007) use Structured SVM to optimize the surrogate objective functions. However, the optimization technologies (e.g., the construction of the joint feature map and the way of finding the most violated constraints) are measure-specific, and thus it is not trivial to extend them to new measures.
In this work, we propose a general direct optimization framework, which can effectively address the aforementioned problems. The framework can accurately approximate any position-based IR measure, and then transform the optimization of an IR measure to that of an approximated surrogate function.
The key idea of our proposed framework is as follows. The difficulty in directly optimizing IR measures lies in that the measures are position based, and thus non-continuous and non-differentiable with respect to the score outputted by the ranking function. If we can accurately approximate the positions of documents by a continuous and differentiable function of the scores of the documents, then we will be able to approximate any position based IR measure. Our theoretical analysis demonstrates that highly accurate approximation of a position based IR measure can be obtained and thus high test performance in ranking can be achieved.
Taking average precision (AP) and normalized discounted cumulated gains (NDCG) as examples, we show that it is easy to derive learning algorithms (ApproxAP and ApproxNDCG) to optimize the surrogate functions in the proposed framework. Experimental results show that the derived algorithms can outperform existing algorithms.
- 1.
We set up a general framework for direct optimization, which is applicable to any position based IR measure, theoretically justifiable, and empirically effective;
- 2.
We show that it is easy to derive algorithms to optimize position based IR measures within the framework. Two effective algorithms are proposed as examples to optimize two popular IR measures, AP and NDCG.
The remainder of this paper is as follows. We start with a review on existing methods in Sect. 2. Section 3 sets up a general framework to approximate and optimize IR measures, and shows two examples of using this framework. Theoretical analysis on approximation accuracy is given in Sect. 4. Experimental results are presented in Sect. 5. We conclude the paper and discuss future directions in the last section.
2 Related work
2.1 Learning to rank for information retrieval
The key problem for document retrieval is ranking, specifically, how to create the ranking model (function) that can sort documents based on their relevance to the given query. It is a common practice in IR to tune the parameters of a ranking model using some labeled data and a performance measure. For example, the state-of- the-art methods of BM25 (Robertson and Hull 2000) and LMIR (Language Models for Information Retrieval) (Zhai and Lafferty 2001) all have parameters to tune. As the ranking models become more sophisticated (more features are used) and more labeled data become available, how to tune or train ranking models turns out to be a challenging issue.
The learning to rank technology can successfully leverage multiple features for ranking, and can automatically tune the parameters in ranking models based on a large volume of training data. This technology has been gaining increasing attention from both the research community and the industry in the past several years. The setting of learning to rank, when applied to document retrieval and web search, is as follows. Assume that there is a corpus of documents. In training, a number of queries are provided; each query is associated with a set of documents with relevance judgments. Each query-document pair is represented by a feature vector. A ranking function is then created using the training data, such that the model can precisely predict the ranked lists in the training data by appropriately combing the features. In retrieval (i.e., testing), given a new query, the ranking function is used to create a ranked list for the documents associated with the query.
Many learning to rank methods have been proposed and applied to different IR applications.
One approach in previous work takes document pairs as instances and reduces the problem of ranking to that of classification on the orders of document pairs. It then applies existing classification techniques to ranking. The methods include Ranking SVM (Herbrich et al. 1999; Joachims 2002), RankBoost (Freund et al. 2003), RankNet (Burges et al. 2005). Ranking SVM solves the problem of pairwise classification using Support Vector Machines, RankBoost using the boosting techniques, and RankNet using Neural Networks. See also Tsai et al. (2007), Zheng et al. (2007) for other pairwise methods.
Another approach regards ranking lists as instances and conducts learning on the lists of documents. For instance, Cao et al. proposed using a permutation probability model in the rank learning and employing a listwise ranking algorithm called ListNet (Cao et al. 2007). In their recent work (Xia et al. 2008), they further studied the properties of the related algorithms and derived a new algorithm based on Maximum Likelihood Estimation called ListMLE. See also Qin et al. (2008c), Volkovs and Zemel (2009) for other listwise methods.
2.2 Direct optimization of IR measures
Recently, a new approach, direct optimization of IR measures, has attracted much attention in learning to rank. The basic idea of the direct optimization approach is to find an optimal ranking function by directly maximizing some IR measures such as AP (Voorhees and Harman 2005) and NDCG (Järvelin and Kekäläinen 2002) on the training set. This new approach seems more straightforward and appealing, because what is used in evaluation is exactly an IR measure. There are two major categories of algorithms for direct optimization of IR measures.
One group of algorithms tries to optimize objective functions that are bounds of the IR measures. For example, SVM^{map} (Yue et al. 2007) optimizes an upper bound of (1 − AP) in the predicted rankings. Specifically, a joint feature map is constructed for each possible ranking, and Structured SVM is used to iteratively optimize the most violated constraint (the way of finding the most violated constraint depends on the property of AP). The idea of SVM^{map} is further extended to optimize other IR evaluation measures, and the corresponding algorithms include SVM^{ndcg} (Chapelle et al. 2007) and SVM^{mrr} (Chakrabarti et al. 2008). In these new algorithms, different joint feature maps and different ways of finding the most-violated constraints are proposed. AdaRank (Xu and Li 2007) minimizes an exponential loss function which can upper bound either (1 − AP) or (1 − NDCG) using boosting methods. It repeatedly constructs weak rankers on the basis of re-weighted training queries and finally linearly combines the weak rankers for making ranking predictions. Two sub methods have been proposed in (Xu and Li 2007). AdaRank.MAP utilizes AP to measure the goodness of a weak ranker, and AdaRank.NDCG utilizes NDCG to measure the goodness of a weak ranker.
Another group of algorithms manages to smooth the IR measures with easy-to-optimize functions. For example, SoftRank (Guiver and Snelson 2008; Taylor et al. 2008) introduces randomness to the ranking scores of the documents, so as to smooth NDCG. It assumes the ranking score of a document to be governed by a Gaussian distribution, and then derives a rank distribution of the document in an iterative manner. Based on the rank distributions of all the documents associated with a query, SoftRank computes the expectation of NDCG as the objective function for learning to rank. The gradient descent method is used to learn the ranking function.
- 1.
The relationships between the surrogate functions and the corresponding IR measures have not been sufficiently studied. Therefore, it is unknown whether optimizing the surrogate functions can indeed optimize the corresponding IR measures.
- 2.
Some of the proposed surrogate functions are not easy to optimize. Existing methods (e.g., the Structured SVM series) have to employ complicated, measure-specific techniques in the optimization. It is not trivial to extend them to new measures.
3 A general approximation framework
In this section, we propose a general framework for direct optimization of IR measures. The framework is applicable to any position based IR measure, and is theoretically justifiable.
- 1.
Reformulating an IR measure from ‘indexed by positions’ to ‘indexed by documents’. The newly formulated IR measure then contains a position function and optionally a truncation function. Both functions are non-continuous and non-differentiable.
- 2.
Approximating the position function with a logistic function of ranking scores of documents.
- 3.
Approximating the truncation function with a logistic function of positions of documents.
- 4.
Applying a global optimization technique to optimize the approximated measure (surrogate function).
We first give a brief introduction to several popular IR measures used in learning to rank, and then take some measures as examples to introduce the four steps of the framework.
3.1 Review on IR measures
To evaluate the effectiveness of a ranking model, many IR measures have been proposed. Here we give a brief introduction to several popular ones which are widely used in learning to rank. See also (Moffat and Zobel 2008) for other measures.
3.2 Step 1: Measure reformulation
Most of the IR measures, for example, Precision@k, AP and NDCG are position based. Specifically, the summations in the definitions of IR measures are taken over positions, as can be seen in (1)–(4). Unfortunately, the position of a document may change during the training process, which makes the optimization of the IR measures difficult. To deal with the problem, we reformulate IR measures using the indexes of documents.
The reformulated IR measures [e.g., (5), (7)–(9)] contain two kinds of functions: position function π(x) and truncation functions 1{π(x) < π(y)} and 1{π(x) ≤ k}. Both of them are non-continuous and non-differentiable. We will discuss how to approximate them separately in the next two subsections.
3.3 Step 2: Position function approximation
That is, positions can be regarded as outputs of functions of ranking scores. Due to the indicator function in it, the position function is non-continuous and non-differentiable.
Examples of position approximation
Document | s_{x} | π(x) | \(\hat{\pi}(x)\) (α = 100) |
---|---|---|---|
x_{1} | 4.20074 | 2 | 2.00118 |
x_{2} | 3.12378 | 4 | 4.00000 |
x_{3} | 4.40918 | 1 | 1.00000 |
x_{4} | 1.55258 | 5 | 5.00000 |
x_{5} | 4.13330 | 3 | 2.99882 |
3.4 Step 3: Truncation function approximation
As can be seen in Sect. 3.2, some measures have truncation functions in their definitions, such as Precision@k, AP, and NDCG@k. These measures need further approximations on the truncation functions. We will introduce in this subsection how it can be achieved. Some other measures including NDCG do not have truncation functions; In this case, the techniques introduced below can be skipped.
With (16), one can approximate measures like Precision@k and NDCG@k. Here we omit the details.
3.5 Step 4: Surrogate function optimization
With the aforementioned approximation technique, the surrogate objective functions (e.g., \(\widehat{\hbox{AP}}\) and \(\widehat{\hbox{NDCG}}\)) become continuous and differentiable with respect to the parameter θ in the ranking model, and many optimization algorithms can be used to maximize them. Measure specific optimization techniques are no longer needed.
However, considering that the original IR measures contain a lot of local optima, the approximations of them will also contain local optima. Therefore, one should better choose those global optimization methods such as random restart (Hu et al. 1994) and simulated annealing (Kirkpatrick et al. 1983) in order to avoid being trapped to local optima. Note that there are also some alternative ways of dealing with the issue of local optimum. For example, one can use a robust but likely less effective learning to rank method (e.g., Ranking SVM) to obtain an initial guess of the ranking model. Then use it as the starting point for the optimization of the approximated IR measure.
In this work, we choose to use the random restart technology (Hu et al. 1994) as an example. That is, we first use a gradient descent method to find a local optimum of the objective function given a certain initial value of the ranking model, and then we randomly re-initialize the model parameters and do another round of optimization. We repeat this for several times. Finally we regard the best local optimum as a global optimum.
Algorithm 1. ApproxAP (ApproxNDCG) |
---|
Input: |
1: m training queries, their associated documents and relevance judgments. |
2: Number of random restarts K; |
3: Stop threshold δ; |
4: Learning rate η. |
Training: |
5: Set iteration number t = 0; |
6: For k = 1: K Do { |
7: Randomly initialize the parameter θ_{t} of the ranking model f(x;θ) |
8: Do { |
9: Set θ = θ_{t}; |
10: Shuffle the m training queries; |
11: For i = 1 to m Do { |
12: Feed i-th training query (after shuffle) to the learning system; |
13: Compute the gradient Δθ of \(\widehat{\hbox{AP}}\, (\widehat{\hbox{NDCG}})\) with respect to θ using (38) [using (35)]; |
14: Update parameter θ = θ + η × Δθ; |
15: } |
16: Set t = t + 1, θ_{t} = θ. |
17: } While (||θ_{t} − θ_{t-1}|| > δ) |
18: Set ω_{k} = θ_{t}. |
19: } |
Output: |
20: Compute the objectives \((\widehat{\hbox{AP}}\hbox{ or }\widehat{\hbox{NDCG}})\) of the K parameters {ω_{1}, ω_{2},...,ω_{K}}. |
21: Output the parameter ω_{k} with the maximal objective. |
3.6 Comparison with previous methods
- 1.
Our framework approximates IR measures by approximating document positions, while SoftRank smooths NDCG by smoothing the document scores.
- 2.
The gradient of our approach can be computed in a complexity of O(n^{2}) while a complexity of O(n^{3}) is needed for SoftRank, where n is the number of documents for a query. That is, the computation complexity of our approach is much lower than that of SoftRank. In our experiments, we observed that our method took less than 0.5 second to compute the gradients of all the training queries on the OHSUMED dataset and SoftRank took about 10 seconds.
- 3.
We propose a general framework, which can be used to optimize any position based measures, while SoftRank only focuses on NDCG, and some extensive efforts are needed to generalize it to other measures.
- 4.
Our approach has a solid theoretical justification on the accuracy of the approximation (see next section), while there is not yet such justification for SoftRank.
4 Theoretical analysis of the framework
As mentioned in Sect. 1, the relationships between the surrogate objective functions and the corresponding IR measures are not clear for the previous methods. In contrast, the relation between the surrogate functions obtained by our framework and the IR measures can be well justified. In this section, we will study this issue.
4.1 Accuracy of position function approximation
The approximation of positions is a basic component in our framework. In order to approximate an IR measure, we need to approximate positions first; in order to analyze the accuracy of approximation of IR measures, we need to analyze the accuracy of approximation of positions first.
The following theorem shows that the position approximation in (12) can achieve very high accuracy. The proof can be found in the appendix.
Theorem 1
A corollary of Theorem 1 is given below:
Corollary 1
4.2 Accuracy of IR measure approximation
The following theorems quantify the errors in the approximations of MAP and NDCG. The proof can be found in the Appendix.
Theorem 2
The theorem indicates that when ɛ is small and β is large, the approximation of AP can be very accurate. In the extreme case, we have \(\lim_{\varepsilon\rightarrow 0, \beta\rightarrow\infty}{\widehat{\hbox{AP}}}=\hbox{AP}.\) For the example in Table 1, if setting β = 100, |D_{+}| = 1, we have \(|\widehat{\hbox{AP}}-\hbox{AP}|<0.0024.\) That is, the AP approximation is very accurate in this case.
Theorem 3
This theorem indicates that when ɛ is small, the approximation of NDCG can be very accurate. In the extreme case, we have \(\lim_{\varepsilon\rightarrow 0}{\widehat{\hbox{NDCG}}}=\hbox{NDCG}.\) For the example in Table 1, we have \(|\widehat{\hbox{NDCG}}-\hbox{NDCG}|<\frac{{\varepsilon}}{ {2\ln2}}\approx 0.00085.\) That is, the NDCG approximation is very accurate in this case.
From these two examples (AP and NDCG), one can see that the surrogate functions obtained by the proposed framework can be very accurate approximations to the IR measures.
4.3 Justification of accurate approximation
As shown in the previous subsection, the surrogate objective function we obtained can be very close to the original IR measure. One may argue whether such an accurate approximation really has benefit for learning the ranking model. To answer such questions, we have the following discussions.
If an algorithm can directly optimize an IR measure on the training set, then the learned ranking model will definitely be the optimal model in terms of the IR measure on the training set. Note that in statistical machine learning, the training performance is computed as an average on the training set, while the test performance is measured as an expectation on the entire instance space (Vapnik 1998). If the training set is extremely large, the training performance will converge to the test performance (i.e., the average will converge to the expectation when the number of samples is infinite). Therefore, directly optimizing the IR measure on a extremely large training set can guarantee the optimal test performance in terms of the same IR measure.
Furthermore, it is easy to understand that if the surrogate measures are very close to the IR measures (i.e., the approximations are very accurate), the optimization of the surrogate measures will also lead to high performances of IR measures on training set. Again, if the training set is very large, the optimization of the surrogate measures will also have high test performances. This intuitively justifies the necessity of accurately approximating the IR measures.
One possible issue with regards to the accurate approximation of IR measures is that the more accurate the approximation is, the more complex the surrogate function will be. This is because the IR measure itself is very complex (as a function of the ranking model), and contains a lot of local optima. In this case, the optimization of the surrogate function is likely to be trapped into some local optimum and the model learned may not have the desired good performance. This is why we propose using global optimization techniques in our framework.
Due to space restrictions, we have only given some high level discussions here. More details can be found in Qin et al. (2008a).
5 Experimental results
We conducted a set of experiments to test the effectiveness of the proposed framework.
5.1 Datasets
We used LETOR datasets (Liu et al. 2007) in our experiments. LETOR is a benchmark collection for the research on learning to rank for information retrieval. It has been widely used in research community for research on learning to rank (Duh and kirchhoff 2008; Guiver and Snelson 2008; Qin et al. 2008b; Xu et al. 2008; Zhou et al. 2008). The first version of LETOR was released in April 2007 and used in the SIGIR 2007 workshop on learning to rank for information retrieval (http://www.research.microsoft.com/users/LR4IR-2007/). At the end of 2007, the second version of LETOR was released, which was later used in the SIGIR 2008 workshop on learning to rank for IR (http://www.research.microsoft.com/users/LR4IR-2008/). The third version of LETOR, namely LETOR 3.0, was released in December 2008.^{1}
Datasets
Datasets | # Query | Relevance levels | # (Docs per query) |
---|---|---|---|
TD2003 | 50 | 2 | ∼1,000 |
TD2004 | 75 | 2 | ∼1,000 |
OHSUMED | 106 | 3 | ∼150 |
We note that most baseline algorithms in LETOR used linear ranking models. For fair comparison, we also used linear ranking model for ApproxAP and ApproxNDCG in the experiments, although our algorithms can also make use of other kinds of ranking models.
5.2 On the approximation of IR measures
We first evaluated the accuracy of the approximations of AP and NDCG.
We can see that for all the three α values, the approximation accuracy is very high, which is more than 95%. Furthermore, as the increase of α, the approximation becomes more accurate: the accuracy is higher than 98% when α = 100.
We then fixed α = 100 and tried different values of β. Figure 1b shows the error ρ with respect to different β values. As can bee seen, when β increases, the accuracy of the approximation also improves.
All these results verify the correctness of the discussions in Sect. 4.2, and indicate that the approximation of IR measures using our proposed method can achieve high accuracy.
5.3 On the performance of ApproxAP
- (a)
we first chose a set of α values {50, 100, 150, 200, 250, 300} and a set of β values {1, 10, 20, 50, 100}.
- (b)
we set δ = 0.001, η = 0.01, K = 10 in Algorithm 1. That is, we made 10 random restarts.
- (c)
for each combination of αand β, we learned a ranking model with 10 random restarts to avoid local optima. We learned 30 models in total.
- (d)
we tested the performance of each model on the validation set and selected the model with the highest MAP as the final model;
- (e)
we tested the performance of the final model on the test set.
As baselines, we used AdaRank.MAP and SVM^{map}, which directly optimize AP. We also compared with Ranking SVM and ListNet, two state-of-the-art algorithms that do not belong to the approach of direct optimization. We cited the results of AdaRank.MAP, SVM^{map}, Ranking SVM, and ListNet directly from LETOR official website (http://www.research.microsoft.com/~letor). According to the information in the website, the hyper parameters of these algorithms have been carefully tuned and the validation set has been used for model selection. In this regard, the experimental settings for our methods and these baselines are the same, which ensures a fair comparison among them.
Ranking accuracy in terms of MAP
Algorithm | TD2003 | TD2004 |
---|---|---|
AdaRank.MAP | 0.2283 | 0.2189 |
SVM^{map} | 0.2445 | 0.2049 |
Ranking SVM | 0.2628 | 0.2237 |
ListNet | 0.2753 | 0.2231 |
ApproxAP | 0.2834 | 0.2224 |
Furthermore, ApproxAP is better than Ranking SVM and ListNet on TD2003 and gets similar result as Ranking SVM and ListNet on TD2004. We also find that AdaRank.MAP and SVM^{map} are not as good as Ranking SVM and ListNet. We hypothesize the reason as follows. AdaRank.MAP and SVM^{map} optimize the upper bound of AP and it is not clear whether the bound is tight. If the bound is very loose, optimization of the bound cannot always lead to the optimization of AP, and so they may not perform well on some datasets. This is in accordance with the discussions in He and Liu (2008).
5.4 On the performance of ApproxNDCG
We used a similar strategy to select the hyper parameters α for ApproxNDCG to that for ApproxAP. We chose the same set of α values {50, 100, 150, 200, 250, 300} and the same value of δ, η and K. But we used NDCG@n for model selection instead of MAP on the validation set.
We compared ApproxNDCG with AdaRank.NDCG and SoftRank, which directly optimize NDCG. We also compared with Ranking SVM and ListNet. We cited the results of AdaRank.NDCG, Ranking SVM and ListNet from LETOR official website (http://www.research.microsoft.com/~letor). Again, according to the information in the website, the hyper parameters of these algorithms have been carefully tuned and the validation set has been used for model selection. There is a hyper parameter σ in SoftRank. We tuned the parameter and used validation set to select the best value. That is, the same experimental strategy was applied to all the algorithms here for fair comparisons.
Ranking accuracy in terms of NDCG on OHSUMED dataset
Algorithm | @1 | @3 | @5 | @10 |
---|---|---|---|---|
AdaRank.NDCG | 0.5330 | 0.4790 | 0.4673 | 0.4496 |
SoftRank | 0.5229 | 0.4732 | 0.4580 | 0.4539 |
Ranking SVM | 0.4958 | 0.4207 | 0.4164 | 0.4140 |
ListNet | 0.5326 | 0.4732 | 0.4432 | 0.4410 |
ApproxNDCG | 0.5771 | 0.5037 | 0.4794 | 0.4620 |
5.5 Discussions
In this sub section, we will make some deep investigations on the algorithms derived from our proposed framework.
5.5.1 Approximation accuracy versus optimization feasibility
As mentioned in Sect. 4.3, the larger the hyper-parameters (α and β) are, the more accurate the approximations of IR measures are, and the more difficult the optimization of the surrogate functions is.
Training performance (NDCG@5) of ApproxNDCG on fold 1 of OHSUMED dataset
K | 1 | 100 |
---|---|---|
α = 50 | 0.4818 | 0.4828 |
α = 100 | 0.4849 | 0.4862 |
α = 300 | 0.4793 | 0.5073 |
From this table, we conclude that larger value of α indeed makes the objective more difficult to maximize; to learn a better ranking model for large α, more random restarts are needed (or generally, more effective global optimization methods are needed). We got similar observations for ApproxAP. The details are omitted here.
5.5.2 Comparison with SoftRank
In Sect. 3.6, we have performed some analysis on the comparison with SoftRank, which belongs to the same sub category of the direct optimization approach as our methods. Here we make some experimental studies, including training performance and time complexity.
After tuning the hyper parameter of SoftRank, it achieved its best training performance in terms of NDCG@5 on fold 1 of OHSUMED as 0.4940. Comparing the results in Table 5, we see that the best training accuracy of ApproxNDCG is better than that of SoftRank. From Tables 4 and 5, we get that ApproxNDCG achieved better ranking accuracy than SoftRank on both training and testing sets.
Running time per iteration on fold 1 of OHSUMED dataset
Algorithm | Time (seconds) |
---|---|
ApproxNDCG | <0.5 |
SoftRank | ∼10 |
6 Conclusions and future work
In this paper, we have set up a general framework to approximate position based IR measures. The key part of the framework is to approximate the positions of documents by logistic functions of their scores. There are several advantages of this framework: (1) the way of approximating position based measures is simple yet general; (2) many existing techniques can be directly applied to the optimization and the optimization process itself is measure independent; (3) it is easy to conduct analysis on the accuracy of the approach and high approximation accuracy can be achieved by setting appropriate parameters.
We have taken AP and NDCG as examples to show how to approximate IR measures within the proposed framework, how to analyze the accuracy of the approximation, and how to derive effective learning algorithms to optimize the approximated functions. Experiments on public benchmark datasets have verified the correctness of the theoretical analysis and have proved the effectiveness of our algorithms.
- 1.
The approximated measures are not convex, and there may be many local optima in training. We have used random restart strategy to find a good solution. We plan to study other global optimization methods to further improve the performance of the proposed algorithms.
- 2.
We have used linear ranking models in the experiments. Our algorithms can be directly applied for other functions such as neural networks. We will conduct experiments to test our algorithms with other function classes.