1 Introduction

Ranking is the central problem for many applications of information retrieval (IR). These include document retrieval (Cao et al. 2006), collaborative filtering (Harrington 2003), key term extraction (Collins 2002), definition finding (Xu et al. 2005), important email routing (Chirita et al. 2005), sentiment analysis (Pang et al. 2005), product rating (Dave et al. 2003), and anti web spam (Gyöngyi et al. 2004). In the task of ranking, given a set of objects, we utilize a ranking model (function) to create a ranked list of the objects. The relative order of objects in the list may represent their degrees of relevance, preference, or importance, depending on applications. Among the aforementioned applications, document retrieval is by all means the most important one, and therefore we will take it as an example when performing the discussions in this paper.

Learning to rank, when applied to document retrieval, is a task as follows. Assume that there is a corpus of documents. In training, a number of queries are provided; each query is associated with a set of documents with relevance judgments; a ranking function is then created using the training data, such that the model can precisely predict the ranked lists in the training data. In retrieval (i.e., testing), given a new query, the ranking function is used to create a ranked list for the documents associated with the query. Since the learning to rank technology can successfully leverage multiple features for ranking, and can automatically learn the optimal way of combining these features, it has been gaining increasing attention in recent years. Many learning to rank methods have been proposed and applied to different IR applications.

To facilitate the research on learning to rank, an experimental platform is sorely needed, which contains indexed document corpora, selected queries for training and test, feature vectors extracted for each document, implementation of baseline algorithms, and standard evaluation tools. However, there was no such an environment and it largely blocked the advancement of the related research. Researchers had to use their own datasets (i.e., different document corpora, different query sets, different features, and/or different evaluation tools), and thus it was not possible to make meaningful comparison among different methods. This is in sharp contrast with several other fields where research has been significantly enhanced by the availabilities of benchmark collections, such as Reuters 21578 Footnote 1 and RCV1 (Lewis et al. 2004) for text categorization, and UCI (Asuncion et al. 2007) for general classification. In order to accelerate the research on learning to rank, we decided to build the benchmark collection LETOR. The construction of such a collection is, however, not easy, because it requires rich knowledge on the domain and a lot of engineering efforts. Thanks to the contributions from many people, we were able to release LETOR and upgrade it for several times.

LETOR was constructed based on multiple data corpora and query sets, which have been widely used in IR. The documents in the corpora were sampled according to carefully designed strategies, and then features and metadata were extracted for each query-document pair. Additional information including hyperlink graph, similarity relationship, and sitemap was also included. The data was partitioned into five folds for cross validation, and standard evaluation tools were provided. In addition, the ranking performances of several state-of-the-art ranking methods were also provided, which can serve as baselines for newly developed methods.

LETOR has been widely used in the research community since its release. The first version of LETOR was released in April 2007 and used in the SIGIR 2007 workshop on learning to rank for information retrieval (http://www.research.microsoft.com/users/LR4IR-2007/). At the end of 2007, the second version of LETOR was released, which was later used in the SIGIR 2008 workshop on learning to rank for IR (http://www.research.microsoft.com/users/LR4IR-2008/). Based on the valuable feedback and suggestions we collected, the third version of LETOR was released in December 2008. The focus of this paper is on the third version, Footnote 2 LETOR 3.0.

The contributions of LETOR to the research community lie in the following aspects.

  1. 1.

    It eases the development of ranking algorithms.

    Researchers can focus on algorithm development, and do not need to worry about experimental setup (e.g., creating datasets and extracting features). In that sense, LETOR has greatly reduced the barrier of the research on learning to rank.

  2. 2.

    It makes the comparison among different learning to rank algorithms possible.

    The standard document corpora, query sets, features, and partitions in LETOR enable researchers to conduct comparative experiments. The inclusion of baselines in LETOR also greatly saves researchers’ experimental efforts.

  3. 3.

    It offers opportunities for new research topics on learning to rank.

    In addition to algorithm comparison, LETOR can also be used to study the problems of ranking model construction, feature creation, feature selection, dependent ranking, and transfer/multitask ranking.

The remaining part of the paper is organized as follows. We introduce the problem of learning to rank for IR in Sect. 2. Section 3 gives a detailed description about LETOR. Section 4 reports the performances of several state-of-the-art learning to rank algorithms on LETOR. We then show how LETOR can be used to study other research topics beyond algorithm comparison in Sect. 5. Finally limitations of LETOR are discussed in Sect. 6 and concluding remarks are given in Sect. 7.

2 Learning to rank for IR

There are two major approaches to tackle the ranking problems in IR: the learning to rank approach and the traditional non-learning approach such as BM25 (Robertson et al. 2000) and language model (Zhai et al. 2001).

The main difference between the two approaches lies in that the former can automatically learn the parameters of the ranking function using training data, while the latter usually determines the parameters heuristically. If a ranking model has only several parameters, heuristic tuning can be possible. However, if there are many parameters, it will become very difficult. As more and more evidences have been proved as useful for ranking, the traditional non-learning approach will face challenges in effectively using these evidences.

In contrast, the learning to rank approach can make good use of multiple evidences. Therefore learning to rank has been drawing broad attention in both machine learning and IR communities, and many learning to rank algorithms have been proposed recently. Roughly speaking, there are mainly three kinds of algorithms, namely, the pointwise approach (Li et al. 2006, 2008), the pairwise approach (Burges et al. 2005; Freund et al. 2003; Herbrich et al. 1999; Matveeva et al. 2006; Qin et al. 2007; Tsai et al. 2007), and the listwise approach (Cao et al. 2007; Huang et al. 2009; Qin et al. 2008b, c, 2008d, e; Taylor et al. 2008; Volkovs et al. 2009; Xia et al. 2008; Xu et al. 2007; Yue et al. 2007).

The pointwise approach regards a single document as its input in learning and defines its loss function based on individual documents. According to different output spaces of the ranking function, the pointwise approach can be further categorized as regression based algorithms (Li et al. 2008), classification based algorithms (Li et al. 2008), and ordinal regression based algorithms (Li et al. 2006).

The pairwise approach takes document pairs as instances in learning, and formalizes the problem of learning to rank as that of pairwise classification. Specifically, in learning it collects or generates document pairs from the training data, with each document pair assigned a label representing the relative order of the two documents. It then trains a ranking model by using classification technologies. The uses of support vector machines (SVM), Boosting, and Neural Network as the classification model lead to the methods of Ranking SVM (Herbrich et al. 1999), RankBoost (Freund et al. 2003), and RankNet (Burges et al. 2005). Many other algorithms have also been proposed, such as FRank (Tsai et al. 2007), multiple hyperplane ranker (Qin et al. 2007) and nested ranker (Matveeva et al. 2006).

The listwise approach takes document lists as instances in learning and the loss function is defined on that basis. Representative work includes ListNet (Cao et al. 2007), RankCosine (Qin et al. 2008e), relational ranking (Qin et al. 2008d), global ranking (Qin et al. 2008c), and StructRank (Huang et al. 2009). A sub branch of the listwise approach is usually referred to as the direct optimization of IR measures. Example algorithms include AdaRank (Xu et al. 2007), SoftRank (Taylor et al. 2008), SVM-MAP (Yue et al. 2007), PermuRank (Xu et al. 2008), ApproxRank (Qin et al. 2008b), and BoltzRank (Volkovs et al. 2009).

3 Creating LETOR collection

In this section, we introduce the processes of creating the LETOR collection, including four main steps: selecting document corpora (together with query sets), sampling documents, extracting learning features and meta information, and finalizing datasets.

3.1 Selecting document corpora

In the LETOR collection, we selected two document corpora: the “Gov” corpus and the OHSUMED corpus. These two corpora were selected because (1) they were publicly available (Voorhees et al. 2005) and (2) they had been widely used by previous work on ranking in the literature of IR (Cao et al. 2006; Craswell et al. 2004, 2003; Robertson et al. 2000).

3.1.1 The “Gov” corpus and six query sets

In TREC 2003 and 2004, there is a special track for web IR, named the Web track (Craswell et al. 2004, 2003). The tracks used the “Gov” corpus, which is based on a January, 2002 crawl of the “Gov” domain. There are about one million html documents in this corpus.

Three search tasks are defined in the Web track: topic distillation (TD), homepage finding (HP) and named page finding (NP). Topic distillation aims to find a list of entry points of good websites principally devoted to the topic. The focus is to return entry pages of good websites rather than the web pages containing relevant information, because entry pages provide a better overview of the websites. Homepage finding aims at returning the homepage of the query. Named page finding is about finding the page whose name is identical to the query. In principle, there is only one answer for homepage finding and named page finding. Many papers (Qin et al. 2005, 2007; Xu et al. 2007; Xue et al. 2005) have been published using the three tasks on the “Gov” corpus as the basis for evaluation.

The following example illustrates the differences among these three tasks (TREC 2004). Consider the query ‘USGS’, which is the acronym for the US Geological Survey.

The numbers of queries in these three tasks are shown in Table 1. For simplicity, we use acronyms in following sections: TD2003 for the topic distillation query set in TREC2003, TD2004 for the topic distillation query set in TREC2004, NP2003 for the named page finding query set in TREC2003, NP2004 for the named page finding query set in TREC2004, HP2003 for the homepage finding query set in TREC2003, and HP2004 for the homepage finding query set in TREC2004.

Table 1 Number of queries in TREC Web track

3.1.2 The OHSUMED corpus

The OHSUMED corpus (Hersh et al. 1994) is a subset of MEDLINE, a database on medical publications. It consists of about 0.3 million records (out of over 7 million) from 270 medical journals during the period of 1987–1991. The fields of a record include title, abstract, MeSH indexing terms, author, source, and publication type.

A query set with 106 queries on the OHSUMED corpus has been widely used in previous work (Qin et al. 2007; Xu et al. 2007), in which each query describes a medical search need (associated with patient information and topic information). The relevance degrees of the documents with respect to the queries are judged by human annotators, on three levels: definitely relevant, partially relevant, and irrelevant. There are in total 16,140 query-document pairs with relevance judgments.

3.2 Sampling documents

Due to the large scale of the corpora, it is not feasible to judge the relevance of all the documents to a given query. As a common practice in IR, given a query, only some “possibly” relevant documents are selected for judgment. For similar reasons, it is not necessary to extract feature vectors from all the documents in the corpora. A reasonable approach is to sample some “possibly” relevant documents, and then extract feature vectors from the corresponding query-document pairs. In this section, we introduce the sampling strategy used in the construction of LETOR.

For the “Gov” corpus, given a query, the annotators organized by the TREC committee labeled some relevant documents. For the remaining unlabeled documents, the TREC committee treated them as irrelevant in the evaluation process (Craswell et al. 2003). Following this practice, in LETOR, we also treated the unlabeled documents for a query as irrelevant. Specifically, following the suggestions in Qin et al. (2008a) and Minka et al. (2008), we performed the document sampling in the following way. We first used the BM25 model to rank all the documents with respect to each query, and then selected the top 1,000 documents for each query for feature extraction. Note that for some rare queries, less than 1,000 documents can be retrieved. As a result, some queries have less than 1,000 associated documents in LETOR.

Different from the “Gov” corpus in which unjudged documents are regarded as irrelevant, in OHSUMED, the judgments explicitly contain the category of “irrelevant” and the unjudged documents are ignored in evaluation (Hersh et al. 1994). Following this practice, we only sampled judged documents for feature extraction and ignored the unjudged documents. As a result, on average, a query has about 152 documents associated for feature extraction.

3.3 Extracting learning features

In IR, given a query, we would like to rank a set of documents according to their relevance and importance to the query. In learning to rank, each query-document pair is represented by a multi-dimensional feature vector, and each dimension of the vector is a feature indicating how relevant or important the document is with respect to the query. For example, the first element of the vector could be the BM25 score of the document with regards to the query; the second element could be the frequency of the query terms appearing in the document; and the third one could be the PageRank score of the document. In this section, we introduce how the features in LETOR were extracted.

The following principles were used in the feature extraction process of LETOR.

  1. 1.

    To cover as many classical features in IR as possible.

  2. 2.

    To reproduce as many features proposed in recent SIGIR papers as possible, which were used in the experiments on the OHSUMED corpus or the “Gov” corpus.

  3. 3.

    To conform to the settings in the original documents or papers. If the authors suggested parameter tuning on a feature, parameter tuning was also conducted in the feature extraction of LETOR. Otherwise, the default parameters reported in the documents or papers were directly applied.

The feature file in LETOR is formatted in a matrix style. A sample file is shown in Fig. 1. Each row represents a query-document pair. The first column is the relevance judgment of the query-document pair; the second column is the query ID, followed with the feature ID and feature values.

Fig. 1
figure 1

Sample data from LETOR

3.3.1 Extracting features for the “Gov” corpus

For the “Gov” corpus, we extracted 64 features in total for each query-document pair, as shown in Table 2. Note that the id of each feature in this table is the same as in the LETOR datasets. We categorized the features into three classes: Q-D means that the feature is dependent on both the query and the document, D means that the feature only depends on the document, and Q means that the feature only depends on the query. Here we would like to point out that a linear ranking function cannot make use of the class-Q features, since these features are the same for all the documents under a query.

Table 2 Learning features for the “Gov” corpus

Some details about these features are listed as below.

  1. 1.

    We considered five streams/fields (Robertson et al. 2004) of a document: body, anchor, title, URL, and their union.

  2. 2.

    q represents a query, which contains terms q1,...,q t . In the two corpora, a query is expressed by two parts: the title of the query and the description of the query. Here we only considered the words appearing in the title of the query as query terms for feature extraction.

  3. 3.

    c(q i , d) denotes the number of occurrences of query term q i in document d. Note that while talking about a stream (e.g., title), c(q i , d) means the number of occurrences of q i in the specific stream (e.g., title) of document d.

  4. 4.

    Inverse document frequency (IDF) of query term q i was computed as follows,

    $$ idf(q_i) = \log{\frac{|C| - df(q_i) + 0.5}{df(q_i) + 0.5}}, $$
    (1)

    where document frequency df(q i ) is the number of documents containing q i in a stream, and |C| is the total number of documents in the document collection. Note that IDF is document-independent, and all the documents associated with the same query has the same IDF value.

  5. 5.

    |d| denotes the length (i.e., the number of words) of document d. When considering a specific stream, |d| means the length of the stream. For example, |d| of body means the length of the body of document d.

  6. 6.

    The BM25 score of a document d was computed as follows,

    $$ {\rm BM25}(d,q) = \sum_{q_i \in q}{{\frac{idf(q_i)\cdot c(q_i, d) \cdot (k_1 + 1)}{c(q_i, d) + k_1 \cdot (1 - b + b \cdot {\frac{|d|} {avgdl}})}} \cdot {\frac{(k_3+1)c(q_i,q)}{k_3+c(q_i,q)}}}, $$
    (2)

    where avgdl denotes the average document length in the entire document corpus. k 1, k 3 and b are free parameters. In feature 21–25 and 61, we set k 1 = 2.5, k 3 = 0 and b = 0.8. The avgdl for each stream can be obtained from meta information, as described in the next sub section.

  7. 7.

    For the language model features (26–40, 62–64), the implementation and the suggested parameters given in Zhai et al. (2001) were used. For the LMIR.ABS features, we set parameter δ = 0.7; for LMIR.DIR features, we set parameter μ = 2,000; for LMIR.JM features, we set parameter λ = 0.1.

  8. 8.

    For sitemap-based propagation features (41–42), we set the propagation rate α = 0.9 according to Qin et al. (2005). For hyperlink based propagation features (43–48), we set the propagation rate α = 0.85 according to Shakery et al. (2003).

  9. 9.

    Because the original PageRank score of a document is usually very small, we amplified it by 105. We also scaled up the original HostRank score by 103.

  10. 10.

    For topical PageRank and topical HITS, the same 12 categories as in Nie et al. (2006) were used.

  11. 11.

    To count the number of children of a web page (for feature 60), we simply used the URLs of documents to recover parent–child relationship between any two web pages.

  12. 12.

    Extracted title was extracted from the content of a html document (Hu et al. 2005), and it is sometimes more meaningful than the original title of the html document.

3.3.2 Extracting features for the OHSUMED corpus

For the OHSUMED corpus, there are in total 45 features extracted from the steams/fields of title, abstract, and ‘title + abstract’, as shown in Table 3. Note that the id of each feature in this table is the same as in the LETOR datasets. We categorized the features into two classes: Q-D means that the feature depends on both the query and the document, and Q means that the feature only depends on the query.

Table 3 Learning features for the OHSUMED corpus

Some details about these features are listed as below.

  1. 1.

    For the language model features (13–15, 28–30, 43–45), the implementation and suggested parameters given in Zhai et al. (2001) were used: for LMIR.ABS features, we set parameter δ = 0.5; for LMIR.DIR features, we set parameter μ = 50; for LMIR.JM features, we set parameter λ = 0.5.

  2. 2.

    For BM25 feature, we set its parameters as k 1 = 1.2, k 3 = 7 and b = 0.75.

3.4 Extracting meta information

In addition to the learning features, meta information that can be used to reproduce these learning features and even to create new features is also provided in LETOR. With the help of the meta information, one can conduct research on feature engineering, which is very important to learning to rank. For example, one can tune the parameters in existing learning features such as BM25 and LMIR, or investigate new features.

There are three kinds of meta information. The first is about the statistical information of a corpus; the second is about the raw information of the documents associated with a query; and the third is about the relationship among documents in a corpus.

3.4.1 Corpus meta information

An xml file was created to contain the information about the corpus, as shown in Fig. 2. As can be seen from the figure, this file contains the number of documents, the number of streams, the number of (unique) terms in each stream, and so on. One can easily obtain quantities like avgdl based on such information.

Fig. 2
figure 2

XML file to describe a corpus

3.4.2 Query meta information

An xml file was created for each individual query, as shown in Fig. 3, containing the raw information about the query and its associated documents. Line 3 in the figure indicates the ID of the query, which comes from the original corpus. Other information includes streaminfo, terminfo, and docinfo.

  1. 1.

    The node “streaminfo” (line 5–10 in Fig. 3) describes the statistical information of query terms with respect to each stream. If there are n streams in the corpus, there will be n “stream” nodes under the “streaminfo” node. The node “streamid” is consistent with the XML file containing the corpus information. We say that “a term appears in a stream (e.g., title) of a corpus” if this term is included in the stream (e.g., title) of at least one document in the corpus. The element “streamspecificquerylength” refers to the number of query terms that appear in this specific stream. It is a building block for language model features.

  2. 2.

    The node “terminfo” (line 12–23 in Fig. 3) contains the information of each query term with respect to each individual stream. We processed queries by stemming and removing stop words, and as a result the queries here may be a little different from the original ones. The node “termnum” indicates the total number of terms in this query. The node “streamfrequency” indicates the number of documents in the corpus that contain the query term in stream “streamid”. The node “streamtermfrequency” indicates the times that a query term appears in stream “streamid” of all the documents in the corpus. A node “term” may contain multiple “streams” as its child nodes, depending on the number of streams in the corpus; a node “terminfo” may have multiple “terms” as its child nodes, depending on the number of terms in the query.

  3. 3.

    The node “docinfo” (line 25–40 in Fig. 3) plays a role similar to the forward index (Brin et al. 1998), but is limited to the query terms and the selected documents. The node “docnum” indicates the number of documents selected for a query, followed by the information of these selected documents. Here “docid” is consistent with the files containing the learning features. The node “label” refers to the judgment of the document with regards to a query. The node “length” under the node “stream” means the total number of words (not limited to the query terms) in the stream of a document, and the node “uniquelength” means the number of unique words contained in the stream of a document. The node “termfrequency” indicates the times that a query term appears in the stream of a document. Similarly, a node “docinfo” may contain multiple child nodes of “doc”; a node “doc” may contain multiple nodes of “stream”; and a node “stream” may contain multiple child nodes of “term”.

Fig. 3
figure 3

XML file to describe a query

3.4.3 Additional meta information

As requested by researchers working on learning to rank, we have created several additional files that contain the hyperlink graph, the sitemap information, and the similarity relationship matrix of the corpora. The hyperlink graph and the sitemap information (built by analyzing URLs of all the documents according to some heuristic rules) are specific for the “Gov” corpus. With the data, one can study link analysis algorithms and relevance propagation algorithms (Qin et al. 2005). The similarity matrix is provided for the OHSUMED corpus, which describes the similarity between all the sampled documents associated with a query. With such kind of information, one can study the problem of dependent ranking (Qin et al. 2008c, 2008d; Zhai et al. 2003).

3.5 Finalizing datasets

As described in Sect. 3.1, there are seven datasets in the LETOR collection: TD2003, TD2004, NP2003, NP2004, HP2003, HP2004, and OHSUMED. There are three versions for each dataset: Feature_NULL, Feature_MIN, and QueryLevelNorm.

  1. 1.

    Feature_NULL: Since some documents may not contain query terms in certain streams (such as URL), we used “NULL” to indicate that the corresponding feature is absent for a query-document pair. One may need to preprocess these “NULL” values before applying a learning to rank algorithm. One can also study how to handle such absent features (Chechik et al. 2008) with this version of data.

  2. 2.

    Feature_MIN: In this version, the “NULL” values have been replaced by the minimal value of a feature in all the documents associated with a given query. This version of dataset can be directly used for learning.

  3. 3.

    QueryLevelNorm: Taking into consideration that the values of different features or different queries may vary largely, this version further conducted query-level normalization of the feature values based on the Feature_MIN version.

We partitioned each dataset into five parts with about the same number of queries, denoted as S1, S2, S3, S4, and S5, for five-fold cross validation. In each fold, we propose using three parts for training, one part for validation, and the remaining part for test (See Table 4). The training set is used to learn ranking models. The validation set is used to tune the hyper parameters of the learning algorithms, such as the number of iterations in RankBoost and the combination coefficient in the objective function of Ranking SVM. The test set is used to evaluate the performance of the learned ranking models. Note that since we conducted five-fold cross validation, the reported performance of a ranking method is actually the average performance over the five trials.

Table 4 Data partitioning for five-fold cross validation

The LETOR collection, containing the aforementioned feature representations of documents, their relevance judgments with respective to queries, and the partitioned training, validation and test sets, can be downloaded from http://www.research.microsoft.com/~letor/.

4 Benchmarking the collection

On top of the data collections introduced above, standard evaluation tools and the evaluation results for several representative learning to rank algorithms are also provided in LETOR. The availability of these baseline results can ease the comparison between existing algorithms and newly designed algorithms, and ensure the objectiveness and correctness of the comparison.

In this section, we will introduce the experimental settings for the evaluation of the baselines algorithms, and make discussions on the experimental results.

4.1 Evaluation measures and tools

Three widely used measures were adopted for evaluation: precision at position k (Baeza-Yates et al. 1999), average precision (AP) (Baeza-Yates et al. 1999), and normalized discounted cumulative gain (NDCG) (Järvelin et al. 2002).

4.1.1 Precision at position k (P@k)

P@k is a measure for evaluating top k positions of a ranked list using two levels (relevant and irrelevant) of relevance judgment:

$$ \hbox{P@}k={\frac{1}{k}}\sum_{j=1}^{k}{r_j}, $$
(3)

where k denotes the truncation position and

$$ r_j=\left\{ \begin{array}{ll} 1 & \hbox{if document in}\,\,j\hbox{-th position is relevant,}\\ 0 & \hbox{otherwise.}\\ \end{array} \right. $$

For a set of queries, we averaged the P@k values of all the queries to get the mean P@k value. Since P@k requires binary judgments while there are three levels of relevance judgments in the OHSUMED corpus, we simply propose treating “definitely relevant” as relevant and the other two levels as irrelevant when computing P@k.

4.1.2 Average precision (AP)

Average precision (AP), another measure based on two levels of relevance judgment, is defined on the basis of P@k:

$$ \hbox{AP}={\frac{1}{|D_+|}}\sum_{j=1}^N{r_j\times \hbox{P@}j}, $$
(4)

where N indicates the number of retrieved documents and |D +| denotes the number of relevant documents with respect to the query. Given a ranked list for a query, we can compute AP for this query. Then MAP is defined as the mean of AP over a set of queries.

4.1.3 Normalized discount cumulative gain (NDCG)

NDCG@k is a measure for evaluating top k positions of a ranked list using multiple levels (labels) of relevance judgment. It is defined as follows,

$$ \hbox{NDCG@}k=N_k^{-1}\sum_{j=1}^{k}{g(r_j)d(j)}, $$
(5)

where k has the same meaning as in Eq. 3; N k denotes the maximumFootnote 3 of ∑ kj=1 g(r j )d(j); r j denotes the relevance level of the document ranked at the j-th position; g(r j ) denotes a gain function:

$$ g(r_j)=2^{r_j}-1; $$

and d(j) denotes a discount function:

$$ d(j)=\left\{ \begin{array}{ll} 1 & \hbox{for}\,\,j=1,2\\ {\frac{1}{\log_2(j)}} & \hbox{otherwise.} \\ \end{array} \right. $$

Note that in order to calculate NDCG scores, we need to define the rating of each document. For the OHSUMED dataset, we defined three ratings 0, 1, 2, corresponding to “irrelevant”, “partially relevant”, and “definitely relevant” respectively; and for the other six datasets, we defined two ratings 0, 1 corresponding to “irrelevant” and “relevant”.

To avoid differences in the evaluation results due to different implementations of these measures, a set of standard evaluation tools are provided in LETOR. The tools are written in perl, and can output Precision, MAP and NDCG for the ranking results of a given ranking algorithm, and the significance test results between two given ranking algorithms. We encourage all the users to use the tools. By using a single set of evaluation tools, the experimental results of different methods can be easily and impartially compared. The input to the evaluation tools should be one of the original datasets in the collection, because the evaluation tools (such as Eval-Score-3.0.pl) sort the documents according to their input order when the documents have the same ranking scores. That is, the evaluation tools are sensitive to the order of documents in the input file.

4.2 Baseline ranking algorithms

We tested a number of learning to rank algorithms on LETOR and provided the corresponding results as baselines. To make fair comparisons, we tried to use the same setting for all the algorithms. Firstly, most algorithms adopted linear ranking functions, except RankBoost and FRank, which adopted non-linear ranking functions by combining multiple binary weak rankers. Secondly, all the algorithms utilized MAP on validation set for model selection.

4.2.1 Pointwise approach

As for the pointwise approach, we tested the linear regression based algorithm on LETOR. This algorithm aims to learn a linear ranking function which maps a feature vector to a real value. Since only relevance label is provided for each query-document pair, one needs to map the label to a real value for training. The general rule is that after mapping the real value of a more relevant document should be larger that that of a less relevant document. In our experiments, we used the validation set to select a good mapping from labels to real values.

4.2.2 Pairwise approach

Several pairwise ranking algorithms were tested on LETOR, including Ranking SVM, RankBoost, and FRank.

The basic idea of Ranking SVM (Herbrich et al. 1999; Joachims 2002) is to formalize learning to rank as a problem of binary classification on document pairs, and then to solve the classification problem using Support Vector Machines. The public tool of SVMlight was used in our experiment. We chose the linear ranking function, and tuned the parameter c using the validation set. To reduce the training time, we set -# 5000 when running SVMlight.

RankBoost (Freund et al. 2003) also formalizes learning to rank as a problem of binary classification, but solves the classification problem by means of boosting. Like all the boosting algorithms, RankBoost trains one weak ranker at each round of iteration, and combines these weak rankers to get the final ranking function. After each round, the document pairs are re-weighted by decreasing the weights of correctly ranked pairs and increasing the weights of incorrectly ranked pairs. In our implementation, we defined each weak ranker on the basis of a single feature. With a proper threshold, the weak ranker has binary output {0, 1}. For each round of iteration, we selected the best weak ranker from (number of features) × (255 thresholds) candidates. The validation set was used to determine the best number of iterations.

FRank (Tsai et al. 2007) is another pairwise ranking algorithm that utilizes a novel loss function called the fidelity loss within the probabilistic ranking framework (Burges et al. 2005). The fidelity loss has several nice properties for ranking. To efficiently minimize the fidelity loss, a generalized additive model is adopted. In our experiments, the validation set was employed to determine the number of weak learners in the additive model.

4.2.3 Listwise approach

Four listwise ranking algorithms were tested on LETOR, i.e., ListNet, AdaRank-MAP, AdaRank-NDCG, and SVMMAP.

ListNet (Cao et al. 2007) is based on a probability distribution on permutations. Specifically, for a document list, ListNet first uses the Luce model to define a permutation probability distribution based on the scores outputted by the ranking function, and then defines another distribution based on the ground truth labels. After that, the cross entropy between the two distributions is used to define the loss function. To define the distribution based on the ground truth, ListNet needs to map the relevance label of a query-document pair to a real-valued score. In our experiments, the validation set was used to determine the mapping.

AdaRank-MAP (Xu et al. 2007) manages to directly optimize MAP. The basic idea of AdaRank-MAP is to repeatedly construct ‘weak rankers’ on the basis of re-weighted training queries and finally linearly combine the weak rankers for making ranking predictions. AdaRank-MAP utilizes MAP to measure the goodness of a weak ranker. In our experiments, the validation set was employed to determine the number of weak rankers.

AdaRank-NDCG (Xu et al. 2007) manages to directly optimize NDCG. The basic idea of AdaRank-NDCG is similar to that of AdaRank-MAP. the difference lies in that AdaRank-NDCG utilizes NDCG to measure the goodness of a weak ranker. In our experiments, the validation set was employed to determine the number of weak rankers.

SVMMAP (Yue et al. 2007) is a structured support vector machine (SVM) method that optimizes an upper bound of (1-AP) in the predicted rankings. There is a hyper parameter in the loss function of SVMMAP. In our experiments, the validation set was employed to determine the value of the hyper parameter.

4.3 Results

The ranking performances of the aforementioned algorithms are listed in Tables 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, and 18. From the experimental results, we find that listwise ranking algorithms perform very well on most datasets. Among the four listwise ranking algorithms, ListNet seems to be better than the others. AdaRank-MAP, AdaRank-NDCG and SVMMAP obtain similar performances. Pairwise ranking algorithms achieve good ranking accuracy on some (but not all) datasets. For example, RankBoost offers the best performance on TD2004 and NP2003; Ranking SVM shows very promising results on NP2003 and NP2004; and FRank achieves very good results on TD2004 and NP2004. In contrast, simple linear regression performs worse than the pairwise and listwise ranking algorithms. Its results are not so good on most datasets.

Table 5 NDCG on TD2003 dataset
Table 6 Precision and MAP on TD2003 dataset
Table 7 NDCG on TD2004 dataset
Table 8 Precision and MAP on TD2004 dataset
Table 9 NDCG on NP2003 dataset
Table 10 Precision and MAP on NP2003 dataset
Table 11 NDCG on NP2004 dataset
Table 12 Precision and MAP on NP2004 dataset
Table 13 NDCG on HP2003 dataset
Table 14 Precision and MAP on HP2003 dataset
Table 15 NDCG on HP2004 dataset
Table 16 Precision and MAP on HP2004 dataset
Table 17 NDCG on OHSUMED dataset
Table 18 Precision and MAP on OHSUMED dataset

We observe that most ranking algorithms perform differently on different datasets. They may perform very well on some datasets but not so well on the other datasets. To evaluate the overall ranking performance of an algorithm, we used the number of other algorithms that it can beat over all the seven datasets as a measure. That is,

$$ S_i(M) = \sum_{j=1}^7\sum_{k=1}^{8}{{\bf 1}_{\{M_i(j)> M_k(j)\}}} $$

where j is the index of a dataset, i and k are the indices of an algorithm, M i (j) is the performance of i-th algorithm on j-th dataset in terms of measure M (such as MAP), and \({\bf1}_{\{M_i(j)>M_k(j)\}}\) is the indicator function

$$ {\bf 1}_{\{M_i(j)>M_k(j)\}}=\left\{ \begin{array}{ll} 1 & \hbox{if}\,\,M_i(j)>M_k(j),\\ 0 & \hbox{otherwise.}\\ \end{array} \right.$$

It is clear that the larger S i (M) is, the better the i-th algorithm performs. For ease of reference, we call this measure winning number. Figure 4 shows the winning number in terms of NDCG for all the algorithms under investigation. From this figure, we have the following observations Footnote 4

  1. 1.

    In terms of NDCG@1, among the four listwise ranking algorithms, ListNet is better than AdaRank-MAP and AdaRank-NDCG, while SVMMAP performs a little worse than the others. The three pairwise ranking algorithms achieve comparable results, among which Ranking SVM seems to be slightly better than the other two. Overall, the listwise algorithms seem to perform better than the pointwise and pairwise algorithms.

  2. 2.

    In terms of NDCG@3, ListNet and AdaRank-MAP perform much better than the other algorithms, while the performances of Ranking SVM, RankBoost, AdaRank-NDCG, and SVMMAP are very similar to each other.

  3. 3.

    For NDCG@10, one can get similar conclusions to those for NDCG@3.

Fig. 4
figure 4

Comparison cross seven datasets by NDCG

Comparing NDCG@1, NDCG@3, and NDCG@10, it seems that the listwise ranking algorithms have certain advantages over other algorithms at top positions (position 1) of the ranking results. Here we give a possible explanation. Because the loss functions of listwise algorithms are defined on all the documents of a query, it can consider all the documents and make use of the position information of them. In contrast, the loss functions of the pointwise and pairwise algorithms are defined on a single document or a document pair. It cannot access the scores of all the documents at the same time and cannot see the position information of each document. Since most IR measures (such as MAP and NDCG) are position based, listwise algorithms which can see the position information in their loss functions should perform better than pointwise and pairwise algorithms which cannot see such information in their loss functions.

Figure 5 shows the winning number in terms of Precision and MAP. We have the following observations from the figure.

  1. 1.

    In terms of P@1, among the four listwise ranking algorithms, ListNet is better than AdaRank-NDCG, while AdaRank-MAP and SVMMAP perform worse than AdaRank-NDCG. The three pairwise ranking algorithms achieve comparable results, among which Ranking SVM seems to be slightly better. Overall, the listwise algorithms seem to perform than the pointwise and pairwise algorithms.

  2. 2.

    For P@3, one can get similar conclusions to those for P@1.

  3. 3.

    In terms of P@10, ListNet performs much better than all the other algorithms; the performances of Ranking SVM, RankBoost, and FRank are better than AdaRank-MAP, AdaRank-NDCG, and SVMMAP.

  4. 4.

    In terms of MAP, ListNet is the best one; Ranking SVM, AdaRank-MAP, and SVMMAP achieve similar results, and are better than the remaining algorithms. Furthermore, the variance among the three pairwise ranking algorithms is much larger than the variance in terms of other measures (P@1, 3 and 10). The possible explanation is as follows. Since MAP involves all the documents associated with a query in the evaluation process, it can better differentiate algorithms.

Fig. 5
figure 5

Comparison cross seven datasets by precision and MAP

To summarize, the experimental results show that the listwise algorithms (ListNet, AdaRank-MAP, AdaRank-NDCG, and SVMMAP) have certain advantages over other algorithms, especially for the top positions of the ranking results.

Note that the above experimental results are in some sense still preliminary, since the result of almost every algorithm can be further improved. For example, for regression, we can add a regularization item to make it more robust; for Ranking SVM, if the time complexity is not an issue, we can remove the constraint of -# 5000 to achieve a better convergence of the algorithm; for ListNet, we can also add a regularization item to its loss function and make it more generalizable to the test set. Considering these issues, we would like to call for contributions from the research community. Researchers are encouraged to submit the results of their newly developed algorithms as well as their carefully tuned existing algorithms to LETOR. In order to let others re-produce the submitted results, the contributors are respectfully asked to prepare a package for the algorithm, including

  1. 1.

    a brief document introducing the algorithm;

  2. 2.

    an executable file of the algorithm;

  3. 3.

    a script to run the algorithm on the seven datasets of LETOR.

We believe with the collaborative efforts of the entire community, we can have more versatile and reliable baselines on LETOR, and better facilitate the research on learning to rank.

5 Supporting new research directions

So far LETOR has mainly been used as an experimental platform to compare different algorithms. In this section, we show that LETOR can also be used to support many new research directions.

5.1 Ranking models

Most of the previous work (as reviewed in Sect. 2) focuses on developing better loss functions, and simply uses a scoring function as the ranking model. It may be one of the major topics at the next step to investigate new ranking models. For example, one can study new algorithms using pairwise and listwise ranking functions. Please note the challenge of using a pairwise/listwise ranking function. That is, the test complexity will be much higher than that of using a scoring function. One should pay attention to the issue if he/she performs research on pairwise and listwise ranking functions.

5.2 Feature engineering

Feature is, by all means, very important for learning to rank algorithms. Since LETOR contains rich meta information, it can be used to study feature related problems.

5.2.1 Feature extraction

The performance of a ranking algorithm greatly depends on the effectiveness of the features used. LETOR contains the low-level information such as term frequency and document length. It also contains rich meta information about the corpora and the documents. These can be used to derive new features, and study their contributions to ranking.

5.2.2 Feature selection

Feature selection has been extensively studied for classification. However, as far as we know, the discussions on feature selection for ranking are still very limited. LETOR contains tens of standard features, and it is feasible to use LETOR to study the selection of most effective features for ranking.

5.2.3 Dimensionality reduction

Different from feature selection, dimensionality reduction tries to reduce the number of features by transforming/combining the original features. Dimensionality reduction has been shown very effective in many applications, such as face detection and signal processing. Similarly to feature selection, little work has been done on dimensionality reduction for ranking. It is an important research topic, and LETOR can be used to support such research.

5.3 New ranking scenarios

LETOR also offers the opportunities to investigate some new ranking scenarios. Here we give several examples as follows.

5.3.1 Query classification and query dependent ranking

In most previous work, a single ranking function is used to handle all kinds of queries. This may not be appropriate, particularly for web search. Queries in web search may vary largely in semantics and user intentions. Using a single model alone would make compromises among queries and result in lower accuracy in relevance ranking. Instead, it would be better to exploit different ranking models for different queries. Since LETOR contains several different kinds of query sets (such as topic distillation, homepage finding, and named page finding) and rich information about queries, it is possible to study the problems of query classification and query dependent ranking (Geng et al. 2008).

5.3.2 Beyond independent ranking

Existing technologies on learning to rank assume that the relevance of a document is independent of the relevance of other documents. The assumption makes it possible to score each document independently first and sort the documents according to their scores after that. In reality, the assumption may not always hold. There are many retrieval applications in which documents are not independent and relation information among documents can be or must be exploited. For example, Web pages from the same site form a sitemap hierarchy. If both a page and its parent page are about the topic of the query, then it would be better to rank higher the parent page for this query. As another example, similarities between documents are available, and we can leverage the information to enhance relevance ranking. Other problems like Subtopic Retrieval (Qin et al. 2007) also need utilize relation information. LETOR contains rich relation information, including hyperlink graph, similarity matrix, and sitemap hierarchy, and therefore can well support the research on dependent ranking.

5.3.3 Multitask ranking and transfer ranking

Multitask learning aims at learning several related task at the same time, and the learning of the tasks can benefit from each other. In other words, the information provided by the training signal for each task serves as a domain-specific inductive bias for the other tasks. Transfer learning uses the data in one or more auxiliary domains to help the learning task in the main domain. Because LETOR contains seven query sets and three different retrieval tasks, it is a good test bed for multitask ranking and transfer ranking.

To summarize, although the current use of LETOR is mostly about algorithm comparison, LETOR actually can be used to support much richer research agenda. We hope that more and more interesting researches can be carried out with the help of LETOR, and the state of the art of learning to rnak can be significantly advanced.

6 Limitations

Although LETOR has been widely used, it has certain limitations as listed below.

6.1 Document sampling strategy

For the “Gov” datasets, the retrieval problem is essentially cast as a re-ranking task (for top 1,000 documents) in LETOR. On one hand, this is a common practice for real-world Web search engines. Usually two rankers are used by a search engine for sake of efficiency: firstly a simple ranker (e.g., BM25) is used to select some candidate documents, and then a more complex ranker (e.g., the learning to rank algorithms as mentioned in the paper) is used to produce the final ranking result. On the other hand, however, there are also some retrieval applications that should not be cast as a re-ranking task. We will add datasets beyond re-ranking settings to LETOR in the future.

For the “Gov” datasets, we sampled documents for each query using a cutoff number of 1,000. We will study the impact of the cutoff number on the performances of the ranking algorithms. It is possible that the dataset should be refined using a better cutoff number.

6.2 Features

In both academic and industrial communities, more and more features have been studied and applied to improve ranking accuracy. The feature list provided in LETOR is far away from comprehensive. For example, document features (such as document length) are not included in the OHSUMED dataset, and proximity features are not included in all the seven datasets. We will add more features into the LETOR datasets in the future.

6.3 Scale and diversity of datasets

As compared with the real-web search, the scale (number of queries) of the datasets in LETOR is not yet very large. To verify the performances of learning to rank techniques for real-web search, large scale datasets are needed. We are working on some large scale datasets and plan to release them in the future versions of LETOR.

Although there are seven query sets in LETOR3.0, there are only two document corpora involved. We will create new datasets using more document corpora in the future.

6.4 Baselines

Most baseline algorithms in LETOR use linear ranking functions. From Sect. 4.3, we can see that the performances of these algorithms are not good enough, since the perfect ranking should achieved the accuracy of 1 in terms of all the measures (P@k, MAP and NDCG). As pointed out in Sect. 3.3, class-Q features cannot be effectively used by linear ranking functions. We will add more algorithms with nonlinear ranking functions as baselines of LETOR. We also encourage researchers in the community to test more non-linear ranking models.

7 Conclusions

By explaining the data creation process and the results of the state-of-the-arts learning to rank algorithms in this paper, we have provided the information for people to better understand the nature of LETOR and to more effectively utilize the datasets in their research work.

We have received a lot of comments and feedback about LETOR after its first release. We hope we can get more suggestions from the research community. We also encourage researchers to contribute to LETOR by submitting their results.

Finally, we expect that LETOR can be just a start for the research community to build benchmark datasets for learning to rank. With more and more such efforts, the research on learning to rank for IR can be significantly advanced.