1 Introduction

In recent years, learning to rank has been successfully applied to a wide variety of applications in Information Retrieval (IR) such as Document Retrieval (DR) (Duh & Kirchhoff, 2008; Liu, 2011). Learning to rank can be seen as a task of a supervised learning (Chapelle & Chang, 2011; Liu, 2011) whose applications usually require the use of a large set of labeled data to accurately train a model. Since these labels might be costly to acquire, active learning (Ailon, 2012; Brinker, 2004; Long et al., 2014; Settles, 2010) and semi-supervised learning (Amini et al., 2008; Duh & Kirchhoff, 2008; Li et al., 2009; Zhu, 2005) technologies aim to reduce manual labeling workload. The model of each method is constructed with a small set of labeled examples and a large set of unlabeled ones. The semi-supervised learning method is more focused on exploiting the data while the active learning method is dedicated to the exploration of these data. The latter interacts with the expert (oracle) to label the selected most informative examples; which is usually expensive and time-consuming. Consequently, using the active learning or semi-supervised learning independently may lead to a poor performance in some cases. Moreover, in literature there are few studies that combine these two learning methods using selectively sampled and automatically labeled data; and to our knowledge, this method has not been studied in the context of learning to rank classifiers (Dammak et al., 2017a).

Therefore, our first aim is to be interested in the evaluation part of the two active learning to rank algorithms of alternatives that combine an active learning to rank method (Dammak et al., 2015) with semi-supervised learning to reduce the labeling effort for DR. We want to give a particular interest to the adjustment of their respective parameters and particulary to the impact on their performances. Our motivation is still to learn with a small set of labeled data in order to reduce the time and to take advantage of both types of learning (active and semi-supervised) and avoid some problems caused by employing only active or semi-supervised learning. Our second aim consists in considering some parameter setting related especially to the number of labeled examples and those to be labeled for the automatic labeling method. This method will be used in the labeling phase of these two algorithms and hence exclude the intervention of the expert for labeling. We expected that the combination will bring a gain in time (efficiency) and will improve the experimental results (effectiveness) by granting a particular interest to learning parameter settings that will be presented in the rest of this paper. These algorithms, referred to as “Semi-Active Learning to Rank” (SAL2R) and “Active-Semi-Supervised Learning to Rank” (ASSL2R), can deal with the most informative examples for improving the performance of the training (Dammak et al., 2017a). The idea, in these algorithms, is to select only the most informative query-document unlabeled pair at each round and specify, afterwards, if the document is relevant or not in relation with this query.

At last, we would like to reconsider the algorithms proposed in Dammak et al. (2017b) to further consolidate the validation part by analyzing the respective influence of each parameter of the learning model. These algorithms select at each round more than query-document pair from the unlabeled training data instead of choosing only one pair. These algorithms, referred to as “Semi-Active List Learning to Rank” (SALL2R) and “Active-Semi-Supervised List Learning to Rank” (ASSLL2R), can deal with a list of most informative query-document pairs for improving the selection strategy and thereby improve their performances and reducing selection time (Dammak et al., 2017b).

In this paper, DR is considered as an application for learning to rank. In this framework, when some queries are given with the associated labeled documents in training, the learning to rank system commonly focuses on learning an effective ranking function which assigns a score to each document, and ranks the documents with respect to the query in a descending order of their scores. Each query-document pair is characterized by a feature vector.

The rest of the paper is organized as follows. Section 2 discusses some related works in the domain of IR and briefly introduces the active and semi-supervised learning to rank literature. Section 3 describes in details the learning to rank algorithms object of our interest in this paper by emphasizing on some parameters subject of the evaluation part. We want to carry on an experimental study to identify which parameters influence the performance of the learned model better than others. Section 4 displays and analyses the experimental results related to each algorithm defined by its fixed parameters. The main conclusions of this research study are drawn and some potential perspectives are suggested in Sect. 5.

2 Related works

The central problem in IR is to extend the Information Retrieval Systems (IRS) with efficient and fast models, taking into account the user’s information need. Thus, many works have focused on the proposal of for automatically optimizing the ranking of the search results returned by the IRS.

In recent years, more and more machine learning technologies have been used to form ranking models. A new area of research called Learning to Rank (discriminative learning) has gradually emerged as an attractive technique. This latter is dedicated to the optimization of ranking results and based on automatic learning techniques. The capacity of combining a large number of features is a very important advantage of learning to rank.

There are two major approaches to learning to rank, referred to as pairwise approach (Burges et al., 2005), and listwise approach (Cao et al., 2007). These approaches learn to rank in different ways and have been successively applied to IR. In the pairwise approach (Cao et al., 2007; Burges et al., 2005), pairs of documents for a given query are considered as input to the learning system. The objective here is to determine which document is more relevant than another. This approach can take into account the order of relationships between pairs of documents. In the learning to rank literature, several pairwise ranking algorithms have been proposed, based on boosting (Freund et al., 2003), neural network (Burges et al., 2005), support vector machines (Joachims, 2002) and other learning machines.

The listwise approach (Xia et al., 2008) takes all of the documents associated with a query in the learning data and predicts their labels. The input space of this approach contains a set of documents related to a query. The output space contains the ordered list (or permutation) of the documents according to their relevance or the list of their relevancy scores.

In general, the performance of ranking models is greatly affected by labeled examples’ number in the training set (Chapelle & Chang, 2011; Duh & Kirchhoff, 2008; Liu, 2011). Since these labels might be expensive to acquire as labeling is usually scarce and costly to get in many applications (Li, 2011; Settles, 2010), semi-supervised learning and active learning technologies (Settles & Craven, 2008) try to tackle the same issue of getting and economizing the number of unlabeled data to learn a specific model. Their ranking algorithms have attracted a greater deal of research interest (Liu, 2011; Pan et al., 2013).

Generally, the semi-supervised learning concentrates more on the exploitation of unlabeled data. It tries to label examples by the machine itself (Li et al., 2009; Liu, 2011; Zhu, 2005). Furthermore, it selects the example that has the highest confidence in each round and adds the predicted label by the machine without any human involvement (Chapelle et al., 2006; Zhu, 2005).

Transductive and inductive learning are two useful and complementary paradigms whose arise from the semi-supervised learning. They look for labeling any unlabeled data to improve the performance of semi-supervised learning algorithms.

The transductive framework is only interested in unlabeled instances of the training set. It is then unable to order new data absent in the learning phase. Whereas the inductive framework has a very different purpose: the aim is to be able to order any data set. It consists firstly of finding a function (model) from the training data, and then applying this function to the new test data. The inductive learning is therefore able to order new data that are absent in the learning phase. In fact, inductive and transuctive methods seek to label any unlabeled data. The disadvantage is that, these methods are extremely complex to efficiently deal with a large amount of unlabeled data. In order to improve the performance of learning to rank, active methods, based on active learning have been proposed. The active learning approach (Freund et al., 1997; Roy & McCallum, 2001; Settles, 2010; Tong, 2001) selects the example that has the lowest confidence for labeling as the most informative one in each round (Ailon, 2012; Brinker, 2004; Long et al., 2014; Settles, 2010). It needs human involvement for the labeling and incorporates the obtained information to select new examples.

On the one hand, this type of learning typically reduces the number of unlabeled data that needs to be labeled (Kuwadekar and Neville, 2011). Indeed, the learner can impact the choice of learning examples which should be selected for labeling. Therefore, this paradigm is more dedicated to the exploration of unlabeled data (Settles, 2010; Huang et al., 2010). Thus, active learning can significantly improve the model’s performance and accelerate the convergence’s speed. On the other hand, it proposes to the user some optimal selection strategies for the ranking of alternatives in order to construct the training set of the model (Ailon, 2012) and determine which alternatives are most informative (Truong, 2009; Settles & Craven, 2008). The most well-known strategies are uncertainty sampling, Query By Committee (QBC) (Seung et al., 1992) and expected error reduction (Truong, 2009).

Although the advantages of both active and semi-supervised learning methods in saving efficiently the number of labeled data, there is little research focusing on combining them and dealing with an automatic step to label the most informative examples for learning to rank. However, there is a good deal of research on combining these techniques in different fields related to IR (Huang et al., 2010; Gu et al., 2014; Song et al., 2011; Krithara et al., 2011; Muslea et al., 2002; Leng et al., 2013).

Furthermore, other research studies suggested to introduce a step of automatic labeling the unlabeled data (Tur et al., 2005; Zhou et al., 2006) and proved that these methods have improved the performance of their results. In the same way, Dammak et al. (2017a) have proposed two new inductive learning to rank algorithms for DR which combine active and semi-supervised learning to assign the relevance scores to an unlabeled set of document-query pairs and Dammak et al. (2017b) have improved these inductive algorithms by using a multi-pairs query-document in the selection stage. The results obtained have proved the performance of these previous algorithms and have shown that this is a promising line of research that enhance the labeling process of the unlabeled data and thus increases the efficiency of costly human labelers. In this paper, we would like to further consolidate the evaluation part.

Techniques and methodologies have been proposed to construct learning to rank data sets that lead to an efficient learning to rank with a reduced cost of obtaining relevance judgments. These methods face the challenge of how to select the appropriate queries, the appropriate documents to be judged and the evaluation metric, for an efficient, reliable and effective evaluation and learning to rank. The major goal of these methods is to select only a small subset of documents. The document selection though should be done in a way that not harms the effectiveness of learning.

These methods are based on random sampling, by considering statistical methods to estimate the values of measures (Inferred Average Precision (InfAP) Yilmaz & Aslam, 2006; Aslam et al., 2006). Other methods utilize stratified sampling (Statistical Average Precision (StatAP) sampling Pavlu, 2008) or a greedy online algorithm (Minimal Test Collection (MTC) (Carterette et al., 2006). They try to test whether low cost methods produce reliable evaluation when used to select documents and how many queries are necessary and needed to draw robust conclusions.

In the next section, we present the algorithms (Dammak et al., 2017a, 2017b) as well as the parameters that we set and deem relevant to consider in the experimental study.

3 Learning to rank algorithms

In this section, we consider two inductive learning to rank algorithms which combine active and semi-supervised learning in order to build ranking models. These learning to rank algorithms are well adapted to DR (Truong 2009), where documents are considered as alternatives and queries as entries or observations. These algorithms focus on the active learning to rank in the context of alternatives (Dammak et al., 2015) to select the appropriate query-document pairs and consider the supervised and semi-supervised learning as auxiliaries to learn ranking functions. The main idea of combining them efficiently can further reduce the task of the manual labeling and takes advantage of their frameworks. Moreover, the original idea that we assume is to use a labeling algorithm instead of resorting to an expert for the labeling process.

In both algorithms a small data set of labeled examples denoted \(S_{L}=\{(x_{i},y_{i});i\in \{1,\ldots , m\}\}\) and a large data set of unlabeled examples denoted \(S_{U}=\{(x^{'}_{i});i\in \{1+m ,\ldots , n+m\}\}\) are considered. Each \(y_{i}\) is a vector of variable size \(m_{i}\), where \(m_{i}\) is the number of candidate alternatives for \(x_{i}\), thus \(y_{i}=(y^{1}_{i},y^{2}_{i},\ldots ,y^{m_{i}}_{i})\), where \(y^{k}_{i}\) expresses the degree of relevance of the \(k^{th}\) alternative. In this setting, the process is generally given a set of queries (observations or entries) \(X=(x_{1},x_{2},\ldots ,x_{n})\), a set of documents (alternatives) A and a set of labels (real output (scores)) \(Y (y^{k}_{i} \in Y)\). We assume that each query \(x_{i}\in X\) is related to a subset of known alternatives \(A_{x_{i}}\subset A\) considered with labels grouped as a variable-size vector \(y_{i}\). The \(y_{i}\) vector specifies the order that is to be predicted on alternatives. The score function h that should predict this order, considers an input pair \((x_{i},k)\) and returns a real score which reflects the similarity between an observation and an alternative, \(x_{i}\) represents an observation (query) and k represents an index of candidate alternative (document) for \(x_{i}\).

Unlike the semi-supervised framework, we notice that there are two types of strategies of labeling for the active learning to rank: the first deals only with one entry and one alternative, whereas the second deals with all the alternatives related to the entry. The first methods use an uncertainty measure while the most recent select the examples that seem to change the most current model. Experimentally, these latter methods seem to be more competitive but suffer from greater complexity (Truong, 2009).

The basic idea, in these algorithms, is to select only one entry-alternative (query-document) pair at each round and determine, afterwards, if the alternative is relevant or not according to the considered entry. In this context, the algorithms employ the effective QBC selection strategy (Melville & Mooney, 2004) to select the pair which puts in conflict most of the members of all models called “committee models” (Freund et al., 1997). This strategy is typical one which maintains a committee of models. All models are trained only once on the initial labeled set, but represent challenging hypotheses. Each committee model is then considered in order to choose the appropriate pair. The goal of this strategy is minimizing the version space. Hence, the algorithms start by learning P ranking models called representative committee models \(\{h_{p}\}_{p\in \{1,\ldots ,P\}}\) and then learn a ranking model h (score function). Thereafter, they randomly select a model \(h_{p}\) among the P representative committee models.

Subsequently, they select the most informative query-document pair from the unlabeled dataset in each round. The pair corresponds to the one having the maximum measure of disagreement between the representative committee model \(h_{p}\) and the model h. Once the pair is selected, the labeling process is carried out automatically with a labeling algorithm. Nevertheless, these algorithms will proceed in two distinct ways (Fig. 1).

Fig. 1
figure 1

Active approach for inductive learning with a transductive knn

In what follows, we detail the particularities of each one.

3.1 Semi-active learning to rank algorithm: SAL2R

The SAL2R algorithm (Algorithm 2), described in Fig. 2, involves a supervised learning algorithm since the initial training set includes a small set of labeled query-document pairs in addition to the unlabeled ones.

SAL2R deals with two auxiliary algorithms:

  • A supervised learning to rank algorithm (SRA) to learn the P representative committee models \(\{h_{p}\}_{p\in \{1,\ldots ,P\}}\) on \(S_{L}\).

  • A Transductive-knn labeling algorithm (Algorithm 1) whose role is to attribute the adequate label of the selected query-document pair which is considered as the most informative one.

At first, SAL2R includes an initial phase which consists in learning P representative committee models on the currently labeled pairs. For that, \(S_{L}\) is subdivided in P partitions for which the supervised learning algorithm is applied to generate P ranking committee models \(\{h_{p}\}_{p\in \{1,\ldots ,P\}}\). Each one is defined by a ranking function. Then, iteratively, the algorithm will randomly choose a model \(h_{p}\) among the learned P models. The effectiveness of this algorithm depends on the learning of the committee models which must be varied enough and representative of the entries space, as well as the choice of the measure of disagreement. As well, SAL2R applies, iteratively, the same supervised learning algorithm to learn a ranking model h, characterized by a ranking function from \(S_{L}\). This function is updated, at each iteration, since the labeled set is increased by the newly selected labeled pair. Once the models are learned, SAL2R selects the most informative query-document pair from the unlabeled data set \(S_{U}\). This pair (\(x_{max},kmax\)) defined as \(p^{max}_{U}\) corresponds to the one that maximizes the measure of disagreement between the representative committee model \(h_{p}\) chosen randomly and the model h for each unabeled query-document pair \((x^{'}_{i},k)\) where \(x^{'}_{i}\in S_{U}\). This measure is defined as follows:

$$\begin{aligned}&(x_{max},kmax)=argMax_{((x^{'}_{i},k)}d_{c}(h,h_{p},(x^{'}_{i},k)) \end{aligned}$$
(1)
$$\begin{aligned}&d_{c}(h,h_{p})_{(x^{'}_{i},k)}=\sqrt{|(h(x^{'}_{i},k)- h_{p}(x^{'}_{i},k)|} \end{aligned}$$
(2)

The basic idea at this stage is to introduce a labeling algorithm to label the selected pair \(p^{max}_{U}\), referred to as transductive-knn algorithm (Algorithm 1). The main idea, inspired by knn algorithm, is to seek for the k-nearest labeled query-document pairs to the more informative selected pair. After that, we choose the label L predominantly represented for the k nearest labeled pairs (belonging to \(S_{L}\)). Finally, L is assigned as a label to the selected pair \(p^{max}_{U}\).

Fig. 2
figure 2

Semi-active learning to rank proposition: SAL2R

At last, SAL2R withdraws the selected pair \(p^{max}_{U}\) from \(S_{U}\) and adds it to \(S_{L}\). These steps are repeated until reaching the desired number of the data to be labeled. As output, the algorithm provides the model H1 characterized by the required score function.

Algorithm 1

Transductive-knn Labeling Algorithm

Inputs

Labeled pairs from \(S_{L}\)

The most informative unlabeled query-document pair selected from \(S_{U}\) : \(p^{max}_{U}\)

Begin

Calculate the scores \({\{sc^{i}_{L}\}}_{i \in \{1,\ldots ,m\}}\) of labeled query-document pairs by the learned function

Calculate the score \(sc^{max}_{U}\) of \(p^{max}_{U}\)

Calculate the difference between the score of the unlabeled pair and all scores of labeled pairs

Search the k nearest labeled pairs from \(p^{max}_{U}\)

Select the label L predominantly represented for the k nearest pairs

Assign this label L to the unlabeled pair \(p^{max}_{U}\)

End

Output : Selected pair labeled : \(p^{max}_{L}\)

Many learning to rank algorithms have considered pairs of entries in the learning process. They are referred to as pairwise approaches (Cao et al., 2007), such as the supervised algorithms RankBoost (Freund et al., 2003) and LambdaMART (Qiang et al., 2010). Other learning to rank algorithms, have been proposed to solve the problem of ranking by minimizing a loss function defined on object lists. They are referred to as listwise approaches (Cao et al., 2007), such as the supervised algorithm AdaRank (Xu & Li, 2007). Therefore, these pairwise and listwise approaches may consider different input and output spaces, deal with different hypotheses, and are based on different loss functions (Liu, 2011). In DR, the input space of the pairwise approach is characterized by pairs of documents according to a given query, both represented by feature vector. The output space is represented by the pairwise preference (which takes values from \(\left\{ -1, +1 \right\}\)) between each pair of documents. However, the input space of the listwise approach is characterized by the entire set of documents associated with a query in the training data. The output space of this approach is represented by the ranked list of the documents. During these years, each approach has shown higher empirical ranking performance as for its use in the IR field (Xia et al., 2008; Liu, 2011). We recommend to choose, as supervised algorithm the well-known boosting algorithms in the DR: RankBoost (Freund et al., 2003), AdaRank (Xu & Li, 2007) and LambdaMART (Qiang et al., 2010). The resulting algorithms are referred as SAL2R-RankBoost, SAL2R-AdaRank and SAL2R-LambdaMART respectively.

In the following, we give the SAL2R algorithm:

Algorithm 2

Semi-Active Learning to Rank algorithm: SAL2R

Inputs

Small set of labeled data \(S_{L}=\{(x_{i},y_{i});i\in \{1,\ldots , m\}\}\)

Large set of unlabeled data \(S_{U}=\{(x^{'}_{i});i\in \{1+m ,\ldots , n+m\}\}\)

Supervised learning to rank algorithm SRA : RankBoost \(\setminus\) AdaRank \(\setminus\) LambdaMART

Labeling algorithm: Transductive knn

Number of \(S_{L}\) partitions: P

Number of required examples to be labeled: NbLab

Begin

Learn P committee models \(\{h_{p}\}_{p\in \{1,\ldots ,P\}}\) with SRA

\(nbIter \leftarrow 1\)

While \(nbIter <= NbLab\) do

Learn a ranking function h with SRA on \(S_{L}\)

Choose randomly a committee model \(h_{p}\)

Select the most informative query-document pair (\(p^{max}_{U}(x_{max},kmax)\)) from \(S_{U}\) which maximizes the measure of disagreement \(d_{c}(h,h_{p})_{(x^{'}_{i},k) \in S_{U}}\)

Label the selected pair with the labeling algorithm

Withdraw this pair from \(S_{U}\) and add it to \(S_{L}\)

\(nbIter \leftarrow nbIter +1\)

End while

End

Output: Model H1

At this stage, we think that the following parameters are relevant and may have an influence on the performance of the learned ranking function.

  • the considered supervised learning to rank algorithm: SRA.

  • the number of \(S_{L}\) partitions: P.

  • the number of labeled examples: \(S_{L}\).

  • the number of examples to be labeled NbLab. We notice here that \(NbLab < n = |S_{U}|\).

3.2 Active-semi-supervised learning to rank algorithm: ASSL2R

The ASSL2R algorithm (Algorithm 3) differs from the SAL2R algorithm by the way that it achieves the learning of the ranking functions (Fig. 3). Indeed, we further assume that we consider a partially labeled pairs in the training set. But, the idea introduced here is to add a subset \(S_{U1}\), extracted from the large set of the unlabeled data \(S_{U}\), to the small amounts of labeled training set \(S_{L}\). ASSL2R deals with three auxiliary algorithms:

  • A supervised learning to rank algorithm (SRA) to learn the P representative committee models \(\{h_{p}\}_{p\in \{1,\ldots ,P\}}\) on \(S_{L}\).

  • A semi-supervised learning to rank algorithm (SSRA) to learn a model h on \(S_{L}\cup S_{U1}.\)

  • A Transductive-knn labeling algorithm (Algorithm 1).

As in the SAL2R algorithm, the most informative unlabeled pair \(p^{max}_{U}\) is the one that maximizes the measure of disagreement between the model h and the model \(h_{p}\) randomly chosen at each iteration, nevertheless this pair will be selected from the remaining dataset \(S_{U2}\).

Fig. 3
figure 3

Active-semi-supervised learning to rank proposition: ASSL2R

The Semi-AdaRank algorithm (Dammak et al., 2017a) is a two-staged algorithm that combines the label propagation process (Zhu & Ghahramani, 2002) and a regularized version of AdaRank (Xu & Li, 2007). It is based on the LP-AdaRank algorithm proposed by Miao and Tang (2013). The key concern of this work is articulated around the label-propagation phase, which consists in labeling just the \(S_{U1}\) subset, and then learns the model h on \(S_{L}\cup S_{U1}\). As for the regularized version of AdaRank, Semi-AdaRank optimizes a novel performance measure and provides a flexible framework that shares the advantages of theoretical soundness, efficiency in training and high performance in testing. The label propagation process proposed in Dammak et al. (2017a) is a graph-based semi-supervised learning framework (Fujiwara & Irie, 2014). The considered idea is to propagate the relevant labels from labeled examples to other unlabeled ones, so that more training data will be available to learn the ranking function. In the graph nodes represent the training data and edges represent similarities between them. The similarities are given by a weight matrix W (with n rows and c columns). In the following, we present the ASSL2R algorithm.

Algorithm 3

Active-Semi-Supervised Learning to Rank algorithm: ASSL2R

Inputs

Small set of labeled data \(S_{L}=\{(x_{i},y_{i});i\in \{1,\ldots , m\}\}\)

Large set of unlabeled data \(S_{U}=\{(x^{'}_{i});i\in \{1+m ,\ldots , n+m\}\} = S_{U1}\cup S_{U2}\)

Supervised learning to rank algorithm SRA: RankBoost \(\setminus\) AdaRank \(\setminus\) LambdaMART

Semi-supervised learning to rank algorithm SSRA: Semi-AdaRank

Labeling algorithm: Transductive knn

Number of \(S_{L}\) partitions: P

Number of required examples to be labeled: NbLab

Begin

Learn P committee models \(\{h_{p}\}_{p\in \{1,\ldots ,P\}}\) with SRA

\(nbIter \leftarrow 1\)

While \(nbIter <= NbLab\) do

Learn a ranking function h with Semi-AdaRank on \(S_{L}\cup S_{U1}\)

Choose randomly a committee model \(h_{p}\)

Select the most informative query-document pair from \(S_{U2}\) which maximizes the measure of disagreement \(d_{c}(h,h_{p})_{(x^{'}_{i},k) \in S_{L}\cup S_{U1}}\)

Label the selected pair with the labeling algorithm

Withdraw this pair from \(S_{U2}\) and add it to \(S_{L}\)

\(nbIter \leftarrow nbIter +1\)

End while

End

Output: Model H2

For this algorithm, we consider as parameters, on the one hand the number of partitions P which corresponds to the number of representative committee models as well as the number of labeled examples \(S_{L}\) an those to be labeled NbLab. On the other hand, we focus particulary on the distribution of \(S_{U}\) between \(S_{U1}\) and \(S_{U2}\).

Selecting the most useful features within the ranking functions and decreasing execution times are issues in learning to rank. We present in the next two sections an improvement of the two previous algorithms SAL2R and ASSL2R by using a multi-pairs in the selection phase, in order to accelerate this phase. These algorithms, denoted as “Semi-Active List Learning to Rank” (SALL2R) and “ Active-Semi-Supervised List Learning to Rank” (ASSLL2R) (Dammak et al., 2017b), use a list of most informative query-document pairs from the unlabeled training data instead of selecting a single pair, at each iteration, and then look for their relevance by an automatic labeling method.

3.3 Semi-active list learning to rank algorithm: SALL2R

We describe in this section the SALL2R algorithm (Algorithm 4), that considers more than one document-query pair to be labeled at each round, compared to SAL2R algorithm.

As SAL2R algorithm, SALL2R uses two auxiliary algorithms:

  • A listwise supervised learning to rank algorithm to learn for once P representative committee models \(\{h_{p}\}_{p\in \{1,\ldots ,P\}}\) on \(S_{L}\), then to learn iteratively a model h on \(S_{L}\) increased by the new selected pairs being labeled.

  • A Transductive-knn labeling algorithm (Algorithm 1).

Firstly, SALL2R consists in learning P representative committee models on \(S_{L}\) denoted \(\{h_{p}\}_{p\in \{1,\ldots ,P\}}\). Then, SALL2R will randomly choose at each round a model \(h_{p}\) among the learned P models.

Secondly, SALL2R uses the same supervised learning algorithm at each round to learn a ranking model h, characterized by a ranking function from \(S_{L}\). Once the models are learned, SALL2R selects a list of most informative query-document pairs from the unlabeled data set \(S_{U}\) which maximize the measure of disagreement (equation 1.1) between the representative committee model \(h_{p}\) chosen randomly and the model h. The ranking function is updated iteratively since \(S_{L}\) is increased by the new selected labeled pairs.

As supervised algorithm, we propose to choose the well-known boosting AdaRank algorithm (Xu & Li, 2007). It is among the first listwise learning to rank algorithms of alternatives.

Algorithm 4

Semi-Active List Learning to Rank algorithm: SALL2R

Inputs

Small set of labeled data \(S_{L}=\{(x_{i},y_{i});i\in \{1,\ldots , m\}\}\)

Large set of unlabeled data \(S_{U}=\{(x^{'}_{i});i\in \{1+m ,\ldots , n+m\}\}\)

Supervised learning to rank algorithm SRA : AdaRank \(\setminus\)LambdaMART

Labeling algorithm: Transductive knn

Number of \(S_{L}\) partitions: P

Number of required examples to be labeled: NbLab

Number of most informative query-document pairs: Nbp

Begin

Learn P committee models \(\{h_{p}\}_{p\in \{1,\ldots ,P\}}\) with SRA

\(nbIter \leftarrow 1\)

While \(nbIter <= NbLab\) do

Learn a ranking function h with AdaRank on \(S_{L}\)

Choose randomly a committee model \(h_{p}\)

Select Nbp of most informative query-documents pairs from \(S_{U}\) which maximize the measure of disagreement \(d_{c}(h,h_{p})_{(x^{'}_{i},k) \in S_{U}}\)

Label the Nbp selected pairs with the labeling algorithm

Remove these pairs from \(S_{U}\) and add them to \(S_{L}\)

\(nbIter \leftarrow nbIter +1\)

End while

End

Output: Model H

We consider for this algorithm the same parameters for the experimental study as for the Algorithm 2. We focus particulary on the number of pairs to be labeled in one given round (Nbp).

3.4 Active-semi-supervised list learning to rank algorithm: ASSLL2R

The “Active-Semi-Supervised List Learning to Rank” (ASSLL2R) algorithm (Algorithm 5) constitutes an improvement of the ASSL2R algorithm by selecting, at each round, a list of most informative query-document pairs from the unlabeled training data \(S_{U2}\) which maximize the measure of disagreement \(d_{c}(h,h_{p})_{(x^{'}_{i},k)}\). ASSLL2R algorithm uses a partially labeled pairs in the initial training set as it considers a subset \(S_{U1}\), extracted from the big set of unlabeled data \(S_{U}\), with the small amounts of labeled training set \(S_{L}\).

As ASSL2R algorithm (Algorithm 3), ASSLL2R uses three auxiliary algorithms:

  • A supervised AdaRank algorithm for learning of P representative committee models \(\{h_{p}\}_{p\in \{1,\ldots ,P\}}\) on \(S_{L}\).

  • A semi-supervised Semi-AdaRank algorithm to learn a model h on \(S_{L}\cup S_{U1}\).

In the following, we give the ASSLL2R algorithm.

Algorithm 5

Active-Semi-Supervised List Learning to Rank algorithm: ASSLL2R

Inputs

Small set of labeled data \(S_{L}=\{(x_{i},y_{i});i\in \{1,\ldots , m\}\}\)

Large set of unlabeled data \(S_{U}=\{(x^{'}_{i});i\in \{1+m ,\ldots , n+m\}\} = S_{U1}\cup S_{U2}\)

Supervised learning to rank algorithm SRA : AdaRank \(\setminus\) LambdaMART

Semi-supervised learning to rank algorithm SSRA: Semi-AdaRank

Labeling algorithm: Transductive knn

Number of \(S_{L}\) partitions : P

Number of required examples to be labeled: NbLab

Number of most informative query-document pairs: Nbp

Begin

Learn P committee models \(\{h_{p}\}_{p\in \{1,\ldots ,P\}}\) with SRA

\(nbIter \leftarrow 1\)

While \(nbIter <= NbLab\) do

Learn a ranking function h with Semi-AdaRank on \(S_{L}\cup S_{U1}\)

Choose randomly a committee model \(h_{p}\)

Select Nbp of most informative query-document pairs from \(S_{U2}\) which maximizes the measure of disagreement \(d_{c}(h,h_{p})_{(x^{'}_{i},k) \in S_{U2}}\)

Label the Nbp selected pairs with the labeling algorithm

Remove these pairs from \(S_{U2}\) and add them to \(S_{L}\)

\(nbIter \leftarrow nbIter +1\)

End while

End

Output: Model H

The two crucial parameters to consider for this algorithm are the respective sizes of \(S_{U1}\) and \(S_{U2}\) in addition to Nbp.

In the following section, we discussed the experimental part of this paper.

4 Experimental study

We chose DR (Liu, 2011) as an experimental framework to validate our proposed learning to rank algorithms and their improvements and to show the interest of semi-supervised and active learning to rank in improving results. We realized a number of empirical experiments, by varying several parameters, in order to compare the different results found and to evaluate the importance of the unlabeled data to learn the ranking functions in the proposed algorithms.

4.1 Experiment setup

Experiments are conducted on the standard benchmark for learning to rank LETOR (LEarning TO Rank) (Liu et al., 2007), released by Microsoft Research Asia. LETOR is widely used in IR which was constructed based on multiple data corpora and query sets. We mainly exploited “MQ2007”, “MQ2008”, “MQ2007-semi” and “MQ2008-semi” (Million Query track) collections from LETOR 4.0.Footnote 1 These selected collections are conducted on the .GOV2 corpus using respectively the TREC 2007 and the TREC 2008, which are extracted from Web sites in the .gov domain.Footnote 2 Each subset of these collections is partitioned into five parts, denoted as S1, S2, S3, S4 and S5, in order to perform five-fold cross validation experiments (Liu et al., 2007). For each fold, there are three subsets for learning: training set, validation set and testing set. There are about 1700 queries in MQ2007 dataset with labeled documents, and about 70,000 query-document pairs, while MQ2008 has 800 queries and about 15,000 query-document pairs. MQ2007-semi and MQ2008-semi contain a small set of the labeled query-document pairs and a large amount of unlabeled query-document pairs (in training set but not in validation and testing set). There are about 2000 queries in these datasets. On average, each query is related with about 40 labeled documents and about 1000 unlabeled documents. Each query-document pair in the dataset is given a relevance level (\(-1\), 0, 1 or 2) where \(-1\) means “unlabeled” and a greater number means more relevance. Each query-document pair is represented by a feature vector that contains 46 features such as TF−IDF, BM25 and LMIR (Figs. 4, 5).

Fig. 4
figure 4

Query-document fields corresponding to a line of the MQ2007-semi and MQ2008-semi collections

Fig. 5
figure 5

Example of a line extracted from the MQ2007-semi collection

Normalised Discounted Cumulative Gain (NDCG@n), Precision at position n (P@N) and Mean Average Precision (MAP) (Järvelin & Kekäläinen, 2000) are used as a standard ranking performance measures to evaluate the retrieval effectiveness of our experiments on LETOR.

Mean Average Precision (MAP) is a measure for assessing the quality of ranking a list of results in the case of binary judgments relevance (relevant documents vs irrelevant documents). It is defined by using the Precision at position n (P@n)). P@n for a query q corresponds to the ratio of relevant documents among the top n documents returned in the ranking results for this query.

$$\begin{aligned}&P@n=\frac{\#relevant\, documents\, in\; top\; n\, results}{n} \end{aligned}$$
(3)
$$\begin{aligned}&MAP=\frac{1}{Q}\sum _{q=1}^{Q}\frac{\sum _{n=1}^{N}(P@n * rel(n))}{\# total\, relevants\, documents\, of\, this\, query} \end{aligned}$$
(4)

Q is the number of queries in the considered collection.

The Normalized Discounted Cumulative Gain (NDCG) at position n (NDCG@n) is calculated by the following equation. r(j) is the degree of relevance of the document at position j in the ranking list.

$$\begin{aligned} NDCG@n=\frac{1}{Q}\sum _{q=1}^{Q}\frac{1}{Z_{n}}\sum _{j=1}^{n}\frac{2^{r(j)}-1}{log(1+j)} \end{aligned}$$
(5)

In the next section, we present a series of experiments in order to compare the proposed learning to rank algorithms with reference algorithms on which they are based.

4.2 Experimental results

The first part of our experimental study (Sect. 4.2.1) concerns the two algorithms SAL2R and ASSL2R. Our principal objective is to evaluate them according to some supervised (RankBoost, AdaRank and LambdaMART), semi-supervised (Semi-RankBoost and Semi-AdaRank) and active learning to rank algorithms (Active-RankBoost and Active-AdaRank Dammak et al., 2015). Furthermore, we plan to compare them mutually by considering the different supervised learning to rank algorithms SRA as auxiliaries which leads to the following variants:

  • SAL2R-RankBoost\(\backslash\) -AdaRank\(\backslash\) -LambdaMART

  • ASSL2R-RankBoost\(\backslash\) -AdaRank\(\backslash\) -LambdaMART

The evaluation was carried out on the MQ2007, MQ2008 (Tables 1, 2, 3, 4), MQ2007-semi and MQ2008-semi (Tables 5, 6, 7) based on NDCG@n, P@n and MAP measures. Moreover, we want to study the impact of varying the number of labeled data considered for training \(|S_{L}|\) (Figs. 4, 5), the number of \(S_{L}\) partitions: P (the number of partitions of committee model) (Fig. 7) and the number of required examples to be labeled NbLab (Fig. 6) on the effectiveness of the ranking function trained. Finally, we focus on the distribution of the unlabeled examples \(S_{U}\) between \(S_{U1}\) and \(S_{U2}\) (Fig. 8).

Table 1 NDCG@n measures on the MQ2007 collection
Table 2 P@n and MAP measures on the MQ2007 collection
Table 3 NDCG@n measures on the MQ2008 collection
Table 4 P@n and MAP measures on the MQ2008 collection
Table 5 NDCG@n measures on the MQ2007-semi collection
Table 6 NDCG@n measures on the MQ2008-semi collection
Table 7 MAP measures on MQ2007-semi and MQ2008-semi
Fig. 6
figure 6

Variation of MAP as a function of the number of labeled examples (Active-RankBoost, SAL2R-RankBoost, ASSL2R-RankBoost) on MQ2008-semi

Fig. 7
figure 7

Variation of MAP as a function of the number of labeled examples (Active-AdaRank, SAL2R-AdaRank, ASSL2R-AdaRank) on MQ2008-semi

Fig. 8
figure 8

Variation of MAP as a function of the number of labeled examples \(S_{L}\) (\(NbLab = 100\), 500 and 1000) on ASSL2R-adaRank algorithm

The second part of our experimental study (Sect. 4.2.2) treats the evaluation of the two algorithms SALL2R and ASSLL2R by according a particular interest to the variation of Nbp (the number of most informative query-document pairs to be labeled for a given iteration). The results obtained for the MQ2007 and MQ2008 collections in terms of NDCG@n will be spread out in Tables 10 and 11 and in term of MAP in Table 12. Those obtained for MQ2007-semi and MQ2008-semi in term of MAP will be presented in Table 13.

There were two learning parameters to be tuned. We tested several values of the k nearest neighbors’ parameter of the labeling algorithm: \(k = 5\), 7, 10 and 12. We noted that this variation had no big influence on the results found. This parameter was fixed to 10 (\(k = 10\)) in our series of experiments.

4.2.1 Evaluation of SAL2R and ASSL2R

Tables 1 and 2 show improvement in the results obtained by SAL2R-RankBoost\(\backslash\) -AdaRank\(\backslash\) -LambdaMART and ASSL2R-RankBoost\(\backslash\) -AdaRank\(\backslash\) -LambdaMART for the MQ2007 dataset (Improvement \(\displaystyle \simeq 10.5 \%\)) as compared to RankBoost\(\backslash\) AdaRank\(\backslash\) LambdaMART. The overall results are very similar to those obtained with the MQ2008 dataset in Tables 3 and 4 (Improvement \(\displaystyle \simeq 11.5 \%\)). From these results, we can see clear performance improvements brought by the algorithms on all the metrics. This series of experiments validate the idea of combining the active (active-semi-supervised) approach with a step of automatic labeling.

From Tables 5 and 7, we found that the performance in terms of NDCG@n and MAP of SAL2R-RankBoost and ASSL2R-RankBoost algorithms in the MQ2007-semi collection was better than those obtained by Active-RankBoost except NDCG@5 of SAL2R-RankBoost. Moreover, we notice that the results of SAL2R-AdaRank and ASSL2R-AdaRank were better than those obtained by Active-AdaRank in terms of NDCG@n and MAP measures (Tables 6 and  7) (Improvement \(\displaystyle \simeq 14.5 \%\)). Nevertheless, MAP of ASSL2R-RankBoost on the MQ2008-semi collection were inferior to those obtained by SAL2R-RankBoost and Active-RankBoost. Contrary, we observe that the MAP measure of ASSL2R-AdaRank is very important (0.6227) according to the Table 7. Table 6 shows, on the MQ2008-semi collection, that ASSL2R-RankBoost algorithm has a better performance than Active-RankBoost algorithms (Improvement \(\displaystyle \simeq 13.5 \%\)). It displays also that ASSL2R-AdaRank is more effective than Active-AdaRank in terms of all measures excluding NDCG@3 (Improvement \(\displaystyle \simeq 12 \%\)).

From these experiments, we also compared SAL2R and ASSL2R algorithms by examining respectively the results obtained by SAL2R-RankBoost\(\backslash\) -AdaRank\(\backslash\) -LambdaMART and ASSL2R-RankBoost\(\backslash\) -AdaRank\(\backslash\) -LambdaMART. These results show improvements indicating that ASSL2R algorithm is more effective than SAL2R (Improvement \(\displaystyle \simeq 2.5 \%\)). This observation supports the idea of using partially labeled pairs in the training set, more precisely the addition of a subset \(S_{U1}\), extracted from the large set of the unlabeled data \(S_{U}\), to the small amounts of labeled training set \(S_{L}\). And obviously the use of the semi-supervised learning algorithm to learn a model h on \(S_{L}\cup S_{U1}\).

To deepen our experimental study we have looked for some examples of queries for which we obtain Ap measures better than the MAP obtained on the set of queries (Table 8).

Table 8 AP measures of three queries extracted from the MQ2008-semi collection

We conclude from these tables that in general the use of unlabeled data constantly improves the effectiveness in terms of NDCG@n, P@n and MAP of the proposed algorithms. These results show improvements indicating that these algorithms are more effective than the supervised \(\backslash\) semi-supervised learning to rank algorithms and the active learning to rank algorithms : Active-RankBoost and Active-AdaRank.

The effect of varying the number of labeled examples \(\textit{S}_{L}\) We plan here to study the effect of increasing the number of labeled examples on the MAP measure first by running SAL2R-RankBoost, ASSL2R-RankBoost, Active-RankBoost algorithms, then by running the Active-AdaRank, SAL2R-AdaRank and ASSL2R-AdaRank algorithms. We looked for MAP variation on the MQ2008-semi collection based on the obtained results. Figures 4 and 5 indicate that the curves of these algorithms have a decreasing appearance with the addition of the labeled examples in the learning phase.

The effect of increasing the number of labeled examples \(\textit{S}_{L}\) by varying NbLab In this subsection, we tested the effect of increasing the number of examples to be labeled NbLab and that of the labeled ones \(S_{L}\) on ASSL2R-AdaRank algorithm. We looked for the variation of the MAP on the MQ2008-semi collection by varying the values of NbLab and \(S_{L}\). Figure 6 indicates that the curves of the three algorithms have a decreasing appearance with increasing the labeled examples. This observation is valid for the three curves corresponding to the different values fixed for NbLab (\(NbLab =100\), 500 and 1000). Figure 6 illustrates performing ASSL2R-AdaRank when \(NbLab=500\) and \(|S_{L}|=50\) (10% of Nblab) which represents the following percentages: 90.91% for \(S_{U}\) and 9.09% for NbLab relative to the entire learning set. It is clear from Figs. 4, 5 and 6 that the values of MAP measures decrease the number of labeled pairs increase. This justifies the idea of choosing to learn with small set of labeled examples and a large set of unlabeled ones.

The effect of varying the committee size P We include experiments on MQ2007 collection that test the effect of varying the number of \(S_{L}\) partitions: P on the MAP measure. We looked for the results obtained on SAL2R-RankBoost\(\backslash\) -AdaRank and ASSL2R-RankBoost\(\backslash\) -AdaRank algorithmes. Figure 7 illustrates that the best results are obtained according to the chosen algorithms when \(P=5\).

The effect of varying the distribution of \(\textit{S}_{U}\) between \({\textit{S}}_{U1}\) and \(\textit{S}_{U2}\) We focus particulary on the distribution of \(S_{U}\) between \(S_{U1}\) and \(S_{U2}\). Figure 8 shows that the curves of these algorithms have a decreasing appearance when the values of \(S_{U1}\) exceed the values of \(S_{U2}\). Results in this figure achieve the best performance when \(S_{U1}<< S_{U2}\). This reinforces the idea of learning with a small set of labeled examples and a large set of unlabeled ones is more effective.

In the next section, we present a series of experiments with the aim of the evaluation of SALL2R and ASSLL2R algorithms (Figs. 9, 10).

Fig. 9
figure 9

Variation of MAP as a function of the number of \(S_{L}\) partitions P on MQ2007

Fig. 10
figure 10

Variation of MAP as a function of the distribution of \(S_{U}\) between \(S_{U1}\) and \(S_{U2}\) on MQ2007-semi

4.2.2 Evaluation of SALL2R and ASSLL2R

Our goal through these series of experiments is to evaluate the behavior of the algorithms (SALL2R, ASSLL2R) according to the number of pairs (Nbp) to be labeled at each round in the training sets. We choose to use \(Nbp = 3\), 7, 10 and 15. The evaluation results of SALL2R-n and ASSLL2R-n (\(\hbox {n}=3\), 7, 10 and 15) on testing set are summarized in Tables 9, 10, 11 and 12. From these tables, we note that the results progress in terms of NDCG@n and MAP measures on MQ2008 collection. We found also that the performance of SALL2R-n (\(\hbox {n}=3\), 7, 10 and 15) are better than those obtained by Active-AdaRank and by SAL2R-AdaRank.

Table 9 NDCG@n measures on the MQ2007 collection
Table 10 NDCG@n measures on the MQ2008 collection
Table 11 MAP measures on MQ2007 and MQ2008
Table 12 MAP measures on MQ2007-semi and MQ2008-semi

According to the Table 11, we note that the results of ASSLL2R-n (\(n=3\), 7, 10 and 15) on MQ2008 collection are more effective than those of Active-AdaRank and ASSL2R-AdaRank and we notice improvement in results. On another side, from these experiments, we note that the SALL2R-n1 results are better than those of SALL2R-n2 if \(\hbox {n}1 > \hbox {n}2\). We can deduce that when the number of pairs increase the results are better. However, we have noted that the variation between the values of ASSLL2R-3, ASSLL2R-7 (respectively SALL2R-3 and SALL2R-7) is of \(0.1 \%\) on the other hand the variation between ASSLL2R-10 and ASSLL2R-15 (respectively SALL2R-10 and SALL2R-15) is of \(0.01 \%\) that’s why we didn’t test any other value for Nbp. This validates the advantage of learning to rank algorithms based on the listwise approach with more than one query-document pair which appears as a promising direction to improve the performance of the SAL2R and ASSL2R algorithms.

The experiments we carried out demonstrate improvements in MQ2007\(\backslash\) MQ2007-semi and MQ2008\(\backslash\) MQ2008-semi collections and highlight the importance of combining the active and the semi-supervised types of learning.

In general, both of the SAL2R and ASSL2R algorithms have shown their performance relatively to some other supervised, semi-supervised and active learning to rank algorithms. These results prove the utility of introducing unlabeled data in the learning to rank process with the combination of active learning and semi-supervised learning methods. Also, both of the SALL2R and ASSLL2R algorithms have shown their performance, by growing number of pairs chosen. This justifies the advantages of introducing the listwise approach with more than one query-document pair to be labeled which appears as a promising direction to improve the performance of SALL2R and ASSLL2R. Moreover, introducing an automatic labeling method in the active learning to rank can potentially improve the evaluation results and might be necessary in some cases, particulary when the training set contains a small set of labeled examples.

To verify the improvement hypothesis of the algorithms that we proposed, we selected to use the paired parametric T-Test (student’s t test) (Demšar, 2006) as well as its non-parametric alternative Wilcoxon signed-ranks Test (Wilcoxon, 1992). The first test checks whether the average difference in the performances of given two algorithms to be compared over the data sets is significantly different from zero. The second test ranks the difference in performances ignoring the signs and compares the ranks for positive and negative differences. The purpose is to try to reject the null hypothesis that both algorithms perform equally well.

The Tables 13, 14, 15 and 16 show the evaluation results obtained by ASSL2R and SAL2R algorithms to be compared with RankBoost and AdaRank algorithms. In these tables, we propose to add the p-value which allows to statistically compare the SAL2R and ASSL2R algorithms proposed (with as auxiliaries for the SRA: RankBoost and AdaRank ) with RankBoost and AdaRank respectively.

Table 13 NDCG@10 measures on MQ2007, MQ2008, MQ2007-semi and MQ2008-semi collections
Table 14 NDCG@10 measures on MQ2007, MQ2008, MQ2007-semi and MQ2008-semi collections
Table 15 NDCG@10 measures on MQ2007, MQ2008, MQ2007-semi and MQ2008-semi collections
Table 16 NDCG@10 measures on MQ2007, MQ2008, MQ2007-semi and MQ2008-semi collections

If we consider the SAL2R algorithm, we observe that for the RankBoost and AdaRank variants we obtain the respective p-values 0.000981 and 0.005746 (\(< 0.05\)). The collections considered are MQ2007\(\backslash\) MQ2007-semi and MQ2008\(\backslash\) MQ2008-semi; and the effectiveness metric is NDCG@10. Moreover, for the ASSL2R algorithm we obtain the respective p-values 0.00172 and 0.01648 for the same variants (\(< 0.05\)).

For a confidence level of \(\alpha =0.05\) and \(\hbox {N}=4\) datasets, we need to form a null hypothesis and determine whether we can reject the null hypothesis. When the significance level (p-value) is low, we can feel comfortable in rejecting the null hypothesis. . This hypothesis is validated with the t-test but not with the Wilcoxon test. This can be explained by the fact that the Wilcoxon test is more sensible (Demšar, 2006). According to the values in the Tables 13, 14, 15 and 16 ASSL2R and SAL2R are significantly better than RankBoost and AdaRank with p-value \(<0.05\), we therefore reject the null hypothesis.

5 Conclusion

To take benefit of the active and semi-supervised learning methods, we have considered an enhanced experimental study of two inductive learning to rank algorithms SAL2R and ASSL2R, for DR which combine the two methods of learning to rank and compensate the small number of labeled data by the information in a large dataset of unlabeled data. Thus, we focus on the number of unlabeled examples (NbLab) and the distribution of \(S_{U}\) between \(S_{U1}\) and \(S_{U2}\). In fact, these algorithms are based on the principle of active learning to rank of alternatives and use supervised and semi-supervised learning as auxiliaries to learn ranking functions in order to select only the most informative query-document unlabeled pair at each round and specify, afterwards, if the document is relevant or not in relation with this query. For this purpose, we propose different variants of the algorithms according to the SRA applied in order to compare them. We focused also for the proposed algorithms on the QBC selection strategy and the use of an automatic labeling algorithm for the labeling process instead of resorting to an expert. Consequently the committee number P can be considered as a parameter which could influence the learning performance to which we must attribute an interest. At last, we deal with SALL2R and ASSLL2R algorithms which used a list of most informative unlabeled query-document pairs (Nbp) instead of opting for the most informative one, which showed its effectiveness through a set of experiments. To evaluate the impact of varying all these parameters, we use collections from the standard benchmark LETOR4.0 and P@n, NDCG@n and MAP metrics. The obtained results corresponding to SAL2R-RankBoost\(\backslash\) -AdaRank\(\backslash\) -LambdaMART, ASSL2R-RankBoost\(\backslash\) -AdaRank\(\backslash\) -LambdaMART algorithms were compared to those corresponding to supervised, semi-supervised and active learning to rank algorithms. Our experiments demonstrate significant improvements in MQ2007, MQ2008, MQ2007-semi and MQ2008-semi collections and highlight the importance of combining the active and the semi-supervised types of learning. This justifies the use of unlabeled data for ranking. Moreover, the automatic labeling in the active learning to rank can potentially improve the evaluation results and might be necessary in some cases, particulary when the training set contains very little labeled examples. These results also demonstrate the influence of the pairwise and the listwise approaches on these algorithms. The main conclusion that emerges from our study of the combination of these parameters is that a semi-supervised listwise algorithm as auxiliary gives the best performances compared to the pairwise algorithms if we consider in addition a greater proportion of \(S_{U}\) (NbLab) against \(S_{L}\) (10% of \(S_{L}\) against 90% of \(S_{U}\)). Further more, by setting number of partitions P to 5 and the number of pairs (Nbp) to 10 or to 15 we obtain the best performances.

A further perspective consists in implementing another selection strategy for active learning through integrating either the transductive-knn algorithm or the Semi-AdaRank algorithm. Coupling weak learning with deep learning can constitute an interesting research avenue for processing unlabeled data for learning.