Keywords

1 Introduction

In recent years, machine learning problems with structured outputs received an increasing interest. These problems appear in a variety of fields, including biology [33], image analysis [23], natural language treatment [5], and so on.

In this paper, we look at label ranking (LR), where one has to learn a mapping from instances to rankings (strict total order) defined over a finite, usually limited number of labels. Most solutions to this problem reduce its initial complexity, either by fitting a probabilistic model (Mallows, Plackett-Luce [7]) with few parameters, or through a decomposition scheme. For example, ranking by pairwise comparison (RPC) [24] transforms the initial problem into binary problems. Constraint classification and log-linear models [13], as well as SVM-based methods [30] learn, for each label, a (linear) utility function from which the ranking is deduced. Those latter approaches are close to other proposals [18] that perform a label-wise decomposition.

In ranking problems, it may also be interesting [9, 18] to predict partial rather than complete rankings, abstaining to make a precise prediction in presence of too little information. Such predictions can be seen as extensions of the reject option [4] or of partial predictions [11]. They can prevent harmful decisions based on incorrect predictions, and have been applied for different decomposition schemes, be it pairwise [10] or label-wise [18], always producing cautious predictions in the form of partial order relations.

In this paper, we propose a new label ranking method, called LR-CSP, based on a label-wise decomposition where each sub-problem intends to predict a set of ranks. More precisely, we propose to learn for each label an imprecise ordinal regression model of its rank [19], and use these models to infer a set of possible ranks. To do this, we use imprecise probabilistic (IP) approaches are well tailored to make partial predictions [11] and represent potential lack of knowledge, by describing our uncertainty by means of a convex set of probability distributions \(\mathscr {P}\) [31] rather than by a classical single precise probability distribution \(\mathbb {P}\). An interesting point of our method, whose principle can be used with any set of probabilities, is that it does not require any modification of the underlying learning imprecise classifier, as long as the classifier can produce lower and upper bounds \([\underline{P}, \overline{P}]\) over binary classification problems.

We then use CSP techniques on the set of resulting predictions to check whether the prediction outputs are consistent with a global ranking (i.e. that each label can be assigned a different rank).

Section 2 introduces the problem and our notations. Section 3 shows how ranks can be predicted from imprecise probabilistic models and presents the proposed inference method based on robust optimization techniques. Section 4 discusses related work. Finally, Sect. 5 is devoted to experimental evaluation showing that our approach does reach a higher accuracy by allowing for partial outputs, and remains quite competitive with alternative approaches to the same learning problem.

2 Problem Setting

Multi-class problems consist in associating an instance \({\mathbf {x}}\) coming from an input space \({\mathcal {X}}\) to a single label of the output space \({\varLambda }=\{{\lambda _{1}},\ldots ,{\lambda _{k}}\}\) representing the possible classes. In label ranking, an instance \({\mathbf {x}}\) is no longer associated to a unique label of \({\varLambda }\) but to an order relationFootnote 1 \(\succ _{\mathbf {x}}\) over \({\varLambda }\times {\varLambda }\), or equivalently to a complete ranking over the labels in \({\varLambda }\). Hence, the output space is the set \(\mathcal {L}({\varLambda })\) of complete rankings of \({\varLambda }\) that contains \(|\mathcal {L}({\varLambda })|=k!\) elements (i.e., the set of all permutations). Table 1 illustrates a label ranking data set example with \(k=3\).

Table 1. An example of label ranking data set \(\mathbb {D}\)

We can identify a ranking \(\succ _{\mathbf {x}}\) with a permutation \(\sigma _{\mathbf {x}}\) on \(\{1,\ldots ,k\}\) such that \(\sigma _{\mathbf {x}}(i) < \sigma _{\mathbf {x}}(j)\) iff \({\lambda _{i}} \succ _{\mathbf {x}}{\lambda _{j}}\), as they are in one-to-one correspondence. \(\sigma _{\mathbf {x}}(i)\) is the rank of label i in the order relation \(\succ _{\mathbf {x}}\). As there is a one-to-one correspondence between permutations and complete rankings, we use the terms interchangeably.

Example 1

Consider the set \({\varLambda }=\{{\lambda _{1}},{\lambda _{2}},{\lambda _{3}}\}\) and the observation \({\lambda _{3}} \succ {\lambda _{1}} \succ {\lambda _{2}}\), then we have \(\sigma _{\mathbf {x}}(1)=2, \; \sigma _{\mathbf {x}}(2)=3, \; \sigma _{\mathbf {x}}(3)=1.\)

The usual objective in label ranking is to use the training instances \(\mathbb {D}={ \{ ({\mathbf {x}}_i,y_i) \;\vert \; i=1,\ldots ,n \} }\) with \(x_i \in \mathcal {X}\), \(y_i \in \mathcal {L}({\varLambda })\) to learn a predictor, or a ranker \(h:{\mathcal {X}}\rightarrow \mathcal {L}({\varLambda })\). While in theory this problem can be transformed into a multi-class problem where each ranking is a separate class, this is in practice undoable, as the number of classes would increase factorially with k. The most usual means to solve this issue is either to decompose the problem into many simpler ones, or to fit a parametric probability distribution over the ranks [7]. In this paper, we shall focus on a label-wise decomposition of the problem.

This rapid increase of \(|\mathcal {L}({\varLambda })|\) also means that getting reliable, precise predictions of ranks is in practice very difficult as k increases. Hence it may be useful to allow the ranker to return partial but reliable predictions.

3 Label-Wise Decomposition: Learning and Predicting

This section details how we propose to reduce the initial ranking problem in a set of k label-wise problems, that we can then solve separately. The idea is the following: since a complete observation corresponds to each label being associated to a unique rank, we can learn a probabilistic model \(p_{i}: K\rightarrow [0,1]\) with \(K=\{1,2,\ldots ,k\}\) and where \(p_{ij}:=p_{i}(j)\) is interpreted as the probability \(P(\sigma (i)=j)\) that label \({\lambda _{i}}\) has rank j. Note that \(\sum _j p_{ij}=1\).

A first step is to decompose the original data set \(\mathbb {D}\) into k data sets \(\mathbb {D}_j={ \{ ({\mathbf {x}}_i,\sigma _{{\mathbf {x}}_i}(j)) \;\vert \; i=1,\ldots ,n \} }\), \(j=1,\ldots ,k\). The decomposition is illustrated by Fig. 1. Estimating the probabilities \(p_{ij}\) for a label \({\lambda _{i}}\) then comes down to solve an ordinal regression problem [27]. In such problems, the rank associated to a label is the one minimizing the expected cost \({\mathbb {E}}_{ij}\) of assigning label \({\lambda _{i}}\) to rank j, that depends on \(p_{ij}\) and a distance \(D:K\times K \rightarrow \mathbb {R}\) between ranks as follows:

$$\begin{aligned} {\mathbb {E}}_{ij}=\sum \nolimits _{\ell =1}^k D(j,k) p_{ik}. \end{aligned}$$
(1)

Common choices for the distances are the \(L_1\) and \(L_2\) norms, corresponding to

$$\begin{aligned} D_1(j,k)=|j-k| \quad \text {and} \quad D_2(j,k)=(j-k)^2. \end{aligned}$$
(2)

Other choices include for instance the pinball loss [29], that penalizes asymmetrically giving a higher or a lower rank than the actual one. An interest of those in the imprecise setting we will adopt next is that it produces predictions in the form of intervals, i.e., in the sense that \(\{1, 3\}\) cannot be a prediction but \(\{1, 2, 3\}\) can. In this paper, we will focus on the \(L_1\) loss, as it is the most commonly considered in ordinal classification problemsFootnote 2.

Fig. 1.
figure 1

Label-wise decomposition of rankings

3.1 Probability Set Model

Precise estimates for \(p_{i}\) issued from the finite data set \(\mathbb {D}_k\) may be unreliable, especially if these estimates rely on little, noisy or incomplete data. Rather than relying on precise estimates in all cases, we propose to consider an imprecise probabilistic model, that is, to consider for each label \({\lambda _{i}}\) a polytope (a convex set) \(\mathscr {P}_i\) of possible probabilities. In our setting, a particularly interesting model are imprecise cumulative distributions [15], as they naturally encode the ordinal nature of rankings, and are a common choice in the precise setting [22]. They consist in providing bounds \(\left[ \underline{P}(A_\ell ),\overline{P}(A_\ell )\right] \) on events \(A_\ell =\{1,\ldots ,\ell \}\) and to consider the resulting set

$$\begin{aligned} \mathscr {P}_i=\left\{ p_i : \underline{P}_i(A_\ell )\le \sum \nolimits _{j=1}^\ell p_{ij}\le \overline{P}_i(A_\ell ), \sum \nolimits _{j\in K}p_{ij}=1\right\} . \end{aligned}$$
(3)

We will denote by \({\underline{F}}_{ij}=\underline{P}_i(A_j)\) and \({\overline{F}}_{ij}=\overline{P}_i(A_j)\) the given bounds. Table 2 provides an example of a cumulative distribution that could be obtained in a ranking problem where \(k=5\) and for a label \({\lambda _{i}}\). For other kinds of sets \(\mathscr {P}_i\) we could consider, see [17].

Table 2. Imprecise cumulative distribution for \({\lambda _{i}}\)

This approach requires to learn k different models, one for each label. This is to be compared with the RPC [24] approach, in which \(\nicefrac {k(k-1)}{2}\) models (one for each pair of labels) have to be learned. There is therefore a clear computational advantage for the current approach when k increases. It should also be noted that the two approaches rely on different models: while the label-wise decomposition uses learning methods issued from ordinal regression problems, the RPC approach usually uses learning methods issued from binary classification.

3.2 Rank-Wise Inferences

The classical means to compare two ranks as possible predictions, given the probability \(p_{i}\), is to say that rank \(\ell \) is preferable to rank m (denoted \(\ell \succ m\)) iff

$$\begin{aligned} \sum \nolimits _{j=1}^k D_1(j,m)p_{ij} \ge \sum \nolimits _{j=1}^k D_1(j,\ell )p_{ij} \end{aligned}$$
(4)

That is if the expected cost (loss) of predicting m is higher than the expected cost of predicting \(\ell \). The final prediction is then the rank that is not dominated or preferred to any other (with typically a random choice when there is some indifference between the top ranks).

When precise probabilities \(p_i\) are replaced by probability sets \(\mathscr {P}_i\), a classical extensionFootnote 3 of this rule is to consider that rank \(\ell \) is preferable to rank m iff it is so for every probability in \(\mathscr {P}_i\), that is if

$$\begin{aligned} \inf \nolimits _{p_i \in \mathscr {P}_i} \sum \nolimits _{j=1}^k (D_1(j,m) - D_1(j,\ell )) p_{ij}\end{aligned}$$
(5)

is positive. Note that under this definition we may have simultaneously \(m \not \succ \ell \) and \(\ell \not \succ m\), therefore there may be multiple undominated, incomparable ranks, in which case the final prediction is a set-valued one.

In general, obtaining the set of predicted values requires to solve Eq. (5) at most a quadratic number of times (corresponding to each pairwise comparison). However, it has been shown [16, Prop. 1] that when considering \(D_1\) as a cost function, the set of predicted values corresponds to the set of possible medians within \(\mathscr {P}_i\), which is straightforward to compute if one uses the generalized p-box [15] as an uncertainty model. Namely, if \({\underline{F}}_i,{\overline{F}}_i\) are the cumulative distributions for label \({\lambda _{i}}\), then the predicted ranks under \(D_1\) cost are

$$\begin{aligned} \hat{R}_i=\left\{ j \in K : {\underline{F}}_{i(j-1)} \le 0.5 \le {\overline{F}}_{ij},~~{\underline{F}}_{i(0)} = 0\right\} , \end{aligned}$$
(6)

a set that is always non-empty and straightforward to obtain. Looking back at Table 2, our prediction would have been \(\hat{R}=\{2,3,4\}\), as these are the three possible median values.

As for the RPC approach (and its cautious versions [9]), the label-wise decomposition requires to aggregate all decomposed models into a single (partial) prediction. Indeed, focusing only on decomposed models \(\mathscr {P}_i\), nothing forbids to predict the same rank for multiple labels. In the next section, we discuss cautious predictions in the form of sets of ranks, as well as how to resolve inconsistencies.

3.3 Global Inferences

Once we have retrieved the different set-valued predictions of ranks for each label, two important questions remain:

  1. 1.

    Are those predictions consistent with the constraint that each label should receive a distinct rank?

  2. 2.

    If so, can we reduce the obtained predictions by integrating the aforementioned constraint?

Example 2

To illustrate the issue, let us consider the case where we have four labels \({\lambda _{1}},{\lambda _{2}},{\lambda _{3}},{\lambda _{4}}\). Then the following predictions

$$\hat{R}_1=\{1,2\},~\hat{R}_2=\{1,2\},~\hat{R}_3=\{1,2\},~\hat{R}_4=\{3,4\}$$

are inconsistent, simply because labels \({\lambda _{1}},{\lambda _{2}},{\lambda _{3}}\) cannot be given simultaneously a different rank (note that pair-wisely, they are not conflicting). On the contrary, the following predictions

$$\hat{R}_1=\{1,2\},~\hat{R}_2=\{1,2,3\},~\hat{R}_3=\{2\},~\hat{R}_4=\{1,2,3,4\}$$

are consistent, and could also be reduced to the unique ranking

$$\hat{R}'_1=\{1\},~\hat{R}'_2=\{3\},~\hat{R}'_3=\{2\},~\hat{R}'_4=\{4\},$$

as the strong constraint \(\hat{R}_3=\{2\}\) propagates to all other predictions by removing \({\lambda _{2}}\) from them, which results in a new strong constraint \(\hat{R}^*_1=\{1\}\) that also propagates to all other predictions. This redundancy elimination is repeated as new strong constraints emerge until we get the unique ranking above.

Such a problem is well known in Constraint Programming [12], where it corresponds to the alldifferent constraint. In the case where all rank predictions are intervals, that is a prediction \(\hat{R}_i\) contains all values between \(\min \hat{R}_i\) and \(\max \hat{R}_i\), efficient algorithms using the fact that one can concentrate on bounds alone exist, that we can use to speed up computations [28].

4 Discussion of Related Approaches

As said in the introduction, one of our main goals in this paper is to introduce a label ranking method that allows the ranker to partially abstain when it has insufficient information, therefore producing a corresponding set of possible rankings. We discuss here the usefulness of such rank-wise partial prediction (mainly w.r.t. approaches producing partial orders), as well as some related works.

4.1 Partial Orders vs Imprecise Ranks

Most existing methods [9, 10] that propose to make set-valued or cautious predictions in ranking problems consider partial orders as their final predictions, that is pairwise relations \(\succ _{\mathbf {x}}\) that are transitive and asymmetric, but no longer necessarily complete. To do so, they often rely on decomposition approaches estimating preferences between each pairs of labels [24].

However, while a complete order can be equivalently described by the relation \(\succ _{\mathbf {x}}\) or by the rank associated to each label, this is no longer true when one considers partial predictions. Indeed, consider for instance the case where the set of rankings over three labels \(\{{\lambda _{1}},{\lambda _{2}},{\lambda _{3}}\}\) we would like to predict is \(S=\{{\lambda _{1}}\succ {\lambda _{2}}\succ {\lambda _{3}},{\lambda _{1}}\prec {\lambda _{2}}\prec {\lambda _{3}}\}\), which could correspond to an instance where \({\lambda _{2}}\) is a good compromise, and where the population is quite divided about \({\lambda _{1}}\) and \({\lambda _{3}}\) that represent more extreme options.

While the set S can be efficiently and exactly represented by providing sets of ranks for each item, none of the information it contains can be retained in a partial order. Indeed, the prediction \(\hat{R}_1=\{1,3\}, \hat{R}_2=\{2\}, \hat{R}_3=\{1,3\}\) perfectly represents S, while representing it by a partial order would result in the empty relation (since for all pairs ij, we have \({\lambda _{i}} \succ {\lambda _{j}}\) and \({\lambda _{j}} \succ {\lambda _{i}}\) in the set S).

We could find an example that would disadvantage a rank-wise cautious prediction over one using partial orders, as one representation is not more general than the otherFootnote 4. Yet, our small example shows that considering both approaches makes sense, as one cannot encapsulate the other, and vice-versa.

4.2 Score-Based Approaches

In a recent literature survey [30], we can see that there are many score-based approaches, already been studied and compared in [24], such as constraint classification, log-linear models, etc. Such approaches learn, from the samples, a function \(h_j\) for each label \({\lambda _{j}}\) that will predict a strength \(h_j(\varvec{x}^{*})\) for a new instance. Labels are then ranked accordingly to their predicted strengths.

We will consider a typical example of such approaches, based on SVM, that we will call SVM label ranking (SVM-LR). Vembu and Gärtner [30] show that the SVM method [20] solving multi-label problems can be straightforwardly generalized to a label ranking problem. In contrast to our approach where each model is learned separately, SVM-LR fits all the functions at once, even if at prediction time they are evaluated independently. While this may account for label dependencies, this comes at a computational cost since we have to solve a quadratic optimization problem (i.e. the dual problem introduced in [20]) whose scale increases rapidly as the number of training samples and labels grows.

More precisely, the score functions \(h_j(\varvec{x}^{*})=\left\langle \varvec{w}_j ~|~ \varvec{x}^{*} \right\rangle \) are scalar products between a weight vector \(\varvec{w}_j\) and \(\varvec{x}^{*}\). If \(\alpha _{ijq}\) are coefficients that represent the existence of either the preference \(\lambda _q\succ _{\varvec{x}_i}\lambda _j\) or \(\lambda _j\succ _{\varvec{x}_i}\lambda _q\) of the instance \(\varvec{x}_i\), \(\varvec{w}_j\) can be obtained from the dual problem in [20, Sect. 5] as follows:

$$\begin{aligned} w_j = \frac{1}{2}\sum _{i=1}^n \left[ \sum _{(j,q) \in E_i} \alpha _{ijq} - \sum _{(p,j) \in E_i} \alpha _{ipj} \right] \varvec{x}_i \end{aligned}$$
(7)

where \(\alpha _{ipq}\) are the weighted target values to optimize into the dual problem. \(E_i\) contains all preferences, i.e. \(\{(p,q)\!\in \!E_i\!\!\iff \!\!\lambda _p\!\succ \!\lambda _q\}\), of the training instance \(\varvec{x}_i\).

It may seem at first that such approaches, once made imprecise, could be closer to ours. Indeed, the obtained models \(h_i\) after training also provide label-wise information. However, if we were to turn these method imprecise and obtain imprecise scores \([\underline{h}_i,\overline{h}_i]\), the most natural way to build a partial prediction would be to consider that \({\lambda _{i}} \succ {\lambda _{j}}\) when \(\underline{h}_i > \overline{h}_j\), that is when the score of \({\lambda _{i}}\) would certainly be higher than the one of \({\lambda _{j}}\). Such a partial prediction would be an interval order and would again not encompass the same family of subsets of rankings, as it would constitute a restricted setting compared to the one allowing for prediction any partial order.

5 Experiments

This section describes our experiments made to test if our approach is (1) competitive with existing ones and if (2) the partial predictions indeed provide more reliable inferences by abstaining on badly predicted ranks.

5.1 Data Sets

The data sets used in the experiments come from the UCI machine learning repository [21] and the Statlog collection [25]. They are synthetic label ranking data sets built either from classification or regression problems. From each original data set, a transformed data set \(({\mathbf {x}}_i,y_i)\) with complete rankings was obtained by following the procedure described in [8]. A summary of the data sets used in the experiments is given in Table 3. We perform 10 \(\times \) 10-fold cross-validation procedure on all the data sets (c.f. Table 3).

Table 3. Experimental data sets

5.2 Completeness/Correctness Trade-Off

To answer the question whether our method correctly identifies on which label it is desirable to abstain or to deliver a set of possible rankings, it is necessary to measure two aspects: how accurate and how precise the predictions are. Indeed, a good balance should be sought between informativeness and reliability of the predictions. For this reason, and similarly to what was proposed in the pairwise setting [9], we use a completeness and a correctness measure to assess the quality of the predictions. Given the prediction \(\hat{R}=\{\hat{R}_i\), \(i=1,\dots ,k\}\), we propose as the completeness (CP) and correctness (CR) measure

$$\begin{aligned} CP(\hat{R}) = \frac{k^2 - \sum _{i=1}^k |\hat{R}_i|}{k^2 - k} \quad \text {and}\quad CR(\hat{R}) = 1-\frac{\sum _{i=1}^k \min _{\hat{r}_i \in \hat{R}_i} |\hat{r}_i - r_i|}{0.5k^2} \end{aligned}$$
(8)

where CP is null if all \(\hat{R}_i\) contains the k possible ranks and has value one if all \(\hat{R}_i\) are reduced to singletons, whilst CR is equivalent to the Spearman Footrule when having a precise observation. Note that classical evaluation measures [36] used in an IP setting cannot be straightforwardly applied here, as they only extend the 0/1 loss and are not consistent with Spearman Footrule, and adapting cost-sensitive extensions [34] to the ranking setting would require some development.

5.3 Our Approach

As mentioned in Sect. 3, our proposal is to fit an imprecise ordinal regression model for every label-wise decomposition \(\mathbb {D}_i\), in which the lower and upper bounds of the cumulative distribution \([{\underline{F}}_i,{\overline{F}}_i]\) must be estimated in order to predict the set of rankings (Eq. 6) of an unlabeled instance \(\varvec{x}^{*}\). In that regard, we propose to use an extension of Frank and Hall [22] method to imprecise probabilities, already studied in detail in [19].

Frank and Hall’s method takes advantage of k ordered label values by transforming the original k-label ordinal problem to \(k-1\) binary classification sub-problems. Each estimates of the probabilityFootnote 5 \(P_i(A_\ell ):=F_i(\ell )\) where \(A_\ell =\{1, \dots , \ell \}\subseteq K\) and the mapping \(F_i:K\rightarrow [0,1]\) can be seen as a discrete cumulative distribution. We simply propose to make these estimates imprecise and to use bounds

$$ \underline{P}_i(A_j) := {\underline{F}}_{i}(j) \quad \text {and}\quad \overline{P}_i(A_j):={\overline{F}}_{i}(j) $$

which is indeed a generalized p-box model [15], as defined in Eq. (3).

To estimate these bounds, we use the naive credal classifier (NCC)Footnote 6 [35], which extends the classical naive Bayes classifier (NBC), as a base classifier. This classifier imprecision level is controlled through a hyper-parameter \(s\in \mathbb {R}\). Indeed, the higher s, the wider the intervals \([\underline{P}_i(A_j),\overline{P}_i(A_j)]\). For \(s=0\), we retrieve the classical NBC with precise predictions, and for \(s>>>0\), the NCC model will make vacuous predictions (i.e. all rankings for every label).

However, the imprecision induced by a peculiar value of s differs from a data set to another (as show the values in Fig. 2), and it is essential to have an adaptive way to quickly obtain two values:

  • the value \(s_{\min }\) corresponding to the value with an average completeness close to 1, making the corresponding classifier close to a precise one. This value is the one we will use to compare our approach to standard, precise ones;

  • the value \(s_{\max }\) corresponding to the value with an average correctness close to 1, and for which the made predictions are almost always right. The corresponding completeness gives an idea of how much we should abstain to get strong guarantees on the prediction, hence of how “hard” is a given data set.

To find those values, we proceed with the following idea: we start from an initial interval of values \([\underline{s},\overline{s}]\), and from target intervals \([\underline{CP},\overline{CP}]\) and \([\underline{CR},\overline{CR}]\), typically [0.95, 1] of average completeness and correctness. Note that in case of inconsistent predictions, \(\hat{R}_i=\emptyset \) and the completeness is higher than 1 (in such case, we consider \(CR=0\)). For \(s_{\min }\), we will typically start from \(\underline{s}=0\) (for which \(CP>1\)) and will consider a value \(\overline{s}\) large enough for which \(CP<0.95\) (e.g., starting from \(s=2\) as advised in [32] and doubling s iteratively until \(CP<0.95\), as when s increases completeness decreases and correctness increases in average). We then proceed by dichotomy to find a value \(s_{\min }\) for which average predictions are within interval \([\underline{CP},\overline{CP}]\). We proceed similarly for \(s_{\max }\).

With \(s_{\min }\) and \(s_{\max }\) found, a last issue to solve is how to get intermediate values of \(s\in [s_{\min }, s_{\max }]\) in order to get an adaptive evolution of completeness/correctness, as in Fig. 2. This is done through a simple procedure: first, we start by calculating the completeness/correctness for the middle value between \(s_{\min }\) and \(s_{\max }\), that is for \((s_{\min } + s_{\max })/2\). We then compute the distance between all the pairs of completeness/correctness values obtained for consecutive s values, and add a new s point in the middle between the two points with the biggest Euclidean distance. We repeat the process until we get the number of s values requested, for which we provide completeness/correctness values.

Fig. 2.
figure 2

Evolution of the hyper-parameter s on glass, stock and calhousing data sets.

The Fig. 2 shows that the boundary values of the hyper-parameter of imprecision s actually significantly depend on the data set. Our approach enables us to find the proper “optimal” value \(s_{\min }\) for each data set, which can be small (as in glass where \(s_{\min } = 1\)) or big (as in calhousing where \(s_{\min } = 160\)).

Figure 2 is already sufficient to show that our abstention method is working as expected, as indeed correctness increases quickly when we allow abstention, that is when completeness decreases. Figure 2(a) shows that for some data sets, one can have an almost perfect correctness while not being totally vacuous (as correctness of almost 1 is reached for a completeness slightly below 0.5, for a value \(s=4\)), while this may not be the case for other more difficult data sets such as calhousing, for which one has to choose a trade-off between completeness and correctness to avoid fully vacuous predictions. Yet, for all data sets (only three being shown for lack of space), we witness a regular increase of correctness.

5.4 Comparison with Other Methods

A remaining question is to know whether our approach is competitive with other state-of-art approaches. To do this, we compare the results obtained on test data sets (in a 10 \(\times \) 10 fold cross validation) between the results we obtain for \(s=s_{\min }\) and several methods. Those results are indeed the closest we can get to precise predictions in our setting. The methods to which we compare ourselves are the following:

  • The ranking by pairwise comparisons (RPC), as implemented in [3];

  • The Label ranking tree (LRT [8]), that adopt a local non-decomposed scheme;

  • The SVM-LR approach that we already described in Sect. 4.2.

As the NCC deals with discrete attributes, we need to discretize continuous attributes in z intervals before trainingFootnote 7. While z could be optimized, we use in this paper only two arbitrarily chosen levels of discretization \(z=5\) and \(z=6\) (i.e. LR-CSP-5 and LR-CSP-6 models) to compare our method against the others, for simplicity and because our goal is only to show competitiveness of our approach.

As mentioned, we consider the comparison by picking the value \(s_{\min }\). By fixing this hyper-parameter regulating the imprecision level of our approach, we then compare the correctness measure (8) with the Spearman Footrule loss obtained for RCP and LRT methods, and implemented into existing software [3]. For the SVM-LR, of which we did not find an online implementation, we used a Python packageFootnote 8, which solves a quadratic problem with known solvers [1] for little data sets, or a Frank-Wolfe algorithm for bigger data sets. In fact, Frank-Wolfe’s algorithm almost certainly guarantees the convergence to the global minimum for convex surfaces and to a local minimum for non-convex surfaces [26].

A last issue to solve is how to handle inconsistency predictions, ones in which the alldifferent constraint would not find a precise or partial solution but an empty one. Here, such predictions are ignored, and our results consider correctness and Spearman footrule on consistent solutions only, as dealing with inconsistent predictions will be the object of future works.

5.5 Results

The average performances and their ranks in parentheses obtained in terms of the correctness (CR) measure are shown in Table 4(a) and 4(b), with discretization 5 and 6 respectively applied to our proposal method LR-CSP.

Table 4. Average correctness accuracies (%) compared to LR-CSP-5 (left) and LR-CSP-6 (right)

A Friedman test [14] on the ranks yields p-values of 0.00006176 and 0.0001097 for LR-CSP-5 and LR-CSP-6, respectively, thus strongly suggesting performance differences between the algorithms. The Nemenyi post-hoc test (see Table 5) further indicates that LR-CSP-5 (and LR-CSP-6) is significantly better than SVM-LR. Our approach also remains competitive with LRT and RPC.

Finally, recall that our method is also quite fast to compute, thanks to the simultaneous use of decomposition (requiring to build k classifiers), and of probability sets and loss functions offering computational advantages that make the prediction step very efficient. Also, thanks to the fact that our predictions are intervals, i.e. sets of ranks without holes in them, we can use very efficient algorithms to treat the alldifferent constraints [28].

Note also that our proposal discretized at \(z\,=\,6\) intervals gets more accurate predictions (and also indicate a little drop in the p-value of all comparisons of Table 5) what can suggest us that an optimal value of \(\hat{z}\) may improve the prediction performance (all that remains, of course, hypothetical).

Table 5. Nemenyi post-hoc test: null hypothesis \(H_0\) and p-value

6 Conclusion and Perspectives

In this paper, we have proposed a method to make partial predictions in label ranking, using a label-wise decomposition as well as a new kind of partial predictions in terms of possible ranks. The experiments on synthetic data sets show that our proposed model (LR-CSP) produces reliable and cautious predictions and performs close to or even outperforms the existing alternative models.

This is quite encouraging, as we left a lot of room for optimization, e.g., in the base classifiers or in the discretization. However, while our method extends straightforwardly to partially observed rankings in training data when those are top-k rankings (considering for instance the rank of all remaining labels as \(k+1\)), it may be trickier to apply it to pairwise rankings, another popular way to get such data. Some of our future works will focus on that.