Ranking data with ordinal labels: optimality and pairwise aggregation
Abstract
The paper describes key insights in order to grasp the nature of Kpartite ranking. From the theoretical side, the various characterizations of optimal elements are fully described, as well as the likelihood ratio monotonicity condition on the underlying distribution which guarantees that such elements do exist. Then, a pairwise aggregation procedure based on Kendall tau is introduced to relate learning rules dedicated to bipartite ranking and solutions of the Kpartite ranking problem. Criteria reflecting ranking performance under these conditions such as the ROC surface and its natural summary, the volume under the ROC surface (VUS), are then considered as targets for empirical optimization. The consistency of pairwise aggregation strategies are studied under these criteria and shown to be efficient under reasonable assumptions. Eventually, numerical results illustrate the relevance of the methodology proposed.
Keywords
Kpartite ranking Ordinal data ROC surface Volume under the ROC surface Empirical risk minimization Median ranking1 Introduction
In many situations, a natural ordering can be considered over a set of observations. When observations are documents in information retrieval applications, the ordering reflects degree of relevance for a specific query. In order to predict future ordering on new data, the learning process uses past data for which some relevance feedback is some provided, such as ratings, say from 0 to 4, from the poorly relevant to the extremely relevant. For an example of such data, we refer to the LETOR benchmark data repository, see http://research.microsoft.com/enus/um/people/letor/. A similar situation occurs in medical applications where decisionmaking support tools provide a scoring of the population of patients based on diagnostic test statistics in order to rank the individuals according to the advance state of a disease which are described as discrete grades, see Pepe (2003), Dreiseitl et al. (2000), Edwards et al. (2005), Mossman (1999) or Nakas and Yiannoutsos (2004) for instance.
A particular case which has received increasing attention both in machine learning the statistics literature is when only binary feedback is available (relevant vs. not relevant, ill vs. healthy) and this is known as the bipartite ranking problem (see Clémençon and Vayatis 2009b, 2010; Freund et al. 2003; Agarwal et al. 2005; Clémençon et al. 2008, etc.). In the presence of ordinal feedback (i.e. ordinal label taking a finite number of values, K≥3 say), the task consists in learning how to order temporarily unlabeled observations so as to reproduce as accurately as possible the ordering induced by the labels not observed yet. This problem is referred to as Kpartite ranking and various approaches have been proposed in order to develop efficient algorithms in that case (see Rudin et al. 2005; Pahikkala et al. 2007). A closely related approach which points at both parametric and nonparametric statistical estimation is represented by ordinal regression modeling (see Waegeman et al. 2008b; Herbrich et al. 2000). To compare and assess the quality of these methods, a first concern is how to extend the typical performance measures such as the ROC curve and the AUC in this setup and this issue has been tackled in Scurfield (1996), Flach (2004). However, many interesting issues are still unexplored such as the theoretical optimality of learning rules, the statistical consistency of empirical performance maximization procedures, error bounds for Kpartite ranking algorithms, ….
In the present paper, we tackle some of these open problems. In particular, we explore the connection between bipartite and Kpartite ranking. Indeed, a natural approach is to transfer virtuous bipartite ranking methods to derive optimal and consistent rules for Kpartite ranking. This idea is quite successful in the multiclass classification setup (see Hastie and Tibshirani 1998 or Fürnkranz 2002 for instance). We propose to build on the original proposition in Fürnkranz et al. (2009) to combine of bipartite ranking tasks in order to solve the Kpartite case. A first intuition suggests that rules which are optimal for all bipartite ranking subproblems simultaneously should be optimal for the global problem. We offer examples in which this is not always the case and we state sufficient conditions for optimality which are called monotonicity likelihood ratio conditions. Based on this finding, we examine strategies which allow to combine rules dedicated for the pairwise subproblems for consecutive labels in order to derive interesting rules for the initial problem. We describe an efficient procedure for the pairwise aggregation of scoring rules which establishes a ranking consensus, called a median scoring rule, through an extension of the Kendall tau metric. It is also shown that such a median scoring rule always exists in the important situation where the scoring functions one seeks to summarize/aggregate are piecewise constant, and computation of this median rule is feasible. Next, we consider concepts such as the ROC surface and the Volume Under the ROC Surface (or VUS) which can be used to assess performance for scoring rules in Kpartite ranking. Consistency can then be considered as convergence to optimal elements in terms of ROC surface or VUS. We then study conditions under which consistency of pairwise aggregation can be achieved. Indeed, it can be shown that under the monotone likelihood ratio condition together with a margin condition over the posterior distributions, the median scoring rule built out of pairwise AUCconsistent rules is VUSconsistent. We also consider specific strategies to derive scoring rules for this problem such as the empirical maximization of the VUS or the plugin scoring rule. We also provide an analysis of the empirical performance of the Kendalltype pairwise aggregation method using the TreeRank algorithm developed by the authors (Clémençon and Vayatis 2009b). An extensive comparison with stateoftheart ranking methods is presented both on artificial and real data sets and we exhibit performance in terms of VUS, as well as the form of the level sets of the estimated scoring rules. The latter visualization show interesting insights about the geometry of risk segments in the input space.
The rest of the paper is structured as follows. In Sect. 2, the probabilistic setting is introduced and optimal scoring rules for Kpartite ranking are successively defined and characterized. A specific monotonicity likelihood ratio condition is stated, which is shown to guarantee the existence of a natural optimal ordering over the input space. A novel Kendalltype aggregation procedure is presented in Sect. 3 and performance metrics, such as the VUS, are the subject matter of Sect. 4. Consistency results and insights on the passage from bipartite subproblems to the full Kpartite case are discussed in Sect. 5. Finally, Sect. 6 displays a series of numerical results and illustrations for the aggregation principle considered in this paper. Mathematical proofs are postponed to the Appendix A.
2 Optimal elements in ranking data with ordinal labels
2.1 Probabilistic setup and notations
2.2 Optimal scoring rules
The problem considered in this paper is to infer an order relationship over ℝ^{ d } after observing vector data with ordinal labels. For this purpose, we consider realvalued decision rules of the form s : ℝ^{ d }→ℝ called scoring rules. In the case of ordinal labels, the main idea is that good scoring rules s are those which assign a high score s(X) to the observations X with large values of the label Y. We now introduce the concept of optimal scoring rule for ranking data with ordinal labels.
Definition 1
(Optimal scoring rule)
The rationale behind this definition can be understood by considering the case K=2. The class Y=2 should receive higher scores than the class Y=1. In this case, an optimal scoring rule s ^{∗} should score observations x in the same order as the posterior probability η _{2} of the class Y=2 (or equivalently as the ratio η _{2}/(1−η _{2})). Since η _{1}(x)+η _{2}(x)=1, for all x, it is easy to see that this is equivalent to the condition described in the previous definition (see Clémençon and Vayatis 2009b for details). In the general case (K>2), optimality of a scoring rule s ^{∗} means that s ^{∗} is optimal for all bipartite subproblems with classes Y=k and Y=l, with l<k.
An important remark is that, in the probabilistic setup introduced above, an optimal scoring rule may not exist as shown in the next example.
Example 1
2.3 Existence and characterization of optimal scoring rules
The previous example shows that the existence of optimal scoring rules cannot be guaranteed under any joint distribution. Our first important result is the characterization of those distributions for which the family of optimal scoring rules is not an empty set. The next proposition offers a necessary and sufficient condition on the distribution which ensures the existence of optimal scoring rules.
Assumption 1
Proposition 1
 (1)
Assumption 1 holds.
 (2)
There exists an optimal scoring rule s ^{∗}.
 (3)
The regression function \(\eta(x) = \mathbb{E}(Y \mid X=x)\) is an optimal scoring rule.
 (4)For any k∈{1,…,K−1}, for all \(x,x'\in\mathcal{X}_{k}\), we have:$$\varPhi_{k+1,k}(x)< \varPhi_{k+1,k}\bigl(x'\bigr) \Rightarrow s^*(x)< s^*\bigl(x'\bigr). $$
 (5)
For any k,l∈{1,…,K} such that l<k, the ratio Φ _{ k,l }(x) is a nondecreasing function of s ^{∗}(x).
Assumption 1 characterizes the class distributions for the random pair (X,Y) for which the very concept of an optimal scoring rules makes sense. The proposition says that if this condition is not satisfied then the ordinal nature of the labels, when seen through the observation X, is violated. We point out that a related condition, called ERA ranking representability, has been introduced in Waegeman and Baets (2011), see Definition 2.1 therein. Precisely, it can be easily checked that the condition in the previous proposition means that the collection of (bipartite) ranking functions {Φ _{ k+1,k }:1≤k<K} is an ERA ranking representable set of ranking functions. Statement (3) suggests that plugin rules based on the statistical estimation of the regression function η and multiple thresholding of the estimate will offer candidates for practical resolution of Kpartite ranking. Such strategies are indeed reminiscent of ordinal logistic regression methods and will be discussed in Sect. 5.3.2. Statement (4) offers an alternative characterization to Definition 1 for optimal scoring rules. Statement (5) means that the family of densities of the classconditional distributions f _{ k } has a monotone likelihood ratio (we refer to standard textbooks of mathematical statistics which use this terminology, e.g. Lehmann and Romano 2005).
Proposition 2
2.4 Examples and counterexamples of monotone likelihood ratio families
It is easy to see that, in absence of Assumption 1, the notion of Kpartite ranking hardly makes sense. However, it is a challenging statistical task to assess whether data arise from a mixture of distributions F _{ k } with monotone likelihood ratio. We now provide examples and counterexamples of such cases.
Disjoint supports
Consider the separable case where: ∀k,l, \(\mathcal{X}_{k} \cap\mathcal{X}_{l} = \emptyset\). Then Assumption 1 is clearly fulfilled as for k≠l, we have either Φ _{ k,l }=0 or ∞. It is worth mentioning that in this case, the nature of the Kpartite ranking problem does not differ from the multiclass classification setup where there is no order relation between classes.
Exponential families

κ:{1,…,K}→ℝ is strictly increasing,

T:ℝ^{ d }→ℝ such that \(\psi(k)=\int_{x\in\mathbb{R}^{d}}\exp\{\kappa(k)T(x)\}f(x) dx<+\infty\), for 1≤k≤K.
1D Gaussian distributions
Uniform noise
3 Pairwise aggregation: from bipartite to Kpartite ranking
In the present section, we propose a practical strategy for building scoring rules which approximate optimal scoring rules for Kpartite ranking based on data. The principle of this strategy is the aggregation of scoring rules obtained for the pairwise subproblems. We emphasize the fact that the situation is very different from multiclass classification where aggregation boils down to linear combination, or majority voting, over binary classifiers (for “one against one” and “one versus all”, we refer to Allwein et al. 2001; Hastie and Tibshirani 1998; Venkatesan and Amit 1999; Debnath et al. 2004; Dietterich and Bakiri 1995; Beygelzimer et al. 2005a, 2005b and the references therein for instance). We propose here, in the Kpartite ranking setup, a metricbased barycentric approach to build the aggregate scoring rule from the collection of scoring rules estimated for the bipartite subproblems. In order to avoid technical discussions dealing with special cases, we assume in the sequel that all classconditional distributions have a continuous density f _{ k } and share the same support \(\mathcal{X}\subset\mathbb{R}^{d}\).
3.1 Median scoring rules and optimal aggregation
Every scoring rule induces an order relation over the input space ℝ^{ d } and, for the ranking problem considered here, a measure of similarity between two scoring functions should only take into consideration the similarity in the ranking induced by each one of them. We propose here a measure of agreement between scoring rules which is based on the probabilistic Kendall τ for a pair of random variables.
Definition 2
(Probabilistic Kendall τ)
This definition of agreement between scoring rules s _{1} and s _{2} coincides indeed with the Kendall τ between realvalued random variables s _{1}(X) and s _{2}(X). Note that the contribution of the two last terms in the definition of τ(s _{1},s _{2}) vanishes when the distributions of the s _{ i }(X)’s are continuous.
From there one can define the notion of median scoring rule which accounts for the consensus of many realvalued scoring rules over a given class of candidates.
Definition 3
(Median scoring rule)
In general, the supremum appearing on the right hand side of Eq. (1) is not attained. However, when the supremum over \(\mathcal{S}_{1}\) can be replaced by a maximum over a finite set \(\mathcal{S}'_{1}\subset \mathcal{S}_{1}\), a median scoring rule always exists (but it is not necessarily unique). In particular, this is the case when considering piecewise constant scoring functions such as those produced by the bipartite ranking algorithms proposed in Clémençon et al. (2011a), Clémençon and Vayatis (2009a, 2010) (we also refer to Clémençon and Vayatis 2009c for a discussion of consensus computation/approximation in this case). The idea underlying the measure of consensus through Kendall metric in order to aggregate scoring functions that are nearly optimal for bipartite ranking subproblems is clarified by the following result.
Definition 4
(Pairwise optimal scoring rule)
We denote by \(\mathcal{S}^{*}_{l,k}\) the set of such optimal rules and, in particular, \(\mathcal{S}^{*}_{k} = \mathcal{S}^{*}_{k, k+1}\).
Proposition 3
 1.
A median scoring rule \(\overline{s}^{*}\) for \((\mathcal{S}, \varSigma ^{*}_{K})\) is an optimal scoring rule for the Kpartite ranking problem.
 2.Any optimal scoring rule s ^{∗} for the Kpartite ranking problem satisfies:$$\sum_{k=1}^{K1} \tau\bigl(s^*,s^*_k\bigr) = K1. $$
The proposition above reveals that “consensus scoring rules”, in the sense of Definition 3, based on K−1 optimal scoring rules are still optimal solutions for the global Kpartite ranking problem and that, conversely, optimal elements necessarily achieve the equality in Statement (2) of the previous proposition. This naturally suggests to implement the following twostage procedure, that consists in (1) solving the bipartite ranking subproblem related to the pairwise case (k,k+1) of consecutive class labels, yielding a scoring function s _{ k }, for 1≤k<K, and (2) computing a median according to Definition 3, when feasible, based on the latter over a set \(\mathcal{S}_{1}\) of scoring functions. Beyond the difficulty to solve each ranking subproblem separately (for instance refer to Clémençon and Vayatis 2009b for a discussion of the nature of the bipartite ranking issue), the performance/complexity of the method sketched above is ruled by the richness of the class \(\mathcal{S}_{1}\) of scoring function candidates: too complex classes clearly make median computation unfeasible, while poor classes may not contain sufficiently accurate scoring rules.
3.2 A practical aggregation procedure

a sample \(\mathcal{D}=\{(X_{i},Y_{i}){:}\; 1\leq i \leq n \} \) with i.i.d. labeled observations,

a sample \(\mathcal{D}'=\{X'_{i},{:}\; 1\leq i \leq n' \}\) a sample with unlabeled observations.
Definition 5
(Empirical Kendall τ)
Practical implementation issues
Motivated by practical problems such as the design of metasearch engines, collaborative filtering or combining results from multiple databases, consensus ranking, which the second stage of the procedure described above is a special case of, has recently enjoyed renewed popularity and received much attention in the machinelearning literature, see Meila et al. (2007), Fagin et al. (2004) or Lebanon and Lafferty (2002) for instance. As shown in Hudry (2008) or Wakabayashi (1998) in particular, median computations are NPhard problems in general. Except in the case where \(\mathcal {S}_{1}\) is of very low cardinality, the (approximate) computation of a supremum involves in practice the use of metaheuristics such as simulated annealing, tabu search or genetic algorithms. The description of these computational approaches to consensus ranking is beyond the scope of this paper and we refer to Barthélemy et al. (1989), Charon and Hudry (1998), Laguna et al. (1999) or Mandhani and Meila (2009) and the references therein for further details on their implementation. We also underline that the implementation of the Kendall aggregation approach could be naturally based on K(K−1)/2 scoring functions, corresponding to solutions of the bipartite subproblems defined by all possible pairs of labels (the theoretical analysis carried out below can be straightforwardly extended so as to establish the validity of this variant), at the price of an additional computational cost for the median computation stage however.
Rank prediction vs. scoring rule learning
When the goal is to rank accurately new unlabeled datasets, rather than to learn a nearly optimal scoring rule explicitly, the following variant of the procedure described above can be considered. Given an unlabeled sample of i.i.d. copies of the input r.v. X \(\mathcal{D}_{X}=\{ X_{1},\ldots,X_{m} \}\), instead of aggregating scoring functions s _{ k } defined on the feature space \(\mathcal{X}\) and use a consensus rule for ranking the elements of \(\mathcal{D}_{X}\), one may aggregate their restrictions to the finite set \(\mathcal{D}_{X}\subset\mathcal {X}\), or simply the ranks of the unlabeled data as defined by the s _{ k }’s.
4 Performance measures for Kpartite ranking
We now turn to the main concepts for assessing performance in the Kpartite ranking problem. We focus on the notion of ROC surface and Volume Under the ROC Surface (VUS) in the case where K=3 in order to keep the presentation simple. These concepts are generalizations of the wellknown ROC curve and AUC criterion which are popular performance measures for bipartite ranking.
4.1 ROC surface
Definition 6
(ROC surface)
By “continuous extension”, it is meant that discontinuity points, due to jumps or flat parts in the cdfs F _{ s,k }, are connected by linear segments (parts of hyperplanes). The same convention is considered in the definition of the ROC curve in the bipartite case given in Clémençon and Vayatis (2009b). In the case K=3, on which we restrict our attention from now for simplicity (all results stated in the sequel can be straightforwardly extended to the general situation), the ROC surface thus corresponds to a continuous manifold of dimension 2 in the unit cube of ℝ^{3}. We also point out that the ROC surface contains the ROC curves of the pairwise problems (f _{1},f _{2}), (f _{2},f _{3}) and (f _{1},f _{3}) which can be obtained as the intersections of the ROC surface with planes orthogonal to each of the axis of the unit cube.
Proposition 4
(Change of parameterization)
We point out that, in the case where s has no capacity to discriminate between the three distributions, i.e. when F _{ s,1}=F _{ s,2}=F _{ s,3}, the ROC surface boils down to the surface delimited by the triangle that connects the points (1,0,0), (0,1,0) and (0,0,1), we then have ROC(s,α,γ)=1−α−γ. By contrast, in the separable situation (see Sect. 2.4), the optimal ROC surface coincides with the surface of the unit cube [0,1]^{3}.
Lemma 1
 1.
ROC(s,α,γ)>0
 2.
\(\mathrm{ROC}_{f_{1},f_{3}}(s,1\alpha) >\gamma\).
Other notions of ROC surface have been considered in the literature, depending on the learning problem considered and the goal pursued. In the context of multiclass pattern recognition, they provide a visual display of classification accuracy, as in Ferri et al. (2003) (see also Fieldsend and Everson 2005, 2006 and Hand and Till 2001) from a oneversusone angle or in Flach (2004) when adopting the oneversusall approach. The concept of ROC analysis described above is more adapted to the situation where a natural order on the set of labels exists, just like in ordinal regression, see Waegeman et al. (2008b).
4.2 ROCoptimality and optimal scoring rules
The ROC surface provides a visual tool for assessing ranking performance of a scoring rule. The next theorem provides a formal statement to justify this practice.
Theorem 1
 1.
Assumption 1 is fulfilled and s ^{∗} is an optimal scoring rule in the sense of Definition 1.
 2.We have, for any scoring rule s and for all (α,γ)∈[0,1]^{2},$$\mathrm{ROC}(s,\alpha,\gamma)\leq\mathrm{ROC}\bigl(s^*,\alpha ,\gamma\bigr). $$
A nontrivial byproduct of the proof of the previous theorem is that optimizing the ROC surface amounts to simultaneously optimizing the ROC curves related to the two pairs of distributions (f _{1},f _{2}) and (f _{2},f _{3}).
The theorem indicates that optimality for scoring rules in the sense of Definition 1 is equivalent to optimality in the sense of the ROC surface. Therefore, the ROC surface provides a complete characterization of the ranking performance of a scoring rule in the Kpartite problem.
 the quantile of order (1−α) of the conditional distribution of the random variable s(X) given Y=k:$$Q^{(k)}(s,\alpha) = F^{1}_{s, k}(1\alpha), $$
 the level set of the scoring rule s with the top elements of class Y=k:$$R^{(k)}_{s,\alpha}=\bigl\{x\in\mathcal{X}s(x)>Q^{(k)}(s,\alpha) \bigr\} . $$
Proposition 5
The previous proposition provides a key inequality for the statistical results developed in the sequel.
4.3 Volume Under the ROC Surface (VUS)
In the bipartite case, a standard summary of ranking performance is the Area Under an ROC Curve (or AUC). In a similar manner, one may consider the volume under the ROC surface (VUS in abbreviated form) in the threeclass framework. We follow here Scurfield (1996) but we mention that other notions of ROC surface can be found in the literature, leading to other summary quantities, also referred to as VUS, such as those introduced in Hand and Till (2001).
Definition 7
(Volume Under the ROC Surface)
The next proposition describes two extreme cases.
Proposition 6
 1.
If F _{ s,1}=F _{ s,2}=F _{ s,3}, then VUS(s)=1/6.
 2.
If the density functions of F _{ s,1}, F _{ s,2}, F _{ s,3} have disjoint supports, then VUS(s)=1.
Like the AUC criterion, the VUS can be interpreted in a probabilistic manner. For completeness, we recall the following result.
Proposition 7
(Scurfield 1996)
In the case where the distribution of s is continuous, the last three terms in the term on the right hand side vanish and the VUS boils down to the probability that, given three random instances X _{1}, X _{2} and X _{3} with respective labels Y _{1}=1, Y _{2}=2 and Y _{3}=3, the scoring rule s ranks them in the right order.
4.4 VUSoptimality
We now consider the notion of optimality with respect to the VUS criterion and provide expressions of the deficit of VUS for any scoring rule which highlight the connection with AUC maximizers for the bipartite subproblems.
Proposition 8
(VUS optimality)
This result shows that optimal scoring rules in the sense of Definition 1 coincide with optimal elements in the sense of VUS. This simple statement grounds the use of empirical VUC maximization strategies for the Kpartite ranking problem.
When the Assumption 1 is not fulfilled, the VUS can still be used as a performance criterion, both in the multiclass classification context (Landgrebe and Duin 2006; Ferri et al. 2003) and in the ordinal regression setup (Waegeman et al. 2008b). However, the interpretation of maximizers of VUS as optimal orderings is highly questionable. For instance, in the situation described in Example 1, one may easily check that, when ω _{1,1}=4/11, ω _{1,2}=6/11, ω _{1,3}=ω _{3,1}=1/11, ω _{2,1}=ω _{2,2}=3/11 and ω _{2,3}=ω _{3,2}=ω _{3,3}=5/11, the maximum VUS (equal to 0.2543) is reached by the scoring rule corresponding to strict orders ≺ and ≺′, such that x _{3}≺x _{2}≺x _{1} and x _{2}≺′x _{3}≺′x _{1} respectively, both at the same time.
We introduce the definition for the AUC of the bipartite ranking problem with the pair of distributions (f _{ k },f _{ k+1}):
Definition 8
(AUC)
We now state the result which establishes the relevance of AUC as an optimality criterion for the bipartite ranking problem.
Proposition 9
The next result makes clear that if a scoring rule s solves simultaneously all the bipartite ranking subproblems then it also solves the global Kpartite ranking problem. For simplicity, we present the result in the case K=3.
Theorem 2
(Deficit of VUS)
5 Consistency of pairwise aggregation and other strategies for Kpartite ranking
5.1 Definition of VUSconsistency and main result
In this section, we assume a data sample \(\mathcal{D}_{n}=\{(X_{1},Y_{1}), \ldots, (X_{n},Y_{n})\}\) is available and composed by n i.i.d. copies of the random pair (X,Y). Our goal here is to learn from the sample \(\mathcal{D}_{n}\) how to build a realvalued scoring rule \(\widehat{s}_{n}\) such that its ROC surface is as close as possible to the optimal ROC surface. We propose to consider a weak concept of consistency which relies on the VUS.
Definition 9
(VUSconsistency)
 the sequence {s _{ n }} is called VUSconsistent if$$\mathrm{VUS}^*\mathrm{VUS}(s_n)\rightarrow0 \quad\text{in probability}, $$
 the sequence {s _{ n }} is called strongly VUSconsistent if$$\mathrm{VUS}^*\mathrm{VUS}(s_n)\rightarrow0\quad\text{with probability one}. $$
Remark 1
In order to state the main result, we need an additional assumption on the distribution of the random pair (X,Y). The reason why this assumption is needed will be explained in the next section.
Assumption 2
In the statistical learning literature, Assumption 2 is referred to as the noise condition and goes back to the work of Tsybakov (2004). It has been adapted to the framework of bipartite ranking in Clémençon et al. (2008). For completeness, we state a result from this latter paper (see Corollary 8 within) which offers a simple sufficient condition for the Assumption 2 to be fulfilled.
Proposition 10
If the distribution of the r.v. η _{ k+1}(X)/(η _{ k }(X)+η _{ k+1}(X)) has a bounded density, then Assumption 2 is satisfied.
We will also need to use the notion of AUC consistency for the bipartite ranking subproblems.
Definition 10
(AUC consistency)
We can now state the main consistency result of the paper which concerns the Kendall aggregation procedure described in Sect. 3.2. Indeed, the following theorem reveals that the notion of median scoring rule introduced in Definition 3 preserves AUC consistency for bipartite subproblems and thus yields a VUS consistent scoring rule for the Kpartite problem. It is assumed that the solutions to the bipartite subproblems are AUCconsistent for each specific pair of class distributions (f _{ k },f _{ k+1}), 1≤k<K. For simplicity, we formulate the result in the case K=3.
Theorem 3
 1.
Assumptions 1 and 2 hold true.
 2.
The class \(\mathcal{S}_{1}\) contains an optimal scoring rule.
 3.
The sequences \((s^{(1)}_{n})_{n\geq1}\) and \((s^{(2)}_{n})_{n\geq1}\) are (strongly) AUCconsistent for the bipartite ranking subproblems related to the pairs of distributions (f _{1},f _{2}) and (f _{2},f _{3}) respectively.
 4.
Assume that, for all n, there exists a median scoring rule \(\overline{s}_{n}\) in the sense of Definition 3 with respect to \((\mathcal{S}_{1}, \varSigma_{2,n})\).
Discussion
The first assumption of Theorem 3 puts a restriction on the class of distributions for which such a consistency result holds. Assumption 1 actually guarantees that the very problem of Kpartite makes sense and the existence of an optimal scoring rule. Assumption 2 can be seen as a “light” restriction since it still covers a large class of distributions commonly used in probabilistic modeling. The third and fourth assumptions are natural as we expect first to have efficient solutions to the bipartite subproblems before considering reasonable solutions to the Kpartite problem. The most restrictive assumption is definitely the second one about the fact that the class of candidates contains an optimal element. Indeed, it is easy to weaken this assumption at the price of an additional bias term by assuming that the scoring rules \(s^{(1)}_{n}\), \(s^{(2)}_{n}\) and \(\overline{s}_{n}\) belong to a set \(\mathcal{S}_{1}^{(n)}\), such that there exists a sequence \((s_{n}^{*})_{n\geq1}\) with \(s_{n}^{*}\in\mathcal{S}_{1}^{(n)}\) and \(\mathrm{VUS} (s_{n}^{*})\rightarrow\mathrm{VUS}^{*}\) as n→∞. We decided not to include this refinement as this is merely a technical argument which does not offer additional insights on the nature of the problem.
5.2 From AUC consistency to VUS consistency
In this section, we introduce auxiliary results which contribute to the proof of the main theorem (details are provided in the Appendix). Key arguments rely on the relationship between the solutions of the bipartite ranking subproblems and those of the Kpartite problem. In particular, a sequence of scoring rules that is simultaneously AUCconsistent for the bipartite ranking problems related to the two pairs of distributions (f _{1},f _{2}) and (f _{2},f _{3}) is VUSconsistent. Indeed, we have the following corollary.
Corollary 1
 (i)
The sequence (s _{ n })_{ n } of scoring rules is (strongly) VUSoptimal.
 (ii)We have simultaneously when n→∞: (with probability one) in probability.
It follows from this result that the 3partite ranking problem can be cast in terms of a doublecriterion optimization task, consisting in finding a scoring rule s that simultaneously maximizes \(\mathrm{AUC}_{f_{1},f_{2}}(s)\) and \(\mathrm{AUC}_{f_{2},f_{3}}(s)\). This result provides a theoretical basis for the justification of our pairwise aggregation procedure. We mention that the idea of decomposing the Kpartite ranking into several bipartite ranking subproblems has also been considered in Fürnkranz et al. (2009) but the aggregation stage is performed with a different strategy.
The other type of result which is needed concerns the connection between the aggregation principle based on a consensus approach (Kendall τ) and the performance metrics involved in the Kpartite ranking problem. The next results establish inequalities which relate the AUC and the Kendall τ in a quantitative manner.
Proposition 11
We point out that it is generally vain to look for a reverse control: indeed, scoring functions yielding different rankings may have exactly the same AUC. However, the following result guarantees that a scoring function with a nearly optimal AUC is close to optimal scoring functions in a certain sense, under the additional assumption that the noise condition introduced in Clémençon et al. (2008) is fulfilled.
Proposition 12
5.3 Alternative approaches to Kpartite ranking
In this section, we also mention, for completeness, two other approaches to Kpartite ranking.
5.3.1 Empirical VUS maximization
The theoretical analysis shall rely on concentration properties of Uprocesses in order to control the deviation between the empirical and theoretical versions of the VUS criterion uniformly over the class \(\mathcal{S}_{1}\). Such an analysis was performed in the bipartite case in Clémençon et al. (2008) and we expect that it can be extended in the Kpartite case. In contrast, algorithmic aspects of the issue of maximizing the empirical VUS criterion (or a concave surrogate) are much less straightforward and the question of extending optimization strategies such as those introduced in Clémençon and Vayatis (2009b) or Clémençon and Vayatis (2010) requires, for instance, significant methodological progress.
5.3.2 Plugin scoring rule
As shown by Proposition 1, when Assumption 1 is fulfilled, the regression function η is an optimal scoring function. The plugin approach consists of estimating the latter and use the resulting estimate as a scoring rule. For instance, one may estimate the posterior probabilities (η _{1}(x),…,η _{ K }(x)) by an empirical counterpart \((\widehat{\eta}_{1}(x),\ldots,\widehat{\eta}_{K}(x))\) based on the training data and consider the ordering on ℝ^{ d } induced by the estimator \(\widehat{\eta}(x)=\sum_{k=1}^{K} k \widehat{\eta}_{k}(x)\). We refer to Clémençon and Vayatis (2009a) and Clémençon and Robbiano (2011) for preliminary theoretical results based on this strategy in the bipartite context and Audibert and Tsybakov (2007) for an account of the plugin approach in binary classification. It is expected that an accurate estimate of η(x) will define a ranking rule similar to the optimal one, with nearly maximal VUS. As an illustration of this approach, the next result relates the deficit of VUS of a scoring function \(\widehat{\eta}\) to its L _{1}(μ)error as an estimate of η. We assume for simplicity that all classconditional distributions have the same support.
Proposition 13
This result reveals that a L _{1}(μ)consistent estimator, i.e. an estimator \(\widehat{\eta}_{n}\) such that \(\mathbb {E}[\eta (X)\widehat {\eta}_{n}(X)]\) converges to zero in probability as n→∞, yields a VUSconsistent ranking procedure. However, from a practical perspective, such procedures should be avoided when dealing with highdimensional data, since they are obviously confronted to the curse of dimensionality.
5.4 Connections with regression estimation and ordinal regression
Whereas standard multiclass classification ignores the possible ordinal structure of the output space, ordinal regression takes the latter into account by penalizing more and more the error of a classifier candidate C on an example (X,Y) as C(X)−Y increases. In general, the loss function chosen is of the form ψ(c,y)=Ψ(c−y), (c,y)∈{1,…,K}^{2}, where Ψ:{0,…,K−1}→ℝ_{+} is some nondecreasing mapping. The most commonly used choice is Ψ(u)=u, corresponding to the risk \(L(C)=\mathbb{E}[\vert C(X)Y \vert]\), referred to as the expected ordinal regression error sometimes, cf. Agarwal (2008). In this case, it is shown that the optimal classifier can be built by thresholding the regression function at specific levels \(t_{0}=0<t^{*}_{1}<\cdots<t^{*}_{K1}<1=t_{K}\), that it so say it is of the form \(C^{*}(x)=\sum_{k=1}^{K}k\cdot\mathbb{I}\{ t^{*}_{k1}\leq\eta(x)<t^{*}_{k} \}\) when assuming that \(\eta(X)=\mathbb {E}[Y\mid X]\) is a continuous r.v. for simplicity. Based on this observation, a popular approach to ordinal regression lies in estimating first the regression function η by an empirical counterpart \(\widehat{\eta}\) (through minimization of an estimate of \(R(f)=\mathbb{E}[( Yf(X))^{2}]\) over a specific class \(\mathcal{F}\) of function candidates f, in general) and choosing next a collection t of thresholds t _{0}=0<t _{1}<⋯<t _{ K−1}<1=t _{ K } in order to minimize a statistical version of L(C _{ t }) where \(C_{\mathbf {t}}(x)=\sum_{k=1}^{K}k\cdot\mathbb{I}\{ t_{k1}\leq\widehat{\eta }(x)<t_{k} \}\). Such procedures are sometimes termed regressionbased algorithms, see Agarwal (2008). One may refer to Kramer et al. (2001) in the case of regression trees for instance.
6 Illustrative numerical experiments
It is the purpose of this section to illustrate the approach described above by numerical results and provide some empirical evidence for its efficacy. Since our goal is here to show that, beyond its theoretical validity, the Kendall aggregation approach to multiclass ranking actually works in practice, rather than to provide a detailed empirical study of its performance on benchmark artificial/real datasets compared to that of possible competitors (this will be the subject of a forthcoming paper), in the subsequent experimental analysis we have considered two simple data generative models, for which one may easily check Assumption 1 and compute the optimal ROC surface (as well as the optimum value VUS^{∗}), which the results obtained must be compared to. The first example involves mixtures of Gaussian distributions, while the second one is based on mixtures of uniform distributions, the target ROC surface being piecewise linear in the latter case (cf. assertion 4 in Proposition 14). Here, the artificial data simulated are split into a training sample and a test sample, used for plotting the “test ROC surfaces”.
The learning algorithm used for solving the bipartite ranking subproblems at the first stage of the procedure is the TreeRank procedure based on locally weighted versions of the CART method (with axis parallel splits), see Clémençon et al. (2011a) for a detailed description of the algorithm (as well as Clémençon and Vayatis 2009b for rigorous statistical foundations of this method). Precisely, we used a package for R statistical software (see http://www.rproject.org) implementing TreeRank (with the “default” parameters: minsplit = (size of training sample)/20, maxdepth = 10, mincrit = 0), available at http://treerank.sourceforge.net, see Baskiotis et al. (2010). The scoring rules produced at stage 1 are thus (treestructured and) piecewise constant, making the aggregating procedure described in Sect. 3.2 quite feasible. Indeed, if s _{1},…,s _{ M } are scoring functions that are all constant on the cells of a finite partition \(\mathcal{P}\) of the input space \(\mathcal{X}\), one easily see that the infimum \(\inf_{s\in\mathcal{S}_{0}}\sum_{m=1}^{M}d_{\tau_{\mu }}(s,s_{m})\) reduces to a minimum over a finite collection of scoring functions that are also constant on \(\mathcal{P}\)’s cells and is thus attained. As underlined in Sect. 3.2, when the number of cells is large, median computation may become practically unfeasible and the use of a metaheuristic can be then considered for approximation purpose (simulated annealing, tabu search, etc.), here the ranking obtained by taking the mean ranks over the K−1 rankings of the test data has been improved in the Kendall consensus sense by means of a standard simulated annealing technique.
For comparison purpose, we have also implemented two ranking algorithms, RankBoost (when aggregating 30 stumps, see Rudin et al. 2005) and SVMRank (with linear and Gaussian kernels with respective parameters C=20 and (C,γ)=(0.01), see Herbrich et al. 2000), using the SVMlight implementation available at http://svmlight.joachims.org/. We have also used the RankRLS method (http://www.tucs.fi/RLScore, see Pahikkala et al. 2007) that implements a regularized least square algorithm with linear kernel (“bias=1”) and with Gaussian kernel (γ=0.01), selection of the intercept on a grid being performed through a leaveoneout procedure. For completeness, the Kendall aggregation procedure has also been implemented with RankBoost for solving the bipartite subproblems.
First example (mixtures of Gaussian distributions)
Comparison of the VUS: “Gaussian” experiment—VUS^{∗}=0.4369
Method  \(\overline{\mathrm{VUS}} (\widehat{\sigma})\) 

TreeRank 1v2  0.3703 (±0.0102) 
TreeRank 2v3  0.3728 (±0.0104) 
TreeRank 1v3  0.3972 (±0.0053) 
TreeRank Agg  0.4118 (±0.0054) 
RankBoostVUS  0.4281 (±0.0024) 
RankBoost Agg  0.4305 (±0.0019) 
SVMrank lin  0.4367 (±0.0003) 
SVMrank gauss  0.4363 (±0.0009) 
RLScore lin  0.4368 (±0.0003) 
RLScore gauss  0.4366 (±0.0006) 
Second example (mixtures of uniform distributions)
Values of the η _{ k }’s on each of the nine subsquare of [0,1]^{2}, cf. Fig. 5b
s ^{∗}  \(s^{*}_{1,2}\)  \(s^{*}_{2,3}\)  η _{1}  η _{2}  η _{3} 

0.2  0.2  0.2  0.7692  0.2000  0.0308 
0.4  0.4  0.2  0.6250  0.3250  0.0500 
0.6  0.8  0.6  0.3968  0.4127  0.1905 
0.8  0.8  0.8  0.3731  0.3881  0.2388 
1  1  1  0.3030  0.3939  0.3030 
1.25  1.25  1  0.2581  0.4194  0.3226 
1.66  1.66  1.66  0.1682  0.3645  0.4673 
2.5  2.5  2.5  0.0952  0.3095  0.5952 
5  2.5  5  0.0597  0.1940  0.7463 
Comparison of the VUS: “uniform” experiment—VUS^{∗}=0.3855
Method  \(\overline{\mathrm{VUS}}(\widehat{\sigma})\) 

TreeRank 1v2  0.3681 (±0.0060) 
TreeRank 2v3  0.3611 (±0.0056) 
TreeRank 1v3  0.3774 (±0.0037) 
TreeRank Agg  0.3818 (±0.0027) 
RankBoostVUS  0.3681 (±0.0013) 
RankBoost Agg  0.3687 (±0.0013) 
SVMrank lin  0.3557 (±0.0008) 
SVMrank gauss  0.3734 (±0.0008) 
RLScore lin  0.3554 (±0.0005) 
RLScore gauss  0.3742 (±0.0007) 
Cardiotocography data
We also illustrate the methodology promoted in this paper by implementing it on a real data set, the Cardiotocography Data Set considered in Frank and Asuncion (2010) namely. The data have been collected as follows: 2126 fetal cardiotocograms (CTG’s in abbreviated form) have been automatically processed and the respective diagnostic features measured. The CTG’s have been next analyzed by three expert obstetricians and a consensus ordinal label has been then assigned to each of them, depending on the degree of anomaly observed: 1 for “normal”, 2 for “suspect” and 3 for “pathologic”.
Comparison of the VUS test—“Cardiotocography” experiment
Method  VUS test 

TreeRank 1v2  0.2357 
TreeRank 2v3  0.3314 
TreeRank 1v3  0.6932 
TreeRank Agg  0.8141 
RankBoostVUS  0.8346 
RankBoost Agg  0.8959 
SVMrank lin  0.7202 
SVMrank gauss  0.7856 
RLScore lin  0.7652 
RLScore gauss  0.7829 
Discussion
We observe that, in each of these experiments, Kendall aggregation clearly improves ranking accuracy, when measured in terms of VUS. In addition, looking at the standard deviation, we see that the aggregated scoring function is more stable. In terms of level sets, Kendall aggregation yielded more complex subsets and thus sharper results. Notice additionally that, as in the “Gaussian” experiment the level sets are linear, it is not surprising that the kernel methods outperform the treebased ones in this situation. In contrast, for the “uniform” experiment, the treebased methods performed much better than the others, the performance of TreeRank Agg is nearly optimal. Looking at the level sets (see Fig. 6), they seem to recover well their geometric structure. Observe also that Kendall aggregation of (bipartite) scoring functions produced by RankBoost has always lead to (slightly) better results than those obtained by a direct use of RankBoost on the 3class population, with a computation time smaller by a factor 10 however. Finally, notice that, on the Cardiotocography data set, the Kendall aggregation approach based on RankBoost is the method that produced the scoring function with largest VUS test among the algorithms candidates. In particular, it provides the best discrimination for the bipartite subproblem “1 vs 2”, the most difficult to solve apparently, in view of the ROC surfaces plotted in Fig. 7.
Psychometric data
Comparison of the VUS test—“ERA” experiment
Method  class 1–7  class 1–9  

VUS test  Cindex  JPstat  VUS test  Cindex  JPstat  
TreeRank Agg  0.0068  0.7099  0.7292  0.0027  0.7330  0.8023 
TreeRankF Agg  0.0074  0.7125  0.7326  0.0028  0.7359  0.8050 
RankBoostVUS  0.0082  0.7141  0.7347  0.0029  0.7344  0.8065 
RankBoost Agg  0.0077  0.7130  0.7331  0.0028  0.7329  0.8042 
SVMrank lin  0.0088  0.7158  0.7359  0.0034  0.7380  0.8103 
SVMrank gauss  0.0054  0.7033  0.7215  0.0020  0.7284  0.7969 
RLScore lin  0.0090  0.7151  0.7354  0.0034  0.7386  0.8102 
RLScore gauss  0.0080  0.7130  0.7331  0.0029  0.7339  0.8052 
Comparison of the VUS test—“ESL” experiment (class 3–7)
Method  VUS test  Cindex  JPstat 

TreeRank Agg  0.6209  0.9536  0.9551 
TreeRankF Agg  0.6415  0.9588  0.9591 
RankBoostVUS  0.5745  0.9496  0.9493 
RankBoost Agg  0.5887  0.9513  0.9514 
SVMrank lin  0.6337  0.9579  0.9583 
SVMrank gauss  0.6074  0.9560  0.9544 
RLScore lin  0.6387  0.9579  0.9590 
RLScore gauss  0.6342  0.9568  0.9577 
Comparison of the VUS test—“LEV” experiment
Method  class 0–3  class 0–4  

VUS test  Cindex  JPstat  VUS test  Cindex  JPstat  
TreeRank Agg  0.4226  0.8347  0.8586  0.2932  0.8547  0.8758 
TreeRankF Agg  0.4893  0.8617  0.8787  0.2995  0.8620  0.8761 
RankBoostVUS  0.4842  0.8631  0.8773  0.2884  0.8637  0.8680 
RankBoost Agg  0.4700  0.8570  0.8743  0.2761  0.8576  0.8703 
SVMrank lin  0.4968  0.8668  0.8828  0.3124  0.8668  0.8753 
SVMrank gauss  0.4870  0.8637  0.8783  0.2847  0.8638  0.8705 
RLScore lin  0.4983  0.8668  0.8827  0.3122  0.8670  0.8751 
RLScore gauss  0.4954  0.8639  0.8799  0.3215  0.8663  0.8797 
Comparison of the VUS test—“SWD” experiment
Method  class 2–5  class 3–5  

VUS test  Cindex  JPstat  VUS test  Cindex  JPstat  
TreeRank Agg  0.4221  0.8154  0.8674  0.5537  0.8072  0.8223 
TreeRankF Agg  0.4169  0.8189  0.8659  0.5706  0.8150  0.8295 
RankBoostVUS  0.3304  0.8141  0.8404  0.5619  0.8125  0.8280 
RankBoost Agg  0.3562  0.8020  0.8498  0.5611  0.8127  0.8280 
SVMrank lin  0.3278  0.8071  0.8369  0.5493  0.8083  0.8219 
SVMrank gauss  0.3612  0.8140  0.8495  0.5599  0.8098  0.8238 
RLScore lin  0.3316  0.8076  0.8386  0.5483  0.8078  0.8214 
RLScore gauss  0.3680  0.8135  0.8518  0.5616  0.8123  0.8260 
We highlight the fact that the results we obtained with the approach promoted in this paper are quite comparable to those in Fürnkranz et al. (2009), they find a Cindex of 0.7418 and a JPstat of 0.7265 in the case of ERA with nine classes as well as a Cindex of 0.8660 and a JPstat of 0.8757 in the case of LEV with five classes. Contrary to the Cindex and the JPstat for which all the values obtained are very close to each other, the VUS seems to reveal more contrast in the ranking performance. For instance, the aggregation procedure based on the TreeRank algorithm clearly outperforms the other competitors when considering the SWD dataset with classes 2–5, assessing the relevance of this approach in a situation where the dimension of the input space is not small (10 namely) and the population very skewed (the size of class “2” is very small compared to that of the others). Observe also that, in the other cases, the aggregation technique implemented using Ranking Forest has performance very similar to the stateoftheart: sometimes not considerably below (cf. ERA with 7 classes, ERA and LEV) sometimes slightly better (cf. ESL and SWD with 3 classes).
These empirical results only aim at illustrating the Kendall aggregation approach for Kpartite ranking, the limited goal pursued here being to show how aggregation helps to improve results. Beyond the theoretical validity framework sketched in Sect. 3, since a variety of bipartite ranking algorithms have been proposed in the literature and dedicated libraries are readily available, one of the main advantages of the Kendall aggregation approach lies in the fact that it is very easy to implement, when applied to bipartite rules that are not too complex, so that the (approximate) median computation is feasible, see Sect. 3.2. A more complete and detailed empirical analysis of the merits and limitations of this procedure is currently the subject of ongoing work, where comparisons with competitors are carried out and computational issues are investigated at length, provided that more real datasets with ordinal labels can be obtained.
7 Conclusion
In this article, we have presented theoretical work on ranking data with ordinal labels. In the first part of the paper, the issue of optimality has been tackled. We have proposed a monotonicity likelihood ratio condition that guarantees the existence and unicity of an “optimal” preorder on the input space, in the sense that it is optimal for any bipartite ranking subproblem, considering all possible pairs of labels. In particular, the regression function is proved to define an optimal ranking rule in this setting, highlighting the connection between Kpartite ranking and ordinal regression. The second part is dedicated to describe a specific method for decomposing the multiclass ranking problem into a series of bipartite ranking tasks, as proposed in Fürnkranz et al. (2009). We have introduced a specific notion of median scoring function based on the (probabilistic) Kendall τ distance. We have next shown that the notion of ROC manifold/surface and its summary, the volume under the ROC surface (VUS), then provide quantitative criteria for evaluating ranking accuracy in the ordinal setup: under the afore mentioned monotonicity likelihood ratio condition, scoring functions whose ROC surface is as high as possible everywhere exactly coincide with those forming the optimal set (i.e. the set of scoring functions that are optimal for all bipartite subproblems, defined with no reference to the notions of ROC surface and VUS). Conversely, we have proved that the existence of a scoring function with such a dominating ROC surface implies that the monotonicity likelihood ratio condition is fulfilled. It is shown that the aggregation procedure leads to a consistent ranking rule, when applied to scoring functions that are, each, consistent for the bipartite ranking subproblem related to a specific pair of consecutive class distributions. This approach allows for extending the use of ranking algorithms originally designed for the bipartite situation to the ordinal multiclass context. It is illustrated by three numerical examples. Further experiments, based on more real datasets in particular, will be carried out in the future in order to determine precisely the situations in which this method is competitive, compared to alternative ranking techniques in the ordinal multiclass setup. In this respect, we underline that, so far, very few practical algorithms tailored for ROC graph optimization have been proposed in the literature. Whereas, as shown at length in Clémençon and Vayatis (2009b) and Clémençon et al. (2011a), partitioning techniques for AUC maximization, in the spirit of the CART method for classification, can be implemented in a very simple manner, by solving recursively costsensitive classification problems (with a local cost, depending on the data lying in the cell to be split), recursive VUS maximization remains a challenging issue, for which no simple interpretation is currently available. Hence, the number of possible strategies for direct optimization of the ranking criterion in the Kpartite situation contrasts with that in the bipartite context and strongly advocates, for the moment, for considering techniques that transform multiclass ranking into a series of bipartite tasks, such as the method analyzed in this article.
References
 Agarwal, S. (2008). Generalization bounds for some ordinal regression algorithms. In Proceedings of the 19th international conference on algorithmic learning theory, ALT ’08 (pp. 7–21). Berlin: Springer. CrossRefGoogle Scholar
 Agarwal, S., Graepel, T., Herbrich, R., HarPeled, S., & Roth, D. (2005). Generalization bounds for the area under the ROC curve. Journal of Machine Learning Research, 6, 393–425. MathSciNetGoogle Scholar
 Allwein, E., Schapire, R., & Singer, Y. (2001). Reducing multiclass to binary: a unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141. MathSciNetzbMATHGoogle Scholar
 Audibert, J., & Tsybakov, A. (2007). Fast learning rates for plugin classifiers. The Annals of Statistics, 35, 608–633. MathSciNetzbMATHCrossRefGoogle Scholar
 Barthélemy, J., Guénoche, A., & Hudry, O. (1989). Median linear orders: heuristics and a branch and bound algorithm. European Journal of Operational Research, 42(3), 313–325. MathSciNetzbMATHCrossRefGoogle Scholar
 Baskiotis, N., Clémençon, S., Depecker, M., & Vayatis, N. (2010). Treerank: an R package for bipartite ranking. In Proceedings of SMDTA 2010—stochastic modeling techniques and data analysis international conference. Google Scholar
 Beygelzimer, A., Dani, V., Hayes, T., Langford, J., & Zadrozny, B. (2005a). Error limiting reductions between classification tasks. In Machine learning, proceedings of the twentysecond international conference (ICML 2005) (pp. 49–56). Google Scholar
 Beygelzimer, A., Langford, J., & Zadrozny, B. (2005b). Weighted one against all. In Proceedings of the 20th national conference on artificial intelligence, AAAI ’05 (Vol. 2, pp. 720–725). Google Scholar
 Charon, I., & Hudry, O. (1998). Lamarckian genetic algorithms applied to the aggregation of preferences. Annals of Operations Research, 80, 281–297. MathSciNetzbMATHCrossRefGoogle Scholar
 Clémençon, S., & Robbiano, S. (2011). Minimax learning rates for bipartite ranking and plugin rules. In Proceedings of the 28th international conference on machine learning, ICML’11 (pp. 441–448). Google Scholar
 Clémençon, S., & Vayatis, N. (2009a). On partitioning rules for bipartite ranking. Journal of Machine Learning Research, 5, 97–104. Google Scholar
 Clémençon, S., & Vayatis, N. (2009b). Treebased ranking methods. IEEE Transactions on Information Theory, 55(9), 4316–4336. CrossRefGoogle Scholar
 Clémençon, S., & Vayatis, N. (2009c). Adaptive estimation of the optimal ROC curve and a bipartite ranking algorithm. In Proceedings of the 20th international conference on algorithmic learning theory, ALT ’09 (pp. 216–231). CrossRefGoogle Scholar
 Clémençon, S., & Vayatis, N. (2010). Overlaying classifiers: a practical approach to optimal scoring. Constructive Approximation, 32(3), 619–648. MathSciNetzbMATHCrossRefGoogle Scholar
 Clémençon, S., Lugosi, G., & Vayatis, N. (2008). Ranking and empirical risk minimization of Ustatistics. The Annals of Statistics, 36(2), 844–874. MathSciNetzbMATHCrossRefGoogle Scholar
 Clémençon, S., Depecker, M., & Vayatis, N. (2011a). Adaptive partitioning schemes for bipartite ranking. Machine Learning, 43(1), 31–69. CrossRefGoogle Scholar
 Clémençon, S., Depecker, M., & Vayatis, N. (2011b). Avancées récentes dans le domaine de l’apprentissage statistique d’ordonnancements. Revue d’Intelligence Artificielle, 25(3), 345–368. CrossRefGoogle Scholar
 David, A. B. (2008). Ordinal realworld data sets repository. Google Scholar
 Debnath, R., Takahide, N., & Takahashi, H. (2004). A decision based oneagainstone method for multiclass support vector machine. Pattern Analysis and Its Applications, 7(2), 164–175. MathSciNetGoogle Scholar
 Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via errorcorrecting output codes. The Journal of Artificial Intelligence Research, 2, 263–286. zbMATHGoogle Scholar
 Dreiseitl, S., OhnoMachado, L., & Binder, M. (2000). Comparing threeclass diagnostic tests by threeway ROC analysis. Medical Decision Making, 20, 323–331. CrossRefGoogle Scholar
 Edwards, D., Metz, C., & Kupinski, M. (2005). The hypervolume under the ROC hypersurface of ‘nearguessing’ and ‘nearperfect’ observers in nclass classification tasks. IEEE Transactions on Medical Imaging, 24(3), 293–299. CrossRefGoogle Scholar
 Fagin, R., Kumar, R., Mahdian, M., Sivakumar, D., & Vee, E. (2004). Comparing and aggregating rankings with ties. In Proceedings of the twentythird ACM SIGMODSIGACTSIGART symposium on principles of database systems, PODS ’04 (pp. 47–58). CrossRefGoogle Scholar
 Ferri, C., HernándezOrallo, J., & Salido, M. (2003). Volume under the ROC surface for multiclass problems. In Proceedings of 14th European conference on machine learning (pp. 108–120). Google Scholar
 Fieldsend, J., & Everson, R. (2005). Formulation and comparison of multiclass ROC surfaces. In Proceedings of the ICML 2005 workshop on ROC analysis in machine learning (pp. 41–48). Google Scholar
 Fieldsend, J., & Everson, R. (2006). Multiclass ROC analysis from a multiobjective optimisation perspective. Pattern Recognition Letters, 27, 918–927. CrossRefGoogle Scholar
 Flach, P. (2004). Tutorial: “the many faces of ROC analysis in machine learning”. Part III (Technical report). International conference on machine learning 2004. Google Scholar
 Frank, A., & Asuncion, A. (2010). UCI machine learning repository. Google Scholar
 Freund, Y., Iyer, R. D., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. MathSciNetGoogle Scholar
 Fürnkranz, J. (2002). Round robin classification. Journal of Machine Learning Research, 2, 721–747. zbMATHGoogle Scholar
 Fürnkranz, J., Hüllermeier, E., & Vanderlooy, S. (2009). Binary decomposition methods for multipartite ranking. In Proceedings of the European conference on machine learning and knowledge discovery in databases: Part I, ECML PKDD ’09 (pp. 359–374). CrossRefGoogle Scholar
 Hand, D., & Till, R. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning, 45(2), 171–186. zbMATHCrossRefGoogle Scholar
 Hastie, T., & Tibshirani, R. (1998). Classification by pairwise coupling. The Annals of Statistics, 26(2), 451–471. MathSciNetzbMATHCrossRefGoogle Scholar
 Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. In Advances in large margin classifiers (pp. 115–132). Cambridge: MIT Press. Google Scholar
 Higgins, J. (2004). Introduction to modern nonparametric statistics. N. Scituate: Duxbury Press. Google Scholar
 Hudry, O. (2008). NPhardness results for the aggregation of linear orders into median orders. Annals of Operations Research, 163, 63–88. MathSciNetzbMATHCrossRefGoogle Scholar
 Huhn, J., & Hüllermeier, E. (2008). Is an ordinal class structure useful in classifier learning? International Journal of Data Mining, Modelling and Management, 1(1), 45–67. CrossRefGoogle Scholar
 Kramer, S., Pfahringer, B., Widmer, G., & Groeve, M. D. (2001). Prediction of ordinal regression trees. Fundamenta Informaticae, 47, 1001–1013. Google Scholar
 Laguna, M., Marti, R., & Campos, V. (1999). Intensification and diversification with elite tabu search solutions for the linear ordering problem. Computers and Operations Research, 26(12), 1217–1230. zbMATHCrossRefGoogle Scholar
 Landgrebe, T., & Duin, R. (2006). A simplified extension of the area under the ROC to the multiclass domain. In Seventeenth annual symposium of the pattern recognition association of South Africa (pp. 241–245). Google Scholar
 Lebanon, G., & Lafferty, J. (2002). Conditional models on the ranking poset. In Advances in neural information processing systems (Vol. 15, pp. 415–422). Google Scholar
 Lehmann, E., & Romano, J. P. (2005). Testing statistical hypotheses. Berlin: Springer. zbMATHGoogle Scholar
 Li, J., & Zhou, X. (2009). Nonparametric and semiparametric estimation of the three way receiver operating characteristic surface. Journal of Statistical Planning and Inference, 139, 4133–4142. MathSciNetzbMATHCrossRefGoogle Scholar
 Mandhani, B., & Meila, M. (2009). Tractable search for learning exponential models of rankings. Journal of Machine Learning Research. Proceedings Track, 5, 392–399. Google Scholar
 Meila, M., Phadnis, K., Patterson, A., & Bilmes, J. (2007). Consensus ranking under the exponential model. In Proceedings of the twentythird conference annual conference on uncertainty in artificial intelligence (UAI07) (pp. 285–294). Google Scholar
 Mossman, D. (1999). Threeway ROCs. Medical Decision Making, 19(1), 78–89. CrossRefGoogle Scholar
 Nakas, C., & Yiannoutsos, C. (2004). Ordered multipleclass ROC analysis with continuous measurements. Statistics in Medicine, 23(22), 3437–3449. CrossRefGoogle Scholar
 Pahikkala, T., Tsivtsivadze, E., Airola, A., Boberg, J., & Salakoski, T. (2007). Learning to rank with pairwise regularized leastsquares. In Proceedings of SIGIR 2007 workshop on learning to rank for information retrieval (pp. 27–33). Google Scholar
 Pepe, M. (2003). Statistical evaluation of medical tests for classification and prediction. Oxford: Oxford University Press. zbMATHGoogle Scholar
 Rajaram, S., & Agarwal, S. (2005). Generalization bounds for kpartite ranking. In NIPS workshop on learning to rank. Google Scholar
 Robbiano, S. (2010). Note on confidence regions for the ROC surface (Technical report). Telecom ParisTech. Google Scholar
 Rudin, C., Cortes, C., Mohri, M., & Schapire, R. E. (2005). Marginbased ranking and boosting meet in the middle. In Proceedings of the 18th annual conference on learning theory, COLT’05 (pp. 63–78). Berlin: Springer. Google Scholar
 Scurfield, B. (1996). Multipleevent forcedchoice tasks in the theory of signal detectability. Journal of Mathematical Psychology, 40, 253–269. zbMATHCrossRefGoogle Scholar
 Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1), 135–166. MathSciNetzbMATHCrossRefGoogle Scholar
 Vapnik, V. (1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5), 988–999. CrossRefGoogle Scholar
 Venkatesan, G., & Amit, S. (1999). Multiclass learning, boosting, and errorcorrecting codes. In Proceedings of the twelfth annual conference on computational learning theory, COLT’99 (pp. 145–155). Google Scholar
 Waegeman, W., & Baets, B. D. (2011). On the era ranking representability of pairwise bipartite ranking functions. Artificial Intelligence, 175, 1223–1250. MathSciNetzbMATHCrossRefGoogle Scholar
 Waegeman, W., Baets, B. D., & Boullart, L. (2008a). On the scalability of ordered multiclass ROC analysis. Computational Statistics and Data Analysis, 52, 3371–3388. MathSciNetzbMATHCrossRefGoogle Scholar
 Waegeman, W., Baets, B. D., & Boullart, L. (2008b). ROC analysis in ordinal regression learning. Pattern Recognition Letters, 29, 1–9. CrossRefGoogle Scholar
 Wakabayashi, Y. (1998). The complexity of computing medians of relations. Resenhas, 3(3), 323–349. MathSciNetzbMATHGoogle Scholar