Calibration and regret bounds for orderpreserving surrogate losses in learning to rank
 770 Downloads
 3 Citations
Abstract
Learning to rank is usually reduced to learning to score individual objects, leaving the “ranking” step to a sorting algorithm. In that context, the surrogate loss used for training the scoring function needs to behave well with respect to the target performance measure which only sees the final ranking. A characterization of such a good behavior is the notion of calibration, which guarantees that minimizing (over the set of measurable functions) the surrogate risk allows us to maximize the true performance.
In this paper, we consider the family of orderpreserving (OP) losses which includes popular surrogate losses for ranking such as the squared error and pairwise losses. We show that they are calibrated with performance measures like the Discounted Cumulative Gain (DCG), but also that they are not calibrated with respect to the widely used Mean Average Precision and Expected Reciprocal Rank. We also derive, for some widely used OP losses, quantitative surrogate regret bounds with respect to several DCGlike evaluation measures.
Keywords
Learning to rank Calibration Surrogate regret bounds1 Introduction
Learning to rank has emerged as a major field of research in machine learning due to its wide range of applications. Typical applications include creating the querydependent document ranking in search engines, where one learns to order sets of documents, each of these sets being attached to a query, using relevance judgments for each document as supervision. This task is known as subset ranking (Cossock and Zhang 2008). Another application is label ranking (see e.g. Dekel et al. 2003; Vembu and Gärtner 2009), where one learns to order a fixed set of labels depending on an input with a training set composed of observed inputs and the corresponding weak or partial order over the label set. Label ranking is a widely used framework to deal with multiclass/multilabel classification when the application accepts a ranking of labels according to the posterior probability of class membership instead of a hard decision about class membership.
In a similar way to other prediction problems in discrete spaces like classification, the optimization of the empirical ranking performance over a restricted class of functions is most frequently an intractable problem. Just like one optimizes the hinge loss or the logloss in binary classification as surrogate for the classification error, the usual approach in learning to rank is to replace the original performance measure by a continuous, preferably differentiable and convex function of the predictions. This has lead many researchers to reduce the problem of learning to rank to learning a scoring function which assigns a real value to each individual item of the input set. The final ranking is then produced with a sorting algorithm. Many existing learning algorithms follow this approach both for label ranking (see e.g. Weston and Watkins 1999; Crammer and Singer 2002; Dekel et al. 2003) and subset ranking (Joachims 2002; Burges et al. 2005; Yue et al. 2007; Cossock and Zhang 2008; Liu 2009; Cambazoglu et al. 2010; Chapelle and Chang 2011). This relaxation has two advantages. First, the sorting algorithm is a very efficient way to obtain a ranking (without scores, obtaining a full ranking is usually a very difficult problem). Second, defining a continuous surrogate loss on the space of predicted scores is a much easier task than defining one in the space of permutations.
While the computational advantage of such surrogate formulations is clear, one needs to have guarantees that minimizing the surrogate formulation (i.e. what the learning algorithm actually does) also enables us to maximize the ranking performance (i.e. what we want the algorithm to do). That is, we want the learning algorithm to be consistent with the true ranking performance we want to optimize. Steinwart (2007) presents general definitions and results showing that an asymptotic guarantee of consistency is equivalent to a notion of calibration of the surrogate loss with respect to the ranking performance measure, while the existence of nonasymptotic guarantees in the form of surrogate regret bounds are equivalent to a notion of uniform calibration. A surrogate regret bound quantifies how fast the evaluation measure is maximized as the surrogate loss is minimized. We note here that such nonasymptotic guarantees are critical in machine learning where it is delusive to hope for learning nearoptimal functions in a strong sense. The calibration and uniform calibration have been extensively studied in (costsensitive) binary classification (see e.g. Bartlett and Jordan 2006; Zhang 2004; Steinwart 2007; Scott 2011) and multiclass classification (Zhang 2004; Tewari and Bartlett 2007). In particular, under natural continuity assumptions, it was shown that calibration and uniform calibration of a surrogate loss are equivalent for margin losses. In the context of learning to rank with pairwise preferences, the noncalibration of many existing surrogate losses with respect to the pairwise disagreement was studied in depth in Duchi et al. (2010). On the other hand, in the context of learning to rank for information retrieval, surrogate regret bounds for squareloss regression with respect to a ranking performance measure called the Discounted Cumulative Gain (DCG, see Järvelin and Kekäläinen 2002) was shown in Cossock and Zhang (2008). These bounds were further extended in Ravikumar et al. (2011) to a larger class of surrogate losses.
In this paper, we analyze the calibration and uniform calibration of losses that possess an orderpreserving property. This property of a surrogate loss implies (and, to some extend, is equivalent to) the calibration with respect to a ranking performance measure in the family what we call the generalized positional performance measures (GPPMs). A GPPM is a performance measure which, up to a suitable parametrization, can be written like a DCG. The study of GPPMs offer, in particular, the possibility to extend the DGC to arbitrary supervision spaces (e.g. linear orders, pairwise preferences) by first mapping the supervision to scores for each item—scores that can be interpreted as utility values. The family of GPPM includes widely known performance measures for ranking like the DCG and its normalized version the NDCG, the precisionatrankK as well as the recallatrankK, and Spearman’s rank correlation coefficient (when the supervision is a linear ordering). We also give practical examples of template orderpreserving losses, which can be instantiated for any specific GPPM to obtain a calibrated surrogate loss.
To go further, we investigate conditions under which the stronger notion of uniform calibration holds in addition to the simple calibration. Under natural continuity conditions for the loss function, we show that any loss calibrated with a GPPM is uniformly calibrated when the supervision space is finite, which stands for the existence of a regret bound. Finally, we prove explicit regret bounds for several convex template orderpreserving losses. These bounds can be instantiated for any GPPM, such as the (N)DCG, and recall/precisionatrankK. In particular, we obtain the first regret bounds with respect to GPPMs for losses based on pairwise comparisons, and recover the surrogate regret bounds of Cossock and Zhang (2008) and Ravikumar et al. (2011). Our proof technique is different though, and we are able to slightly improve the constant factor in the bounds.
As a byproduct of our analysis, we investigate whether a loss with some orderpreserving property can be calibrated with two other measures than GPPMs, namely the Expected Reciprocal Rank (Chapelle et al. 2009) used as reference in the recent Yahoo! Learning to Rank Challenge (Chapelle and Chang 2011) and the Average Precision which was used in past Text REtrieval Conferences (TREC) competitions (Voorhees and Harman 2005). We surprisingly show a negative result—even though these measures assume that the supervision itself takes the form of real values (relevance scores). Our result implies that for any transformation of the relevance scores given as supervision, the regression function of these transformations is not optimal for the ERR or the AP in general. We do believe that this result can help understand the limitations of the scoreandsort approach to ranking, and put emphasis on an often neglected fact: the choice of the surrogate formulation is not really a matter of supervision space, but should be carefully chosen depending on the target measure. For example, one can use an appropriate regression approach to optimize Pearson’s correlation coefficient when the supervision is a full ordering of the set, but for the ERR or the AP, regression approaches cannot be calibrated even though one gets real values for each item as supervision.
The rest of the paper is organized as follows, Sect. 2 describes the framework and the basic definitions. In Sect. 3, we introduce a family of surrogate losses called the orderpreserving losses, and we study their calibration with respect to a wide range of performance measures. Then, we propose an analysis of sufficient conditions under which calibration is equivalent to the existence of a surrogate regret bound; this analysis is carried out by studying the stronger notion of uniform calibration in Sect. 4. In Sect. 5, we describe several methods to find explicit formulas of surrogate regret bounds, and we exhibit some examples for common surrogate losses. The related work is discussed in Sect. 6, where we also summarize our main contributions.
This paper extends our prior work with D. Buffoni published in the International Conference on Machine Learning (Buffoni et al. 2011). More specifically, the definition of orderpreserving losses, as well as Theorems and their proofs, already appeared in Buffoni et al. (2011); all other results are new.
2 Ranking performance measures and surrogate losses
Notation
A boldface character always denotes a function taking values in \(\mathbb{R}^{n}\) or an element of \(\mathbb{R}^{n}\) for some n>1. If \({\bf f}\) is a function of x, then f _{ i }(x), using normal font and subscript, denotes the ith component of \({\bf f}(x)\). Likewise, if \({\bf x} \in \mathbb{R}^{n}\), x _{ i } denotes its ith component.
2.1 Definitions and examples
Scoring functions and scoring performance measures
Notice that with these definitions, for \(\sigma\in \operatorname{arg\,sort}( {\bf f}(x))\), σ(k) denotes the integer of \(\left \{ 1,\ldots,n \right \}\) whose predicted rank is k. Following the tradition in information retrieval, “item i has better rank than item j according to σ” means “σ ^{−1}(i)<σ ^{−1}(j)”, i.e. low ranks are better. Likewise, the topd ranks stands for the set of ranks \(\left \{ 1, \ldots, d \right \}\). Also notice that \(\operatorname {arg\, sort}\) is a setvalued function because of possible ties.
Predicting a total order over a finite set of objects is of use in many applications. This can be found in information retrieval, where x represents a tuple (query, set of documents) and the score f _{ i }(x) is the score given to the ith document in the set given the query. In practice, x contains joint feature representations of (query, document) pairs and f _{ i }(x) is the predicted relevance of document i with respect to the query. The learning task associated to this prediction problem has been called subset ranking in Cossock and Zhang (2008). Note that in practice, the set of documents for different queries may vary in size, while in our work it is supposed to be constant. Anyway, all our results hold if one allows to vary the set size, keeping it uniformly bounded. Another example of application is label ranking (see e.g. Dekel et al. 2003) where x is some observed object such as a text document or an image, and the set to order is a fixed set of class labels. In that case, x usually contains a feature representation of the object, and a function f _{ i } is learned for each label index i; higher values of f _{ i }(x) represent higher predicted classmembership probabilities. Largemargin approaches to multiclass classification (see e.g. Weston and Watkins 1999; Crammer and Singer 2002) are special cases of a scoreandsort approach to label ranking, where the prediction is the topranked label.
In the supervised learning setting, the prediction function is trained using a set of examples for which some feedback, or supervision, indicative of the desired ordering is given. In order to measure the quality of a predicted ordering of \(\left \{ 1,\ldots,n \right \}\), a ranking performance measure is used. It is a measurable function \(r: \mathcal {Y}\times {\mathfrak {S}_{n}} \rightarrow \mathbb{R}_{+}\), where \((\mathcal {Y}, \varSigma_{\mathcal {Y}})\) is a measurable space which will be called the supervision space, and each \(y\in \mathcal {Y}\) provides information about which ordering are desired. We take the convention that larger values of \({ r\left .(y, \sigma\right .)}\) mean that σ is an ordering of \(\left \{ 1,\ldots,n \right \}\) in accordance to y. The supervision space may differ from one application to the other. In search engine applications, the Average Precision (AP) used in past TREC competitions (Voorhees and Harman 2005), the Expected Reciprocal Rank (ERR) used in the Yahoo! Learning to Rank Challenge (Chapelle et al. 2009; Chapelle and Chang 2011) or the (Normalized) Discounted Cumulative Gain ((N)DCG) (Järvelin and Kekäläinen 2002), assume the supervision space \(\mathcal {Y}\) is defined as {0,…,p}^{ n } for some integer p>0, where the ith component of \(y\in \mathcal {Y}\) is a judgment of the relevance of the ith item to rank w.r.t. the query (these performance measures always favor better ranks for items of higher relevance). Other forms of supervision spaces may be used though, for instance in recommendation tasks we may allow user ratings on a continuous scale (yet usually bounded), or allow the supervision to be a preference relation over the set of items, as proposed in one of the earliest papers on learning to rank (Cohen et al. 1997).
Definition 2.1
(Generalized Positional Performance Measure)
 1.
ϕ(1)>0 and ∀0<k<n,ϕ(k)≥ϕ(k+1)≥0
 2.
\(\displaystyle\exists b: \mathcal {Y}\rightarrow \mathbb{R}\), such that \(r: (y, \sigma ) \mapsto b(y)+ \sum_{k = 1}^{n}{\phi }(k)u_{\sigma (k)}(y)\).
The most popular example of a GPPM is the DCG, for which \({\phi }(i) = \frac{{\bf1}_{\left \{ i \le k \right \}}}{\log(1 + i)}\) and \(u_{i}({\bf y}) = 2^{y_{i}}  1\). The function u can be seen as mapping the supervision y to utility scores for each individual item to rank. We may therefore refer to u as the utility function of the (u,ϕ)GPPM r. The name we use for this family of measures comes from the positional models described in Chapelle et al. (2009) in which one orders the documents according to the utility (i.e. the relevance) of a document w.r.t. the query. We call them “generalized”, because the utility function is, in our case, only a means to transform the supervision so that a given performance measure can be assimilated to a positional model. However, in a specific context, this transformation is not necessarily the relevance of a document with respect to the query as a user may define it. In particular, in information retrieval, the relevance of a document is usually defined independently of the other documents, while in the case of GPPMs, the utility score may depend on the relevance of the other documents. The Normalized DCG is a typical example of such a GPPM.
Summary of common performance measures. The function b is equal to zero for all measures except for the AUC (\(b({\bf y}) = \frac{(\lVert {\bf y}\rVert _{1}1)}{2(n  \lVert {\bf y}\rVert _{1})}\)) and for Spearman Rank Correlation Coefficient (\(b({\bf y}) =  \frac{3(n1)}{(n + 1)}\)). The details of the calculations are given in Appendix B.1
Learning objective and surrogate scoring loss
A major issue of the field of learning to rank is the design of surrogate scoring losses that are, in some sense, wellbehaved with respect to the target ranking performance measure. The next subsection will define criteria that should be satisfied for a reasonable surrogate loss. But before going into more details and in order to give a concrete example of a family of losses that may be useful when the performance measure is a GPPM, we define the following family of template losses:
Definition 2.2
(Template Scoring Loss)
A typical example of template loss is the squared loss defined by \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i=1}^{n} (v_{i}s_{i})^{2}\) on \(\varGamma =\mathbb{R}^{n}\), as proposed in Cossock and Zhang (2008). Other examples of template losses will be given in Sect. 3.
Note that many surrogate losses have been proposed for learning to rank (see Liu 2009 for an exhaustive review), and many of them are actually not template losses. SVM^{ map } (Yue et al. 2007), and many other instances of the structural SVM approach to ranking are good examples (Le and Smola 2007; Chakrabarti et al. 2008). Their advantage is to be designed for a specific performance measure, which may work better in practice when this performance measure is used for evaluation. On the other hand, template losses have the algorithmic advantage of providing an interface that can easily be specialized for a specific GPPM.
2.2 Calibration and surrogate regret bounds
We now describe some natural properties that surrogate loss functions should satisfy. This subsection defines the notations, and briefly summarizes the definitions and results from Steinwart (2007) which we are the basis of our work. The notations defined in this section are used in the rest of the paper without further notice.
Calibration

\(\mathcal {L}(P, {\bf f}) = \int_{\mathcal {X}} \int_{\mathcal {Y}} { \ell \left (y, {\bf f}(x)\right )} {\rm d}P(yx) {\rm d}P_{\mathcal {X}}(x)\) the scoring risk of \({\bf f}\);

\(\displaystyle {\underline {\mathcal {L}}}(P) = \inf_{\substack{{\bf f}:\mathcal {X}\rightarrow \mathbb{R}^n\\{\bf f}~\text{measurable}}} \mathcal {L}(P, {\bf f})\) the optimal scoring risk;

\(\displaystyle {\overline {\mathcal {R}}}(P) = \sup_{\substack{{\bf f}:\mathcal {X}\rightarrow \mathbb{R}^n\\{\bf f}~\text{measurable}}} \mathcal {R}(P, {\bf f})\) the optimal ranking performance.
Let \(\mathcal {D}\) denote the set of probability distributions over \(\mathcal {Y}\), and let \({\Delta }\subset \mathcal {D}\). Following Steinwart (2007, Definition 2.6), we say that P is a distribution of type Δ if P(.x)∈Δ for all x. Then, Steinwart (2007, Theorem 2.8) shows that (1) hold for any distribution of type Δ such that \({\overline {\mathcal {R}}}(P)<+\infty\) and \({\underline {\mathcal {L}}}(P)<+\infty\) if and only if ℓ is rcalibrated on Δ, according to the following definition:
Definition 2.3
(Calibration)
Let r be a ranking performance measure, ℓ a scoring loss and \({\Delta }\subset \mathcal {D}\) where \(\mathcal {D}\) is the set of probability distributions over \(\mathcal {Y}\).

\(\displaystyle\forall {\bf s}\in \mathbb{R}^n, { L\left ({\scriptstyle \triangle },{\bf s}\right )} = \int_{\mathcal {Y}} { \ell \left (y, {\bf s}\right )} {\rm d}{\scriptstyle \triangle }(y)\) and \(\displaystyle { {\underline {L}}\left ({\scriptstyle \triangle }\right )} = \inf_{{\bf s}\in \mathbb{R}^n} { L\left ({\scriptstyle \triangle },{\bf s}\right )}\);

\(\displaystyle\forall {\bf s}\in \mathbb{R}^n,{ R\left .({\scriptstyle \triangle },{\bf s}\right .)} = \int_{\mathcal {Y}} { r\left .(y, {\bf s}\right .)} {\rm d}{\scriptstyle \triangle }(y)\) and \(\displaystyle { {\overline {R}}\left ({\scriptstyle \triangle }\right )} = \sup_{{\bf s}\in \mathbb{R}^n} { R\left .({\scriptstyle \triangle },{\bf s}\right .)}\).
The definition of calibration allows us to study the implication (1), which considers risks and performance defined on the whole data distribution, to the study of the inner risks which are much easier to deal with since they are only functions of the distribution over the supervision space and a score vector. Thus, the inner risk and the inner performance are the essential quantities we investigate in this paper. The study of the calibration w.r.t. GPPMs of some surrogate losses will be treated in Sect. 3.
Remark 1
The criterion given by Eq. (1) studies the convergence to performance of the best possible scoring function, even though reaching this function is practically unfeasible on a finite training set since we need to consider a restricted class of functions. Nonetheless, as discussed in Zhang (2004) in the context of multiclass classification, the best possible performance can be achieved asymptotically as the number of examples grows to infinity, using the method of sieves or structural risk minimization, that is by progressively increasing the models complexity as the training set size increases.
Uniform calibration and surrogate regret bounds
Definition 2.4
(Uniform Calibration)
Some criteria to establish the uniform calibration of scoring losses w.r.t. GPPMs are provided in Sect. 4. Quantitative regret bounds for specific template scoring losses will then be given in Sect. 5.
3 Calibration of orderpreserving losses
In this section, we address the following question: which surrogate losses are calibrated w.r.t. GPPMs. This leads us to define the orderpreserving property for surrogate losses. Since there is no reason to believe that these losses are calibrated only with respect to GPPMs, we address the question of whether they can be calibrated with two other popular performance measures, namely the ERR and the AP. In the remaining of the paper, we make extensive use of the notations of Definition 2.3.
Notation
We introduce now additional notations. For a ranking performance measure r, we denote \(\mathcal {D}_{r}= \left \{ {\scriptstyle \triangle }\in \mathcal {D} \forall {\bf s}\in \mathbb{R}^{n}, { R\left .({\scriptstyle \triangle },{\bf s}\right .)}<+\infty \right \}\). Likewise, we define \(\mathcal {D}_{\ell }= \left \{ {\scriptstyle \triangle }\in \mathcal {D} \forall {\bf s}\in \mathbb{R}^{n}, { L\left ({\scriptstyle \triangle },{\bf s}\right )}<+\infty \right \} \) for a scoring loss ℓ, and denote \(\mathcal {D}_{\ell , r}\) the intersection of \(\mathcal {D}_{r}\) and \(\mathcal {D}_{\ell }\). Finally, given let r be (u,ϕ)GPPM and let \({\scriptstyle \triangle }\in \mathcal {D}_{r}\). We denote \(\boldsymbol {U}({\scriptstyle \triangle }) = \int_{\mathcal {Y}} \boldsymbol {u}(y)d{\scriptstyle \triangle }(y)\) the expected value of u. One may notice that \(\mathcal {D}_{r}= \left \{ {\scriptstyle \triangle } \lVert \boldsymbol {U}({\scriptstyle \triangle })\rVert _{\infty}<+\infty \right \}\).
3.1 Orderpreserving scoring losses
This result was already noticed by Cossock and Zhang (2008), where the authors advocated for regression approaches for optimizing the DCG and in Ravikumar et al. (2011) where the authors studied a generalization of regression losses based on Bregman divergences (see Eq. (5) below). This result emphasizes on the fact that optimizing a GPPM is still a much weaker objective in general than regressing the utility values—preserving the ordering induced by the utility function is sufficient. Consequently, it is natural to look for surrogate losses for which the inner risk is minimized only by scores which order the items like U: by the definition of calibration, any such loss is rcalibrated with any (u,ϕ)GPPM r. This leads us to the following definition:
Definition 3.1
(OrderPreserving Loss)
 Pointwise template scoring losses:As mentioned in Sect. 2.1, one may take \(\varGamma= \mathbb{R}^{n}\) and λ(v _{ i },s _{ i })=(v _{ i }−s _{ i })^{2} as in Cossock and Zhang (2008). This template loss is obviously orderpreserving since the optimal value of the scores is precisely the expected value of \({\bf v}\) (and thus U(△) when the template loss is instantiated).$$ \forall {\bf v}\in\varGamma\subset \mathbb{R}^n, {\bf s}\in \mathbb{R}^n, { \ell \left ({\bf v}, {\bf s}\right )} = \sum _{i=1}^n{ \lambda (v_i, s_i)} . $$(3)
We may also consider, given η>0, the form λ(v _{ i },s _{ i })=v _{ i } φ(s _{ i })+(η−v _{ i })φ(−s _{ i }), which is convex with respect to \({\bf s}\) for any \({\bf v}\) in Γ=[0,η]^{ n } if φ is convex. As we shall see in Sect. 5, this loss is orderpreserving for many choices of φ, including the logloss (t↦log(1+e ^{−t })), the exponential loss (t↦e ^{−t }) or differentiable versions of the Hinge loss. The logloss proposed in Kotlowski et al. (2011) in the context of bipartite instance ranking for optimizing the AUC follows the same idea as the latter pointwise losses. The surrogate regret bounds proved in Dembczynski et al. (2012) in the same ranking framework than the one we consider here apply to pointwise losses of a similar form, although with a value of η that depends on the supervision at hand.
 Pairwise template scoring losses:with \(\varGamma=\mathbb{R}_{+}^{n}\). For example, taking λ(v _{ i },v _{ j },s _{ i }−s _{ j })=(s _{ i }−s _{ j }−v _{ i }+v _{ j })^{2} also obviously leads to an orderpreserving template loss. But we may also take λ(v _{ i },v _{ j },s _{ i }−s _{ j })=v _{ i } φ(s _{ i }−s _{ j })+v _{ j } φ(s _{ j }−s _{ i }) (the latter being convex with respect to \({\bf s}\) for any \({\bf v}\) in Γ whenever φ is so). Such a choice leads to an order preserving template loss whenever φ is nonincreasing, differentiable with ϕ′(0)<0 and the infimum (over \({\bf s}\in \mathbb{R}^{n}\)) of \({ L\left ({\scriptstyle \triangle },.\right )}{.}\) is achieved for any △ (see Remark 2 below). Pairwise losses are natural candidates for surrogate scoring losses because they share a natural invariant with the scoring performance measure (invariance by translation of the scores).$$ { \ell \left ({\bf v}, {\bf s}\right )} = \sum _{i<j} \lambda (v_i, v_j, s_i  s_j) $$(4)
 Listwise scoring losses: as proposed in Ravikumar et al. (2011), we may consider a general form of surrogate losses defined by Bregman divergences. Let \(\varPsi:\varGamma\subset \mathbb{R}^{n}\rightarrow \mathbb{R}\) be a strictly convex, differentiable function on a set Γ and define the Bregman divergence associated to Ψ by \(B_{\psi}({\bf v}\ {\bf s}) = \psi({\bf v})  \psi({\bf s})  \langle{\nabla\psi({\bf s})},{{\bf v} {\bf s}}\rangle \). Let \({\bf g}:\mathbb{R}^{n} \rightarrow\varGamma\) be invertible and such that for any \({\bf s}\in \mathbb{R}^{n}\), \(s_{i}>s_{j} \Rightarrow{\bf g}_{i}({\bf s})>{\bf g}_{j}({\bf s})\). Then, we can use the following template loss:which is an orderpreserving template loss (Ravikumar et al. 2011) as soon as the closure of \(g(\mathbb{R}^{n})\) contains Γ. This is due to a characterization of Bregman divergences due to Banerjee et al. (2005) that the expectation of Bregman divergences (for a distribution over the lefthand argument) is uniquely minimized over the righthand argument when the latter equals the expected value of the former.$$ { \ell \left ({\bf v}, {\bf s}\right )} = B_\psi \bigl({\bf v}\ {\bf g}({\bf s}) \bigr) , $$(5)
Remark 2
(Pairwise Losses)
The categorization of surrogate scoring losses into “pointwise”, “pairwise” and “listwise” we use here is due to Cao and Liu (2007). Note, however, that the pairwise template loss we consider in (4) with λ(v _{ i },v _{ j },s _{ i }−s _{ j })=v _{ i } φ(s _{ i }−s _{ j })+v _{ j } φ(s _{ j }−s _{ i }) does not correspond to what is usually called the “pairwise comparison approach” to ranking and used in many algorithms, including the very popular RankBoost (Freund et al. 2003) and Ranking SVMs (see e.g. Joachims 2002; Cao et al. 2006). Indeed, the latter can be written as \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i,j} {\bf1}_{v_{i}>v_{j}}\varphi(s_{i}  s_{j})\) (or some weighted versions of this formula). This usual loss was shown to be noncalibrated with respect to the pairwise disagreement error for ranking by Duchi et al. (2010) for any convex φ in many general settings. This result shows that the loss is not orderpreserving in general (because the pairwise disagreement error, when the supervision space is \(\left \{ 0,1 \right \}^{n}\) is minimized when we order the items according to their probability of belonging to class 1). On the other hand, with the form we propose in this paper, the inner risk for the uinstance of ℓ can be written as \(\mathcal {L}^{\boldsymbol {u}}({\scriptstyle \triangle }, {\bf s}) = \sum_{i=1}^{n}U_{i}\left ({\scriptstyle \triangle }\right )\sum_{j\neq i} \varphi(s_{i}  s_{j})\), which has the same form as the inner risk of the multiclass pairwise loss studied in Zhang (2004) and is orderpreserving under the appropriate assumptions (Zhang 2004, Theorem 5).
Remark 3
(A note on terminology)
We use the qualifier orderpreserving for scoring losses in a sense similar to what the author of Zhang (2004) used in the context of multiclass classification. We may note that Ravikumar et al. (2011) use the term orderpreserving to qualify a function \({\bf g}:\mathbb{R}^{n} \rightarrow \mathbb{R}^{n}\) such that \(s_{i}>s_{j} \Rightarrow g_{i}({\bf s})>g_{j}({\bf s})\), which corresponds to a different notion than the one used here.
3.2 Calibration of orderpreserving losses
As already noticed, it follows from the definitions that if r is a (u,ϕ)GPPM, then any loss orderpreserving w.r.t. u is rcalibrated. The reverse implication is also true: given a measurable u, then only a loss orderpreserving w.r.t. u is calibrated with any (u,ϕ)GPPM (that is, for any ϕ). This latter claim can be found with different definitions in Ravikumar (2011, Lemma 3 and Lemma 4). We summarize these results in the following theorem and give the proof for completeness:
Theorem 3.2
 1.
If ℓ is orderpreserving w.r.t. u on Δ, then ℓ is rcalibrated on Δ.
 2.
If ϕ is strictly decreasing and ℓ is rcalibrated on Δ, then ℓ is orderpreserving w.r.t. u on Δ.
Proof
The first claim and the remark on the template loss essentially follows from the definitions and from (2). For point 2, it is sufficient to show that for a given △∈Δ, if \(\operatorname {arg\,sort}({\bf s}) \nsubseteq \operatorname {arg\,sort}(\boldsymbol {U}({\scriptstyle \triangle }))\), then there is a c>0 such that \({ {\overline {R}}\left ({\scriptstyle \triangle }\right )} { R\left .({\scriptstyle \triangle },{\bf s}\right .)} \geq c\).
Note that obviously, if a scoring loss is orderpreserving w.r.t. u, then it is calibrated with any ranking performance measure such that \(\operatorname{arg\,sort}(\boldsymbol {U}({\scriptstyle \triangle })) \subset \operatorname{arg\,min}_{\sigma } \mathcal {R}({\scriptstyle \triangle }, \sigma )\). This gives us a full characterization of ranking performance measures with respect to which orderpreserving losses are calibrated.
While the orderpreserving property is all we need for the calibration w.r.t. to a GPPM, one may then ask if it can be of use for the two other widely used performance measures: the AP and the ERR. The question is important because, apart from the usual precision/recall at rank K and the (N)DCG, these are the most widely used measures in search engine evaluation. Unfortunately, the answer is negative:
Theorem 3.3
Let \(\mathcal {Y}=\{0,1\}^{n}\), \(\boldsymbol {u}: \mathcal {Y}\rightarrow \mathbb{R}^{n}_{+}\) and ℓ an orderpreserving loss w.r.t. u on \(\mathcal {D}\). Then ℓ is not calibrated with the \({\tt ERR} \) and the AP.
The proof of Theorem 3.3 can be found in the Appendix B.2. To the best of our knowledge, there is no existing study of the calibration of any surrogate scoring loss w.r.t. the ERR or the AP.
The theorem, in particular, implies that a regression approach is necessarily not calibrated with the ERR or the AP—whatever function of the relevance measure we are trying to regress. We do believe that the theorem does not cast lights on any weakness of the class of orderpreserving (template) losses, but rather provides some strong evidence that these measures are difficult objective for learning and are that scoreandsort approaches are probably not suited for optimizing such performance measures.
4 Calibration and uniform calibration
We provided a characterization of surrogate losses calibrated with GPPM, as well as a characterization of the performance measures with respect to which these losses are calibrated. In this section, we are interested in investigating the stronger notion of uniform calibration which gives a nonasymptotic guarantee and implies the existence of a surrogate regret bound (Steinwart 2007, Theorems 2.13 and 2.17). Afterwards, in Sect. 5, we explicit regret bounds for some specific popular surrogate losses. In fact, we express conditions on the supervision space under which the uniform calibration w.r.t. a GPPM is equivalent to the simple calibration w.r.t. the same GPPM for learning to rank.
The equivalence between calibration and uniform calibration with respect to the classification error has already been proved in Bartlett and Jordan (2006) for the binary case, and in Zhang (2004) and Tewari and Bartlett (2007) in the multiclass case. Both studies concerned to margin losses, which are similar to the scoring losses we consider in the paper except that \(\mathcal {Y}\) is restricted (in our notations) to be the canonical basis of \(\mathbb{R}^{n}\) and u is the identity function. We extend these results to the case of GPPMs, but will not obtain an equivalence between calibration and uniform in general because of the more general form of scoring loss functions and the possible unboundedness of u. Yet, we are able to present a number of special cases depending on the loss function and the considered set of distribution over the supervision space where such an equivalence holds.
The existence of a surrogate regret bound independent of the data distribution (even without an explicit statement of the bound) is critical tool in the proof of the consistency of structural risk minimization of the surrogate formulation in Bartlett and Jordan (2006), Zhang (2004), Tewari and Bartlett (2007). Indeed, if one performs the empirical minimization of the surrogate risk in function classes that grow (sufficiently slowly) with the number of examples so that the surrogate risk tends to its infimum, the surrogate regret bound is sufficient to show that the sequence of surrogate risk minimizers tend to have maximal performance. The major tool used in Bartlett and Jordan (2006) and Scott (2011) for deriving explicit regret bounds also precisely corresponds to proving the uniform calibration. In our case, the criterion we develop for showing the equivalence between calibration and uniform calibration (Theorem 4.2) unfortunately does not lead to tight regret bounds. However, the following technical lemma, which we need to prove this criterion, will also appear crucial for the statement of explicit regret bounds.
Lemma 4.1
 1.
∀(i,j)≠(z,t)∈C _{ σ }, we have \(\left \{ i,j \right \}\cap \left \{ z,t \right \}=\emptyset\),
 2.
\(\forall(i,j) \in C_{\sigma}, U_{i}\left ({\scriptstyle \triangle }\right )>U_{j}\left ({\scriptstyle \triangle }\right ) \) and σ ^{−1}(i)>σ ^{−1}(j),
 3.
\({ {\overline {R}}\left ({\scriptstyle \triangle }\right )}  { R\left .({\scriptstyle \triangle },\sigma \right .)} \le\sum_{(i,j) \in C_{\sigma}} (U_{i}\left ({\scriptstyle \triangle }\right )  U_{j}\left ({\scriptstyle \triangle }\right )) ({\phi }(\nu ^{1}(i))  {\phi }(\nu^{1}(j)) ) \).
Proof
For the proof, we will use the notation \(C_{\!{\scriptscriptstyle \triangle },\sigma }^{\boldsymbol {u},{\phi }}\!(\nu )\) for the set C _{ σ } to make all the dependencies clear. We prove the existence of \(C_{\!{\scriptscriptstyle \triangle },\sigma }^{\boldsymbol {u},{\phi }}\!(\nu )\) by induction on n, the number of items to rank. Let n>2. It is easy to see that the result holds for \(n\in \left \{ 1,2 \right \}\). Assume that the same holds for any k≤n.
Since j is the item with the worst true rank among the topd predicted items, we can only decrease the performance by exchanging it, in the predicted ranking, with the topranked item. In more formal terms, denoting \(\tau_{wz} \in {\mathfrak {S}_{n}}\) the transposition of w and z, we thus have \({ R\left .({\scriptstyle \triangle },\sigma \right .)} \geq { R\left .({\scriptstyle \triangle },\sigma \circ\tau_{1p}\right .)}\) (σ∘τ _{1p } is thus the ranking created by exchanging the items at (predicted) rank 1 and p=σ ^{−1}(j)).
 Case i≠j

In that case, define \(r^{{\scriptscriptstyle \prime }}\) as a \((\boldsymbol {u}^{{\scriptscriptstyle \prime }}, {\phi }^{{\scriptscriptstyle \prime }})\)GPPM on lists of items of size n−2, such that \(\boldsymbol {u}^{{\scriptscriptstyle \prime }}\), \({\phi }^{{\scriptscriptstyle \prime }}\), \(\nu^{{\scriptscriptstyle \prime }}\) and \(\sigma ^{{\scriptscriptstyle \prime }}\) are equal to u, ϕ, ν and σ on indices different from i and j up to an appropriate reindexing of the remaining n−2 items. Using the induction assumption, we can find a set \(C_{\!{\scriptstyle \triangle },\sigma ^{{\scriptscriptstyle \prime }}}^{\boldsymbol {u}^{{\scriptscriptstyle \prime }},{\phi }^{{\scriptscriptstyle \prime }}}\!(\nu ^{{\scriptscriptstyle \prime }})\) satisfying the three conditions of the lemma, which we add to the pair (i,j) after reindexing to build \(C_{\!{\scriptstyle \triangle },\sigma }^{\boldsymbol {u},{\phi }}\!(\nu )\). Notice that for now, we do not exactly meet condition (ii) since we have \(\forall(i,j) \in C_{\!{\scriptstyle \triangle },\sigma }^{\boldsymbol {u},{\phi }}\!(\nu ), U_{\sigma(i)}\left ({\scriptstyle \triangle }\right )\geq U_{\sigma(j)}\left ({\scriptstyle \triangle }\right )\), while condition (ii) requires a strict inequality. However, if \(U_{\sigma(i)}\left ({\scriptstyle \triangle }\right )= U_{\sigma(j)}\left ({\scriptstyle \triangle }\right )\) for some pair (i,j) in \(C_{\!{\scriptstyle \triangle },\sigma }^{\boldsymbol {u},{\phi }}\!(\nu )\), then the pair has no influence on the bound and can thus simply be discarded.
 Case i=j

In that case, define \(r^{{\scriptscriptstyle \prime }}\) as a \((\boldsymbol {u}^{{\scriptscriptstyle \prime }}, {\phi }^{{\scriptscriptstyle \prime }})\)GPPM on lists of items of size n−1. We then directly use the induction, ignoring the topranked element and considering the set of pairs on the remaining n−1 elements.
An important characteristic of the set C _{ σ } in the lemma is condition 1, which ensures that the pairs (i,j) are independent (each index i appears in at most one pair). This condition is critical in the derivation of explicit surrogate bounds of the next section. Another important technical feature of the bound is that is based on misordered pairs, and thus can be applied to any loss. In contrast, the bounds on DCG suboptimality used in Cossock and Zhang (2008) or Ravikumar et al. (2011) depend on how much (a function of) the score vector \({\bf s}\) approximates U(△)—a bound which, consequently, can only be used for regressionlike template losses.
We are now ready to give a new characterization of uniform calibration w.r.t. GPPMs. This characterization is easier to deal with than the initial definition of uniform calibration. We note here that it perfectly applies to losses of arbitrary structure, and thus also to nontemplate losses.
Theorem 4.2
 (a)There is a function \(\delta:\mathbb{R}_{+}\rightarrow \mathbb{R}_{+}\) s.t. ∀ε>0,δ(ε)>0 and:$$ \forall i\neq j, \forall {\scriptstyle \triangle }\in {\Delta }_{i,j}(\varepsilon), \quad { {\underline {L}}\left ({\scriptstyle \triangle }\right )} +\delta(\varepsilon) \leq \inf_{{\bf s}\in \varOmega _{i,j}} { L\left ({\scriptstyle \triangle },{\bf s}\right )} . $$
 (b)
ℓ is uniformly rcalibrated on Δ.
Proof
We start with (a) ⇒ (b). Fix ε>0, \({\bf s}\in \mathbb{R}^{n}\) and △∈Δ. From (a), we know that if \({ L\left ({\scriptstyle \triangle },{\bf s}\right )}  { {\underline {L}}\left ({\scriptstyle \triangle }\right )} < \delta(\varepsilon)\) then for any i,j satisfying \((U_{i}\left ({\scriptstyle \triangle }\right ) U_{j}\left ({\scriptstyle \triangle }\right ) )(s_{i}s_{j})\leq0\), we have \(\lvert U_{i}\left ({\scriptstyle \triangle }\right ) U_{j}\left ({\scriptstyle \triangle }\right )\rvert <\varepsilon\). By Lemma 4.1, we obtain \({ {\overline {R}}\left ({\scriptstyle \triangle }\right )}  { R\left .({\scriptstyle \triangle },{\bf s}\right .)} < \frac{n}{2}{\phi }(1) \varepsilon\), since there are less than n/2 nonoverlapping pairs of indexes (i,j),i≠j in \(\left \{ 1,\ldots,n \right \}^{n}\) and ϕ(i)−ϕ(j)≤ϕ(1) for any i,j. This bound being independent on △, this proves the uniform calibration of ℓ w.r.t. r on Δ.
We now prove (b) ⇒ (a) when ∀0<i<n,ϕ(i)>ϕ(i+1) by contrapositive. Suppose (a) does not hold. Then, we can find ε>0, a sequence (i _{ k },j _{ k })_{ k≥0} with i _{ k }≠j _{ k } for all k and a sequence (△_{ k })_{ k≥0} with \(\forall k, {\scriptstyle \triangle }_{k}\in {\Delta }_{i_{k},j_{k}}(\varepsilon)\) satisfying \(\inf_{{\bf s}\in \varOmega _{i_{k},j_{k}}}{ L\left ({\scriptstyle \triangle }_{k},{\bf s}\right )}  { {\underline {L}}\left ({\scriptstyle \triangle }_{k}\right )} \mathop{\longrightarrow}\limits_{k\rightarrow+\infty} 0\).
Using this new characterization, we now address the problem of finding losses ℓ and sets of distributions Δ such that if ℓ is rcalibrated on Δ for some GPPM r, then condition (a) of Theorem 4.2 holds, implying the uniform calibration and the existence of the regret bound. The interest of the characterization of Theorem 4.2 is that in some cases, it is implied by large families of losses. Before going to some examples, we provide here the main corollary. Examples for more specific losses or supervision spaces are given in Corollaries 4.5 and in Appendix A.
Corollary 4.3
 1.
Δ is compact;
 2.
the map \(\displaystyle \left ( \begin{array}{rcl} {\Delta }& \rightarrow & \mathbb{R}^n_+ \\ {\scriptstyle \triangle }&\mapsto& \boldsymbol {U}({\scriptstyle \triangle }) \end{array} \right )\) is continuous;
 3.
\(\forall i, j, \displaystyle \left ( \begin{array}{rcl} {\Delta }& \rightarrow&\mathbb{R}\\ {\scriptstyle \triangle }& \mapsto& \displaystyle\!\!\inf_{{\bf s}\in \varOmega _{i,j}}\!\!{ L\left ({\scriptstyle \triangle },{\bf s}\right )}  { {\underline {L}}\left ({\scriptstyle \triangle }\right )} \end{array} \right )\) is continuous, with Ω _{ i,j } defined by (8).
Proof
Since uniform calibration implies calibration, we only have to show the “only if” part.
First, we show using conditions 1 and 2 that for any ε>0 and any i,j, the set Δ_{ i,j }(ε) defined by (7) is compact. Since U is continuous on Δ and Δ is compact, U(Δ) is a compact subset of \(\mathbb{R}_{+}^{n}\). Therefore, U(Δ) is bounded. Let B=sup_{△∈Δ}∥U(△)∥_{∞} and consider now the function \(h_{i,j}({\scriptstyle \triangle }) = U_{i}\left ({\scriptstyle \triangle }\right )U_{j}\left ({\scriptstyle \triangle }\right )\). h _{ i,j } is continuous from Δ to \(\mathbb{R}^{n}\) with Δ compact. Therefore, h _{ i,j } is a proper map, i.e. the preimage of any compact is compact (see e.g. Lee 2003, Lemma 2.14, p. 45). Thus, \({\Delta }_{i,j}(\varepsilon) = h_{i,j}^{1}([\varepsilon, B])\) is compact in Δ.
We now go on to the proof of the result. Let i≠j and denote \(g_{i,j}:{\Delta }\rightarrow \mathbb{R}\) the function defined in condition 3. Since ℓ is rcalibrated on Δ, we have g _{ i,j }(△)>0 for any △∈Δ_{ i,j }(ε) as soon as ε>0. Since g _{ i,j } is continuous and Δ_{ i,j }(ε) is compact, g _{ i,j }(Δ_{ i,j }(ε)) is a compact of \(\mathbb{R}\) and the minimum is attained. Defining δ(ϵ)=min_{ i≠j }ming _{ i,j }(Δ_{ i,j }(ε)), we thus have δ(ϵ)>0. Using Theorem 4.2, it proves that ℓ is uniformly rcalibrated. □
Corollary 4.3 gives conditions on the accepted form of supervision (conditions 1 and 2) and on the loss structure (condition 3) which are important to verify that rcalibration on Δ for a GPPM r implies uniform rcalibration on Δ. Conditions 1 and 2 are obviously satisfied when the supervision space is finite, and, as we shall see later, condition 3 is then automatically satisfied as well. Also, we may expect the same result to hold when we restrict U to be bounded. The cases of a finite supervision space is treated below. The more technical cases where the supervision space is infinite is more technical, and is detailed in Appendix A. We first remind the following result which will help us discuss these special cases:
Lemma 4.4
(Zhang 2004, Lemma 27) Let K>0, and let \(\psi_{k}:\mathbb{R}\rightarrow \mathbb{R}_{+}, k=1..K\) be K continuous functions. Let \(\varOmega\subseteq \mathbb{R}^{n}\), Ω≠∅ and \(\mathcal{Q}\) be a compact subset of \(\mathbb{R}_{+}^{K}\). Then, the function \({\underline {\varPsi}}\) defined as \(\left ( \begin{array}{rcl} \mathcal{Q} & \rightarrow& \mathbb{R}_{+}\\ {\bf q} &\mapsto& \displaystyle\inf_{{\bf s}\in \varOmega}\sum_{k=1}^K q_k\psi_k({\bf s}) \end{array} \right ) \) is continuous.
From now on, we suppose that the supervision space \(\mathcal {Y}\) is finite. Then, \({\Delta }=\mathcal {D}\) can be identified with the \(\mathcal {Y}\)simplex which is compact using its natural topology. Moreover, in that case, U is necessarily continuous with respect to this topology on Δ and thus conditions 1 and 2 of Corollary 4.3 are satisfied. Thus, the only question which remains is whether the class of loss functions we consider satisfies 3—a question which is solved by Lemma 4.4. We can now give a full answer to the question of the uniform calibration w.r.t. a GPPM when the supervision space is finite:
Corollary 4.5
 1.
ℓ is rcalibrated on Δ if and only if it is uniformly calibrated on Δ.
 2.
if ℓ is orderpreserving w.r.t. u on Δ, it is uniformly rcalibrated on Δ.
 3.
If ϕ(i)>ϕ(i+1) for all 0<i<n, then, ℓ is orderpreserving w.r.t. u on Δ if and only if it is uniformly rcalibrated on Δ.
Proof
Since \(\mathcal {Y}=\left \{ y_{1}, \ldots, y_{K} \right \}\) is finite (\(K=\mathcal {Y}\)), we already showed that both conditions 1 and 2 of Corollary 4.3 are satisfied, identifying \(\mathcal {D}\) with the Ksimplex. Then, for any scoring loss, we have: \({ L\left ({\scriptstyle \triangle },{\bf s}\right )} = \sum_{k=1}^{K} {\scriptstyle \triangle }(\{y_{k}\}) \ell (y_{k}, {\bf s})\) which satisfies condition 3 of Corollary 4.3 using Lemma 4.4. Thus, using Corollary 4.3, we know that for any (u,ϕ)GPPM r, ℓ is rcalibrated if and only if it is uniformly rcalibrated, giving us the first claim of the corollary.
The second claim comes from the fact that an orderpreserving loss is calibrated with any GPPM. The third claim comes from the fact that only orderpreserving losses are calibrated w.r.t. (u,ϕ)GPPM with ϕ(i)>ϕ(i+1) for all 0<i<n, and the equivalence of rcalibration and uniform rcalibration when the supervision space is finite. □
This result shows that when the supervision space is finite, then any surrogate loss calibrated with respect to a GPPM has a regret bound. Thus, any loss calibrated with a GPPM has nonasymptotic guarantees. Since the exact form of the regret bound depends on the loss at hand, Corollary 4.5 is the stronger result we can obtain for arbitrary losses. We refer to Appendix A for a similar result concerning template losses in a special case where the supervision space is infinite. In the next section, we provide more quantitative surrogate regret bounds for specific template losses.
5 Surrogate regret bounds
The previous section deals with the existence of surrogate regret bounds by the study of uniform calibration. Now we propose to practically derive surrogate regret bounds for commonlyused template surrogate losses pointwise, pairwise or that can be written as a Bregman divergence. Like in classification the main idea is to find a convex lower bound of the surrogate regret as a function of the performance regret. However, contrary to classification, computing the calibration function like Steinwart (2007) or the Ψtransform like Bartlett (2006) is actually unfeasible. If one tries to find the function δ of Theorem 4.2, the bound will be worse than the ones we reach in this section. Indeed, it doesn’t use nonoverlapping pairs of indexes.
 For any distribution P on \(\mathcal {X}\times \mathcal {Y}\) and prediction function \({\bf f}\),$$\mathcal {L}^{\boldsymbol {u}}(P, {\bf f}) = \int_{\mathcal {X}} \int _{\mathcal {Y}} { \ell ^{\boldsymbol {u}}\left (y, {\bf f}(x)\right )} \mathrm{dP}( y,x) = \int_{\mathcal {X}} \int_{\mathcal {Y}} { \ell \left (\boldsymbol {u}(y), {\bf f}(x)\right )} \mathrm{dP}(y,x), $$
 For any \({\scriptstyle \triangle }\in \mathcal {D}\), and \({\bf s}\in \mathbb{R}^{n}\),$${ L^{\boldsymbol {u}}\left ({\scriptstyle \triangle },{\bf s}\right )} = \int_{\mathcal {Y}} { \ell ^{\boldsymbol {u}}\left (y, {\bf s}\right )} {\rm d}{\scriptstyle \triangle }(y) = \int _{\mathcal {Y}} { \ell \left (\boldsymbol {u}(y), {\bf s}\right )} {\rm d}{\scriptstyle \triangle }(y). $$
Summary of surrogate regret bounds. Recalling that the v _{ i } are upperbounded, η can be chosen as \(\eta> \max_{i}U_{i}\left ({\scriptstyle \triangle }\right )\) and φ _{ α } is a differentiable version of the Hinge Loss, where \(\alpha\in (0;\frac{ \eta }{ 2 } )\) is a parameter to choose: φ _{ α }(x)=0 if x≤0, \(\varphi_{\alpha}(x) =\frac{x^{2}}{2\alpha}\) if x∈[0,α], and \(\varphi_{\alpha}(x) =x \frac{\alpha}{2}\) otherwise
Pointwise Losses (3): \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i=1}^{n} { \lambda (v_{i}, s_{i})} \)  

Name  λ(v _{ i },s _{ i })  c 
Squared Error  (v _{ i }−s _{ i })^{2}  \(\sqrt{2}\) 
Logistic  \(v_{i} \log(1 + e^{s_{i}}) + (\eta  v_{i}) \log(1 + e^{s_{i}})\)  \(\sqrt{\eta }\) 
Exponential  \(v_{i} e^{ s_{i}} + (\eta  v_{i}) e^{s_{i}}\)  \(\sqrt {\eta }\) 
Square Hinge  v _{ i }max(0,t−s _{ i })^{2}+(η−v _{ i })max(0,s _{ i })^{2}  \(\frac{\sqrt{2\eta }}{t}\) 
Differentiable Hinge  v _{ i } φ _{ α }(1−s _{ i })+(η−v _{ i })φ _{ α }(s _{ i })  \(4\sqrt{\frac{\eta }{\alpha}}\) 
Pairwise Losses (4): \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i<j} \lambda (v_{i}, v_{j}, s_{i}  s_{j}) \)  
Name  λ(v _{ i },v _{ j },d _{ ij })  c 
Squared Error  (v _{ i }−v _{ j }−d _{ ij })^{2}  1 
Logistic  \(v_{i} \log(1 + e^{d_{ij}}) + v_{j} \log(1 + e^{d_{ij}})\)  \(2\sqrt{\lVert \boldsymbol {U}({\scriptstyle \triangle })\rVert _{\infty}}\) 
Exponential  \(v_{i} e^{d_{ij}} + v_{j} e^{d_{ij}}\)  \(2\sqrt{\lVert \boldsymbol {U}({\scriptstyle \triangle })\rVert _{\infty}}\) 
Bregman Divergence (5): \({ \ell \left ({\bf v}, {\bf s}\right )} = B_{\psi}({\bf v}\ {\bf g}({\bf s}) )\)  
ψ(.)  c  
μstrongly convex (12)  \(\frac{2}{\sqrt {\mu}}\) 
5.1 Regret bounds for common surrogate losses
We first give a summary of the different bounds obtained in the following of the section for both pointwise losses, Bregman divergences and pairwise losses, and then present the three methods used on the latter families of losses to achieve these bounds.
Table 2 details the different examples of Bregman divergences, pointwise losses and pairwise losses satisfying this surrogate regret bound (9) by giving the constant c. The methods for achieving such bounds are detailed in the following of the section: Theorem 5.2 for the pointwise losses, Theorem 5.3 for the Bregman divergences and Theorem 5.4 for pairwise losses. The proofs ensuring that the surrogate losses given in Table 2 satisfy the assumptions of the corresponding later theorems are given in Appendix B.
The differences in the constant factor c come from the fact that it represents a scaling factor between the surrogate loss and the expected utilities. Actually the magnitude of the loss may vary consequently from one to another. Furthermore, the bounds on the pointwise Square Hinge and the pointwise Differentiable Hinge depends respectively on t and α. Indeed, these parameters control the magnitude of the range within the optimal scores vary, so the scaling between the optimal scores and the expected utilities.
Notice that, C _{ ϕ }(p) is generally strictly lower than ∥ϕ∥_{ p }, thus, for the pointwise Squared Error, our approach allows us to obtain a slightly better bound than in Cossock and Zhang (2008, Theorem 2). The regret bound of the pointwise Squared Error is a crucial result since it helps to obtain the regret bounds on Bregman divergences. This explain why, applying the method of Ravikumar et al. (2011, Theorem 10) for Bregman divergences in our Theorem 5.3, we also reach a slightly better bound than them. Finally, for pairwise losses, to the best of our knowledge, no bound has already been proposed.
5.2 General results to derive regret bounds
Now, we aim at describing the methods allowing us to explicit the results of Table 2. The main argument is to combine lower bound of the surrogate regret with the upper bound on the performance regret given by the Lemma 4.1. We always use the same upper bound on the performance regret deduced from Lemma 4.1 so we explicit it here in the following lemma. Then, we will only work on the surrogate regret to obtain the bounds.
Lemma 5.1
The proof can be found in Appendix B.3.
We first treat the case of pointwise losses, then Bregman divergences and finally pairwise losses.
Specific orderpreserving pointwise losses
Theorem 5.2
Let r be a (u,ϕ)GPPM and ℓ a pointwise template loss. If there exists c>0 and q≥1 such that for any \({\scriptstyle \triangle }\in \mathcal {D}_{r, \ell }\),
Proof
Bregman divergence
Theorem 5.3
Proof
Specific orderpreserving pairwise losses
In this section, we study the popular family of pairwise losses (see (4)) through two subfamilies. We propose the first one to overcome the nonconsistency of the classic pairwise hinge loss cast in light by Duchi et al. (2010). The second one is just a mean squared error on the pairs of indexes.
Pairwise surrogate losses integrate complex correlations between the different dimensions of the predicted vector of score when optimizing. This is why it’s not immediate to benefit from the independence given by the bound of Lemma 4.1. For pairwise surrogate losses, the main idea of the method is to treat them as pointwise losses on pairs of items with some additional constraints. Then, we compare the optima of the loss with and without the constraints.
Theorem 5.4
 1.
\({ {\underline { L^{\boldsymbol {u}}}}\left ({\scriptstyle \triangle }\right )} = \inf_{{\bf d}\in D} \sum_{i<j} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})} = \inf_{{\bf d}\in \mathbb{R}^{n} \times \mathbb{R}^{n}} \sum_{i<j} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})}\)
 2.There exist c>0 and q≥1 such that$$ \inf_{d_{ij} \le0} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})}  \inf_{d_{ij} \in \mathbb{R}} { \varLambda ^{u_{i}, u_{j}}({\scriptstyle \triangle }, d_{ij})} \ge \frac{1}{c^q} \lvert U_{i}\left ( {\scriptstyle \triangle }\right )  U_{j}\left ({\scriptstyle \triangle }\right )\rvert ^q $$
Proof
6 Discussion and related work
In this section, we discuss the most closely related works, and then summarize our results and discuss some of their practical implications.
Surrogate regret bounds for learning to rank
The calibration and uniform calibration have been extensively studied in (costsensitive) binary classification (see e.g. Bartlett and Jordan 2006; Zhang 2004; Steinwart 2007; Scott 2011) and multiclass classification (Zhang 2004; Tewari and Bartlett 2007). In the context of learning to rank, the calibration of surrogate losses in learning to rank has been previously studied by Cossock and Zhang (2008), Duchi et al. (2010), and Ravikumar et al. (2011). Cossock and Zhang (2008) proved the calibration of some variants of regression losses based on the mean squared error with respect to the DCG, and proved the first surrogate regret bound for ranking. In Ravikumar et al. (2011), the authors generalize this work to obtain the calibration of losses based on Bregman divergences (which include the squared error loss) with respect to the (N)DCG, and provide surrogate regret bounds for this class of surrogate losses. In this paper, we extend the work of Ravikumar et al. (2011) in several ways. First, we consider a wider class of ranking performance measures, the GPPMs, essentially by noticing that it is not necessary to restrict the supervision to relevance judgments. Second, we consider a much larger class of surrogate losses (the orderpreserving ones), which, in particular, are not constrained to have a unique minimizer. Relaxing these two assumptions, we obtain a new and general result on the existence of surrogate regret bounds for any loss calibrated with respect to a GPPM when the supervision space is finite, through the equivalence of calibration and uniform calibration for GPPMs (Corollary 4.3). Furthermore, our deeper study of the performance measures (Lemma 4.1) allow us to prove both slightly better regret bounds than Zhang (2004) and Ravikumar et al. (2011) for the mean squared regression and for the Bregman divergences, as well as new regret bounds for other forms of surrogate losses such as pairwise losses or pointwise losses that do not have a unique minimizer (Sect. 5). While all these works studied the DCG, Dembczynski et al. (2012) proved regret bounds for pointwise losses for the special case of the AUC metric. The pointwise losses they consider are similar to the one we consider in Sect. 5 (the difference is that in their work, the value of η in these losses are not constants). While our proof technique could be adapted to their specific loss, the bounds we prove are more general since they apply to a larger variety of losses and different performance measures.
We may note here that surrogate regret bounds have also been studied in another context of learning to rank, namely instance ranking (Clemençon et al. 2005; Kotlowski et al. 2011). Instance ranking, from which bipartite ranking is the bestknown example (the case with binary relevance judgments) is a framework where the prediction task is to order a single set (the sample space itself), and learning is carried out based on an i.i.d. sample from this set. In contrast, in the task we consider here, the goal is to predict the ordering of a finite set for each instance and learning is carried out using an i.i.d. sample of such instances with a supervision that indicates how to rank the finite set given this instance. The evaluation performances for instance ranking are usually the Area Under the ROC Curve, or more generally linear rank statistics (Clémençon and Vayatis 2007), which are similar in nature to what we call GPPMs. However, since the underlying sampling assumptions are different in instance ranking and in the framework we consider here, all the notions of inner risks are different, and the analyses carried out in one framework do not apply to the other framework.
Fitting utility values
When the supervision takes the form of relevance scores on a discrete scale (as usually in search engine applications), it may be natural to simply try to fit them, for instance using classification or ordinal regression approaches. In the presence of noise however, our results show that one should not try to predict the value of the label, but rather its corresponding utility. More precisely, one should learn to rank according to the expected value of the utility; fitting the expected value of the utilities, for instance by minimizing the squared error, leads to a calibrated formulation, but it is only a special case of what one can do: in general, applying any orderpreserving template loss is valid. Considering that many performance measures are GPPMs—for instance the (N)DCG, the precisionatk, the recallatk, the AUC, or Spearman’s rank correlation coefficient (see Table 1)—our result allow us to provide template calibrated surrogate losses that can be easily instantiated for each of these measures (Sects. 2 and 3).
Another important result we obtain in the paper is the noncalibration of any surrogate loss that tries to reproduce the order given by the relevance judgments for the AP and the ERR (Theorem 3.3). The important result is that the noncalibration holds for any utility function that one can associate to these metric. Despite the importance of these measures in search engine evaluations, our result thus proves that many common surrogate losses used in learning to rank algorithm are not AP or ERRcalibrated.
Consequently, the exact form of the supervision we have for the problem at hand—which may be relevance judgments, a preference relation, or total orders—does not dictate the kind of algorithm we should use. Spearman’s correlation coefficient (see Table 1), which considers total orders as supervision, is actually a GPPM, and thus any template orderpreserving loss can be calibrated with respect to it. This contrast with the case of the ERR or the AP, with respect to which no orderpreserving is calibrated even though these performance measures consider realvalued relevance judgments for their supervision.
Pairwise losses
As already mentioned in Remark 2, a traditional approach to learning to rank is to use pairwisecomparisonbased losses, as in Ranking SVMs or RankBoost (Joachims 2002; Freund et al. 2003; Cao et al. 2006). To take a concrete example, consider the case when the supervision is a vector of relevance judgments. Then, the idea of pairwisecomparisonbased losses is to take a loss of the form \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i,j} {\bf 1}_{\left \{ v_{i}>v_{j} \right \}}\varphi(s_{i}  s_{j})\) (\(\bf v\) here takes the place of the supervision, or any monotonic transform of it). The motivation of these approaches is that only the relative ordering between any two items does matter for ranking, and thus it is somewhat natural to only consider the relative ordering given by the supervision for learning. However, such losses are not orderpreserving when φ is convex (see Remark 2; this result is actually a direct consequence of the noncalibration result of Duchi et al. (2010)), and they are consequently not calibrated with respect to any GPPMs. This is why in this work we propose an alternative formulation \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i<j} ( v_{i}\varphi(s_{i}  s_{j}) + v_{j}\varphi(s_{j}s_{i}) )\), which is convex when the values of v _{ i } are nonnegative and φ is convex, and which, as we show in Sect. 5, is also orderpreserving. Consequently, this alternative formulation provides a template loss whose instances are calibrated with respect to any GPPM. Notice that from a computational perspective, the two losses (the initial formulation and the alternative that we propose here) are comparable, and thus we strongly encourage to consider the alternative formulation in practice.
Limitations of scoring approaches for ranking?
The difficulty of designing (convex) surrogate formulation for the scoreandsort approach to ranking has previously been addressed in Duchi et al. (2010), where the authors show that a number of existing surrogate losses are non calibrated with respect to the pairwise disagreement, a performance measure used when the supervision contains arbitrary pairwise preferences and which counts the number of pairs of items for which the predicted ordering does not match the supervision. Duchi et al. (2010) also conjecture that no convex loss of the scores can be calibrated with respect to the pairwise disagreement. In this work, we prove additional results concerning the possible limitations of scoring approaches: no orderpreserving loss can be calibrated with respect to the AP or the ERR in general (Theorem 3.3). While this suggest that some approaches other than scoring may be useful for these evaluation measures, it also gives new insights on the intrinsic limitations of scoring approaches (in particular regression approaches) for information retrieval.
7 Conclusion
The calibration, uniform calibration and surrogate regret bounds are crucial tools to assess the quality of surrogate losses. We proposed an indepth study of the calibration of orderpreserving losses with respect to GPPMs.
A large body of work remains to be done in learning to rank. Duchi et al. (2010) pointed out, learning from pairwise preferences is still an open issue without making strong assumptions on the preference relations that we may have as supervision. More closely to our work, designing losses with better regret bounds for GPPMs with a cutoff (i.e. ϕ(i)=0 for i>k and k<<n) as in Cossock and Zhang (2008), but without any strong prior knowledge on which items should be ranked first and keeping easytooptimize surrogate losses, remains critical in many applications and mostly an open problem.
Footnotes
 1.
The rearrangement inequality states that for any real numbers x _{1}≥⋯≥x _{ n }≥0 and y _{1}≥⋯≥y _{ n }, and for any permutation \(\sigma\in {\mathfrak {S}_{n}}\), we have x _{1} y _{ σ(1)}+⋯+x _{ n } y _{ σ(n)}≤x _{1} y _{1}+⋯+x _{ n } y _{ n }. (the dot product is maximized by pairing greater x _{ i }s with greater y _{ i }s). Moreover, if the x _{ i }s are strictly decreasing, then the equality holds if and only if y _{ σ(1)}≥⋯≥y _{ σ(n)}.
 2.
There is no loss in generality here. If u does not satisfy this assumption, we can simply increase \(\mathcal {Y}\) and extend u so that the assumption is satisfied. We will then show that the equivalence between calibration and uniform calibration holds on a larger space of distributions.
 3.
We simply have to identify the set of △ who have the same expected value under u and assimilate this set to the real value. The only thing that has to be verified is that \(\Delta=\left \{ {\scriptstyle \triangle }\in \mathcal {D}: \lVert \boldsymbol {U}({\scriptstyle \triangle })\rVert _{\infty}\leq B \right \}\) is closed, which is the case with our assumption on u since \(\left \{ \boldsymbol {U}({\scriptstyle \triangle }), {\scriptstyle \triangle }\in\Delta \right \} = [\min K, \min(\sup K, B) ]^{n}\).
 4.
Note that if the regret bound for some loss ℓ does not depend on B, as for squaredlossbased scoring losses, then we can directly have a regret bound using Jensen’s inequality.
Notes
Acknowledgements
This work was partially funded by the French DGA, as well as the French Government and Région île de France through the FUI project OpenWay III. The authors thank the anonymous reviewers for their helpful comments and suggestions.
References
 Banerjee, A., Guo, X., & Wang, H. (2005). On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7), 2664–2669. MathSciNetCrossRefGoogle Scholar
 Bartlett, P., & Jordan, M. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Society, 101(473), 138–156. MathSciNetCrossRefzbMATHGoogle Scholar
 Buffoni, D., Calauzènes, C., Gallinari, P., & Usunier, N. (2011). Learning scoring functions with orderpreserving losses and standardized supervision. In Proceedings of the International Conference on Machine Learning (pp. 825–832). Google Scholar
 Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. N. (2005). Learning to rank using gradient descent. In Proceedings of the International Conference on Machine Learning (pp. 89–96). Google Scholar
 Cambazoglu, B. B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., & Degenhardt, J. (2010). Early exit optimizations for additive machine learned ranking systems. In Proceedings of the ACM International Conference on Web Search and Data Mining (pp. 411–420). CrossRefGoogle Scholar
 Cao, Y., Xu, J., Liu, T. Y., Li, H., Huang, Y., & Hon, H. W. (2006). Adapting ranking SVM to document retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 186–193). Google Scholar
 Cao, Z., & Liu, T. Y. (2007). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the International Conference on Machine Learning (pp. 129–136). Google Scholar
 Chakrabarti, S., Khanna, R., Sawant, U., & Bhattacharyya, C. (2008). Structured learning for nonsmooth ranking losses. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 88–96). Google Scholar
 Chapelle, O., & Chang, Y. (2011). Yahoo! learning to rank challenge overview. Journal of Machine Learning Research, 14, 1–24. Google Scholar
 Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceeding of the ACM Conference on Information and Knowledge Management (pp. 621–630). CrossRefGoogle Scholar
 Clemençon, S., Lugosi, G., & Vayatis, N. (2005). Ranking and scoring using empirical risk minimization. In Proceedings of the Conference on Learning Theory (pp. 783–800). Google Scholar
 Clémençon, S., & Vayatis, N. (2007). Ranking the best instances. Journal of Machine Learning Research, 8, 2671–2699. zbMATHGoogle Scholar
 Cohen, W. W., Schapire, R. E., & Singer, Y. (1997). Learning to order things. In Proceedings of Advances in Neural Information Processing Systems (pp. 243–270). Google Scholar
 Cossock, D., & Zhang, T. (2008). Statistical analysis of Bayes optimal subset ranking. IEEE Transactions on Information Theory, 54(11), 5140–5154. MathSciNetCrossRefGoogle Scholar
 Crammer, K., & Singer, Y. (2002). On the algorithmic implementation of multiclass kernelbased vector machines. Journal of Machine Learning Research, 2, 265–292. zbMATHGoogle Scholar
 Dekel, O., Manning, C. D., & Singer, Y. (2003). Loglinear models for label ranking. In Proceedings of Advances in Neural Information Processing Systems. Google Scholar
 Dembczynski, K., Kotlowski, W., & Huellermeier, E. (2012). Consistent multilabel ranking through univariate losses. In Proceedings of the International Conference on Machine Learning (pp. 1319–1326). Google Scholar
 Duchi, J., Mackey, L. W., & Jordan, M. I. (2010). On the consistency of ranking algorithms. In Proceedings of the International Conference on Machine Learning (pp. 327–334). Google Scholar
 Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. MathSciNetGoogle Scholar
 Järvelin, K., & Kekäläinen, J. (2002). Cumulated gainbased evaluation of IR techniques. ACM Transactions on Information Systems, 20, 422–446. CrossRefGoogle Scholar
 Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 133–142). Google Scholar
 Kotlowski, W., Dembczynski, K., & Huellermeier, E. (2011). Bipartite ranking through minimization of univariate loss. In Proceedings of the International Conference on Machine Learning (pp. 1113–1120). Google Scholar
 Le, Q. V., & Smola, A. J. (2007). Direct optimization of ranking measures. Technical report, NICTA. Google Scholar
 Lee, J. (2003). Graduate texts in mathematics: Introduction to smooth manifolds. Berlin: Springer. CrossRefGoogle Scholar
 Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3, 225–331. CrossRefGoogle Scholar
 Ravikumar, P. D., Tewari, A., & Yang, E. (2011). On NDCG consistency of listwise ranking methods. Journal of Machine Learning Research—Proceedings Track, 15, 618–626. Google Scholar
 Scott, C. (2011). Surrogate losses and regret bounds for costsensitive classification with exampledependent costs. In Proceedings of the International Conference in Machine Learning (pp. 153–160). Google Scholar
 Steinwart, I. (2007). How to compare different loss functions and their risks. Constructive Approximation, 26(2), 225–287. MathSciNetCrossRefzbMATHGoogle Scholar
 Tewari, A., & Bartlett, P. (2007). On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8, 1007–1025. MathSciNetzbMATHGoogle Scholar
 Vembu, S., & Gärtner, T. (2009). Label ranking algorithms: a survey. In Preference Learning (pp. 1530–1537). Berlin: Springer. Google Scholar
 Voorhees, E., & Harman, D. (2005). TREC: experiment and evaluation in information retrieval. Digital libraries and electronic publishing. Cambridge: MIT Press. Google Scholar
 Weston, J., & Watkins, C. (1999). Support vector machines for multiclass pattern recognition. In Proceedings of the European Symposium on Artificial Neural Networks (pp. 219–224). Google Scholar
 Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (pp. 271–278). Google Scholar
 Zhang, T. (2004). Statistical analysis of some multicategory large margin classification methods. Journal of Machine Learning Research, 5, 1225–1251. zbMATHGoogle Scholar