Advertisement

Machine Learning

, Volume 93, Issue 2–3, pp 227–260 | Cite as

Calibration and regret bounds for order-preserving surrogate losses in learning to rank

  • Clément CalauzènesEmail author
  • Nicolas Usunier
  • Patrick Gallinari
Article

Abstract

Learning to rank is usually reduced to learning to score individual objects, leaving the “ranking” step to a sorting algorithm. In that context, the surrogate loss used for training the scoring function needs to behave well with respect to the target performance measure which only sees the final ranking. A characterization of such a good behavior is the notion of calibration, which guarantees that minimizing (over the set of measurable functions) the surrogate risk allows us to maximize the true performance.

In this paper, we consider the family of order-preserving (OP) losses which includes popular surrogate losses for ranking such as the squared error and pairwise losses. We show that they are calibrated with performance measures like the Discounted Cumulative Gain (DCG), but also that they are not calibrated with respect to the widely used Mean Average Precision and Expected Reciprocal Rank. We also derive, for some widely used OP losses, quantitative surrogate regret bounds with respect to several DCG-like evaluation measures.

Keywords

Learning to rank Calibration Surrogate regret bounds 

1 Introduction

Learning to rank has emerged as a major field of research in machine learning due to its wide range of applications. Typical applications include creating the query-dependent document ranking in search engines, where one learns to order sets of documents, each of these sets being attached to a query, using relevance judgments for each document as supervision. This task is known as subset ranking (Cossock and Zhang 2008). Another application is label ranking (see e.g. Dekel et al. 2003; Vembu and Gärtner 2009), where one learns to order a fixed set of labels depending on an input with a training set composed of observed inputs and the corresponding weak or partial order over the label set. Label ranking is a widely used framework to deal with multiclass/multilabel classification when the application accepts a ranking of labels according to the posterior probability of class membership instead of a hard decision about class membership.

In a similar way to other prediction problems in discrete spaces like classification, the optimization of the empirical ranking performance over a restricted class of functions is most frequently an intractable problem. Just like one optimizes the hinge loss or the log-loss in binary classification as surrogate for the classification error, the usual approach in learning to rank is to replace the original performance measure by a continuous, preferably differentiable and convex function of the predictions. This has lead many researchers to reduce the problem of learning to rank to learning a scoring function which assigns a real value to each individual item of the input set. The final ranking is then produced with a sorting algorithm. Many existing learning algorithms follow this approach both for label ranking (see e.g. Weston and Watkins 1999; Crammer and Singer 2002; Dekel et al. 2003) and subset ranking (Joachims 2002; Burges et al. 2005; Yue et al. 2007; Cossock and Zhang 2008; Liu 2009; Cambazoglu et al. 2010; Chapelle and Chang 2011). This relaxation has two advantages. First, the sorting algorithm is a very efficient way to obtain a ranking (without scores, obtaining a full ranking is usually a very difficult problem). Second, defining a continuous surrogate loss on the space of predicted scores is a much easier task than defining one in the space of permutations.

While the computational advantage of such surrogate formulations is clear, one needs to have guarantees that minimizing the surrogate formulation (i.e. what the learning algorithm actually does) also enables us to maximize the ranking performance (i.e. what we want the algorithm to do). That is, we want the learning algorithm to be consistent with the true ranking performance we want to optimize. Steinwart (2007) presents general definitions and results showing that an asymptotic guarantee of consistency is equivalent to a notion of calibration of the surrogate loss with respect to the ranking performance measure, while the existence of non-asymptotic guarantees in the form of surrogate regret bounds are equivalent to a notion of uniform calibration. A surrogate regret bound quantifies how fast the evaluation measure is maximized as the surrogate loss is minimized. We note here that such non-asymptotic guarantees are critical in machine learning where it is delusive to hope for learning near-optimal functions in a strong sense. The calibration and uniform calibration have been extensively studied in (cost-sensitive) binary classification (see e.g. Bartlett and Jordan 2006; Zhang 2004; Steinwart 2007; Scott 2011) and multiclass classification (Zhang 2004; Tewari and Bartlett 2007). In particular, under natural continuity assumptions, it was shown that calibration and uniform calibration of a surrogate loss are equivalent for margin losses. In the context of learning to rank with pairwise preferences, the non-calibration of many existing surrogate losses with respect to the pairwise disagreement was studied in depth in Duchi et al. (2010). On the other hand, in the context of learning to rank for information retrieval, surrogate regret bounds for square-loss regression with respect to a ranking performance measure called the Discounted Cumulative Gain (DCG, see Järvelin and Kekäläinen 2002) was shown in Cossock and Zhang (2008). These bounds were further extended in Ravikumar et al. (2011) to a larger class of surrogate losses.

In this paper, we analyze the calibration and uniform calibration of losses that possess an order-preserving property. This property of a surrogate loss implies (and, to some extend, is equivalent to) the calibration with respect to a ranking performance measure in the family what we call the generalized positional performance measures (GPPMs). A GPPM is a performance measure which, up to a suitable parametrization, can be written like a DCG. The study of GPPMs offer, in particular, the possibility to extend the DGC to arbitrary supervision spaces (e.g. linear orders, pairwise preferences) by first mapping the supervision to scores for each item—scores that can be interpreted as utility values. The family of GPPM includes widely known performance measures for ranking like the DCG and its normalized version the NDCG, the precision-at-rank-K as well as the recall-at-rank-K, and Spearman’s rank correlation coefficient (when the supervision is a linear ordering). We also give practical examples of template order-preserving losses, which can be instantiated for any specific GPPM to obtain a calibrated surrogate loss.

To go further, we investigate conditions under which the stronger notion of uniform calibration holds in addition to the simple calibration. Under natural continuity conditions for the loss function, we show that any loss calibrated with a GPPM is uniformly calibrated when the supervision space is finite, which stands for the existence of a regret bound. Finally, we prove explicit regret bounds for several convex template order-preserving losses. These bounds can be instantiated for any GPPM, such as the (N)DCG, and recall/precision-at-rank-K. In particular, we obtain the first regret bounds with respect to GPPMs for losses based on pairwise comparisons, and recover the surrogate regret bounds of Cossock and Zhang (2008) and Ravikumar et al. (2011). Our proof technique is different though, and we are able to slightly improve the constant factor in the bounds.

As a by-product of our analysis, we investigate whether a loss with some order-preserving property can be calibrated with two other measures than GPPMs, namely the Expected Reciprocal Rank (Chapelle et al. 2009) used as reference in the recent Yahoo! Learning to Rank Challenge (Chapelle and Chang 2011) and the Average Precision which was used in past Text REtrieval Conferences (TREC) competitions (Voorhees and Harman 2005). We surprisingly show a negative result—even though these measures assume that the supervision itself takes the form of real values (relevance scores). Our result implies that for any transformation of the relevance scores given as supervision, the regression function of these transformations is not optimal for the ERR or the AP in general. We do believe that this result can help understand the limitations of the score-and-sort approach to ranking, and put emphasis on an often neglected fact: the choice of the surrogate formulation is not really a matter of supervision space, but should be carefully chosen depending on the target measure. For example, one can use an appropriate regression approach to optimize Pearson’s correlation coefficient when the supervision is a full ordering of the set, but for the ERR or the AP, regression approaches cannot be calibrated even though one gets real values for each item as supervision.

The rest of the paper is organized as follows, Sect. 2 describes the framework and the basic definitions. In Sect. 3, we introduce a family of surrogate losses called the order-preserving losses, and we study their calibration with respect to a wide range of performance measures. Then, we propose an analysis of sufficient conditions under which calibration is equivalent to the existence of a surrogate regret bound; this analysis is carried out by studying the stronger notion of uniform calibration in Sect. 4. In Sect. 5, we describe several methods to find explicit formulas of surrogate regret bounds, and we exhibit some examples for common surrogate losses. The related work is discussed in Sect. 6, where we also summarize our main contributions.

This paper extends our prior work with D. Buffoni published in the International Conference on Machine Learning (Buffoni et al. 2011). More specifically, the definition of order-preserving losses, as well as Theorems and their proofs, already appeared in Buffoni et al. (2011); all other results are new.

2 Ranking performance measures and surrogate losses

Notation

A boldface character always denotes a function taking values in \(\mathbb{R}^{n}\) or an element of \(\mathbb{R}^{n}\) for some n>1. If \({\bf f}\) is a function of x, then f i (x), using normal font and subscript, denotes the i-th component of \({\bf f}(x)\). Likewise, if \({\bf x} \in \mathbb{R}^{n}\), x i denotes its i-th component.

2.1 Definitions and examples

Scoring functions and scoring performance measures

The prediction task we consider is the following: considering a measurable space \((\mathcal {X}, \varSigma_{\mathcal {X}})\) and some integer n>1, the goal is to predict an ordering of a fixed set of n objects, which we identify with the set of indices \(\left \{ 1, \ldots, n \right \}\), for any \(x\in \mathcal {X}\). This ordering is predicted with a score-and-sort approach. A scoring function \({\bf f}\) is any measurable function \({\bf f}:\mathcal {X}\rightarrow \mathbb{R}^{n}\), and the ordering of the set \(\left \{ 1,\ldots,n \right \}\) given \(x\in \mathcal {X}\) is obtained by sorting the integers i by decreasing values of f i (x). Identifying the linear orders of \(\left \{ 1,\ldots,n \right \}\) with the set \({\mathfrak {S}_{n}}\) of permutations of \(\left \{ 1,\ldots,n \right \}\), the predicted ordering can thus be any permutation in \(\operatorname{arg\,sort}( {\bf f}(x))\), where:
$$ \operatorname{arg\,sort}:{\bf s}\in \mathbb{R}^n \mapsto \left \{ \sigma \in {\mathfrak {S}_{n}} \vert \forall0<k<n, s_{\sigma(k)}\geq s_{\sigma(k+1)} \right \} $$

Notice that with these definitions, for \(\sigma\in \operatorname{arg\,sort}( {\bf f}(x))\), σ(k) denotes the integer of \(\left \{ 1,\ldots,n \right \}\) whose predicted rank is k. Following the tradition in information retrieval, “item i has better rank than item j according to σ” means “σ −1(i)<σ −1(j)”, i.e. low ranks are better. Likewise, the top-d ranks stands for the set of ranks \(\left \{ 1, \ldots, d \right \}\). Also notice that \(\operatorname {arg\, sort}\) is a set-valued function because of possible ties.

Predicting a total order over a finite set of objects is of use in many applications. This can be found in information retrieval, where x represents a tuple (query, set of documents) and the score f i (x) is the score given to the i-th document in the set given the query. In practice, x contains joint feature representations of (query, document) pairs and f i (x) is the predicted relevance of document i with respect to the query. The learning task associated to this prediction problem has been called subset ranking in Cossock and Zhang (2008). Note that in practice, the set of documents for different queries may vary in size, while in our work it is supposed to be constant. Anyway, all our results hold if one allows to vary the set size, keeping it uniformly bounded. Another example of application is label ranking (see e.g. Dekel et al. 2003) where x is some observed object such as a text document or an image, and the set to order is a fixed set of class labels. In that case, x usually contains a feature representation of the object, and a function f i is learned for each label index i; higher values of f i (x) represent higher predicted class-membership probabilities. Large-margin approaches to multiclass classification (see e.g. Weston and Watkins 1999; Crammer and Singer 2002) are special cases of a score-and-sort approach to label ranking, where the prediction is the top-ranked label.

In the supervised learning setting, the prediction function is trained using a set of examples for which some feedback, or supervision, indicative of the desired ordering is given. In order to measure the quality of a predicted ordering of \(\left \{ 1,\ldots,n \right \}\), a ranking performance measure is used. It is a measurable function \(r: \mathcal {Y}\times {\mathfrak {S}_{n}} \rightarrow \mathbb{R}_{+}\), where \((\mathcal {Y}, \varSigma_{\mathcal {Y}})\) is a measurable space which will be called the supervision space, and each \(y\in \mathcal {Y}\) provides information about which ordering are desired. We take the convention that larger values of \({ r\left .(y, \sigma\right .)}\) mean that σ is an ordering of \(\left \{ 1,\ldots,n \right \}\) in accordance to y. The supervision space may differ from one application to the other. In search engine applications, the Average Precision (AP) used in past TREC competitions (Voorhees and Harman 2005), the Expected Reciprocal Rank (ERR) used in the Yahoo! Learning to Rank Challenge (Chapelle et al. 2009; Chapelle and Chang 2011) or the (Normalized) Discounted Cumulative Gain ((N)DCG) (Järvelin and Kekäläinen 2002), assume the supervision space \(\mathcal {Y}\) is defined as {0,…,p} n for some integer p>0, where the i-th component of \(y\in \mathcal {Y}\) is a judgment of the relevance of the i-th item to rank w.r.t. the query (these performance measures always favor better ranks for items of higher relevance). Other forms of supervision spaces may be used though, for instance in recommendation tasks we may allow user ratings on a continuous scale (yet usually bounded), or allow the supervision to be a preference relation over the set of items, as proposed in one of the earliest papers on learning to rank (Cohen et al. 1997).

Most ranking performance measures used in practical application have the form described above: they are defined on a prediction space which is exactly the set of linear orderings, and do not directly take ties into account. In order to define performance measures for scoring functions, we take the convention that ties are broken randomly. Thus, given a ranking performance measure r, we overload the notation r (the context is clear given the arguments’ names) to define a scoring performance measure as follows:
$$ \forall {\bf s}\in \mathbb{R}^n, \forall y\in \mathcal {Y}, \quad r(y, {\bf s}) = \frac{1}{|\operatorname{arg\,sort}({\bf s})|} \sum_{\sigma \in \operatorname{arg\,sort}({\bf s})} r(y, \sigma ) $$
where |S| denotes the cardinal of a set. Of particular importance in our work will be the following family of ranking performance measures, which we call generalized positional performance measures (GPPM):

Definition 2.1

(Generalized Positional Performance Measure)

Let \(r:\mathcal {Y}\times {\mathfrak {S}_{n}} \rightarrow \mathbb{R}_{+}\) be a ranking performance measure. We say that r is a (u,ϕ)-generalized positional performance measure (abbreviated (u,ϕ)-GPPM) if \({\phi }:\left \{ 1..n \right \}\rightarrow \mathbb{R}_{+}\), \(\boldsymbol {u}:\mathcal {Y}\rightarrow \mathbb{R}_{+}^{n}\) is measurable, and:
  1. 1.

    ϕ(1)>0 and ∀0<k<n,ϕ(k)≥ϕ(k+1)≥0

     
  2. 2.

    \(\displaystyle\exists b: \mathcal {Y}\rightarrow \mathbb{R}\), such that \(r: (y, \sigma ) \mapsto b(y)+ \sum_{k = 1}^{n}{\phi }(k)u_{\sigma (k)}(y)\).

     

The most popular example of a GPPM is the DCG, for which \({\phi }(i) = \frac{{\bf1}_{\left \{ i \le k \right \}}}{\log(1 + i)}\) and \(u_{i}({\bf y}) = 2^{y_{i}} - 1\). The function u can be seen as mapping the supervision y to utility scores for each individual item to rank. We may therefore refer to u as the utility function of the (u,ϕ)-GPPM r. The name we use for this family of measures comes from the positional models described in Chapelle et al. (2009) in which one orders the documents according to the utility (i.e. the relevance) of a document w.r.t. the query. We call them “generalized”, because the utility function is, in our case, only a means to transform the supervision so that a given performance measure can be assimilated to a positional model. However, in a specific context, this transformation is not necessarily the relevance of a document with respect to the query as a user may define it. In particular, in information retrieval, the relevance of a document is usually defined independently of the other documents, while in the case of GPPMs, the utility score may depend on the relevance of the other documents. The Normalized DCG is a typical example of such a GPPM.

Table 1 summarizes the formulas for several, widely used GPPMs: the (Normalized) Discounted Cumulative Gain (N)DCG, the precision at rank K (Prec@k), recall at rank K (Rec@k) and Area Under the ROC Curve (AUC). All of these performance measure assume \(\mathcal {Y}\subset \mathbb{R}_{+}^{n}\). We also provide the formula of Spearman’s rank correlation coefficient (Spearman) as a GPPM to give an example where \(\mathcal {Y}\not\subset \mathbb{R}^{n}\), but the set of total orders of \(\left \{ 1,\ldots, n \right \}\) instead. For completeness, we also give the formula and the supervision space for the Expected Reciprocal Rank (ERR) and the Average Precision (AP). Note that for GPPMs, the utility function may not be unique, Table 1 only gives one possible formulation.
Table 1

Summary of common performance measures. The function b is equal to zero for all measures except for the AUC (\(b({\bf y}) = \frac{(\lVert {\bf y}\rVert _{1}-1)}{2(n - \lVert {\bf y}\rVert _{1})}\)) and for Spearman Rank Correlation Coefficient (\(b({\bf y}) = - \frac{3(n-1)}{(n + 1)}\)). The details of the calculations are given in Appendix B.1

Learning objective and surrogate scoring loss

In the remaining of the paper, we will use many results due to Steinwart 2007, and thus follow its basic assumptions that \(\mathcal {Y}\) is a polish space (i.e. a separable completely metrizable space) and \(\mathcal {X}\) is complete in the sense of Steinwart 2007, p. 3). These assumptions are purely technical and allow to deal with ranking tasks in their full generality (note, in particular, that any open or closed subset or \(\mathbb{R}^{n}\) is a polish space, as well as any finite set). Consider a probability measure P on \(\mathcal {X}\times \mathcal {Y}\), which is unknown, and a ranking (or, equivalently, scoring) performance measure r. Given a sample drawn i.i.d. from P, the goal of learning to rank is to find a scoring function \({\bf f}\) with high ranking performance \(\mathcal {R}(P, {\bf f}) \) on P, defined by:
$$ \mathcal {R}({\rm P}, {\bf f}) = \int _{\mathcal {X}\times \mathcal {Y}} { r\left .(y, {\bf f}(x)\right .)} {\rm d}P(x,y) = \int_{\mathcal {X}} \int_{\mathcal {Y}} { r\left .(y, {\bf f}(x)\right .)} {\rm d}P(y|x) {\rm d}P_\mathcal {X}(x) $$
where \(P_{\mathcal {X}}\) is the marginal distribution of P over \(\mathcal {X}\) and P(.|.) is a regular conditional probability. As usual in learning, the performance measure we intend to maximize is neither continuous nor differentiable. The optimization of the empirical performance is thus intractable, and the common practice is to minimize a surrogate scoring risk as a substitute for directly optimizing ranking performance. This surrogate is chosen to ease the optimization of its empirical risk. A natural way to obtain computationally efficient algorithm is to consider as surrogate a continuous and differentiable function of the predicted scores. More generally, we define a scoring loss as a measurable function \(\ell :\mathcal {Y}\times \mathbb{R}_{+}^{n}\rightarrow \mathbb{R}_{+}\). We use the convention that scoring losses, which are substitutes for the ranking/scoring performance, are minimized while the latter are maximized. This will avoid ambiguities about which function is the surrogate and which one is the target.

A major issue of the field of learning to rank is the design of surrogate scoring losses that are, in some sense, well-behaved with respect to the target ranking performance measure. The next subsection will define criteria that should be satisfied for a reasonable surrogate loss. But before going into more details and in order to give a concrete example of a family of losses that may be useful when the performance measure is a GPPM, we define the following family of template losses:

Definition 2.2

(Template Scoring Loss)

Let Γ be a subset of \(\mathbb{R}_{+}^{n}\). A template scoring loss is a measurable function \(\ell :\varGamma\times \mathbb{R}^{n} \rightarrow \mathbb{R}_{+}\). For any measurable function \(\boldsymbol {u}:\mathcal {Y}\rightarrow \mathbb{R}_{+}^{n}\) with \(\boldsymbol {u}(\mathcal {Y})\subset\varGamma\), the u-instance of , denoted u , is defined by:
$$ \forall y\in \mathcal {Y}, \forall {\bf s}\in \mathbb{R}^n, \quad { \ell ^{\boldsymbol {u}}\left (y, {\bf s}\right )} = { \ell \left (\boldsymbol {u}(y), {\bf s}\right )} $$

A typical example of template loss is the squared loss defined by \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i=1}^{n} (v_{i}-s_{i})^{2}\) on \(\varGamma =\mathbb{R}^{n}\), as proposed in Cossock and Zhang (2008). Other examples of template losses will be given in Sect. 3.

Note that many surrogate losses have been proposed for learning to rank (see Liu 2009 for an exhaustive review), and many of them are actually not template losses. SVM map (Yue et al. 2007), and many other instances of the structural SVM approach to ranking are good examples (Le and Smola 2007; Chakrabarti et al. 2008). Their advantage is to be designed for a specific performance measure, which may work better in practice when this performance measure is used for evaluation. On the other hand, template losses have the algorithmic advantage of providing an interface that can easily be specialized for a specific GPPM.

2.2 Calibration and surrogate regret bounds

We now describe some natural properties that surrogate loss functions should satisfy. This subsection defines the notations, and briefly summarizes the definitions and results from Steinwart (2007) which we are the basis of our work. The notations defined in this section are used in the rest of the paper without further notice.

Calibration

A natural property that a surrogate loss should satisfy is that if one achieves, in some way, to find a scoring function minimizing its associated risk, then the ranking performance of this ranking function should be optimal as well. More formally, consider any scoring function \({\bf f}\) and let us denote:
  • \(\mathcal {L}(P, {\bf f}) = \int_{\mathcal {X}} \int_{\mathcal {Y}} { \ell \left (y, {\bf f}(x)\right )} {\rm d}P(y|x) {\rm d}P_{\mathcal {X}}(x)\) the scoring risk of \({\bf f}\);

  • \(\displaystyle {\underline {\mathcal {L}}}(P) = \inf_{\substack{{\bf f}:\mathcal {X}\rightarrow \mathbb{R}^n\\{\bf f}~\text{measurable}}} \mathcal {L}(P, {\bf f})\) the optimal scoring risk;

  • \(\displaystyle {\overline {\mathcal {R}}}(P) = \sup_{\substack{{\bf f}:\mathcal {X}\rightarrow \mathbb{R}^n\\{\bf f}~\text{measurable}}} \mathcal {R}(P, {\bf f})\) the optimal ranking performance.

Then, we want the following proposition to be true for any sequence \(({\bf f}_{k})_{k\geq0}\) of scoring functions:
$$ \mathcal {L}(P, {\bf f}_k) \mathop{ \longrightarrow}\limits _{k\rightarrow\infty} {\underline {\mathcal {L}}}(P) \quad\Rightarrow\quad \mathcal {R}(P, {\bf f}_k) \mathop{\longrightarrow}\limits _{k\rightarrow\infty} {\overline {\mathcal {R}}}(P) $$
(1)
Condition (1) is, in fact, equivalent to the notion of calibration (see Steinwart 2007, Definition 2.7), which we describe now.

Let \(\mathcal {D}\) denote the set of probability distributions over \(\mathcal {Y}\), and let \({\Delta }\subset \mathcal {D}\). Following Steinwart (2007, Definition 2.6), we say that P is a distribution of type Δ if P(.|x)∈Δ for all x. Then, Steinwart (2007, Theorem 2.8) shows that (1) hold for any distribution of type Δ such that \({\overline {\mathcal {R}}}(P)<+\infty\) and \({\underline {\mathcal {L}}}(P)<+\infty\) if and only if is r-calibrated on Δ, according to the following definition:

Definition 2.3

(Calibration)

Let r be a ranking performance measure, a scoring loss and \({\Delta }\subset \mathcal {D}\) where \(\mathcal {D}\) is the set of probability distributions over \(\mathcal {Y}\).

We say that is r-calibrated on Δ if for any ε>0 and any △∈Δ, there exists δ>0 such that:
$$ \forall {\bf s}\in \mathbb{R}^n, { L\left ( {\scriptstyle \triangle },{\bf s}\right )} - { {\underline {L}}\left ({\scriptstyle \triangle }\right )} < \delta \quad\Longrightarrow\quad { {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .({\scriptstyle \triangle },{\bf s}\right .)} < \varepsilon $$
where \(({\scriptstyle \triangle }, {\bf s})\mapsto { L\left ({\scriptstyle \triangle },{\bf s}\right )}\) and \(({\scriptstyle \triangle }, {\bf s})\mapsto { R\left .({\scriptstyle \triangle },{\bf s}\right .)}\) are called respectively the inner risk and inner performance, and the quantities \({ L\left ({\scriptstyle \triangle },{\bf s}\right )}\), \({ {\underline {L}}\left ({\scriptstyle \triangle }\right )}\), \({ R\left .({\scriptstyle \triangle },{\bf s}\right .)}\) and \({ {\overline {R}}\left ({\scriptstyle \triangle }\right )}\) are respectively defined by:
  • \(\displaystyle\forall {\bf s}\in \mathbb{R}^n, { L\left ({\scriptstyle \triangle },{\bf s}\right )} = \int_{\mathcal {Y}} { \ell \left (y, {\bf s}\right )} {\rm d}{\scriptstyle \triangle }(y)\) and \(\displaystyle { {\underline {L}}\left ({\scriptstyle \triangle }\right )} = \inf_{{\bf s}\in \mathbb{R}^n} { L\left ({\scriptstyle \triangle },{\bf s}\right )}\);

  • \(\displaystyle\forall {\bf s}\in \mathbb{R}^n,{ R\left .({\scriptstyle \triangle },{\bf s}\right .)} = \int_{\mathcal {Y}} { r\left .(y, {\bf s}\right .)} {\rm d}{\scriptstyle \triangle }(y)\) and \(\displaystyle { {\overline {R}}\left ({\scriptstyle \triangle }\right )} = \sup_{{\bf s}\in \mathbb{R}^n} { R\left .({\scriptstyle \triangle },{\bf s}\right .)}\).

The definition of calibration allows us to study the implication (1), which considers risks and performance defined on the whole data distribution, to the study of the inner risks which are much easier to deal with since they are only functions of the distribution over the supervision space and a score vector. Thus, the inner risk and the inner performance are the essential quantities we investigate in this paper. The study of the calibration w.r.t. GPPMs of some surrogate losses will be treated in Sect. 3.

Remark 1

The criterion given by Eq. (1) studies the convergence to performance of the best possible scoring function, even though reaching this function is practically unfeasible on a finite training set since we need to consider a restricted class of functions. Nonetheless, as discussed in Zhang (2004) in the context of multiclass classification, the best possible performance can be achieved asymptotically as the number of examples grows to infinity, using the method of sieves or structural risk minimization, that is by progressively increasing the models complexity as the training set size increases.

Uniform calibration and surrogate regret bounds

While the calibration gives us an asymptotic relation between the minimization of the surrogate loss and the maximization of the performance, it does not give us any information on how fast the regret of \({\bf f}_{k}\) in terms of performance defined by \(\displaystyle {\overline {\mathcal {R}}}(P) - \mathcal {R}(P, {\bf f}_k)\) decreases to 0 when the surrogate regret of \({\bf f}_{k}\), defined by \(\mathcal {L}(P, {\bf f}_{k})- \displaystyle {\underline {\mathcal {L}}}(P) \), tends to 0. An answer to this question can be given by a surrogate regret bound, which is a function \(\varUpsilon:\mathbb{R}_{+} \rightarrow \mathbb{R}_{+}\) with ϒ(0)=0 and continuous in 0, such that, for any distribution P of type Δ satisfying \({\overline {\mathcal {R}}}(P)<+\infty\) and \({\underline {\mathcal {L}}}(P)<+\infty\), we have, for any scoring function \({\bf f}\):
$$ {\overline {\mathcal {R}}}(P) - \mathcal {R}(P, {\bf f}) \le\varUpsilon \bigl( \mathcal {L}(P, {\bf f})- \displaystyle {\underline {\mathcal {L}}}(P) \bigr) $$
Steinwart (2007, Theorems 2.13 and 2.17) show that the existence of such a surrogate regret bound is equivalent to a notion stronger than calibration called uniform calibration (see Steinwart 2007, Definition 2.15):

Definition 2.4

(Uniform Calibration)

With the notations of Definition 2.3, we say that is uniformly r-calibrated on Δ if, for any ε>0, there exists δ>0 such that for any △∈Δ and any \({\bf s}\in \mathbb{R}^{n}\):
$$ { L\left ({\scriptstyle \triangle },{\bf s}\right )} - { {\underline {L}}\left ( {\scriptstyle \triangle }\right )} < \delta \quad\Longrightarrow\quad { {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .({\scriptstyle \triangle },{\bf s}\right .)} < \varepsilon $$

Some criteria to establish the uniform calibration of scoring losses w.r.t. GPPMs are provided in Sect. 4. Quantitative regret bounds for specific template scoring losses will then be given in Sect. 5.

3 Calibration of order-preserving losses

In this section, we address the following question: which surrogate losses are calibrated w.r.t. GPPMs. This leads us to define the order-preserving property for surrogate losses. Since there is no reason to believe that these losses are calibrated only with respect to GPPMs, we address the question of whether they can be calibrated with two other popular performance measures, namely the ERR and the AP. In the remaining of the paper, we make extensive use of the notations of Definition 2.3.

Notation

We introduce now additional notations. For a ranking performance measure r, we denote \(\mathcal {D}_{r}= \left \{ {\scriptstyle \triangle }\in \mathcal {D}| \forall {\bf s}\in \mathbb{R}^{n}, { R\left .({\scriptstyle \triangle },{\bf s}\right .)}<+\infty \right \}\). Likewise, we define \(\mathcal {D}_{\ell }= \left \{ {\scriptstyle \triangle }\in \mathcal {D}| \forall {\bf s}\in \mathbb{R}^{n}, { L\left ({\scriptstyle \triangle },{\bf s}\right )}<+\infty \right \} \) for a scoring loss , and denote \(\mathcal {D}_{\ell , r}\) the intersection of \(\mathcal {D}_{r}\) and \(\mathcal {D}_{\ell }\). Finally, given let r be (u,ϕ)-GPPM and let \({\scriptstyle \triangle }\in \mathcal {D}_{r}\). We denote \(\boldsymbol {U}({\scriptstyle \triangle }) = \int_{\mathcal {Y}} \boldsymbol {u}(y)d{\scriptstyle \triangle }(y)\) the expected value of u. One may notice that \(\mathcal {D}_{r}= \left \{ {\scriptstyle \triangle }| \lVert \boldsymbol {U}({\scriptstyle \triangle })\rVert _{\infty}<+\infty \right \}\).

3.1 Order-preserving scoring losses

As the starting point of our analysis, we first notice that by definition of a (u,ϕ)-GPPM, the function ϕ is a non-increasing function of the rank. Thus, for any given value of the supervision, the (u,ϕ)-GPPM is maximized by predicting items of higher utility values at better ranks by the rearrangement inequality;1 and considering the additive structure of a (u,ϕ)-GPPM, the expected value of the (u,ϕ)-GPPM over a distribution \({\scriptstyle \triangle }\in \mathcal {D}_{r}\), is maximized by ranking the items according to their expected utility values. More formally, for any (u,ϕ)-GPPM and any \({\scriptstyle \triangle }\in \mathcal {D}_{r}\):
$$ \operatorname{arg\,sort}({\bf s}) \subseteq \operatorname{arg\, sort}\bigl(\boldsymbol {U}({\scriptstyle \triangle })\bigr) \quad\Rightarrow\quad { R\left .({\scriptstyle \triangle },{\bf s}\right .)} = { {\overline {R}}\left ({\scriptstyle \triangle }\right )} $$
(2)
Moreover, the reverse implication holds when ϕ is strictly decreasing (i.e. ϕ(i)>ϕ(i+1) for any 0<i<n).

This result was already noticed by Cossock and Zhang (2008), where the authors advocated for regression approaches for optimizing the DCG and in Ravikumar et al. (2011) where the authors studied a generalization of regression losses based on Bregman divergences (see Eq. (5) below). This result emphasizes on the fact that optimizing a GPPM is still a much weaker objective in general than regressing the utility values—preserving the ordering induced by the utility function is sufficient. Consequently, it is natural to look for surrogate losses for which the inner risk is minimized only by scores which order the items like U: by the definition of calibration, any such loss is r-calibrated with any (u,ϕ)-GPPM r. This leads us to the following definition:

Definition 3.1

(Order-Preserving Loss)

Let \(\boldsymbol {u}: \mathcal {Y}\rightarrow \mathbb{R}^{n}_{+}\) be a measurable function, be scoring loss and \({\Delta }\subset \mathcal {D}_{\ell }\). We say that the scoring loss \(\ell : \mathcal {Y}\times \mathbb{R}^{n} \rightarrow \mathbb{R}\) is order-preserving w.r.t. u on Δ if, for any △∈Δ, we have:
$$ { {\underline {L}}\left ({\scriptstyle \triangle }\right )} < \inf \big \{ { L\left ({\scriptstyle \triangle }, {\bf s}\right )} | {\bf s}\in \mathbb{R}^n, \operatorname{arg\,sort}({\bf s}) \nsubseteq \operatorname{arg\, sort}\bigl(\boldsymbol {U}({\scriptstyle \triangle })\bigr) \big \} $$
Moreover, a template scoring loss (see Definition 2.2) is called order-preserving if it is order-preserving with respect to the identity function of \(\mathbb{R}\) on \(\mathcal {D}_{\ell }\).
It is clear that in general, if a loss is order-preserving w.r.t. some function u, then it is not be order-preserving w.r.t. another utility function \(\boldsymbol {u}^{{\scriptscriptstyle \prime }}\) unless their is a strong relationship between the two functions (e.g. they are equal up to a constant additive or multiplicative factor). As such, in order to obtain loss functions calibrated with any GPPM, template scoring losses are a natural choice. We provide here some examples of such losses, for which surrogate regret bounds are given in Sect. 5:
  • Pointwise template scoring losses:
    $$ \forall {\bf v}\in\varGamma\subset \mathbb{R}^n, {\bf s}\in \mathbb{R}^n, { \ell \left ({\bf v}, {\bf s}\right )} = \sum _{i=1}^n{ \lambda (v_i, s_i)} . $$
    (3)
    As mentioned in Sect. 2.1, one may take \(\varGamma= \mathbb{R}^{n}\) and λ(v i ,s i )=(v i s i )2 as in Cossock and Zhang (2008). This template loss is obviously order-preserving since the optimal value of the scores is precisely the expected value of \({\bf v}\) (and thus U(△) when the template loss is instantiated).

    We may also consider, given η>0, the form λ(v i ,s i )=v i φ(s i )+(ηv i )φ(−s i ), which is convex with respect to \({\bf s}\) for any \({\bf v}\) in Γ=[0,η] n if φ is convex. As we shall see in Sect. 5, this loss is order-preserving for many choices of φ, including the log-loss (t↦log(1+e t )), the exponential loss (te t ) or differentiable versions of the Hinge loss. The log-loss proposed in Kotlowski et al. (2011) in the context of bipartite instance ranking for optimizing the AUC follows the same idea as the latter pointwise losses. The surrogate regret bounds proved in Dembczynski et al. (2012) in the same ranking framework than the one we consider here apply to pointwise losses of a similar form, although with a value of η that depends on the supervision at hand.

  • Pairwise template scoring losses:
    $$ { \ell \left ({\bf v}, {\bf s}\right )} = \sum _{i<j} \lambda (v_i, v_j, s_i - s_j) $$
    (4)
    with \(\varGamma=\mathbb{R}_{+}^{n}\). For example, taking λ(v i ,v j ,s i s j )=(s i s j v i +v j )2 also obviously leads to an order-preserving template loss. But we may also take λ(v i ,v j ,s i s j )=v i φ(s i s j )+v j φ(s j s i ) (the latter being convex with respect to \({\bf s}\) for any \({\bf v}\) in Γ whenever φ is so). Such a choice leads to an order preserving template loss whenever φ is non-increasing, differentiable with ϕ′(0)<0 and the infimum (over \({\bf s}\in \mathbb{R}^{n}\)) of \({ L\left ({\scriptstyle \triangle },.\right )}{.}\) is achieved for any △ (see Remark 2 below). Pairwise losses are natural candidates for surrogate scoring losses because they share a natural invariant with the scoring performance measure (invariance by translation of the scores).
  • Listwise scoring losses: as proposed in Ravikumar et al. (2011), we may consider a general form of surrogate losses defined by Bregman divergences. Let \(\varPsi:\varGamma\subset \mathbb{R}^{n}\rightarrow \mathbb{R}\) be a strictly convex, differentiable function on a set Γ and define the Bregman divergence associated to Ψ by \(B_{\psi}({\bf v}\| {\bf s}) = \psi({\bf v}) - \psi({\bf s}) - \langle{\nabla\psi({\bf s})},{{\bf v}- {\bf s}}\rangle \). Let \({\bf g}:\mathbb{R}^{n} \rightarrow\varGamma\) be invertible and such that for any \({\bf s}\in \mathbb{R}^{n}\), \(s_{i}>s_{j} \Rightarrow{\bf g}_{i}({\bf s})>{\bf g}_{j}({\bf s})\). Then, we can use the following template loss:
    $$ { \ell \left ({\bf v}, {\bf s}\right )} = B_\psi \bigl({\bf v}\| {\bf g}({\bf s}) \bigr) , $$
    (5)
    which is an order-preserving template loss (Ravikumar et al. 2011) as soon as the closure of \(g(\mathbb{R}^{n})\) contains Γ. This is due to a characterization of Bregman divergences due to Banerjee et al. (2005) that the expectation of Bregman divergences (for a distribution over the left-hand argument) is uniquely minimized over the right-hand argument when the latter equals the expected value of the former.

Remark 2

(Pairwise Losses)

The categorization of surrogate scoring losses into “pointwise”, “pairwise” and “listwise” we use here is due to Cao and Liu (2007). Note, however, that the pairwise template loss we consider in (4) with λ(v i ,v j ,s i s j )=v i φ(s i s j )+v j φ(s j s i ) does not correspond to what is usually called the “pairwise comparison approach” to ranking and used in many algorithms, including the very popular RankBoost (Freund et al. 2003) and Ranking SVMs (see e.g. Joachims 2002; Cao et al. 2006). Indeed, the latter can be written as \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i,j} {\bf1}_{v_{i}>v_{j}}\varphi(s_{i} - s_{j})\) (or some weighted versions of this formula). This usual loss was shown to be non-calibrated with respect to the pairwise disagreement error for ranking by Duchi et al. (2010) for any convex φ in many general settings. This result shows that the loss is not order-preserving in general (because the pairwise disagreement error, when the supervision space is \(\left \{ 0,1 \right \}^{n}\) is minimized when we order the items according to their probability of belonging to class 1). On the other hand, with the form we propose in this paper, the inner risk for the u-instance of can be written as \(\mathcal {L}^{\boldsymbol {u}}({\scriptstyle \triangle }, {\bf s}) = \sum_{i=1}^{n}U_{i}\left ({\scriptstyle \triangle }\right )\sum_{j\neq i} \varphi(s_{i} - s_{j})\), which has the same form as the inner risk of the multiclass pairwise loss studied in Zhang (2004) and is order-preserving under the appropriate assumptions (Zhang 2004, Theorem 5).

Remark 3

(A note on terminology)

We use the qualifier order-preserving for scoring losses in a sense similar to what the author of Zhang (2004) used in the context of multiclass classification. We may note that Ravikumar et al. (2011) use the term order-preserving to qualify a function \({\bf g}:\mathbb{R}^{n} \rightarrow \mathbb{R}^{n}\) such that \(s_{i}>s_{j} \Rightarrow g_{i}({\bf s})>g_{j}({\bf s})\), which corresponds to a different notion than the one used here.

3.2 Calibration of order-preserving losses

As already noticed, it follows from the definitions that if r is a (u,ϕ)-GPPM, then any loss order-preserving w.r.t. u is r-calibrated. The reverse implication is also true: given a measurable u, then only a loss order-preserving w.r.t. u is calibrated with any (u,ϕ)-GPPM (that is, for any ϕ). This latter claim can be found with different definitions in Ravikumar (2011, Lemma 3 and Lemma 4). We summarize these results in the following theorem and give the proof for completeness:

Theorem 3.2

Let \(\boldsymbol {u}: \mathcal {Y}\rightarrow \mathbb{R}^{n}_{+}\) be a measurable function, be a scoring loss, and r be a (u,ϕ)-GPPM and \({\Delta }\subset \mathcal {D}_{\ell , r}\). The following claims are true:
  1. 1.

    If is order-preserving w.r.t. u on Δ, then is r-calibrated on Δ.

     
  2. 2.

    If ϕ is strictly decreasing and is r-calibrated on Δ, then is order-preserving w.r.t. u on Δ.

     
Moreover, if is an order-preserving template loss, then the u-instance of is r-calibrated on \(\mathcal {D}_{ \ell ^{\boldsymbol {u}}, r}\).

Proof

The first claim and the remark on the template loss essentially follows from the definitions and from (2). For point 2, it is sufficient to show that for a given △∈Δ, if \(\operatorname {arg\,sort}({\bf s}) \nsubseteq \operatorname {arg\,sort}(\boldsymbol {U}({\scriptstyle \triangle }))\), then there is a c>0 such that \({ {\overline {R}}\left ({\scriptstyle \triangle }\right )} -{ R\left .({\scriptstyle \triangle },{\bf s}\right .)} \geq c\).

Notice that if \(\operatorname {arg\,sort}({\bf s}) \nsubseteq \operatorname {arg\,sort}(\boldsymbol {U}({\scriptstyle \triangle }))\), then there is a pair (i,j) with \(U_{i}\left ({\scriptstyle \triangle }\right ) > U_{j}\left ({\scriptstyle \triangle }\right ) \) but s i s j . Then, there is at least one permutation σ in \(\operatorname {arg\,sort}({\bf s})\) with σ −1(i)>σ −1(j). If \(\tau_{ij} \in {\mathfrak {S}_{n}}\) is the transposition of i and j, we then have:
$$ { {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .({\scriptstyle \triangle }, \sigma \right .)} = \underbrace{{ {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .({\scriptstyle \triangle }, \tau_{ij}\circ \sigma \right .)}}_{\geq 0} {}+ \underbrace{{ R\left .( {\scriptstyle \triangle },\tau_{ij}\circ \sigma \right .)} - { R\left .({\scriptstyle \triangle }, \sigma \right .)}}_{ {= (U_{i}\left ({\scriptstyle \triangle }\right ) - U_{j}\left ({\scriptstyle \triangle }\right ) ) ({\phi }( \sigma ^{-1}(j)) - {\phi }( \sigma ^{-1}(i)) )}} $$
(6)
Since \(|\operatorname {arg\,sort}({\bf s})| \leq n!\), we have:
$$ { {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .({\scriptstyle \triangle },{\bf s}\right .)} \geq \min _{k<n}\,\bigl |{\phi }(k)-{\phi }(k+1)\bigr| \times \min _{i,j:U_{i}\left ({\scriptstyle \triangle }\right ) \neq U_{j}\left ({\scriptstyle \triangle }\right )}\frac{\lvert U_{i}\left ({\scriptstyle \triangle }\right ) - U_{j}\left ({\scriptstyle \triangle }\right )\rvert }{n!} , $$
which proves the result. □

Note that obviously, if a scoring loss is order-preserving w.r.t. u, then it is calibrated with any ranking performance measure such that \(\operatorname{arg\,sort}(\boldsymbol {U}({\scriptstyle \triangle })) \subset \operatorname{arg\,min}_{\sigma } \mathcal {R}({\scriptstyle \triangle }, \sigma )\). This gives us a full characterization of ranking performance measures with respect to which order-preserving losses are calibrated.

While the order-preserving property is all we need for the calibration w.r.t. to a GPPM, one may then ask if it can be of use for the two other widely used performance measures: the AP and the ERR. The question is important because, apart from the usual precision/recall at rank K and the (N)DCG, these are the most widely used measures in search engine evaluation. Unfortunately, the answer is negative:

Theorem 3.3

Let \(\mathcal {Y}=\{0,1\}^{n}\), \(\boldsymbol {u}: \mathcal {Y}\rightarrow \mathbb{R}^{n}_{+}\) and an order-preserving loss w.r.t. u on \(\mathcal {D}\). Then is not calibrated with the \({\tt ERR} \) and the AP.

The proof of Theorem 3.3 can be found in the Appendix B.2. To the best of our knowledge, there is no existing study of the calibration of any surrogate scoring loss w.r.t. the ERR or the AP.

The theorem, in particular, implies that a regression approach is necessarily not calibrated with the ERR or the AP—whatever function of the relevance measure we are trying to regress. We do believe that the theorem does not cast lights on any weakness of the class of order-preserving (template) losses, but rather provides some strong evidence that these measures are difficult objective for learning and are that score-and-sort approaches are probably not suited for optimizing such performance measures.

4 Calibration and uniform calibration

We provided a characterization of surrogate losses calibrated with GPPM, as well as a characterization of the performance measures with respect to which these losses are calibrated. In this section, we are interested in investigating the stronger notion of uniform calibration which gives a non-asymptotic guarantee and implies the existence of a surrogate regret bound (Steinwart 2007, Theorems 2.13 and 2.17). Afterwards, in Sect. 5, we explicit regret bounds for some specific popular surrogate losses. In fact, we express conditions on the supervision space under which the uniform calibration w.r.t. a GPPM is equivalent to the simple calibration w.r.t. the same GPPM for learning to rank.

The equivalence between calibration and uniform calibration with respect to the classification error has already been proved in Bartlett and Jordan (2006) for the binary case, and in Zhang (2004) and Tewari and Bartlett (2007) in the multiclass case. Both studies concerned to margin losses, which are similar to the scoring losses we consider in the paper except that \(\mathcal {Y}\) is restricted (in our notations) to be the canonical basis of \(\mathbb{R}^{n}\) and u is the identity function. We extend these results to the case of GPPMs, but will not obtain an equivalence between calibration and uniform in general because of the more general form of scoring loss functions and the possible unboundedness of u. Yet, we are able to present a number of special cases depending on the loss function and the considered set of distribution over the supervision space where such an equivalence holds.

The existence of a surrogate regret bound independent of the data distribution (even without an explicit statement of the bound) is critical tool in the proof of the consistency of structural risk minimization of the surrogate formulation in Bartlett and Jordan (2006), Zhang (2004), Tewari and Bartlett (2007). Indeed, if one performs the empirical minimization of the surrogate risk in function classes that grow (sufficiently slowly) with the number of examples so that the surrogate risk tends to its infimum, the surrogate regret bound is sufficient to show that the sequence of surrogate risk minimizers tend to have maximal performance. The major tool used in Bartlett and Jordan (2006) and Scott (2011) for deriving explicit regret bounds also precisely corresponds to proving the uniform calibration. In our case, the criterion we develop for showing the equivalence between calibration and uniform calibration (Theorem 4.2) unfortunately does not lead to tight regret bounds. However, the following technical lemma, which we need to prove this criterion, will also appear crucial for the statement of explicit regret bounds.

Lemma 4.1

Let r be a (u,ϕ)-GPPM, \({\scriptstyle \triangle }\in {\Delta }\subset \mathcal {D}_{r}\), and \(\nu\in \operatorname {arg\,sort}(\boldsymbol {U}({\scriptstyle \triangle }))\). Then, for any \(\sigma \in {\mathfrak {S}_{n}}\), there is a set \(C_{\sigma}\subset \left \{ 1..n \right \}^{2}\) satisfying:
  1. 1.

    ∀(i,j)≠(z,t)∈C σ , we have \(\left \{ i,j \right \}\cap \left \{ z,t \right \}=\emptyset\),

     
  2. 2.

    \(\forall(i,j) \in C_{\sigma}, U_{i}\left ({\scriptstyle \triangle }\right )>U_{j}\left ({\scriptstyle \triangle }\right ) \) and σ −1(i)>σ −1(j),

     
  3. 3.

    \({ {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .({\scriptstyle \triangle },\sigma \right .)} \le\sum_{(i,j) \in C_{\sigma}} (U_{i}\left ({\scriptstyle \triangle }\right ) - U_{j}\left ({\scriptstyle \triangle }\right )) ({\phi }(\nu ^{-1}(i)) - {\phi }(\nu^{-1}(j)) ) \).

     
Consequently, for any \({\bf s}\in \mathbb{R}^{n}\), if we take \(C_{{\bf s}}= C_{\sigma}\) for some \(\displaystyle \sigma \in \operatorname {arg\,min}_{\sigma ^{\scriptscriptstyle \prime }\in \operatorname {arg\,sort}({\bf s})}\!{ R\left .({\scriptstyle \triangle },\sigma ^{\scriptscriptstyle \prime }\right .)}\), we have \(\forall(i,j)\in C_{{\bf s}}, U_{i}\left ({\scriptstyle \triangle }\right )>U_{j}\left ({\scriptstyle \triangle }\right ), s_{i}\leq s_{j}\) and:
$$ { {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .({\scriptstyle \triangle },{\bf s}\right .)} \le \sum_{(i,j) \in C_{\bf s}} \bigl(U_{i}\left ({\scriptstyle \triangle }\right ) - U_{j}\left ({\scriptstyle \triangle }\right ) \bigr) \bigl({\phi }\bigl( \nu^{-1}(i)\bigr) - {\phi }\bigl(\nu^{-1}(j)\bigr) \bigr) . $$

Proof

For the proof, we will use the notation \(C_{\!{\scriptscriptstyle \triangle },\sigma }^{\boldsymbol {u},{\phi }}\!(\nu )\) for the set C σ to make all the dependencies clear. We prove the existence of \(C_{\!{\scriptscriptstyle \triangle },\sigma }^{\boldsymbol {u},{\phi }}\!(\nu )\) by induction on n, the number of items to rank. Let n>2. It is easy to see that the result holds for \(n\in \left \{ 1,2 \right \}\). Assume that the same holds for any kn.

Let \(\nu\in \operatorname {arg\,sort}(\boldsymbol {U}({\scriptstyle \triangle }))\) be an optimal ordering and \(\sigma \in {\mathfrak {S}_{n}}\). The idea of the proof is build a permutation consisting in a set of non-overlapping transpositions \(C_{\!{\scriptscriptstyle \triangle },\sigma }^{\boldsymbol {u},{\phi }}\!(\nu )\) which inner-risk is worse or equal to the one of σ. For clarity, Fig. 1 illustrates the exchanges that we now present. To simplify the proof, we make the following abuses of language: the “true rank of i” stands for ν −1(i), i.e. for the rank of item i according to the optimal ν and the “predicted rank of i” stands for σ −1(i). Take i=ν(1) the true top-ranked item, and denote d=σ −1(i) its predicted rank. Now, consider the items in the top-d predicted ranks, and, in that set, denote j the one with worst true rank. Denote p its predicted rank, that is p=σ −1(j) with \(\displaystyle j=\operatorname {arg\,max}_{ q: \sigma ^{-1}(q) \leq d}\{ \nu^{-1}(q) \}\). Notice that we have \(U_{i}\left ({\scriptstyle \triangle }\right ) \geq U_{j}\left ({\scriptstyle \triangle }\right )\) and ν −1(j)≥d.
Fig. 1

Pictorial representation of \(\sigma\circ\tau_{1p}\circ\tau_{d\nu^{-1}(\sigma (p))}\). The item j is put at first rank (i.e. the rank of i), and the item i is put at the true rank of j (i.e. the rank of j according to the true ranking ν). By the definition of j, this rank is greater than or equal to d

Since j is the item with the worst true rank among the top-d predicted items, we can only decrease the performance by exchanging it, in the predicted ranking, with the top-ranked item. In more formal terms, denoting \(\tau_{wz} \in {\mathfrak {S}_{n}}\) the transposition of w and z, we thus have \({ R\left .({\scriptstyle \triangle },\sigma \right .)} \geq { R\left .({\scriptstyle \triangle },\sigma \circ\tau_{1p}\right .)}\) (στ 1p is thus the ranking created by exchanging the items at (predicted) rank 1 and p=σ −1(j)).

Likewise, since the true rank of j is greater than d and i is the true top-ranked element, we can, as well, only decrease performance if, in the predicted ranking, we exchange i with the item whose predicted rank is the true rank of j. More formally, we have:
$$ { R\left .({\scriptstyle \triangle },\sigma \right .)} \geq { R\left .({\scriptstyle \triangle },\sigma \circ\tau_{1p}\circ\tau_{d\nu^{-1}(\sigma (p))}\right .)} $$
In words, using \(\sigma \circ\tau_{1p}\circ\tau_{d\nu^{-1}(\sigma (p))}\), we put i at the true rank of j (which is worse than the predicted rank of i), and put j at rank 1 (i.e. at the true rank of i). Even though we may have moved some other items, the important point is that the exchanges only decrease performance.
The interest of these exchanges is that i and j in \(\sigma \circ\tau_{1p}\circ\tau_{d\nu^{-1}(\sigma (p))}\) have exchanged their position compared to the true optimal ranking ν. Consequently, we have:
$$\begin{aligned} { R\left .({\scriptstyle \triangle },\nu\right .)} - { R\left .({\scriptstyle \triangle },\sigma \right .)} \le & { R\left .({\scriptstyle \triangle },\nu\right .)} - { R\left .({\scriptstyle \triangle },\sigma \circ \tau_{1p}\circ\tau_{d\nu^{-1}(\sigma (p))}\right .)} \\ = & \bigl(U_{i}\left ({\scriptstyle \triangle }\right ) - U_{j}\left ({\scriptstyle \triangle }\right )\bigr) \bigl({\phi }(1) - {\phi }\bigl(\nu^{-1}(j)\bigr)\bigr) \\ & + \underbrace{\sum_{k\notin \left \{ i,j \right \}}U_{\nu(k)}\left ( {\scriptstyle \triangle }\right ){\phi }(k)}_{={ R^{{\scriptscriptstyle \prime }}\left ({\scriptstyle \triangle },\nu^{{\scriptscriptstyle \prime }}\right )}} - \underbrace{\sum _{k\notin \left \{ i,j \right \}}U_{\sigma (k)}\left ({\scriptstyle \triangle }\right ){\phi }(k)}_{={ R^{{\scriptscriptstyle \prime }}\left ({\scriptstyle \triangle },\sigma ^{\scriptscriptstyle \prime }\right )}} \end{aligned}$$
where we define \(r^{{\scriptscriptstyle \prime }}\) as a \((\boldsymbol {u}^{{\scriptscriptstyle \prime }}, {\phi }^{{\scriptscriptstyle \prime }})\)-GPPM on lists of items of size n−1 or n−2 depending on i and j:
Case ij

In that case, define \(r^{{\scriptscriptstyle \prime }}\) as a \((\boldsymbol {u}^{{\scriptscriptstyle \prime }}, {\phi }^{{\scriptscriptstyle \prime }})\)-GPPM on lists of items of size n−2, such that \(\boldsymbol {u}^{{\scriptscriptstyle \prime }}\), \({\phi }^{{\scriptscriptstyle \prime }}\), \(\nu^{{\scriptscriptstyle \prime }}\) and \(\sigma ^{{\scriptscriptstyle \prime }}\) are equal to u, ϕ, ν and σ on indices different from i and j up to an appropriate re-indexing of the remaining n−2 items. Using the induction assumption, we can find a set \(C_{\!{\scriptstyle \triangle },\sigma ^{{\scriptscriptstyle \prime }}}^{\boldsymbol {u}^{{\scriptscriptstyle \prime }},{\phi }^{{\scriptscriptstyle \prime }}}\!(\nu ^{{\scriptscriptstyle \prime }})\) satisfying the three conditions of the lemma, which we add to the pair (i,j) after re-indexing to build \(C_{\!{\scriptstyle \triangle },\sigma }^{\boldsymbol {u},{\phi }}\!(\nu )\). Notice that for now, we do not exactly meet condition (ii) since we have \(\forall(i,j) \in C_{\!{\scriptstyle \triangle },\sigma }^{\boldsymbol {u},{\phi }}\!(\nu ), U_{\sigma(i)}\left ({\scriptstyle \triangle }\right )\geq U_{\sigma(j)}\left ({\scriptstyle \triangle }\right )\), while condition (ii) requires a strict inequality. However, if \(U_{\sigma(i)}\left ({\scriptstyle \triangle }\right )= U_{\sigma(j)}\left ({\scriptstyle \triangle }\right )\) for some pair (i,j) in \(C_{\!{\scriptstyle \triangle },\sigma }^{\boldsymbol {u},{\phi }}\!(\nu )\), then the pair has no influence on the bound and can thus simply be discarded.

Case i=j

In that case, define \(r^{{\scriptscriptstyle \prime }}\) as a \((\boldsymbol {u}^{{\scriptscriptstyle \prime }}, {\phi }^{{\scriptscriptstyle \prime }})\)-GPPM on lists of items of size n−1. We then directly use the induction, ignoring the top-ranked element and considering the set of pairs on the remaining n−1 elements.

 □

An important characteristic of the set C σ in the lemma is condition 1, which ensures that the pairs (i,j) are independent (each index i appears in at most one pair). This condition is critical in the derivation of explicit surrogate bounds of the next section. Another important technical feature of the bound is that is based on misordered pairs, and thus can be applied to any loss. In contrast, the bounds on DCG suboptimality used in Cossock and Zhang (2008) or Ravikumar et al. (2011) depend on how much (a function of) the score vector \({\bf s}\) approximates U(△)—a bound which, consequently, can only be used for regression-like template losses.

We are now ready to give a new characterization of uniform calibration w.r.t. GPPMs. This characterization is easier to deal with than the initial definition of uniform calibration. We note here that it perfectly applies to losses of arbitrary structure, and thus also to non-template losses.

Theorem 4.2

Let r be a (u,ϕ)-GPPM, be a scoring loss and \({\Delta }\subseteq \mathcal {D}_{\ell , r}\). For any ε>0, and \(i,j \in \left \{ 1,\ldots,n \right \}\), define:
$$ {\Delta }_{i,j}(\varepsilon) = \left \{ {\scriptstyle \triangle }\in {\Delta }| U_{i}\left ({\scriptstyle \triangle }\right ) - U_{j}\left ({\scriptstyle \triangle }\right ) \geq \varepsilon \right \} $$
(7)
and denote
$$ \varOmega _{i,j}= \left \{ {\bf s}\in \mathbb{R}^n \vert s_i\leq s_j \right \} . $$
(8)
Consider the two following statements:
  1. (a)
    There is a function \(\delta:\mathbb{R}_{+}\rightarrow \mathbb{R}_{+}\) s.t. ∀ε>0,δ(ε)>0 and:
    $$ \forall i\neq j, \forall {\scriptstyle \triangle }\in {\Delta }_{i,j}(\varepsilon), \quad { {\underline {L}}\left ({\scriptstyle \triangle }\right )} +\delta(\varepsilon) \leq \inf_{{\bf s}\in \varOmega _{i,j}} { L\left ({\scriptstyle \triangle },{\bf s}\right )} . $$
     
  2. (b)

    is uniformly r-calibrated on Δ.

     
We have (a) ⇒ (b) and, if ∀0<i<n,ϕ(i)>ϕ(i+1) then (b) ⇒ (a).

Proof

We start with (a) ⇒ (b). Fix ε>0, \({\bf s}\in \mathbb{R}^{n}\) and △∈Δ. From (a), we know that if \({ L\left ({\scriptstyle \triangle },{\bf s}\right )} - { {\underline {L}}\left ({\scriptstyle \triangle }\right )} < \delta(\varepsilon)\) then for any i,j satisfying \((U_{i}\left ({\scriptstyle \triangle }\right ) -U_{j}\left ({\scriptstyle \triangle }\right ) )(s_{i}-s_{j})\leq0\), we have \(\lvert U_{i}\left ({\scriptstyle \triangle }\right ) -U_{j}\left ({\scriptstyle \triangle }\right )\rvert <\varepsilon\). By Lemma 4.1, we obtain \({ {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .({\scriptstyle \triangle },{\bf s}\right .)} < \frac{n}{2}{\phi }(1) \varepsilon\), since there are less than n/2 non-overlapping pairs of indexes (i,j),ij in \(\left \{ 1,\ldots,n \right \}^{n}\) and |ϕ(i)−ϕ(j)|≤ϕ(1) for any i,j. This bound being independent on △, this proves the uniform calibration of w.r.t. r on Δ.

We now prove (b) ⇒ (a) when ∀0<i<n,ϕ(i)>ϕ(i+1) by contrapositive. Suppose (a) does not hold. Then, we can find ε>0, a sequence (i k ,j k ) k≥0 with i k j k for all k and a sequence (△ k ) k≥0 with \(\forall k, {\scriptstyle \triangle }_{k}\in {\Delta }_{i_{k},j_{k}}(\varepsilon)\) satisfying \(\inf_{{\bf s}\in \varOmega _{i_{k},j_{k}}}{ L\left ({\scriptstyle \triangle }_{k},{\bf s}\right )} - { {\underline {L}}\left ({\scriptstyle \triangle }_{k}\right )} \mathop{\longrightarrow}\limits_{k\rightarrow+\infty} 0\).

Thus, for any η>0, we can find ij, △∈Δ i,j (ε) and \(s\in \mathbb{R}^{n}\) with s i s j such that \({ L\left ({\scriptstyle \triangle },{\bf s}\right )} - { {\underline {L}}\left ({\scriptstyle \triangle }\right )} < \eta\). However, if one considers the lower bound of (6), we obtain, for some permutation σ in \(\operatorname {arg\,sort}({\bf s})\) s.t. σ −1(i)>σ −1(j):
$$ { {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .({\scriptstyle \triangle },\sigma \right .)} \geq\epsilon\min _{j<n}\,\bigl|{\phi }(j)-{\phi }(j+1)\bigr| $$
Finally, since \(|\operatorname {arg\,sort}({\bf s})| \leq n!\), we obtain \({ {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .({\scriptstyle \triangle },{\bf s}\right .)} \geq \epsilon \frac{\min_{j<n}\lvert {\phi }(j)-{\phi }(j+1)\rvert }{n!}\). This lower bound holds for any ij, any △∈Δ i,j (ε) and any \({\bf s}\) such that \((U_{i}\left ({\scriptstyle \triangle }_{k}\right ) - U_{j}\left ({\scriptstyle \triangle }_{k}\right ) )(s_{i}-s_{j})\leq0\), and thus is not uniformly r-calibrated on Δ. □

Using this new characterization, we now address the problem of finding losses and sets of distributions Δ such that if is r-calibrated on Δ for some GPPM r, then condition (a) of Theorem 4.2 holds, implying the uniform calibration and the existence of the regret bound. The interest of the characterization of Theorem 4.2 is that in some cases, it is implied by large families of losses. Before going to some examples, we provide here the main corollary. Examples for more specific losses or supervision spaces are given in Corollaries 4.5 and in Appendix A.

Corollary 4.3

Let r be a (u,ϕ)-GPPM, be a scoring loss, and \({\Delta }\subseteq \mathcal {D}_{\ell ,r}\). Assume Δ can be given a topology such that:
  1. 1.

    Δ is compact;

     
  2. 2.

    the map \(\displaystyle \left ( \begin{array}{rcl} {\Delta }& \rightarrow & \mathbb{R}^n_+ \\ {\scriptstyle \triangle }&\mapsto& \boldsymbol {U}({\scriptstyle \triangle }) \end{array} \right )\) is continuous;

     
  3. 3.

    \(\forall i, j, \displaystyle \left ( \begin{array}{rcl} {\Delta }& \rightarrow&\mathbb{R}\\ {\scriptstyle \triangle }& \mapsto& \displaystyle\!\!\inf_{{\bf s}\in \varOmega _{i,j}}\!\!{ L\left ({\scriptstyle \triangle },{\bf s}\right )} - { {\underline {L}}\left ({\scriptstyle \triangle }\right )} \end{array} \right )\) is continuous, with Ω i,j defined by (8).

     
Then, is r-calibrated on Δ if and only if it is uniformly r-calibrated on Δ.

Proof

Since uniform calibration implies calibration, we only have to show the “only if” part.

First, we show using conditions 1 and 2 that for any ε>0 and any i,j, the set Δ i,j (ε) defined by (7) is compact. Since U is continuous on Δ and Δ is compact, U(Δ) is a compact subset of \(\mathbb{R}_{+}^{n}\). Therefore, U(Δ) is bounded. Let B=sup△∈ΔU(△)∥ and consider now the function \(h_{i,j}({\scriptstyle \triangle }) = U_{i}\left ({\scriptstyle \triangle }\right )-U_{j}\left ({\scriptstyle \triangle }\right )\). h i,j is continuous from Δ to \(\mathbb{R}^{n}\) with Δ compact. Therefore, h i,j is a proper map, i.e. the preimage of any compact is compact (see e.g. Lee 2003, Lemma 2.14, p. 45). Thus, \({\Delta }_{i,j}(\varepsilon) = h_{i,j}^{-1}([\varepsilon, B])\) is compact in Δ.

We now go on to the proof of the result. Let ij and denote \(g_{i,j}:{\Delta }\rightarrow \mathbb{R}\) the function defined in condition 3. Since is r-calibrated on Δ, we have g i,j (△)>0 for any △∈Δ i,j (ε) as soon as ε>0. Since g i,j is continuous and Δ i,j (ε) is compact, g i,j i,j (ε)) is a compact of \(\mathbb{R}\) and the minimum is attained. Defining δ(ϵ)=min ij ming i,j i,j (ε)), we thus have δ(ϵ)>0. Using Theorem 4.2, it proves that is uniformly r-calibrated. □

Corollary 4.3 gives conditions on the accepted form of supervision (conditions 1 and 2) and on the loss structure (condition 3) which are important to verify that r-calibration on Δ for a GPPM r implies uniform r-calibration on Δ. Conditions 1 and 2 are obviously satisfied when the supervision space is finite, and, as we shall see later, condition 3 is then automatically satisfied as well. Also, we may expect the same result to hold when we restrict U to be bounded. The cases of a finite supervision space is treated below. The more technical cases where the supervision space is infinite is more technical, and is detailed in Appendix A. We first remind the following result which will help us discuss these special cases:

Lemma 4.4

(Zhang 2004, Lemma 27) Let K>0, and let \(\psi_{k}:\mathbb{R}\rightarrow \mathbb{R}_{+}, k=1..K\) be K continuous functions. Let \(\varOmega\subseteq \mathbb{R}^{n}\), Ω≠∅ and \(\mathcal{Q}\) be a compact subset of \(\mathbb{R}_{+}^{K}\). Then, the function \({\underline {\varPsi}}\) defined as \(\left ( \begin{array}{rcl} \mathcal{Q} & \rightarrow& \mathbb{R}_{+}\\ {\bf q} &\mapsto& \displaystyle\inf_{{\bf s}\in \varOmega}\sum_{k=1}^K q_k\psi_k({\bf s}) \end{array} \right ) \) is continuous.

From now on, we suppose that the supervision space \(\mathcal {Y}\) is finite. Then, \({\Delta }=\mathcal {D}\) can be identified with the \(|\mathcal {Y}|\)-simplex which is compact using its natural topology. Moreover, in that case, U is necessarily continuous with respect to this topology on Δ and thus conditions 1 and 2 of Corollary 4.3 are satisfied. Thus, the only question which remains is whether the class of loss functions we consider satisfies 3—a question which is solved by Lemma 4.4. We can now give a full answer to the question of the uniform calibration w.r.t. a GPPM when the supervision space is finite:

Corollary 4.5

Suppose that \(\mathcal {Y}\) is finite. Let r be a (u,ϕ)-GPPM and a scoring loss such that (y,.) is continuous on \(\mathbb{R}^{n}\) for any \(y\in \mathcal {Y}\). Take \({\Delta }=\mathcal {D}\) (notice that \(\mathcal {D}= \mathcal {D}_{\ell }= \mathcal {D}_{r}\)). Then, the following claims are true:
  1. 1.

    is r-calibrated on Δ if and only if it is uniformly calibrated on Δ.

     
  2. 2.

    if is order-preserving w.r.t. u on Δ, it is uniformly r-calibrated on Δ.

     
  3. 3.

    If ϕ(i)>ϕ(i+1) for all 0<i<n, then, is order-preserving w.r.t. u on Δ if and only if it is uniformly r-calibrated on Δ.

     

Proof

Since \(\mathcal {Y}=\left \{ y_{1}, \ldots, y_{K} \right \}\) is finite (\(K=|\mathcal {Y}|\)), we already showed that both conditions 1 and 2 of Corollary 4.3 are satisfied, identifying \(\mathcal {D}\) with the K-simplex. Then, for any scoring loss, we have: \({ L\left ({\scriptstyle \triangle },{\bf s}\right )} = \sum_{k=1}^{K} {\scriptstyle \triangle }(\{y_{k}\}) \ell (y_{k}, {\bf s})\) which satisfies condition 3 of Corollary 4.3 using Lemma 4.4. Thus, using Corollary 4.3, we know that for any (u,ϕ)-GPPM r, is r-calibrated if and only if it is uniformly r-calibrated, giving us the first claim of the corollary.

The second claim comes from the fact that an order-preserving loss is calibrated with any GPPM. The third claim comes from the fact that only order-preserving losses are calibrated w.r.t. (u,ϕ)-GPPM with ϕ(i)>ϕ(i+1) for all 0<i<n, and the equivalence of r-calibration and uniform r-calibration when the supervision space is finite. □

This result shows that when the supervision space is finite, then any surrogate loss calibrated with respect to a GPPM has a regret bound. Thus, any loss calibrated with a GPPM has non-asymptotic guarantees. Since the exact form of the regret bound depends on the loss at hand, Corollary 4.5 is the stronger result we can obtain for arbitrary losses. We refer to Appendix A for a similar result concerning template losses in a special case where the supervision space is infinite. In the next section, we provide more quantitative surrogate regret bounds for specific template losses.

5 Surrogate regret bounds

The previous section deals with the existence of surrogate regret bounds by the study of uniform calibration. Now we propose to practically derive surrogate regret bounds for commonly-used template surrogate losses pointwise, pairwise or that can be written as a Bregman divergence. Like in classification the main idea is to find a convex lower bound of the surrogate regret as a function of the performance regret. However, contrary to classification, computing the calibration function like Steinwart (2007) or the Ψ-transform like Bartlett (2006) is actually unfeasible. If one tries to find the function δ of Theorem 4.2, the bound will be worse than the ones we reach in this section. Indeed, it doesn’t use non-overlapping pairs of indexes.

In this section, we first present regret bounds for specific template losses in Table 2, then we describe three methods of proof for achieving theses bounds for either pointwise losses (3), Bregman divergences (5) or pairwise losses (4). Before starting the analysis, we introduce new notation for the (inner) risks associated to template losses. We first recall that given a template scoring loss \(\ell :\varGamma \times \mathbb{R}^{n} \rightarrow \mathbb{R}_{+}\) with \(\varGamma\subset \mathbb{R}_{+}^{n}\), its u-instance is denoted by u . Using a similar notation with a superscript u, the scoring risk and inner risk of u are respectively denoted by:
  • For any distribution P on \(\mathcal {X}\times \mathcal {Y}\) and prediction function \({\bf f}\),
    $$\mathcal {L}^{\boldsymbol {u}}(P, {\bf f}) = \int_{\mathcal {X}} \int _{\mathcal {Y}} { \ell ^{\boldsymbol {u}}\left (y, {\bf f}(x)\right )} \mathrm{dP}( y,x) = \int_{\mathcal {X}} \int_{\mathcal {Y}} { \ell \left (\boldsymbol {u}(y), {\bf f}(x)\right )} \mathrm{dP}(y,x), $$
  • For any \({\scriptstyle \triangle }\in \mathcal {D}\), and \({\bf s}\in \mathbb{R}^{n}\),
    $${ L^{\boldsymbol {u}}\left ({\scriptstyle \triangle },{\bf s}\right )} = \int_{\mathcal {Y}} { \ell ^{\boldsymbol {u}}\left (y, {\bf s}\right )} {\rm d}{\scriptstyle \triangle }(y) = \int _{\mathcal {Y}} { \ell \left (\boldsymbol {u}(y), {\bf s}\right )} {\rm d}{\scriptstyle \triangle }(y). $$
Moreover, \({\underline { \mathcal {L}^{\boldsymbol {u}}}}(P)\) and \({ {\underline { L^{\boldsymbol {u}}}}\left ({\scriptstyle \triangle }\right )}\) refer to the respective optimal risks.
Table 2

Summary of surrogate regret bounds. Recalling that the v i are upper-bounded, η can be chosen as \(\eta> \max_{i}U_{i}\left ({\scriptstyle \triangle }\right )\) and φ α is a differentiable version of the Hinge Loss, where \(\alpha\in (0;\frac{ \eta }{ 2 } )\) is a parameter to choose: φ α (x)=0 if x≤0, \(\varphi_{\alpha}(x) =\frac{x^{2}}{2\alpha}\) if x∈[0,α], and \(\varphi_{\alpha}(x) =x- \frac{\alpha}{2}\) otherwise

Pointwise Losses (3): \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i=1}^{n} { \lambda (v_{i}, s_{i})} \)

Name

λ(v i ,s i )

c

Squared Error

(v i s i )2

\(\sqrt{2}\)

Logistic

\(v_{i} \log(1 + e^{-s_{i}}) + (\eta - v_{i}) \log(1 + e^{s_{i}})\)

\(\sqrt{\eta }\)

Exponential

\(v_{i} e^{- s_{i}} + (\eta - v_{i}) e^{s_{i}}\)

\(\sqrt {\eta }\)

Square Hinge

v i max(0,ts i )2+(ηv i )max(0,s i )2

\(\frac{\sqrt{2\eta }}{t}\)

Differentiable Hinge

v i φ α (1−s i )+(ηv i )φ α (s i )

\(4\sqrt{\frac{\eta }{\alpha}}\)

Pairwise Losses (4): \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i<j} \lambda (v_{i}, v_{j}, s_{i} - s_{j}) \)

Name

λ(v i ,v j ,d ij )

c

Squared Error

(v i v j d ij )2

1

Logistic

\(v_{i} \log(1 + e^{-d_{ij}}) + v_{j} \log(1 + e^{d_{ij}})\)

\(2\sqrt{\lVert \boldsymbol {U}({\scriptstyle \triangle })\rVert _{\infty}}\)

Exponential

\(v_{i} e^{-d_{ij}} + v_{j} e^{d_{ij}}\)

\(2\sqrt{\lVert \boldsymbol {U}({\scriptstyle \triangle })\rVert _{\infty}}\)

Bregman Divergence (5): \({ \ell \left ({\bf v}, {\bf s}\right )} = B_{\psi}({\bf v}\| {\bf g}({\bf s}) )\)

 

ψ(.)

c

 

μ-strongly convex (12)

\(\frac{2}{\sqrt {\mu}}\)

5.1 Regret bounds for common surrogate losses

We first give a summary of the different bounds obtained in the following of the section for both pointwise losses, Bregman divergences and pairwise losses, and then present the three methods used on the latter families of losses to achieve these bounds.

Given a (u,ϕ)-GPPM, for these families of surrogate scoring losses, we obtain the same regret bound up to a constant factor c, which intuitively correspond to the rescaling with respect to the surrogate loss.
$$ {\overline {\mathcal {R}}}(P)- \mathcal {R}(P, {\bf f}) \leq c C_{\phi }(2) \sqrt{ \mathcal {L}^{\boldsymbol {u}}(P, {\bf f})- {\underline { \mathcal {L}^{ \boldsymbol {u}}}}(P)} $$
(9)
with
$$\displaystyle C_{\phi }(p) = \Biggl( \sum_{i=1}^{ \lfloor\frac{n}{2} \rfloor} \bigl({\phi }(i) - {\phi }(n-i+1) \bigr)^p \Biggr)^{\frac{1}{p}} ,$$
for any positive integer p.

Table 2 details the different examples of Bregman divergences, pointwise losses and pairwise losses satisfying this surrogate regret bound (9) by giving the constant c. The methods for achieving such bounds are detailed in the following of the section: Theorem 5.2 for the pointwise losses, Theorem 5.3 for the Bregman divergences and Theorem 5.4 for pairwise losses. The proofs ensuring that the surrogate losses given in Table 2 satisfy the assumptions of the corresponding later theorems are given in Appendix B.

The differences in the constant factor c come from the fact that it represents a scaling factor between the surrogate loss and the expected utilities. Actually the magnitude of the loss may vary consequently from one to another. Furthermore, the bounds on the pointwise Square Hinge and the pointwise Differentiable Hinge depends respectively on t and α. Indeed, these parameters control the magnitude of the range within the optimal scores vary, so the scaling between the optimal scores and the expected utilities.

Notice that, C ϕ (p) is generally strictly lower than ∥ϕ p , thus, for the pointwise Squared Error, our approach allows us to obtain a slightly better bound than in Cossock and Zhang (2008, Theorem 2). The regret bound of the pointwise Squared Error is a crucial result since it helps to obtain the regret bounds on Bregman divergences. This explain why, applying the method of Ravikumar et al. (2011, Theorem 10) for Bregman divergences in our Theorem 5.3, we also reach a slightly better bound than them. Finally, for pairwise losses, to the best of our knowledge, no bound has already been proposed.

5.2 General results to derive regret bounds

Now, we aim at describing the methods allowing us to explicit the results of Table 2. The main argument is to combine lower bound of the surrogate regret with the upper bound on the performance regret given by the Lemma 4.1. We always use the same upper bound on the performance regret deduced from Lemma 4.1 so we explicit it here in the following lemma. Then, we will only work on the surrogate regret to obtain the bounds.

Lemma 5.1

Let r be a (u,ϕ)-GPPM, \({\scriptstyle \triangle }\in {\Delta }\subset \mathcal {D}_{r}\), and \(C_{{\bf s}}\subset \left \{ 1..n \right \}^{2}\) given by Lemma 4.1. For p,q>1 such that \(\frac{1}{p} + \frac{1}{q} = 1\) if we denote
$$ C_{\phi }(p) = \Biggl(\sum _{i=1}^{ \lfloor\frac {n}{2} \rfloor} \bigl({\phi }(i) - {\phi }(n-i+1) \bigr)^p \Biggr)^{\frac{1}{p}} $$
(10)
then for any \({\bf s}\in \mathbb{R}^{n}\), we have
$$ { {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .( {\scriptstyle \triangle },{\bf s}\right .)} \le C_{\phi }(p) \biggl(\sum_{(i,j) \in C_{\bf s}} \bigl(U_{i}\left ({\scriptstyle \triangle }\right ) - U_{j}\left ({\scriptstyle \triangle }\right )\! \bigr)^q \biggr)^{\frac{1}{q}} $$

The proof can be found in Appendix B.3.

We first treat the case of pointwise losses, then Bregman divergences and finally pairwise losses.

Specific order-preserving pointwise losses

The case of the pointwise template loss (3) is clearly the easier. Indeed, in a pointwise loss, the dimensions are independent from each other. Lemma 4.1 breaks some dependencies into a set of non-overlapping pairs of items and allows us to link more easily the performance regret and the regret of a pointwise loss. Now we can consider only independent pairs of indexes to study the surrogate regret. First we define the optimal value of a surrogate loss w.r.t. a pair of indexes (i,j), and the near-optimal value given that the corresponding items are misordered as follows:
$$ H_{ij}(\boldsymbol {u}, {\scriptstyle \triangle }) = { { {\underline {\varLambda ^{u_{i}}}}}\left ({\scriptstyle \triangle }\right )} + { { {\underline {\varLambda ^{u_{j}}}}}\left ({\scriptstyle \triangle }\right )} $$
$$ H^-_{ij}(\boldsymbol {u}, {\scriptstyle \triangle }) = \inf _{\substack{{\bf s}_i, {\bf s}_j \in \mathbb{R}\\ ({\bf s}_i - {\bf s}_j)(U_{i}\left ({\scriptscriptstyle \triangle }\right ) - U_{j}\left ({\scriptscriptstyle \triangle }\right )) \le0}} { \varLambda ^{u_{i}}({\scriptstyle \triangle }, s_i)} + { \varLambda ^{u_{j}}({\scriptstyle \triangle }, s_j)} $$
where \({ \varLambda ^{u_{i}}({\scriptstyle \triangle }, s)} = \int_{\mathcal {Y}}{ \lambda (u_{i}({\bf y}), s)}d{\scriptstyle \triangle }({\bf y})\) with λ defined in (3) and \({ { {\underline {\varLambda ^{u_{i}}}}}\left ({\scriptstyle \triangle }\right )}\) is its infimum over s, i.e. \({ { {\underline {\varLambda ^{u_{i}}}}}\left ({\scriptstyle \triangle }\right )} = \inf_{s \in \mathbb{R}}{ \varLambda ^{u_{i}}({\scriptstyle \triangle }, s)}\). In order to link \(H^{-}_{ij}\) and H ij with the bound of the performance regret, we will use the assumption given by (11) below and verify that this assumption is met for natural instances of λ.

Theorem 5.2

Let r be a (u,ϕ)-GPPM and a pointwise template loss. If there exists c>0 and q≥1 such that for any \({\scriptstyle \triangle }\in \mathcal {D}_{r, \ell }\),

$$ c^q \bigl (H^-_{ij}(\boldsymbol {u}, {\scriptstyle \triangle }) - H_{ij}(\boldsymbol {u}, {\scriptstyle \triangle })\bigr ) \ge \bigl |U_{i}\left ({\scriptstyle \triangle }\right ) - U_{j}\left ({\scriptstyle \triangle }\right )\bigr |^q $$
(11)
Then, for any distribution P on \(\mathcal {X}\times \mathcal {Y}\) of type \(\mathcal {D}_{\ell ,r}\) such that \({\overline {\mathcal {R}}}(P)<+\infty\) and \({\underline {\mathcal {L}}}(P)<\infty\), we have, for any measurable scoring function \({\bf f}\):
$$ {\overline {\mathcal {R}}}(P)- \mathcal {R}(P, {\bf f}) \leq c C_{\phi }(p) \bigl( \mathcal {L}^{\boldsymbol {u}}(P, {\bf f})- {\underline {\mathcal {L}^{\boldsymbol {u}}}}(P) \bigr)^{\frac{1}{q}} $$
where p≥1 satisfies \(\frac{1}{p} + \frac{1}{q}= 1\) and C ϕ (p) is defined in (10).

Proof

We consider \(C_{{\bf s}}\subset \left \{ 1..n \right \}^{2}\) as defined in Lemma 4.1 with ν=id. Indeed, considering the symmetry of the problem, we can consider the expected utilities are already ordered without any loss of generality. Consequently, for any \((i,j) \in C_{{\bf s}}\) we have i<j. Lemma 5.1 and (11) give
$$\begin{aligned} { {\overline {R}}\left ({\scriptstyle \triangle }\right )} - { R\left .({\scriptstyle \triangle },{\bf s}\right .)} \le & C_{\phi }(p) \biggl(\sum_{(i,j) \in C_{\bf s}} \bigl| U_{i}\left ({\scriptstyle \triangle }\right ) - U_{j}\left ({\scriptstyle \triangle }\right )\bigr|^q \biggr)^{\frac{1}{q}} \\ \le & C_{\phi }(p) \biggl(c^q\sum _{(i,j) \in C_{\bf s}} H^-_{ij}(\boldsymbol {u}, {\scriptstyle \triangle }) - H_{ij}(\boldsymbol {u}, {\scriptstyle \triangle }) \biggr)^{\frac{1}{q}} \end{aligned}$$
Now, we denote \(\overline{C_{{\bf s}}} = \left \{ i \in \left \{ 1..n \right \} \mid \exists j / (i,j) \in C_{{\bf s}}\mbox{ or } (j,i) \in C_{{\bf s}}\right \}\) and \(S_{\boldsymbol {u}}(C_{{\bf s}}) =\{{\bf s}^{{\scriptscriptstyle \prime }} \in \mathbb{R}^{n} \mid\forall(i,j) \in C_{{\bf s}}, s^{{\scriptscriptstyle \prime }}_{i} \le s^{{\scriptscriptstyle \prime }}_{j}\}\). Since \({\bf s}\in S_{\boldsymbol {u}}(C_{{\bf s}})\) then we have
$$\begin{aligned} { L^{\boldsymbol {u}}\left ({\scriptstyle \triangle },{\bf s}\right )} - { {\underline { L^{\boldsymbol {u}}}}\left ( {\scriptstyle \triangle }\right )} \ge& \inf_{{\bf s}^{'} \in S_{\boldsymbol {u}}(C_{\bf s})} { L^{\boldsymbol {u}}\left ({\scriptstyle \triangle }, {\bf s}^{{\scriptscriptstyle \prime }}\right )} - { {\underline { L^{\boldsymbol {u}}}}\left ({\scriptstyle \triangle }\right )} \\ = & \inf_{{\bf s}^{'} \in S_{\boldsymbol {u}}(C_{\bf s})} \sum_{i=1}^n \bigl({ \varLambda ^{u_{i}}({\scriptstyle \triangle }, s^{{\scriptscriptstyle \prime }}_i)} - { { {\underline {\varLambda ^{u_{i}}}}}\left ({\scriptstyle \triangle }\right )} \bigr) \\ = & \inf_{{\bf s}^{'} \in S_{\boldsymbol {u}}(C_{\bf s})} \biggl[ \sum_{(i,j) \in C_{\bf s}} \bigl({ \varLambda ^{u_{i}}({\scriptstyle \triangle }, s^{{\scriptscriptstyle \prime }}_i)} + { \varLambda ^{u_{j}}({\scriptstyle \triangle }, s^{{\scriptscriptstyle \prime }}_j)} \bigr) + \sum_{k \notin\overline{C{\bf s}}} { \varLambda ^{u_{k}}({\scriptstyle \triangle }, s^{{\scriptscriptstyle \prime }}_k)} \biggr] \\ & - \sum_{i = 1}^n { { {\underline {\varLambda ^{ u_{i}}}}}\left ({\scriptstyle \triangle }\right )} \\ = & \sum_{(i,j) \in C_{\bf s}} \Bigl[ \inf_{s'_i \le s'_j} \bigl({ \varLambda ^{u_{i}}({\scriptstyle \triangle }, s^{{\scriptscriptstyle \prime }}_i)} + { \varLambda ^{u_{j}}({\scriptstyle \triangle }, s^{{\scriptscriptstyle \prime }}_j)} \bigr) - { { {\underline {\varLambda ^{u_{i}}}}}\left ({\scriptstyle \triangle }\right )} - { { {\underline {\varLambda ^{u_{j}}}}}\left ({\scriptstyle \triangle }\right )} \Bigr] \\ & + \sum_{k \notin\overline{C_{\bf s}}} { { {\underline {\varLambda ^{u_{k}}}}}\left ( {\scriptstyle \triangle }\right )} - \sum_{k \notin\overline{C_{\bf s}}} { { {\underline {\varLambda ^{u_{k}}}}}\left ( {\scriptstyle \triangle }\right )} \\ = & \sum_{(i,j) \in C_{\bf s}} \bigl( H^-_{ij}(\boldsymbol {u}, {\scriptstyle \triangle }) - H_{ij}(\boldsymbol {u}, {\scriptstyle \triangle }) \bigr) \end{aligned}$$
The inversion between the inf and the sum is possible because of the independence of the pairs in \(C_{{\bf s}}\). Combining both inequalities gives the bound on the inner regret. Then the bound on the regret is deduced from using (Steinwart 2007, Theorems 3.2 and 2.13). □

Bregman divergence

Since we propose a bound on the pointwise Squared Error, we can apply a method similar to Ravikumar et al. (2011, Theorem 10) to obtain regret bounds on losses that derive from a Bregman divergence like those of (5). Moreover, pointwise losses Logistic and Exponential can be rewritten as Bregman divergences. This gives another way to obtain their corresponding bounds. Thus, we propose to use Lemma 5.1 to extend this theorem to the case of (u,ϕ)-GPPM’s using almost the same conditions like strong convexity of the function ψ which generate the Bregman divergence. We say a function f is called μ-strongly convex if and only if for any x,y in the domain and t∈[0,1]
$$ f\bigl(tx + (1-t)y\bigr) \le tf(x) + (1-t)f(y) - \frac{\mu}{2}t(1-t)\lVert x-y\rVert _2^2 $$
(12)
So, if ψ is μ-strongly convex, we have \(B_{\psi}(u \| v) \ge \frac{\mu}{2} \lVert u - v\rVert _{2}^{2}\).

Theorem 5.3

Let r be a (u,ϕ)-GPPM, \(\psi: \varGamma_{\psi}\rightarrow \mathbb{R}\) in \(\mathcal{C}^{1}\) a μ-strongly convex function and \(g : \mathbb{R}^{n} \rightarrow \mathbb{R}^{n}\) an invertible map such that for any i,j we have s i <s j g i (s)<g j (s) such that \(\varGamma_{\psi}= g(\mathbb{R}^{n})\). For a scoring loss defined as (5), we have
$$ {\overline {\mathcal {R}}}(P)- \mathcal {R}(P, {\bf f}) \leq \frac{2C_{\phi }(2)}{\sqrt{\mu}} \sqrt{ \mathcal {L}^{\boldsymbol {u}}(P, {\bf f})- {\underline { \mathcal {L}^{ \boldsymbol {u}}}}(P)} $$

Proof

We consider \(C_{{\bf s}}\subset \left \{ 1..n \right \}^{2}\) as defined in Lemma 4.1 with ν=id without loss of generality. We first start with a first bound on the suboptimality of u from strong convexity. Since \({ {\underline { L^{\boldsymbol {u}}}}\left ({\scriptstyle \triangle }\right )} = 0\), we have:
$$\begin{aligned} { L^{\boldsymbol {u}}\left ({\scriptstyle \triangle },{\bf s}'\right )} - { {\underline {L^{ \boldsymbol {u}}}}\left ({\scriptstyle \triangle }\right )} = & \int_\mathcal {Y}B_\psi\bigl( {\bf y}\| g\bigl({\bf s}'\bigr)\bigr) {\rm d}{\scriptstyle \triangle }({\bf y}) \\ \ge& \frac{\mu}{2} \int_\mathcal {Y}\bigl\Vert \boldsymbol {u}({\bf y}) - g \bigl({\bf s}'\bigr)\bigr\Vert_2^2 d{\scriptstyle \triangle }({\bf y}) \\ \ge& \frac{\mu}{2} \bigl\Vert \boldsymbol {U}({\scriptstyle \triangle }) - g\bigl( {\bf s}'\bigr)\bigr\Vert_2^2 \end{aligned}$$
The first inequality comes from strong convexity of ψ, while the second comes from the convexity of the squared 2-norm. Now, let us denote \(S_{\boldsymbol {u}}(C_{{\bf s}}) = \{{\bf s}^{{\scriptscriptstyle \prime }} \in \mathbb{R}^{n} \mid\forall (i,j) \in C_{{\bf s}}, s_{i} \le s_{j}\}\). Since \({\bf s}\in S_{\boldsymbol {u}}(C_{{\bf s}})\) and using the above inequality, we obtain
$$\begin{aligned} { L^{\boldsymbol {u}}\left ({\scriptstyle \triangle },{\bf s}\right )} - { {\underline {L^{\boldsymbol {u}}}}\left ({\scriptstyle \triangle }\right )} \ge& \inf_{{\bf s}' \in S_{\boldsymbol {u}}(C_{\bf s})}{ L^{\boldsymbol {u}}\left ({\scriptstyle \triangle }, {\bf s}'\right )} - { {\underline { L^{\boldsymbol {u}}}}\left ({\scriptstyle \triangle }\right )} \\ \ge& \frac{\mu}{2}\inf_{{\bf s}' \in S_{\boldsymbol {u}}(C_{\bf s})} \bigl\Vert \boldsymbol {U}( {\scriptstyle \triangle }) - g\bigl({\bf s}'\bigr)\bigr\Vert_2^2 \end{aligned}$$
which is actually equals to the Squared Error regret taken in \(g({\bf s}')\). Then, combine with the regret bound on the Squared Error (see Table 2) to obtain the bound. □

Specific order-preserving pairwise losses

In this section, we study the popular family of pairwise losses (see (4)) through two sub-families. We propose the first one to overcome the non-consistency of the classic pairwise hinge loss cast in light by Duchi et al. (2010). The second one is just a mean squared error on the pairs of indexes.

Pairwise surrogate losses integrate complex correlations between the different dimensions of the predicted vector of score when optimizing. This is why it’s not immediate to benefit from the independence given by the bound of Lemma 4.1. For pairwise surrogate losses, the main idea of the method is to treat them as pointwise losses on pairs of items with some additional constraints. Then, we compare the optima of the loss with and without the constraints.

The notations \(\varLambda ^{u_{i},u_{j}}\) and \({ {\underline {\varLambda ^{u_{i},u_{j}}}}}\) are defined similarly to the ones of pointwise losses. We denote the following set of constraints which impose a solution equivalent to a score for each item to rank.
$$ D= \left \{ d \in \mathbb{R}^n \times \mathbb{R}^n / \forall i,j,k \in \left \{ 1..n \right \}, d_{ij} = d_{ik} + d_{kj} \right \} $$
With this set of constraints D, we can reduce the conditions on the pairwise surrogate loss in the following lemma to a condition on the Bayes risk and a pointwise condition w.r.t. the pairs (i,j) of items.

Theorem 5.4

Let r be a (u,ϕ)-GPPM and a template pairwise scoring loss as described in (4). For any \({\scriptstyle \triangle }\in \mathcal {D}_{r, \ell }\) and \({\bf s}\in \mathbb{R}^{n}\), if u satisfies:
  1. 1.

    \({ {\underline { L^{\boldsymbol {u}}}}\left ({\scriptstyle \triangle }\right )} = \inf_{{\bf d}\in D} \sum_{i<j} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})} = \inf_{{\bf d}\in \mathbb{R}^{n} \times \mathbb{R}^{n}} \sum_{i<j} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})}\)

     
  2. 2.
    There exist c>0 and q≥1 such that
    $$ \inf_{d_{ij} \le0} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})} - \inf_{d_{ij} \in \mathbb{R}} { \varLambda ^{u_{i}, u_{j}}({\scriptstyle \triangle }, d_{ij})} \ge \frac{1}{c^q} \lvert U_{i}\left ( {\scriptstyle \triangle }\right ) - U_{j}\left ({\scriptstyle \triangle }\right )\rvert ^q $$
     
Then, for any distribution P on \(\mathcal {X}\times \mathcal {Y}\) of type \(\mathcal {D}_{\ell ,r}\) such that \({\overline {\mathcal {R}}}(P)<+\infty\) and \({\underline {\mathcal {L}}}(P)<\infty\), we have, for any scoring function \({\bf f}\):
$$ {\overline {\mathcal {R}}}(P)- \mathcal {R}(P, {\bf f}) \leq c C_{\phi }(p) ( \mathcal {L}^{\boldsymbol {u}}(P, {\bf f})-{\underline {\mathcal {L}^{ \boldsymbol {u}}}}(P) )^{\frac{1}{q}} $$
(13)
where p≥1 satisfies \(\frac{1}{p} + \frac{1}{q} = 1\) and C ϕ (p) is defined in (10).

Proof

For \(C_{{\bf s}}\) defined as in Lemma 4.1, we denote
$$ S_{\boldsymbol {u}}(C_{\bf s}) = \left \{ {\bf s}^{{\scriptscriptstyle \prime }} \in \mathbb{R}^n \mid\forall (i,j) \in C_{\bf s}, s_i \le s_j \right \} , $$
and
$$ \displaystyle {\varGamma (C_{\bf s})} = \left \{ d \in \mathbb{R}^n \times \mathbb{R}^n | \forall(i,j) \in C_{\bf s}, d_{ij} \le0 \right \} . $$
Since \({\bf s}\in S_{\boldsymbol {u}}(C_{{\bf s}})\) then we have
$$\begin{aligned} { L^{\boldsymbol {u}}\left ({\scriptstyle \triangle },{\bf s}\right )} - { {\underline { L^{\boldsymbol {u}}}}\left ({\scriptstyle \triangle }\right )} \ge& \inf_{{\bf s}^{'} \in S_{\boldsymbol {u}}(C_{\bf s})} { L^{\boldsymbol {u}}\left ({\scriptstyle \triangle }, {\bf s}^{{\scriptscriptstyle \prime }}\right )} - { {\underline { L^{\boldsymbol {u}}}}\left ({\scriptstyle \triangle }\right )} \\ \ge& \inf_{{\bf d}\in {\varGamma (C_{\bf s})}} \sum_{i<j} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})} - \inf _{{\bf d}\in \mathbb{R}^n \times \mathbb{R}^n} \sum_{i<j} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})} \\ \ge& \inf_{{\bf d}\in {\varGamma (C_{\bf s})}} \sum_{\substack{i<j\\(i,j) \in C}} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})} + \sum _{\substack{i<j\\(i,j) \notin C_{\bf s}}} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})} \\ & - \inf_{{\bf d}\in \mathbb{R}^n \times \mathbb{R}^n} \sum_{i<j} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})} \\ \ge& \sum_{\substack{i<j\\(i,j) \in C_{\bf s}}} \inf_{d_{ij} \le0} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})} - \inf _{d_{ij} \in \mathbb{R}} { \varLambda ^{u_{i},u_{j}}({\scriptstyle \triangle }, d_{ij})} \\ \ge& \frac{1}{c^q} \sum_{\substack{i<j\\(i,j) \in C_{\bf s}}} \lvert U_{i}\left ({\scriptstyle \triangle }\right ) - U_{j}\left ({\scriptstyle \triangle }\right )\rvert ^q \end{aligned}$$
Then, just apply Lemma 5.1 in the same way as in the proof of Theorem 5.2 to plug this inequality to the performance inner regret to obtain the bound on the inner regret. The bound on the regret is deduced using (Steinwart 2007, Theorems 3.2 and 2.13). □

6 Discussion and related work

In this section, we discuss the most closely related works, and then summarize our results and discuss some of their practical implications.

Surrogate regret bounds for learning to rank

The calibration and uniform calibration have been extensively studied in (cost-sensitive) binary classification (see e.g. Bartlett and Jordan 2006; Zhang 2004; Steinwart 2007; Scott 2011) and multiclass classification (Zhang 2004; Tewari and Bartlett 2007). In the context of learning to rank, the calibration of surrogate losses in learning to rank has been previously studied by Cossock and Zhang (2008), Duchi et al. (2010), and Ravikumar et al. (2011). Cossock and Zhang (2008) proved the calibration of some variants of regression losses based on the mean squared error with respect to the DCG, and proved the first surrogate regret bound for ranking. In Ravikumar et al. (2011), the authors generalize this work to obtain the calibration of losses based on Bregman divergences (which include the squared error loss) with respect to the (N)DCG, and provide surrogate regret bounds for this class of surrogate losses. In this paper, we extend the work of Ravikumar et al. (2011) in several ways. First, we consider a wider class of ranking performance measures, the GPPMs, essentially by noticing that it is not necessary to restrict the supervision to relevance judgments. Second, we consider a much larger class of surrogate losses (the order-preserving ones), which, in particular, are not constrained to have a unique minimizer. Relaxing these two assumptions, we obtain a new and general result on the existence of surrogate regret bounds for any loss calibrated with respect to a GPPM when the supervision space is finite, through the equivalence of calibration and uniform calibration for GPPMs (Corollary 4.3). Furthermore, our deeper study of the performance measures (Lemma 4.1) allow us to prove both slightly better regret bounds than Zhang (2004) and Ravikumar et al. (2011) for the mean squared regression and for the Bregman divergences, as well as new regret bounds for other forms of surrogate losses such as pairwise losses or pointwise losses that do not have a unique minimizer (Sect. 5). While all these works studied the DCG, Dembczynski et al. (2012) proved regret bounds for pointwise losses for the special case of the AUC metric. The pointwise losses they consider are similar to the one we consider in Sect. 5 (the difference is that in their work, the value of η in these losses are not constants). While our proof technique could be adapted to their specific loss, the bounds we prove are more general since they apply to a larger variety of losses and different performance measures.

We may note here that surrogate regret bounds have also been studied in another context of learning to rank, namely instance ranking (Clemençon et al. 2005; Kotlowski et al. 2011). Instance ranking, from which bipartite ranking is the best-known example (the case with binary relevance judgments) is a framework where the prediction task is to order a single set (the sample space itself), and learning is carried out based on an i.i.d. sample from this set. In contrast, in the task we consider here, the goal is to predict the ordering of a finite set for each instance and learning is carried out using an i.i.d. sample of such instances with a supervision that indicates how to rank the finite set given this instance. The evaluation performances for instance ranking are usually the Area Under the ROC Curve, or more generally linear rank statistics (Clémençon and Vayatis 2007), which are similar in nature to what we call GPPMs. However, since the underlying sampling assumptions are different in instance ranking and in the framework we consider here, all the notions of inner risks are different, and the analyses carried out in one framework do not apply to the other framework.

Fitting utility values

When the supervision takes the form of relevance scores on a discrete scale (as usually in search engine applications), it may be natural to simply try to fit them, for instance using classification or ordinal regression approaches. In the presence of noise however, our results show that one should not try to predict the value of the label, but rather its corresponding utility. More precisely, one should learn to rank according to the expected value of the utility; fitting the expected value of the utilities, for instance by minimizing the squared error, leads to a calibrated formulation, but it is only a special case of what one can do: in general, applying any order-preserving template loss is valid. Considering that many performance measures are GPPMs—for instance the (N)DCG, the precision-at-k, the recall-at-k, the AUC, or Spearman’s rank correlation coefficient (see Table 1)—our result allow us to provide template calibrated surrogate losses that can be easily instantiated for each of these measures (Sects. 2 and 3).

Another important result we obtain in the paper is the non-calibration of any surrogate loss that tries to reproduce the order given by the relevance judgments for the AP and the ERR (Theorem 3.3). The important result is that the non-calibration holds for any utility function that one can associate to these metric. Despite the importance of these measures in search engine evaluations, our result thus proves that many common surrogate losses used in learning to rank algorithm are not AP- or ERR-calibrated.

Consequently, the exact form of the supervision we have for the problem at hand—which may be relevance judgments, a preference relation, or total orders—does not dictate the kind of algorithm we should use. Spearman’s correlation coefficient (see Table 1), which considers total orders as supervision, is actually a GPPM, and thus any template order-preserving loss can be calibrated with respect to it. This contrast with the case of the ERR or the AP, with respect to which no order-preserving is calibrated even though these performance measures consider real-valued relevance judgments for their supervision.

Pairwise losses

As already mentioned in Remark 2, a traditional approach to learning to rank is to use pairwise-comparison-based losses, as in Ranking SVMs or RankBoost (Joachims 2002; Freund et al. 2003; Cao et al. 2006). To take a concrete example, consider the case when the supervision is a vector of relevance judgments. Then, the idea of pairwise-comparison-based losses is to take a loss of the form \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i,j} {\bf 1}_{\left \{ v_{i}>v_{j} \right \}}\varphi(s_{i} - s_{j})\) (\(\bf v\) here takes the place of the supervision, or any monotonic transform of it). The motivation of these approaches is that only the relative ordering between any two items does matter for ranking, and thus it is somewhat natural to only consider the relative ordering given by the supervision for learning. However, such losses are not order-preserving when φ is convex (see Remark 2; this result is actually a direct consequence of the non-calibration result of Duchi et al. (2010)), and they are consequently not calibrated with respect to any GPPMs. This is why in this work we propose an alternative formulation \({ \ell \left ({\bf v}, {\bf s}\right )} = \sum_{i<j} ( v_{i}\varphi(s_{i} - s_{j}) + v_{j}\varphi(s_{j}-s_{i}) )\), which is convex when the values of v i are non-negative and φ is convex, and which, as we show in Sect. 5, is also order-preserving. Consequently, this alternative formulation provides a template loss whose instances are calibrated with respect to any GPPM. Notice that from a computational perspective, the two losses (the initial formulation and the alternative that we propose here) are comparable, and thus we strongly encourage to consider the alternative formulation in practice.

Limitations of scoring approaches for ranking?

The difficulty of designing (convex) surrogate formulation for the score-and-sort approach to ranking has previously been addressed in Duchi et al. (2010), where the authors show that a number of existing surrogate losses are non calibrated with respect to the pairwise disagreement, a performance measure used when the supervision contains arbitrary pairwise preferences and which counts the number of pairs of items for which the predicted ordering does not match the supervision. Duchi et al. (2010) also conjecture that no convex loss of the scores can be calibrated with respect to the pairwise disagreement. In this work, we prove additional results concerning the possible limitations of scoring approaches: no order-preserving loss can be calibrated with respect to the AP or the ERR in general (Theorem 3.3). While this suggest that some approaches other than scoring may be useful for these evaluation measures, it also gives new insights on the intrinsic limitations of scoring approaches (in particular regression approaches) for information retrieval.

7 Conclusion

The calibration, uniform calibration and surrogate regret bounds are crucial tools to assess the quality of surrogate losses. We proposed an in-depth study of the calibration of order-preserving losses with respect to GPPMs.

A large body of work remains to be done in learning to rank. Duchi et al. (2010) pointed out, learning from pairwise preferences is still an open issue without making strong assumptions on the preference relations that we may have as supervision. More closely to our work, designing losses with better regret bounds for GPPMs with a cutoff (i.e. ϕ(i)=0 for i>k and k<<n) as in Cossock and Zhang (2008), but without any strong prior knowledge on which items should be ranked first and keeping easy-to-optimize surrogate losses, remains critical in many applications and mostly an open problem.

Footnotes

  1. 1.

    The rearrangement inequality states that for any real numbers x 1≥⋯≥x n ≥0 and y 1≥⋯≥y n , and for any permutation \(\sigma\in {\mathfrak {S}_{n}}\), we have x 1 y σ(1)+⋯+x n y σ(n)x 1 y 1+⋯+x n y n . (the dot product is maximized by pairing greater x i s with greater y i s). Moreover, if the x i s are strictly decreasing, then the equality holds if and only if y σ(1)≥⋯≥y σ(n).

  2. 2.

    There is no loss in generality here. If u does not satisfy this assumption, we can simply increase \(\mathcal {Y}\) and extend u so that the assumption is satisfied. We will then show that the equivalence between calibration and uniform calibration holds on a larger space of distributions.

  3. 3.

    We simply have to identify the set of △ who have the same expected value under u and assimilate this set to the real value. The only thing that has to be verified is that \(\Delta=\left \{ {\scriptstyle \triangle }\in \mathcal {D}: \lVert \boldsymbol {U}({\scriptstyle \triangle })\rVert _{\infty}\leq B \right \}\) is closed, which is the case with our assumption on u since \(\left \{ \boldsymbol {U}({\scriptstyle \triangle }), {\scriptstyle \triangle }\in\Delta \right \} = [\min K, \min(\sup K, B) ]^{n}\).

  4. 4.

    Note that if the regret bound for some loss does not depend on B, as for squared-loss-based scoring losses, then we can directly have a regret bound using Jensen’s inequality.

Notes

Acknowledgements

This work was partially funded by the French DGA, as well as the French Government and Région île de France through the FUI project OpenWay III. The authors thank the anonymous reviewers for their helpful comments and suggestions.

References

  1. Banerjee, A., Guo, X., & Wang, H. (2005). On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7), 2664–2669. MathSciNetCrossRefGoogle Scholar
  2. Bartlett, P., & Jordan, M. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Society, 101(473), 138–156. MathSciNetCrossRefzbMATHGoogle Scholar
  3. Buffoni, D., Calauzènes, C., Gallinari, P., & Usunier, N. (2011). Learning scoring functions with order-preserving losses and standardized supervision. In Proceedings of the International Conference on Machine Learning (pp. 825–832). Google Scholar
  4. Burges, C. J. C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. N. (2005). Learning to rank using gradient descent. In Proceedings of the International Conference on Machine Learning (pp. 89–96). Google Scholar
  5. Cambazoglu, B. B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., & Degenhardt, J. (2010). Early exit optimizations for additive machine learned ranking systems. In Proceedings of the ACM International Conference on Web Search and Data Mining (pp. 411–420). CrossRefGoogle Scholar
  6. Cao, Y., Xu, J., Liu, T. Y., Li, H., Huang, Y., & Hon, H. W. (2006). Adapting ranking SVM to document retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 186–193). Google Scholar
  7. Cao, Z., & Liu, T. Y. (2007). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the International Conference on Machine Learning (pp. 129–136). Google Scholar
  8. Chakrabarti, S., Khanna, R., Sawant, U., & Bhattacharyya, C. (2008). Structured learning for non-smooth ranking losses. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 88–96). Google Scholar
  9. Chapelle, O., & Chang, Y. (2011). Yahoo! learning to rank challenge overview. Journal of Machine Learning Research, 14, 1–24. Google Scholar
  10. Chapelle, O., Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. In Proceeding of the ACM Conference on Information and Knowledge Management (pp. 621–630). CrossRefGoogle Scholar
  11. Clemençon, S., Lugosi, G., & Vayatis, N. (2005). Ranking and scoring using empirical risk minimization. In Proceedings of the Conference on Learning Theory (pp. 783–800). Google Scholar
  12. Clémençon, S., & Vayatis, N. (2007). Ranking the best instances. Journal of Machine Learning Research, 8, 2671–2699. zbMATHGoogle Scholar
  13. Cohen, W. W., Schapire, R. E., & Singer, Y. (1997). Learning to order things. In Proceedings of Advances in Neural Information Processing Systems (pp. 243–270). Google Scholar
  14. Cossock, D., & Zhang, T. (2008). Statistical analysis of Bayes optimal subset ranking. IEEE Transactions on Information Theory, 54(11), 5140–5154. MathSciNetCrossRefGoogle Scholar
  15. Crammer, K., & Singer, Y. (2002). On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2, 265–292. zbMATHGoogle Scholar
  16. Dekel, O., Manning, C. D., & Singer, Y. (2003). Log-linear models for label ranking. In Proceedings of Advances in Neural Information Processing Systems. Google Scholar
  17. Dembczynski, K., Kotlowski, W., & Huellermeier, E. (2012). Consistent multilabel ranking through univariate losses. In Proceedings of the International Conference on Machine Learning (pp. 1319–1326). Google Scholar
  18. Duchi, J., Mackey, L. W., & Jordan, M. I. (2010). On the consistency of ranking algorithms. In Proceedings of the International Conference on Machine Learning (pp. 327–334). Google Scholar
  19. Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. MathSciNetGoogle Scholar
  20. Järvelin, K., & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20, 422–446. CrossRefGoogle Scholar
  21. Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 133–142). Google Scholar
  22. Kotlowski, W., Dembczynski, K., & Huellermeier, E. (2011). Bipartite ranking through minimization of univariate loss. In Proceedings of the International Conference on Machine Learning (pp. 1113–1120). Google Scholar
  23. Le, Q. V., & Smola, A. J. (2007). Direct optimization of ranking measures. Technical report, NICTA. Google Scholar
  24. Lee, J. (2003). Graduate texts in mathematics: Introduction to smooth manifolds. Berlin: Springer. CrossRefGoogle Scholar
  25. Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3, 225–331. CrossRefGoogle Scholar
  26. Ravikumar, P. D., Tewari, A., & Yang, E. (2011). On NDCG consistency of listwise ranking methods. Journal of Machine Learning Research—Proceedings Track, 15, 618–626. Google Scholar
  27. Scott, C. (2011). Surrogate losses and regret bounds for cost-sensitive classification with example-dependent costs. In Proceedings of the International Conference in Machine Learning (pp. 153–160). Google Scholar
  28. Steinwart, I. (2007). How to compare different loss functions and their risks. Constructive Approximation, 26(2), 225–287. MathSciNetCrossRefzbMATHGoogle Scholar
  29. Tewari, A., & Bartlett, P. (2007). On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8, 1007–1025. MathSciNetzbMATHGoogle Scholar
  30. Vembu, S., & Gärtner, T. (2009). Label ranking algorithms: a survey. In Preference Learning (pp. 1530–1537). Berlin: Springer. Google Scholar
  31. Voorhees, E., & Harman, D. (2005). TREC: experiment and evaluation in information retrieval. Digital libraries and electronic publishing. Cambridge: MIT Press. Google Scholar
  32. Weston, J., & Watkins, C. (1999). Support vector machines for multi-class pattern recognition. In Proceedings of the European Symposium on Artificial Neural Networks (pp. 219–224). Google Scholar
  33. Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vector method for optimizing average precision. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval (pp. 271–278). Google Scholar
  34. Zhang, T. (2004). Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5, 1225–1251. zbMATHGoogle Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  • Clément Calauzènes
    • 1
    Email author
  • Nicolas Usunier
    • 1
    • 2
  • Patrick Gallinari
    • 1
  1. 1.Department of Computer Science (Laboratoire d’Informatique de Paris 6)University Pierre et Marie CurieParisFrance
  2. 2.HeudiasycUniversité Technologique de CompiègneParisFrance

Personalised recommendations