Machine Learning

, Volume 103, Issue 2, pp 261–283 | Cite as

V-shaped interval insensitive loss for ordinal classification

  • Kostiantyn Antoniuk
  • Vojtěch Franc
  • Václav Hlaváč
Article

Abstract

We address a problem of learning ordinal classifiers from partially annotated examples. We introduce a V-shaped interval-insensitive loss function to measure discrepancy between predictions of an ordinal classifier and a partial annotation provided in the form of intervals of candidate labels. We show that under reasonable assumptions on the annotation process the Bayes risk of the ordinal classifier can be bounded by the expectation of an associated interval-insensitive loss. We propose several convex surrogates of the interval-insensitive loss which are used to formulate convex learning problems. We described a variant of the cutting plane method which can solve large instances of the learning problems. Experiments on a real-life application of human age estimation show that the ordinal classifier learned from cheap partially annotated examples can achieve accuracy matching the results of the so-far used supervised methods which require expensive precisely annotated examples.

Keywords

Ordinal classification Partially annotated examples Risk minimization 

1 Introduction

The ordinal classification model (also ordinal regression) is used in problems where the set of labels is fully ordered, for example, the label can be an age category (0–9, 10–19, \(\ldots \), 90–99) or a respondent answer to certain question (from strongly agree to strongly disagree). The ordinal classifiers are routinely used in social sciences, epidemiology, information retrieval or computer vision.

Recently, many supervised algorithms have been proposed for discriminative learning of the ordinal classifiers. The discriminative methods learn parameters of an ordinal classifier by minimizing a regularized convex proxy of the empirical risk. A Perceptron-like on-line algorithm PRank has been proposed in Crammer and Singer (2001). A large-margin principle has been applied for learning ordinal classifiers in Shashua and Levin (2002). The paper (Chu and Keerthi 2005) proposed Support Vector Ordinal Regression algorithm with explicit constraints (SVOR-EXP) and the SVOR algorithm with implicit constraints (SVOR-IMC). Unlike (Shashua and Levin 2002), the SVOR-EXP and SVOR-IMC guarantee the learned ordinal classifier to be statistically plausible. The same approach have been proposed independently by Rennie and Srebro (2005) who introduce so called immediate-threshold loss and all-thresholds loss functions. Minimization of a quadratically regularized immediate-threshold loss and the all-threshold loss are equivalent to the SVOR-EXP and the SVOR-IMC formulation, respectively. A generic framework proposed in Li and Lin (2006), of which the SVOR-EXP and SVOR-IMC are special instances, allows to convert learning of the ordinal classifier into learning of two-class SVM classifier with weighted examples.

Estimating parameters of a probabilistic model by the Maximum Likelihood (ML) method is another paradigm that can be used to learn ordinal classifiers. A plug-in ordinal classifier can be then constructed by substituting the estimated model to the optimal decision rule derived for a particular loss function (see e.g. Dembczyński et al. 2008 for a list of losses and corresponding decision functions suitable for ordinal classification). Parametric probability distributions suitable for modeling the ordinal labels have been proposed in McCullagh (1980), Fu and Simpson (2002), Rennie and Srebro (2005). Besides the parametric methods, the non-parametric probabilistic approaches like the Gaussian processes were also proposed (Chu and Ghahramani 2005).

Properties of the discriminative and the ML based methods are complementary to each other. The ML approach can be directly applied in the presence of incomplete annotation (e.g. the setting considered in this paper when label interval is given instead of a single label) by using the expectation–maximization algorithms (Schlesinger 1968; Dempster et al. 1997). However, the ML methods are sensitive to model mis-specification which complicates their application in modeling complex high-dimensional data. In contrast, the discriminative methods are known to be robust against the model mis-specification while their extension for learning from partial annotations is not trivial. To our best knowledge, the existing discriminative approaches for ordinal classification assume the precisely annotation only, that is, each training instance is annotated by exactly one label.

In this paper, we consider learning of the ordinal classifiers from partially annotated examples. We assume that each training input is annotated by an interval of candidate labels rather than a single label. This setting is common in practice. For example, let us assume a computer vision problem of learning an ordinal classifier predicting age from a facial image (e.g. Ramanathan et al. 2009, Chang et al. 2011). In this case, examples of face images are typically downloaded from the Internet and the age of depicted persons is estimated by a human annotator. Providing a reliable year-exact age just from a face image is virtually impossible. For humans it is more natural and easier to provide an interval estimate of the age. The interval annotation can be also obtained in an automated way e.g. by the method of Kotlowski et al. (2008) removing inconsistencies in imprecisely annotated data.

To deal with the interval annotations, we propose an interval-insensitive loss function which extends an arbitrary (supervised) V-shaped loss to the interval setting. The interval-insensitive loss measures a discrepancy between the interval of candidate labels given in the annotation and a label predicted by the classifier. Our interval-insensitive loss can be seen as the ordinal regression counterpart of the \(\epsilon \)-insensitive loss used in the support vector regression (Vapnik 1998). We prove that under reasonable assumptions on the annotation process, the Bayes risk of the ordinal classifier can be bounded by the expectation of the interval-insensitive loss. This bound justifies learning of the ordinal classifier via minimization of an empirical estimate of the interval-insensitive loss. Tightness of the bound depends on two intuitive parameters characterizing the annotations process. We show how to control these parameters in practice by properly designing the annotation process. We propose a convex surrogate of an arbitrary V-shaped interval-insensitive loss which is then used to formulate a convex learning problem. We also show how to modify the existing supervised methods, the SVOR-EXP and the SVOR-IMC algorithms, in order to minimize a convex surrogate of the interval-insensitive loss associated with the 0/1-loss and the mean absolute error (MAE) loss. Finally, we design a variant of a cutting plane algorithm which enables to solve large instances of the learning problems efficiently.

Discriminative learning from partially annotated examples has been recently studied in the context of a generic multi-class classifiers (Cour et al. 2011), the Hidden Markov Chain based classifiers (Do and Artières 2009), generic structured output models (Lou and Hamprecht 2012), the multi-instance learning (Jie and Orabona 2010) etc. All these methods translate learning to minimization of a partial loss evaluating discrepancy between the classifier predictions and partial annotations. The partial loss is defined as minimal value of a supervised loss (defined on a pair of labels, e.g. 0/1-loss) over all candidate labels consistent with the partial annotation. Our interval-insensitive loss can be seen as an application of such type of partial losses in the context of the ordinal classification. In particular, we analyze the partial annotation in the form of intervals of the candidate labels and the mean absolute error being the most typical target loss in the ordinal classification. The bounds of the Bayes risk via the expectation of the partial loss have been studied in Cour et al. (2011) but only for the 0/1-loss which is much less suitable for ordinal classification. It worth mentioning that the ordinal classification model allows for a tight convex approximations of the partial loss in contrast to previously considered classification models often requiring a hard to optimize non-convex surrogates (Do and Artières 2009; Lou and Hamprecht 2012; Jie and Orabona 2010).

The paper is organized as follows. Formulation of the learning problem and its solution via minimization of the interval insensitive loss is presented in Sect. 2. Algorithms approximating minimization of the interval-insensitive loss by convex optimization problems are proposed in Sect. 3. A cutting plane based method solving the convex programs is described in Sect. 4. Section 5 presents experimental evaluation and Sect. 6 concludes the paper.

2 Learning ordinal classifier from weakly annotated examples

2.1 Learning from completely annotated examples

Let \(\mathcal{X}\subset \mathbb {R}^n\) be a space of input observations and \(\mathcal{Y}=\{1,\ldots ,Y\}\) a set of hidden labels endowed with a natural order.1 We consider learning of an ordinal classifier \(h:\mathcal{X}\rightarrow \mathcal{Y}\) of the form
$$\begin{aligned} h({\varvec{x}};\varvec{w},\varvec{\theta }) = 1 + \sum _{k=1}^{Y-1} \llbracket \langle \varvec{x}, \varvec{w}\rangle > \theta _k \rrbracket \, \end{aligned}$$
(1)
where \(\varvec{w}\in \mathbb {R}^n\) and \(\varvec{\theta }\in {\varTheta }=\{\varvec{\theta }'\in \mathbb {R}^{Y-1}\mid \theta _y' \le \theta '_{y+1},\;y=1,\ldots ,Y-1\}\) are admissible parameters. The brackets \(\langle \cdot ,\cdot \rangle \) denote the dot product and the operator \(\llbracket A\rrbracket \) evaluates to 1 if A holds, otherwise it is 0. The classifier (1) splits the real line of projections \(\langle \varvec{x},\varvec{w}\rangle \) into Y consecutive intervals defined by thresholds \(\theta _1 \le \theta _2 \le \dots \le \theta _{Y-1}\). The observation \(\varvec{x}\) is assigned a label corresponding to the interval to which the projection \(\langle \varvec{w},\varvec{x}\rangle \) falls to. The classifier (1) is a suitable model if the label can be thought of as a rough measurement of a continuous random variable \(\xi (\varvec{x})=\langle \varvec{x},\varvec{w}\rangle +\text{ noise }\) (McCullagh 1980). An example of the ordinal classifier applied to a toy 2D problem is depicted in Fig. 1.
Fig. 1

The figure vizualizes division of the 2-dimensional feature space into four classes realized by an instance of the ordinal classifier (1)

There exist several discriminative methods for learning parameters \((\varvec{w},\varvec{\theta })\) of the classifier (1) from examples, e.g. Crammer and Singer (2001), Shashua and Levin (2002), Chu and Keerthi (2005), Li and Lin (2006). To our best knowledge, all the existing methods are fully supervised algorithms requiring a set of completely annotated training examples
$$\begin{aligned} \mathcal{T}^m = \{(\varvec{x}^1,y^1),\ldots ,(\varvec{x}^m,y^m)\}\in (\mathcal{X}\times \mathcal{Y})^m \end{aligned}$$
(2)
typically assumed to be drawn from i.i.d. random variables with some unknown distribution \(p(\varvec{x},y)\). The goal of the supervised learning algorithm is formulated as follows. Given a loss function \({\varDelta }:\mathcal{Y}\times \mathcal{Y}\rightarrow \mathbb {R}\) and the training examples (2), the task is to learn the ordinal classifier h whose Bayes risk
$$\begin{aligned} R(h) = \mathbb E_{(\varvec{x},y)\sim p(\varvec{x},y)} {\varDelta }(y,h(\varvec{x};\varvec{w},\varvec{\theta })) \end{aligned}$$
(3)
is as small as possible. The loss functions most commonly used in practice are the mean absolute error (MAE) \({\varDelta }(y,y')=|y-y'|\) and the 0/1-loss \({\varDelta }(y,y')=\llbracket y\ne y'\rrbracket \). The MAE and 0/1-loss are instances of so called V-shaped losses.

Definition 1

(V-shaped loss). A loss \({\varDelta }:\mathcal{Y}\times \mathcal{Y}\rightarrow \mathbb {R}\) is V-shaped if \({\varDelta }(y,y)=0\) and \({\varDelta }(y'',y)\ge {\varDelta }(y',y)\) holds for all triplets \((y,y',y'')\in \mathcal{Y}^3\) such that \(|y''-y'| \ge |y'-y|\).

That is, the value of a V-shaped loss grows monotonically with the distance between the predicted and the true label. In this paper we constrain our analysis to the V-shaped losses.

Because the expected risk R(h) is not accessible directly due to the unknown distribution \(p(\varvec{x},y)\), the discriminative methods like Shashua and Levin (2002), Chu and Keerthi (2005), Li and Lin (2006) minimize a convex surrogate of the empirical risk augmented by a quadratic regularizer. We follow the same framework but with novel surrogate loss functions suitable for learning from partially annotated examples.

2.2 Learning from partially annotated examples

Analogically to the supervised setting we assume that the observation \(\varvec{x}\in \mathcal{X}\) and the corresponding hidden label \(y\in \mathcal{Y}\) are generated from some unknown distribution \(p(\varvec{x},y)\). However, in contrast to the supervised setting the training set does not contain a single label for each instance. Instead, we assume that an annotator provided with the observation \(\varvec{x}\), and possibly with the label y, returns a partial annotation in the form of an interval of candidate labels \([y_l,y_r]\in \mathcal{P}\). The symbol \(\mathcal{P}=\{ [y_l,y_r]\in \mathcal{Y}^2 \mid y_l\le y_r\}\) denotes the set of all possible partial annotations. The partial annotation \([y_l,y_r]\) means that the true label y is from the interval \([y_l,y_r]=\{y\in \mathcal{Y}\mid y_l\le y \le y_r\}\). We shall assume that the annotator can be modeled by a stochastic process determined by a distribution \(p(y_l,y_r\mid \varvec{x}, y)\). That is, we are given a set of partially annotated examples
$$\begin{aligned} \mathcal{T}^m_I =\{(\varvec{x}^1,[y^1_l, y^1_r]),\ldots ,(\varvec{x}^m,[y^m_l, y^m_r])\}\in (\mathcal{X}\times \mathcal{P})^m \end{aligned}$$
(4)
assumed to be generated from i.i.d. random variables with the distribution
$$\begin{aligned} p(\varvec{x},y_l,y_r) = \sum _{y\in \mathcal{Y}} p(y_l,y_r\mid \varvec{x},y)\, p(\varvec{x},y) \end{aligned}$$
defined over \(\mathcal{X}\times \mathcal{P}\). The learning algorithms described below do not require the knowledge of \(p(\varvec{x},y)\) and \(p(y_l,y_r\mid \varvec{x}, y)\). However, it is clear that the annotation process given by \(p(y_l,y_r\mid \varvec{x}, y)\) cannot be arbitrary in order to make learning possible. For example, in the case when \(p(y_l,y_r\mid \varvec{x},y) = p(y_l,y_r)\) the annotation would carry no information about the true label. Therefore we will later assume that the annotation is consistent in the sense that \(y\notin [y_l,y_r]\) implies \(p(y_l,y_r\mid \varvec{x},y)=0\). The consistency of the annotation process is a standard assumption used e.g. in Cour et al. (2011).

The goal of learning from the partially annotated examples is formulated as follows. Given a (supervised) loss function \({\varDelta }:\mathcal{Y}\times \mathcal{Y}\rightarrow \mathbb {R}\) and partially annotated examples (4), the task is to learn the ordinal classifier (1) whose Bayes risk R(h) defined by (3) is as small as possible. Note that the objective remains the same as in the supervised setting but the information about the labels contained in the training set is reduced to intervals.

2.3 Interval insensitive loss

We define an interval-insensitive loss function in order to measure discrepancy between the interval annotation \([y_l,y_r]\in \mathcal{P}\) and the predictions made by the classifier \(h(\varvec{x};\varvec{w},\varvec{\theta })\in \mathcal{Y}\).

Definition 2

(Interval insensitive loss) Let \({\varDelta }:\mathcal{Y}\times \mathcal{Y}\rightarrow \mathbb {R}\) be a supervised V-shaped loss. The interval insensitive loss \({\varDelta }_I:\mathcal{P}\times \mathcal{Y}\rightarrow \mathbb {R}\) associated with \({\varDelta }\) is defined as
$$\begin{aligned} {\varDelta }_I(y_l,y_r,y) = \min _{y'\in [y_l,y_r]} {\varDelta }(y',y) = \left\{ \begin{array}{lll} 0 &{} \text{ if } &{} y\in [y_l,y_r] \;,\\ {\varDelta }(y,y_l) &{} \text{ if } &{} y\le y_l \;,\\ {\varDelta }(y,y_r) &{} \text{ if } &{} y\ge y_r \;.\\ \end{array} \right. \end{aligned}$$
(5)

The interval-insensitive loss \({\varDelta }_I(y_l,y_r,y)\) does not penalize predictions which are in the interval \([y_l,y_r]\). Otherwise the penalty is either \({\varDelta }(y,y_l)\) or \({\varDelta }(y,y_r)\) depending which border of the interval \([y_l,y_r]\) is closer to the prediction y. In the special case of the mean absolute error (MAE) \({\varDelta }(y,y')=|y-y'|\), one can think of the associated interval-insensitive loss \({\varDelta }_I(y_l,y_r,y)\) as the discrete counterpart of the \(\epsilon \)-insensitive loss used in the Support Vector Regression (Vapnik 1998).

Having defined the interval-insensitive loss, we can approximate minimization of the Bayes risk R(h) defined in (3) by minimization of the expectation of the interval-insensitive loss
$$\begin{aligned} R_I(h) = {\mathbb E}_{ (\varvec{x},y_l,y_r) \sim p(\varvec{x},y_l,y_r)}{\varDelta }_I(y_l,y_r,h(\varvec{x};\varvec{w},\varvec{\theta })) \,. \end{aligned}$$
(6)
In the sequel we denote \(R_I(h)\) as the partial risk. The question is how well the partial risk \(R_I(h)\) approximates the Bayes risk R(h) being the target quantity to be minimized. In the rest of this section we first analyze this question for the 0/1-loss adapting results of Cour et al. (2011), and then we present a novel bound for the MAE loss. In particular, we show that the Bayes risk R(h) for both losses can be upper bounded by a linear function of the partial risk \(R_I(h)\).

In the sequel we will assume that the annotated process governed by the distribution \(p(y_l,y_r\mid \varvec{x},y)\) is consistent in the following sense.

Definition 3

(Consistent annotation process) Let \(p(y_l,y_r\mid \varvec{x},y)\) be a properly defined distribution over \(\mathcal{P}\) for any \((\varvec{x},y)\in \mathcal{X}\times \mathcal{Y}\). The annotation process governed by \(p(y_l,y_r\mid \varvec{x},y)\) is consistent if any \(y\in \mathcal{Y}\), \([y_l,y_r]\in \mathcal{P}\) such that \(y\notin [y_l,y_r]\) implies \(p(y_l,y_r\mid \varvec{x},y) = 0\).

The consistent annotation process guarantees that the true label is always contained among the candidate labels in the annotation.

We first apply the excess bound for the 0 / 1-loss function which has been studied in Cour et al. (2011) for a generic partial annotations when \(\mathcal{P}\) is not constrained to be a set of label intervals. The tightness of the resulting bound depends on the annotation process \(p(y_l,y_r\mid \varvec{x}, y)\) characterized by so called ambiguity degree\(\varepsilon \) which, if adopted to our interval-setting, is defined as
$$\begin{aligned} \varepsilon =\max _{\varvec{x},y,z\ne y} p(z\in [y_l,y_r] \mid \varvec{x},y) = \max _{\varvec{x},y,z} \sum _{[y_l,y_r]\in \mathcal{P}} \llbracket y_l \le z \le y_r \rrbracket \;p(y_l,y_r\mid \varvec{x}, y) \,. \end{aligned}$$
(7)
In words, the ambiguity degree \(\varepsilon \) is the maximum probability of an extra label z co-occurring with the true label y in the annotation interval \([y_l,y_r]\), over all labels and observations.

Theorem 1

Let \(p(y_l,y_r\mid \varvec{x},y)\) be a distribution describing a consistent annotation process with the ambiguity degree \(\varepsilon \) defined by (7). Let \(R^{0/1}(h)\) be the Bayes risk (3) instantiated for the 0/1-loss and let \(R_I^{0/1}(h)\) be the partial risk (6) instantiated for the interval insensitive loss associated to the 0 / 1-loss. Then the upper bound
$$\begin{aligned} R^{0/1}(h) \le \frac{1}{1-\varepsilon } R_I^{0/1}(h)\, \end{aligned}$$
holds true for any \(h\in \mathcal{X}\rightarrow \mathcal{Y}\).

Theorem 1 is a direct application of Proposition 1 from Cour et al. (2011).

Next we will introduce a novel upper bound for the MAE loss more frequently used in applications of the ordinal classifier. We again consider consistent annotation processes. We will characterize the annotation process by two numbers describing an amount of uncertainty in the training data. First, we use \(\alpha \in [0,1]\) to denoted the lower bound of the portion of exactly annotated examples, that is, examples annotated by an interval having just a single label \([y_l,y_r]\), \(y_l=y_r\). Second, we use \(\beta \in \{0,\ldots ,Y-1\}\) to denote the maximal uncertainty in annotation, that is, \(\beta +1\) is the maximal width of the annotation interval which can appear in the training data with non-zero probability.

Definition 4

(\(\alpha \beta \)-precise annotation process) Let \(p(y_l,y_r\mid \varvec{x},y)\) be a properly defined distribution over \(\mathcal{P}\) for any \((\varvec{x},y)\in \mathcal{X}\times \mathcal{Y}\). The annotation process governed by \(p(y_l,y_r\mid \varvec{x},y)\) is \(\alpha \beta \)-precise if
$$\begin{aligned} \alpha \le p(y,y\mid \varvec{x},y) \quad \text{ and }\quad \beta \ge \max _{[y_l,y_r]\in \mathcal{P}} \llbracket p(y_l,y_r\mid \varvec{x},y) > 0\rrbracket \; (y_r - y_l) \end{aligned}$$
hold for any \((\varvec{x},y)\in \mathcal{X}\times \mathcal{Y}\).

To illustrate the meaning of the parameters \(\alpha \) and \(\beta \) let us consider the extreme cases. If \(\alpha =1\) of \(\beta = 0\) then all examples are annotated exactly and we are back in the standard supervised setting. On the other hand, if \(\beta =Y-1\) then it may happen (but does not have to) that the annotation brings no information about the hidden label because the intervals contain all labels in \(\mathcal{Y}\). With the definition of \(\alpha \beta \)-precise annotation we can upper bound the Bayes risk in terms of the partial risk as follows:

Theorem 2

Let \(p(y_l,y_r\mid \varvec{x},y)\) be a distribution describing a consistent \(\alpha \beta \)-precise annotation process. Let \(R^{{\textit{MAE}}}(h)\) be the Bayes risk (3) instantiated for the MAE-loss and let \(R_I^{{\textit{MAE}}}(h)\) be the partial risk (6) instantiated for the interval insensitive loss associated to the MAE-loss. Then the upper bound
$$\begin{aligned} R^{{\textit{MAE}}}(h) \le R_I^{{\textit{MAE}}}(h) + (1-\alpha )\beta \end{aligned}$$
(8)
holds true for any \(h\in \mathcal{X}\rightarrow \mathcal{Y}\).

Proof of Theorem 2 is deferred to Appendix 1.

The bound (8) is obtained by the worst case analysis hence it may become trivial in some cases, for example, if all examples are annotated with wide intervals because then \(\alpha =0\) and \(\beta \) is large. The experimental study presented in section 5 nevertheless shows that the partial risk \(R_I\) is a good proxy even in cases when the bound upper bound is large. This suggests that better bounds might be derived, for example, when additional information about \(p(y_l,y_r\mid \varvec{x},y)\) is available.

In order to improve the performance of the resulting classifier via the bound (8) one needs to control the parameters \(\alpha \) and \(\beta \). A possible way which allows to set the parameters \((\alpha ,\beta )\) exactly is to control the annotation process. For example, given a set of unannotated randomly drawn input samples \(\{\varvec{x}_1,\ldots ,\varvec{x}_m\}\in \mathcal{X}^m\) we can proceed as follows:
  1. 1.

    We generate a vector of binary variables \(\varvec{\pi }\in \{0,1\}^m\) according to Bernoulli distribution with probability \(\alpha \) that the variable is 1.

     
  2. 2.

    We instruct the annotator to provide just a single label for each input example with index from \(\{i\in \{1,\ldots ,m\}\mid \pi _i = 1\}\) while the remaining inputs (with \(\pi _i = 0\)) can be annotated by intervals but not larger than \(\beta +1\) labels. That means that approximately \(m\cdot \alpha \) inputs will be annotated exactly and \(m\cdot (1-\alpha )\) with intervals.

     
This simple procedure ensures that the annotation process is \(\alpha \beta \)-precise though the distribution \(p(y_l,y_r\mid \varvec{x},y)\) itself is unknown and depends on the annotator.

3 Algorithms

In the previous section we argue that the partial risk defined as an expectation of the interval insensitive loss is a reasonable proxy of the target Bayes risk. In this section we design algorithms learning the ordinal classifier via minimization of the quadratically regularized empirical risk used as a proxy for the expected risk. Similarly to the standard supervised case, we cannot minimize the empirical risk directly due to a discrete domain of the interval insensitive loss. For this reason we derive several convex surrogates which allow to translates the risk minimization to tractable convex problems.

We first show how to modify two existing supervised methods in order to learn from partially annotated examples. Namely, we extend the Support Vector Ordinal Regression algorithm with explicit constraints (SVOR-EXP) and the Support Vector Ordinal Regression algorithm with implicit constraints (SVOR-IMC) (Chu and Keerthi 2005). The extended interval-insensitive variants are named II-SVOR-EXP (Sect. 3.1) and II-SVOR-IMC (Sect. 3.2), respectively. The II-SVOR-EXP is a method minimizing a convex surrogate of the interval-insensitive loss associated to the 0/1-loss while the II-SVOR-IMC is designed for the minimization of MAE loss.

In Sect. 3.3, we show how to construct a generic convex surrogate of the interval-insensitive loss associated to an arbitrary V-shaped loss. We call a method minimizing this generic surrogate as the V-shaped interval insensitive loss minimization algorithm (VILMA). We prove that the VILMA subsumes the II-SVOR-IMC (as well as the SVOR-IMC) as a special case.

3.1 Interval insensitive SVOR-EXP algorithm for optimization of 0/1-loss

The original SVOR-EXP algorithm (Chu and Keerthi 2005) learns parameters of the ordinal classifier (1) from completely annotated examples \(\mathcal{T}^m\) by solving the following convex problem
$$\begin{aligned} (\varvec{w}^*,\varvec{\theta }^*) = \mathop {\mathrm{argmin}}_{\varvec{w}\in \mathbb {R}^n,\varvec{\theta }\in \hat{{\varTheta }}} \left[ \frac{{\uplambda }}{2} \Vert \varvec{w}\Vert ^2 + \sum _{i=1}^m \ell ^\mathrm{EXP}(\varvec{x}^i,y^i,\varvec{w},\varvec{\theta }) \right] \end{aligned}$$
(9)
where the optimized convex surrogate loss reads
$$\begin{aligned} \ell ^\mathrm{EXP}(\varvec{x}^i,y^i,\varvec{w},\varvec{\theta }) = \max (0, 1-\langle \varvec{x}^i,\varvec{w}\rangle + \theta _{y^i-1}) + \max (0, 1+\langle \varvec{x}^i,\varvec{w}\rangle - \theta _{y^i}) \end{aligned}$$
and \(\hat{{\varTheta }}=\{ \varvec{\theta }\in \mathbb {R}^{Y+1} \mid \theta _0 = -\infty , \theta _Y = \infty , \theta _y \le \theta _{y+1}, y=1,\ldots , Y-1\}\). Note that the set \(\hat{{\varTheta }}\) differs from the previously defined \({\varTheta }\) by adding auxiliary constants \(\theta _0 = -\infty \) and \(\theta _{Y}=\infty \) used just for a notational convenience. In the original paper (Chu and Keerthi 2005), the SVOR-EXP algorithm is formulated as an equivalent quadratic program which can be easily obtained from (9) by using slack variables to replace the \(\max (\cdot )\) functions. We rather use the formulation (9) because it shows the surrogate loss in its explicit form. The surrogate \(\ell ^\mathrm{EXP}(\varvec{w},\varvec{\theta },\varvec{x},y)\) is a convex upper bound of the 0/1-loss
$$\begin{aligned} {\varDelta }^{0/1}(y,h(\varvec{x};\varvec{w},\varvec{\theta })) = \llbracket y\ne h(\varvec{x};\varvec{w},\varvec{\theta }) \rrbracket = \llbracket \langle \varvec{x}, \varvec{w}\rangle < \theta _{y-1} \rrbracket + \llbracket \langle \varvec{x}, \varvec{w}\rangle \ge \theta _{y} \rrbracket \,, \end{aligned}$$
obtained by replacing the step function \(\llbracket t \le 0\rrbracket \) by the hinge loss \(\max (0,1-t)\).
Now we apply the same idea to approximate the interval insensitive loss \({\varDelta }_I^{0/1}(y_l,y_r,y)\) associated with the 0 / 1-loss which according to (5) reads
$$\begin{aligned} \begin{array}{rcl} {\varDelta }_I^{0/1}(y_l,y_r,h(\varvec{x};\varvec{w},\varvec{\theta })) &{}= &{}\displaystyle \min _{y'\in [y_l,y_r]}\llbracket y'\ne h(\varvec{x};\varvec{w},\varvec{\theta })\rrbracket \\ &{}= &{}\displaystyle \llbracket \langle \varvec{x}, \varvec{w}\rangle < \theta _{y_l-1} \rrbracket + \llbracket \langle \varvec{x}, \varvec{w}\rangle \ge \theta _{y_r} \rrbracket \,. \end{array} \end{aligned}$$
Replacing the step functions by the hinge loss yields the surrogate
$$\begin{aligned} \ell _I^\mathrm{EXP}(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta }) = \max (0, 1-\langle \varvec{x},\varvec{w}\rangle + \theta _{y_l-1}) + \max (0, 1+\langle \varvec{x},\varvec{w}\rangle - \theta _{y_r}) \, \end{aligned}$$
which is clearly a convex upper bound of \({\varDelta }_I^{0/1}(y_l,y_r,h(\varvec{x};\varvec{w},\varvec{\theta }))\) as can be also seen in Fig. 2.
Fig. 2

The left figure shows the interval insensitive loss \({\varDelta }^{0/1}_I(\varvec{x},y_l,y_r,h(\varvec{x};\varvec{w},\varvec{\theta }))\) associated with the 0/1-loss and its surrogate \(\ell ^\mathrm{EXP}_I(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta }))\). The right figure shows the interval insensitive loss \({\varDelta }^\mathrm{MAE}_I(\varvec{x},y_l,y_r,h(\varvec{x};\varvec{w},\varvec{\theta }))\) associated with the MAE loss and its surrogate \(\ell ^\mathrm{IMC}_I(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta }))\). The losses are shown as a function of the score \(\langle \varvec{x},\varvec{w}\rangle \) evaluated for \(\theta _1=1, \theta _2=2,\ldots , \theta _{Y-1}=Y-1\) and \(y_l=4\), \(y_r=6\). Note that for this particular setting of \(\varvec{\theta }\) the surrogate \(\ell ^\mathrm{EXP}_I(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta }))\) also appears to upper bound \({\varDelta }^\mathrm{MAE}_I(\varvec{x},y_l,y_r,h(\varvec{x};\varvec{w},\varvec{\theta }))\), but this does not hold in general

Given partially annotated examples \(\mathcal{T}^m_I\), we can learn parameters \((\varvec{w},\varvec{\theta })\) of the ordinal classifier (1) by solving (9) with the surrogate \(\ell _I^\mathrm{EXP}(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta })\) substituted for \(\ell ^\mathrm{EXP}(\varvec{x},y, \varvec{w},\varvec{\theta })\). We denote the modified variant as the II-SVOR-EXP algorithm.

3.2 Interval insensitive SVOR-IMC algorithm for optimization of MAE loss

The original SVOR-IMC algorithm (Chu and Keerthi 2005) learns parameters of the ordinal classifier (1) from completely annotated examples \(\mathcal{T}^m\) by solving the following convex optimization problem
$$\begin{aligned} (\varvec{w}^*,\varvec{\theta }^*) = \mathop {\mathrm{argmin}}_{\varvec{w}\in \mathbb {R}^n,\varvec{\theta }\in \hat{{\varTheta }}} \left[ \frac{{\uplambda }}{2} \Vert \varvec{w}\Vert ^2 + \sum _{i=1}^m \ell ^\mathrm{IMC}(\varvec{x}^i,y^i,\varvec{w},\varvec{\theta }) \right] \end{aligned}$$
(10)
where the convex surrogate reads
$$\begin{aligned} \ell ^\mathrm{IMC}(\varvec{x}^i,y^i,\varvec{w},\varvec{\theta }) = \sum _{y=1}^{y^i-1} \max (0, 1-\langle \varvec{x}^i,\varvec{w}\rangle + \theta _{y-1}) + \sum _{y=y^i}^{Y-1} \max (0, 1+\langle \varvec{x}^i,\varvec{w}\rangle - \theta _{y}) \end{aligned}$$
and using the convention \(\sum _{i=m}^n a_i = 0\) if \(m>n\). As in the previous case, the problem (10) is an equivalent reformulation of the quadratic program defining the SVOR-IMC algorithm in Chu and Keerthi (2005). It is seen that the surrogate \(\ell ^\mathrm{IMC}(\varvec{x},y,\varvec{w},\varvec{\theta })\) is a convex upper bound of the MAE loss
$$\begin{aligned} \begin{array}{rcl} {\varDelta }^\mathrm{MAE}(y,h(\varvec{x};\varvec{w},\varvec{\theta })) &{}= &{}\displaystyle |y-h(\varvec{x};\varvec{w},\varvec{\theta }) | \\ &{}= &{}\displaystyle \sum _{y'=1}^{y-1} \llbracket \langle \varvec{x},\varvec{w}\rangle < \theta _{y'-1}\rrbracket + \sum _{y'=y}^{Y-1}\llbracket \langle \varvec{x},\varvec{w}\rangle \ge \theta _{y'} \rrbracket \,.\\ \end{array} \end{aligned}$$
The surrogate is again obtained by replacing the step functions with the hinge loss. Analogically, we can derive a convex surrogate of the interval insensitive loss associated with the MAE which according to the definition (5) reads
$$\begin{aligned} \begin{array}{rcl} {\varDelta }_I^\mathrm{MAE}(y_l,y_r,h(\varvec{x};\varvec{w},\varvec{\theta })) &{}= &{}\displaystyle \min _{y'\in [y_l,y_r]}| y' -h(\varvec{x};\varvec{w},\varvec{\theta })| \\ &{}= &{}\displaystyle \sum _{y'=1}^{y_l-1} \llbracket \langle \varvec{x},\varvec{w}\rangle < \theta _y\rrbracket + \sum _{y'=y_r}^{Y-1}\llbracket \langle \varvec{x},\varvec{w}\rangle \ge \theta _y \rrbracket \,. \end{array} \end{aligned}$$
Replacing the step functions by the hinge loss we obtain a convex surrogate
$$\begin{aligned} \ell ^\mathrm{IMC}_I(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta }) = \sum _{y'=1}^{y_l-1} \max (0, 1-\langle \varvec{x},\varvec{w}\rangle + \theta _{y'-1}) + \sum _{y'=y_r}^{Y-1} \max (0, 1+\langle \varvec{x},\varvec{w}\rangle - \theta _{y'}) \,, \end{aligned}$$
which is obviously an upper bound of \({\varDelta }_I^\mathrm{MAE}(y_l,y_r,h(\varvec{x};\varvec{w},\varvec{\theta }))\) as can be also seen in Fig. 2.

Given the partially annotated examples \(\mathcal{T}^m_I\), we can learn parameters \((\varvec{w},\varvec{\theta })\) of the ordinal classifier (1) by solving (10) with the proposed surrogate \(\ell ^\mathrm{IMC}_I(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta })\) substituted for \(\ell ^\mathrm{IMC}(\varvec{x},y,\varvec{w},\varvec{\theta })\). We denote the modified variant as the II-SVOR-IMC algorithm. Note that due to the equality \(\ell ^\mathrm{IMC}_I(\varvec{x},y,y,\varvec{w},\varvec{\theta })= \ell ^\mathrm{IMC}(\varvec{x},y,\varvec{w},\varvec{\theta })\) it is clear that the proposed II-SVOR-IMC subsumes the original supervised SVOR-IMC as a special case.

3.3 VILMA: V-shaped interval insensitive loss minimization algorithm

In this section we propose a generic method for learning the ordinal classifiers with arbitrary interval insensitive V-shaped loss. We start by introducing an equivalent parametrization of the ordinal classifier (1) originally proposed in Antoniuk et al. (2013). The ordinal classifier (1) can be re-parametrized as a multi-class linear classifier, in the sequel denoted as multi-class ordinal (MORD) classifier, which reads
$$\begin{aligned} h'(\varvec{x};\varvec{w},\varvec{b}) = \mathop {\mathrm{argmax}}_{y\in \mathcal{Y}} \Big ( \langle \varvec{x}, \varvec{w}\rangle \cdot y + b_y\Big ) \, \end{aligned}$$
(11)
where \(\varvec{w}\in \mathbb {R}^n\) and \(\varvec{b}=(b_1,\ldots ,b_Y)\in \mathbb {R}^Y\) are parameters. Note that the MORD classifier has \(n+Y\) parameters and any pair \((\varvec{w},\varvec{b})\in (\mathbb {R}^n\times \mathbb {R}^Y)\) is admissible. Note that the standard parametrization of the ordinal classifier (1) has \(n+Y-1\) parameters, however, the admissible parameters must satisfy a set of linear constraints \((\varvec{w},\varvec{\theta })\in (\mathbb {R}^n\times {\varTheta })\). The following theorem states that both parametrizations are equivalent.

Theorem 3

The ordinal classifier (1) and the MORD classifier (11) are equivalent in the following sense. For any \(\varvec{w}\in \mathbb {R}^n\) and admissible \(\varvec{\theta }\in {\varTheta }\) there exists \(\varvec{b}\in \mathbb {R}^Y\) such that \(h(\varvec{x},\varvec{w},\varvec{\theta }) = h'(\varvec{x},\varvec{w},\varvec{b})\), \(\forall \varvec{x}\in \mathbb {R}^n\). For any \(\varvec{w}\in \mathbb {R}^n\) and \(\varvec{b}\in \mathbb {R}^n\), there exists admissible \(\varvec{\theta }\in {\varTheta }\) such that \(h(\varvec{x},\varvec{w},\varvec{\theta }) = h'(\varvec{x},\varvec{w},\varvec{b})\), \(\forall \varvec{x}\in \mathbb {R}^n\).

Proof of Theorem 3 is given in Antoniuk et al. (2013). The proof is constructive in that it provides analytical formulas for conversion between the two parametrizations. For the sake of completeness the conversion formulas are shown in Appendix 2.

The MORD parametrization allows to adopt existing techniques for linear classification. Namely, we can replace the interval insensitive loss by a convex surrogate similar to the margin-rescaling loss used in the structured output leaning (Tsochantaridis et al. 2005). Given a V-shaped supervised loss \({\varDelta }:\mathcal{Y}\times \mathcal{Y}\rightarrow \mathbb {R}\), we propose to approximate the value of the associated interval insensitive loss \({\varDelta }_I(y_l,y_r,h'(\varvec{x};\varvec{w},\varvec{b}))\) by a surrogate loss \(\ell _I:\mathcal{X}\times \mathcal{P}\times \mathbb {R}^n\times \mathbb {R}^Y\rightarrow \mathbb {R}\) defined as
$$\begin{aligned} \begin{array}{rcl} \ell _I(\varvec{x},y_l,y_r,\varvec{w},\varvec{b}) &{}= &{}\displaystyle \max _{y\le y_l}\Big [{\varDelta }(y,y_l) + \langle \varvec{x},\varvec{w}\rangle (y-y_l) + b_y - b_{y_l} \Big ] \\ &{}&{} \displaystyle + \max _{y\ge y_r} \Big [ {\varDelta }(y,y_r) + \langle \varvec{x},\varvec{w}\rangle (y-y_r) + b_y - b_{y_r} \Big ]\,. \end{array} \end{aligned}$$
(12)
It is seen that for fixed \((\varvec{x},y_l,y_r)\) the function \(\ell _I(\varvec{x},y_l,y_r,\varvec{w},\varvec{b})\) is a sum of two point-wise maxima over linear functions hence it is convex in the parameters \((\varvec{w},\varvec{b})\). The following proposition states that the surrogate is like the previous surrogates an upper bound of the interval insensitive loss.

Proposition 1

For any \(\varvec{x}\in \mathbb {R}^n\), \([y_l,y_r]\in \mathcal{P}\), \(\varvec{w}\in \mathbb {R}^n\) and \(\varvec{b}\in \mathbb {R}^Y\) the inequality
$$\begin{aligned} {\varDelta }_I(y_l,y_r,h'(\varvec{x};\varvec{w},\varvec{b})) \le \ell _I(\varvec{x},y_l,y_r,\varvec{w},\varvec{b}) \end{aligned}$$
holds where \(h'(\varvec{x};\varvec{w},\varvec{b})\) denotes response of the MORD classifier (11).

Proof is deferred to Appendix 3.

Given partially annotated training examples \(\mathcal{T}^m_I\), we can learn parameters \((\varvec{w},\varvec{b})\) of the MORD classifier (11) by solving the following unconstrained convex problem
$$\begin{aligned} (\varvec{w}^*,\varvec{b}^*) = \mathop {\mathrm{argmin}}_{\varvec{w}\in \mathbb {R}^n,\varvec{b}\in \mathbb {R}^Y} \left[ \frac{{\uplambda }}{2}\Vert \varvec{w}\Vert ^2 + \frac{1}{m}\sum _{i=1}^m \ell _I(\varvec{x}^i,y_l^i,y_r^i,\varvec{w},\varvec{b}) \right] \end{aligned}$$
(13)
where \({\uplambda }\in \mathbb {R}_{++}\) is a regularization constant. A suitable value of the regularization constant is typically tuned on the validation set. In the sequel we denote the method based on solving (13) as the V-shape Interval insensitive Loss Minimization Algorithm (VILMA).
As an important example let us consider the surrogate (12) instantiated for the MAE loss. In this case, the surrogate becomes
$$\begin{aligned} \begin{array}{rcl} \ell _I^\mathrm{MAE}(\varvec{x},y_l,y_r,\varvec{w},\varvec{b}) &{}= &{}\displaystyle \max _{y\le y_l}\Big [ y_l-y + \langle \varvec{x},\varvec{w}\rangle (y-y_l) + b_y - b_{y_l} \Big ] \\ &{}&{} \displaystyle + \max _{y\ge y_r} \Big [ y - y_r + \langle \varvec{x},\varvec{w}\rangle (y-y_r) + b_y - b_{y_r} \Big ]\,. \end{array} \end{aligned}$$
(14)
It is interesting to compare the VILMA instantiated for the MAE loss with the II-SVOR-IMC algorithm which optimizes a different surrogate of the same loss. Note that the II-SVOR-IMC learns the parameters \((\varvec{w},\varvec{\theta })\) of the ordinal classifier (1) while the VILMA parameters \((\varvec{w},\varvec{b})\) of the MORD rule (11). The following proposition states that surrogates of both methods are equivalent.

Proposition 2

Let \(\varvec{w}\in \mathbb {R}^n, \varvec{\theta }\in {\varTheta }, \varvec{b}\in \mathbb {R}^{Y}\) be a triplet of vectors such that \(h(\varvec{x};\varvec{w},\varvec{\theta }) = h'(\varvec{x};\varvec{w},\varvec{b})\) holds for all \(\varvec{x}\in \mathcal{X}\) where \(h(\varvec{x};\varvec{w},\varvec{\theta })\) denotes the ordinal classifier (1) and \(h'(\varvec{x};\varvec{w},\varvec{b})\) the MORD classifier (11). Then the equality
$$\begin{aligned} \ell ^\mathrm{IMC}_I(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta }) = \ell _I^\mathrm{MAE}(\varvec{x},y_l,y_r,\varvec{w},\varvec{b}) \end{aligned}$$
holds true for any \(\varvec{x}\in \mathcal{X}\) and \([y_l,y_r]\in \mathcal{P}\).

Proof is deferred to Appendix 4. Proposition 2 ensures that the II-SVOR-IMC algorithm and the VILMA with MAE loss both return the same classification rules although differently parametrized.

We finish with listing core properties of the generic method, the VILMA, proposed in this section:
  • VILMA is applicable for an arbitrary V-shaped loss.

  • VILMA subsumes the II-SVOR-IMC algorithm optimizing the MAE loss as a special case.

  • VILMA converts learning into an unconstrained convex optimization. Note that the II-SVOR-EXP and the II-SVOR-IMC in contrast to VILMA maintain the set of linear constraints \(\varvec{\theta }\in \hat{{\varTheta }}\).

4 Double-loop cutting plane solver

The proposed method VILMA translates learning into a convex optimization problem (13) that can be re-written as
$$\begin{aligned} (\varvec{w}^*,\varvec{b}^*) = \mathop {\mathrm{argmin}}_{\varvec{w}\in \mathbb {R}^n,\varvec{b}\in \mathbb {R}^Y} \left[ \frac{{\uplambda }}{2}\Vert \varvec{w}\Vert ^2 + R_\mathrm{emp}(\varvec{w},\varvec{b}) \right] \, \end{aligned}$$
(15)
where \(R_\mathrm{emp}(\varvec{w},\varvec{b}) = \frac{1}{m}\sum _{i=1}^m \ell _I(\varvec{x}^i,y^i,\varvec{w},\varvec{b})\) is a non-differentiable convex function of variables \(\varvec{w}\) and \(\varvec{b}\). The task (15) can be reformulated as a quadratic program with \(n+m+Y\) variables and \(Y\cdot m\) constraints. Generic off-the-shelf QP solvers are applicable only to small problems. In this section we derive an instance of the cutting plane algorithm tailored for the problem (15). The resulting CPA is applicable for large problems and it provides a certificate of the optimality.
More details on the CPA based solvers applied to the machine learning problems can be found for example in Teo et al. (2010), Franc et al. (2012). The standard CPA is suitable for solving convex tasks of the form
$$\begin{aligned} \varvec{w}^* = \mathop {\mathrm{argmin}}_{\varvec{w}\in \mathbb {R}^n} F(\varvec{w})\quad \text{ where }\quad F(\varvec{w})= \frac{{\uplambda }}{2}\Vert \varvec{w}\Vert ^2 + G(\varvec{w}) \end{aligned}$$
(16)
and \(G:\mathbb {R}^n\rightarrow \mathbb {R}\) is a convex function. In contrast to our problem (15), the objective of (16) contains a quadratic regularization imposed on all variables. It is well known that the CPA applied directly to the un-regularized problem like (15) exhibits a strong zig-zag behavior leading to a large number of iterations. A frequently used ad ad-hoc solution is to impose an artificial regularization on \(\varvec{b}\) which may however significantly spoil the results as demonstrated in Sect. 5.2. In the rest of this section we first outline the CPA algorithm for the problem (16) and then show how it can be used to solve the problem (15).
The core idea of the CPA is to approximate the solution of the master problem (16) by solving a reduced problem
$$\begin{aligned} \varvec{w}_t \in \mathop {\mathrm{argmin}}_{\varvec{w}\in \mathbb {R}^n} F_t(\varvec{w}) \quad \text{ where }\quad F_t(\varvec{w})= \frac{{\uplambda }}{2}\Vert \varvec{w}\Vert ^2 + G_t(\varvec{w}) \,. \end{aligned}$$
(17)
The reduced problem (17) is obtained from (16) by substituting a cutting-plane model \(G_t(\varvec{w})\) for the convex function \(G(\varvec{w})\) while the regularizer remains unchanged. The cutting plane model of \(G(\varvec{w})\) reads
$$\begin{aligned} G_t(\varvec{w}) = \max _{i=0,\ldots ,t-1}\big [ G(\varvec{w}_i) + \langle G'(\varvec{w}_i),\varvec{w}-\varvec{w}_i\rangle \big ] \end{aligned}$$
(18)
where \(G'(\varvec{w})\in \mathbb {R}^n\) is a sub-gradient of G at point \(\varvec{w}\). Thanks to the convexity of \(G(\varvec{w})\), \(G_t(\varvec{w})\) is a piece-wise linear underestimator of \(G(\varvec{w})\) which is tight in the points \(\varvec{w}_i\), \(i=0,\ldots ,t-1\). In turn, the reduced problem objective \(F_t(\varvec{w})\) is an underestimator of \(F(\varvec{w})\). The cutting plane model is build iteratively by the following simple procedure. Starting from \(\varvec{w}_0\in \mathbb {R}^n\), the CPA computes a new iterate \(\varvec{w}_t\) by solving the reduced problem (17). In each iteration t, the cutting-plane model (18) is updated by a new cutting plane computed at the intermediate solution \(\varvec{w}_t\) leading to a progressively tighter approximation of \(F(\varvec{w})\). The CPA halts if the gap between \(F(\varvec{w}_t)\) and \(F_t(\varvec{w}_t)\) gets below a prescribed \(\varepsilon >0\), meaning that \(F(\varvec{w}_t)\le F(\varvec{w}^*) + \varepsilon \) holds. The CPA is guaranteed to halt after \(\mathcal{O}(\frac{1}{{\uplambda }\varepsilon })\) iterations at most (Teo et al. 2010). The CPA is outlined in Algorithm 1.
We can convert our problem (15) to (16) by setting
$$\begin{aligned} G(\varvec{w}) = R_\mathrm{emp}(\varvec{w},\varvec{b}(\varvec{w}) ) \quad \text{ where }\quad \varvec{b}(\varvec{w}) = \mathop {\mathrm{argmin}}_{\varvec{b}\in \mathbb {R}^Y} R_\mathrm{emp}(\varvec{w},\varvec{b}) \,. \end{aligned}$$
(19)
It is clear that if \(\varvec{w}^*\) is a solution of the problem (16) with the function \(G(\varvec{w})\) defined by the Eq. (19) then \((\varvec{w}^*, \varvec{b}(\varvec{w}^*))\) must be a solution of (15). Because \(R_\mathrm{emp}(\varvec{w},\varvec{b})\) is jointly convex in \(\varvec{w}\) and \(\varvec{b}\), the function \(G(\varvec{w})\) in (19) is also convex in \(\varvec{w}\) (see for example Boyd and Vandenberghe 2004). Hence, application of Algorithm 1 to solve (15) will preserve all its convergence guarantees. To this end, we only need to provide a first-order oracle computing \(G(\varvec{w})\) and the sub-gradient \(G'(\varvec{w})\) required to build the cutting plane model. Given \(\varvec{b}(\varvec{w})\), the subgradient of \(G(\varvec{w})\) reads (Boyd and Vandenberghe 2004)
$$\begin{aligned} G'(\varvec{w}) = \frac{1}{m}\sum _{i=1}^m \varvec{x}^i (\hat{y}^i_l + \hat{y}^i_r- y_l^i-y_r^i) \end{aligned}$$
(20)
where
$$\begin{aligned} \begin{array}{rcl} \hat{y}^i_l &{}=&{} \displaystyle \mathop {\mathrm{argmax}}_{y\le y^i_l} \big [ {\varDelta }(y,y_l^i) + \langle \varvec{w},\varvec{x}^i\rangle y - b_y(\varvec{w}) \big ]\,, \\ \hat{y}^i_r &{}=&{} \displaystyle \mathop {\mathrm{argmax}}_{y\ge y^i_r} \big [ {\varDelta }(y,y_r^i) + \langle \varvec{w},\varvec{x}^i\rangle y - b_y(\varvec{w}) \big ] \,. \end{array} \end{aligned}$$
The proposed CPA transforms solving of the problem (15) into a sequence of two simpler problems:
  1. 1.

    The reduced problem (17) solved in each iteration of the CPA. The problem (17) is a quadratic program that can be approached via its dual formulation (Teo et al. 2010) having only t variables where t is the number of iterations of the CPA. Since the CPA rarely needs more than a few hundred iterations, the dual of (17) can be solved by off-the-shelf QP libraries.

     
  2. 2.

    The problem (19) providing \(\varvec{b}(\varvec{w})\) which is required to compute \(G(\varvec{w})=R_\mathrm{emp}(\varvec{w},\varvec{b}(\varvec{w}) )\) and the sub-gradient \(G'(\varvec{w})\) via Eq. (20). The problem (19) has only Y (the number of labels) variables. Hence it can be approached by generic convex solvers like the Analytic Center Cutting Plane algorithm (Gondzio et al. 1996).

     
Because we use another cutting plane method in the inner loop to implement the first-order oracle, we call the proposed solver as the double-loop CPA.

Finally, we point out that the convex problems associated with the II-SVOR-EXP and the II-SVOR-IMC can be solved by a similar method. The only change is in using additional constraints \(\varvec{\theta }\in \hat{{\varTheta }}\) in (15) which propagate to the problem (19).

5 Experiments

We evaluate the proposed methods on a real-life computer vision problem of estimating age of a person from his/her facial image. The age estimation is a prototypical problem calling for the ordinal classification as well as learning from interval annotations. The set of labels corresponds to individual ages which form an ordered set. Training examples of the facial images are cheap, for example, they can be downloaded from the Internet. On the other hand obtaining ground truth age for a given facial image is often very complicated for obvious reasons. A typical solution used in practice is to annotate the age manually and use it as a replacement for the true age. Creating a year-precise annotation manually is however tedious process. Moreover, precision and consistency of manual annotations can be large. Using the interval annotation instead of the year-precise one can significantly ease the mentioned problems. We demonstrate on real-life data that the proposed methods can effectively exploit cheap interval annotations for learning precise age estimators.

The experiments have two parts. First, in Sect. 5.2, we present results on precisely annotated examples. By conducting these experiments we i) set a baseline for the latter experiments on partially annotated examples, ii) numerically verify that the VILMA subsumes the SVOR-IMC algorithm as a special case and iii) justify usage of the proposed double-loop CPA. Second, in Sect. 5.3, we thoroughly analyze performance of the VILMA on partially annotated examples. We emphasize that all tested algorithms are designed to optimize the MAE loss being the standard evaluation metric of age estimation systems.

5.1 Databases and implementation details

We use two large face databases with year-precise annotation of the age:
  1. 1.

    MORPH database (Ricanek and Tesafaye 2006) is the standard benchmark for age estimation. It contains 55,134 face images with exact age annotation ranging from 16 to 77 years. Because the age category 70+ is severely under-represented (only 9 examples in total) we removed faces with age higher than 70. The database contains frontal police mugshots taken under controlled conditions. The images have resolution 200\(\times \)240 pixels and most of them are of very good quality.

     
  2. 2.

    WILD database is a collection of three public databases: Labeled Faces in the Wild (Huang et al. 2007), PubFig (Kumar et al. 2009) and PAL (Minear and Park 2004). The images are annotated by several independent annotators. We selected a subset of near-frontal images (yaw angle in \([-30^\circ ,30^\circ ]\)) containing 34,259 faces in total with the age from 1 to 80 years. The WILD database contains challenging “in-the-wild” images exhibiting a large variation in the resolution, illumination changes, race and background clutter.

     
The faces were split randomly three times into training, validation and testing part in the ratio 60/20/20. We made sure that images of the same identity never appear in different parts simultaneously.

Preprocessing The feature representation of the facial images was computed as follow. We first localized the faces by a commercial face detector 2 and consequently applied a Deformable Part Model based detector (Uřičář et al. 2012) to find facial landmarks like the corners of eyes, mouth and tip of the nose. The found landmarks were used to transform the input face by an affine transform into its canonical pose. Finally, the canonical face of size \(60\times 40\) pixels was described by multi-scale LBP descriptor (Sonnenburg and Franc 2010) resulting in \(n=159,488\)-dimensional binary sparse vector serving as an input of the ordinal classifier.

Implementation of the solver We implemented the double-loop CPA and the standard CPA in C++ by modifying the code from the Shogun machine learning library (Sonnenburg et al. 2010). To solve internal problem (19) we used the Oracle Based Optimization Engine (OBOE) implementation of the Analytic Center Cutting Plane algorithm being a part of COmputational INfrastructure for Operations Research project (COIN-OR) (Gondzio et al. 1996).

5.2 Supervised setting

The purpose of experiments conducted on fully supervised data is three fold. First, to present results of the standard supervised setting which is later used as a baseline. Second, to numerically verify Proposition 2 which states that the VILMA instantiated for the MAE loss subsumes the SVOR-IMC algorithm. Third, to show that imposing an extra quadratic regularization on the biases \(\varvec{b}\) of the MORD rule (11) severly harms the results which justifies usage the proposed double-loop CPA.

Here we used images with the year-precise age annotations from the MORPH database. We constructed a sequence of training sets with the number of examples m varying from \(m=3300\) to \(m=33{,}000\) (the total number of training example in the MORPH). For each training set we learned the ordinal classifier with the regularization parameters set to \({\uplambda }\in \{1,0.1,0.01,0.001\}\). The classifier corresponding to \({\uplambda }\) with the smallest validation MAE was applied on the testing examples. This process was repeated for the three random splits. We report the averages and the standard deviations of the MAE computed on the test examples over the three splits. The same evaluation procedure was used for the three compared algorithms: (i) the proposed method VILMA, (ii) the standard SVOR-IMC and (iii) the VILMA-REG which solves the problem (13) but using the regularization term \(\frac{{\uplambda }}{2}(\Vert \varvec{w}|^2+\Vert \varvec{b}\Vert ^2)\) instead of \(\frac{{\uplambda }}{2}\Vert \varvec{w}\Vert ^2\). We used the double-loop CPA for the VILMA and the SVOR-IMC and the standard CPA for the VILMA-REG. Table 1 summarizes the results.
Table 1

The test MAE of the ordinal classifier learned from the precisely annotated examples by the VILMA, the standard SVOR-IMC and the VILMA-REG using the \(\frac{{\uplambda }}{2}(\Vert \varvec{w}\Vert ^2+\Vert \varvec{b}\Vert )\) regularizer

 

\(m=3300\)

\(m=6600\)

\(m=13{,}000\)

\(m=23{,}000\)

\(m=33{,}000\)

VILMA

\(5.56 \pm 0.02\)

\(5.12 \pm 0.02\)

\(4.83 \pm 0.02\)

\(4.66 \pm 0.01\)

\(4.55 \pm 0.02\)

SVOR-IMC

\(5.56 \pm 0.03\)

\(5.14 \pm 0.02\)

\(4.83 \pm 0.01\)

\(4.68 \pm 0.03\)

\( 4.54 \pm 0.01\)

VILMA-REG

\(9.57\pm 0.03\)

\(9.21\pm 0.06\)

\(9.07 \pm 0.05\)

\(9.04 \pm 0.05\)

\(9.06 \pm 0.02\)

The results are shown for the training sets generated from the MORPH database by randomly selecting different number of the training examples m

We observe that the prediction error steeply decreases with adding new precisely annotated examples. The MAE for the largest training set is \(4.55 \pm 0.02\) which closely matches the state-of-the-art methods like Guo and Mu (2010) reporting MAE 4.45 on the same database. The next section shows that similar results can be obtained with cheaper partially annotated examples.

Although the VILMA and the SVOR-IMC learn different parametrizations of the ordinal classifier the resulting rules are equivalent up a numerical error as predicted by Proposition 2. We repeated the same experiment applying the VILMA and the II-SVOR-IMC on the partially annotated examples as described in the next section. The results of both methods were the same up a numerical error. Hence in the next section we only include the results for the VILMA.

The test MAE of the classifier learned by the VILMA-REG is almost doubled compared to the classifier learned by VILMA via the double-loop CPA. This shows that pushing the biases \(\varvec{b}\) towards zero by the quadratic regularizer, which is necessary if the standard CPA is to be used, has a detrimental effect on the accuracy.

5.3 Learning from partially annotated examples

The goal is to evaluate the VILMA when it is applied to learning from partially annotated examples. The MORPH and WILD databases contain the year-precise annotation. The partial annotation is generated in a way which simulates a practical setting:
  • \(m_P\) randomly selected examples were annotated precisely by taking the annotation from the databases.

  • \(m_I\) randomly selected examples were annotated by intervals. The admissible annotation intervals were chosen so that they partition the set of ages and have the same width (up to the border cases). The interval width was varied from \(u\in \{5,10,20\}\). The interval annotation was obtained by rounding the true age from the databases to the admissible intervals. For example, in case of \((u=5)\)-years wide intervals the true ages \(y\in \{1,2,\dots ,5\}\) were transformed to the interval annotation [1, 5], the ages \(y\in \{6,7,\ldots ,10\}\) to [6, 10] and so on.

The described annotation process is approximately \(\alpha \beta \)-precise (c.f. Definition 4) with \(\alpha =\frac{m_P}{m_P+m_I}\) and \(\beta =u-1\in \{4,9,19\}\). We varied \(m_P\in \{ 3300, 6600\}\) and \(m_I\) from 0 to \(m_\mathrm{total}-m_P\) where \(m_\mathrm{total}\) is the total number of the training examples in the corresponding database.
For each training set we run the VILMA with the regularization constant \({\uplambda }\) set to \(\{1,0.1,0.01,0.001\}\) and selected the best value according to the MAE computed on the validation examples. The best model was then evaluated on the test part. This process was repeated for the three random splits. The reported errors are the averages and the standard deviations of the MAE computed on the test examples. The results are summarized in Fig. 3 and Table 2.
Fig. 3

The figures show test MAE for the ordinal classifiers learned by the VILMA from different training sets. The x-axis corresponds to the total number of examples in the training set. In the case of partial annotation, x-axis corresponds \(m_P+m_I\) where \(m_P\) is the number of partial and \(m_I\) the number of precisely annotated examples, respectively. The figures a, c show results for \(m_P=3300\) and figures b, d for \(m_P=6600\), respectively. In the supervised case, the x-axis is just the number of precisely annotated examples. Each figure shows one curve for the supervised setting plus three curves corresponding to the partial setting with different width \(u\in \{5,10,20\}\) of the annotation intervals. The results for MORPH database are in figures a, b and the results for WILD in c, d. a MORPH—\(m_p=3300\) precisely annotated. b MORPH—\(m_p=6600\) precisely annotated. c WILD—\(m_p=3300\) precisely annotated. d WILD—\(m_p=6600\) precisely annotated

Table 2

The table summarizes test MAE of the ordinal classifier learned from the training set with m examples

  

\(m=3300\)

\(m=6600\)

\(m=13{,}000\)

\(m=23{,}000\)

\(m=33{,}000\)

 

Supervised

\(5.56 \pm 0.02\)

\(5.12 \pm 0.02\)

\(4.83 \pm 0.02\)

\(4.66 \pm 0.01\)

\(4.55 \pm 0.02\)

MORPH

\(m_P\)

u

\(m_I = 0\)

\(m_I = 3300\)

\(m_I = 9700\)

\(m_I = 19{,}700\)

\(m_I = 29{,}700\)

3300

5

\(5.56 \pm 0.02\)

\(5.21 \pm 0.04\)

\(4.89 \pm 0.03\)

\(4.70 \pm 0.01\)

\(4.62 \pm 0.01\)

10

\(5.56 \pm 0.03\)

\(5.25 \pm 0.02\)

\(5.15 \pm 0.05\)

\(4.97 \pm 0.01\)

\(4.90 \pm 0.04\)

20

\(5.56 \pm 0.03\)

\(5.32 \pm 0.03\)

\(5.26 \pm 0.06\)

\(5.06 \pm 0.04\)

\(4.97 \pm 0.01\)

\(m_P\)

u

\(m_I = 0\)

\(m_I = 0\)

\(m_I = 6400\)

\(m_I = 16{,}400\)

\(m_I = 26{,}400\)

6600

5

\(5.12 \pm 0.02\)

\(4.86 \pm 0.02\)

\(4.69 \pm 0.00\)

\(4.61 \pm 0.00\)

10

\(5.13 \pm 0.02\)

\(4.96 \pm 0.03\)

\(4.81 \pm 0.01\)

\(4.84 \pm 0.04\)

20

\(5.13 \pm 0.02\)

\(5.03 \pm 0.02\)

\(4.86 \pm 0.04\)

\(4.86 \pm 0.01\)

  

\(m=3300\)

\(m=6600\)

\(m=11{,}000\)

\(m=16{,}000\)

\(m=21{,}000\)

 

Supervised

\(10.40 \pm 0.03\)

\(9.60 \pm 0.03\)

\(9.14 \pm 0.02\)

\(8.89 \pm 0.02\)

\(8.68 \pm 0.02\)

WILD

\(m_P \)

u

\(m_I = 0\)

\(m_I = 3300\)

\(m_I = 7700\)

\(m_I = 12{,}700\)

\(m_I = 17{,}700\)

3300

5

\(10.40 \pm 0.03\)

\(9.69 \pm 0.02\)

\(9.23 \pm 0.05\)

\(8.89 \pm 0.02\)

\(8.71 \pm 0.02\)

10

\(10.40 \pm 0.03\)

\(9.76 \pm 0.02\)

\(9.42 \pm 0.04\)

\(9.09 \pm 0.02\)

\(8.99 \pm 0.02\)

20

\(10.40 \pm 0.03\)

\(9.88 \pm 0.03\)

\(9.67 \pm 0.04\)

\(9.51 \pm 0.00\)

\(9.40 \pm 0.01\)

\(m_P \)

u

\(m_I = 0\)

\(m_I = 0\)

\(m_I = 4400\)

\(m_I = 9400\)

\(m_I = 14{,}400\)

6600

5

\(9.60 \pm 0.03\)

\(9.22 \pm 0.06\)

\(8.89 \pm 0.02\)

\(8.71 \pm 0.02\)

10

\(9.60 \pm 0.03\)

\(9.22 \pm 0.02\)

\(9.04 \pm 0.03\)

\(8.90 \pm 0.02\)

20

\(9.60 \pm 0.03\)

\(9.35 \pm 0.06\)

\(9.14 \pm 0.03\)

\(9.04 \pm 0.02\)

The upper row shows results of the supervised setting when all m examples are precisely annotated. The bottom rows show results of learning from \(m_p\) precisely annotated examples and \(m_I=m-m_P\) examples annotated by intervals of width u

We observe that adding the partially annotated examples monotonically improves the accuracy. This observation holds true for all tested combinations of \(m_I\), \(m_P\), u and both databases. This observation is of a great practical importance. It suggests that adding cheap partially annotated examples only improves and never worsens the accuracy of the ordinal classifier.

It is seen that the improvement caused by adding the partially annotated examples can be substantial. Not surprisingly the best results are obtained for the annotation with the narrowest (5-years) intervals. In this case, the performance of the classifier learned from the partial annotations closely matches the supervised setting. In particular, the loss in accuracy resulting from using the partial annotation on the WILD database is on the level of standard deviation. Even in the most challenging case, when learning from 20-years wide intervals, the results are practically useful. For example, to get classifier with \(\approx 9\) MAE on the WILD database one can either learn from \(\approx 12,000\) precisely annotated examples or instead from 6, 600 precisely annotated plus 14, 400 partially annotated with 20-years wide intervals.

Finally, let us define a quantity \(\gamma (\alpha ,\beta )=\hat{R}^{{\textit{MAE}}}(h^{\alpha ,\beta })-\hat{R}^{{\textit{MAE}}}(h^*)\) where \(\hat{R}^{{\textit{MAE}}}(\cdot )\) denotes the test MAE, \(h^{\alpha ,\beta }\) is the classifier learned from partially annotated examples generated by the \(\alpha \beta \)-precise annotation process and \(h^*\) is the classifier learned from the precise annotations only. The quantity \(\gamma (\alpha ,\beta )\) thus measures the loss in test accuracy caused by using the imprecise annotation. The values of \(\gamma (\alpha ,\beta )\) observed on both databases are shown in Fig. 4. We see that the loss in accuracy grows proportionally with the interval width \(u=1+\beta \) and with the portion of partially annotated examples \(1-\alpha \). This observation complies with the theoretical upper bound \(\gamma (\alpha ,\beta ) \le (1-\alpha )\beta \) given in Theorem 2. Although the slope of real curve \(\gamma (\alpha ,\beta )\), if seen as a function of \(1-\alpha \), is considerably smaller than \(\beta \), the tendency is approximately linear at least in the regime \(1-\alpha \in [0,0.5]\) as predicted by the theoretical bound.
Fig. 4

The figures show \(\gamma (\alpha ,\beta )=\hat{R}^{{\textit{MAE}}}(h^{\alpha ,\beta })-\hat{R}^{{\textit{MAE}}}(h^*)\) which is the loss in accuracy caused by training from partially annotated examples generated by \(\alpha \beta \)-precise annotation process relatively to the supervised case. The value of \(\gamma (\alpha ,\beta )\) is shown for different \(\beta \) (note that \(u=\beta +1\) is the interval width) as a function of the portion of the partially annotated examples \(1-\alpha \). The figure a, b contains the results obtained on the MORPH and the WILD database, respectively

6 Conclusions and future work

We have proposed a V-shaped interval-insensitive loss suitable for risk minimization based learning of ordinal classifiers from partially annotated examples. We proved that under reasonable assumption on the annotation process the Bayes risk of the ordinal classifier can be bounded by the expectation of the associated interval-insensitive loss. We proposed a convex surrogate of the interval-insensitive loss associated to an arbitrary supervised V-shaped loss. We derived a generic V-shaped Interval insensitive Loss Minimization Algorithm (VILMA) which translates learning from the interval annotations to a convex optimization problem. We also derived other convex surrogate losses of the interval insensitive loss by extending the existing methods like the SVOR-EXP and SVOR-IMC algorithm. We have proposed a cutting plane method which can solve large instances of the resulting convex learning problems. The experiments conducted on a real-life problem of human age estimation from facial images show that the proposed method has a practical potential. We demonstrated that a precise ordinal classifier with accuracy matching the state-of-the-art results can be obtained by learning from cheap partial annotations.

Our work is based on the interval insensitive loss and its convex surrogates which turned out to work well empirically. We showed that under certain assumptions the expectation of the interval insensitive loss can be used to upper bound expectation of the associated target loss. However a deeper theoretical understanding is needed. For example, an open issue is whether there exists a distribution for which the upper bound is sharp. Another interesting question is how to weaken the assumptions on the annotation process, e.g. the requiring on the consistency of the annotation. It is also unclear which of the introduced convex surrogates is theoretically better. We believe that this issue could be resolved by analyzing statistical consistency of the surrogates as in (Zhang 2004; Tewari and Bartlett 2007). These issues are left for the future work.

Footnotes

  1. 1.

    The sequence \(1,\ldots , Y\) is used just for a notational convenience, however, any other finite and fully ordered set can be used instead.

  2. 2.

    Courtesy of Eydea Recognition Ltd, www.eyedea.cz.

  3. 3.

    Here for simplicity we provide proof in non-degenerated case, it can be adopted however for generated case as well.

Notes

Acknowledgments

The authors were supported by the Grant Agency of the Czech Republic under Project P202/12/2071, the project ERC-CZ 1303 and EU project FP7-ICT-609763 TRADR.

References

  1. Antoniuk, K., Franc, V., & Hlavac, V. (2013). Mord: Multi-class classifier for ordinal regression. In Proceedings of European conference on machine learning and principles and practice of knowledge discovery in databases (ECML/PKDD) (pp. 96–111).Google Scholar
  2. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. New York, NY: Cambridge University Press.CrossRefMATHGoogle Scholar
  3. Chang, K., Chen, C., & Hung, Y. (2011). Ordinal hyperplane ranker with cost sensitivities for age estimation. In Proceedings of computer vision and pattern recognition (CVPR).Google Scholar
  4. Chu, W., & Ghahramani, Z. (2005). Preference learning with gaussian processes. In Proceedings of the international conference on machine learning (ICML).Google Scholar
  5. Chu, W., & Keerthi, S.S. (2005). New approaches to support vector ordinal regression. In Proceedings of the international conference on machine learning (ICML) (pp. 145–152).Google Scholar
  6. Cour, T., Sapp, B., & Taskar, B. (2011). Learning from partial labels. Journal of Machine Learning Research, 12, 1225–1261.MathSciNetMATHGoogle Scholar
  7. Crammer, K., & Singer, Y. (2001). Pranking with ranking. In Advances in neural information processing systems (NIPS) (pp. 641–647).Google Scholar
  8. Dembczyński, K., Kotlowski, W., & Slowinski, R. (2008). Ordinal classification with decision rules. In Mining complex data. Lecture notes in computer science, 4944, 169–181.Google Scholar
  9. Dempster, A., Laird, N., & Rubin, D. (1997). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 1(39), 1–38.Google Scholar
  10. Do, T.-M.-T., & Artières, T. (2009). Large margin training for hidden markov models with partially observed states. In Proceedings of the international conference on Machine Learning (ICML).Google Scholar
  11. Franc, V., Sonnenburg, S., & Werner, T. (2012). Cutting-plane methods in machine learning (chapter 7, pp. 185–218). The MIT Press, Cambridge, USA.Google Scholar
  12. Fu, L., & Simpson, D.G. (2002). Conditional risk models for ordinal response data: Simultaneous logistic regression analysis and generalized score test. Journal of Statistical Planning and Inference, 108(1–2), 201–217.Google Scholar
  13. Gondzio, J., du Merle, O., Sarkissian, R., & Vial, J.-P. (1996). ACCPM—A library for convex optimization based on an analytic center cutting plane method. European Journal of Operational Research, 94, 206–211.Google Scholar
  14. Guo, G., & Mu, G. (2010). Human age estimation: What is the influence across race and gender? In Proceedings of conference on computer vision and pattern recognition workshops (CVPRW).Google Scholar
  15. Huang, G.B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report 07-49, University of Massachusetts, Amherst.Google Scholar
  16. Jie, L., & Orabona, F. (2010). Learning from candidate labeling sets. In Proceedings of advances in neural information processing systems (NIPS).Google Scholar
  17. Kotlowski, W., Dembczynski, K., Greco, S., & Slowinski, R. (2008). Stochastic dominance-based rough set model for ordinal classification. Journal of Information Sciences, 178(21), 4019–4037.MathSciNetCrossRefMATHGoogle Scholar
  18. Kumar, N., Berg, A. C., Belhumeur, P. N., & Nayar, S. K. (2009). Attribute and simile classifiers for face verification. In Proceedings of international conference on computer vision (ICCV).Google Scholar
  19. Li, L., & Lin, H.-T. (2006). Ordinal regression by extended binary classification. In Proceedings of advances in neural information processing systems (NIPS).Google Scholar
  20. Lou, X., & Hamprecht, F.A. (2012). Structured learning from partial annotations. In Proceedings of the international conference on machine learning (ICML).Google Scholar
  21. McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statical Society, 42(2), 109–142.MathSciNetMATHGoogle Scholar
  22. Minear, M., & Park, D. (2004). A lifespan database of adult facial stimuli. Behavior Research Methods, Instruments, & Computers: A Journal of the Psychonomic Society, 36, 630–633.CrossRefGoogle Scholar
  23. Ramanathan, N., Chellappa, R., & Biswas, S. (2009). Computational methods for modeling facial aging: Asurvey. Journal of Visual Languages and Computing, 20, 131–144.Google Scholar
  24. Rennie, J. D., & Srebro, N. (2005). Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling.Google Scholar
  25. Ricanek, K., & Tesafaye, T. (2006). Morph: A longitudial image database of normal adult age-progression. In Proceedings of automated face and gesture recognition.Google Scholar
  26. Schlesinger, M. (1968). A connection between learning and self-learning in the pattern recognition (in Russian). Kibernetika, 2, 81–88.Google Scholar
  27. Shashua, A., & Levin, A. (2002). Ranking with large margin principle: Two approaches. In Proceedings of advances in neural information processing systems (NIPS).Google Scholar
  28. Sonnenburg, S., & Franc, V. (2010). Coffin: A computational framework for linear SVMs. In Proceedings of the international conference on machine learning (ICML).Google Scholar
  29. Sonnenburg, S., Rätsch, G., Henschel, S., Widmer, C., Behr, J., Zien, A., et al. (2010). The shogun machine learning toolbox. Journal of Machine Learning Research, 11, 1799–1802.MATHGoogle Scholar
  30. Teo, C. H., Vishwanthan, S., Smola, A. J., & Le, Q. V. (2010). Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11, 311–365.MathSciNetMATHGoogle Scholar
  31. Tewari, A., & Bartlett, P. (2007). On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8, 1007–1025.MathSciNetMATHGoogle Scholar
  32. Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., & Singer, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.MathSciNetMATHGoogle Scholar
  33. Uřičář, M., Franc, V., & Hlaváč, V. (2012). Detector of facial landmarks learned by the structured output SVM. In Proceedings of the international conference on computer vision theory and applications (VISAPP) (Vol. 1, pp. 547–556).Google Scholar
  34. Vapnik, V. N. (1998). Statistical learning theory. Adaptive and learning systems. New York, New York: Wiley.MATHGoogle Scholar
  35. Zhang, T. (2004). Statistical behaviour and consistency of classification methods based on convex risk minimization. Annals of Statistics, 31(1), 56–134.MATHGoogle Scholar

Copyright information

© The Author(s) 2016

Authors and Affiliations

  1. 1.Center for Machine Perception, Department of Cybernetics, Faculty of Electrical EngineeringCzech Technical University in PraguePrague 6Czech Republic

Personalised recommendations