# V-shaped interval insensitive loss for ordinal classification

## Abstract

We address a problem of learning ordinal classifiers from partially annotated examples. We introduce a V-shaped interval-insensitive loss function to measure discrepancy between predictions of an ordinal classifier and a partial annotation provided in the form of intervals of candidate labels. We show that under reasonable assumptions on the annotation process the Bayes risk of the ordinal classifier can be bounded by the expectation of an associated interval-insensitive loss. We propose several convex surrogates of the interval-insensitive loss which are used to formulate convex learning problems. We described a variant of the cutting plane method which can solve large instances of the learning problems. Experiments on a real-life application of human age estimation show that the ordinal classifier learned from cheap partially annotated examples can achieve accuracy matching the results of the so-far used supervised methods which require expensive precisely annotated examples.

### Keywords

Ordinal classification Partially annotated examples Risk minimization## 1 Introduction

The ordinal classification model (also ordinal regression) is used in problems where the set of labels is fully ordered, for example, the label can be an age category (0–9, 10–19, \(\ldots \), 90–99) or a respondent answer to certain question (from strongly agree to strongly disagree). The ordinal classifiers are routinely used in social sciences, epidemiology, information retrieval or computer vision.

Recently, many supervised algorithms have been proposed for discriminative learning of the ordinal classifiers. The discriminative methods learn parameters of an ordinal classifier by minimizing a regularized convex proxy of the empirical risk. A Perceptron-like on-line algorithm PRank has been proposed in Crammer and Singer (2001). A large-margin principle has been applied for learning ordinal classifiers in Shashua and Levin (2002). The paper (Chu and Keerthi 2005) proposed Support Vector Ordinal Regression algorithm with explicit constraints (SVOR-EXP) and the SVOR algorithm with implicit constraints (SVOR-IMC). Unlike (Shashua and Levin 2002), the SVOR-EXP and SVOR-IMC guarantee the learned ordinal classifier to be statistically plausible. The same approach have been proposed independently by Rennie and Srebro (2005) who introduce so called immediate-threshold loss and all-thresholds loss functions. Minimization of a quadratically regularized immediate-threshold loss and the all-threshold loss are equivalent to the SVOR-EXP and the SVOR-IMC formulation, respectively. A generic framework proposed in Li and Lin (2006), of which the SVOR-EXP and SVOR-IMC are special instances, allows to convert learning of the ordinal classifier into learning of two-class SVM classifier with weighted examples.

Estimating parameters of a probabilistic model by the Maximum Likelihood (ML) method is another paradigm that can be used to learn ordinal classifiers. A plug-in ordinal classifier can be then constructed by substituting the estimated model to the optimal decision rule derived for a particular loss function (see e.g. Dembczyński et al. 2008 for a list of losses and corresponding decision functions suitable for ordinal classification). Parametric probability distributions suitable for modeling the ordinal labels have been proposed in McCullagh (1980), Fu and Simpson (2002), Rennie and Srebro (2005). Besides the parametric methods, the non-parametric probabilistic approaches like the Gaussian processes were also proposed (Chu and Ghahramani 2005).

Properties of the discriminative and the ML based methods are complementary to each other. The ML approach can be directly applied in the presence of incomplete annotation (e.g. the setting considered in this paper when label interval is given instead of a single label) by using the expectation–maximization algorithms (Schlesinger 1968; Dempster et al. 1997). However, the ML methods are sensitive to model mis-specification which complicates their application in modeling complex high-dimensional data. In contrast, the discriminative methods are known to be robust against the model mis-specification while their extension for learning from partial annotations is not trivial. To our best knowledge, the existing discriminative approaches for ordinal classification assume the precisely annotation only, that is, each training instance is annotated by exactly one label.

In this paper, we consider learning of the ordinal classifiers from partially annotated examples. We assume that each training input is annotated by an interval of candidate labels rather than a single label. This setting is common in practice. For example, let us assume a computer vision problem of learning an ordinal classifier predicting age from a facial image (e.g. Ramanathan et al. 2009, Chang et al. 2011). In this case, examples of face images are typically downloaded from the Internet and the age of depicted persons is estimated by a human annotator. Providing a reliable year-exact age just from a face image is virtually impossible. For humans it is more natural and easier to provide an interval estimate of the age. The interval annotation can be also obtained in an automated way e.g. by the method of Kotlowski et al. (2008) removing inconsistencies in imprecisely annotated data.

To deal with the interval annotations, we propose an interval-insensitive loss function which extends an arbitrary (supervised) V-shaped loss to the interval setting. The interval-insensitive loss measures a discrepancy between the interval of candidate labels given in the annotation and a label predicted by the classifier. Our interval-insensitive loss can be seen as the ordinal regression counterpart of the \(\epsilon \)-insensitive loss used in the support vector regression (Vapnik 1998). We prove that under reasonable assumptions on the annotation process, the Bayes risk of the ordinal classifier can be bounded by the expectation of the interval-insensitive loss. This bound justifies learning of the ordinal classifier via minimization of an empirical estimate of the interval-insensitive loss. Tightness of the bound depends on two intuitive parameters characterizing the annotations process. We show how to control these parameters in practice by properly designing the annotation process. We propose a convex surrogate of an arbitrary V-shaped interval-insensitive loss which is then used to formulate a convex learning problem. We also show how to modify the existing supervised methods, the SVOR-EXP and the SVOR-IMC algorithms, in order to minimize a convex surrogate of the interval-insensitive loss associated with the 0/1-loss and the mean absolute error (MAE) loss. Finally, we design a variant of a cutting plane algorithm which enables to solve large instances of the learning problems efficiently.

Discriminative learning from partially annotated examples has been recently studied in the context of a generic multi-class classifiers (Cour et al. 2011), the Hidden Markov Chain based classifiers (Do and Artières 2009), generic structured output models (Lou and Hamprecht 2012), the multi-instance learning (Jie and Orabona 2010) etc. All these methods translate learning to minimization of a partial loss evaluating discrepancy between the classifier predictions and partial annotations. The partial loss is defined as minimal value of a supervised loss (defined on a pair of labels, e.g. 0/1-loss) over all candidate labels consistent with the partial annotation. Our interval-insensitive loss can be seen as an application of such type of partial losses in the context of the ordinal classification. In particular, we analyze the partial annotation in the form of intervals of the candidate labels and the mean absolute error being the most typical target loss in the ordinal classification. The bounds of the Bayes risk via the expectation of the partial loss have been studied in Cour et al. (2011) but only for the 0/1-loss which is much less suitable for ordinal classification. It worth mentioning that the ordinal classification model allows for a tight convex approximations of the partial loss in contrast to previously considered classification models often requiring a hard to optimize non-convex surrogates (Do and Artières 2009; Lou and Hamprecht 2012; Jie and Orabona 2010).

The paper is organized as follows. Formulation of the learning problem and its solution via minimization of the interval insensitive loss is presented in Sect. 2. Algorithms approximating minimization of the interval-insensitive loss by convex optimization problems are proposed in Sect. 3. A cutting plane based method solving the convex programs is described in Sect. 4. Section 5 presents experimental evaluation and Sect. 6 concludes the paper.

## 2 Learning ordinal classifier from weakly annotated examples

### 2.1 Learning from completely annotated examples

^{1}We consider learning of an ordinal classifier \(h:\mathcal{X}\rightarrow \mathcal{Y}\) of the form

*A*holds, otherwise it is 0. The classifier (1) splits the real line of projections \(\langle \varvec{x},\varvec{w}\rangle \) into

*Y*consecutive intervals defined by thresholds \(\theta _1 \le \theta _2 \le \dots \le \theta _{Y-1}\). The observation \(\varvec{x}\) is assigned a label corresponding to the interval to which the projection \(\langle \varvec{w},\varvec{x}\rangle \) falls to. The classifier (1) is a suitable model if the label can be thought of as a rough measurement of a continuous random variable \(\xi (\varvec{x})=\langle \varvec{x},\varvec{w}\rangle +\text{ noise }\) (McCullagh 1980). An example of the ordinal classifier applied to a toy 2D problem is depicted in Fig. 1.

*h*whose

*Bayes risk*

**Definition 1**

(V-shaped loss). A loss \({\varDelta }:\mathcal{Y}\times \mathcal{Y}\rightarrow \mathbb {R}\) is V-shaped if \({\varDelta }(y,y)=0\) and \({\varDelta }(y'',y)\ge {\varDelta }(y',y)\) holds for all triplets \((y,y',y'')\in \mathcal{Y}^3\) such that \(|y''-y'| \ge |y'-y|\).

That is, the value of a V-shaped loss grows monotonically with the distance between the predicted and the true label. In this paper we constrain our analysis to the V-shaped losses.

Because the expected risk *R*(*h*) is not accessible directly due to the unknown distribution \(p(\varvec{x},y)\), the discriminative methods like Shashua and Levin (2002), Chu and Keerthi (2005), Li and Lin (2006) minimize a convex surrogate of the empirical risk augmented by a quadratic regularizer. We follow the same framework but with novel surrogate loss functions suitable for learning from partially annotated examples.

### 2.2 Learning from partially annotated examples

*y*, returns a partial annotation in the form of an interval of candidate labels \([y_l,y_r]\in \mathcal{P}\). The symbol \(\mathcal{P}=\{ [y_l,y_r]\in \mathcal{Y}^2 \mid y_l\le y_r\}\) denotes the set of all possible partial annotations. The partial annotation \([y_l,y_r]\) means that the true label

*y*is from the interval \([y_l,y_r]=\{y\in \mathcal{Y}\mid y_l\le y \le y_r\}\). We shall assume that the annotator can be modeled by a stochastic process determined by a distribution \(p(y_l,y_r\mid \varvec{x}, y)\). That is, we are given a set of partially annotated examples

The goal of learning from the partially annotated examples is formulated as follows. Given a (supervised) loss function \({\varDelta }:\mathcal{Y}\times \mathcal{Y}\rightarrow \mathbb {R}\) and partially annotated examples (4), the task is to learn the ordinal classifier (1) whose Bayes risk *R*(*h*) defined by (3) is as small as possible. Note that the objective remains the same as in the supervised setting but the information about the labels contained in the training set is reduced to intervals.

### 2.3 Interval insensitive loss

We define an interval-insensitive loss function in order to measure discrepancy between the interval annotation \([y_l,y_r]\in \mathcal{P}\) and the predictions made by the classifier \(h(\varvec{x};\varvec{w},\varvec{\theta })\in \mathcal{Y}\).

**Definition 2**

The interval-insensitive loss \({\varDelta }_I(y_l,y_r,y)\) does not penalize predictions which are in the interval \([y_l,y_r]\). Otherwise the penalty is either \({\varDelta }(y,y_l)\) or \({\varDelta }(y,y_r)\) depending which border of the interval \([y_l,y_r]\) is closer to the prediction *y*. In the special case of the mean absolute error (MAE) \({\varDelta }(y,y')=|y-y'|\), one can think of the associated interval-insensitive loss \({\varDelta }_I(y_l,y_r,y)\) as the discrete counterpart of the \(\epsilon \)-insensitive loss used in the Support Vector Regression (Vapnik 1998).

*R*(

*h*) defined in (3) by minimization of the expectation of the interval-insensitive loss

*partial risk*. The question is how well the partial risk \(R_I(h)\) approximates the Bayes risk

*R*(

*h*) being the target quantity to be minimized. In the rest of this section we first analyze this question for the 0/1-loss adapting results of Cour et al. (2011), and then we present a novel bound for the MAE loss. In particular, we show that the Bayes risk

*R*(

*h*) for both losses can be upper bounded by a linear function of the partial risk \(R_I(h)\).

In the sequel we will assume that the annotated process governed by the distribution \(p(y_l,y_r\mid \varvec{x},y)\) is consistent in the following sense.

**Definition 3**

(Consistent annotation process) Let \(p(y_l,y_r\mid \varvec{x},y)\) be a properly defined distribution over \(\mathcal{P}\) for any \((\varvec{x},y)\in \mathcal{X}\times \mathcal{Y}\). The annotation process governed by \(p(y_l,y_r\mid \varvec{x},y)\) is consistent if any \(y\in \mathcal{Y}\), \([y_l,y_r]\in \mathcal{P}\) such that \(y\notin [y_l,y_r]\) implies \(p(y_l,y_r\mid \varvec{x},y) = 0\).

The consistent annotation process guarantees that the true label is always contained among the candidate labels in the annotation.

*ambiguity degree*\(\varepsilon \) which, if adopted to our interval-setting, is defined as

*z*co-occurring with the true label

*y*in the annotation interval \([y_l,y_r]\), over all labels and observations.

**Theorem 1**

Theorem 1 is a direct application of Proposition 1 from Cour et al. (2011).

Next we will introduce a novel upper bound for the MAE loss more frequently used in applications of the ordinal classifier. We again consider consistent annotation processes. We will characterize the annotation process by two numbers describing an amount of uncertainty in the training data. First, we use \(\alpha \in [0,1]\) to denoted the lower bound of the portion of exactly annotated examples, that is, examples annotated by an interval having just a single label \([y_l,y_r]\), \(y_l=y_r\). Second, we use \(\beta \in \{0,\ldots ,Y-1\}\) to denote the maximal uncertainty in annotation, that is, \(\beta +1\) is the maximal width of the annotation interval which can appear in the training data with non-zero probability.

**Definition 4**

To illustrate the meaning of the parameters \(\alpha \) and \(\beta \) let us consider the extreme cases. If \(\alpha =1\) of \(\beta = 0\) then all examples are annotated exactly and we are back in the standard supervised setting. On the other hand, if \(\beta =Y-1\) then it may happen (but does not have to) that the annotation brings no information about the hidden label because the intervals contain all labels in \(\mathcal{Y}\). With the definition of \(\alpha \beta \)-precise annotation we can upper bound the Bayes risk in terms of the partial risk as follows:

**Theorem 2**

Proof of Theorem 2 is deferred to Appendix 1.

The bound (8) is obtained by the worst case analysis hence it may become trivial in some cases, for example, if all examples are annotated with wide intervals because then \(\alpha =0\) and \(\beta \) is large. The experimental study presented in section 5 nevertheless shows that the partial risk \(R_I\) is a good proxy even in cases when the bound upper bound is large. This suggests that better bounds might be derived, for example, when additional information about \(p(y_l,y_r\mid \varvec{x},y)\) is available.

- 1.
We generate a vector of binary variables \(\varvec{\pi }\in \{0,1\}^m\) according to Bernoulli distribution with probability \(\alpha \) that the variable is 1.

- 2.
We instruct the annotator to provide just a single label for each input example with index from \(\{i\in \{1,\ldots ,m\}\mid \pi _i = 1\}\) while the remaining inputs (with \(\pi _i = 0\)) can be annotated by intervals but not larger than \(\beta +1\) labels. That means that approximately \(m\cdot \alpha \) inputs will be annotated exactly and \(m\cdot (1-\alpha )\) with intervals.

## 3 Algorithms

In the previous section we argue that the partial risk defined as an expectation of the interval insensitive loss is a reasonable proxy of the target Bayes risk. In this section we design algorithms learning the ordinal classifier via minimization of the quadratically regularized empirical risk used as a proxy for the expected risk. Similarly to the standard supervised case, we cannot minimize the empirical risk directly due to a discrete domain of the interval insensitive loss. For this reason we derive several convex surrogates which allow to translates the risk minimization to tractable convex problems.

We first show how to modify two existing supervised methods in order to learn from partially annotated examples. Namely, we extend the Support Vector Ordinal Regression algorithm with explicit constraints (SVOR-EXP) and the Support Vector Ordinal Regression algorithm with implicit constraints (SVOR-IMC) (Chu and Keerthi 2005). The extended interval-insensitive variants are named II-SVOR-EXP (Sect. 3.1) and II-SVOR-IMC (Sect. 3.2), respectively. The II-SVOR-EXP is a method minimizing a convex surrogate of the interval-insensitive loss associated to the 0/1-loss while the II-SVOR-IMC is designed for the minimization of MAE loss.

In Sect. 3.3, we show how to construct a generic convex surrogate of the interval-insensitive loss associated to an arbitrary V-shaped loss. We call a method minimizing this generic surrogate as the V-shaped interval insensitive loss minimization algorithm (VILMA). We prove that the VILMA subsumes the II-SVOR-IMC (as well as the SVOR-IMC) as a special case.

### 3.1 Interval insensitive SVOR-EXP algorithm for optimization of 0/1-loss

Given partially annotated examples \(\mathcal{T}^m_I\), we can learn parameters \((\varvec{w},\varvec{\theta })\) of the ordinal classifier (1) by solving (9) with the surrogate \(\ell _I^\mathrm{EXP}(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta })\) substituted for \(\ell ^\mathrm{EXP}(\varvec{x},y, \varvec{w},\varvec{\theta })\). We denote the modified variant as the II-SVOR-EXP algorithm.

### 3.2 Interval insensitive SVOR-IMC algorithm for optimization of MAE loss

Given the partially annotated examples \(\mathcal{T}^m_I\), we can learn parameters \((\varvec{w},\varvec{\theta })\) of the ordinal classifier (1) by solving (10) with the proposed surrogate \(\ell ^\mathrm{IMC}_I(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta })\) substituted for \(\ell ^\mathrm{IMC}(\varvec{x},y,\varvec{w},\varvec{\theta })\). We denote the modified variant as the II-SVOR-IMC algorithm. Note that due to the equality \(\ell ^\mathrm{IMC}_I(\varvec{x},y,y,\varvec{w},\varvec{\theta })= \ell ^\mathrm{IMC}(\varvec{x},y,\varvec{w},\varvec{\theta })\) it is clear that the proposed II-SVOR-IMC subsumes the original supervised SVOR-IMC as a special case.

### 3.3 VILMA: V-shaped interval insensitive loss minimization algorithm

**Theorem 3**

The ordinal classifier (1) and the MORD classifier (11) are equivalent in the following sense. For any \(\varvec{w}\in \mathbb {R}^n\) and admissible \(\varvec{\theta }\in {\varTheta }\) there exists \(\varvec{b}\in \mathbb {R}^Y\) such that \(h(\varvec{x},\varvec{w},\varvec{\theta }) = h'(\varvec{x},\varvec{w},\varvec{b})\), \(\forall \varvec{x}\in \mathbb {R}^n\). For any \(\varvec{w}\in \mathbb {R}^n\) and \(\varvec{b}\in \mathbb {R}^n\), there exists admissible \(\varvec{\theta }\in {\varTheta }\) such that \(h(\varvec{x},\varvec{w},\varvec{\theta }) = h'(\varvec{x},\varvec{w},\varvec{b})\), \(\forall \varvec{x}\in \mathbb {R}^n\).

Proof of Theorem 3 is given in Antoniuk et al. (2013). The proof is constructive in that it provides analytical formulas for conversion between the two parametrizations. For the sake of completeness the conversion formulas are shown in Appendix 2.

**Proposition 1**

Proof is deferred to Appendix 3.

**Proposition 2**

Proof is deferred to Appendix 4. Proposition 2 ensures that the II-SVOR-IMC algorithm and the VILMA with MAE loss both return the same classification rules although differently parametrized.

VILMA is applicable for an arbitrary V-shaped loss.

VILMA subsumes the II-SVOR-IMC algorithm optimizing the MAE loss as a special case.

VILMA converts learning into an unconstrained convex optimization. Note that the II-SVOR-EXP and the II-SVOR-IMC in contrast to VILMA maintain the set of linear constraints \(\varvec{\theta }\in \hat{{\varTheta }}\).

## 4 Double-loop cutting plane solver

*reduced problem*

*G*at point \(\varvec{w}\). Thanks to the convexity of \(G(\varvec{w})\), \(G_t(\varvec{w})\) is a piece-wise linear underestimator of \(G(\varvec{w})\) which is tight in the points \(\varvec{w}_i\), \(i=0,\ldots ,t-1\). In turn, the reduced problem objective \(F_t(\varvec{w})\) is an underestimator of \(F(\varvec{w})\). The cutting plane model is build iteratively by the following simple procedure. Starting from \(\varvec{w}_0\in \mathbb {R}^n\), the CPA computes a new iterate \(\varvec{w}_t\) by solving the reduced problem (17). In each iteration

*t*, the cutting-plane model (18) is updated by a new cutting plane computed at the intermediate solution \(\varvec{w}_t\) leading to a progressively tighter approximation of \(F(\varvec{w})\). The CPA halts if the gap between \(F(\varvec{w}_t)\) and \(F_t(\varvec{w}_t)\) gets below a prescribed \(\varepsilon >0\), meaning that \(F(\varvec{w}_t)\le F(\varvec{w}^*) + \varepsilon \) holds. The CPA is guaranteed to halt after \(\mathcal{O}(\frac{1}{{\uplambda }\varepsilon })\) iterations at most (Teo et al. 2010). The CPA is outlined in Algorithm 1.

- 1.
The reduced problem (17) solved in each iteration of the CPA. The problem (17) is a quadratic program that can be approached via its dual formulation (Teo et al. 2010) having only

*t*variables where*t*is the number of iterations of the CPA. Since the CPA rarely needs more than a few hundred iterations, the dual of (17) can be solved by off-the-shelf QP libraries. - 2.
The problem (19) providing \(\varvec{b}(\varvec{w})\) which is required to compute \(G(\varvec{w})=R_\mathrm{emp}(\varvec{w},\varvec{b}(\varvec{w}) )\) and the sub-gradient \(G'(\varvec{w})\) via Eq. (20). The problem (19) has only

*Y*(the number of labels) variables. Hence it can be approached by generic convex solvers like the Analytic Center Cutting Plane algorithm (Gondzio et al. 1996).

*double-loop CPA*.

Finally, we point out that the convex problems associated with the II-SVOR-EXP and the II-SVOR-IMC can be solved by a similar method. The only change is in using additional constraints \(\varvec{\theta }\in \hat{{\varTheta }}\) in (15) which propagate to the problem (19).

## 5 Experiments

We evaluate the proposed methods on a real-life computer vision problem of estimating age of a person from his/her facial image. The age estimation is a prototypical problem calling for the ordinal classification as well as learning from interval annotations. The set of labels corresponds to individual ages which form an ordered set. Training examples of the facial images are cheap, for example, they can be downloaded from the Internet. On the other hand obtaining ground truth age for a given facial image is often very complicated for obvious reasons. A typical solution used in practice is to annotate the age manually and use it as a replacement for the true age. Creating a year-precise annotation manually is however tedious process. Moreover, precision and consistency of manual annotations can be large. Using the interval annotation instead of the year-precise one can significantly ease the mentioned problems. We demonstrate on real-life data that the proposed methods can effectively exploit cheap interval annotations for learning precise age estimators.

The experiments have two parts. First, in Sect. 5.2, we present results on precisely annotated examples. By conducting these experiments we i) set a baseline for the latter experiments on partially annotated examples, ii) numerically verify that the VILMA subsumes the SVOR-IMC algorithm as a special case and iii) justify usage of the proposed double-loop CPA. Second, in Sect. 5.3, we thoroughly analyze performance of the VILMA on partially annotated examples. We emphasize that all tested algorithms are designed to optimize the MAE loss being the standard evaluation metric of age estimation systems.

### 5.1 Databases and implementation details

- 1.
MORPH database (Ricanek and Tesafaye 2006) is the standard benchmark for age estimation. It contains 55,134 face images with exact age annotation ranging from 16 to 77 years. Because the age category 70+ is severely under-represented (only 9 examples in total) we removed faces with age higher than 70. The database contains frontal police mugshots taken under controlled conditions. The images have resolution 200\(\times \)240 pixels and most of them are of very good quality.

- 2.
WILD database is a collection of three public databases: Labeled Faces in the Wild (Huang et al. 2007), PubFig (Kumar et al. 2009) and PAL (Minear and Park 2004). The images are annotated by several independent annotators. We selected a subset of near-frontal images (yaw angle in \([-30^\circ ,30^\circ ]\)) containing 34,259 faces in total with the age from 1 to 80 years. The WILD database contains challenging “in-the-wild” images exhibiting a large variation in the resolution, illumination changes, race and background clutter.

*Preprocessing* The feature representation of the facial images was computed as follow. We first localized the faces by a commercial face detector ^{2} and consequently applied a Deformable Part Model based detector (Uřičář et al. 2012) to find facial landmarks like the corners of eyes, mouth and tip of the nose. The found landmarks were used to transform the input face by an affine transform into its canonical pose. Finally, the canonical face of size \(60\times 40\) pixels was described by multi-scale LBP descriptor (Sonnenburg and Franc 2010) resulting in \(n=159,488\)-dimensional binary sparse vector serving as an input of the ordinal classifier.

*Implementation of the solver* We implemented the double-loop CPA and the standard CPA in C++ by modifying the code from the Shogun machine learning library (Sonnenburg et al. 2010). To solve internal problem (19) we used the Oracle Based Optimization Engine (OBOE) implementation of the Analytic Center Cutting Plane algorithm being a part of COmputational INfrastructure for Operations Research project (COIN-OR) (Gondzio et al. 1996).

### 5.2 Supervised setting

The purpose of experiments conducted on fully supervised data is three fold. First, to present results of the standard supervised setting which is later used as a baseline. Second, to numerically verify Proposition 2 which states that the VILMA instantiated for the MAE loss subsumes the SVOR-IMC algorithm. Third, to show that imposing an extra quadratic regularization on the biases \(\varvec{b}\) of the MORD rule (11) severly harms the results which justifies usage the proposed double-loop CPA.

*m*varying from \(m=3300\) to \(m=33{,}000\) (the total number of training example in the MORPH). For each training set we learned the ordinal classifier with the regularization parameters set to \({\uplambda }\in \{1,0.1,0.01,0.001\}\). The classifier corresponding to \({\uplambda }\) with the smallest validation MAE was applied on the testing examples. This process was repeated for the three random splits. We report the averages and the standard deviations of the MAE computed on the test examples over the three splits. The same evaluation procedure was used for the three compared algorithms: (i) the proposed method VILMA, (ii) the standard SVOR-IMC and (iii) the VILMA-REG which solves the problem (13) but using the regularization term \(\frac{{\uplambda }}{2}(\Vert \varvec{w}|^2+\Vert \varvec{b}\Vert ^2)\) instead of \(\frac{{\uplambda }}{2}\Vert \varvec{w}\Vert ^2\). We used the double-loop CPA for the VILMA and the SVOR-IMC and the standard CPA for the VILMA-REG. Table 1 summarizes the results.

The test MAE of the ordinal classifier learned from the precisely annotated examples by the VILMA, the standard SVOR-IMC and the VILMA-REG using the \(\frac{{\uplambda }}{2}(\Vert \varvec{w}\Vert ^2+\Vert \varvec{b}\Vert )\) regularizer

\(m=3300\) | \(m=6600\) | \(m=13{,}000\) | \(m=23{,}000\) | \(m=33{,}000\) | |
---|---|---|---|---|---|

VILMA | \(5.56 \pm 0.02\) | \(5.12 \pm 0.02\) | \(4.83 \pm 0.02\) | \(4.66 \pm 0.01\) | \(4.55 \pm 0.02\) |

SVOR-IMC | \(5.56 \pm 0.03\) | \(5.14 \pm 0.02\) | \(4.83 \pm 0.01\) | \(4.68 \pm 0.03\) | \( 4.54 \pm 0.01\) |

VILMA-REG | \(9.57\pm 0.03\) | \(9.21\pm 0.06\) | \(9.07 \pm 0.05\) | \(9.04 \pm 0.05\) | \(9.06 \pm 0.02\) |

We observe that the prediction error steeply decreases with adding new precisely annotated examples. The MAE for the largest training set is \(4.55 \pm 0.02\) which closely matches the state-of-the-art methods like Guo and Mu (2010) reporting MAE 4.45 on the same database. The next section shows that similar results can be obtained with cheaper partially annotated examples.

Although the VILMA and the SVOR-IMC learn different parametrizations of the ordinal classifier the resulting rules are equivalent up a numerical error as predicted by Proposition 2. We repeated the same experiment applying the VILMA and the II-SVOR-IMC on the partially annotated examples as described in the next section. The results of both methods were the same up a numerical error. Hence in the next section we only include the results for the VILMA.

The test MAE of the classifier learned by the VILMA-REG is almost doubled compared to the classifier learned by VILMA via the double-loop CPA. This shows that pushing the biases \(\varvec{b}\) towards zero by the quadratic regularizer, which is necessary if the standard CPA is to be used, has a detrimental effect on the accuracy.

### 5.3 Learning from partially annotated examples

\(m_P\) randomly selected examples were annotated precisely by taking the annotation from the databases.

\(m_I\) randomly selected examples were annotated by intervals. The admissible annotation intervals were chosen so that they partition the set of ages and have the same width (up to the border cases). The interval width was varied from \(u\in \{5,10,20\}\). The interval annotation was obtained by rounding the true age from the databases to the admissible intervals. For example, in case of \((u=5)\)-years wide intervals the true ages \(y\in \{1,2,\dots ,5\}\) were transformed to the interval annotation [1, 5], the ages \(y\in \{6,7,\ldots ,10\}\) to [6, 10] and so on.

The table summarizes test MAE of the ordinal classifier learned from the training set with *m* examples

\(m=3300\) | \(m=6600\) | \(m=13{,}000\) | \(m=23{,}000\) | \(m=33{,}000\) | ||
---|---|---|---|---|---|---|

Supervised | \(5.56 \pm 0.02\) | \(5.12 \pm 0.02\) | \(4.83 \pm 0.02\) | \(4.66 \pm 0.01\) | \(4.55 \pm 0.02\) | |

| ||||||

\(m_P\) |
| \(m_I = 0\) | \(m_I = 3300\) | \(m_I = 9700\) | \(m_I = 19{,}700\) | \(m_I = 29{,}700\) |

3300 | 5 | \(5.56 \pm 0.02\) | \(5.21 \pm 0.04\) | \(4.89 \pm 0.03\) | \(4.70 \pm 0.01\) | \(4.62 \pm 0.01\) |

10 | \(5.56 \pm 0.03\) | \(5.25 \pm 0.02\) | \(5.15 \pm 0.05\) | \(4.97 \pm 0.01\) | \(4.90 \pm 0.04\) | |

20 | \(5.56 \pm 0.03\) | \(5.32 \pm 0.03\) | \(5.26 \pm 0.06\) | \(5.06 \pm 0.04\) | \(4.97 \pm 0.01\) | |

\(m_P\) |
| \(m_I = 0\) | \(m_I = 0\) | \(m_I = 6400\) | \(m_I = 16{,}400\) | \(m_I = 26{,}400\) |

6600 | 5 | — | \(5.12 \pm 0.02\) | \(4.86 \pm 0.02\) | \(4.69 \pm 0.00\) | \(4.61 \pm 0.00\) |

10 | — | \(5.13 \pm 0.02\) | \(4.96 \pm 0.03\) | \(4.81 \pm 0.01\) | \(4.84 \pm 0.04\) | |

20 | — | \(5.13 \pm 0.02\) | \(5.03 \pm 0.02\) | \(4.86 \pm 0.04\) | \(4.86 \pm 0.01\) |

\(m=3300\) | \(m=6600\) | \(m=11{,}000\) | \(m=16{,}000\) | \(m=21{,}000\) | ||
---|---|---|---|---|---|---|

Supervised | \(10.40 \pm 0.03\) | \(9.60 \pm 0.03\) | \(9.14 \pm 0.02\) | \(8.89 \pm 0.02\) | \(8.68 \pm 0.02\) | |

| ||||||

\(m_P \) |
| \(m_I = 0\) | \(m_I = 3300\) | \(m_I = 7700\) | \(m_I = 12{,}700\) | \(m_I = 17{,}700\) |

3300 | 5 | \(10.40 \pm 0.03\) | \(9.69 \pm 0.02\) | \(9.23 \pm 0.05\) | \(8.89 \pm 0.02\) | \(8.71 \pm 0.02\) |

10 | \(10.40 \pm 0.03\) | \(9.76 \pm 0.02\) | \(9.42 \pm 0.04\) | \(9.09 \pm 0.02\) | \(8.99 \pm 0.02\) | |

20 | \(10.40 \pm 0.03\) | \(9.88 \pm 0.03\) | \(9.67 \pm 0.04\) | \(9.51 \pm 0.00\) | \(9.40 \pm 0.01\) | |

\(m_P \) |
| \(m_I = 0\) | \(m_I = 0\) | \(m_I = 4400\) | \(m_I = 9400\) | \(m_I = 14{,}400\) |

6600 | 5 | — | \(9.60 \pm 0.03\) | \(9.22 \pm 0.06\) | \(8.89 \pm 0.02\) | \(8.71 \pm 0.02\) |

10 | — | \(9.60 \pm 0.03\) | \(9.22 \pm 0.02\) | \(9.04 \pm 0.03\) | \(8.90 \pm 0.02\) | |

20 | — | \(9.60 \pm 0.03\) | \(9.35 \pm 0.06\) | \(9.14 \pm 0.03\) | \(9.04 \pm 0.02\) |

We observe that adding the partially annotated examples monotonically improves the accuracy. This observation holds true for all tested combinations of \(m_I\), \(m_P\), *u* and both databases. This observation is of a great practical importance. It suggests that adding cheap partially annotated examples only improves and never worsens the accuracy of the ordinal classifier.

It is seen that the improvement caused by adding the partially annotated examples can be substantial. Not surprisingly the best results are obtained for the annotation with the narrowest (5-years) intervals. In this case, the performance of the classifier learned from the partial annotations closely matches the supervised setting. In particular, the loss in accuracy resulting from using the partial annotation on the WILD database is on the level of standard deviation. Even in the most challenging case, when learning from 20-years wide intervals, the results are practically useful. For example, to get classifier with \(\approx 9\) MAE on the WILD database one can either learn from \(\approx 12,000\) precisely annotated examples or instead from 6, 600 precisely annotated plus 14, 400 partially annotated with 20-years wide intervals.

## 6 Conclusions and future work

We have proposed a V-shaped interval-insensitive loss suitable for risk minimization based learning of ordinal classifiers from partially annotated examples. We proved that under reasonable assumption on the annotation process the Bayes risk of the ordinal classifier can be bounded by the expectation of the associated interval-insensitive loss. We proposed a convex surrogate of the interval-insensitive loss associated to an arbitrary supervised V-shaped loss. We derived a generic V-shaped Interval insensitive Loss Minimization Algorithm (VILMA) which translates learning from the interval annotations to a convex optimization problem. We also derived other convex surrogate losses of the interval insensitive loss by extending the existing methods like the SVOR-EXP and SVOR-IMC algorithm. We have proposed a cutting plane method which can solve large instances of the resulting convex learning problems. The experiments conducted on a real-life problem of human age estimation from facial images show that the proposed method has a practical potential. We demonstrated that a precise ordinal classifier with accuracy matching the state-of-the-art results can be obtained by learning from cheap partial annotations.

Our work is based on the interval insensitive loss and its convex surrogates which turned out to work well empirically. We showed that under certain assumptions the expectation of the interval insensitive loss can be used to upper bound expectation of the associated target loss. However a deeper theoretical understanding is needed. For example, an open issue is whether there exists a distribution for which the upper bound is sharp. Another interesting question is how to weaken the assumptions on the annotation process, e.g. the requiring on the consistency of the annotation. It is also unclear which of the introduced convex surrogates is theoretically better. We believe that this issue could be resolved by analyzing statistical consistency of the surrogates as in (Zhang 2004; Tewari and Bartlett 2007). These issues are left for the future work.

## Footnotes

- 1.
The sequence \(1,\ldots , Y\) is used just for a notational convenience, however, any other finite and fully ordered set can be used instead.

- 2.
Courtesy of Eydea Recognition Ltd, www.eyedea.cz.

- 3.
Here for simplicity we provide proof in non-degenerated case, it can be adopted however for generated case as well.

## Notes

### Acknowledgments

The authors were supported by the Grant Agency of the Czech Republic under Project P202/12/2071, the project ERC-CZ 1303 and EU project FP7-ICT-609763 TRADR.

### References

- Antoniuk, K., Franc, V., & Hlavac, V. (2013). Mord: Multi-class classifier for ordinal regression. In
*Proceedings of European conference on machine learning and principles and practice of knowledge discovery in databases (ECML/PKDD)*(pp. 96–111).Google Scholar - Boyd, S., & Vandenberghe, L. (2004).
*Convex optimization*. New York, NY: Cambridge University Press.CrossRefMATHGoogle Scholar - Chang, K., Chen, C., & Hung, Y. (2011). Ordinal hyperplane ranker with cost sensitivities for age estimation. In
*Proceedings of computer vision and pattern recognition (CVPR)*.Google Scholar - Chu, W., & Ghahramani, Z. (2005). Preference learning with gaussian processes. In
*Proceedings of the international conference on machine learning (ICML)*.Google Scholar - Chu, W., & Keerthi, S.S. (2005). New approaches to support vector ordinal regression. In
*Proceedings of the international conference on machine learning (ICML)*(pp. 145–152).Google Scholar - Cour, T., Sapp, B., & Taskar, B. (2011). Learning from partial labels.
*Journal of Machine Learning Research*,*12*, 1225–1261.MathSciNetMATHGoogle Scholar - Crammer, K., & Singer, Y. (2001). Pranking with ranking. In
*Advances in neural information processing systems (NIPS)*(pp. 641–647).Google Scholar - Dembczyński, K., Kotlowski, W., & Slowinski, R. (2008). Ordinal classification with decision rules.
*In Mining complex data. Lecture notes in computer science*,*4944*, 169–181.Google Scholar - Dempster, A., Laird, N., & Rubin, D. (1997). Maximum likelihood from incomplete data via the em algorithm.
*Journal of the Royal Statistical Society*,*1*(39), 1–38.Google Scholar - Do, T.-M.-T., & Artières, T. (2009). Large margin training for hidden markov models with partially observed states. In
*Proceedings of the international conference on Machine Learning (ICML)*.Google Scholar - Franc, V., Sonnenburg, S., & Werner, T. (2012).
*Cutting-plane methods in machine learning*(chapter 7, pp. 185–218). The MIT Press, Cambridge, USA.Google Scholar - Fu, L., & Simpson, D.G. (2002). Conditional risk models for ordinal response data: Simultaneous logistic regression analysis and generalized score test.
*Journal of Statistical Planning and Inference*,*108*(1–2), 201–217.Google Scholar - Gondzio, J., du Merle, O., Sarkissian, R., & Vial, J.-P. (1996). ACCPM—A library for convex optimization based on an analytic center cutting plane method.
*European Journal of Operational Research*,*94*, 206–211.Google Scholar - Guo, G., & Mu, G. (2010). Human age estimation: What is the influence across race and gender? In
*Proceedings of conference on computer vision and pattern recognition workshops (CVPRW)*.Google Scholar - Huang, G.B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007).
*Labeled faces in the wild: A database for studying face recognition in unconstrained environments*. Technical report 07-49, University of Massachusetts, Amherst.Google Scholar - Jie, L., & Orabona, F. (2010). Learning from candidate labeling sets. In
*Proceedings of advances in neural information processing systems (NIPS)*.Google Scholar - Kotlowski, W., Dembczynski, K., Greco, S., & Slowinski, R. (2008). Stochastic dominance-based rough set model for ordinal classification.
*Journal of Information Sciences*,*178*(21), 4019–4037.MathSciNetCrossRefMATHGoogle Scholar - Kumar, N., Berg, A. C., Belhumeur, P. N., & Nayar, S. K. (2009). Attribute and simile classifiers for face verification. In
*Proceedings of international conference on computer vision (ICCV)*.Google Scholar - Li, L., & Lin, H.-T. (2006). Ordinal regression by extended binary classification. In
*Proceedings of advances in neural information processing systems (NIPS)*.Google Scholar - Lou, X., & Hamprecht, F.A. (2012). Structured learning from partial annotations. In
*Proceedings of the international conference on machine learning (ICML)*.Google Scholar - McCullagh, P. (1980). Regression models for ordinal data.
*Journal of the Royal Statical Society*,*42*(2), 109–142.MathSciNetMATHGoogle Scholar - Minear, M., & Park, D. (2004). A lifespan database of adult facial stimuli.
*Behavior Research Methods, Instruments, & Computers: A Journal of the Psychonomic Society*,*36*, 630–633.CrossRefGoogle Scholar - Ramanathan, N., Chellappa, R., & Biswas, S. (2009). Computational methods for modeling facial aging: Asurvey.
*Journal of Visual Languages and Computing*,*20*, 131–144.Google Scholar - Rennie, J. D., & Srebro, N. (2005). Loss functions for preference levels: Regression with discrete ordered labels. In
*Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling*.Google Scholar - Ricanek, K., & Tesafaye, T. (2006). Morph: A longitudial image database of normal adult age-progression. In
*Proceedings of automated face and gesture recognition*.Google Scholar - Schlesinger, M. (1968). A connection between learning and self-learning in the pattern recognition (in Russian).
*Kibernetika*,*2*, 81–88.Google Scholar - Shashua, A., & Levin, A. (2002). Ranking with large margin principle: Two approaches. In
*Proceedings of advances in neural information processing systems (NIPS)*.Google Scholar - Sonnenburg, S., & Franc, V. (2010). Coffin: A computational framework for linear SVMs. In
*Proceedings of the international conference on machine learning (ICML)*.Google Scholar - Sonnenburg, S., Rätsch, G., Henschel, S., Widmer, C., Behr, J., Zien, A., et al. (2010). The shogun machine learning toolbox.
*Journal of Machine Learning Research*,*11*, 1799–1802.MATHGoogle Scholar - Teo, C. H., Vishwanthan, S., Smola, A. J., & Le, Q. V. (2010). Bundle methods for regularized risk minimization.
*Journal of Machine Learning Research*,*11*, 311–365.MathSciNetMATHGoogle Scholar - Tewari, A., & Bartlett, P. (2007). On the consistency of multiclass classification methods.
*Journal of Machine Learning Research*,*8*, 1007–1025.MathSciNetMATHGoogle Scholar - Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., & Singer, Y. (2005). Large margin methods for structured and interdependent output variables.
*Journal of Machine Learning Research*,*6*, 1453–1484.MathSciNetMATHGoogle Scholar - Uřičář, M., Franc, V., & Hlaváč, V. (2012). Detector of facial landmarks learned by the structured output SVM. In
*Proceedings of the international conference on computer vision theory and applications (VISAPP)*(Vol. 1, pp. 547–556).Google Scholar - Vapnik, V. N. (1998).
*Statistical learning theory. Adaptive and learning systems*. New York, New York: Wiley.MATHGoogle Scholar - Zhang, T. (2004). Statistical behaviour and consistency of classification methods based on convex risk minimization.
*Annals of Statistics*,*31*(1), 56–134.MATHGoogle Scholar