Vshaped interval insensitive loss for ordinal classification
 509 Downloads
 2 Citations
Abstract
We address a problem of learning ordinal classifiers from partially annotated examples. We introduce a Vshaped intervalinsensitive loss function to measure discrepancy between predictions of an ordinal classifier and a partial annotation provided in the form of intervals of candidate labels. We show that under reasonable assumptions on the annotation process the Bayes risk of the ordinal classifier can be bounded by the expectation of an associated intervalinsensitive loss. We propose several convex surrogates of the intervalinsensitive loss which are used to formulate convex learning problems. We described a variant of the cutting plane method which can solve large instances of the learning problems. Experiments on a reallife application of human age estimation show that the ordinal classifier learned from cheap partially annotated examples can achieve accuracy matching the results of the sofar used supervised methods which require expensive precisely annotated examples.
Keywords
Ordinal classification Partially annotated examples Risk minimization1 Introduction
The ordinal classification model (also ordinal regression) is used in problems where the set of labels is fully ordered, for example, the label can be an age category (0–9, 10–19, \(\ldots \), 90–99) or a respondent answer to certain question (from strongly agree to strongly disagree). The ordinal classifiers are routinely used in social sciences, epidemiology, information retrieval or computer vision.
Recently, many supervised algorithms have been proposed for discriminative learning of the ordinal classifiers. The discriminative methods learn parameters of an ordinal classifier by minimizing a regularized convex proxy of the empirical risk. A Perceptronlike online algorithm PRank has been proposed in Crammer and Singer (2001). A largemargin principle has been applied for learning ordinal classifiers in Shashua and Levin (2002). The paper (Chu and Keerthi 2005) proposed Support Vector Ordinal Regression algorithm with explicit constraints (SVOREXP) and the SVOR algorithm with implicit constraints (SVORIMC). Unlike (Shashua and Levin 2002), the SVOREXP and SVORIMC guarantee the learned ordinal classifier to be statistically plausible. The same approach have been proposed independently by Rennie and Srebro (2005) who introduce so called immediatethreshold loss and allthresholds loss functions. Minimization of a quadratically regularized immediatethreshold loss and the allthreshold loss are equivalent to the SVOREXP and the SVORIMC formulation, respectively. A generic framework proposed in Li and Lin (2006), of which the SVOREXP and SVORIMC are special instances, allows to convert learning of the ordinal classifier into learning of twoclass SVM classifier with weighted examples.
Estimating parameters of a probabilistic model by the Maximum Likelihood (ML) method is another paradigm that can be used to learn ordinal classifiers. A plugin ordinal classifier can be then constructed by substituting the estimated model to the optimal decision rule derived for a particular loss function (see e.g. Dembczyński et al. 2008 for a list of losses and corresponding decision functions suitable for ordinal classification). Parametric probability distributions suitable for modeling the ordinal labels have been proposed in McCullagh (1980), Fu and Simpson (2002), Rennie and Srebro (2005). Besides the parametric methods, the nonparametric probabilistic approaches like the Gaussian processes were also proposed (Chu and Ghahramani 2005).
Properties of the discriminative and the ML based methods are complementary to each other. The ML approach can be directly applied in the presence of incomplete annotation (e.g. the setting considered in this paper when label interval is given instead of a single label) by using the expectation–maximization algorithms (Schlesinger 1968; Dempster et al. 1997). However, the ML methods are sensitive to model misspecification which complicates their application in modeling complex highdimensional data. In contrast, the discriminative methods are known to be robust against the model misspecification while their extension for learning from partial annotations is not trivial. To our best knowledge, the existing discriminative approaches for ordinal classification assume the precisely annotation only, that is, each training instance is annotated by exactly one label.
In this paper, we consider learning of the ordinal classifiers from partially annotated examples. We assume that each training input is annotated by an interval of candidate labels rather than a single label. This setting is common in practice. For example, let us assume a computer vision problem of learning an ordinal classifier predicting age from a facial image (e.g. Ramanathan et al. 2009, Chang et al. 2011). In this case, examples of face images are typically downloaded from the Internet and the age of depicted persons is estimated by a human annotator. Providing a reliable yearexact age just from a face image is virtually impossible. For humans it is more natural and easier to provide an interval estimate of the age. The interval annotation can be also obtained in an automated way e.g. by the method of Kotlowski et al. (2008) removing inconsistencies in imprecisely annotated data.
To deal with the interval annotations, we propose an intervalinsensitive loss function which extends an arbitrary (supervised) Vshaped loss to the interval setting. The intervalinsensitive loss measures a discrepancy between the interval of candidate labels given in the annotation and a label predicted by the classifier. Our intervalinsensitive loss can be seen as the ordinal regression counterpart of the \(\epsilon \)insensitive loss used in the support vector regression (Vapnik 1998). We prove that under reasonable assumptions on the annotation process, the Bayes risk of the ordinal classifier can be bounded by the expectation of the intervalinsensitive loss. This bound justifies learning of the ordinal classifier via minimization of an empirical estimate of the intervalinsensitive loss. Tightness of the bound depends on two intuitive parameters characterizing the annotations process. We show how to control these parameters in practice by properly designing the annotation process. We propose a convex surrogate of an arbitrary Vshaped intervalinsensitive loss which is then used to formulate a convex learning problem. We also show how to modify the existing supervised methods, the SVOREXP and the SVORIMC algorithms, in order to minimize a convex surrogate of the intervalinsensitive loss associated with the 0/1loss and the mean absolute error (MAE) loss. Finally, we design a variant of a cutting plane algorithm which enables to solve large instances of the learning problems efficiently.
Discriminative learning from partially annotated examples has been recently studied in the context of a generic multiclass classifiers (Cour et al. 2011), the Hidden Markov Chain based classifiers (Do and Artières 2009), generic structured output models (Lou and Hamprecht 2012), the multiinstance learning (Jie and Orabona 2010) etc. All these methods translate learning to minimization of a partial loss evaluating discrepancy between the classifier predictions and partial annotations. The partial loss is defined as minimal value of a supervised loss (defined on a pair of labels, e.g. 0/1loss) over all candidate labels consistent with the partial annotation. Our intervalinsensitive loss can be seen as an application of such type of partial losses in the context of the ordinal classification. In particular, we analyze the partial annotation in the form of intervals of the candidate labels and the mean absolute error being the most typical target loss in the ordinal classification. The bounds of the Bayes risk via the expectation of the partial loss have been studied in Cour et al. (2011) but only for the 0/1loss which is much less suitable for ordinal classification. It worth mentioning that the ordinal classification model allows for a tight convex approximations of the partial loss in contrast to previously considered classification models often requiring a hard to optimize nonconvex surrogates (Do and Artières 2009; Lou and Hamprecht 2012; Jie and Orabona 2010).
The paper is organized as follows. Formulation of the learning problem and its solution via minimization of the interval insensitive loss is presented in Sect. 2. Algorithms approximating minimization of the intervalinsensitive loss by convex optimization problems are proposed in Sect. 3. A cutting plane based method solving the convex programs is described in Sect. 4. Section 5 presents experimental evaluation and Sect. 6 concludes the paper.
2 Learning ordinal classifier from weakly annotated examples
2.1 Learning from completely annotated examples
Definition 1
(Vshaped loss). A loss \({\varDelta }:\mathcal{Y}\times \mathcal{Y}\rightarrow \mathbb {R}\) is Vshaped if \({\varDelta }(y,y)=0\) and \({\varDelta }(y'',y)\ge {\varDelta }(y',y)\) holds for all triplets \((y,y',y'')\in \mathcal{Y}^3\) such that \(y''y' \ge y'y\).
That is, the value of a Vshaped loss grows monotonically with the distance between the predicted and the true label. In this paper we constrain our analysis to the Vshaped losses.
Because the expected risk R(h) is not accessible directly due to the unknown distribution \(p(\varvec{x},y)\), the discriminative methods like Shashua and Levin (2002), Chu and Keerthi (2005), Li and Lin (2006) minimize a convex surrogate of the empirical risk augmented by a quadratic regularizer. We follow the same framework but with novel surrogate loss functions suitable for learning from partially annotated examples.
2.2 Learning from partially annotated examples
The goal of learning from the partially annotated examples is formulated as follows. Given a (supervised) loss function \({\varDelta }:\mathcal{Y}\times \mathcal{Y}\rightarrow \mathbb {R}\) and partially annotated examples (4), the task is to learn the ordinal classifier (1) whose Bayes risk R(h) defined by (3) is as small as possible. Note that the objective remains the same as in the supervised setting but the information about the labels contained in the training set is reduced to intervals.
2.3 Interval insensitive loss
We define an intervalinsensitive loss function in order to measure discrepancy between the interval annotation \([y_l,y_r]\in \mathcal{P}\) and the predictions made by the classifier \(h(\varvec{x};\varvec{w},\varvec{\theta })\in \mathcal{Y}\).
Definition 2
The intervalinsensitive loss \({\varDelta }_I(y_l,y_r,y)\) does not penalize predictions which are in the interval \([y_l,y_r]\). Otherwise the penalty is either \({\varDelta }(y,y_l)\) or \({\varDelta }(y,y_r)\) depending which border of the interval \([y_l,y_r]\) is closer to the prediction y. In the special case of the mean absolute error (MAE) \({\varDelta }(y,y')=yy'\), one can think of the associated intervalinsensitive loss \({\varDelta }_I(y_l,y_r,y)\) as the discrete counterpart of the \(\epsilon \)insensitive loss used in the Support Vector Regression (Vapnik 1998).
In the sequel we will assume that the annotated process governed by the distribution \(p(y_l,y_r\mid \varvec{x},y)\) is consistent in the following sense.
Definition 3
(Consistent annotation process) Let \(p(y_l,y_r\mid \varvec{x},y)\) be a properly defined distribution over \(\mathcal{P}\) for any \((\varvec{x},y)\in \mathcal{X}\times \mathcal{Y}\). The annotation process governed by \(p(y_l,y_r\mid \varvec{x},y)\) is consistent if any \(y\in \mathcal{Y}\), \([y_l,y_r]\in \mathcal{P}\) such that \(y\notin [y_l,y_r]\) implies \(p(y_l,y_r\mid \varvec{x},y) = 0\).
The consistent annotation process guarantees that the true label is always contained among the candidate labels in the annotation.
Theorem 1
Theorem 1 is a direct application of Proposition 1 from Cour et al. (2011).
Next we will introduce a novel upper bound for the MAE loss more frequently used in applications of the ordinal classifier. We again consider consistent annotation processes. We will characterize the annotation process by two numbers describing an amount of uncertainty in the training data. First, we use \(\alpha \in [0,1]\) to denoted the lower bound of the portion of exactly annotated examples, that is, examples annotated by an interval having just a single label \([y_l,y_r]\), \(y_l=y_r\). Second, we use \(\beta \in \{0,\ldots ,Y1\}\) to denote the maximal uncertainty in annotation, that is, \(\beta +1\) is the maximal width of the annotation interval which can appear in the training data with nonzero probability.
Definition 4
To illustrate the meaning of the parameters \(\alpha \) and \(\beta \) let us consider the extreme cases. If \(\alpha =1\) of \(\beta = 0\) then all examples are annotated exactly and we are back in the standard supervised setting. On the other hand, if \(\beta =Y1\) then it may happen (but does not have to) that the annotation brings no information about the hidden label because the intervals contain all labels in \(\mathcal{Y}\). With the definition of \(\alpha \beta \)precise annotation we can upper bound the Bayes risk in terms of the partial risk as follows:
Theorem 2
Proof of Theorem 2 is deferred to Appendix 1.
The bound (8) is obtained by the worst case analysis hence it may become trivial in some cases, for example, if all examples are annotated with wide intervals because then \(\alpha =0\) and \(\beta \) is large. The experimental study presented in section 5 nevertheless shows that the partial risk \(R_I\) is a good proxy even in cases when the bound upper bound is large. This suggests that better bounds might be derived, for example, when additional information about \(p(y_l,y_r\mid \varvec{x},y)\) is available.
 1.
We generate a vector of binary variables \(\varvec{\pi }\in \{0,1\}^m\) according to Bernoulli distribution with probability \(\alpha \) that the variable is 1.
 2.
We instruct the annotator to provide just a single label for each input example with index from \(\{i\in \{1,\ldots ,m\}\mid \pi _i = 1\}\) while the remaining inputs (with \(\pi _i = 0\)) can be annotated by intervals but not larger than \(\beta +1\) labels. That means that approximately \(m\cdot \alpha \) inputs will be annotated exactly and \(m\cdot (1\alpha )\) with intervals.
3 Algorithms
In the previous section we argue that the partial risk defined as an expectation of the interval insensitive loss is a reasonable proxy of the target Bayes risk. In this section we design algorithms learning the ordinal classifier via minimization of the quadratically regularized empirical risk used as a proxy for the expected risk. Similarly to the standard supervised case, we cannot minimize the empirical risk directly due to a discrete domain of the interval insensitive loss. For this reason we derive several convex surrogates which allow to translates the risk minimization to tractable convex problems.
We first show how to modify two existing supervised methods in order to learn from partially annotated examples. Namely, we extend the Support Vector Ordinal Regression algorithm with explicit constraints (SVOREXP) and the Support Vector Ordinal Regression algorithm with implicit constraints (SVORIMC) (Chu and Keerthi 2005). The extended intervalinsensitive variants are named IISVOREXP (Sect. 3.1) and IISVORIMC (Sect. 3.2), respectively. The IISVOREXP is a method minimizing a convex surrogate of the intervalinsensitive loss associated to the 0/1loss while the IISVORIMC is designed for the minimization of MAE loss.
In Sect. 3.3, we show how to construct a generic convex surrogate of the intervalinsensitive loss associated to an arbitrary Vshaped loss. We call a method minimizing this generic surrogate as the Vshaped interval insensitive loss minimization algorithm (VILMA). We prove that the VILMA subsumes the IISVORIMC (as well as the SVORIMC) as a special case.
3.1 Interval insensitive SVOREXP algorithm for optimization of 0/1loss
Given partially annotated examples \(\mathcal{T}^m_I\), we can learn parameters \((\varvec{w},\varvec{\theta })\) of the ordinal classifier (1) by solving (9) with the surrogate \(\ell _I^\mathrm{EXP}(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta })\) substituted for \(\ell ^\mathrm{EXP}(\varvec{x},y, \varvec{w},\varvec{\theta })\). We denote the modified variant as the IISVOREXP algorithm.
3.2 Interval insensitive SVORIMC algorithm for optimization of MAE loss
Given the partially annotated examples \(\mathcal{T}^m_I\), we can learn parameters \((\varvec{w},\varvec{\theta })\) of the ordinal classifier (1) by solving (10) with the proposed surrogate \(\ell ^\mathrm{IMC}_I(\varvec{x},y_l,y_r,\varvec{w},\varvec{\theta })\) substituted for \(\ell ^\mathrm{IMC}(\varvec{x},y,\varvec{w},\varvec{\theta })\). We denote the modified variant as the IISVORIMC algorithm. Note that due to the equality \(\ell ^\mathrm{IMC}_I(\varvec{x},y,y,\varvec{w},\varvec{\theta })= \ell ^\mathrm{IMC}(\varvec{x},y,\varvec{w},\varvec{\theta })\) it is clear that the proposed IISVORIMC subsumes the original supervised SVORIMC as a special case.
3.3 VILMA: Vshaped interval insensitive loss minimization algorithm
Theorem 3
The ordinal classifier (1) and the MORD classifier (11) are equivalent in the following sense. For any \(\varvec{w}\in \mathbb {R}^n\) and admissible \(\varvec{\theta }\in {\varTheta }\) there exists \(\varvec{b}\in \mathbb {R}^Y\) such that \(h(\varvec{x},\varvec{w},\varvec{\theta }) = h'(\varvec{x},\varvec{w},\varvec{b})\), \(\forall \varvec{x}\in \mathbb {R}^n\). For any \(\varvec{w}\in \mathbb {R}^n\) and \(\varvec{b}\in \mathbb {R}^n\), there exists admissible \(\varvec{\theta }\in {\varTheta }\) such that \(h(\varvec{x},\varvec{w},\varvec{\theta }) = h'(\varvec{x},\varvec{w},\varvec{b})\), \(\forall \varvec{x}\in \mathbb {R}^n\).
Proof of Theorem 3 is given in Antoniuk et al. (2013). The proof is constructive in that it provides analytical formulas for conversion between the two parametrizations. For the sake of completeness the conversion formulas are shown in Appendix 2.
Proposition 1
Proof is deferred to Appendix 3.
Proposition 2
Proof is deferred to Appendix 4. Proposition 2 ensures that the IISVORIMC algorithm and the VILMA with MAE loss both return the same classification rules although differently parametrized.

VILMA is applicable for an arbitrary Vshaped loss.

VILMA subsumes the IISVORIMC algorithm optimizing the MAE loss as a special case.

VILMA converts learning into an unconstrained convex optimization. Note that the IISVOREXP and the IISVORIMC in contrast to VILMA maintain the set of linear constraints \(\varvec{\theta }\in \hat{{\varTheta }}\).
4 Doubleloop cutting plane solver
 1.
The reduced problem (17) solved in each iteration of the CPA. The problem (17) is a quadratic program that can be approached via its dual formulation (Teo et al. 2010) having only t variables where t is the number of iterations of the CPA. Since the CPA rarely needs more than a few hundred iterations, the dual of (17) can be solved by offtheshelf QP libraries.
 2.
The problem (19) providing \(\varvec{b}(\varvec{w})\) which is required to compute \(G(\varvec{w})=R_\mathrm{emp}(\varvec{w},\varvec{b}(\varvec{w}) )\) and the subgradient \(G'(\varvec{w})\) via Eq. (20). The problem (19) has only Y (the number of labels) variables. Hence it can be approached by generic convex solvers like the Analytic Center Cutting Plane algorithm (Gondzio et al. 1996).
Finally, we point out that the convex problems associated with the IISVOREXP and the IISVORIMC can be solved by a similar method. The only change is in using additional constraints \(\varvec{\theta }\in \hat{{\varTheta }}\) in (15) which propagate to the problem (19).
5 Experiments
We evaluate the proposed methods on a reallife computer vision problem of estimating age of a person from his/her facial image. The age estimation is a prototypical problem calling for the ordinal classification as well as learning from interval annotations. The set of labels corresponds to individual ages which form an ordered set. Training examples of the facial images are cheap, for example, they can be downloaded from the Internet. On the other hand obtaining ground truth age for a given facial image is often very complicated for obvious reasons. A typical solution used in practice is to annotate the age manually and use it as a replacement for the true age. Creating a yearprecise annotation manually is however tedious process. Moreover, precision and consistency of manual annotations can be large. Using the interval annotation instead of the yearprecise one can significantly ease the mentioned problems. We demonstrate on reallife data that the proposed methods can effectively exploit cheap interval annotations for learning precise age estimators.
The experiments have two parts. First, in Sect. 5.2, we present results on precisely annotated examples. By conducting these experiments we i) set a baseline for the latter experiments on partially annotated examples, ii) numerically verify that the VILMA subsumes the SVORIMC algorithm as a special case and iii) justify usage of the proposed doubleloop CPA. Second, in Sect. 5.3, we thoroughly analyze performance of the VILMA on partially annotated examples. We emphasize that all tested algorithms are designed to optimize the MAE loss being the standard evaluation metric of age estimation systems.
5.1 Databases and implementation details
 1.
MORPH database (Ricanek and Tesafaye 2006) is the standard benchmark for age estimation. It contains 55,134 face images with exact age annotation ranging from 16 to 77 years. Because the age category 70+ is severely underrepresented (only 9 examples in total) we removed faces with age higher than 70. The database contains frontal police mugshots taken under controlled conditions. The images have resolution 200\(\times \)240 pixels and most of them are of very good quality.
 2.
WILD database is a collection of three public databases: Labeled Faces in the Wild (Huang et al. 2007), PubFig (Kumar et al. 2009) and PAL (Minear and Park 2004). The images are annotated by several independent annotators. We selected a subset of nearfrontal images (yaw angle in \([30^\circ ,30^\circ ]\)) containing 34,259 faces in total with the age from 1 to 80 years. The WILD database contains challenging “inthewild” images exhibiting a large variation in the resolution, illumination changes, race and background clutter.
Preprocessing The feature representation of the facial images was computed as follow. We first localized the faces by a commercial face detector ^{2} and consequently applied a Deformable Part Model based detector (Uřičář et al. 2012) to find facial landmarks like the corners of eyes, mouth and tip of the nose. The found landmarks were used to transform the input face by an affine transform into its canonical pose. Finally, the canonical face of size \(60\times 40\) pixels was described by multiscale LBP descriptor (Sonnenburg and Franc 2010) resulting in \(n=159,488\)dimensional binary sparse vector serving as an input of the ordinal classifier.
Implementation of the solver We implemented the doubleloop CPA and the standard CPA in C++ by modifying the code from the Shogun machine learning library (Sonnenburg et al. 2010). To solve internal problem (19) we used the Oracle Based Optimization Engine (OBOE) implementation of the Analytic Center Cutting Plane algorithm being a part of COmputational INfrastructure for Operations Research project (COINOR) (Gondzio et al. 1996).
5.2 Supervised setting
The purpose of experiments conducted on fully supervised data is three fold. First, to present results of the standard supervised setting which is later used as a baseline. Second, to numerically verify Proposition 2 which states that the VILMA instantiated for the MAE loss subsumes the SVORIMC algorithm. Third, to show that imposing an extra quadratic regularization on the biases \(\varvec{b}\) of the MORD rule (11) severly harms the results which justifies usage the proposed doubleloop CPA.
The test MAE of the ordinal classifier learned from the precisely annotated examples by the VILMA, the standard SVORIMC and the VILMAREG using the \(\frac{{\uplambda }}{2}(\Vert \varvec{w}\Vert ^2+\Vert \varvec{b}\Vert )\) regularizer
\(m=3300\)  \(m=6600\)  \(m=13{,}000\)  \(m=23{,}000\)  \(m=33{,}000\)  

VILMA  \(5.56 \pm 0.02\)  \(5.12 \pm 0.02\)  \(4.83 \pm 0.02\)  \(4.66 \pm 0.01\)  \(4.55 \pm 0.02\) 
SVORIMC  \(5.56 \pm 0.03\)  \(5.14 \pm 0.02\)  \(4.83 \pm 0.01\)  \(4.68 \pm 0.03\)  \( 4.54 \pm 0.01\) 
VILMAREG  \(9.57\pm 0.03\)  \(9.21\pm 0.06\)  \(9.07 \pm 0.05\)  \(9.04 \pm 0.05\)  \(9.06 \pm 0.02\) 
We observe that the prediction error steeply decreases with adding new precisely annotated examples. The MAE for the largest training set is \(4.55 \pm 0.02\) which closely matches the stateoftheart methods like Guo and Mu (2010) reporting MAE 4.45 on the same database. The next section shows that similar results can be obtained with cheaper partially annotated examples.
Although the VILMA and the SVORIMC learn different parametrizations of the ordinal classifier the resulting rules are equivalent up a numerical error as predicted by Proposition 2. We repeated the same experiment applying the VILMA and the IISVORIMC on the partially annotated examples as described in the next section. The results of both methods were the same up a numerical error. Hence in the next section we only include the results for the VILMA.
The test MAE of the classifier learned by the VILMAREG is almost doubled compared to the classifier learned by VILMA via the doubleloop CPA. This shows that pushing the biases \(\varvec{b}\) towards zero by the quadratic regularizer, which is necessary if the standard CPA is to be used, has a detrimental effect on the accuracy.
5.3 Learning from partially annotated examples

\(m_P\) randomly selected examples were annotated precisely by taking the annotation from the databases.

\(m_I\) randomly selected examples were annotated by intervals. The admissible annotation intervals were chosen so that they partition the set of ages and have the same width (up to the border cases). The interval width was varied from \(u\in \{5,10,20\}\). The interval annotation was obtained by rounding the true age from the databases to the admissible intervals. For example, in case of \((u=5)\)years wide intervals the true ages \(y\in \{1,2,\dots ,5\}\) were transformed to the interval annotation [1, 5], the ages \(y\in \{6,7,\ldots ,10\}\) to [6, 10] and so on.
The table summarizes test MAE of the ordinal classifier learned from the training set with m examples
\(m=3300\)  \(m=6600\)  \(m=13{,}000\)  \(m=23{,}000\)  \(m=33{,}000\)  

Supervised  \(5.56 \pm 0.02\)  \(5.12 \pm 0.02\)  \(4.83 \pm 0.02\)  \(4.66 \pm 0.01\)  \(4.55 \pm 0.02\)  
MORPH  
\(m_P\)  u  \(m_I = 0\)  \(m_I = 3300\)  \(m_I = 9700\)  \(m_I = 19{,}700\)  \(m_I = 29{,}700\) 
3300  5  \(5.56 \pm 0.02\)  \(5.21 \pm 0.04\)  \(4.89 \pm 0.03\)  \(4.70 \pm 0.01\)  \(4.62 \pm 0.01\) 
10  \(5.56 \pm 0.03\)  \(5.25 \pm 0.02\)  \(5.15 \pm 0.05\)  \(4.97 \pm 0.01\)  \(4.90 \pm 0.04\)  
20  \(5.56 \pm 0.03\)  \(5.32 \pm 0.03\)  \(5.26 \pm 0.06\)  \(5.06 \pm 0.04\)  \(4.97 \pm 0.01\)  
\(m_P\)  u  \(m_I = 0\)  \(m_I = 0\)  \(m_I = 6400\)  \(m_I = 16{,}400\)  \(m_I = 26{,}400\) 
6600  5  —  \(5.12 \pm 0.02\)  \(4.86 \pm 0.02\)  \(4.69 \pm 0.00\)  \(4.61 \pm 0.00\) 
10  —  \(5.13 \pm 0.02\)  \(4.96 \pm 0.03\)  \(4.81 \pm 0.01\)  \(4.84 \pm 0.04\)  
20  —  \(5.13 \pm 0.02\)  \(5.03 \pm 0.02\)  \(4.86 \pm 0.04\)  \(4.86 \pm 0.01\) 
\(m=3300\)  \(m=6600\)  \(m=11{,}000\)  \(m=16{,}000\)  \(m=21{,}000\)  

Supervised  \(10.40 \pm 0.03\)  \(9.60 \pm 0.03\)  \(9.14 \pm 0.02\)  \(8.89 \pm 0.02\)  \(8.68 \pm 0.02\)  
WILD  
\(m_P \)  u  \(m_I = 0\)  \(m_I = 3300\)  \(m_I = 7700\)  \(m_I = 12{,}700\)  \(m_I = 17{,}700\) 
3300  5  \(10.40 \pm 0.03\)  \(9.69 \pm 0.02\)  \(9.23 \pm 0.05\)  \(8.89 \pm 0.02\)  \(8.71 \pm 0.02\) 
10  \(10.40 \pm 0.03\)  \(9.76 \pm 0.02\)  \(9.42 \pm 0.04\)  \(9.09 \pm 0.02\)  \(8.99 \pm 0.02\)  
20  \(10.40 \pm 0.03\)  \(9.88 \pm 0.03\)  \(9.67 \pm 0.04\)  \(9.51 \pm 0.00\)  \(9.40 \pm 0.01\)  
\(m_P \)  u  \(m_I = 0\)  \(m_I = 0\)  \(m_I = 4400\)  \(m_I = 9400\)  \(m_I = 14{,}400\) 
6600  5  —  \(9.60 \pm 0.03\)  \(9.22 \pm 0.06\)  \(8.89 \pm 0.02\)  \(8.71 \pm 0.02\) 
10  —  \(9.60 \pm 0.03\)  \(9.22 \pm 0.02\)  \(9.04 \pm 0.03\)  \(8.90 \pm 0.02\)  
20  —  \(9.60 \pm 0.03\)  \(9.35 \pm 0.06\)  \(9.14 \pm 0.03\)  \(9.04 \pm 0.02\) 
We observe that adding the partially annotated examples monotonically improves the accuracy. This observation holds true for all tested combinations of \(m_I\), \(m_P\), u and both databases. This observation is of a great practical importance. It suggests that adding cheap partially annotated examples only improves and never worsens the accuracy of the ordinal classifier.
It is seen that the improvement caused by adding the partially annotated examples can be substantial. Not surprisingly the best results are obtained for the annotation with the narrowest (5years) intervals. In this case, the performance of the classifier learned from the partial annotations closely matches the supervised setting. In particular, the loss in accuracy resulting from using the partial annotation on the WILD database is on the level of standard deviation. Even in the most challenging case, when learning from 20years wide intervals, the results are practically useful. For example, to get classifier with \(\approx 9\) MAE on the WILD database one can either learn from \(\approx 12,000\) precisely annotated examples or instead from 6, 600 precisely annotated plus 14, 400 partially annotated with 20years wide intervals.
6 Conclusions and future work
We have proposed a Vshaped intervalinsensitive loss suitable for risk minimization based learning of ordinal classifiers from partially annotated examples. We proved that under reasonable assumption on the annotation process the Bayes risk of the ordinal classifier can be bounded by the expectation of the associated intervalinsensitive loss. We proposed a convex surrogate of the intervalinsensitive loss associated to an arbitrary supervised Vshaped loss. We derived a generic Vshaped Interval insensitive Loss Minimization Algorithm (VILMA) which translates learning from the interval annotations to a convex optimization problem. We also derived other convex surrogate losses of the interval insensitive loss by extending the existing methods like the SVOREXP and SVORIMC algorithm. We have proposed a cutting plane method which can solve large instances of the resulting convex learning problems. The experiments conducted on a reallife problem of human age estimation from facial images show that the proposed method has a practical potential. We demonstrated that a precise ordinal classifier with accuracy matching the stateoftheart results can be obtained by learning from cheap partial annotations.
Our work is based on the interval insensitive loss and its convex surrogates which turned out to work well empirically. We showed that under certain assumptions the expectation of the interval insensitive loss can be used to upper bound expectation of the associated target loss. However a deeper theoretical understanding is needed. For example, an open issue is whether there exists a distribution for which the upper bound is sharp. Another interesting question is how to weaken the assumptions on the annotation process, e.g. the requiring on the consistency of the annotation. It is also unclear which of the introduced convex surrogates is theoretically better. We believe that this issue could be resolved by analyzing statistical consistency of the surrogates as in (Zhang 2004; Tewari and Bartlett 2007). These issues are left for the future work.
Footnotes
 1.
The sequence \(1,\ldots , Y\) is used just for a notational convenience, however, any other finite and fully ordered set can be used instead.
 2.
Courtesy of Eydea Recognition Ltd, www.eyedea.cz.
 3.
Here for simplicity we provide proof in nondegenerated case, it can be adopted however for generated case as well.
Notes
Acknowledgments
The authors were supported by the Grant Agency of the Czech Republic under Project P202/12/2071, the project ERCCZ 1303 and EU project FP7ICT609763 TRADR.
References
 Antoniuk, K., Franc, V., & Hlavac, V. (2013). Mord: Multiclass classifier for ordinal regression. In Proceedings of European conference on machine learning and principles and practice of knowledge discovery in databases (ECML/PKDD) (pp. 96–111).Google Scholar
 Boyd, S., & Vandenberghe, L. (2004). Convex optimization. New York, NY: Cambridge University Press.CrossRefMATHGoogle Scholar
 Chang, K., Chen, C., & Hung, Y. (2011). Ordinal hyperplane ranker with cost sensitivities for age estimation. In Proceedings of computer vision and pattern recognition (CVPR).Google Scholar
 Chu, W., & Ghahramani, Z. (2005). Preference learning with gaussian processes. In Proceedings of the international conference on machine learning (ICML).Google Scholar
 Chu, W., & Keerthi, S.S. (2005). New approaches to support vector ordinal regression. In Proceedings of the international conference on machine learning (ICML) (pp. 145–152).Google Scholar
 Cour, T., Sapp, B., & Taskar, B. (2011). Learning from partial labels. Journal of Machine Learning Research, 12, 1225–1261.MathSciNetMATHGoogle Scholar
 Crammer, K., & Singer, Y. (2001). Pranking with ranking. In Advances in neural information processing systems (NIPS) (pp. 641–647).Google Scholar
 Dembczyński, K., Kotlowski, W., & Slowinski, R. (2008). Ordinal classification with decision rules. In Mining complex data. Lecture notes in computer science, 4944, 169–181.Google Scholar
 Dempster, A., Laird, N., & Rubin, D. (1997). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 1(39), 1–38.Google Scholar
 Do, T.M.T., & Artières, T. (2009). Large margin training for hidden markov models with partially observed states. In Proceedings of the international conference on Machine Learning (ICML).Google Scholar
 Franc, V., Sonnenburg, S., & Werner, T. (2012). Cuttingplane methods in machine learning (chapter 7, pp. 185–218). The MIT Press, Cambridge, USA.Google Scholar
 Fu, L., & Simpson, D.G. (2002). Conditional risk models for ordinal response data: Simultaneous logistic regression analysis and generalized score test. Journal of Statistical Planning and Inference, 108(1–2), 201–217.Google Scholar
 Gondzio, J., du Merle, O., Sarkissian, R., & Vial, J.P. (1996). ACCPM—A library for convex optimization based on an analytic center cutting plane method. European Journal of Operational Research, 94, 206–211.Google Scholar
 Guo, G., & Mu, G. (2010). Human age estimation: What is the influence across race and gender? In Proceedings of conference on computer vision and pattern recognition workshops (CVPRW).Google Scholar
 Huang, G.B., Ramesh, M., Berg, T., & LearnedMiller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report 0749, University of Massachusetts, Amherst.Google Scholar
 Jie, L., & Orabona, F. (2010). Learning from candidate labeling sets. In Proceedings of advances in neural information processing systems (NIPS).Google Scholar
 Kotlowski, W., Dembczynski, K., Greco, S., & Slowinski, R. (2008). Stochastic dominancebased rough set model for ordinal classification. Journal of Information Sciences, 178(21), 4019–4037.MathSciNetCrossRefMATHGoogle Scholar
 Kumar, N., Berg, A. C., Belhumeur, P. N., & Nayar, S. K. (2009). Attribute and simile classifiers for face verification. In Proceedings of international conference on computer vision (ICCV).Google Scholar
 Li, L., & Lin, H.T. (2006). Ordinal regression by extended binary classification. In Proceedings of advances in neural information processing systems (NIPS).Google Scholar
 Lou, X., & Hamprecht, F.A. (2012). Structured learning from partial annotations. In Proceedings of the international conference on machine learning (ICML).Google Scholar
 McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statical Society, 42(2), 109–142.MathSciNetMATHGoogle Scholar
 Minear, M., & Park, D. (2004). A lifespan database of adult facial stimuli. Behavior Research Methods, Instruments, & Computers: A Journal of the Psychonomic Society, 36, 630–633.CrossRefGoogle Scholar
 Ramanathan, N., Chellappa, R., & Biswas, S. (2009). Computational methods for modeling facial aging: Asurvey. Journal of Visual Languages and Computing, 20, 131–144.Google Scholar
 Rennie, J. D., & Srebro, N. (2005). Loss functions for preference levels: Regression with discrete ordered labels. In Proceedings of the IJCAI multidisciplinary workshop on advances in preference handling.Google Scholar
 Ricanek, K., & Tesafaye, T. (2006). Morph: A longitudial image database of normal adult ageprogression. In Proceedings of automated face and gesture recognition.Google Scholar
 Schlesinger, M. (1968). A connection between learning and selflearning in the pattern recognition (in Russian). Kibernetika, 2, 81–88.Google Scholar
 Shashua, A., & Levin, A. (2002). Ranking with large margin principle: Two approaches. In Proceedings of advances in neural information processing systems (NIPS).Google Scholar
 Sonnenburg, S., & Franc, V. (2010). Coffin: A computational framework for linear SVMs. In Proceedings of the international conference on machine learning (ICML).Google Scholar
 Sonnenburg, S., Rätsch, G., Henschel, S., Widmer, C., Behr, J., Zien, A., et al. (2010). The shogun machine learning toolbox. Journal of Machine Learning Research, 11, 1799–1802.MATHGoogle Scholar
 Teo, C. H., Vishwanthan, S., Smola, A. J., & Le, Q. V. (2010). Bundle methods for regularized risk minimization. Journal of Machine Learning Research, 11, 311–365.MathSciNetMATHGoogle Scholar
 Tewari, A., & Bartlett, P. (2007). On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8, 1007–1025.MathSciNetMATHGoogle Scholar
 Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., & Singer, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.MathSciNetMATHGoogle Scholar
 Uřičář, M., Franc, V., & Hlaváč, V. (2012). Detector of facial landmarks learned by the structured output SVM. In Proceedings of the international conference on computer vision theory and applications (VISAPP) (Vol. 1, pp. 547–556).Google Scholar
 Vapnik, V. N. (1998). Statistical learning theory. Adaptive and learning systems. New York, New York: Wiley.MATHGoogle Scholar
 Zhang, T. (2004). Statistical behaviour and consistency of classification methods based on convex risk minimization. Annals of Statistics, 31(1), 56–134.MATHGoogle Scholar