Efficient max-margin multi-label classification with applications to zero-shot learning

Hariharan, Bharath; Vishwanathan, S. V. N.; Varma, Manik

doi:10.1007/s10994-012-5291-x

Efficient max-margin multi-label classification with applications to zero-shot learning

Published: 03 May 2012

Volume 88, pages 127–155, (2012)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Efficient max-margin multi-label classification with applications to zero-shot learning

Download PDF

Bharath Hariharan¹,
S. V. N. Vishwanathan² &
Manik Varma³

3874 Accesses
39 Citations
Explore all metrics

Abstract

The goal in multi-label classification is to tag a data point with the subset of relevant labels from a pre-specified set. Given a set of L labels, a data point can be tagged with any of the 2^L possible subsets. The main challenge therefore lies in optimising over this exponentially large label space subject to label correlations.

Our objective, in this paper, is to design efficient algorithms for multi-label classification when the labels are densely correlated. In particular, we are interested in the zero-shot learning scenario where the label correlations on the training set might be significantly different from those on the test set.

We propose a max-margin formulation where we model prior label correlations but do not incorporate pairwise label interaction terms in the prediction function. We show that the problem complexity can be reduced from exponential to linear while modelling dense pairwise prior label correlations. By incorporating relevant correlation priors we can handle mismatches between the training and test set statistics. Our proposed formulation generalises the effective 1-vs-All method and we provide a principled interpretation of the 1-vs-All technique.

We develop efficient optimisation algorithms for our proposed formulation. We adapt the Sequential Minimal Optimisation (SMO) algorithm to multi-label classification and show that, with some book-keeping, we can reduce the training time from being super-quadratic to almost linear in the number of labels. Furthermore, by effectively re-utilizing the kernel cache and jointly optimising over all variables, we can be orders of magnitude faster than the competing state-of-the-art algorithms. We also design a specialised algorithm for linear kernels based on dual co-ordinate ascent with shrinkage that lets us effortlessly train on a million points with a hundred labels.

Retargeted Regression Methods for Multi-label Learning

Multi-label Classification with Output Kernels

Gradient-Based Label Binning in Multi-label Classification

1 Introduction

Our objective, in this paper, is to develop efficient algorithms for max-margin, multi-label classification. Given a set of pre-specified labels and a data point, (binary) multi-label classification deals with the problem of predicting the subset of labels most relevant to the data point. This is in contrast to multi-class classification where one has to predict just the single, most probable label. For instance, rather than simply saying that Fig. 1 is an image of a Babirusa we might prefer to describe it as containing a brown, hairless, herbivorous, medium sized quadruped with tusks growing out of its snout.

There are many advantages in generating such a description and multi-label classification has found applications in areas ranging from computer vision to natural language processing to bio-informatics. We are specifically interested in the problem of image search on the web and in personal photo collections. In such applications, it is very difficult to get training data for every possible object out there in the world that someone might conceivably search for. In fact, we might not have any training images whatsoever for many object categories such as the obscure Babirusa. Nevertheless, we can not preclude the possibility of someone searching for one of these objects. A similar problem is encountered when trying to search videos on the basis of human body pose and motion and many other applications such as neural activity decoding (Palatucci et al. 2009).

One way of recognising object instances from previously unseen test categories (the zero-shot learning problem) is by leveraging knowledge about common attributes and shared parts. For instance, given adequately labelled training data, one can learn classifiers for the attributes occurring in the training object categories. These classifiers can then be used to recognise the same attributes in object instances from the novel test categories. Recognition can then proceed on the basis of these learnt attributes (Farhadi et al. 2009, 2010; Lampert et al. 2009).

The learning problem can therefore be posed as multi-label classification where there is a significant difference between attribute (label) correlations in the training categories and the previously unseen test categories. What adds to the complexity of the problem is the fact that these attributes are often densely correlated as they are shared across most categories. This makes optimising over the exponentially large output space, given by the power set of all labels, very difficult. The problem is acute not just during prediction but also during training as the number of training images might grow to be quite large over time in some applications.

Previously proposed solutions to the multi-label problem take one of two approaches—neither of which can be applied straight forwardly in our scenario. In the first, labels are a priori assumed not to be correlated so that a predictor can be trained for each label independently. This reduces training and prediction complexity from exponential in the number of labels to linear. Such methods can therefore scale efficiently to large problems but at the cost of not being able to model label correlations. Furthermore, these methods typically tend not to minimise a multi-label loss. In the second, label correlations are explicitly taken into account by incorporating pairwise, or higher order, label interactions. However, exact inference is mostly intractable for densely correlated labels and in situations where the label correlation graph has loops. Most approaches therefore assume sparsely correlated labels such as those arranged in a hierarchical tree structure.

In this paper, we follow a middle approach. We develop a max-margin multi-label classification formulation, referred to as M3L, where we do model prior label correlations but do not incorporate pairwise, or higher order, label interaction terms in the prediction function. This lets us generalise to the case where the training label correlations might differ significantly from the test label correlations. We can also efficiently handle densely correlated labels. In particular, we show that under fairly general assumptions of linearity, the M3L primal formulation can be reduced from having an exponential number of constraints to linear in the number of labels. Furthermore, if no prior information about label correlations is provided, M3L reduces directly to the 1-vs-All method. This lets us provide a principled interpretation of the 1-vs-All multi-label approach which has enjoyed the reputation of being a popular, effective but nevertheless, heuristic technique.

Much of the focus of this paper is on optimising the M3L formulation. It turns out that it is not good enough to just reduce the primal to have only a linear number of constraints. A straight forward application of state-of-the-art decompositional optimisation methods, such as Sequential Minimal Optimisation (SMO), would lead to an algorithm that is super-quadratic in the number of labels. We therefore develop specialised optimisation algorithms that can be orders of magnitude faster than competing methods. In particular, for kernelised M3L, we show that by simple book keeping and delaying gradient updates, SMO can be adapted to yield a linear time algorithm. Furthermore, due to efficient kernel caching and jointly optimising all variables, we can sometimes be an order of magnitude faster than the 1-vs-All method. Thus our code, available from Hariharan et al. (2010a), should also be very useful for learning independent 1-vs-All classifiers. For linear M3L, we adopt a dual co-ordinate ascent strategy with shrinkage which lets us efficiently tackle large scale training data sets. In terms of prediction accuracy, we show that incorporating prior knowledge about label correlations using the M3L formulation can substantially boost performance over independent methods.

The rest of the paper is organised as follows. Related work is reviewed in Sect. 2. Section 3 develops the M3L primal formulation and shows how to reduce the number of primal constraints from exponential to linear. The 1-vs-All formulation is also shown to be a special case of the M3L formulation. The M3L dual is developed in Sect. 4 and optimised in Sect. 5. We develop algorithms tuned to both the kernelised and the linear case. Experiments are carried out in Sect. 7 and it is demonstrated that the M3L formulation can lead to significant gains in terms of both optimisation and prediction accuracy. An earlier version of the paper appeared in Hariharan et al. (2010b).

2 Related Work

The multi-label problem has many facets including binary (Tsoumakas and Katakis 2007; Ueda and Saito 2003), multi-class (Dekel and Shamir 2010) and ordinal (Cheng et al. 2010) multi-label classification as well as semi-supervised learning, feature selection (Zhang and Wang 2009b), active learning (Li et al. 2004), multi-instance learning (Zhang and Wang 2009a), etc. Our focus, in this paper, is on binary multi-label classification where most of the previous work can be categorised into one of two approaches depending on whether labels are assumed to be independent or not. We first review approaches that do assume label independence. Most of these methods try and reduce the multi-label problem to a more “canonical” one such as regression, ranking, multi-class or binary classification.

In regression methods (Hsu et al. 2009; Ji et al. 2008; Tsoumakas and Katakis 2007), the label space is mapped onto a vector space (which might sometimes be a shared subspace of the feature space) where regression techniques can be applied straightforwardly. The primary advantage of such methods is that they can be extremely efficient if the mapped label space has significantly lower dimensionality than the original label space (Hsu et al. 2009). The disadvantage of such approaches is that the choice of an appropriate mapping might be unclear. As a result, minimising regression loss functions, such as square loss, in this space might be very efficient but might not be strongly correlated with minimising the desired multi-label loss. Furthermore, classification involves inverting the map which might not be straightforward, result in multiple solutions and might involve heuristics.

A multi-label problem with L labels can be viewed as a classification problem with 2^L classes (McCallum 1999; Boutell et al. 2004) and standard multi-class techniques can be brought to bear. Such an approach was shown to give the best empirical results in the survey by Tsoumakas and Katakis (2007). However, such approaches have three major drawbacks. First, since not all 2^L label combinations can be present in the training data, many of the classes will have no positive examples. Thus, predictors can not be learnt for these classes implying that these label combinations can not be recognised at run time. Second, the 0/1 multi-class loss optimised by such methods forms a poor approximation to most multi-label losses. For instance, the 0/1 loss would charge the same penalty for predicting all but one of the labels correctly as it would for predicting all of the labels incorrectly. Finally, learning and predicting with such a large number of classifiers might be very computationally expensive.

Binary classification can be leveraged by replicating the feature vector for each data point L times. For copy number l, an extra dimension is added to the feature vector with value l and the training label is +1 if label l is present in the label set of the original point and −1 otherwise. A binary classifier can be learnt from this expanded training set and a novel point classified by first replicating it as described above and then applying the binary classifier L times to determine which labels are selected. Due to the data replication, applying a binary classifier naively would be computationally costly and would require that complex decision boundaries be learnt. However, Schapire and Singer (2000) show that the problem can be solved efficiently using Boosting. A somewhat related technique is 1-vs-All (Rifkin and Khautau 2004) which independently learns a binary classifier for each label. As we’ll show in Sect. 3, our formulation generalises 1-vs-All to handle prior label correlations.

A ranking based solution was proposed in Elisseeff and Weston (2001). The objective was to ensure that, for every data point, all the relevant labels were ranked higher than any of the irrelevant ones. This approach has been influential but suffers from the drawback of not being able to easily determine the number of labels to select in the ranking. The solution proposed in Elisseeff and Weston (2001) was to find a threshold so that all labels scoring above the threshold were selected. The threshold was determined using a regressor trained subsequently on the ranker output on the training set. Many variations have been proposed, such as using dummy labels to determine the threshold, but each has its own limitations and no clear choice has emerged. Furthermore, posing the problem as ranking induces a quadratic number of constraints per example which leads to a harder optimisation problem. This is ameliorated in Crammer and Singer (2003) who reduced the space complexity to linear and time complexity to sub-quadratic.

Most of the approaches mentioned above do not explicitly model label correlations—McCallum (1999) has a generative model which can, in principle, handle correlations but greedy heuristics are used to search over the exponential label space. In terms of discriminative methods, most work has focused on hierarchical tree, or forest, structured labels. Methods such as Cai and Hofmann (2007), Cesa-Bianchi et al. (2006) optimise a hierarchical loss over the tree structure but do not incorporate pairwise, or higher order, label interaction terms. In both these methods, a label is predicted only if its parent has also been predicted in the hierarchy. For instance, Cesa-Bianchi et al. (2006) train a classifier for each node of the tree. The positive training data for the classifier is the set of data points marked with the node label while the negative training points are selected from the sibling nodes. Classification starts at the root and all the children classifiers are tested to determine which path to take. This leads to a very efficient algorithm during both training and prediction as each classifier is trained on only a subset of the data. Alternatively, Cai and Hofmann (2007) classify at only the leaf nodes and use them as a proxy for the entire path starting from the root. A hierarchical loss is defined and optimised using the ranking method of Elisseeff and Weston (2001).

The M3N formulation of Taskar et al. (2003) was the first to suggest max-margin learning of label interactions. The original formulation starts off having an exponential number of constraints. These can be reduced to quadratic if the label interactions formed a tree or forest. Approximate algorithms are also developed for sparse, loopy graph structures. While the M3N formulation dealt with the Hamming loss, a more suitable hierarchical loss was introduced and efficiently optimised in Rousu et al. (2006) for the case of hierarchies. Note that even though these methods take label correlations explicitly into account, they are unsuitable for our purposes as they cannot handle densely correlated labels and learn training set label correlations which are not useful at test time since the statistics might have changed significantly.

Finally, Tsochantaridis et al. (2005) propose an iterative, cutting plane algorithm for learning in general structured output spaces. The algorithm adds the worst violating constraint to the active set in each iteration and is proved to take a maximum number of iterations independent of the size of the output space. While this algorithm can be used to learn pairwise label interactions it too can’t handle a fully connected graph as the worst violating constraint cannot be generally found in polynomial time. However, it can be used to learn our proposed M3L formulation but is an order of magnitude slower than the specialised optimisation algorithms we develop.

Zero shot learning deals with the problem of recognising instances from novel categories that were not present during training. It is a nascent research problem and most approaches tackle it by building an intermediate representation leveraging attributes, features or classifier outputs which can be learnt from the available training data (Farhadi et al. 2009, 2010; Lampert et al. 2009; Palatucci et al. 2009). Novel instances are classified by first generating their intermediate representation and then mapping it onto the novel category representation (which can be generated using meta-data alone). The focus of research has mainly been on what is a good intermediate level representation and how should the mapping be carried out.

A popular choice of the intermediate level representation have been parts and attributes—whether they be semantic or discriminative. Since not all features are relevant to all attributes, Farhadi et al. (2009) explore feature selection so as to better predict a novel instance’s attributes. Probabilistic techniques for mapping the list of predicted attributes to a novel category’s list of attributes (known a priori) are developed in Lampert et al. (2009) while Palatucci et al. (2009) carry out a theoretical analysis and use the one nearest neighbour rule. An alternative approach to zero-shot learning is not to name the novel object, or explicitly recognise its attributes, but simply say that it is “like” an object seen during training (Wang et al. 2010). For instance, the Babirusa in Fig. 1 could be declared to be like a pig. This is sufficient for some applications and works well if the training set has good category level coverage.

3 M3L: the max-margin multi-label classification primal formulation

The objective in multi-label classification is to learn a function f which can be used to assign a set of labels to a point $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}$. We assume that N training data points have been provided of the form $(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}) \in\mathbb{R}^{D} \times\{\pm1\}^{L}$ with y _il being +1 if label l has been assigned to point i and −1 otherwise. Note that such an encoding allows us to learn from both the presence and absence of labels, since both can be informative when predicting test categories.

A principled way of formulating the problem would be to take the loss function Δ that one truly cares about and minimise it over the training set subject to regularisation or prior knowledge. Of course, since direct minimisation of most discrete loss functions is hard, we might end up minimising an upper bound on the loss, such as the hinge. The learning problem can then be formulated as the following primal

(1)

(2)

(3)

with a new point $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}$ being assigned labels according to $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}^{*}=\mathop{\mbox{argmax}}_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}} f(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})$. The drawback of such a formulation is that there are N2^L constraints which make direct optimisation very slow. Furthermore, classification of novel points might require 2^L function evaluations (one for each possible value of $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}$), which can be prohibitive at run time. In this Section, we demonstrate that, under general assumptions of linearity, (P ₁) can be reformulated as the minimisation of L densely correlated sub-problems each having only N constraints. At the same time, prediction cost is reduced to a single function evaluation with complexity linear in the number of labels. The ideas underlying this decomposition were also used in Evgeniou et al. (2005) in a multi-task learning scenario. However, their objective is to combine multiple tasks into a single learning problem, while we are interested in decomposing (3) into multiple subproblems.

We start by making the standard assumption that

$$ f(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}^t \bigl(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \phi$}} {\mbox{\boldmath$\textstyle\bf \phi$}} {\mbox{\boldmath$\scriptstyle\bf \phi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \phi$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}) \otimes \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \psi$}} {\mbox{\boldmath$\textstyle\bf \psi$}} {\mbox{\boldmath$\scriptstyle\bf \psi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \psi$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})\bigr) $$

(4)

where $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \phi$}} {\mbox{\boldmath$\textstyle\bf \phi$}} {\mbox{\boldmath$\scriptstyle\bf \phi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \phi$}}$ and $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \psi$}} {\mbox{\boldmath$\textstyle\bf \psi$}} {\mbox{\boldmath$\scriptstyle\bf \psi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \psi$}}$ are the feature and label space mappings respectively, ⊗ is the Kronecker product and $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}^{t}$ denotes $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}$ transpose. Note that, for zero shot learning, it is possible to theoretically show that , in the limit of infinite data, one does not need to model label correlations when training and test distributions are the same (Palatucci et al. 2009). In practice, however, training sets are finite, often relatively small, and have label distributions that are significantly different from the test set. Therefore to incorporate prior knowledge and correlate classifiers efficiently, we assume that labels have at most linear, possibly dense, correlation so that it is sufficient to choose $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \psi$}} {\mbox{\boldmath$\textstyle\bf \psi$}} {\mbox{\boldmath$\scriptstyle\bf \psi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \psi$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) = \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}$ where $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}$ is an invertible matrix encoding all our prior knowledge about the labels. If we assume f to be quadratic (or higher order) in $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}$, as is done in structured output prediction, then it would not be possible to reduce the number of constraints from exponential to linear while still modelling dense, possibly negative, label correlations. Furthermore, learning label correlation on the training set by incorporating quadratic terms in $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}$ might not be fruitful as the test categories will have very different correlation statistics. Thus, by sacrificing some expressive power, we hope to build much more efficient algorithms that can still give improved prediction accuracy in the zero-shot learning scenario.

We make another standard assumption that the chosen loss function should decompose over the individual labels (Taskar et al. 2003). Hence, we require that

$$ \varDelta(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i,\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})=\sum_{l=1}^{L} \varDelta_l(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i,y_l) $$

(5)

where y _l∈{±1} corresponds to label l in the set of labels represented by $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}$. For instance, the popular Hamming loss, amongst others, satisfies this condition. We define the Hamming loss $\varDelta_{H}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})$, between a ground truth label $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}$ and a prediction $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}$ as

$$ \varDelta_{H}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i,\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) = \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i^t (\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i - \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) $$

(6)

which is a count of twice the total number of individual labels mispredicted in $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}$. Note that the Hamming loss can be decomposed over the labels as $\varDelta_{H}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})=\sum_{l} 1 - y_{l} y_{il}$. Of course, for Δ to represent a sensible loss we also require that $\varDelta(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) \geq\varDelta(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}) = 0$.

Under these assumptions, (P ₁) can be expressed as

$$ P_1 \equiv\min_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}} \frac{1}{2}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}^t \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}} + C\sum_{i=1}^{N} \max_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}} \in\{\pm1\}^L} \bigl[\varDelta(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i,\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) + \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}^{t} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \phi$}} {\mbox{\boldmath$\textstyle\bf \phi$}} {\mbox{\boldmath$\scriptstyle\bf \phi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \phi$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}_i) \otimes \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}} - \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i)\bigr] $$

(7)

where the constraints have been moved into the objective and ξ _i≥0 eliminated by including $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}$ in the maximisation. To simplify notation, we express the vector $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}$ as a D×L matrix $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf W$}} {\mbox{\boldmath$\textstyle\bf W$}} {\mbox{\boldmath$\scriptstyle\bf W$}} {\mbox{\boldmath$\scriptscriptstyle\bf W$}}$ so that

(8)

Substituting $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf W$}} {\mbox{\boldmath$\textstyle\bf W$}} {\mbox{\boldmath$\scriptstyle\bf W$}} {\mbox{\boldmath$\scriptscriptstyle\bf W$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}$, $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}^{t}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}} \succ0$ and using the identity \(\mathop{\mbox{Trace}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf A$}} {\mbox{\boldmath$\textstyle\bf A$}} {\mbox{\boldmath$\scriptstyle\bf A$}} {\mbox{\boldmath$\scriptscriptstyle\bf A$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf B$}} {\mbox{\boldmath$\textstyle\bf B$}} {\mbox{\boldmath$\scriptstyle\bf B$}} {\mbox{\boldmath$\scriptscriptstyle\bf B$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf C$}} {\mbox{\boldmath$\textstyle\bf C$}} {\mbox{\boldmath$\scriptstyle\bf C$}} {\mbox{\boldmath$\scriptscriptstyle\bf C$}})=\mathop{\mbox{Trace}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf C$}} {\mbox{\boldmath$\textstyle\bf C$}} {\mbox{\boldmath$\scriptstyle\bf C$}} {\mbox{\boldmath$\scriptscriptstyle\bf C$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf A$}} {\mbox{\boldmath$\textstyle\bf A$}} {\mbox{\boldmath$\scriptstyle\bf A$}} {\mbox{\boldmath$\scriptscriptstyle\bf A$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf B$}} {\mbox{\boldmath$\textstyle\bf B$}} {\mbox{\boldmath$\scriptstyle\bf B$}} {\mbox{\boldmath$\scriptscriptstyle\bf B$}})\) results in

(9)

where $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf z$}} {\mbox{\boldmath$\textstyle\bf z$}} {\mbox{\boldmath$\scriptstyle\bf z$}} {\mbox{\boldmath$\scriptscriptstyle\bf z$}}_{l}$ is the lth column of $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}$. Note that the terms inside the maximisation break up independently over the L components of $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}$. It is therefore possible to interchange the maximisation and summation to get

(10)

This leads to an equivalent primal formulation (P ₂) as the summation of L correlated problems, each having N constraints which is significantly easier to optimise.

(11)

(12)

(13)

(14)

Furthermore, a novel point $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}$ can be assigned the set of labels for which the entries of $\mbox{sign}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}^{t}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \phi$}} {\mbox{\boldmath$\textstyle\bf \phi$}} {\mbox{\boldmath$\scriptstyle\bf \phi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \phi$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}))$ are +1. This corresponds to a single evaluation of f taking time linear in the number of labels.

The L classifiers in $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}$ are not independent but correlated by $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$—a positive definite matrix encoding our prior knowledge about label correlations. One might typically have thought of learning $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ from training data. For instance, one could learn $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ directly or express $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}^{-1}$ as a linear combination of predefined positive definite matrices with learnt coefficients. Such formulations have been developed in the Multiple Kernel Learning literature and we could leverage some of the proposed MKL optimization techniques (Vishwanathan et al. 2010). However, in the zero-shot learning scenario, learning $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ from training data is not helpful as the correlations between labels during training might be significantly different from those during testing.

Instead, we rely on the standard zero-shot learning assumption, that the test category attributes are known a priori (Farhadi et al. 2009, 2010; Lampert et al. 2009; Palatucci et al. 2009). Furthermore, if the prior distribution of test categories was known, then R could be set to approximate the average pairwise test label correlation (see Sect. 7.2.1 for details).

Note that, in the zero-shot learning scenario, $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ can be dense, as almost all the attributes might be shared across categories and correlated with each other, and can also have negative entries representing negative label correlations. We propose to improve prediction accuracy on the novel test categories by encoding prior knowledge about their label correlations in $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$.

Note that we deliberately chose not to include bias terms $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf b$}} {\mbox{\boldmath$\textstyle\bf b$}} {\mbox{\boldmath$\scriptstyle\bf b$}} {\mbox{\boldmath$\scriptscriptstyle\bf b$}}$ in f even though the reduction from (P ₁) to (P ₂) would still have gone through and the resulting kernelised optimisation been more or less the same (see Sect. 7.1). However, we would then have had to regularise $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf b$}} {\mbox{\boldmath$\textstyle\bf b$}} {\mbox{\boldmath$\scriptstyle\bf b$}} {\mbox{\boldmath$\scriptscriptstyle\bf b$}}$ and correlate it using $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$. Otherwise $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf b$}} {\mbox{\boldmath$\textstyle\bf b$}} {\mbox{\boldmath$\scriptstyle\bf b$}} {\mbox{\boldmath$\scriptscriptstyle\bf b$}}$ would have been a free parameter capable of undoing the effects of $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ on $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}$. Therefore, rather than explicitly have $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf b$}} {\mbox{\boldmath$\textstyle\bf b$}} {\mbox{\boldmath$\scriptstyle\bf b$}} {\mbox{\boldmath$\scriptscriptstyle\bf b$}}$ and regularise it, we implicitly simulate $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf b$}} {\mbox{\boldmath$\textstyle\bf b$}} {\mbox{\boldmath$\scriptstyle\bf b$}} {\mbox{\boldmath$\scriptscriptstyle\bf b$}}$ by adding an extra dimension to the feature vector. This has the same effect while keeping optimisation simple.

We briefly discuss two special cases before turning to the dual and its optimisation.

3.1 The special case of 1-vs-all

If label correlation information is not included, i.e. $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf I$}} {\mbox{\boldmath$\textstyle\bf I$}} {\mbox{\boldmath$\scriptstyle\bf I$}} {\mbox{\boldmath$\scriptscriptstyle\bf I$}}$, then (P ₂) decouples into L completely independent sub-problems each of which can be tackled in isolation. In particular, for the Hamming loss we get

(15)

(16)

(17)

(18)

Thus, S _l reduces to an independent binary classification sub-problem where the positive class contains all training points tagged with label l and the negative class contains all other points. This is exactly the strategy used in the popular and effective 1-vs-All method and we can therefore now make explicit the assumptions underlying this technique. The only difference is that one should charge a misclassification penalty of 2C to be consistent with the original primal formulation.

3.2 Relating the kernel to the loss

In general, the kernel is chosen so as to ensure that the training data points become well separated in the feature space. This is true for both K _x, the kernel on $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}$, as well as K _y, the kernel on $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}$. However, one might also take the view that since the loss Δ induces a measure of dissimilarity in the label space it must be related to the kernel on $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}$ which is a measure of similarity in the label space. This heavily constrains the choice of K _y and therefore the label space mapping $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \psi$}} {\mbox{\boldmath$\textstyle\bf \psi$}} {\mbox{\boldmath$\scriptstyle\bf \psi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \psi$}}$. For instance, if a linear relationship is assumed, we might choose \(\varDelta(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})=K_{y}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}) - K_{y}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})\). Note that this allows Δ to be asymmetric even though K _y is not and ensures the linearity of Δ if ψ, the label space mapping, is linear.

In this case, label correlation information should be encoded directly into the loss. For example, the Hamming loss could be transformed to \(\varDelta_{H}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) = \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}^{t} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}} (\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i} - \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})\). $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ is the same matrix as before except the interpretation now is that the entries of $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ encode label correlations by specifying the penalties to be charged if a label is misclassified in the set. Of course, for Δ to be a valid loss, not only must $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ be positive definite as before but it now must also be diagonally dominant. As such, it can only encode “weak” correlations. Given the choice of Δ and the linear relationship with K _y , the label space mapping gets fixed to $\psi(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}$ where $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}^{t}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}$.

Under these assumptions once can still go from (P ₁) to (P ₂) using the same steps as before. The main differences are that $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ is now more restricted and that \(\varDelta_{l}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},y_{l})=(1/L) \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}^{t} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i} - y_{l} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}^{t} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}_{l}\) where $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}_{l}$ is the lth column of $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$. While this result is theoretically interesting, we do not explore it further in this paper.

4 The M3L dual formulation

The dual of (P ₂) has similar properties in that it can be viewed as the maximisation of L related problems which decouple into independent binary SVM classification problems when $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf I$}} {\mbox{\boldmath$\textstyle\bf I$}} {\mbox{\boldmath$\scriptstyle\bf I$}} {\mbox{\boldmath$\scriptscriptstyle\bf I$}}$. The dual is easily derived if we rewrite (P ₂) in vector notation. Defining

(19)

(20)

(21)

we get the following Lagrangian

(22)

with the optimality conditions being

(23)

(24)

Substituting these back into the Lagrangian leads to the following dual

$$ D_2 = \max_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}} \leq \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}\leq C\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf 1$}} {\mbox{\boldmath$\textstyle\bf 1$}} {\mbox{\boldmath$\scriptstyle\bf 1$}} {\mbox{\boldmath$\scriptscriptstyle\bf 1$}}} \sum _{l=1}^{L} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^t_l \bigl(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \varDelta$}} {\mbox{\boldmath$\textstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \varDelta$}}_{l}^{-} - \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \varDelta$}} {\mbox{\boldmath$\textstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \varDelta$}}_{l}^{+} \bigr) -2 \sum_{l=1}^{L} \sum _{k=1}^{L} R_{lk} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^t_l \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Y$}} {\mbox{\boldmath$\textstyle\bf Y$}} {\mbox{\boldmath$\scriptstyle\bf Y$}} {\mbox{\boldmath$\scriptscriptstyle\bf Y$}}_l \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}}_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Y$}} {\mbox{\boldmath$\textstyle\bf Y$}} {\mbox{\boldmath$\scriptstyle\bf Y$}} {\mbox{\boldmath$\scriptscriptstyle\bf Y$}}_k \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}_k $$

(25)

Henceforth we will drop the subscript on the kernel matrix and write $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}}_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}}$ as $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}}$.

5 Optimisation

The M3L dual is similar to the standard SVM dual. Existing optimisation techniques can therefore be brought to bear. However, the dense structure of $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ couples all NL dual variables and simply porting existing solutions leads to very inefficient code. We show that, with book keeping, we can easily go from an O(L ²) algorithm to an O(L) algorithm. Furthermore, by re-utilising the kernel cache, our algorithms can be very efficient even for non-linear problems. We treat the kernelised and linear M3L cases separately.

5.1 Kernelised M3L

The Dual (D ₂) is a convex quadratic programme with very simple box constraints. We can therefore use co-ordinate ascent algorithms (Platt 1999; Fan et al. 2005; Lin et al. 2009) to maximise the dual. The algorithms start by picking a feasible point—typically $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}= \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}}$. Next,two variables are selected and optimised analytically. This step is repeated until the projected gradient magnitude falls below a threshold and the algorithm can be shown to have converged to the global optimum (Lin et al. 2009). The three key components are therefore: (a) reduced variable optimisation; (b) working set selection and (c) stopping criterion and kernel caching. We now discuss each of these components. The pseudo-code of the algorithm and proof of convergence are given in the Appendix.

5.1.1 Reduced Variable Optimisation

If all but two of the dual variables were fixed, say α _pl and α _ql along the label l, then the dual optimisation problem reduces to

(26)

(27)

(28)

where $\delta_{pl}=\alpha_{pl}^{\mathrm{new}} - \alpha_{pl}^{\mathrm{old}}$ and $\delta_{ql}=\alpha_{ql}^{\mathrm{new}} - \alpha_{ql}^{\mathrm{old}}$ and

$$ g_{pl} = \nabla_{\alpha_{pl}} D_{2} = \varDelta_{pl}^{-} - \varDelta_{pl}^+ - 4 \sum _{i=1}^N\sum_{k=1}^L R_{kl}K_{ip}y_{ik}y_{pl} \alpha_{ik} $$

(29)

Note that D _2pql has a quadratic objective in two variables which can be maximised analytically due to the simple box constraints. We do not give the expressions for $\alpha^{\mathrm{new}}_{pl}$ and $\alpha^{\mathrm{new}}_{ql}$ which maximise D _2pql as many special cases are involved for when the variables are at bound but they can be found in Algorithm 2 of the pseudo-code in Appendix A.

5.1.2 Working set selection

Since the M3L formulation does not have a bias term, it can be optimized by picking a single variable at each iteration rather than a pair of variables. This leads to a low cost per iteration but a large number of iterations. Selecting two variables per iteration increases the cost per iteration but significantly reduces the number of iterations as second order information can be incorporated into the variable selection policy.

If we were to choose two variables to optimise along the same label l, say α _pl and α _ql, then the maximum change that we could affect in the dual is given by

(30)

In terms of working set selection, it would have been ideal to have chosen the two variables α _pl and α _ql which would have maximised the increase in the dual objective. However, this turns out to be too expensive in practice. A good approximation is to choose the first point α _pl to be the one having the maximum projected gradient magnitude. The projected gradient is defined as

$$ \tilde{g}_{pl} = \left \{ \begin{array}{l@{\quad}l} g_{pl} & \text{if }\alpha_{pl} \in(0,C) \\ \min(0, g_{pl}) & \text{if }\alpha_{pl}=C \\ \max(0, g_{pl}) & \text{if }\alpha_{pl}=0 \end{array} \right . $$

(31)

and hence the first point is chosen as

$$ \bigl(p^*,l^*\bigr)=\mathop{\mbox{argmax}}_{p,l} |\tilde{g}_{pl}| $$

(32)

Having chosen the first point, the second point is chosen to be the one that maximises (30).

Working set selection can be made efficient by maintaining the set of gradients g. Every time a variable, say α _pl is changed, the gradients need to be updated as

$$ g^{\mathrm{new}}_{jk} = g^{\mathrm{old}}_{jk} - 4 y_{pl} y_{jk} R_{kl}K_{pj}\bigl( \alpha_{pl}^{\mathrm{new}} - \alpha_{pl}^{\mathrm{old}}\bigr) $$

(33)

Note that because of the dense structure of $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ all NL gradients have to be updated even if a single variable is changed. Since there are NL variables and each has to be updated presumably at least once we end up with an algorithm that takes time at least N ² L ².

The algorithm can be made much more efficient if, with some book keeping, not all gradients had to be updated every time a variable was changed. For instance, if we were to fix a label l and modify L variables along the chosen label, then the gradient update equations could be written as

(34)

(35)

(36)

As long as we are changing variables along a particular label, the gradient updates can be accumulated in u and only when we switch to a new variable do all the gradients have to be updated. We therefore end up doing O(NL) work after changing L variables resulting in an algorithm which takes time O(N ² L) rather than O(N ² L ²).

5.1.3 Stopping criterion and kernel caching

We use the standard stopping criterion that the projected gradient magnitude for all NL dual variables should be less than a predetermined threshold.

We employ a standard Least Recently Used (LRU) kernel cache strategy implemented as a circular queue. Since we are optimising over all labels jointly, the kernel cache gets effectively re-utilised, particularly as compared to independent methods that optimise one label at a time. In the extreme case, independent methods will have to rebuild the cache for each label which can slow them down significantly.

6 Linear M3L

We build on top of the stochastic dual coordinate ascent with shrinkage algorithm of Hsieh et al. (2008). At each iteration, a single dual variable is chosen uniformly at random from the active set, and optimised analytically. The variable update equation is given by

(37)

(38)

(39)

As can be seen, the dual variable update can be computed more efficiently in terms of the primal variables $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}$ which then need to be maintained every time a dual variable is modified. The update equation for $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}$ every time α _pl is modified is

$$ \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf z$}} {\mbox{\boldmath$\textstyle\bf z$}} {\mbox{\boldmath$\scriptstyle\bf z$}} {\mbox{\boldmath$\scriptscriptstyle\bf z$}}^{\mathrm{new}}_{l} = \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf z$}} {\mbox{\boldmath$\textstyle\bf z$}} {\mbox{\boldmath$\scriptstyle\bf z$}} {\mbox{\boldmath$\scriptscriptstyle\bf z$}}^{\mathrm{old}}_{l} + 2R_{kl} y_{pl} \bigl(\alpha^{\mathrm{new}}_{pl} - \alpha^{\mathrm{old}}_{pl}\bigr)\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}_{p} $$

(40)

Thus, all the primal variables $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}$ need to be updated every time a single dual variable is modified. Again, as in the kernelised case, the algorithm can be made much more efficient by fixing a label l and modifying L dual variables along it while delaying the gradient updates as

(41)

(42)

(43)

In practice, it was observed that performing L stochastic updates along a chosen label right from the start could slow down convergence in some cases. Therefore, we initially use the more expensive strategy of choosing dual variables uniformly at random and only after the projected gradient magnitudes are below a pre-specified threshold do we switch to the strategy of optimising L dual variables along a particular label before picking a new label uniformly at random.

The active set is initialised to contain all the training data points. Points at bound having gradient magnitude outside the range of currently maintained extremal gradients are discarded from the active set. Extremal gradients are re-estimated at the end of each pass and if they are too close to each other the active set is expanded to include all training points once again.

A straightforward implementation with globally maintained extremal gradients leads to inefficient code. Essentially, if the classifier for a particular label has not yet converged, then it can force a large active set even though most points would not be considered by the other classifiers. We therefore implemented separate active sets for each label but coupled the maintained extremal gradients via $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$. The extremal gradients lb _l and ub _l, for label l. are initially set to −∞ and +∞ respectively. After each pass through the active set, they are updated as

(44)

(45)

where A _k is the set of indices in the active set of label k. This choice was empirically found to decrease training time.

Once all the projected gradients in all the active sets have magnitude less than a threshold τ, we expand the active sets to include all the variables, and re-estimate the projected gradients. The algorithm stops when all projected gradients have magnitude less than τ.

7 Experiments

In this section we first compare the performance of our optimisation algorithms and then evaluate how prediction accuracy can be improved by incorporating prior knowledge about label correlations.

7.1 Optimisation experiments

The cutting plane algorithm in SVMStruct (Tsochantaridis et al. 2005) is a general purpose algorithm that can be used to optimise the original M3L formulation (P ₁). In each iteration, the approximately worst violating constraint is added to the active set and the algorithm is proved to take a maximum number of iterations independent of the size of the output space. The algorithm has a user defined parameter ϵ for the amount of error that can be tolerated in finding the worst violating constraint.

We compared the SVMStruct algorithm to our M3L implementation on an Intel Xeon 2.67 GHz machine with 8 GB RAM. It was observed that even on medium scale problems with linear kernels, our M3L implementation was nearly a hundred times faster than SVMStruct. For example, on the Media Mill data set (Snoek et al. 2006) with a hundred and one labels and ten, fifteen and twenty thousand training points, our M3L code took 19, 37 and 55 seconds while SVMStruct took 1995, 2998 and 7198 seconds respectively. On other data sets SVMStruct ran out of RAM or failed to converge in a reasonable amount of time (even after tuning ϵ). This demonstrates that explicitly reducing the number of constraints from exponential to linear and implementing a specialised solver can lead to a dramatic reduction in training time.

As the next best thing, we benchmark our performance against the 1-vs-All method, even though it can’t incorporate prior label correlations. In the linear case, we compare our linear M3L implementation to 1-vs-All trained by running LibLinear (Fan et al. 2008) and LibSVM (Chang and Lin 2001) independently over each label. For the non-linear case, we compare our kernelised M3L implementation to 1-vs-All trained using LibSVM. In each case, we set $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf I$}} {\mbox{\boldmath$\textstyle\bf I$}} {\mbox{\boldmath$\scriptstyle\bf I$}} {\mbox{\boldmath$\scriptscriptstyle\bf I$}}$, so that M3L reaches exactly the same solution as LibSVM and LibLinear. Also, we avoided repeated disk I/O by reading the data into RAM and using LibLinear and LibSVM’s API’s.

Table 1 lists the variation in training time with the number of training examples on the Animals with Attributes (Lampert et al. 2009), Media Mill (Snoek et al. 2006), Siam (SIAM 2007) and RCV1 (Lewis et al. 2004) data sets. The training times of linear M3L (LM3L) and LibLinear are comparable, with LibLinear being slightly faster. The training time of kernelised M3L (KM3L) are significantly lower than LibSVM, with KM3L sometimes being as much as 30 times faster. This is primarily because KM3L can efficiently leverage the kernel cache across all labels while LibSVM has to build the cache from scratch each time. Furthermore, leaving aside caching issues, it would appear that by optimising over all variables jointly, M3L reaches the vicinity of the global optimum much more quickly than 1-vs-All. Figure 2 plots dual progress against the number of iterations for all four data sets with ten thousand training points. As can be seen, kernelised M3L gets to within the vicinity of the global optimum much faster than 1-vs-All implemented using LibSVM. Figure 3 shows similar plots with respect to time. The difference is even more significant due to kernel caching effects. In conclusion, even though M3L generalises 1-vs-All, its training time can be comparable, and sometimes, even significantly lower.

Table 1 Comparison of training times for the linear M3L (LM3L) and kernelised M3L (KM3L) optimisation algorithms with 1-vs-All techniques implemented using LibLinear and LibSVM. Each data set has N training points, D features and L labels. See text for details

Full size table

Finally, to demonstrate that our code scales to large problems, we train linear M3L on RCV1 with 781,265 points, 47,236 dimensional sparse features and 103 labels. Table 2 charts dual progress and train and test error with time. As can be seen, the model is nearly fully trained in under six minutes and converges in eighteen.

Table 2 Linear M3L training on RCV1 with 781,265 points, 47,236 dimensional sparse features and 103 labels

Full size table

7.2 Incorporating prior knowledge for zero-shot learning

In this section, we investigate whether the proposed M3L formulation can improve label prediction accuracy in a zero-shot learning scenario. Zero-shot learning has two major components as mentioned earlier. The first component deals with generating an intermediate level representation, generally based on attributes for each data point. The second concerns itself with how to map test points in the intermediate representation to points representing novel categories. Our focus is on the former and the more accurate prediction of multiple, intermediate attributes (labels) when their correlation statistics on the training and test sets are significantly different.

7.2.1 Animals with attributes

The Animals with Attributes data set (Lampert et al. 2009) has forty training animal categories, such as Dalmatian, Skunk, Tiger, Giraffe, Dolphin, etc. and the following ten disjoint test animal categories: Humpback Whale, Leopard, Chimpanzee, Hippopotamus, Raccoon, Persian Cat, Rat, Seal, Pig and Giant Panda. All categories share a common set of 85 attributes such as has yellow, has spots, is hairless, is big, has flippers, has buckteeth, etc. The attributes are densely correlated and form a fully connected graph. Each image in the database contains a dominant animal and is labelled with its 85 attributes. There are 24,292 training images and 6,180 test images. Some example images are shown in Fig. 4. We use 252 dimensional PHOG features that are provided by the authors. M3L training times for this data set are reported in Table 1(a).

We start by visualising the influence of $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$. We randomly sample 200 points from the training set and discard all but two of the attributes—“has black” and “is weak”. These two attributes were selected as they are very weakly correlated on our training set, with a correlation coefficient of 0.2, but have a strong negative correlation of −0.76 on the test animals (Leopards, Giant Pandas, Humpback Whales and Chimpanzees all have black but are not weak). Figure 5 plots the Hamming loss on the test set as we set $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=[1~r;r~1]$, plug it into the M3L formulation, and vary r from −1 to +1. Learning independent classifiers for the two attributes (r=0) can lead to a Hamming loss of 25 % because of the mismatch between training and test sets. This can be made even worse by incorrectly choosing, or learning using structured output prediction techniques, a prior that forces the two labels to be positively correlated. However, if our priors are generally correct, then negatively correlating the classifiers lowers prediction error.

We now evaluate performance quantitatively on the same training set but with all 85 labels. We stress that in the zero shot learning scenario no training samples from any of the test categories are provided. As is commonly assumed (Farhadi et al. 2009, 2010; Lampert et al. 2009; Palatucci et al. 2009), we only have access to $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c}$ which is the set of attributes for a given test category. Furthermore we require, as additional information, the prior distribution over test categories p(c). For the M3L formulation we set $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\sum_{c=1}^{10} p(c)\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c}^{t}$. Under this setup, learning independent classifiers using 1-vs-All yields a Hamming loss of 29.38 %. The Hamming loss for M3L, with the specific choice of $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$, is 26.35 %. This decrease in error is very significant given that 1-vs-All, trained on all 24,292 training points, only manages to reduce error to 28.64 %. Thus M3L, with extra knowledge, in the form of just test category distributions, can dramatically reduce test error. The results also compare favourably to other independent methods such as BoostTexter (Schapire and Singer 2000) (30.28 %), power set multi-class classification (32.70 %), 5 nearest neighbours (31.79 %), regression (Hsu et al. 2009) (29.38 %) and ranking (Crammer and Singer 2003) (34.84 %).

7.2.2 Benchmark data sets

We also present results on the fMRI-Words zero-shot learning data set of Mitchell et al. (2008). The data set has 60 categories out of which we use 48 for training and 12 for testing. Each category is described by 25 real valued attributes which we convert to binary labels by thresholding against the median attribute value. Prior information about which attributes occur in which novel test categories is provided in terms of a knowledge base. The experimental protocol is kept identical to the one used in Animals with Attributes. $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ is set to $\sum_{c=1}^{10} p(c)\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c}^{t}$ where $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c}$ comes from the knowledge base and p(c) is required as additional prior information. We use 400 points for training and 648 points for testing. The test Hamming loss for M3L and various independent methods is given in Table 3. The M3L results are much better than 1-vs-All with the test Hamming loss being reduced by nearly 7 %. This is noteworthy since even if 1-vs-All were trained on the full training set of 2592 points, it would decrease the Hamming loss by just over 5 % to 48.79 %.

Table 3 Test Hamming loss (%) on benchmark data sets

Full size table

Table 3 also presents results on some other data sets. Unfortunately, most of them have not been designed for zero-shot learning. Siam (SIAM 2007), Media Mill (Snoek et al. 2006), RCV1 (Lewis et al. 2004) and Yeast (Elisseeff and Weston 2001) are traditional multi-label data sets with matching training and test set statistics. The a-PASCAL+a-Yahoo (Farhadi et al. 2009) data set has different training and test categories but does not include prior information about which attributes are relevant to which test categories. Thus, for all these data sets, we sample the original training set to create a new training subset which has different label correlations than the provided test set. The remainder of the original training points are used only to estimate the $\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ matrix. As Table 3 indicates, by incorporating prior knowledge M3L can do better than all the other methods which assume independence.

8 Conclusions

We developed the M3L formulation for learning a max-margin multi-label classifier with prior knowledge about densely correlated labels. We showed that the number of constraints could be reduced from exponential to linear and, in the process, generalised 1-vs-All multi-label classification. We also developed efficient optimisation algorithms that were orders of magnitude faster than the standard cutting plane method. Our kernelised algorithm was significantly faster than even the 1-vs-All technique implemented using LibSVM and hence our code, available from Hariharan et al. (2010a), can also be used for efficient independent learning. Finally, we demonstrated on multiple data sets that incorporating prior knowledge using M3L could improve prediction accuracy over independent methods. In particular, in zero-shot learning scenarios, M3L trained on 200 points could outperform 1-vs-All trained on nearly 25,000 points on the Animals with Attributes data set and the M3L test Hamming loss on the fMRI-Words data set was nearly 7 % lower than that of 1-vs-All.

References

Bernstein, D. S. (2005). Matrix mathematics. Princeton: Princeton University Press.
MATH Google Scholar
Boutell, M., Luo, J., Shen, X., & Brown, C. (2004). Learning multi-label scene classification. Pattern Recognition, 37(9), 1757–1771.
Article Google Scholar
Cai, L., & Hofmann, T. (2007). Exploiting known taxonomies in learning overlapping concepts. In Proceedings of the international joint conference on artificial intelligence (pp. 714–719).
Google Scholar
Cesa-Bianchi, N., Gentile, C., & Zaniboni, L. (2006). Incremental algorithms for hierarchical classification. Journal of Machine Learning Research, 7, 31–54.
MathSciNet MATH Google Scholar
Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Cheng, W., Dembczynski, K., & Huellermeier, E. (2010). Graded multilabel classification: the ordinal case. In Proceedings of the international conference on machine learning.
Google Scholar
Crammer, K., & Singer, Y. (2003). A family of additive online algorithms for category ranking. Journal of Machine Learning Research, 3, 1025–1058.
MathSciNet MATH Google Scholar
Dekel, O., & Shamir, O. (2010). Multiclass-multilabel classification with more classes than examples. In Proceedings of the international conference on artificial intelligence and statistics.
Google Scholar
Elisseeff, A., & Weston, J. (2001). A kernel method for multi-labelled classification. In Advances in neural information processing systems (pp. 681–687).
Google Scholar
Evgeniou, T., Micchelli, C. A., & Pontil, M. (2005). Learning multiple tasks with kernel methods. Journal of Machine Learning Research.
Fan, R. E., Chen, P. H., & Lin, C. J. (2005). Working set selection using second order information for training SVM. Journal of Machine Learning Research, 6, 1889–1918.
MathSciNet MATH Google Scholar
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). Liblinear: a library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.
MATH Google Scholar
Farhadi, A., Endres, I., Hoeim, D., & Forsyth, D. A. (2009). Describing objects by their attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Google Scholar
Farhadi, A., Endres, I., & Hoeim, D. (2010). Attribute-centric recognition for cross-category generalization. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Google Scholar
Hariharan, B., Vishwanathan, S. V. N., & Varma, M. (2010a). M3L code. http://research.microsoft.com/~manik/code/M3L/download.html.
Hariharan, B., Zelnik-Manor, L., Vishwanathan, S. V. N., & Varma, M. (2010b). Large scale max-margin multi-label classification with priors. In Proceedings of the international conference on machine learning.
Google Scholar
Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S. S., & Sundarajan, S. (2008). A dual coordinate descent method for large-scale linear SVM. In Proceedings of the international conference on machine learning.
Google Scholar
Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multi-label prediction via compressed sensing. In Advances in neural information processing systems.
Google Scholar
Ji, S., Sun, L., Jin, R., & Ye, J. (2008). Multi-label multiple kernel learning. In Advances in neural information processing systems (pp. 777–784).
Google Scholar
Keerthi, S. S., & Gilbert, E. G. (2002). Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, 46(1–3), 351–360.
Article MATH Google Scholar
Lampert, C. H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Google Scholar
Lewis, D., Yang, Y., Rose, T., & Li, F. (2004). RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.
Google Scholar
Li, X., Wang, L., & Sung, E. (2004). Multi-label SVM active learning for image classification. In Proceedings of the IEEE international conference on image processing (pp. 2207–2210).
Google Scholar
Lin, C. J., Lucidi, S., Palagi, L., Risi, A., & Sciandrone, M. (2009). Decomposition algorithm model for singly linearly-constrained problems subject to lower and upper bounds. Journal of Optimization Theory and Applications, 141(1), 107–126.
Article MathSciNet MATH Google Scholar
McCallum, A. (1999). Multi-label text classification with a mixture model trained by EM. In AAAI 99 workshop on text learning.
Google Scholar
Mitchell, T., Shinkareva, S., Carlson, A., Chang, K.-M., Malave, V., Mason, R., & Just, A. (2008). Predicting human brain activity associated with the meanings of nouns. Science, 320, 1191–1195.
Article Google Scholar
Nocedal, J., & Wright, S. J. (2006). Numerical optimization (2nd ed.). Berlin: Springer.
MATH Google Scholar
Palatucci, M., Pomerleau, D., Hinton, G., & Mitchell, T. (2009). Zero-shot learning with semantic output codes. In Advances in neural information processing systems.
Google Scholar
Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. In Advances in kernel methods—support vector learning (pp. 185–208).
Google Scholar
Rifkin, R., & Khautau, A. (2004). In defense of one-vs-all classification. Journal of Machine Learning Research, 5, 101–141.
MATH Google Scholar
Rousu, J., Saunders, C., Szedmak, S., & Shawe-Taylor, J. (2006). Kernel-based learning of hierarchical multilabel classification models. Journal of Machine Learning Research, 7, 1601–1626.
MathSciNet MATH Google Scholar
Schapire, R. E., & Singer, Y. (2000). Boostexter: a boosting-based system for text categorization. Machine Learning, 39(2/3), 135–168.
Article MATH Google Scholar
Snoek, C., Worring, M., van Gemert, J., Geusebroek, J.-M., & Smeulders, A. (2006). The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of ACM multimedia (pp. 421–430).
Google Scholar
Taskar, B., Guestrin, C., & Koller, D. (2003). Max-margin Markov networks. In Advances in neural information processing systems.
Google Scholar
The SIAM Text Mining Competition (2007). http://www.cs.utk.edu/tmw07/.
Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.
MathSciNet MATH Google Scholar
Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: an overview. International Journal of Data Warehousing and Mining, 3(3), 1–13.
Article Google Scholar
Ueda, N., & Saito, K. (2003). Parametric mixture models for multi-labeled text. In Advances in neural information processing systems.
Google Scholar
Vishwanathan, S. V. N., Sun, Z., Ampornpunt, N., & Varma, M. (2010). Multiple kernel learning and the SMO algorithm. In Advances in neural information processing systems 23.
Google Scholar
Wang, G., Forsyth, D. A., & Hoeim, D. (2010). Comparative object similarity for improved recognition with few or no examples. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Google Scholar
Zhang, M.-L., & Wang, Z.-J. (2009a). Mimlrbf: Rbf neural networks for multi-instance multi-label learning. Neural Computing, 72(16–18), 3951–3956.
Google Scholar
Zhang, M.-L., & Wang, Z.-J. (2009b). Feature selection for multi-label naive Bayes classification. Information Sciences, 179(19), 3218–3229.
Article MATH Google Scholar

Download references

Acknowledgements

We would like to thank Alekh Agarwal, Brendan Frey, Sunita Sarawagi, Alex Smola and Lihi Zelnik-Manor for helpful discussions and feedback.

Author information

Authors and Affiliations

University of California at Berkeley, Berkeley, USA
Bharath Hariharan
Purdue University, West Lafayette, USA
S. V. N. Vishwanathan
Microsoft Research India, Bangalore, India
Manik Varma

Authors

Bharath Hariharan
View author publications
You can also search for this author in PubMed Google Scholar
S. V. N. Vishwanathan
View author publications
You can also search for this author in PubMed Google Scholar
Manik Varma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manik Varma.

Additional information

Editors: Grigorios Tsoumakas, Min-Ling Zhang, and Zhi-Hua Zhou.

Appendices

Appendix A: Pseudo code of the kernelised M3L algorithm

The dual that we are trying to solve is:

(46)

s.t.

where $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}_{l} = [ \alpha_{1l}, \ldots\alpha_{Nl}]$, $\mathchoice {\mbox{\boldmath$\displaystyle\bf Y$}} {\mbox{\boldmath$\textstyle\bf Y$}} {\mbox{\boldmath$\scriptstyle\bf Y$}} {\mbox{\boldmath$\scriptscriptstyle\bf Y$}}_{l} = \mathop{\mathrm{diag}}([y_{1l} \ldots y_{Nl}])$ and $\mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}} = \phi(\mathchoice {\mbox{\boldmath$\displaystyle\bf X$}} {\mbox{\boldmath$\textstyle\bf X$}} {\mbox{\boldmath$\scriptstyle\bf X$}} {\mbox{\boldmath$\scriptscriptstyle\bf X$}})^{t}\phi(\mathchoice {\mbox{\boldmath$\displaystyle\bf X$}} {\mbox{\boldmath$\textstyle\bf X$}} {\mbox{\boldmath$\scriptstyle\bf X$}} {\mbox{\boldmath$\scriptscriptstyle\bf X$}})$. Algorithm 1 describes the training algorithm. The algorithm relies on picking two variables at each step and optimising over them keeping all the others constant. If the two variables are α _pl and α _ql (note that we choose two variables corresponding to the same label l), then at each step we maximise $h(\delta_{pl}, \delta_{ql}) = D_{2}(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}+[\delta_{pl}, \delta_{ql}, \mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}}^{t}]^{t}) - D_{2}( \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}})$ subject to −α _pl≤δ _pl≤C−α _pl and −α _ql≤δ _ql≤C−α _ql. Here, the indices have been reordered so that α _pl,α _ql occupy the first two indices. D ₂ is the dual objective function. It can be seen that h(δ _pl,δ _ql) comes out to be:

(47)

Here g _pl=∇_pl D ₂ and similarly g _ql=∇_ql D ₂. Since h is basically a quadratic function, it can be written as:

$$ h(\mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}_{pq}) = -\frac{1}{2}\mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}_{pq} \mathchoice {\mbox{\boldmath$\displaystyle\bf Q$}} {\mbox{\boldmath$\textstyle\bf Q$}} {\mbox{\boldmath$\scriptstyle\bf Q$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q$}}_{pq} \mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}_{pq} + \mathchoice {\mbox{\boldmath$\displaystyle\bf g$}} {\mbox{\boldmath$\textstyle\bf g$}} {\mbox{\boldmath$\scriptstyle\bf g$}} {\mbox{\boldmath$\scriptscriptstyle\bf g$}}_{pq}^t \mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}_{pq} $$

(48)

where

(49)

(50)

(51)

The constraints too can be written in vector form as:

$$ \mathchoice {\mbox{\boldmath$\displaystyle\bf m$}} {\mbox{\boldmath$\textstyle\bf m$}} {\mbox{\boldmath$\scriptstyle\bf m$}} {\mbox{\boldmath$\scriptscriptstyle\bf m$}}_{pq} \leq \mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}_{pq} \leq \mathchoice {\mbox{\boldmath$\displaystyle\bf M$}} {\mbox{\boldmath$\textstyle\bf M$}} {\mbox{\boldmath$\scriptstyle\bf M$}} {\mbox{\boldmath$\scriptscriptstyle\bf M$}}_{pq} $$

(52)

where

(53)

(54)

Therefore at each step we solve a 2-variable quadratic program with box constraints. The algorithm to do so is described later.

The variables α _pl and α _ql being optimised over need to be chosen carefully. In particular we need to ensure that the matrix $\mathchoice {\mbox{\boldmath$\displaystyle\bf Q$}} {\mbox{\boldmath$\textstyle\bf Q$}} {\mbox{\boldmath$\scriptstyle\bf Q$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q$}}_{pq}$ is positive definite so that it can be maximised easily. We also need to make sure that none of α _pl and α _ql has projected gradient 0. The projected gradient of α _pl, denoted here as $\tilde{g}_{pl}$, is given by:

$$ \tilde{g}_{pl} = \left \{ \begin{array}{l@{\quad}l} g_{pl} & \text{if }\alpha_{pl} \in(0,C) \\ \min(0, g_{pl}) & \text{if }\alpha_{pl}=C \\ \max(0, g_{pl}) & \text{if }\alpha_{pl}=0 \end{array} \right . $$

(55)

We also use some heuristics when choosing p and q. It can be seen that the unconstrained maximum of h is given by:

(56)

This is an upper bound on the dual progress that we can achieve in an iteration and we pick p and q such that $h^{max}_{pq}$ is as big as possible.

Algorithm 2 solves the problem:

$$ \max_{\mathchoice {\mbox{\boldmath$\displaystyle\bf m$}} {\mbox{\boldmath$\textstyle\bf m$}} {\mbox{\boldmath$\scriptstyle\bf m$}} {\mbox{\boldmath$\scriptscriptstyle\bf m$}}\leq \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}\leq \mathchoice {\mbox{\boldmath$\displaystyle\bf M$}} {\mbox{\boldmath$\textstyle\bf M$}} {\mbox{\boldmath$\scriptstyle\bf M$}} {\mbox{\boldmath$\scriptscriptstyle\bf M$}}}\quad -\frac{1}{2}\mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}^t\mathchoice {\mbox{\boldmath$\displaystyle\bf Q$}} {\mbox{\boldmath$\textstyle\bf Q$}} {\mbox{\boldmath$\scriptstyle\bf Q$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q$}}\mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}} + \mathchoice {\mbox{\boldmath$\displaystyle\bf g$}} {\mbox{\boldmath$\textstyle\bf g$}} {\mbox{\boldmath$\scriptstyle\bf g$}} {\mbox{\boldmath$\scriptscriptstyle\bf g$}}^t\mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}} $$

(57)

where $\mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}$ is 2-dimensional. Setting the gradient = 0, we get that the unconstrained maximum is at $\mathchoice {\mbox{\boldmath$\displaystyle\bf Q$}} {\mbox{\boldmath$\textstyle\bf Q$}} {\mbox{\boldmath$\scriptstyle\bf Q$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q$}}^{-1}\mathchoice {\mbox{\boldmath$\displaystyle\bf g$}} {\mbox{\boldmath$\textstyle\bf g$}} {\mbox{\boldmath$\scriptstyle\bf g$}} {\mbox{\boldmath$\scriptscriptstyle\bf g$}}$. If this point satisfies the box constraints, then we are done. If not, then we need to look at the boundaries of the feasible set. This can be done by clamping one variable to the boundary and maximising along the other, which becomes a 1-dimensional quadratic problem. Solve1DQP(a, b, m, M), referenced in lines 21 of Algorithm 1 and lines 6, 9, 12 and 13 solves a 1-dimensional QP with box constraints:

$$ \max_{m\leq x\leq M} -\frac{1}{2}ax^2+bx $$

(58)

The solution to this is merely $\min(M, \max(m, \frac{b}{a}))$.

Appendix B: Proof of convergence of the kernelised M3L algorithm

We now give a proof of convergence of the kernelised M3L algorithm. The proof closely follows the one in Keerthi and Gilbert (2002) and is provided for the sake of completeness.

2.1 B.1 Notation

We denote vectors in bold small letters, for example $\mathchoice {\mbox{\boldmath$\displaystyle\bf v$}} {\mbox{\boldmath$\textstyle\bf v$}} {\mbox{\boldmath$\scriptstyle\bf v$}} {\mbox{\boldmath$\scriptscriptstyle\bf v$}}$. If $\mathchoice {\mbox{\boldmath$\displaystyle\bf v$}} {\mbox{\boldmath$\textstyle\bf v$}} {\mbox{\boldmath$\scriptstyle\bf v$}} {\mbox{\boldmath$\scriptscriptstyle\bf v$}}$ is a vector of dimension d, then v _k,k∈{1,…,d} is the k-th component of $\mathchoice {\mbox{\boldmath$\displaystyle\bf v$}} {\mbox{\boldmath$\textstyle\bf v$}} {\mbox{\boldmath$\scriptstyle\bf v$}} {\mbox{\boldmath$\scriptscriptstyle\bf v$}}$, and $\mathchoice {\mbox{\boldmath$\displaystyle\bf v$}} {\mbox{\boldmath$\textstyle\bf v$}} {\mbox{\boldmath$\scriptstyle\bf v$}} {\mbox{\boldmath$\scriptscriptstyle\bf v$}}_{I}, I \subseteq\{1, \ldots, d\}$ denotes the vector with components v _k,k∈I (with the v _k’s arranged in the same order as in $\mathchoice {\mbox{\boldmath$\displaystyle\bf v$}} {\mbox{\boldmath$\textstyle\bf v$}} {\mbox{\boldmath$\scriptstyle\bf v$}} {\mbox{\boldmath$\scriptscriptstyle\bf v$}}$). Similarly, matrices will be written in bold capital letters, for example $\mathchoice {\mbox{\boldmath$\displaystyle\bf A$}} {\mbox{\boldmath$\textstyle\bf A$}} {\mbox{\boldmath$\scriptstyle\bf A$}} {\mbox{\boldmath$\scriptscriptstyle\bf A$}}$. If $\mathchoice {\mbox{\boldmath$\displaystyle\bf A$}} {\mbox{\boldmath$\textstyle\bf A$}} {\mbox{\boldmath$\scriptstyle\bf A$}} {\mbox{\boldmath$\scriptscriptstyle\bf A$}}$ is an m×n matrix, then A _ij represents the ij-th entry of $\mathchoice {\mbox{\boldmath$\displaystyle\bf A$}} {\mbox{\boldmath$\textstyle\bf A$}} {\mbox{\boldmath$\scriptstyle\bf A$}} {\mbox{\boldmath$\scriptscriptstyle\bf A$}}$, and $\mathchoice {\mbox{\boldmath$\displaystyle\bf A$}} {\mbox{\boldmath$\textstyle\bf A$}} {\mbox{\boldmath$\scriptstyle\bf A$}} {\mbox{\boldmath$\scriptscriptstyle\bf A$}}_{IJ}$ represents the matrix with entries A _ij,i∈I,j∈J.

A sequence is denoted as {a ⁿ}, and a ⁿ is the n-th element of this sequence. If $\hat{a}$ is a limit point of the sequence, we write $a^{n} \rightarrow\hat{a}$.

2.2 B.2 The optimisation problem

The dual that we are trying to solve is:

(59)

s.t.

where $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}_{l} \!=\! [ \alpha_{1l}, \ldots,\alpha_{Nl}]$, $\mathchoice {\mbox{\boldmath$\displaystyle\bf Y$}} {\mbox{\boldmath$\textstyle\bf Y$}} {\mbox{\boldmath$\scriptstyle\bf Y$}} {\mbox{\boldmath$\scriptscriptstyle\bf Y$}}_{l} \!=\! \mathop{\mathrm{diag}}([y_{1l} \ldots y_{Nl}])$, $\mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}} = \phi(\mathchoice {\mbox{\boldmath$\displaystyle\bf X$}} {\mbox{\boldmath$\textstyle\bf X$}} {\mbox{\boldmath$\scriptstyle\bf X$}} {\mbox{\boldmath$\scriptscriptstyle\bf X$}})^{t}\phi (\mathchoice {\mbox{\boldmath$\displaystyle\bf X$}} {\mbox{\boldmath$\textstyle\bf X$}} {\mbox{\boldmath$\scriptstyle\bf X$}} {\mbox{\boldmath$\scriptscriptstyle\bf X$}})$ and $\mathchoice {\mbox{\boldmath$\displaystyle\bf \varDelta$}} {\mbox{\boldmath$\textstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \varDelta$}}_{l}^{\pm} = (\varDelta_{l}(\mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{1}, \pm y_{1l}),\allowbreak \ldots, \varDelta_{l}(\mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{N}, \pm y_{Nl}))$. This can be written as the following optimisation problem:

Problem:

(60)

s.t.

Here the vector $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}} = [\alpha_{11} \ldots\alpha_{1L}, \alpha_{21}, \ldots\alpha_{NL}]^{t}$ and $\mathchoice {\mbox{\boldmath$\displaystyle\bf Q$}} {\mbox{\boldmath$\textstyle\bf Q$}} {\mbox{\boldmath$\scriptstyle\bf Q$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q$}}=4\mathchoice {\mbox{\boldmath$\displaystyle\bf Y$}} {\mbox{\boldmath$\textstyle\bf Y$}} {\mbox{\boldmath$\scriptstyle\bf Y$}} {\mbox{\boldmath$\scriptscriptstyle\bf Y$}}\mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}} \otimes \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\mathchoice {\mbox{\boldmath$\displaystyle\bf Y$}} {\mbox{\boldmath$\textstyle\bf Y$}} {\mbox{\boldmath$\scriptstyle\bf Y$}} {\mbox{\boldmath$\scriptscriptstyle\bf Y$}}$ where ⊗ is the Kronecker product. $\mathchoice {\mbox{\boldmath$\displaystyle\bf Y$}} {\mbox{\boldmath$\textstyle\bf Y$}} {\mbox{\boldmath$\scriptstyle\bf Y$}} {\mbox{\boldmath$\scriptscriptstyle\bf Y$}} = \mathrm{diag}( [y_{11} \ldots y_{1L}, y_{21}, \ldots y_{NL}])$. $\mathchoice {\mbox{\boldmath$\displaystyle\bf p$}} {\mbox{\boldmath$\textstyle\bf p$}} {\mbox{\boldmath$\scriptstyle\bf p$}} {\mbox{\boldmath$\scriptscriptstyle\bf p$}} = \mathchoice {\mbox{\boldmath$\displaystyle\bf \varDelta$}} {\mbox{\boldmath$\textstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \varDelta$}}_{l}^{-} - \mathchoice {\mbox{\boldmath$\displaystyle\bf \varDelta$}} {\mbox{\boldmath$\textstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \varDelta$}}_{l}^{+}$, $\mathchoice {\mbox{\boldmath$\displaystyle\bf l$}} {\mbox{\boldmath$\textstyle\bf l$}} {\mbox{\boldmath$\scriptstyle\bf l$}} {\mbox{\boldmath$\scriptscriptstyle\bf l$}}=\mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}}$ and $\mathchoice {\mbox{\boldmath$\displaystyle\bf u$}} {\mbox{\boldmath$\textstyle\bf u$}} {\mbox{\boldmath$\scriptstyle\bf u$}} {\mbox{\boldmath$\scriptscriptstyle\bf u$}}=C\mathchoice {\mbox{\boldmath$\displaystyle\bf 1$}} {\mbox{\boldmath$\textstyle\bf 1$}} {\mbox{\boldmath$\scriptstyle\bf 1$}} {\mbox{\boldmath$\scriptscriptstyle\bf 1$}}$. We assume that $\mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ and $\mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}}$ are both positive definite matrices. The eigenvalues of $\mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}}\otimes \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ are then λ _i μ _j (see, for example, Bernstein 2005), where λ _i are the eigenvalues of $\mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}}$ and μ _j are the eigenvalues of $\mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$. Because all eigenvalues of both $\mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ and $\mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}}$ are positive, so are the eigenvalues of $\mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}}\otimes \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}$ and thus $\mathchoice {\mbox{\boldmath$\displaystyle\bf Q$}} {\mbox{\boldmath$\textstyle\bf Q$}} {\mbox{\boldmath$\scriptstyle\bf Q$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q$}}$ is positive definite. Thus the dual we are trying to solve is a strictly convex quadratic program.

Our algorithm will produce a sequence of vectors $\{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}\}$ where $\{ \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{i}\}$ is the vector before the i-th iteration. For brevity, we denote the gradient $\nabla f(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n})$ as $\mathchoice {\mbox{\boldmath$\displaystyle\bf g$}} {\mbox{\boldmath$\textstyle\bf g$}} {\mbox{\boldmath$\scriptstyle\bf g$}} {\mbox{\boldmath$\scriptscriptstyle\bf g$}}^{n}$ and the projected gradient $\nabla^{P} f(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n})$ as $\tilde{\mathchoice {\mbox{\boldmath$\displaystyle\bf g$}} {\mbox{\boldmath$\textstyle\bf g$}} {\mbox{\boldmath$\scriptstyle\bf g$}} {\mbox{\boldmath$\scriptscriptstyle\bf g$}}}^{n}$. The algorithm stops when all the projected gradients have magnitude less than τ. It can be easily seen that by reducing τ, we can get arbitrarily close to the optimum.

Hence, in the following, we only need to prove that the algorithm will terminate in a finite number of steps.

2.3 B.3 Convergence

In this section we prove that the sequence of vectors $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}$ converges.

Note the following:

In each iteration of the algorithm, we optimise over a set of variables, which may either be a single variable α _pl or a pair of variables {α _pl,α _ql}.
The projected gradient of all the chosen variables is non zero at the start of the iteration.
At least one of the chosen variables has projected gradient with magnitude greater than τ.

Consider the n-th iteration. Denote by B the set of indices of the variables chosen: B={(p,l)} or B={(p,l),(q,l)}. Without loss of generality, reorder variables so that the variables in B occupy the first |B| indices. In the n-th iteration, we optimise f over the variables in B keeping the rest of the variables constant. Thus we have to maximise $h(\mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}_{B}) = f(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n} + [ \mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}_{B}^{t}, \; \mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}}^{t} ]^{t}) - f(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n})$. This amounts to solving the optimisation problem:

(61)

s.t.

Note that since $\mathchoice {\mbox{\boldmath$\displaystyle\bf g$}} {\mbox{\boldmath$\textstyle\bf g$}} {\mbox{\boldmath$\scriptstyle\bf g$}} {\mbox{\boldmath$\scriptscriptstyle\bf g$}}^{n}_{B} = -(\mathchoice {\mbox{\boldmath$\displaystyle\bf Q\alpha$}} {\mbox{\boldmath$\textstyle\bf Q\alpha$}} {\mbox{\boldmath$\scriptstyle\bf Q\alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q\alpha$}}^{n})_{B} + \mathchoice {\mbox{\boldmath$\displaystyle\bf p$}} {\mbox{\boldmath$\textstyle\bf p$}} {\mbox{\boldmath$\scriptstyle\bf p$}} {\mbox{\boldmath$\scriptscriptstyle\bf p$}}_{B}$

(62)

$\mathchoice {\mbox{\boldmath$\displaystyle\bf Q$}} {\mbox{\boldmath$\textstyle\bf Q$}} {\mbox{\boldmath$\scriptstyle\bf Q$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q$}}_{BB} $ is positive definite since $\mathchoice {\mbox{\boldmath$\displaystyle\bf Q$}} {\mbox{\boldmath$\textstyle\bf Q$}} {\mbox{\boldmath$\scriptstyle\bf Q$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q$}}$ is positive definite, so this QP is convex. Hence standard theorems (see Nocedal and Wright 2006) tell us that $\mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}_{B}^{*}$ optimises (61) iff it is feasible and

(63)

Then we have that $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n+1} = \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n} + \mathchoice {\mbox{\boldmath$\displaystyle\bf \delta $}} {\mbox{\boldmath$\textstyle\bf \delta $}} {\mbox{\boldmath$\scriptstyle\bf \delta $}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta $}}^{*}$, where $\mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}^{*} = [ \mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}_{B}^{*t}, \; \mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}}^{t}]^{t}$. Now

(64)

Also,

(65)

Then (65) means that:

(66)

Using (63)

(67)

This leads us to the following lemma:

Lemma 1

Let $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}$ be the solution at the start of the n-th iteration. Let B be the set of indices of the variables over which we optimise. Let the updated solution be $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n+1}$. Then

1.
$\tilde{\mathchoice {\mbox{\boldmath$\displaystyle\bf g$}} {\mbox{\boldmath$\textstyle\bf g$}} {\mbox{\boldmath$\scriptstyle\bf g$}} {\mbox{\boldmath$\scriptscriptstyle\bf g$}}}^{n+1}_{B} = \mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}}$
2.
$\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n+1} \neq \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}$
3.
If $l_{jk}<\alpha^{n+1}_{jk}<u_{jk}$ then $g^{n+1}_{jk} = 0$ ∀(j,k)∈B

Proof

1. This follows directly from (67).

2. If $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n+1} = \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}$, then $\mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}_{B}^{*} = \mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}}$ and so, from (65), $\mathchoice {\mbox{\boldmath$\displaystyle\bf g$}} {\mbox{\boldmath$\textstyle\bf g$}} {\mbox{\boldmath$\scriptstyle\bf g$}} {\mbox{\boldmath$\scriptscriptstyle\bf g$}}^{n+1}_{B} = \nabla h(\mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}}) = \mathchoice {\mbox{\boldmath$\displaystyle\bf g$}} {\mbox{\boldmath$\textstyle\bf g$}} {\mbox{\boldmath$\scriptstyle\bf g$}} {\mbox{\boldmath$\scriptscriptstyle\bf g$}}^{n}_{B} $ . This means that from (67) $\tilde{\mathchoice {\mbox{\boldmath$\displaystyle\bf g$}} {\mbox{\boldmath$\textstyle\bf g$}} {\mbox{\boldmath$\scriptstyle\bf g$}} {\mbox{\boldmath$\scriptscriptstyle\bf g$}}}^{n}_{B} = \tilde{\mathchoice {\mbox{\boldmath$\displaystyle\bf g$}} {\mbox{\boldmath$\textstyle\bf g$}} {\mbox{\boldmath$\scriptstyle\bf g$}} {\mbox{\boldmath$\scriptscriptstyle\bf g$}}}^{n+1}_{B} = \mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}}$. But this is a contradiction since we required that all variables in the chosen set have non zero projected gradient before the start of the iteration.

3. Since the final projected gradients are 0 for all variables in the chosen set (from (67)), if $l_{jk}<\alpha^{n+1}_{jk}<u_{jk}$ then $g^{n+1}_{jk} = 0 \; \forall(j,k) \in B$. □

Lemma 2

In the same setup as the previous lemma, $f(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n+1}) - f(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}) \geq\sigma\| \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha $}} {\mbox{\boldmath$\textstyle\bf \alpha $}} {\mbox{\boldmath$\scriptstyle\bf \alpha $}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha $}}^{n+1} - \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n} \|^{2}$, for some fixed σ>0.

Proof

(68)

where $\mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}^{*}_{B}$ is the optimum solution of Problem (61). Now, note that since $\mathchoice {\mbox{\boldmath$\displaystyle\bf \delta$}} {\mbox{\boldmath$\textstyle\bf \delta$}} {\mbox{\boldmath$\scriptstyle\bf \delta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \delta$}}^{*}_{B}$ is feasible and $\mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}}$ is feasible and h is concave, we have that (see Nocedal and Wright 2006):

(69)

(70)

(71)

This gives us that

(72)

(73)

(74)

where ν _B is the minimum eigenvalue of the matrix $\mathchoice {\mbox{\boldmath$\displaystyle\bf Q$}} {\mbox{\boldmath$\textstyle\bf Q$}} {\mbox{\boldmath$\scriptstyle\bf Q$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q$}}_{BB}$. Since $\mathchoice {\mbox{\boldmath$\displaystyle\bf Q$}} {\mbox{\boldmath$\textstyle\bf Q$}} {\mbox{\boldmath$\scriptstyle\bf Q$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q$}}_{BB}$ is positive definite always, this value is always greater than zero, and bounded below by the minimum eigenvalue among all 2×2 positive definite sub matrices of $\mathchoice {\mbox{\boldmath$\displaystyle\bf Q$}} {\mbox{\boldmath$\textstyle\bf Q$}} {\mbox{\boldmath$\scriptstyle\bf Q$}} {\mbox{\boldmath$\scriptscriptstyle\bf Q$}}$. Thus

(75)

for some fixed σ≥0. □

Theorem 1

The sequence $\{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}\}$ generated by our algorithm converges.

Proof

From Lemma 2, we have that $f(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n+1}) - f(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}) \geq0$. Thus the sequence $\{f(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n})\}$ is monotonically increasing. Since it is bounded from above (by the optimum value) it must converge. Since convergent sequences are Cauchy, this sequence is also Cauchy. Thus for every ϵ, ∃n ₀ s.t. $f(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n+1}) - f(\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}) \leq\sigma\epsilon^{2} \; \forall n \geq n_{0}$. Again using Lemma 2, we get that

(76)

for every n≥n ₀. Hence the sequence $\{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}\}$ is Cauchy. The feasible set of $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}$ is closed and compact, so Cauchy sequences are also convergent. Hence $\{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}\}$ converges. □

2.4 B.4 Finite termination

We have shown that $\{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}\}$ converges. Let $\hat{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha $}} {\mbox{\boldmath$\textstyle\bf \alpha $}} {\mbox{\boldmath$\scriptstyle\bf \alpha $}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha $}}}$ be a limit point of $\{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{n}\}$. We will start from the assumption that the algorithm runs for an infinite number of iterations and then prove a contradiction.

Call the variable α _ik as τ-violating if the magnitude of the projected gradient $\tilde{g}_{ik}$ is greater than τ. Note that at every iteration, the chosen set of variables contains at least one that is τ-violating. Now suppose the algorithm runs for an infinite number of iterations. Then it means that the sequence of iterates $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{k}$ contains an infinite number of τ-violating variables. Since there are only a finite number of distinct variables, we have that at least one variable figures as a τ-violating variable in the chosen set B an infinite number of times. Suppose that α _il is one such variable, and let {k _il} be the sub-sequence in which this variable is chosen as a τ-violating variable.

Lemma 3

For every ϵ $\exists k_{il}^{0}$ s.t $|\alpha_{il}^{k_{il}+1} - \alpha_{il}^{k_{il}} |\leq\epsilon \ \forall k_{il} > k_{il}^{0}$.

Proof

We have that since $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{k} \rightarrow\hat{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}}$, $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{k_{il}} \rightarrow\hat{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}}$, and $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha $}} {\mbox{\boldmath$\textstyle\bf \alpha $}} {\mbox{\boldmath$\scriptstyle\bf \alpha $}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha $}}^{k_{il}+1} \rightarrow\hat{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}}$. Thus, for any given ϵ ∃ $k_{il}^{0}$ such that

(77)

(78)

This gives, by triangle inequality,

(79)

□

Lemma 4

$|\hat{g_{il}} |\geq\tau$, where $\hat{g_{il}}$ is the derivative of f w.r.t. α _il at $\hat{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}}$.

Proof

This is simply because of the fact that $| g^{k_{il}}_{il} |\geq \tau$ for every k _il, and the absolute value of the derivative w.r.t. α _il is a continuous function of $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}$, and $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^{k_{il}} \rightarrow\hat{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}}$. □

We use some notation. If $\alpha_{il}^{k_{il}} \in(l_{il},u_{il})$ and if $\alpha_{il}^{k_{il}+1} = l_{il}$ or $\alpha_{il}^{k_{il}+1} = u_{il}$, then we say that “k _il is int → bd”, where “int” stands for interior and “bd” stands for boundary. Similar interpretations are assumed for “bd → bd” and “int → int”. Thus each iteration k _il can be of one of only four possible kinds: int → int, int → bd, bd → int and bd → bd. We will prove that each of these kinds of iterations can only occur a finite number of times.

Lemma 5

There can be only a finite number of int → int and bd → int transitions.

Proof

Suppose not. Then we can construct an infinite sub-sequence {s _il} of the sequence {k _il} that consists of these transitions. Then we have that $g^{s_{il}+1}_{il} = 0$, using Lemma 1. Hence $g^{s_{il}+1}_{il} \rightarrow0$. Since the gradient is a continuous function of $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}$, and since $\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha $}} {\mbox{\boldmath$\textstyle\bf \alpha $}} {\mbox{\boldmath$\scriptstyle\bf \alpha $}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha $}}^{s_{il}+1} \rightarrow\hat{\mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}}$, we have that $g^{s_{il}+1}_{il} \rightarrow\hat{g}_{il}$. But this means $\hat {g}_{il}=0 $, which contradicts Lemma 4. □

Lemma 6

There can be only a finite number of int → bd transitions.

Proof

Suppose that we have completed sufficient number of iterations so that all int → int and bd → int transitions have completed. The next int → bd transition will place α _il on the boundary. Since there are no bd → int transitions anymore, α _il will stay on the boundary henceforth. Hence there can be no more int → bd transitions. □

Lemma 7

There can only be a finite number of bd → bd transitions.

Proof

Suppose not, i.e. there are an infinite number of bd → bd transitions. Let t _il be the sub-sequence of k _il consisting of bd → bd transitions. Now, the sequence $\alpha^{t_{il}}_{il} \rightarrow\hat{\alpha}_{il}$ and is therefore Cauchy. Hence ∃n ₁ s.t.

(80)

Similarly, because the gradient is a continuous function of α, the sequence $\{ g^{t_{il}}_{il} \}$ is convergent and therefore Cauchy. Hence ∃n ₂ s.t.

(81)

Also, from the previous lemmas, ∃n ₃ s.t. t _il is not int → int, bd → int or int → bd ∀t _il≥n ₃.

Take n ₀=max(n ₁,n ₂,n ₃). Now, consider t _il≥n ₀. Without loss of generality, assume that $\alpha^{t_{il}}_{il} = l_{il}$. Then, since $|\tilde {g}_{il}^{t_{il}} |\geq\tau$, we must have that $g^{t_{il}}_{il} \geq\tau$. From (80), and using the fact that this is a bd → bd transition, we must have that

(82)

From (81), we have that

(83)

From (82) and (83), we have that $\tilde {g}^{t_{il}+1}_{il} \geq\frac{\tau}{2}$, which contradicts Lemma 1. □

But if all int → int, int → bd, bd → int and bd → bd transitions are finite, then α _il cannot be τ-violating an infinite number of times and hence we have a contradiction. This gives us the following theorem:

Theorem 2

Our algorithm terminates in finite number of steps.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hariharan, B., Vishwanathan, S.V.N. & Varma, M. Efficient max-margin multi-label classification with applications to zero-shot learning. Mach Learn 88, 127–155 (2012). https://doi.org/10.1007/s10994-012-5291-x

Download citation

Received: 20 October 2010
Accepted: 11 April 2012
Published: 03 May 2012
Issue Date: July 2012
DOI: https://doi.org/10.1007/s10994-012-5291-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Efficient max-margin multi-label classification with applications to zero-shot learning

Abstract

Similar content being viewed by others

Retargeted Regression Methods for Multi-label Learning

Multi-label Classification with Output Kernels

Gradient-Based Label Binning in Multi-label Classification

1 Introduction

2 Related Work

3 M3L: the max-margin multi-label classification primal formulation

3.1 The special case of 1-vs-all

3.2 Relating the kernel to the loss

4 The M3L dual formulation

5 Optimisation

5.1 Kernelised M3L

5.1.1 Reduced Variable Optimisation

5.1.2 Working set selection

5.1.3 Stopping criterion and kernel caching

6 Linear M3L

7 Experiments

7.1 Optimisation experiments

7.2 Incorporating prior knowledge for zero-shot learning

7.2.1 Animals with attributes

7.2.2 Benchmark data sets

8 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Pseudo code of the kernelised M3L algorithm

Appendix B: Proof of convergence of the kernelised M3L algorithm

2.1 B.1 Notation

2.2 B.2 The optimisation problem

2.3 B.3 Convergence

Lemma 1

Proof

Lemma 2

Proof

Theorem 1

Proof

2.4 B.4 Finite termination

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Theorem 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation