1 Introduction

Our objective, in this paper, is to develop efficient algorithms for max-margin, multi-label classification. Given a set of pre-specified labels and a data point, (binary) multi-label classification deals with the problem of predicting the subset of labels most relevant to the data point. This is in contrast to multi-class classification where one has to predict just the single, most probable label. For instance, rather than simply saying that Fig. 1 is an image of a Babirusa we might prefer to describe it as containing a brown, hairless, herbivorous, medium sized quadruped with tusks growing out of its snout.

Fig. 1
figure 1

Having never seen a Babirusa before we can still describe it as a brown, hairless, herbivorous, medium sized quadruped with tusks growing out of its snout

There are many advantages in generating such a description and multi-label classification has found applications in areas ranging from computer vision to natural language processing to bio-informatics. We are specifically interested in the problem of image search on the web and in personal photo collections. In such applications, it is very difficult to get training data for every possible object out there in the world that someone might conceivably search for. In fact, we might not have any training images whatsoever for many object categories such as the obscure Babirusa. Nevertheless, we can not preclude the possibility of someone searching for one of these objects. A similar problem is encountered when trying to search videos on the basis of human body pose and motion and many other applications such as neural activity decoding (Palatucci et al. 2009).

One way of recognising object instances from previously unseen test categories (the zero-shot learning problem) is by leveraging knowledge about common attributes and shared parts. For instance, given adequately labelled training data, one can learn classifiers for the attributes occurring in the training object categories. These classifiers can then be used to recognise the same attributes in object instances from the novel test categories. Recognition can then proceed on the basis of these learnt attributes (Farhadi et al. 2009, 2010; Lampert et al. 2009).

The learning problem can therefore be posed as multi-label classification where there is a significant difference between attribute (label) correlations in the training categories and the previously unseen test categories. What adds to the complexity of the problem is the fact that these attributes are often densely correlated as they are shared across most categories. This makes optimising over the exponentially large output space, given by the power set of all labels, very difficult. The problem is acute not just during prediction but also during training as the number of training images might grow to be quite large over time in some applications.

Previously proposed solutions to the multi-label problem take one of two approaches—neither of which can be applied straight forwardly in our scenario. In the first, labels are a priori assumed not to be correlated so that a predictor can be trained for each label independently. This reduces training and prediction complexity from exponential in the number of labels to linear. Such methods can therefore scale efficiently to large problems but at the cost of not being able to model label correlations. Furthermore, these methods typically tend not to minimise a multi-label loss. In the second, label correlations are explicitly taken into account by incorporating pairwise, or higher order, label interactions. However, exact inference is mostly intractable for densely correlated labels and in situations where the label correlation graph has loops. Most approaches therefore assume sparsely correlated labels such as those arranged in a hierarchical tree structure.

In this paper, we follow a middle approach. We develop a max-margin multi-label classification formulation, referred to as M3L, where we do model prior label correlations but do not incorporate pairwise, or higher order, label interaction terms in the prediction function. This lets us generalise to the case where the training label correlations might differ significantly from the test label correlations. We can also efficiently handle densely correlated labels. In particular, we show that under fairly general assumptions of linearity, the M3L primal formulation can be reduced from having an exponential number of constraints to linear in the number of labels. Furthermore, if no prior information about label correlations is provided, M3L reduces directly to the 1-vs-All method. This lets us provide a principled interpretation of the 1-vs-All multi-label approach which has enjoyed the reputation of being a popular, effective but nevertheless, heuristic technique.

Much of the focus of this paper is on optimising the M3L formulation. It turns out that it is not good enough to just reduce the primal to have only a linear number of constraints. A straight forward application of state-of-the-art decompositional optimisation methods, such as Sequential Minimal Optimisation (SMO), would lead to an algorithm that is super-quadratic in the number of labels. We therefore develop specialised optimisation algorithms that can be orders of magnitude faster than competing methods. In particular, for kernelised M3L, we show that by simple book keeping and delaying gradient updates, SMO can be adapted to yield a linear time algorithm. Furthermore, due to efficient kernel caching and jointly optimising all variables, we can sometimes be an order of magnitude faster than the 1-vs-All method. Thus our code, available from Hariharan et al. (2010a), should also be very useful for learning independent 1-vs-All classifiers. For linear M3L, we adopt a dual co-ordinate ascent strategy with shrinkage which lets us efficiently tackle large scale training data sets. In terms of prediction accuracy, we show that incorporating prior knowledge about label correlations using the M3L formulation can substantially boost performance over independent methods.

The rest of the paper is organised as follows. Related work is reviewed in Sect. 2. Section 3 develops the M3L primal formulation and shows how to reduce the number of primal constraints from exponential to linear. The 1-vs-All formulation is also shown to be a special case of the M3L formulation. The M3L dual is developed in Sect. 4 and optimised in Sect. 5. We develop algorithms tuned to both the kernelised and the linear case. Experiments are carried out in Sect. 7 and it is demonstrated that the M3L formulation can lead to significant gains in terms of both optimisation and prediction accuracy. An earlier version of the paper appeared in Hariharan et al. (2010b).

2 Related Work

The multi-label problem has many facets including binary (Tsoumakas and Katakis 2007; Ueda and Saito 2003), multi-class (Dekel and Shamir 2010) and ordinal (Cheng et al. 2010) multi-label classification as well as semi-supervised learning, feature selection (Zhang and Wang 2009b), active learning (Li et al. 2004), multi-instance learning (Zhang and Wang 2009a), etc. Our focus, in this paper, is on binary multi-label classification where most of the previous work can be categorised into one of two approaches depending on whether labels are assumed to be independent or not. We first review approaches that do assume label independence. Most of these methods try and reduce the multi-label problem to a more “canonical” one such as regression, ranking, multi-class or binary classification.

In regression methods (Hsu et al. 2009; Ji et al. 2008; Tsoumakas and Katakis 2007), the label space is mapped onto a vector space (which might sometimes be a shared subspace of the feature space) where regression techniques can be applied straightforwardly. The primary advantage of such methods is that they can be extremely efficient if the mapped label space has significantly lower dimensionality than the original label space (Hsu et al. 2009). The disadvantage of such approaches is that the choice of an appropriate mapping might be unclear. As a result, minimising regression loss functions, such as square loss, in this space might be very efficient but might not be strongly correlated with minimising the desired multi-label loss. Furthermore, classification involves inverting the map which might not be straightforward, result in multiple solutions and might involve heuristics.

A multi-label problem with L labels can be viewed as a classification problem with 2L classes (McCallum 1999; Boutell et al. 2004) and standard multi-class techniques can be brought to bear. Such an approach was shown to give the best empirical results in the survey by Tsoumakas and Katakis (2007). However, such approaches have three major drawbacks. First, since not all 2L label combinations can be present in the training data, many of the classes will have no positive examples. Thus, predictors can not be learnt for these classes implying that these label combinations can not be recognised at run time. Second, the 0/1 multi-class loss optimised by such methods forms a poor approximation to most multi-label losses. For instance, the 0/1 loss would charge the same penalty for predicting all but one of the labels correctly as it would for predicting all of the labels incorrectly. Finally, learning and predicting with such a large number of classifiers might be very computationally expensive.

Binary classification can be leveraged by replicating the feature vector for each data point L times. For copy number l, an extra dimension is added to the feature vector with value l and the training label is +1 if label l is present in the label set of the original point and −1 otherwise. A binary classifier can be learnt from this expanded training set and a novel point classified by first replicating it as described above and then applying the binary classifier L times to determine which labels are selected. Due to the data replication, applying a binary classifier naively would be computationally costly and would require that complex decision boundaries be learnt. However, Schapire and Singer (2000) show that the problem can be solved efficiently using Boosting. A somewhat related technique is 1-vs-All (Rifkin and Khautau 2004) which independently learns a binary classifier for each label. As we’ll show in Sect. 3, our formulation generalises 1-vs-All to handle prior label correlations.

A ranking based solution was proposed in Elisseeff and Weston (2001). The objective was to ensure that, for every data point, all the relevant labels were ranked higher than any of the irrelevant ones. This approach has been influential but suffers from the drawback of not being able to easily determine the number of labels to select in the ranking. The solution proposed in Elisseeff and Weston (2001) was to find a threshold so that all labels scoring above the threshold were selected. The threshold was determined using a regressor trained subsequently on the ranker output on the training set. Many variations have been proposed, such as using dummy labels to determine the threshold, but each has its own limitations and no clear choice has emerged. Furthermore, posing the problem as ranking induces a quadratic number of constraints per example which leads to a harder optimisation problem. This is ameliorated in Crammer and Singer (2003) who reduced the space complexity to linear and time complexity to sub-quadratic.

Most of the approaches mentioned above do not explicitly model label correlations—McCallum (1999) has a generative model which can, in principle, handle correlations but greedy heuristics are used to search over the exponential label space. In terms of discriminative methods, most work has focused on hierarchical tree, or forest, structured labels. Methods such as Cai and Hofmann (2007), Cesa-Bianchi et al. (2006) optimise a hierarchical loss over the tree structure but do not incorporate pairwise, or higher order, label interaction terms. In both these methods, a label is predicted only if its parent has also been predicted in the hierarchy. For instance, Cesa-Bianchi et al. (2006) train a classifier for each node of the tree. The positive training data for the classifier is the set of data points marked with the node label while the negative training points are selected from the sibling nodes. Classification starts at the root and all the children classifiers are tested to determine which path to take. This leads to a very efficient algorithm during both training and prediction as each classifier is trained on only a subset of the data. Alternatively, Cai and Hofmann (2007) classify at only the leaf nodes and use them as a proxy for the entire path starting from the root. A hierarchical loss is defined and optimised using the ranking method of Elisseeff and Weston (2001).

The M3N formulation of Taskar et al. (2003) was the first to suggest max-margin learning of label interactions. The original formulation starts off having an exponential number of constraints. These can be reduced to quadratic if the label interactions formed a tree or forest. Approximate algorithms are also developed for sparse, loopy graph structures. While the M3N formulation dealt with the Hamming loss, a more suitable hierarchical loss was introduced and efficiently optimised in Rousu et al. (2006) for the case of hierarchies. Note that even though these methods take label correlations explicitly into account, they are unsuitable for our purposes as they cannot handle densely correlated labels and learn training set label correlations which are not useful at test time since the statistics might have changed significantly.

Finally, Tsochantaridis et al. (2005) propose an iterative, cutting plane algorithm for learning in general structured output spaces. The algorithm adds the worst violating constraint to the active set in each iteration and is proved to take a maximum number of iterations independent of the size of the output space. While this algorithm can be used to learn pairwise label interactions it too can’t handle a fully connected graph as the worst violating constraint cannot be generally found in polynomial time. However, it can be used to learn our proposed M3L formulation but is an order of magnitude slower than the specialised optimisation algorithms we develop.

Zero shot learning deals with the problem of recognising instances from novel categories that were not present during training. It is a nascent research problem and most approaches tackle it by building an intermediate representation leveraging attributes, features or classifier outputs which can be learnt from the available training data (Farhadi et al. 2009, 2010; Lampert et al. 2009; Palatucci et al. 2009). Novel instances are classified by first generating their intermediate representation and then mapping it onto the novel category representation (which can be generated using meta-data alone). The focus of research has mainly been on what is a good intermediate level representation and how should the mapping be carried out.

A popular choice of the intermediate level representation have been parts and attributes—whether they be semantic or discriminative. Since not all features are relevant to all attributes, Farhadi et al. (2009) explore feature selection so as to better predict a novel instance’s attributes. Probabilistic techniques for mapping the list of predicted attributes to a novel category’s list of attributes (known a priori) are developed in Lampert et al. (2009) while Palatucci et al. (2009) carry out a theoretical analysis and use the one nearest neighbour rule. An alternative approach to zero-shot learning is not to name the novel object, or explicitly recognise its attributes, but simply say that it is “like” an object seen during training (Wang et al. 2010). For instance, the Babirusa in Fig. 1 could be declared to be like a pig. This is sufficient for some applications and works well if the training set has good category level coverage.

3 M3L: the max-margin multi-label classification primal formulation

The objective in multi-label classification is to learn a function f which can be used to assign a set of labels to a point \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}\). We assume that N training data points have been provided of the form \((\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}) \in\mathbb{R}^{D} \times\{\pm1\}^{L}\) with y il being +1 if label l has been assigned to point i and −1 otherwise. Note that such an encoding allows us to learn from both the presence and absence of labels, since both can be informative when predicting test categories.

A principled way of formulating the problem would be to take the loss function Δ that one truly cares about and minimise it over the training set subject to regularisation or prior knowledge. Of course, since direct minimisation of most discrete loss functions is hard, we might end up minimising an upper bound on the loss, such as the hinge. The learning problem can then be formulated as the following primal

(1)
(2)
(3)

with a new point \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}\) being assigned labels according to \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}^{*}=\mathop{\mbox{argmax}}_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}} f(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})\). The drawback of such a formulation is that there are N2L constraints which make direct optimisation very slow. Furthermore, classification of novel points might require 2L function evaluations (one for each possible value of \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}\)), which can be prohibitive at run time. In this Section, we demonstrate that, under general assumptions of linearity, (P 1) can be reformulated as the minimisation of L densely correlated sub-problems each having only N constraints. At the same time, prediction cost is reduced to a single function evaluation with complexity linear in the number of labels. The ideas underlying this decomposition were also used in Evgeniou et al. (2005) in a multi-task learning scenario. However, their objective is to combine multiple tasks into a single learning problem, while we are interested in decomposing (3) into multiple subproblems.

We start by making the standard assumption that

$$ f(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}^t \bigl(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \phi$}} {\mbox{\boldmath$\textstyle\bf \phi$}} {\mbox{\boldmath$\scriptstyle\bf \phi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \phi$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}) \otimes \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \psi$}} {\mbox{\boldmath$\textstyle\bf \psi$}} {\mbox{\boldmath$\scriptstyle\bf \psi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \psi$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})\bigr) $$
(4)

where \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \phi$}} {\mbox{\boldmath$\textstyle\bf \phi$}} {\mbox{\boldmath$\scriptstyle\bf \phi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \phi$}}\) and \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \psi$}} {\mbox{\boldmath$\textstyle\bf \psi$}} {\mbox{\boldmath$\scriptstyle\bf \psi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \psi$}}\) are the feature and label space mappings respectively, ⊗ is the Kronecker product and \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}^{t}\) denotes \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}\) transpose. Note that, for zero shot learning, it is possible to theoretically show that , in the limit of infinite data, one does not need to model label correlations when training and test distributions are the same (Palatucci et al. 2009). In practice, however, training sets are finite, often relatively small, and have label distributions that are significantly different from the test set. Therefore to incorporate prior knowledge and correlate classifiers efficiently, we assume that labels have at most linear, possibly dense, correlation so that it is sufficient to choose \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \psi$}} {\mbox{\boldmath$\textstyle\bf \psi$}} {\mbox{\boldmath$\scriptstyle\bf \psi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \psi$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) = \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}\) where \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}\) is an invertible matrix encoding all our prior knowledge about the labels. If we assume f to be quadratic (or higher order) in \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}\), as is done in structured output prediction, then it would not be possible to reduce the number of constraints from exponential to linear while still modelling dense, possibly negative, label correlations. Furthermore, learning label correlation on the training set by incorporating quadratic terms in \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}\) might not be fruitful as the test categories will have very different correlation statistics. Thus, by sacrificing some expressive power, we hope to build much more efficient algorithms that can still give improved prediction accuracy in the zero-shot learning scenario.

We make another standard assumption that the chosen loss function should decompose over the individual labels (Taskar et al. 2003). Hence, we require that

$$ \varDelta(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i,\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})=\sum_{l=1}^{L} \varDelta_l(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i,y_l) $$
(5)

where y l ∈{±1} corresponds to label l in the set of labels represented by \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}\). For instance, the popular Hamming loss, amongst others, satisfies this condition. We define the Hamming loss \(\varDelta_{H}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})\), between a ground truth label \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}\) and a prediction \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}\) as

$$ \varDelta_{H}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i,\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) = \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i^t (\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i - \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) $$
(6)

which is a count of twice the total number of individual labels mispredicted in \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}\). Note that the Hamming loss can be decomposed over the labels as \(\varDelta_{H}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})=\sum_{l} 1 - y_{l} y_{il}\). Of course, for Δ to represent a sensible loss we also require that \(\varDelta(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) \geq\varDelta(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}) = 0\).

Under these assumptions, (P 1) can be expressed as

$$ P_1 \equiv\min_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}} \frac{1}{2}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}^t \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}} + C\sum_{i=1}^{N} \max_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}} \in\{\pm1\}^L} \bigl[\varDelta(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i,\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) + \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}^{t} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \phi$}} {\mbox{\boldmath$\textstyle\bf \phi$}} {\mbox{\boldmath$\scriptstyle\bf \phi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \phi$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}_i) \otimes \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}} - \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_i)\bigr] $$
(7)

where the constraints have been moved into the objective and ξ i ≥0 eliminated by including \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}\) in the maximisation. To simplify notation, we express the vector \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf w$}} {\mbox{\boldmath$\textstyle\bf w$}} {\mbox{\boldmath$\scriptstyle\bf w$}} {\mbox{\boldmath$\scriptscriptstyle\bf w$}}\) as a D×L matrix \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf W$}} {\mbox{\boldmath$\textstyle\bf W$}} {\mbox{\boldmath$\scriptstyle\bf W$}} {\mbox{\boldmath$\scriptscriptstyle\bf W$}}\) so that

(8)

Substituting \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf W$}} {\mbox{\boldmath$\textstyle\bf W$}} {\mbox{\boldmath$\scriptstyle\bf W$}} {\mbox{\boldmath$\scriptscriptstyle\bf W$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}\), \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}^{t}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}} \succ0\) and using the identity \(\mathop{\mbox{Trace}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf A$}} {\mbox{\boldmath$\textstyle\bf A$}} {\mbox{\boldmath$\scriptstyle\bf A$}} {\mbox{\boldmath$\scriptscriptstyle\bf A$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf B$}} {\mbox{\boldmath$\textstyle\bf B$}} {\mbox{\boldmath$\scriptstyle\bf B$}} {\mbox{\boldmath$\scriptscriptstyle\bf B$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf C$}} {\mbox{\boldmath$\textstyle\bf C$}} {\mbox{\boldmath$\scriptstyle\bf C$}} {\mbox{\boldmath$\scriptscriptstyle\bf C$}})=\mathop{\mbox{Trace}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf C$}} {\mbox{\boldmath$\textstyle\bf C$}} {\mbox{\boldmath$\scriptstyle\bf C$}} {\mbox{\boldmath$\scriptscriptstyle\bf C$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf A$}} {\mbox{\boldmath$\textstyle\bf A$}} {\mbox{\boldmath$\scriptstyle\bf A$}} {\mbox{\boldmath$\scriptscriptstyle\bf A$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf B$}} {\mbox{\boldmath$\textstyle\bf B$}} {\mbox{\boldmath$\scriptstyle\bf B$}} {\mbox{\boldmath$\scriptscriptstyle\bf B$}})\) results in

(9)

where \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf z$}} {\mbox{\boldmath$\textstyle\bf z$}} {\mbox{\boldmath$\scriptstyle\bf z$}} {\mbox{\boldmath$\scriptscriptstyle\bf z$}}_{l}\) is the lth column of \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}\). Note that the terms inside the maximisation break up independently over the L components of \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}\). It is therefore possible to interchange the maximisation and summation to get

(10)

This leads to an equivalent primal formulation (P 2) as the summation of L correlated problems, each having N constraints which is significantly easier to optimise.

(11)
(12)
(13)
(14)

Furthermore, a novel point \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}\) can be assigned the set of labels for which the entries of \(\mbox{sign}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}^{t}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \phi$}} {\mbox{\boldmath$\textstyle\bf \phi$}} {\mbox{\boldmath$\scriptstyle\bf \phi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \phi$}}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}))\) are +1. This corresponds to a single evaluation of f taking time linear in the number of labels.

The L classifiers in \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}\) are not independent but correlated by \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\)—a positive definite matrix encoding our prior knowledge about label correlations. One might typically have thought of learning \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) from training data. For instance, one could learn \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) directly or express \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}^{-1}\) as a linear combination of predefined positive definite matrices with learnt coefficients. Such formulations have been developed in the Multiple Kernel Learning literature and we could leverage some of the proposed MKL optimization techniques (Vishwanathan et al. 2010). However, in the zero-shot learning scenario, learning \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) from training data is not helpful as the correlations between labels during training might be significantly different from those during testing.

Instead, we rely on the standard zero-shot learning assumption, that the test category attributes are known a priori (Farhadi et al. 2009, 2010; Lampert et al. 2009; Palatucci et al. 2009). Furthermore, if the prior distribution of test categories was known, then R could be set to approximate the average pairwise test label correlation (see Sect. 7.2.1 for details).

Note that, in the zero-shot learning scenario, \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) can be dense, as almost all the attributes might be shared across categories and correlated with each other, and can also have negative entries representing negative label correlations. We propose to improve prediction accuracy on the novel test categories by encoding prior knowledge about their label correlations in \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\).

Note that we deliberately chose not to include bias terms \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf b$}} {\mbox{\boldmath$\textstyle\bf b$}} {\mbox{\boldmath$\scriptstyle\bf b$}} {\mbox{\boldmath$\scriptscriptstyle\bf b$}}\) in f even though the reduction from (P 1) to (P 2) would still have gone through and the resulting kernelised optimisation been more or less the same (see Sect. 7.1). However, we would then have had to regularise \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf b$}} {\mbox{\boldmath$\textstyle\bf b$}} {\mbox{\boldmath$\scriptstyle\bf b$}} {\mbox{\boldmath$\scriptscriptstyle\bf b$}}\) and correlate it using \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\). Otherwise \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf b$}} {\mbox{\boldmath$\textstyle\bf b$}} {\mbox{\boldmath$\scriptstyle\bf b$}} {\mbox{\boldmath$\scriptscriptstyle\bf b$}}\) would have been a free parameter capable of undoing the effects of \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) on \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}\). Therefore, rather than explicitly have \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf b$}} {\mbox{\boldmath$\textstyle\bf b$}} {\mbox{\boldmath$\scriptstyle\bf b$}} {\mbox{\boldmath$\scriptscriptstyle\bf b$}}\) and regularise it, we implicitly simulate \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf b$}} {\mbox{\boldmath$\textstyle\bf b$}} {\mbox{\boldmath$\scriptstyle\bf b$}} {\mbox{\boldmath$\scriptscriptstyle\bf b$}}\) by adding an extra dimension to the feature vector. This has the same effect while keeping optimisation simple.

We briefly discuss two special cases before turning to the dual and its optimisation.

3.1 The special case of 1-vs-all

If label correlation information is not included, i.e. \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf I$}} {\mbox{\boldmath$\textstyle\bf I$}} {\mbox{\boldmath$\scriptstyle\bf I$}} {\mbox{\boldmath$\scriptscriptstyle\bf I$}}\), then (P 2) decouples into L completely independent sub-problems each of which can be tackled in isolation. In particular, for the Hamming loss we get

(15)
(16)
(17)
(18)

Thus, S l reduces to an independent binary classification sub-problem where the positive class contains all training points tagged with label l and the negative class contains all other points. This is exactly the strategy used in the popular and effective 1-vs-All method and we can therefore now make explicit the assumptions underlying this technique. The only difference is that one should charge a misclassification penalty of 2C to be consistent with the original primal formulation.

3.2 Relating the kernel to the loss

In general, the kernel is chosen so as to ensure that the training data points become well separated in the feature space. This is true for both K x , the kernel on \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}\), as well as K y , the kernel on \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}\). However, one might also take the view that since the loss Δ induces a measure of dissimilarity in the label space it must be related to the kernel on \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}\) which is a measure of similarity in the label space. This heavily constrains the choice of K y and therefore the label space mapping \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \psi$}} {\mbox{\boldmath$\textstyle\bf \psi$}} {\mbox{\boldmath$\scriptstyle\bf \psi$}} {\mbox{\boldmath$\scriptscriptstyle\bf \psi$}}\). For instance, if a linear relationship is assumed, we might choose \(\varDelta(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})=K_{y}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}) - K_{y}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})\). Note that this allows Δ to be asymmetric even though K y is not and ensures the linearity of Δ if ψ, the label space mapping, is linear.

In this case, label correlation information should be encoded directly into the loss. For example, the Hamming loss could be transformed to \(\varDelta_{H}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}) = \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}^{t} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}} (\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i} - \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})\). \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) is the same matrix as before except the interpretation now is that the entries of \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) encode label correlations by specifying the penalties to be charged if a label is misclassified in the set. Of course, for Δ to be a valid loss, not only must \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) be positive definite as before but it now must also be diagonally dominant. As such, it can only encode “weak” correlations. Given the choice of Δ and the linear relationship with K y , the label space mapping gets fixed to \(\psi(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}})=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}\) where \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}^{t}\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf P$}} {\mbox{\boldmath$\textstyle\bf P$}} {\mbox{\boldmath$\scriptstyle\bf P$}} {\mbox{\boldmath$\scriptscriptstyle\bf P$}}\).

Under these assumptions once can still go from (P 1) to (P 2) using the same steps as before. The main differences are that \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) is now more restricted and that \(\varDelta_{l}(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i},y_{l})=(1/L) \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}^{t} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i} - y_{l} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{i}^{t} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}_{l}\) where \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}_{l}\) is the lth column of \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\). While this result is theoretically interesting, we do not explore it further in this paper.

4 The M3L dual formulation

The dual of (P 2) has similar properties in that it can be viewed as the maximisation of L related problems which decouple into independent binary SVM classification problems when \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf I$}} {\mbox{\boldmath$\textstyle\bf I$}} {\mbox{\boldmath$\scriptstyle\bf I$}} {\mbox{\boldmath$\scriptscriptstyle\bf I$}}\). The dual is easily derived if we rewrite (P 2) in vector notation. Defining

(19)
(20)
(21)

we get the following Lagrangian

(22)

with the optimality conditions being

(23)
(24)

Substituting these back into the Lagrangian leads to the following dual

$$ D_2 = \max_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}} \leq \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}\leq C\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf 1$}} {\mbox{\boldmath$\textstyle\bf 1$}} {\mbox{\boldmath$\scriptstyle\bf 1$}} {\mbox{\boldmath$\scriptscriptstyle\bf 1$}}} \sum _{l=1}^{L} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^t_l \bigl(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \varDelta$}} {\mbox{\boldmath$\textstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \varDelta$}}_{l}^{-} - \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \varDelta$}} {\mbox{\boldmath$\textstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptstyle\bf \varDelta$}} {\mbox{\boldmath$\scriptscriptstyle\bf \varDelta$}}_{l}^{+} \bigr) -2 \sum_{l=1}^{L} \sum _{k=1}^{L} R_{lk} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}^t_l \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Y$}} {\mbox{\boldmath$\textstyle\bf Y$}} {\mbox{\boldmath$\scriptstyle\bf Y$}} {\mbox{\boldmath$\scriptscriptstyle\bf Y$}}_l \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}}_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Y$}} {\mbox{\boldmath$\textstyle\bf Y$}} {\mbox{\boldmath$\scriptstyle\bf Y$}} {\mbox{\boldmath$\scriptscriptstyle\bf Y$}}_k \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}_k $$
(25)

Henceforth we will drop the subscript on the kernel matrix and write \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}}_{\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}}\) as \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf K$}} {\mbox{\boldmath$\textstyle\bf K$}} {\mbox{\boldmath$\scriptstyle\bf K$}} {\mbox{\boldmath$\scriptscriptstyle\bf K$}}\).

5 Optimisation

The M3L dual is similar to the standard SVM dual. Existing optimisation techniques can therefore be brought to bear. However, the dense structure of \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) couples all NL dual variables and simply porting existing solutions leads to very inefficient code. We show that, with book keeping, we can easily go from an O(L 2) algorithm to an O(L) algorithm. Furthermore, by re-utilising the kernel cache, our algorithms can be very efficient even for non-linear problems. We treat the kernelised and linear M3L cases separately.

5.1 Kernelised M3L

The Dual (D 2) is a convex quadratic programme with very simple box constraints. We can therefore use co-ordinate ascent algorithms (Platt 1999; Fan et al. 2005; Lin et al. 2009) to maximise the dual. The algorithms start by picking a feasible point—typically \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf \alpha$}} {\mbox{\boldmath$\textstyle\bf \alpha$}} {\mbox{\boldmath$\scriptstyle\bf \alpha$}} {\mbox{\boldmath$\scriptscriptstyle\bf \alpha$}}= \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf 0$}} {\mbox{\boldmath$\textstyle\bf 0$}} {\mbox{\boldmath$\scriptstyle\bf 0$}} {\mbox{\boldmath$\scriptscriptstyle\bf 0$}}\). Next,two variables are selected and optimised analytically. This step is repeated until the projected gradient magnitude falls below a threshold and the algorithm can be shown to have converged to the global optimum (Lin et al. 2009). The three key components are therefore: (a) reduced variable optimisation; (b) working set selection and (c) stopping criterion and kernel caching. We now discuss each of these components. The pseudo-code of the algorithm and proof of convergence are given in the Appendix.

5.1.1 Reduced Variable Optimisation

If all but two of the dual variables were fixed, say α pl and α ql along the label l, then the dual optimisation problem reduces to

(26)
(27)
(28)

where \(\delta_{pl}=\alpha_{pl}^{\mathrm{new}} - \alpha_{pl}^{\mathrm{old}}\) and \(\delta_{ql}=\alpha_{ql}^{\mathrm{new}} - \alpha_{ql}^{\mathrm{old}}\) and

$$ g_{pl} = \nabla_{\alpha_{pl}} D_{2} = \varDelta_{pl}^{-} - \varDelta_{pl}^+ - 4 \sum _{i=1}^N\sum_{k=1}^L R_{kl}K_{ip}y_{ik}y_{pl} \alpha_{ik} $$
(29)

Note that D 2pql has a quadratic objective in two variables which can be maximised analytically due to the simple box constraints. We do not give the expressions for \(\alpha^{\mathrm{new}}_{pl}\) and \(\alpha^{\mathrm{new}}_{ql}\) which maximise D 2pql as many special cases are involved for when the variables are at bound but they can be found in Algorithm 2 of the pseudo-code in Appendix A.

5.1.2 Working set selection

Since the M3L formulation does not have a bias term, it can be optimized by picking a single variable at each iteration rather than a pair of variables. This leads to a low cost per iteration but a large number of iterations. Selecting two variables per iteration increases the cost per iteration but significantly reduces the number of iterations as second order information can be incorporated into the variable selection policy.

If we were to choose two variables to optimise along the same label l, say α pl and α ql , then the maximum change that we could affect in the dual is given by

(30)

In terms of working set selection, it would have been ideal to have chosen the two variables α pl and α ql which would have maximised the increase in the dual objective. However, this turns out to be too expensive in practice. A good approximation is to choose the first point α pl to be the one having the maximum projected gradient magnitude. The projected gradient is defined as

$$ \tilde{g}_{pl} = \left \{ \begin{array}{l@{\quad}l} g_{pl} & \text{if }\alpha_{pl} \in(0,C) \\ \min(0, g_{pl}) & \text{if }\alpha_{pl}=C \\ \max(0, g_{pl}) & \text{if }\alpha_{pl}=0 \end{array} \right . $$
(31)

and hence the first point is chosen as

$$ \bigl(p^*,l^*\bigr)=\mathop{\mbox{argmax}}_{p,l} |\tilde{g}_{pl}| $$
(32)

Having chosen the first point, the second point is chosen to be the one that maximises (30).

Working set selection can be made efficient by maintaining the set of gradients g. Every time a variable, say α pl is changed, the gradients need to be updated as

$$ g^{\mathrm{new}}_{jk} = g^{\mathrm{old}}_{jk} - 4 y_{pl} y_{jk} R_{kl}K_{pj}\bigl( \alpha_{pl}^{\mathrm{new}} - \alpha_{pl}^{\mathrm{old}}\bigr) $$
(33)

Note that because of the dense structure of \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) all NL gradients have to be updated even if a single variable is changed. Since there are NL variables and each has to be updated presumably at least once we end up with an algorithm that takes time at least N 2 L 2.

The algorithm can be made much more efficient if, with some book keeping, not all gradients had to be updated every time a variable was changed. For instance, if we were to fix a label l and modify L variables along the chosen label, then the gradient update equations could be written as

(34)
(35)
(36)

As long as we are changing variables along a particular label, the gradient updates can be accumulated in u and only when we switch to a new variable do all the gradients have to be updated. We therefore end up doing O(NL) work after changing L variables resulting in an algorithm which takes time O(N 2 L) rather than O(N 2 L 2).

5.1.3 Stopping criterion and kernel caching

We use the standard stopping criterion that the projected gradient magnitude for all NL dual variables should be less than a predetermined threshold.

We employ a standard Least Recently Used (LRU) kernel cache strategy implemented as a circular queue. Since we are optimising over all labels jointly, the kernel cache gets effectively re-utilised, particularly as compared to independent methods that optimise one label at a time. In the extreme case, independent methods will have to rebuild the cache for each label which can slow them down significantly.

6 Linear M3L

We build on top of the stochastic dual coordinate ascent with shrinkage algorithm of Hsieh et al. (2008). At each iteration, a single dual variable is chosen uniformly at random from the active set, and optimised analytically. The variable update equation is given by

(37)
(38)
(39)

As can be seen, the dual variable update can be computed more efficiently in terms of the primal variables \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}\) which then need to be maintained every time a dual variable is modified. The update equation for \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}\) every time α pl is modified is

$$ \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf z$}} {\mbox{\boldmath$\textstyle\bf z$}} {\mbox{\boldmath$\scriptstyle\bf z$}} {\mbox{\boldmath$\scriptscriptstyle\bf z$}}^{\mathrm{new}}_{l} = \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf z$}} {\mbox{\boldmath$\textstyle\bf z$}} {\mbox{\boldmath$\scriptstyle\bf z$}} {\mbox{\boldmath$\scriptscriptstyle\bf z$}}^{\mathrm{old}}_{l} + 2R_{kl} y_{pl} \bigl(\alpha^{\mathrm{new}}_{pl} - \alpha^{\mathrm{old}}_{pl}\bigr)\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf x$}} {\mbox{\boldmath$\textstyle\bf x$}} {\mbox{\boldmath$\scriptstyle\bf x$}} {\mbox{\boldmath$\scriptscriptstyle\bf x$}}_{p} $$
(40)

Thus, all the primal variables \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf Z$}} {\mbox{\boldmath$\textstyle\bf Z$}} {\mbox{\boldmath$\scriptstyle\bf Z$}} {\mbox{\boldmath$\scriptscriptstyle\bf Z$}}\) need to be updated every time a single dual variable is modified. Again, as in the kernelised case, the algorithm can be made much more efficient by fixing a label l and modifying L dual variables along it while delaying the gradient updates as

(41)
(42)
(43)

In practice, it was observed that performing L stochastic updates along a chosen label right from the start could slow down convergence in some cases. Therefore, we initially use the more expensive strategy of choosing dual variables uniformly at random and only after the projected gradient magnitudes are below a pre-specified threshold do we switch to the strategy of optimising L dual variables along a particular label before picking a new label uniformly at random.

The active set is initialised to contain all the training data points. Points at bound having gradient magnitude outside the range of currently maintained extremal gradients are discarded from the active set. Extremal gradients are re-estimated at the end of each pass and if they are too close to each other the active set is expanded to include all training points once again.

A straightforward implementation with globally maintained extremal gradients leads to inefficient code. Essentially, if the classifier for a particular label has not yet converged, then it can force a large active set even though most points would not be considered by the other classifiers. We therefore implemented separate active sets for each label but coupled the maintained extremal gradients via \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\). The extremal gradients lb l and ub l , for label l. are initially set to −∞ and +∞ respectively. After each pass through the active set, they are updated as

(44)
(45)

where A k is the set of indices in the active set of label k. This choice was empirically found to decrease training time.

Once all the projected gradients in all the active sets have magnitude less than a threshold τ, we expand the active sets to include all the variables, and re-estimate the projected gradients. The algorithm stops when all projected gradients have magnitude less than τ.

7 Experiments

In this section we first compare the performance of our optimisation algorithms and then evaluate how prediction accuracy can be improved by incorporating prior knowledge about label correlations.

7.1 Optimisation experiments

The cutting plane algorithm in SVMStruct (Tsochantaridis et al. 2005) is a general purpose algorithm that can be used to optimise the original M3L formulation (P 1). In each iteration, the approximately worst violating constraint is added to the active set and the algorithm is proved to take a maximum number of iterations independent of the size of the output space. The algorithm has a user defined parameter ϵ for the amount of error that can be tolerated in finding the worst violating constraint.

We compared the SVMStruct algorithm to our M3L implementation on an Intel Xeon 2.67 GHz machine with 8 GB RAM. It was observed that even on medium scale problems with linear kernels, our M3L implementation was nearly a hundred times faster than SVMStruct. For example, on the Media Mill data set (Snoek et al. 2006) with a hundred and one labels and ten, fifteen and twenty thousand training points, our M3L code took 19, 37 and 55 seconds while SVMStruct took 1995, 2998 and 7198 seconds respectively. On other data sets SVMStruct ran out of RAM or failed to converge in a reasonable amount of time (even after tuning ϵ). This demonstrates that explicitly reducing the number of constraints from exponential to linear and implementing a specialised solver can lead to a dramatic reduction in training time.

As the next best thing, we benchmark our performance against the 1-vs-All method, even though it can’t incorporate prior label correlations. In the linear case, we compare our linear M3L implementation to 1-vs-All trained by running LibLinear (Fan et al. 2008) and LibSVM (Chang and Lin 2001) independently over each label. For the non-linear case, we compare our kernelised M3L implementation to 1-vs-All trained using LibSVM. In each case, we set \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf I$}} {\mbox{\boldmath$\textstyle\bf I$}} {\mbox{\boldmath$\scriptstyle\bf I$}} {\mbox{\boldmath$\scriptscriptstyle\bf I$}}\), so that M3L reaches exactly the same solution as LibSVM and LibLinear. Also, we avoided repeated disk I/O by reading the data into RAM and using LibLinear and LibSVM’s API’s.

Table 1 lists the variation in training time with the number of training examples on the Animals with Attributes (Lampert et al. 2009), Media Mill (Snoek et al. 2006), Siam (SIAM 2007) and RCV1 (Lewis et al. 2004) data sets. The training times of linear M3L (LM3L) and LibLinear are comparable, with LibLinear being slightly faster. The training time of kernelised M3L (KM3L) are significantly lower than LibSVM, with KM3L sometimes being as much as 30 times faster. This is primarily because KM3L can efficiently leverage the kernel cache across all labels while LibSVM has to build the cache from scratch each time. Furthermore, leaving aside caching issues, it would appear that by optimising over all variables jointly, M3L reaches the vicinity of the global optimum much more quickly than 1-vs-All. Figure 2 plots dual progress against the number of iterations for all four data sets with ten thousand training points. As can be seen, kernelised M3L gets to within the vicinity of the global optimum much faster than 1-vs-All implemented using LibSVM. Figure 3 shows similar plots with respect to time. The difference is even more significant due to kernel caching effects. In conclusion, even though M3L generalises 1-vs-All, its training time can be comparable, and sometimes, even significantly lower.

Fig. 2
figure 2

Dual progress versus number of iterations for the kernelised M3L algorithm and 1-vs-All implemented using LibSVM for an RBF kernel and ten thousand training points. M3L appears to get close to the vicinity of the global optimum much more quickly than 1-vs-All. The results are independent of kernel caching effects

Fig. 3
figure 3

Dual progress versus normalised time for the kernelised M3L algorithm and 1-vs-All implemented using LibSVM for an RBF kernel and ten thousand training points. The difference between M3L and 1-vs-All is even starker than in Fig. 2 due to kernel caching effects

Table 1 Comparison of training times for the linear M3L (LM3L) and kernelised M3L (KM3L) optimisation algorithms with 1-vs-All techniques implemented using LibLinear and LibSVM. Each data set has N training points, D features and L labels. See text for details

Finally, to demonstrate that our code scales to large problems, we train linear M3L on RCV1 with 781,265 points, 47,236 dimensional sparse features and 103 labels. Table 2 charts dual progress and train and test error with time. As can be seen, the model is nearly fully trained in under six minutes and converges in eighteen.

Table 2 Linear M3L training on RCV1 with 781,265 points, 47,236 dimensional sparse features and 103 labels

7.2 Incorporating prior knowledge for zero-shot learning

In this section, we investigate whether the proposed M3L formulation can improve label prediction accuracy in a zero-shot learning scenario. Zero-shot learning has two major components as mentioned earlier. The first component deals with generating an intermediate level representation, generally based on attributes for each data point. The second concerns itself with how to map test points in the intermediate representation to points representing novel categories. Our focus is on the former and the more accurate prediction of multiple, intermediate attributes (labels) when their correlation statistics on the training and test sets are significantly different.

7.2.1 Animals with attributes

The Animals with Attributes data set (Lampert et al. 2009) has forty training animal categories, such as Dalmatian, Skunk, Tiger, Giraffe, Dolphin, etc. and the following ten disjoint test animal categories: Humpback Whale, Leopard, Chimpanzee, Hippopotamus, Raccoon, Persian Cat, Rat, Seal, Pig and Giant Panda. All categories share a common set of 85 attributes such as has yellow, has spots, is hairless, is big, has flippers, has buckteeth, etc. The attributes are densely correlated and form a fully connected graph. Each image in the database contains a dominant animal and is labelled with its 85 attributes. There are 24,292 training images and 6,180 test images. Some example images are shown in Fig. 4. We use 252 dimensional PHOG features that are provided by the authors. M3L training times for this data set are reported in Table 1(a).

Fig. 4
figure 4

Sample training (top) and test (bottom) images from the Animals with Attributes data set

We start by visualising the influence of \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\). We randomly sample 200 points from the training set and discard all but two of the attributes—“has black” and “is weak”. These two attributes were selected as they are very weakly correlated on our training set, with a correlation coefficient of 0.2, but have a strong negative correlation of −0.76 on the test animals (Leopards, Giant Pandas, Humpback Whales and Chimpanzees all have black but are not weak). Figure 5 plots the Hamming loss on the test set as we set \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=[1~r;r~1]\), plug it into the M3L formulation, and vary r from −1 to +1. Learning independent classifiers for the two attributes (r=0) can lead to a Hamming loss of 25 % because of the mismatch between training and test sets. This can be made even worse by incorrectly choosing, or learning using structured output prediction techniques, a prior that forces the two labels to be positively correlated. However, if our priors are generally correct, then negatively correlating the classifiers lowers prediction error.

Fig. 5
figure 5

Test Hamming loss versus classifier correlation

We now evaluate performance quantitatively on the same training set but with all 85 labels. We stress that in the zero shot learning scenario no training samples from any of the test categories are provided. As is commonly assumed (Farhadi et al. 2009, 2010; Lampert et al. 2009; Palatucci et al. 2009), we only have access to \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c}\) which is the set of attributes for a given test category. Furthermore we require, as additional information, the prior distribution over test categories p(c). For the M3L formulation we set \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}=\sum_{c=1}^{10} p(c)\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c}^{t}\). Under this setup, learning independent classifiers using 1-vs-All yields a Hamming loss of 29.38 %. The Hamming loss for M3L, with the specific choice of \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\), is 26.35 %. This decrease in error is very significant given that 1-vs-All, trained on all 24,292 training points, only manages to reduce error to 28.64 %. Thus M3L, with extra knowledge, in the form of just test category distributions, can dramatically reduce test error. The results also compare favourably to other independent methods such as BoostTexter (Schapire and Singer 2000) (30.28 %), power set multi-class classification (32.70 %), 5 nearest neighbours (31.79 %), regression (Hsu et al. 2009) (29.38 %) and ranking (Crammer and Singer 2003) (34.84 %).

7.2.2 Benchmark data sets

We also present results on the fMRI-Words zero-shot learning data set of Mitchell et al. (2008). The data set has 60 categories out of which we use 48 for training and 12 for testing. Each category is described by 25 real valued attributes which we convert to binary labels by thresholding against the median attribute value. Prior information about which attributes occur in which novel test categories is provided in terms of a knowledge base. The experimental protocol is kept identical to the one used in Animals with Attributes. \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) is set to \(\sum_{c=1}^{10} p(c)\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c} \protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c}^{t}\) where \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf y$}} {\mbox{\boldmath$\textstyle\bf y$}} {\mbox{\boldmath$\scriptstyle\bf y$}} {\mbox{\boldmath$\scriptscriptstyle\bf y$}}_{c}\) comes from the knowledge base and p(c) is required as additional prior information. We use 400 points for training and 648 points for testing. The test Hamming loss for M3L and various independent methods is given in Table 3. The M3L results are much better than 1-vs-All with the test Hamming loss being reduced by nearly 7 %. This is noteworthy since even if 1-vs-All were trained on the full training set of 2592 points, it would decrease the Hamming loss by just over 5 % to 48.79 %.

Table 3 Test Hamming loss (%) on benchmark data sets

Table 3 also presents results on some other data sets. Unfortunately, most of them have not been designed for zero-shot learning. Siam (SIAM 2007), Media Mill (Snoek et al. 2006), RCV1 (Lewis et al. 2004) and Yeast (Elisseeff and Weston 2001) are traditional multi-label data sets with matching training and test set statistics. The a-PASCAL+a-Yahoo (Farhadi et al. 2009) data set has different training and test categories but does not include prior information about which attributes are relevant to which test categories. Thus, for all these data sets, we sample the original training set to create a new training subset which has different label correlations than the provided test set. The remainder of the original training points are used only to estimate the \(\protect \mathchoice {\mbox{\boldmath$\displaystyle\bf R$}} {\mbox{\boldmath$\textstyle\bf R$}} {\mbox{\boldmath$\scriptstyle\bf R$}} {\mbox{\boldmath$\scriptscriptstyle\bf R$}}\) matrix. As Table 3 indicates, by incorporating prior knowledge M3L can do better than all the other methods which assume independence.

8 Conclusions

We developed the M3L formulation for learning a max-margin multi-label classifier with prior knowledge about densely correlated labels. We showed that the number of constraints could be reduced from exponential to linear and, in the process, generalised 1-vs-All multi-label classification. We also developed efficient optimisation algorithms that were orders of magnitude faster than the standard cutting plane method. Our kernelised algorithm was significantly faster than even the 1-vs-All technique implemented using LibSVM and hence our code, available from Hariharan et al. (2010a), can also be used for efficient independent learning. Finally, we demonstrated on multiple data sets that incorporating prior knowledge using M3L could improve prediction accuracy over independent methods. In particular, in zero-shot learning scenarios, M3L trained on 200 points could outperform 1-vs-All trained on nearly 25,000 points on the Animals with Attributes data set and the M3L test Hamming loss on the fMRI-Words data set was nearly 7 % lower than that of 1-vs-All.