Active zeroshot learning: a novel approach to extreme multilabeled classification
Abstract
Big data bring a huge volume of data in a great speed and in many formats with extremely many labels and concepts to be modeled and predicted, such as in text and image tagging, online advertisement placement, recommendation systems, NLP. This emerging issue of big data is termed “extreme multilabeled classification” (XMLC) and is challenging due to the time, space and sample complexity in predictive model training and testing. We first define general XMLC and then categorize and review recent methods based on two specific forms of XMLC. We propose a novel method called active zeroshot learning to reduce the above complexities. Since the performance of the unseen class prediction largely depends on the seen classes that have labeled data, we challenge the critical and yet often overlooked assumption that the labeled data can only be passively acquired. We propose a new learning paradigm aiming at accurate predictions of a large number of unseen labels using labeled data from only an intelligently selected small set of seed classes with the help of external knowledge. We further demonstrate that the proposed strategy has desirable probabilistic properties to facilitate unseen classes prediction. Experiments on 4 datasets demonstrate that the proposed algorithm is superior to a wide spectrum of baselines. Based on our findings, we point out several critical and promising future directions in XMLC.
Keywords
Extreme multilabeled classification Active learning Zeroshot learning1 Introduction
It is not uncommon to have tens of thousands of classes to predict in some realworld classification problems. For example, Amazon products can be classified into some of the many categories; a sizable Twitter dataset readily offers millions of hashtags that can be associated with the tweets; in smart and connected healthcare applications like senior home activity monitoring, many potentially useful human activities (classes) can be captured in data stream from wearable sensors. The common task for these applications is to predict a (usually small) set of relevant labels out of the many labels.^{1} This classification problem is termed “extreme multilabeled classification” (XMLC) and has been studied most recently [3, 19, 28, 38, 40, 41].
XMLC differs from traditional multilabeled or multiclass classification, in that the number of labels can be on the scale of millions. The distinction makes XMLC more challenging. In the training phase, if onevsall multilabeled classification [30] is adopted, sufficient labeled data have to be collected for each label to train a reasonably good predictive model. However, since there are so many labels, collecting annotated data for all the labels can be extremely timeconsuming. Also, in the prediction phrase, the time complexity grows linearly with the number of labels and becomes a concern in largescale online systems like advertisement placement or webpage tagging with millions of labels to be predicted in real time.
In response to these challenges, two major lines of research have been proposed based on two different assumptions. The first category of methods assumes that certain amount of labeled data are available for each label. Such assumption holds if the data can be labeled by a large number of the users, including useredited Wikipedia articles, usertagged tweets, and clickthrough data. These methods focus on issues such as prediction time complexity and tail labels that consist of most of the labels but have scarce true positives [1, 4, 5, 6, 7, 8, 19, 28, 33, 39]. Methods in the second category make the weaker assumption that only a small number of classes have labeled data (the “seen” classes), while the vast remaining classes have zero labeled data (the “unseen” classes) and need to be predicted, resulting in the socalled zeroshot learning problem [12, 14, 15, 22, 23, 24, 25, 26, 32, 35]. Both lines of research share the common assumption that the label space is in a lowdimensional space and the original labels can be expressed using a much smaller set of signals, and model training and prediction in this compressed space are thus more efficient.
2 XMLC problem definition and methodologies
We first formally define the XMLC problem. Let \(\ell \) be the total number of labels to be predicted, and p be the number of features of the instances. The training data are given by \(\{(\mathbf {x}_1,\mathbf {y}_1), \dots , (\mathbf {x}_n, \mathbf {y}_n)\}\), where \(\mathbf {x}_i\in \mathcal{X}=\mathbb {R}^p\) and \(\mathbf {y}_i\in \mathcal{Y}=\{0,1\}^{\ell }\). Here, for any \(l\in \{1,\dots , \ell \}\), \(y_i^{l}=1\) indicates that the ith instance has the lth label, and otherwise that the label is irrelevant to the instance. XMLC aims at training classification models for efficient and accurate predictions of all \(\ell \) labels with large \(\ell \) on the test dataset \(\{(\mathbf {x}_{n+1},\mathbf {y}_{n+1}), \dots , (\mathbf {x}_{n+m}, \mathbf {y}_{n+m})\}\) or in general any sample from \(\mathcal{X}\times \mathcal{Y}\).
Based on the above definition, two more specific problem settings of XMLC have been studied. In one problem setting (Sect. 2.1), there are labeled data for each label, that is, \(\forall l\in \{1,\ldots , \ell \}, \exists i\in \{1,\ldots , n\}\), such that \(y_i^{l}\) is known. The problem can thus be tackled by traditional multilabeled classification models, such as binary relevance [7, 8] and onevsall [30], which train a classifier for each label. However, a sufficiently large amount of labeled data have to be collected for each label, and the time complexity of training and prediction is linear in the number of labels. Given a large \(\ell \), even a linear time complexity can be prohibitive. Another problem setting called zeroshot learning (Sect. 2.2) assumes that labeled data for a small number of labels can be used to train models to predict all labels. Let the set of d labels with labeled data be denoted by \(\mathcal{S}\) (the seen classes), and the set of the remaining k labels be denoted by \(\mathcal{U}\) (the unseen classes). Without loss of generality, assume that the seen classes are indexed by \(\{1,\ldots , d\}\), and the unseen classes are indexed by \(\{d+1,\ldots , d+k\}\). The space of seen labels is \(\mathcal{Y}=\{0,1\}^{d}\), and the space of unseen labels is \(\mathcal{Z}=\{0,1\}^{k}\), with the two spaces being orthogonal. Zeroshot learning predicts the k unseen classes \(\mathbf {z}\in \mathcal{Z}=\{0,1\}^{k}\) using two mappings: \(f\,{:}\,\mathcal{X}\rightarrow \mathcal{Y}^{\prime }\) and \(g\,{:}\,\mathcal{Y}^{\prime }\rightarrow \mathcal{Z}\) such that the composed predictive model \(g\circ f\,{:}\,\mathcal{X}\rightarrow \mathcal{Z}\) has good prediction performance on the unseen classes. Here \(\mathcal{Y}^{\prime }\) can be the same as \(\mathcal{Y}\), but is a compressed version of \(\mathcal{Y}\) when \(\ell \) and k are large.
2.1 Problem setting 1: XMLC with labeled data for all labels
We review representative methods based on label embeddings and trees in this problem setting. Embeddingbased methods employ various dimension reduction approaches. In [18], compressed sensing is adopted and a random matrix \(A\in \mathbb {R}^{m\times \ell }\) compresses the large label space \(\mathcal{Y}\) to a much lowerdimensional space \(\mathcal{Y}^\prime \). Linear regression models are learned to map from \(\mathcal{X}\) to \(\mathcal{Y}^\prime \), where the predictions are then decompressed to \(\mathcal{Y}\) using various algorithms such as orthogonal match pursuit (OMP), CoSaMP and FoBa. In [36], the authors proposed principle label space transformation to improve the above random projection. The leftsingular vectors of the label matrix \(\mathbf {Y}\in \{0,1\}^{\ell \times n}\) are used to compress the labels vectors into mdimensional vectors. The decoding is simpler than compressed sensing since the singular vectors are orthonormal. The above two methods only exploit the label matrix \(\mathbf {Y}\) and do not consider the discriminative information available in \(\mathcal{X}\times \mathcal{Y}\). In [42], the authors proposed to adopt canonical component analysis (CCA) to learn to map both \(\mathcal{X}\) and \(\mathcal{Y}\) to the same lowerdimensional space, such that for each training data \((\mathbf {x}, \mathbf {y})\), the image of \(\mathbf {x}\) is highly correlated with that of \(\mathbf {y}\) under the mapping. Besides, the authors argued for the systematic codes [9] that include both the original labels and their images under the CCA mapping. Although such coding scheme is infeasible for scenarios with a large number of labels, the CCAbased encoding merits an independent direction in XMLC research [38]. Bloom filter [8] is a coding scheme that first partitions all labels into P clusters such that any label vector in the training data can only be contained in at most one cluster. Then, labels in each cluster are coded using a Ksparse vector of length Q, such that \(P<{Q\atopwithdelims ()K}\). Notice that bloom filter uses only information of the labels.
Treebased methods partition the label space into disjoint smaller regions. The complexities of coding and encoding can be reduced from \(O(\ell )\) to the scale of \(O(\log (\ell ))\) given a balanced tree. Based on different splitting criteria, there are various encoding (tree construction) schemas. In [28], the authors use an ensemble of trees for encoding and decoding. Specifically, a tree is built by splitting the training instances in a topdown manner. The split of a node is determined by minimizing the ranking losses of the rankings in the two resulting children nodes, with splitting uncertainty taken into account. In [39], the authors proposed to partition the input space \(\mathcal{X}\) into subregions, each of which is assigned a small number of relevant labels. During testing time, an instance is first mapped to a single region, and the predictive models for those labels in that region give the relevance scores of the labels. In this way, the cost of predicting all labels can be avoided. In [1], the authors proposed a multilabeled random forest approach similar to that in [39] to handle millions of labels. Besides these embedding and treebased approaches, there is a special case where the two kinds of approaches meet. In [3], the authors further reduce prediction time complexity in treebased approaches, by traversing the learned tree in an embedded space, which has much lower dimensionality (\({<}\log (\ell )\)) than the original d.
2.2 Problem setting 2: zeroshot learning
The above embedding and treebased methods assume that labels can be collected for each label for model training. This assumption becomes less prohibitive when there are many, possibly millions, labels. For example, in the healthcare application where one wants to identify many potentially useful human activities (classes) in videos, it is markedly laborious to tag the videos for all possible activities. Another example is online advertisement bidding [1], where it is impossible for human annotators to go through millions of labels and training examples (webpages) to identify positive and negative instances for each label. Instead, they resorted to noisy and biased labels automatically inferred from the logs of a search engine. Indeed, their experiments showed that by preprocessing the harvested labels, the performance can be further improved.
Zeroshot learning approaches make an assumption on the other extreme: No labeled data can be collected for the majority of the labels (the “unseen” labels), while a certain amount of labeled data are available for a small number of labels (the “seen” labels). Additionally, external knowledge bases describing class relations, such as a large corpus (e.g., the Internet), domain knowledge (e.g., known attributes of the unseen classes) or ontology (e.g., WordNet), are assumed available to turn the predicted seen labels into predictions of the unseen classes [22, 24, 26]. For example, one can first collect video segments for a small subset of primitive human activities, such as simple movements of body parts, to learn a videotomovement mapping. Then, more complex composite activities can be identified using the predicted primitive movements via domain knowledge (a composite activity consists of multiple primitive movements).
Zeroshot learning approaches can similarly be categorized into two groups based on the knowledge bases (an embedding space or a tree of labels) adopted. For embeddingbased approaches, direct attribute prediction (DAP) and indirect attribute prediction (IAP) [22] are the two fundamental paradigms. More sophisticated zeroshot models are also proposed, such as maxmargin semisupervised learning for exploiting the unlabeled data [23], and multiview zeroshot learning for utilizing multiple data sources [15]. Multiple knowledge bases such as Wikipedia [14, 26], web search logs [25] and humanannotated images [22] are compared. The authors in [14, 26, 35] propose to learn the intermediate attributes using deep learning. For treebased approaches, the authors in [32] proposed three similarity metrics on trees to predict unseen classes from seen class predictions. WordNet is such a tree of labels, with each word being a class, and the classes are connected via hyponym–hypernym relationships. The prediction of an unseen class on a leaf node can be obtained by averaging the predictions of the hypernyms of that unseen class, or the costsensitive averaging of the predictions from all leaf nodes of seen classes [10].
3 Further reduce the labeling cost via active zeroshot learning
Although the zeroshot learning literature has addressed some of the crucial issues, it assumes that the zeroshot models can only passively learn from labeled data collected for a predefined subset of seen labels [23, 26]. That is, labeled data are available for the given seen classes but not for the unseens, and a zeroshot learning algorithm has to predict unseen classes using the given labeled data and dependencies among labels. Due to the complex dependencies between seen classes and unseen classes, different seen classes provide varied predictive information for the unseen classes. When a good selection of seen classes is not provided, or does not provide sufficient information (e.g., too few seen classes), we need to decide for which classes labeled data should be collected to predict unseen classes well. In other words, the splitting of all classes into seen and unseen sets of classes is a parameter to optimize in zeroshot learning, while none of the previous zeroshot learning methods has addressed the problem.
We contribute to this class splitting problem and propose to actively and intelligently select a parsimonious set of core classes to collect labeled data, and keep the large number of remaining classes unseen to save labeling efforts. Traditional multilabeled active learning algorithms are less relevant here, as they assume that for each and every label, certain labeled data have to be queried [29, 37]. We propose to select the labels as seen ones that can provide most information regarding the unseen ones and characterize such informativeness of a candidate class via the entropy of interclass similarities. We empirically show that the interclass similarity follows a beta distribution, based on which we reveal the relationship between the entropy and the probability that an unseen class is sufficiently connected to the seen ones, thus justify the proposed class selection criterion.
3.1 Problem formulation
Since we focus on the effects brought by a class split, we fix the following components of zeroshot learning. We adopt the DAP [22] paradigm and want to select d classes as seen classes to form the compressed space \(\mathcal{Y}^{\prime }\), such that d is small to minimize labeling efforts, and the prediction for the unseen k classes is optimized. Logistic regression is adopted to learn the mapping f from \(\mathcal{X} \) to \(\mathcal{Y}^{\prime }\). A class similarity matrix K is derived from a related corpus as the knowledge base (see Sect. 5). One can view K as the adjacent matrix of the graph \(\mathcal{G}=(\mathcal{V},\mathcal{E})\), where \(\mathcal{V}\) is the set of all classes and the edge weights are the class similarities. Given two index sets \(\mathcal{I}\) and \(\mathcal{J}\), let \(K^\mathcal{IJ}\) be the submatrix of K that consists of the rows indexed by \(\mathcal{I}\) and columns indexed by \(\mathcal{J}\). Then \(K^\mathcal{UU}\) is the similarity matrix for the unseen classes, and \(K^\mathcal{US}_{ij}\) is the similarity between the ith unseen class and the jth seen class. With \(\hat{\mathbf y}\in \mathbb {R}^{d}\) being the predicted seen classes for \(\mathbf {x}\), the mapping from \(\mathcal{Y}^{\prime }\) to \(\mathcal{Z}\) is \(g\,{:}\,\hat{\mathbf {y}}\mapsto K^\mathcal{US}\hat{\mathbf y}\).
3.2 Methodology
We propose to iteratively add from the pool of unseen classes more labels that are informative about the remaining unseen classes. The connectivities between the classes can be indicators of information about one class carried by others. Specifically, the connectivity between the ith unseen class and the other unseen classes can be measured by various centrality metrics of the corresponding ith node on the subgraph of \(\mathcal{G}\) consisting of all unseen classes. For example, the degree centrality of the ith unseen class can be calculated as \(\sum _{j=1}^{k}K^\mathcal{UU}_{ij}\), where k is the current number of unseen classes. We call this strategy “maxdeguu” as it selects the unseen class with the maximal degree. This selection strategy does not consider the distribution of the class similarities between class i and others: class i can be strongly connected to only a few unseen classes with high weights, but barely so to the remaining majority classes. Such a class can still have a high degree, but does not add much information about the remaining unseen classes.
4 Theoretical justification
The seen–unseen class split has effects on the resulting mappings f and g. We study the effects on f in the experimental sections, and here we compare the effects that the proposed strategy and maxdeguu have on the linear mapping \(g\,{:}\,\mathbf {y}\mapsto K^\mathcal{US}\mathbf {y}\). The prediction of the ith unseen class is given by \(K^\mathcal{US}_{i:}\mathbf {y}\). We view the entries of \(K^\mathcal{US}\) as random variables and analyze the patterns in which the significant values in \(K^\mathcal{US}\) distribute. Suppose the ith unseen class is only associated with the seen ones through insignificant coefficients, then the unseen class predictions through the linear model \(K^\mathcal{US}_{i:}\) are less confident. Also, if the unseen class is only related to a few seen classes, even if the connections are strong, the resulting prediction can be misled due to the limited number of seen classes, whereas the unseen class can actually be related to more classes that are not selected into \(\mathcal{S}\). We would like to select seen classes such that they can sufficiently convey information for most of the unseen classes.
Definition 4.1
(\(\delta \)Covered unseen class) An unseen class, say the ith one, is \(\delta \)covered by the selected seen classes if at least one entry in the row \(K^\mathcal{US}_{i:}\) has magnitude at least \(\delta \).
Regarding the distribution of \(\bar{K}^\mathcal{US}_{ij}\), we find out that these coefficients can be fitted quite well by the beta distribution \(\textsc {Beta}(\alpha , \beta )\), where \(\alpha \) and \(\beta \) are shape parameters, see Fig. 1a for an example on the unix dataset. How does the entropy guide us to a more desirable beta distribution of the coefficients? We collect the empirical entropies defined by Eq. (2) for each seen class and estimate the shape parameters of the distribution defined by Eq. (1). We find out that the entropies are correlated with the fitted shape parameter \(\alpha \): entropy grows as \(\alpha \) goes up, see Fig. 1b. Note that the entropy is less correlated to \(\beta \) (Fig. 1c). Similar observations are obtained on the other datasets. Therefore, we fix \(\beta \) and plot two beta distributions with \(\alpha =0.1\) and \(\alpha =50\) in Fig. 1d. We can see from the figure that, with a larger \(\alpha \) (the red solid line), the beta distribution has more mass at the upper end, and thus, more samples \(\bar{K}^\mathcal{US}_{ij}\) from that distribution will be significant values. As a result, the lower bound of the chance that an unseen class is not \(\delta \)covered is small. On the other hand, with a smaller entropy and \(\alpha \), there can be a significant mass at the lower end (the blue dotted line).
5 Experiments
5.1 Experimental settings
Datasets
Askubuntu  Dba  Superuser  Unix  

# Training  55,684  12,070  93,106  23,069 
# Test  55,883  12,211  93,182  23,025 
# Tags  1003  345  1895  775 

maxdeguu: as mentioned in Sect. 3.2, this method labels data for the classes that have the highest degree centrality.

mindegus: We take the row sums of the matrix \(K^\mathcal{US}\), which captures the total similarity between the current unseen tags and seen tags. The unseen class with smallest row sum is picked. The rationale is that unseen tags that are farthest away from the current seen classes can provide complementary information.

uncertainty: This method queries for the training data the top unseen classes that have the highest entropies in their predictions on the training data, according to the current zeroshot prediction model. This baseline runs in an incremental manner as maxentuu and mindegus.

matrix: in [16] the author proposed a matrix partition algorithm to split a set of instances into two, such that the mutual information between the distributions of the two sets is maximized. This method is considered to be a representativenessbased active learning method. We adapt their model and treat classes as instances. This algorithm runs in batch mode and we only report its performance when 100 additional classes are selected.
5.2 Results
In Fig. 2, along with the performance of the batchmode method matrix, we show how the zeroshot prediction performances of three iterative algorithms evolve as more labeled data are added. Each row in Fig. 2 consists of two subfigures showing the performance in Precision@5 and NDCG@5, respectively. In each subfigure, the performance of maxentuu (shown in green solid lines) is compared with those of the four baselines. From the figures, we can see that across all datasets and all metrics, maxentuu consistently outperforms all the baselines. In some cases, maxentuu ends up with performance two times better than the runnerup (see Fig. 2b, g). Interestingly, mindegus, uncertainty, and matrix consistently have medium performance compared with maxentuu and maxdeguu in all datasets using both metrics when the iterations finish.
5.3 Empirical analysis of maxentuu
In active zeroshot setting, before one queries the labels of the data for a class, it is difficult to gauge the prior and posterior probability distributions of the class. The only information available is the class similarity from external knowledge bases. Below we empirically show, in two aspects, that even with such a lack of information, maxentuu is able to pick the informative unseen classes for label queries.
In Fig. 3a, we plot the CDFs of the frequencies of the selected classes that appear in the training instances (namely document frequencies) on the superuser dataset (best viewed in color). We can see that among the five strategies, the maxdeguu tends to select classes that have higher document frequencies than those selected by maxentuu, as the CDF of maxdeguu is more shifted to the right. It has been shown in text classification that, the more frequent a word appears in the corpus, the less informative it is, as evidenced by the commonly used TFIDF transformation [31]. A frequent seen class is likely to be predicted more often by predictive models that take the class prior distribution into account. Such a seen class becomes a less discriminative feature when used as features in the mapping g. This partly explains why the baseline maxdeguu has the worst performance in all cases in Fig. 2.
6 Future directions

Previous zeroshot learning algorithms utilize knowledge bases such as WordNet, web search engines, hierarchies and embedded attributes majorly through the semantic similarity information of the labels. One promising future direction is to exploit logical knowledge in the knowledge bases for better unseen label prediction. For example, WordNet contains part–whole relations between words (e.g., “tire” is part of a “car”), and we can use such relations to find the parts given the label of the whole. Contradiction relation can also prevent the model from including labels that contradict the reliably predicted labels.

How to choose the right knowledge base for a specific prediction task is lacking. Although previous work like [32] did empirically compare the effectiveness of different knowledge bases on their tasks, there is no formal metric defined for knowledge base selection. One important question to answer in future XMLC research is how to define such measurements based on generalization error reduction, knowledge base coverage and knowledge base consistency.

Humanintheloop machine learning, such as crowdsourcing, and active learning have been proved to be critical in improving machine learning models, and XMLC is such a case. For example, when the prediction of an XMLC model is uncertain, human experts or the crowd can help resolve the uncertainty. Another way is to continuously incorporate new knowledge from human beings into the knowledge bases for better future predictions.
7 Conclusions
We review recent XMLC literature and categorize the published methods based on the availability of labeled data and label space compression methods. We then study active learning in the zeroshot prediction setting for the purpose of finding a small number of informative seen classes to facilitate unseen class predictions. We propose an entropybased selection method, which is demonstrated to be able to capture the desirable distribution and strength of seen–unseen similarities. We model the similarity between classes using a beta distribution to justify the proposed entropybased selection method. Experiments show that the proposed method outperforms both representativeness and uncertaintybased active learning methods.
Footnotes
 1.
“Label” and “class” are used interchangeably in this article.
Notes
Acknowledgements
This work was supported in part by NSF Award III1526499, and NVIDIA Corporation with the donation of the Titan X GPU.
References
 1.Agrawal, R. et al.: Multilabel learning with millions of labels: recommending advertiser bid phrases for web pages. In: WWW. Rio de Janeiro, Brazil (2013)Google Scholar
 2.Balasubramanian, K., Lebanon, G.: The landmark selection method for multiple output prediction (2012)Google Scholar
 3.Bengio, S., Weston, J., Grangier, D.: Label embedding trees for large multiclass tasks. In: NIPS (2010)Google Scholar
 4.Bi, W., Kwok, J.T.: Multilabel classification on tree and DAG structured hierarchies. In: ICML. New York, NY (2011)Google Scholar
 5.Bi, W., Kwok, J.: Efficient multilabel classification with many labels. In: ICML (2013)Google Scholar
 6.CesaBianchi, N., Gentile, C., Zaniboni, L.: Incremental algorithms for hierarchical classification. J. Mach. Learn. Res. 7, 31–54 (2006)MathSciNetMATHGoogle Scholar
 7.Chen, Y.N., Lin, H.T.: Featureaware label space dimension reduction for multilabel classification. In: NIPS (2012)Google Scholar
 8.Cisse, M.M. et al.: Robust bloom filters for large multilabel classification tasks. In: NIPS (2013)Google Scholar
 9.Cover, T.M., Thomas, J.A.: Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) (2006)Google Scholar
 10.Deng, J. et al.: What does classifying more than 10,000 image categories tell us? In: ECCV (2010)Google Scholar
 11.Deng, J. et al.: Fast and balanced: efficient label tree learning for large scale object recognition. In: NIPS (2011)Google Scholar
 12.Dietterich, T.G., Bakiri, G.: Solving multiclass learning problems via errorcorrecting output codes. J. Artif. Int. Res. 2, 263–286 (1995)MATHGoogle Scholar
 13.Ferng, C.S., Lin, H.T.: Multilabel classification with errorcorrecting codes. J. Mach. Learn. Res. 20, 281–295 (2011)Google Scholar
 14.Frome, A. et al.: DeViSE: a deep visualsemantic embedding model. In: NIPS. (2013)Google Scholar
 15.Fu, Y. et al.: Transductive multiview embedding for zeroshot recognition and annotation. In: ECCV (2014)Google Scholar
 16.Guo, Y.: Active instance sampling via matrix partition. In: NIPS (2010)Google Scholar
 17.Gao, T. , Koller, D.: Discriminative learning of relaxed hierarchy for largescale visual recognition. In: ICCV (2011)Google Scholar
 18.Hsu, D.J. et al.: Multilabel prediction via compressed sensing. In: NIPS (2009)Google Scholar
 19.Huang, K.H., Lin, H.T.: Costsensitive label embedding for multilabel classification (2016)Google Scholar
 20.Ji, S. et al.: Extracting shared subspace for multilabel classification. In: KDD, Las Vegas, ND (2008)Google Scholar
 21.Kapoor, A., Viswanathan, R., Jain, P.: Multilabel classification using Bayesian compressed sensing. In: NIPS (2012)Google Scholar
 22.Lampert, C.H., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by betweenclass attribute transfer. In: CVPR (2009)Google Scholar
 23.Li, X., Guo, Y.: Maxmargin zeroshot learning for multiclass classification. In: AISTAT (2015)Google Scholar
 24.Li, X. et al.: Zeroshot image tagging by hierarchical semantic embedding. In: SIGIR (2015)Google Scholar
 25.Mensink, T., Gavves, E., Snoek, C.G.M.: COSTA: cooccurrence statistics for zeroshot classification. In: CVPR (2014)Google Scholar
 26.Norouzi, M. et al.: Zeroshot learning by convex combination of semantic embeddings. In: CoRR (2013)Google Scholar
 27.Palatucci, M. et al.: Zeroshot learning with semantic output codes. In: NIPS (2009)Google Scholar
 28.Prabhu, Y., Varma, M.: FastXML: a fast. KDD, accurate and stable treeclassifier for extreme multilabel learning (2014)Google Scholar
 29.Qi, G.J. et al.: Twodimensional active learning for image classification. In: CVPR (2008)Google Scholar
 30.Rifkin, R., Klautau, A.: In defense of onevsall classification. J. Mach. Learn. Res. 5, 101–141 (2004)Google Scholar
 31.Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. J. Doc. 60 (2004)Google Scholar
 32.Rohrbach, M., Stark, M., Schiele, B.: Evaluating knowledge transfer and zeroshot learning in a largescale setting. In: CVPR (2011)Google Scholar
 33.Rousu, J. et al.: Kernelbased learning of hierarchical multilabel classification models. J. Mach. Learn. Res. 7, 1601–1626 (2006) ISSN: 15324435Google Scholar
 34.Salakhutdinov, R., Mnih, A., Hinton, G.: Restricted Boltzmann machines for collaborative filtering. In: ICML (2007)Google Scholar
 35.Socher, R. et al.: Zeroshot learning through crossmodal transfer. In: NIPS (2013)Google Scholar
 36.Tai, F., Lin, H.T.: Multilabel classification with principal label space transformation. Neural Comput. 24(9), 2508–2542 (2012)MathSciNetCrossRefMATHGoogle Scholar
 37.Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, 45–66 (2002)MATHGoogle Scholar
 38.Weston, J., Bengio, S., Usunier, N.: WSABIE: scaling up to large vocabulary image annotation. In: IJCAI (2011)Google Scholar
 39.Weston, J., Makadia, A., Yee, H.: Label partitioning for sublinear ranking. In: ICML (2013)Google Scholar
 40.Xu, C., Tao, D., Xu, C.: Robust Extreme Multilabel Learning. In: KDD, San Francisco, CA (2016)Google Scholar
 41.Yu, H.F. et al.: Largescale multilabel learning with missing labels. In: ICML (2014)Google Scholar
 42.Zhang, Y., Schneider, J.G.: Multilabel output codes using canonical correlation analysis. In: ICML (2011)Google Scholar