Data scarcity, robustness and extreme multilabel classification
 2.6k Downloads
 1 Citations
Abstract
The goal in extreme multilabel classification (XMC) is to learn a classifier which can assign a small subset of relevant labels to an instance from an extremely large set of target labels. The distribution of training instances among labels in XMC exhibits a long tail, implying that a large fraction of labels have a very small number of positive training instances. Detecting taillabels, which represent diversity of the label space and account for a large fraction (upto 80%) of all the labels, has been a significant research challenge in XMC. In this work, we pose the taillabel detection task in XMC as robust learning in the presence of worstcase perturbations. This viewpoint is motivated by a key observation that there is a significant change in the distribution of the feature composition of instances of these labels from the training set to test set. For shallow classifiers, our robustness perspective to XMC naturally motivates the wellknown \(\ell _1\)regularized classification. Contrary to the popular belief that Hamming loss is unsuitable for taillabels detection in XMC, we show that minimizing (convex upper bound on) Hamming loss with appropriate regularization surpasses many stateoftheart methods. Furthermore, we also highlight the suboptimality of the coordinate descent based solver in the LibLinear package, which, given its ubiquity, is interesting in its own right. We also investigate the spectral properties of label graphs for providing novel insights towards understanding the conditions governing the performance of Hamming loss based onevsrest scheme visàvis label embedding methods.
Keywords
Extreme multilabel classification Largescale classification Robustness Linear classification1 Introduction
Extreme multilabel classification (XMC) refers to supervised learning with a large target label set where each training/test instance is labeled with small subset of relevant labels which are chosen from the large set of target labels. Machine learning problems consisting of hundreds of thousand labels are common in various domains such as annotating webscale encyclopedia (Prabhu and Varma 2014), hashtag suggestion in social media (Denton et al. 2015), and imageclassification (Deng et al. 2010). For instance, all Wikipedia pages are tagged with a small set of relevant labels which are chosen from more than a million possible tags in the collection. It has been demonstrated that, in addition to automatic labelling, the framework of XMC can be leveraged to effectively address learning problems arising in recommendation systems, ranking and webadvertizing (Agrawal et al. 2013; Prabhu and Varma 2014). In the context of recommendation systems for example, by learning from similar users’ buying patterns in estores like Amazon and eBay, this framework can be used to recommend a small subset of relevant items from a large collection in the estore. With applications in a diverse range, designing effective algorithms to solve XMC has become a key challenge for researchers in industry and academia alike.
In addition to large number of target labels, typical datasets in XMC consist of a similar scale for the number of instances in the training data and also for the dimensionality of the input feature space. For text datasets, each training instance is a sparse representation of a few hundred nonzero features from the input space having dimensionality of the order hundreds of thousand. An an example, a benchmark WikiLSHTC325K dataset from the extreme classification repository (Bhatia et al. 2016) consists of 1.7 Million training instances which are distributed among 325,000 labels and each training instance sparsely spans a feature space of 1.6 Million dimensions. The challenge posed by the sheer scale of number of labels, training instances and features, makes the setup of XMC quite different from that tackled in classical literature in multilabel classification (Tsoumakas et al. 2009), and hence renders the direct and offtheshelf application of some of the classical methods, such as Random Forests, Decision Trees and SVMs, nonapplicable.
1.1 Tail labels
Tail labels exhibit diversity of the label space, and contain informative content not captured by the head or torso labels. Indeed, by predicting well the head labels, an algorithm can achieve high accuracy and yet omit most of the tail labels. Such behavior is not desirable in many real world applications. For instance, in movie recommendation systems, the head labels correspond to popular blockbusters—most likely, the user has already watched these. In contrast, the tail corresponds to less popular yet equally favored films, like independent movies (Shani and Gunawardana 2013). These are the movies that the recommendation system should ideally focus on. A similar discussion applies to search engine development (Radlinski et al. 2009) and hashtag recommendation in social networks (Denton et al. 2015), and hierarchical classification (Babbar et al. 2014).
From a statistical perspective, it has been conjectured in the recent works that Hamming loss is unsuitable for detection of taillabels in XMC (Jain et al. 2016; Bhatia et al. 2015; Prabhu and Varma 2014). In particular, the work in Jain et al. (2016) proposes new propensityscored loss functions (discussed in Sect. 3) which are sensitive towards the taillabels by weighing them higher than the head/torso labels. In this work, we refute the above conjecture by motivating XMC from a robustness perspective.
1.2 Our contributions

Statistically, we model XMC as learning in the presence of worstcase perturbations, and demonstrate the efficacy of Hamming loss for taillabel prediction in XMC. This novel perspective stems from the observation that, for tail labels, there is a significant variation in the feature composition of instances in the test set as compared to the training set. We thus frame the learning problem as a robust optimization objective which accounts for this feature variation by considering perturbations \(\tilde{\mathbf{x }}_i\) for each input training instance \(\mathbf x _i\).

Algorithmically, by exploiting the labelwise independence of Hamming loss among labels, our algorithm is amenable to distributed training across labels. As a result, our forward–backward proximal gradient algorithm can scale upto hundreds of thousand labels for benchmark datasets. Our investigation also shows that the corresponding solver in the LibLinear package (“−s 5” option) yields suboptimal solutions because of severe underfitting. Due to its widespread usage in machine learning packages such as scikitlearn, this finding is significant in its own right.

Empirically, our robust optimization formulation of XMC naturally motivates the wellknown \(\ell _1\) regularized SVM, which is shown to surpass around one dozen stateoftheart methods on benchmark datasets. For a Wikpedia dataset with 325,000 labels, we show 20% relative improvement over PFastreXML and Parabel—leading tree based approaches, and 60% over SLEEC—a leading label embedding method.

Analytically, by drawing connections to spectral properties of label graphs, we also present novel insights to explain the conditions under which Hamming loss might be suited for XMC visàvis label embedding methods. We show that the algebraic connectivity of label graph can be used to explain the variation in the relative performance of various methods as it varies from small datasets consisting of few hundred labels to the extreme regime consisting of hundreds of thousand labels.
2 Robust optimization for tail labels
In the extreme classification scenario, the tail labels in the fattailed distribution have very few training instances that belong to them. Also, each training instance typically represents a document of about a few hundred words from a total vocabulary of hundreds of thousand or even millions of words. The training instance is, therefore, a sparse representation with only 0.1% or even lesser nonzero features. Due to sparsity, the nonzero features/words in one training instance differ significantly from the other training instance of the same label. Furthermore, since there are only a few training instance in the taillabels, the union of feature composition of all the training instances for a particular label does not necessarily form a good representation of that label. As a result, the feature composition of the test instance may differ significantly from that of the training instances.
On the other hand, the head labels consist of few tens or even hundreds of training instances. Therefore, the union of words/features which appear in all the training instances is a reasonably good representation of the feature composition of that label. This can also be viewed from the perspective of density of subspace spanned by the features for a label. For the head labels, this subspace is much more densely spanned as compared to the tail labels where it is sparsely spanned.
The above observation for the change in the feature composition of tail labels motivates a robustness approach in order to account for the distribution shift. Robustness approach for handling adversarial perturbations to features has been a subject of resurgence particularly in the context of deep learning. In the whitebox setting, i.e., with access to deep network and its parameters and their gradients, it has been shown that one can make the deep network predict any desired label for a given image (Goodfellow et al. 2014; Shaham et al. 2015; Szegedy et al. 2013). In the of image classification with deep learning, the benefit of taking a robustness perspective in the presence of less training data has been demonstrated in a very recent work in the upcoming ICLR 2019 (Tsipras et al. 2019). A gametheoretic approach for robustness to feature deletion or addition has also been studied in an earlier work (Globerson and Roweis 2006).
Training and test instances for label 28503
Training instances  

1.  Vision computational investigation into the human representation and processing of visual information david marr late of the massachusetts institute of technology was the author of many seminal articles on visual information processing and artificial intelligence 
2.  Foundations of vision it has much to offer everyone who wonders how this most remarkable of all senses works karen de valois science 
Test instance  

1.  Vision science photons to phenomenology this is monumental work covering wide range of topics findings and recent approaches on the frontiers anne princeton university stephen palmer is professor of psychology and director of the institute of cognitive studies at the university of california berkeley 
2.1 Case scenarios
The distribution shift is demonstrated for two of the tail labels extracted from the raw data corresponding to Amazon670K dataset [provided by the authors of Liu et al. (2017)]. The instances in the tail label in Table 1 refers to the book titles and editor reviews for books on computer vision and neuroscience, while the instances in the label in Table 2 provides similar descriptions for VHS tapes on action and adventure genre. Note that, in both cases, there is a significant variation in the features/vocabulary within training instances and also from training to test set instances. For instance, in the first example in Table 1, except the word ‘vision’, there is not much commonality between features of training instances, and those between training and test instances. Similarly, in the second instance, except for ‘vhs’, and the years ‘1942’ and ‘1943’, there is substantial variation in the vocabulary of instances. In other words, even though the underlying distribution generating the training and test set is same in principle, the typical assumption of the test set coming from the same distribution as the training distribution is violated substantially for taillabels. Those features which were active in the training set might not appear in test set, and viceversa.
2.2 Robust optimization
Let the training data, given by \({\mathcal {T}} = \left\{ (\mathbf x _1,\mathbf y _1), \ldots ,(\mathbf x _N,\mathbf y _N) \right\} \) consist of input feature vectors \(\mathbf x _i \in {\mathcal {X}} \subseteq {\mathbb {R}}^D \) and respective output vectors \(\mathbf y _i \in {\mathcal {Y}} \subseteq \{0,1\}^L\) such that \(\mathbf y _{i_{\ell }}=1\) iff the \(\ell \)th label belongs to the training instance \(\mathbf x _i\). For each label \(\ell \), sign vectors \(\mathbf s ^{(\ell )} \in \{+1, 1\}^N\) can be constructed, such that \(\mathbf s ^{(\ell )}_{i} = +1 \) if and only if \(\mathbf y _{i_\ell } = 1\), and −1 otherwise. The traditional goal in XMC is to learn a multilabel classifier in the form of a vectorvalued output function \(f : {\mathbb {R}}^D \mapsto \{0,1\}^L\). This is typically achieved by minimizing an empirical estimate of \({\mathbb {E}}_{(\mathbf x ,\mathbf y ) \sim {\mathcal {D}}}[{\mathcal {L}}(\mathbf W ;(\mathbf x ,\mathbf y ))]\) where \({\mathcal {L}}\) is a loss function, and samples \((\mathbf x ,\mathbf y )\) are drawn from some underlying distribution \({\mathcal {D}}\), and \(\mathbf W \) denotes the desired parameters.
Training and test instances for label 246910
Training instances  

1.  Manhunt in the african jungle vhs 1943 an american secret agent matches wits with nazi agents in casablanca contains 15 episodes 
2.  Men vs black dragon vhs 1943 fifteen episodes of 1942 serial showing government agents as they exposed the infamous black dragon society an axis spy ring intent on crippling the war effort 
Test instance  

1.  and the vhs 1942 this is classic movie from 1942 with action death defying stunts and breathless cliffhangers 
For XMC problems, where N, D, and L lie in the range \(10^4  10^6\), we restrict ourselves to Hamming loss for the loss function \({\mathcal {L}}\) and linear function class for \({\mathcal {W}}\). These have the following statistical and computational advantages: (i) the minmax optimization problem in Eq. (1) can be solved exactly without resorting to Taylor series approximation, (ii) Hamming loss function decomposes over individual labels, thereby enabling parallel training and hence linear speedup with number of cores, (iii) linear methods are computationally efficient and statistically comparable to kernel methods for large D (Huang and Lin 2016).
From the equivalence between formulations in (2) and (3), the choice of norm in the bound on perturbations (\(\sum _{i=1}^N\tilde{\mathbf{x }}_{i} < \lambda '\)) determines the regularizer in equivalent formulation in Eq. (3). For instance, considering \(\ell _2\)bounded perturbations leads to \(\ell _2\)regularized SVM, which is robust to spherical perturbations. It may be recalled that the dual of \(._1, ._2\), and \(._\infty \) norms are \(._\infty , ._2\), and \(._1\) respectively.
2.3 Choice of norm
As shown in the examples in Tables 1 and 2, there can be a significant variation in the features’ distribution from the training set to the test set instances. We therefore consider the worst case perturbations in the input, i.e., \(._\infty \) norm. This is given by \(\tilde{\mathbf{x }}_{i}_\infty := \max _{d=1\ldots D}\tilde{\mathbf{x }}_{i_d}\). This choice of norm can be motivated from two perspectives.
Firstly, it may be noted that changing the input \(\mathbf x \) by small perturbations along each dimension, such that \(\tilde{\mathbf{x }}_\infty \) remains the same, can change the inner product evaluation of the decision function \(\mathbf w ^T(\mathbf x +\tilde{\mathbf{x }})\) significantly. By accounting for such perturbations in the training data, the resulting weight vector is, therefore, robust to worstcase feature variations. This is not true for perturbations whose \(._1\) or \(._2\) norms are bounded.
Since the dual of \(._\infty \) is \(._1\) norm, this choice of norm for the perturbations in formulation (2) leads to the \(\ell _1\) regularized SVM in the optimization problem in (3). As this is known to yield sparse solutions, the above choice of norm shows an equivalence between sparsity and robustness. Both of these are desirable properties in the context of XMC, since we want models which are robust to perturbations due to the taillabel effect, and also sparse which leads to compact models for fast prediction in realtimes applications of XMC.
2.4 Optimization
However, both \(\ell _1\)norm and hingeloss \(\max [1\mathbf{s }_{i} ( \langle \mathbf{w },\mathbf x _i \rangle ), 0 ]\) are nonsmooth, which is undesirable from the viewpoint of efficient optimization. In the following theorem, we prove that one can replace hinge loss by its squared version given by \((\max [1\mathbf{s }_{i} ( \langle \mathbf{w },\mathbf{x }_i \rangle ), 0 ])^2\) for a different choice of the regularization parameter \(\lambda \) instead of \(\lambda '\). The statistically equivalent problem results in objective function in Eq. (5), which is easier optimize. The proof of the theorem follows the same technique as in Xu et al. (2010).
Theorem 1
Proof
We first start with a definition. Let \(g(.) : {\mathbb {R}}^D \mapsto {\mathbb {R}}\) and \(h(.) : {\mathbb {R}}^D \mapsto {\mathbb {R}}\) be two functions. Then \(\mathbf w ^*\) is called weakly efficient if atleast one of the following holds, (i) \(\mathbf w ^* \in \arg \min _\mathbf{w \in {\mathbb {R}}^D} g(\mathbf w )\), (ii) \(\mathbf w ^* \in \arg \min _\mathbf{w \in {\mathbb {R}}^D} h(\mathbf w )\), and (iii) \(\mathbf w ^*\) is Pareto efficient, which means that \(\not \exists \)\(\mathbf w '\) such that \(g(\mathbf w ') \le g(\mathbf w ^*)\) and \(h(\mathbf w ') \le h(\mathbf w ^*)\) with atleast one holding with strict inequality.
A standard result from convex analysis states that for convex functions \(g(\mathbf w )\) and \(h(\mathbf w )\), the set of optimal solutions for the weighted sum, \( \min _\mathbf{w } (\lambda _1 g(\mathbf w ) + \lambda _2 h(\mathbf w ))\) where \(\lambda _1, \lambda _2 \in [0,+\infty )\) and not being zero together, coincides with the set of weakly efficient solutions. This means that the set of optimal solutions of \( \min _\mathbf{w } (\lambda ' \mathbf w _1 + \sum _{i=1}^N \max [1\mathbf s _{i} (\langle \mathbf w ,\mathbf x _i \rangle ),0]) \), where \(\lambda '\) ranges in \([0,+\infty )\) is the set of weakly efficient solution of \( \mathbf w _1 \) and \( \sum _{i=1}^N \max [1\mathbf s _{i} (\langle \mathbf w ,\mathbf x _i \rangle ), 0] \). On similar lines, the set of optimal solutions of \( \min _\mathbf{w } ( \lambda \mathbf w _1 + \sum _{i=1}^N (\max [1\mathbf s _{i} (\langle \mathbf w ,\mathbf x _i \rangle ), 0 ])^2) \) where \(\lambda \) ranges in \([0,+\infty )\) is the set of weakly efficient solution of \( \mathbf w _1 \) and \( \sum _{i=1}^N (\max [1\mathbf s _{i} ( \langle \mathbf w ,\mathbf x _i \rangle ), 0 ])^2\). Since taking the square for nonnegatives is a monotonic function, it implies that these two sets are identical, and hence the two formulations given in Eqs. (4) and (5) are statistically equivalent upto change in the regularization parameter. \(\square \)
2.5 Suboptimality of Liblinear solver (Fan et al. 2008)
The formulation in Eq. (5) lends itself to easier optimization and an efficient solution is implemented in the Liblinear package (as −s 5 argument) by using a cyclic coordinate descent (CCD) procedure. Liblinear has been included in machine learning packages such as scikitlearn and Cran LibLineaR for solving largescale linear classification and regression tasks. A natural question to ask is  why not use this solver directly if the modeling of XMC under the worstcase perturbation setting and the resulting optimization problem are indeed correct.
We applied the CCD based implementation in LibLinear and found that it gives suboptimal solution. In particular, the CCD solution, (i) underfits the training data, and (ii) does not give good generalization performance. For concreteness, let \(\mathbf w _{CCD} \in {\mathbb {R}}^D\) be the minimizer of the objective function in Eq. (5) and \(opt_{CCD} \in {\mathbb {R}}^+\) be the corresponding optimal value of the objective value attained. We demonstrate underfitting by producing a certificate \(\mathbf w _{Prox} \in {\mathbb {R}}^D\) with the corresponding objective function value \(opt_{Prox} \in {\mathbb {R}}^+\) such that \(opt_{Prox} < opt_{CCD}\). The construction of the certificate of suboptimality is obtained by following a proximal gradient procedure in the next section. The inferior generalization performance of Liblinear is shown in Table 4, which among other methods, provides comparison on the test set of the models learnt by CCD and that learnt by our proximal gradient procedure in Algorithm 1.
2.6 Certificate construction by proximal gradient
For EURLex dataset, Fig. 2 compares the variation in the optimization objective for the weight \(\mathbf w _{CCD}\) learnt by the LibLinear CCD solver and proximal gradient solver \(\mathbf w _{Prox}\). For approximately 90% of the labels, the objective value obtained by Algorithm 1, was lower than that obtained by LibLinear, which in some cases could be as low as half for \(\mathbf w _{Prox}\).
3 Experimental analysis
Multilabel datasets from XMC repository
Dataset  # Training (N)  # Features (D)  # Labels (L)  APpL  ALpP  Algebraic connectivity, \(\lambda _2(G)\) 

Mediamill  30,993  120  101  1902.1  4.4  0.46 
Bibtex  4,880  1,836  159  111.7  2.4  0.30 
EURLex  15,539  5,000  3,993  25.7  5.3  0.22 
WikiLSHTC325K  1,778,351  1,617,899  325,056  17.4  3.2  0.002 
Wiki500K  1,813,391  2,381,304  501,070  24.7  4.7  0.001 
Amazon670K  490,499  135,909  670,091  3.9  5.4  0.0001 
To match against the ground truth, as suggested in Jain et al. (2016), we use \(100*{\mathbb {G}}(\{\hat{\mathbf{y }}\})/{\mathbb {G}}(\{\mathbf{y }\})\) as the performance metric. For M test samples, \({\mathbb {G}}(\{\hat{\mathbf{y }}\}) = \frac{1}{M}\sum _{i=1}^{M}{\mathbb {L}}(\hat{\mathbf{y }}_i,\mathbf y )\), where \({\mathbb {G}}(.)\) and \({\mathbb {L}}(.,.)\) signify gain and loss respectively. The loss \({\mathbb {L}}(.,.)\) can take two forms, (i)\({\mathbb {L}}(\hat{\mathbf{y }}_i,\mathbf y ) =  PSnDCG@k\), and (ii) \({\mathbb {L}}(\hat{\mathbf{y }}_i,\mathbf y ) =  PSP@k\). This leads to the two metrics which are finally used in our comparison in Table 4 (denoted by N@k and P@k).
3.1 Methods for comparison
We compare PRoXML against eight algorithms including the stateoftheart labelembedding, treebased and linear methods:
 (I)
LEML (Yu et al. 2014)—It learns a lowrank embedding of the label space and is shown to work well for datasets with high label correlation in the presence of moderate number of labels.
 (II)
SLEEC (Bhatia et al. 2015)—It learns sparse locally lowrank embeddings and captures nonlinear correlation among the labels. This method has been shown to scale for XMC problems by applying data clustering as an initialization step.
 (I)
FastXML (Prabhu and Varma 2014)—This is a scalable treebased method which partitions the feature space and optimizes vanilla nDCG metric at each node (\(p_{\ell }=1\) in equation (8) above).
 (II)
PFastreXML (Jain et al. 2016)—This method is designed for better classification of taillabels. It learns an ensemble of PFastXML classifier which optimizes propensityscored nDCG metric in Eq. 8 and Rocchio classifier (https://en.wikipedia.org/wiki/Rocchio_algorithm) applied on the top 1,000 labels predicted by PFastXML. It is shown to outperform the production system used in Bing Search [c.f. Section 7 in Jain et al. (2016)] and reviewed in detail in Sect. 5.
 (III)
Parabel (Prabhu et al. 2018)—This is recently proposed method which learns label partitions by a balanced 2means algorithm, followed by learning onevsrest classifier at each node.
 (I)
PDSparse (Yen et al. 2016)—It uses elastic net regularization with multiclass hinge loss and exploits the sparsity in the primal and dual problems for faster convergence.
 (II)
DiSMEC (Babbar and Schölkopf 2017)—This is distributed onevsrest method which achieves stateoftheart results on vanilla P@k and nDCG@k. It minimizes Hamming loss with \(\ell _2\) regularization with weight pruning heuristic for model size reduction.
 (IV)
CCDL1—This is the inbuilt sparse solver as part of the LibLinear package which optimizes Eq. 5 using coordinate descent.
Comparison of N@k for \(\hbox {k}=1\),3 and 5
Dataset  Proposed approach  Embedding methods  Treebased methods  Linear methods  

ProXML  SLEEC  LEML  FastXML  PFastreXML  Parabel  PDSparse  CCDL1  DiSMEC  
MediaMill  
N@1  64.3  70.1  66.3  66.1  66.8  66.5  62.2  63.9  66.5 
N@3  63.6  72.3  65.7  66.1  66.5  65.9  61.0  62.8  65.5 
N@5  62.8  73.1  64.7  65.2  65.7  65.2  57.2  62.0  65.2 
P@1  64.3  70.1  66.3  66.6  66.8  63.4  62.2  63.6  66.5 
P@3  61.3  72.7  65.1  65.4  66.5  62.8  59.8  60.2  65.1 
P@5  60.8  74.0  63.6  64.3  64.7  62.1  54.0  59.7  63.7 
BibTex  
N@1  50.1  51.1  47.9  48.5  52.2  50.8  48.3  49.9  50.2 
N@3  52.1  52.9  50.2  51.1  53.6  51.9  48.5  51.6  52.0 
N@5  55.1  56.0  53.5  54.3  56.9  54.5  50.7  54.9  55.7 
P@1  50.1  51.1  47.9  48.5  52.3  41.2  48.3  49.9  50.2 
P@3  52.0  53.9  51.4  52.3  54.3  45.8  48.7  52.1  52.2 
P@5  58.3  59.5  57.5  58.8  60.5  54.5  52.9  57.9  58.6 
EURLex4K  
N@1  45.2  35.4  24.1  26.6  43.8  37.7  38.2  37.8  41.2 
N@3  47.5  38.8  26.3  32.0  45.2  43.4  40.9  41.6  44.3 
N@5  49.1  40.3  27.6  35.2  46.0  46.1  42.8  44.1  46.9 
P@1  45.2  35.4  24.1  27.6  43.8  37.7  38.2  37.8  41.2 
P@3  48.5  39.8  27.2  35.3  46.4  44.7  42.7  41.6  45.4 
P@5  51.0  42.7  29.1  39.9  47.3  48.8  44.8  44.1  49.3 
Wiki325K  
N@1  34.8  20.2  3.4  16.3  30.6  28.7  28.3  27.8  29.1 
N@3  38.7  22.2  3.6  19.5  31.2  35.2  31.9  30.6  35.9 
N@5  41.5  23.3  3.9  21.0  32.0  38.1  33.6  33.9  39.4 
P@1  34.8  20.5  3.4  16.5  30.8  28.7  28.3  27.8  29.1 
P@3  37.7  23.3  3.7  21.2  31.5  35.0  33.5  30.6  35.6 
P@5  41.0  25.2  4.2  23.7  33.0  38.6  36.6  33.9  39.4 
Wiki500K  
N@1  33.1  21.1  3.2  22.5  29.2  28.8  –  29.8  31.2 
N@3  35.2  21.0  3.4  21.8  27.6  31.2  –  30.2  33.4 
N@5  39.0  20.8  3.5  22.4  27.7  35.5  –  33.1  37.0 
P@1  33.1  21.1  3.2  22.5  29.2  28.8  –  29.8  31.2 
P@3  35.0  21.0  3.4  21.8  27.6  31.9  –  30.2  33.4 
P@5  39.4  20.8  3.5  22.4  27.7  34.6  –  33.1  37.0 
Amazon670K  
N@1  30.8  20.6  2.07  19.3  29.3  28.0  –  19.4  27.8 
N@3  31.7  22.6  2.21  22.2  30.4  28.8  –  21.2  28.8 
N@5  32.6  24.4  2.35  24.6  31.4  29.4  –  22.7  30.7 
P@1  30.8  20.6  2.0  20.2  28.0  27.6  –  19.4  27.8 
P@3  32.8  23.3  2.2  23.8  29.5  31.0  –  21.2  30.6 
P@5  35.1  26.0  2.4  27.2  30.1  34.1  –  22.7  34.2 
3.2 Prediction accuracy
The relative performance of various methods on propensity scored metrics for nDCG@k (denoted N@k) and precision@k (denoted P@k) is shown in Table 4. The important observations from these are summarized below:
(A) Larger datasets We first look at the results for the datasets falling in the extreme regime such as Amazon670K, Wiki500K and WikiLSHTC325K with large number of labels, and a large fraction of them are taillabels. Under this regime, PRoXML performs substantially better than both embeddingschemes and treebased methods such as PFastreXML. For instance, for WikiLSHTC325K, the improvement in P@5 and N@5 over SLEEC and PFastreXML is approx. 60% and 25% respectively. It is important to note that our method works better on propensity scored metrics than PFastreXML even though its training process is optimizing another metric, namely, a convex upper bound on Hamming loss. On the other hand, PFastreXML is minimizing the same metric on which the performance is evaluated. Due to its robustness properties, PRoXML also performs better than DiSMEC which also minimizes Hamming loss but employs \(\ell _2\) regularization instead.
(B) Smaller datasets We now consider smaller datasets which consist of no taillabels. These include Mediamill and Bibtex datasets consisting of 101 and 159 labels respectively. Under this regime, embedding based methods SLEEC and LEML perform better or at par with Hamming loss minimizing methods. As explained in Sect. 4, this is due to high algebraic connectivity of label graphs in smaller datasets, leading to high correlation between labels. This behavior is in stark contrast to datasets in the extreme regime such as WikiLSHTC325K and Amazon670K in which Hamming loss minimizing methods significantly outperform labelembedding methods. The above differences observed in the performance of smallscale problems visàvis largescale problems are indeed quite contrary to the remarks in recent works [c.f. abstract of Jain et al. (2016)].
(D) Vanilla metrics The results for vanilla precision@k and nDCG@k (in which the label propensity \(p_{\ell }=1, \forall \ell \)) are shown in the appendix. For these metrics, ProXML performs slightly worse than DiSMEC. However, this is expected since these metrics give equal weight to all the labels. As a result, those methods which are biased towards headlabels tend to perform better, but tend to yield less diverse predictions.
3.3 Model size, and training/prediction time
Due to the sparsity inducing \(\ell _1\) regularization, the obtained models are quite sparse and lightweight. For instance, the model learnt by PRoXML is 3 GB in size for WikiLSHTC325K, compared to 30 GB for PFastreXML on this dataset. In terms of training time, PRoXML uses a distributed training framework thereby exploiting any number of cores as are available for computation. The training can be done offline on a distributed/cloud based system for large datasets such as WikiLSHTC325K and Amazon670K. Faster convergence can be achieved by other methods such as subsampling negative examples or warmstarting the optimization with the weights learnt by DiSMEC algorithm to warmstart for faster convergence, via better initialization instead of initializing with an allzeros solution in Algorithm 1.
Prediction speed is more critical for most applications of XMC which demand low latency in domains such as recommendation systems and webadvertizing. The compact model learnt by PRoXML can be easily evaluated for prediction on streaming test instances. This is further aided by distributed model storage which can exploit the parallel architecture for prediction. On WikiLSHTC325K, it takes 2 ms per test instance on average which is thrice as fast as SLEEC, and at par with treebased methods.
4 Discussion: what works, what doesn’t and why?
We now analyze the empirical results shown in the previous section by drawing connections to spectral properties of label graphs, and determine datadependent conditions under which Hamming loss minimization is more suited compared to label embedding methods and viceversa. This section also sheds light on qualitative differences between data properties when one moves from smallscale to the extreme regime.
4.1 Algebraic connectivity of label graphs
Why Hamming loss works for extreme classification?
Contrary to the assertions in Jain et al. (2016), as shown in Table 4, Hamming loss minimizing onevsrest, which trains an independent classifier for every label, works well on datasets in the extreme regime such as WikiLSHTC325K and Amazon670K. In this regime, there is very little correlation between labels that could potentially be exploited in the first place by schemes such as LEML and SLEEC. The extremely weak correlation is indicated by crucial statistics shown in Table 3, which include: lower value of the algebraic connectivity of the label graph \(\lambda _2(G)\), fattailed distribution of instances among labels and lower values of average number of labels per instance. The virtual nonexistence of correlation indicates that the presence of a given label does not really imply the presence of other labels. It may be noted that though there may be underlying semantic similarity between labels, but there is not enough data, especially for taillabels, to support that. This inherent separation in label graph for larger datasets leads to better performance of onevsrest scheme.
Why labelembedding is suitable for small datasets ?
For smaller datasets that consist of only a few hundred labels (such as MediaMill) and relatively large value for average number of labels per instance, the labels tend to cooccur more often than for datasets in extreme regime. In this situation, label correlation is much higher that can be easily exploited by labelembedding approaches leading to better performance compared to onevsrest approach. This scale of datasets, as is common in traditional multilabel problems, has been marked by the success of labelembedding methods. Therefore, it may be noted that conclusions drawn on this scale of problems, such as on the applicability of learning algorithms or suitability of loss functions for a given problem, may not necessarily apply to datasets in XMC.
What about PSP@k and PSPnDG@k?
Though PSP@k and PSPnDG@k are appropriate for performance evaluation, these may not right metrics to optimize over during training. For instance, if a training instance has fifteen positive labels and we are optimizing PSP@5, then as soon as it has correctly classified five out of the fifteen labels correctly, the training process will stop trying to change the decision hyperplane for this training instance. As a result, the information regarding the remaining ten labels is not captured while optimizing the PSP@5 metric. It is possible that at test time, we get a similar instance which has some or all the remaining ten labels which were not optimized during training. On the other hand, onevsrest which minimizes Hamming loss would try to independently align the hyperplanes for all the fifteen labels until these are separated from the rest. Overall, the model learnt by optimizing is richer compared to that learnt by optimizing PSP@k and PSPnDG@k. Therefore, it leads to better performance on P@k and nDG@k as well as PSP@k and PSPnDG@k, when regularized properly.
5 Related work
To handle the large scale of labels in XMC, most methods have focused on two of the main strands, (i) Treebased methods, and (ii) Labelembedding based methods.
Labelembedding approaches assume that the output space has an inherently low rank structure. These approaches have been at the forefront in multilabel classification consisting of few hundred labels (Bhatia et al. 2015; Hsu et al. 2009; Yu et al. 2014; Tai and Lin 2012; Bi and Kwok 2013; Zhang and Schneider 2011; Chen and Lin 2012; Bengio et al. 2010; Lin et al. 2014; Tagami 2017). However, this assumption can break down in presence of largescale powerlaw distributed category systems, and hence leading to high prediction error (Babbar et al. 2013, 2016).
Treebased approaches (Jain et al. 2016; Prabhu and Varma 2014; Si et al. 2017; NiculescuMizil and Abbasnejad 2017; Daume III et al. 2016; Jernite et al. 2016; Jasinska et al. 2016) are aimed towards faster prediction which can be achieved by recursively dividing the space of labels or features. Due to the cascading effect, the prediction error made at a toplevel cannot be corrected at lower levels. Typically, such techniques tradeoff prediction accuracy for logarithmic prediction speed which might be desired in some applications.
5.1 PFastreXML (Jain et al. 2016)
 (I)
Standalone PFastXML —Fig. 3 shows the variation of PSP@k of PFastreXML with change in \(\alpha \) which includes the two extremes (PFastXML, \(\alpha = 1\)) and (\(Rocchio_{1,000}\) classifier, \(\alpha = 0\)) on three datasets from Table 3. Clearly, the performance of PFastreXML depends heavily on good performance of \(Rocchio_{1,000}\) classifier. It may be recalled that one of the main goals of propensity based metrics and PFastXML was better coverage of tail labels. However, PFastXML itself needs to be supported by the additional \(Rocchio_{1,000}\) classifier for better tail label coverage. To the contrary, our method does not need additional such auxiliary classifier.
 (II)
Need for propensity estimation from metadata—To estimate propensities \(p_\ell \) using \(p_\ell := 1/\left( 1+C e^{A\log (N_\ell +B)}\right) \), one needs to compute parameters A and B from some metainformation of the datasource such as Wikipedia or Amazon taxonomies. Furthermore, it might not even be possible on some datasets to have this side information, in which case the authors in Jain et al. (2016) set it to average of Wikipedia and Amazon datasets, which is quite adhoc. Our method does not need propensities for training and hence is also applicable to other metrics for taillabel coverage.
 (III)
Large model sizes—PFastreXML leads to large model size such as 30 GB (for 50 trees) for WikiLSHTC325K data, and 70 GB (for 20 trees) for Wiki500K. Such large model sizes can be difficult to evaluate for making realtime predictions in recommendation systems and webadvertizing. For larger datasets such as WikiLSHTC325K, the model sizes learnt by PRoXML is around 3 GB which is an order of magnitude smaller than PFastreXML.
 (IV)
Lots of hyperparameters—PFastreXML has around half a dozen hyperparameters such as \(\alpha \), number of trees in ensemble, and number of instances in the leaf node etc. Also, there is no reason apriori to fix \(\alpha =0.8\) even though it gives better generalization performance as shown in Fig. 4. To the contrary, our method has just one hyperparameter which is the regularization parameter.
Comparison of N@k for \(\hbox {k}=1\),3 and 5
Dataset  Proposed approach  Embedding methods  Treebased methods  Linear methods  

ProXML  SLEEC  LEML  FastXML  PFastreXML  Parabel  PDSparse  L1SVM  DiSMEC  
MediaMill  
P@1  86.5  87.2  84.0  84.2  84.2  83.4  81.8  85.8  87.2 
P@3  68.4  73.4  67.2  67.3  67.3  66.3  62.5  67.4  69.3 
P@5  53.2  59.1  52.8  53.0  53.0  51.7  45.1  52.5  54.1 
Bibtex  
P@1  64.4  65.0  62.5  63.4  62.8  64.4  61.2  64.1  64.5 
P@3  39.0  39.6  38.4  39.2  39.6  38.5  35.9  38.7  39.2 
P@5  28.2  28.8  28.2  28.8  28.9  27.9  25.8  28.4  28.4 
EURLex4K  
P@1  83.4  79.2  63.1  71.6  75.8  80.3  76.2  80.8  82.2 
P@3  70.9  64.3  50.3  62.0  62.2  68.9  60.9  67.6  68.3 
P@5  59.1  52.3  41.6  51.2  52.0  57.7  49.8  55.1  57.9 
Wiki325K  
P@1  63.6  54.8  19.8  49.3  56.4  64.7  61.2  60.6  64.9 
P@3  41.5  33.8  11.4  32.7  36.4  42.9  36.3  38.6  42.5 
P@5  30.8  24.0  8.4  24.0  27.0  31.6  28.7  28.5  31.5 
Wiki500K  
P@1  69.0  48.2  41.3  54.1  59.2  67.5  –  65.3  70.2 
P@3  49.1  29.4  30.1  35.5  40.3  48.7  –  46.1  50.6 
P@5  38.8  21.2  19.8  26.2  30.7  37.7  –  35.3  39.7 
Amazon670K  
P@1  43.5  35.0  8.1  33.3  28.6  44.0  –  39.8  44.7 
P@3  38.7  31.2  6.8  29.3  24.9  39.4  –  34.3  39.7 
P@5  35.3  28.5  6.0  26.1  22.3  36.0  –  30.1  36.1 
Comparison of N@k for \(\hbox {k}=1\),3 and 5
Dataset  Proposed approach  Embedding methods  Treebased methods  Linear methods  

ProXML  SLEEC  LEML  FastXML  PFastreXML  Parabel  PDSparse  L1SVM  DiSMEC  
MediaMill  
N@1  86.5  87.2  84.0  84.2  84.2  83.4  81.8  85.8  87.2 
N@3  77.3  81.5  75.2  75.4  75.6  74.4  70.2  76.4  78.5 
N@5  75.6  79.2  71.8  72.3  72.4  70.9  63.7  74.7  76.5 
Bibtex  
N@1  64.4  65.0  62.5  63.4  62.8  64.4  61.2  64.1  64.5 
N@3  59.2  60.4  58.2  59.5  60.0  59.3  55.8  59.2  59.4 
N@5  61.5  62.6  60.5  61.7  62.0  61.0  57.3  61.3  61.6 
EURLex4K  
N@1  83.4  79.2  63.1  71.6  75.8  80.3  76.2  80.8  82.2 
N@3  74.2  68.1  53.5  61.2  65.9  71.8  64.3  71.2  72.5 
N@5  68.2  61.6  48.4  52.3  60.7  66.1  58.7  64.9  66.7 
Wiki325K  
N@1  63.6  54.8  19.8  49.3  56.4  64.7  61.2  60.6  64.9 
N@3  57.4  47.2  14.4  33.1  50.4  58.3  55.3  55.2  58.5 
N@5  57.1  46.1  13.4  24.4  50.0  58.1  54.7  55.0  58.4 
Wiki500K  
N@1  69.0  48.2  41.3  54.1  59.2  67.5  –  65.3  70.2 
N@3  39.8  22.6  18.7  26.4  30.1  38.5  –  36.1  42.1 
N@5  38.7  21.4  17.1  24.7  28.7  36.3  –  34.3  40.5 
Amazon670K  
N@1  43.5  35.0  8.1  33.3  28.6  44.0  –  39.8  44.7 
N@3  41.1  32.7  7.3  33.2  35.8  41.5  –  36.8  42.1 
N@5  39.7  31.5  6.8  30.5  33.2  39.8  –  35.2  40.5 
6 Conclusion
In this work, we motivated the importance of effective taillabel discovery in XMC. We approached this problem from a novel perspective of robustness which has not been considered so far in the domain of XMC. We show that this viewpoint naturally motivates the well known \(\ell _1\)regularized SVM, and demonstrate its surprising effectiveness over stateoftheart methods while scaling to problems with millions of labels. To provide insights into the observations, we explain the performance gain of onevsrest scheme visàvis label embedding methods, using tools from graph theory. We hope that synergizing with recent progress on robustness of deep learning methods will open new research avenues for future research.
Footnotes
Notes
Acknowledgements
Open access funding provided by Aalto University. Funding was provided by Aalto Yliopisto (Grant No. TT Package). The authors wish to acknowledge CSC  IT Center for Science, Finland, for computational resources, and Triton cluster team at Aalto University.
References
 Agrawal, R., Gupta, A., Prabhu, Y., & Varma, M. (2013). Multilabel learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW (pp. 13–24).Google Scholar
 Babbar, R., & Schölkopf, B. (2017). Dismec: Distributed sparse machines for extreme multilabel classification. In WSDM (pp. 721–729).Google Scholar
 Babbar, R., Partalas, I., Gaussier, E., & Amini, M.R. (2013). On flat versus hierarchical classification in largescale taxonomies. In Advances in Neural Information Processing Systems (pp. 1824–1832).Google Scholar
 Babbar, R., Metzig, C., Partalas, I., Gaussier, E., & Amini, M.R. (2014). On power law distributions in largescale taxonomies. ACM SIGKDD Explorations Newsletter, 16(1), 47–56.CrossRefGoogle Scholar
 Babbar, R., Partalas, I., Gaussier, E., Amini, M.R., & Amblard, C. (2016). Learning taxonomy adaptation in largescale classifcation. The Journal of Machine Learning Research, 17(1), 3350–3386.zbMATHGoogle Scholar
 Bach, F., Jenatton, R., Mairal, J., & Obozinski, G. (2011). Convex optimization with sparsityinducing norms. Optimization for Machine Learning, 5, 19–53.zbMATHGoogle Scholar
 Bengio, S., Weston, J., & Grangier, D. (2010). Label embedding trees for large multiclass tasks. In Neural information processing systems (pp. 163–171).Google Scholar
 Bhatia, K., Jain, H., Kar, P., Varma, M., & Jain, P. (2015). Sparse local embeddings for extreme multilabel classification. In C. Cortes, N. D. Lawrence, D.D. Lee, M. Sugiyama, & R. Garnett (Eds.), NIPS (pp. 730–738).Google Scholar
 Bhatia, K., Dahiya, K., Jain, H., Prabhu, Y., & Varma, M. (2016). The extreme classification repository: Multilabel datasets and code. http://manikvarma.org/downloads/XC/XMLRepository.html.
 Bi, W., & Kwok, J. (2013). Efficient multilabel classification with many labels. In Proceedings of The 30th international conference on machine learning (pp. 405–413).Google Scholar
 Chen, Y.N., & Lin, H.T. (2012). Feature—aware label space dimension reduction for multilabel classification. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), NIPS (pp. 1529–1537).Google Scholar
 Chung, F. R. (1997). Spectral graph theory. Providence: American Mathematical Society.zbMATHGoogle Scholar
 Combettes, P. L., & Pesquet, J.C. (2007). A douglasrachford splitting approach to nonsmooth convex variational signal recovery. IEEE Journal of Selected Topics in Signal Processing, 1(4), 564.CrossRefGoogle Scholar
 Daume III, H., Karampatziakis, N., Langford, J., & Mineiro, P. (2016). Logarithmic time oneagainstsome. arXiv preprint. arXiv:1606.04988.
 Deng, J., Berg, A. C., Li, K., & FeiFei, L. (2010). What does classifying more than 10,000 image categories tell us? In K. Daniilidis, P. Maragos, & N. Paragios (Eds.), ECCV (pp. 71–84). Berlin: Springer.Google Scholar
 Denton, E., Weston, J., Paluri, M., Bourdev, L., & Fergus, R. (2015). User conditional hashtag prediction for images. In KDD.Google Scholar
 Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., & Lin, C.J. (2008). Liblinear: A library for large linear classification. The Journal of Machine Learning Research, 9, 1871–1874.zbMATHGoogle Scholar
 Gaure, A., Gupta, A., Verma, V. K., & Rai, P. (2017). A probabilistic framework for zeroshot multilabel learning. In UAI.Google Scholar
 Globerson, A., & Roweis, S. (2006). Nightmare at test time: robust learning by feature deletion. In W. W. Cohen & A. Moore (Eds.), Proceedings of the 23rd international conference on machine learning (pp. 353–360). ACM.Google Scholar
 Goodfellow, I. J., Shlens, J., & Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint. arXiv:1412.6572.
 Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multilabel prediction via compressed sensing. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in neural information processing systems (pp. 772–780).Google Scholar
 Huang, H.Y., & Lin, C.J. (2016). Linear and kernel classification: When to use which? In SDM (pp. 216–224). SIAM.Google Scholar
 Jain, H., Prabhu, Y., & Varma, M. (2016). Extreme multilabel loss functions for recommendation, tagging, ranking and other missing label applications. In KDD.Google Scholar
 Jain, V., Modhe, N., & Rai, P. (2017). Scalable generative models for multilabel learning with missing labels. In ICML.Google Scholar
 Jasinska, K., Dembczynski, K., BusaFekete, R., Pfannschmidt, K., Klerx, T., & Hüllermeier, E. (2016). Extreme fmeasure maximization using sparse probability estimates. In ICML.Google Scholar
 Jernite, Y., Choromanska, A., Sontag, D., & LeCun, Y. (2016). Simultaneous learning of trees and representations for extreme classification, with application to language modeling. arXiv preprint. arXiv:1610.04658.
 Lin, Z., Ding, G., Hu, M., & Wang, J. (2014). Multilabel classification via featureaware implicit label space encoding. (pp. 325–333).Google Scholar
 Liu, J., Chang, W.C., Wu, Y., & Yang, Y. (2017). Deep learning for extreme multilabel text classification. In SIGIR (pp. 115–124). ACM.Google Scholar
 McAuley, J., & Leskovec, J. (2013). Hidden factors and hidden topics: Understanding rating dimensions with review text. In RecSys (pp. 165–172). ACM.Google Scholar
 Mencia, E. L., & Fürnkranz, J. (2008). Efficient pairwise multilabel classification for largescale problems in the legal domain. In ECMLPKDD.Google Scholar
 Nam, J., Mencía, E. L., Kim, H. J., & Fürnkranz, J. (2017). Maximizing subset accuracy with recurrent neural networks in multilabel classification. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems (pp. 5413–5423).Google Scholar
 NiculescuMizil, A., & Abbasnejad, E. (2017). Label Filters for large scale multilabel classification. In AISTATS, proceedings of machine learning research, (pp. 1448–1457). Fort Lauderdale.Google Scholar
 Papanikolaou, Y., & Tsoumakas, G. (2017). Subset labeled lda for largescale multilabel classification. arXiv preprint. arXiv:1709.05480.
 Prabhu, Y., & Varma, M. (2014). Fastxml: A fast, accurate and stable treeclassifier for extreme multilabel learning. In Proceedings of the 20th ACM SIGKDD, (pp. 263–272). ACM.Google Scholar
 Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., & Varma, M. (2018). Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In WWW.Google Scholar
 Radlinski, F., Bennett, P. N., Carterette, B., & Joachims, T. (2009). Redundancy, diversity and interdependent document relevance. In ACM SIGIR Forum.Google Scholar
 Shaham, U., Yamada, Y., & Negahban, S. (2015). Understanding adversarial training: Increasing local stability of neural nets through robust optimization. arXiv preprint. arXiv:1511.05432.
 Shani, G., & Gunawardana, A. (2013). Tutorial onapplicationoriented evaluation of recommendation systems. AICommunications, 26(2), 225–236.Google Scholar
 Si, S., Zhang, H., Keerthi, S. S., Mahajan, D., Dhillon, I. S., & Hsieh, C.J. (2017). Gradient boosted decision trees for high dimensional sparse output. In ICML.Google Scholar
 Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2013). Intriguing properties of neural networks. arXiv preprint. arXiv:1312.6199.
 Tagami, Y. (2017). Annexml: Approximate nearest neighbor search for extreme multilabel classification. In KDD, ACM.Google Scholar
 Tai, F., & Lin, H.T. (2012). Multilabel classification with principal label space transformation. Neural Computation, 24(9), 2508–2542.MathSciNetCrossRefzbMATHGoogle Scholar
 Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., & Madry, A. (2019). Robustness may be at odds with accuracy. In International conference on learning representations. https://openreview.net/forum?id=SyxAb30cY7.
 Tsoumakas, G., Katakis, I., & Vlahavas, I. (2009). Miningmultilabel data. In Data mining and knowledge discoveryhandbook, Springer.Google Scholar
 Xu, H., Caramanis, C., & Mannor, S. (2009). Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10, 1485–1510.MathSciNetzbMATHGoogle Scholar
 Xu, H., Caramanis, C., & Mannor, S. (2010). Robust regression and lasso. IEEE Transactions on Information Theory, 7(56), 3561–3574.MathSciNetCrossRefzbMATHGoogle Scholar
 Yen, I. E., Huang, X., Ravikumar, P., Zhong, K., & Dhillon, I. S. (2016). Pdsparse : A primal and dual sparse approach to extreme multiclass and multilabel classification. In ICML.Google Scholar
 Yu, H.F., Jain, P., Kar, P., & Dhillon, I. (2014). Largescale multilabel learning with missing labels. In ICML (pp. 593–601).Google Scholar
 Zhang, Y., & Schneider, J. (2011). Multilabel output codes using canonical correlation analysis. (pp. 873–882).Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.