Abstract
Extreme multilabel classification (XMC) refers to supervised multilabel learning involving hundreds of thousands or even millions of labels. In this paper, we develop a suite of algorithms, called Bonsai, which generalizes the notion of label representation in XMC, and partitions the labels in the representation space to learn shallow trees. We show three concrete realizations of this label representation space including: (i) the input space which is spanned by the input features, (ii) the output space spanned by label vectors based on their cooccurrence with other labels, and (iii) the joint space by combining the input and output representations. Furthermore, the constraintfree multiway partitions learnt iteratively in these spaces lead to shallow trees. By combining the effect of shallow trees and generalized label representation, Bonsai achieves the best of both worlds—fast training which is comparable to stateoftheart treebased methods in XMC, and much better prediction accuracy, particularly on taillabels. On a benchmark Amazon3M dataset with 3 million labels, Bonsai outperforms a stateoftheart onevsrest method in terms of prediction accuracy, while being approximately 200 times faster to train. The code for Bonsai is available at https://github.com/xmcaalto/bonsai.
Introduction
Extreme Multilabel Classification (XMC) refers to supervised learning of a classifier which can automatically label an instance with a small subset of relevant labels from an extremely large set of all possible target labels. Machine learning problems consisting of hundreds of thousand labels are common in various domains such as product categorization for ecommerce (McAuley and Leskovec 2013; Shen et al. 2011; Bengio et al. 2010; Agrawal et al. 2013), hashtag suggestion in social media (Denton et al. 2015), annotating webscale encyclopedia (Partalas et al. 2015), and imageclassification (Krizhevsky et al. 2012; Deng et al. 2010). It has been demonstrated that, the framework of XMC can also be leveraged to effectively address ranking problems arising in bidphrase suggestion in webadvertising and suggestion of relevant items for recommendation systems (Prabhu and Varma 2014).
From the machine learning perspective, building effective extreme classifiers is faced with the computational challenge arising due to large number of (i) output labels, (ii) input training instances, and (iii) input features. Another important statistical characteristic of the datasets in XMC is that a large fraction of labels are tail labels, i.e., those which have very few training instances that belong to them (also referred to as powerlaw, fattailed distribution and Zipf’s law). Formally, let \(N_r\) denote the size of the rth ranked label, when ranked in decreasing order of number of training instances that belong to that label, then:
where \(N_1\) represents the size of the 1st ranked label and \(\beta >0\) denotes the exponent of the power law distribution. This distribution is shown in Fig. 1 for a benchmark dataset, WikiLSHTC325K from the XMC repository (Bhatia et al. 2016). In this dataset, only \(\sim 150\),000 out of 325,000 labels have more than 5 training instances in them. Tail labels exhibit diversity of the label space, and contain informative content not captured by the head or torso labels. Indeed, by predicting well the head labels, yet omitting most of the tail labels, an algorithm can achieve high accuracy (Wei and Li 2018). However, such behavior is not desirable in many real world applications, where fit to powerlaw distribution has been observed (Babbar et al. 2014).
Related work
Multilabel learning has long been a topic of interest with early focus on relatively smaller scale problems (Tsoumakas and Katakis 2007; Tsoumakas et al. 208; Read et al. 2008; Vens et al. 2008; Madjarov et al. 2012). However, most works in the context of largescale scenarios which fall under the realm of XMC, can be broadly categorized into one of the four strands:

1.
Treebased Treebased methods implement a divideandconquer paradigm and scale to large label sets in XMC by partitioning the labels space. As a result, these scheme of methods have the computational advantage of enabling faster training and prediction (Prabhu and Varma 2014; Jain et al. 2016; Jasinska et al. 2016; Majzoubi and Choromanska 2019; Wydmuch et al. 2018). Approaches based on decision trees have also been proposed for multilabel classification and those tailored to XMC regime (Joly et al. 2019; Si et al. 2017). However, treebased methods suffer from error propagation in the tree cascade as also observed in hierarchical classification (Babbar et al. 2013, 2016). As a result, these methods tend to perform particularly worse on metrics which are sensitive for taillabels Prabhu et al. (2018).

2.
Label embedding Labelembedding approaches assume that, despite large number of labels, the label matrix is effectively low rank and therefore project it to a lowdimensional subspace. These approaches have been at the forefront in multilabel classification for small scale problems with few tens or hundred labels (Hsu et al. 2009; Tai and Lin 2012; Weston et al. 2011; Lin et al. 2014). For powerlaw distributed labels in XMC settings, the crucial assumption made by the embeddingbased approaches of a low rank label space breaks down (Xu et al. 2016a; Bhatia et al. 2015; Tagami 2017). Under this condition, embedding based approaches leads to high prediction error.

3.
Onevsrest Sometimes also referred to as binary relavance (Zhang et al. 2018), these methods learn a classifier per label which distinguishes it from rest of the labels. In terms of prediction accuracy and label diversity, these methods have been shown to be among the best performing ones for XMC (Babbar and Schölkopf 2017; Yen et al. 2017; Babbar and Schölkopf 2019). However, due to their reliance on a distributed training framework, it remains challenging to employ them in resource constrained environments.

4.
Deep learning Deeper architectures on top of wordembeddings have also been explored in recent works (Liu et al. 2017; Joulin et al. 2017; Mikolov et al. 2013). However, their performance still remains suboptimal compared to the methods discussed above which are based on bagofwords feature representations. This is mainly due to the data scarcity in taillabels which is substantially below the sample complexity required for deep learning methods to reach their peak performance.
Therefore, a central challenge in XMC is to build classifiers which retain the accuracy of onevsrest paradigm while being as efficiently trainable as the treebased methods. Recently, there have been efforts for speeding up the training of existing classifiers by better initialization and exploiting the problem structure (Fang et al. 2019; Liang et al. 2018; Jalan et al. 2019). In a similar vein, a recently proposed treebased method, Parabel (Prabhu et al. 2018), partitions the label space recursively into two child nodes using 2means clustering. It also maintains a balance between these two label partitions in terms of number of labels. Each intermediate node in the resulting binary labeltree is like a metalabel which captures the generic properties of its constituent labels. The leaves of the tree consist of the actual labels from the training data. During training and prediction each of these labels is distinguished from other labels under the same parent node through the application of a binary classifier at internal nodes and onevsall classifier for the leaf nodes. By combination of treebased partitioning and onevsrest classifier, it has been shown to give better performance than previous treebased methods (Prabhu and Varma 2014; Jain et al. 2016; Jasinska et al. 2016) while simultaneously allowing efficient training.
However, in terms of prediction performance, Parabel remains suboptimal compared to onevsrest approaches. In addition to error propagation due to cascading effect of the deep trees, its performance is particularly worse on tail labels. This is the result of two strong constraints in its label partitioning process, (i) each parent node in the tree has only two child nodes, and (ii) at each node, the labels are partitioned into equal sized parts, such that the number of labels under the two child nodes differ by at most one. As a result of the coarseness imposed by the binary partitioning of labels, the tail labels get subsumed by the head labels.
Bonsai overview
In this paper, we develop a family of algorithms, called Bonsai. At a high level, Bonsai follows a similar paradigm which is common in most treebased approaches, i.e., label partitioning followed by learning classifiers at the internal nodes. However, it has two main features, which distinguish it from stateoftheart tree based approaches. These are summarized below :

Generalized label representation In this work, we argue that the notion of representing the labels is quite general, and there are various meaningful manifestations of the label representation space. As three concrete examples, we show the applicability of the following representations of labels: (i) input space representation as a function of feature vectors (ii) output space representation based on their cooccurrence with other labels, and (iii) a combination of the output and input representations. In this regard, our work generalizes the approach taken in many earlier works, which have represented labels only in the input space (Prabhu et al. 2018; Wydmuch et al. 2018), or only in the output space (Tsoumakas et al. 208). We show that these representations, when combined with shallow trees (described in the next section), surpass existing methods demonstrating the efficacy of the proposed generalized representation.

Shallow trees To avoid error propagation in the tree cascade, we propose to construct a shallow tree architecture. This is achieved by enabling (i) a flexible clustering via Kmeans for \(K>2\), and (ii) relaxing balancedness constraints in the clustering step. Multiway partitioning initializes diverse subgroups of labels, and the unconstrained nature maintains the diversity during the entire process. These are in contrast to treebased methods which impose such constraints for a balanced tree construction. As we demonstrate in our empirical findings, by relaxing the constraints, Bonsai leads to prediction diversity and significantly better taillabel coverage.
By synergizing the effect of a richer label representation and shallow trees, Bonsai achieves the best of both worlds—prediction diversity better than stateoftheart treebased methods with comparable training speed, and prediction accuracy at par with onevsrest methods. The code for Bonsai is available at https://github.com/xmcaalto/bonsai.
Formal description of Bonsai
We assume to be given a set of N training points \(\left\{ (\mathbf{x} _i, \mathbf{y} _i)\right\} _{i=1}^N\) with \(D\) dimensional feature vectors \(\mathbf{x} _i \in \mathbb {R}^D\) and L dimensional label vectors \(\mathbf{y} _i \in \{0,1\}^L\). Without loss of generality, let the set of labels be represented by \(\{1,\ldots , \ell , \ldots , L\}\). Our goal is to learn a multilabel classifier in the form of a vectorvalued output function \(f: {\mathbb {R}}^D\mapsto \{0,1\}^L\). This is typically achieved by minimizing an empirical estimate of \({\mathbb {E}}_{(\mathbf{x} ,\mathbf{y} ) \sim {\mathcal {D}}}[\mathcal {L}(\mathbf{W} ;(\mathbf{x} ,\mathbf{y} ))]\) where \({\mathcal {L}}\) is a loss function, and samples \((\mathbf{x} ,\mathbf{y} )\) are drawn from some underlying distribution \({\mathcal {D}}\). The desired parameters \(\mathbf{W} \) can take one of the myriad of choices. In the simplest (and yet effective) of setups for XMC such as linear classification, \(\mathbf{W} \) can be in the form of matrix. In other cases, it can be representative of a deeper architecture or a cascade of classifiers in a tree structured topology. Due to their scalability to extremely large datasets, Bonsai follows a treestructured partitioning of labels.
In this section, we next present in detail the two main components of Bonsai: (i) generalized label representation and (ii) shallow trees.
Label representation
In the extreme classification setting, labels can be represented in various ways. To motivate this, as an analogy in terms of publications and their authors, one can think of labels as authors, the papers they write as their training instances, and multiple coauthors of a paper as the multiple labels. Now, one can represent authors (labels) either solely based on the content of the papers they authored (input space representation), or based only on their coauthors (output space representation) or as a combination of the two.
Formally, let each label \(\ell \) be represented by \(\eta \)dimensional vector \(\mathbf{v} _{\ell } \in {\mathbb {R}}^{\eta }\). Now, \(\mathbf{v} _{\ell }\) can be represented as a function (i) only of input features via the input vectors in training instances \(\left\{ \mathbf{x} _i\right\} _{i=1}^N\), (ii) only of output features via the label vectors in the training instances \(\left\{ \mathbf{y} _i\right\} _{i=1}^N\) or (iii) as a combination of both \(\left\{ (\mathbf{x} _i, \mathbf{y} _i)\right\} _{i=1}^N\). We now present three concrete realizations of the label representation \(\mathbf{v} _{\ell }\). We later show that these representations can be seamlessly combined with shallow tree cascade of classifiers, and yield stateoftheart performance on XMC tasks.

(a)
Input space label representation The label representation for label \(\ell \) can be arrived at by summing all the training examples for which it is active. Let \(\mathbf{V} _i\) be the label representation matrix given by
$$\begin{aligned} \mathbf{V} _i = \mathbf{Y} ^T\mathbf{X} = \begin{bmatrix} \mathbf{v} _1^T\\ \mathbf{v} _2^T\\ \vdots \\ \mathbf{v} _L^T \end{bmatrix}_{L \times D} \quad \mathrm{where \quad } \mathbf{X} = \begin{bmatrix} \mathbf{x} _1^T\\ \mathbf{x} _2^T\\ \vdots \\ \mathbf{x} _N^T \end{bmatrix}_{N \times D}\mathrm{,\quad } \mathbf{Y} = \begin{bmatrix} \mathbf{y} _1^T\\ \mathbf{y} _2^T\\ \vdots \\ \mathbf{y} _N^T \end{bmatrix}_{N \times L}. \end{aligned}$$(2)We follow the notation that each bold letter such as \(\mathbf{x} \) is a vector in column format and \(\mathbf{x} ^T\) represents the corresponding row vector. Hence, each row \(\mathbf{v} _{\ell }\) of matrix \({\mathcal {V}}_i\) which represents the label \(\ell \), is given by the sum of all the training instances for which label \(\ell \) is active. This can also be represented as, \(\mathbf{v} _{\ell } = \sum _{i=1}^N \mathbf{y} _{i{\ell }}\mathbf{x} _i\). Note that even though \(\mathbf{v} _{\ell }\) also depends on the label vectors, it is still in the same space as the input instance and has dimensionality D. Furthermore, each \(\mathbf{v} _{\ell }\) can be normalized to unit length in euclidean norm as follows: \(\mathbf{v} _{\ell } := \mathbf{v} _{\ell }/\Vert \mathbf{v} _{\ell }\Vert _2\).

(b)
Output space representation In the multilabel setting, another way to represent the labels is to represent them solely as a function of the degree of their cooccurrence with other labels. That is, if two labels cooccur with similar set of labels, then these are bound to be related to each other, and hence should have similar representation. In this case, the label representation matrix \(\mathbf{V} _o\) is given by
$$\begin{aligned} \mathbf{V} _o = \mathbf{Y} ^T\mathbf{Y} = \begin{bmatrix} \mathbf{v} _1^T\\ \mathbf{v} _2^T\\ \vdots \\ \mathbf{v} _L^T \end{bmatrix}_{L \times L} \quad \mathrm{where} \quad \mathbf{Y} = \begin{bmatrix} \mathbf{y} _1^T\\ \mathbf{y} _2^T\\ \vdots \\ \mathbf{y} _N^T \end{bmatrix}_{N \times L}. \end{aligned}$$(3)Here \(\mathbf{V} _o\) is an \(L \times L\) symmetric matrix, where each row \(\mathbf{v} _{\ell }^T\), corresponds to the number of times the label \(\ell \) cooccurs with all other labels. Hence these label cooccurrence vectors \(\mathbf{v} _{\ell }\) give us another way of representing the label \(\ell \). It may be noted that in contrast to the previous case, being an output space representation, the dimensionality of the label vector is same as that of the output space having the same dimensionality, i.e. \(\eta =L\).

(c)
Joint input–output representation Given the previous input and output space representations of labels, a natural way to extend it is by combining these representations via concatenation. This is achieved as follows, for a training instance i with feature vector \(\mathbf{x} _i\) and corresponding label vector \(\mathbf{y} _i\), let \(\mathbf{z} _i\) be the concatenated vector given by, \(\mathbf{z} _i = [\mathbf{x} _i \odot \mathbf{y} _i]\). Then, the joint representation can be computed in the matrix \(\mathbf{V} _j\) as follows
$$\begin{aligned} \mathbf{V} _j = \mathbf{Y} ^T\mathbf{Z} = \begin{bmatrix} \mathbf{v} _1^T\\ \mathbf{v} _2^T\\ \vdots \\ \mathbf{v} _L^T \end{bmatrix}_{L \times (D+L)} \mathrm{where\quad } \mathbf{Z} = \begin{bmatrix} \mathbf{z} _1^T\\ \mathbf{z} _2^T\\ \vdots \\ \mathbf{z} _N^T \end{bmatrix}_{N \times (D+L)} \mathbf{Y} = \begin{bmatrix} \mathbf{y} _1^T\\ \mathbf{y} _2^T\\ \vdots \\ \mathbf{y} _N^T \end{bmatrix}_{N \times L} \end{aligned}$$(4)Here each row \(\mathbf {v}_{\ell }\) of the label representation matrix \({\mathcal {V}}_j\) which is the label representation in the joint space, is therefore a concatenation of representations obtained from \({\mathcal {V}}_i\) and \({\mathcal {V}}_o\), hence being of length \((D+L)\). Since both the input vectors \(\mathbf{x} _i\) and output vectors \(\mathbf{y} _i\) are highly sparse, this does not lead to any major computational burden in training.
It may be noted that our notion of label representation generalizes similar approaches in recent works (i) which are either based solely on the input space representation (Prabhu et al. 2018; Wydmuch et al. 2018), or (ii) those which are based on output space representation only (Tsoumakas et al. 208). As also shown later in our empirical findings, in combination with shallow tree cascade of classifiers, partitioning of:

output space representation (\(\mathbf{V} _o\)) yields competitive results compared to stateoftheart classifiers in XMC such as Parabel.

joint representation (\(\mathbf{V} _j\)) further surpasses the stateoftheart methods in terms of prediction performance and label diversity.
Label partitioning via Kmeans clustering
Once we have obtained the representation \(\mathbf {v}_{\ell }\) for each label \(\ell \) in the set \(S= \left\{ 1, \ldots , L\right\} \), the next step is to iteratively partition S into disjoint subsets. This is achieved by Kmeans clustering, which also presents many choices such as number of clusters and degree of balancedness among the clusters. Our goal, in this work, is to avoid propagation error in a deep tree cascade. We, therefore, choose a relatively large value of K (e.g. \(\ge 100\)) which leads to shallow trees.
The clustering step in Bonsai first partitions \(S\) into \(K\) disjoint sets \(\left\{ S_1, \ldots , S_K\right\} \). Each of the elements, \(S_k\), of the above set can be thought of as a metalabel which semantically groups actual labels together in one cluster. Then, \(K\) child nodes of the root are created, each contains one of the partitions, \(\left\{ S_k\right\} _{k=1}^K\). The same process is repeated on each of the newlycreated \(K\) child nodes in an iterative manner. In each subtree, the process terminates either when the node’s depth exceeds predefined threshold \(d_{\text {max}} \) or the number of associated labels is no larger than \(K\), e.g, \(S_k \le K\).
Formally, without loss of generality, we assume a nonleaf node has labels \(\left\{ 1, \ldots , L\right\} \). We aim at finding \(K\) cluster centers \(\mathbf{c} _1, \ldots , \mathbf{c} _K\in {\mathbb {R}}^{\eta }\), i.e., in an appropriate space (input, output, or joint) by optimizing the following:
where \(d(., .)\) represents a distance function and \(\mathbf{v} _{\ell }\) represents the vector representation of the label \({\ell }\). The distance function is defined in terms of the dot product as follows: \(d(\mathbf{v} _{\ell }, \mathbf{c} _k) = 1  \mathbf{v} _{\ell }^T \cdot \mathbf{c} _k\). The above problem is \(\mathbf{NP} \)hard and we use the standard Kmeans algorithm (also known as Lloyd’s algorithm) (Lloyd 1982)^{Footnote 1} for finding an approximate solution to Eq. (5).
The Kway unconstrained clustering in Bonsai has the following advantages over Parabel which enforces binary and balanced partitioning:

1.
Initializing label diversity in partitioning By setting \(K> 2\), Bonsai allows a varied partitioning of the labels space, rather than grouping all labels in two clusters. This facet of Bonsai is especially favorable for tail labels by allowing them to be part of separate clusters if they are indeed very different from the rest of the labels. Depending on the similarity to other labels, each label can choose to be part of one of the K clusters.

2.
Sustaining label diversity Bonsai sustains the diversity in the label space by not enforcing the balancedness constraint of the form, \(S_k  S_{k'}\le 1, \forall 1 \le k,k' \le K\) (where . operator is overloaded to mean set cardinality for the inner one and absolute value for the outer ones) among the partitions. This makes the Bonsai partitions more datadependent since smaller partitions with diverse taillabels are very moderately penalized under this framework.

3.
Shallow tree cascade Furthermore, Kway unconstrained partitioning leads to shallower trees which are less prone propagation error in deeper trees constructed by Parabel. As we will show in Sect. 3, the diverse partitioning reinforced by shallower architecture leads to better prediction performance, and significant improvement is achieved on tail labels.
A pictorial description of the partitioning scheme of Bonsai and its difference compared to Parabel is also illustrated in Fig. 2.
Learning node classifiers
Once the label space is partitioned into a diverse and shallow tree structure, we learn a \(K\)way OnevsAll linear classifier at each node. These classifiers are trained independently using only the training examples that have at least one of the node labels. We distinguish the leaf nodes and nonleaf nodes in the following way: (i) for nonleaf nodes, the classifier learns \(K\) linear classifiers separately, each maps to one of the \(K\) children. During prediction, the output of each classifier determines whether the test point should traverse down the corresponding child. (ii) for leaf nodes, the classifier learns to predict the actual labels on the node.
Without loss of generality, given a node in the tree, denote by \(\{c_k\}_{k=1}^{K}\) as the set of its children. For the special case of leaf nodes, the set of children represent the final labels. We learn \(K\) linear classifiers parameterized by \(\left\{ \mathbf{w} _1, \ldots , \mathbf{w} _K\right\} \), where \(\mathbf{w} _k \in \mathbb {R}^{D}\) for \(\forall k=1, \ldots , K\). Each output label determines if the corresponding \(K\) children should be traversed or not.
For each of the child node \(c_k\), we define the training data as \(T_k = (\mathbf{X} _k, \mathbf{s} _k)\), where \(\mathbf{X} _k=\left\{ \mathbf{x} _i \mid \mathbf{y} _{ik} = 1, i=1,\ldots ,N\right\} \). Let \(\mathbf{s} _k \in \left\{ +1, 1\right\} ^N\) represent the vector of signs depending on whether \(\mathbf{y} _{ik} = 1\) corresponds to +1 and \(\mathbf{y} _{ik} = 0\) for \(1\). We consider the following optimization problem for learning linear SVM with squared hinge loss and \(\ell _2\)regularization
where \({\mathcal {L}}(z) = (\max (0,1z))^2\). This is solved using the Newton method based primal implementation in LIBLINEAR (Fan et al. 2008). To restrict the model size, and remove spurious parameters, thresholding of small weights is performed as in Babbar and Schölkopf (2017). Similar to Parabel, onevsall classifiers are also learnt at the leaf nodes, which consist of the actual labels.
The treestructured architecture of Bonsai is illustrated in Fig. 3. The details of Bonsai ’s training procedure are shown in Algorithm 1. The partitioning process in Sect. 2.1 is described as the procedure GROW in the algorithm. The OnevsAll procedure is shown as onevsall in Algorithm 1.
Prediction error propagation in shallow versus deep trees
During prediction, a test point \(\mathbf{x} \) traverses down the tree. At each nonleaf node, the classifier narrows down the search space by deciding which subset of child nodes \(\mathbf{x} \) should further traverse. If the classifier decides not to traverse down some child node c, all descendants of c will not be traversed. Later, as \(\mathbf{x} \) reaches to one or more leaf nodes, OnevsAll classifiers are evaluated to assign probabilities to each label. Bonsai uses beam search to avoid the possibility of evaluating all nodes. At each level, \(B\) most probable nodes, whose scores are calculated from previous level, are further traversed down.
However, the above search space pruning strategy implies errors made at nonleaf nodes could propagate to their descendants. Bonsai sets relatively large values to the branching factor K (typically 100), resulting in much shallower trees compared to Parabel, and hence significantly reducing error propagation, particularly for taillabels.
More formally, given a data point \(\mathbf{x} \) and a label \(\ell \) that is relevant to \(\mathbf{x} \), we denote \(e \) as the leaf node \(\ell \) belongs to and \(\mathcal {A} (e)\) as the set of ancestor nodes of \(e \) and \(e \) itself. Note that \(\mathcal {A} (e) \) is path length from root to \(e \). Denote the parent of n as \(p(n) \). We define the binary indicator variable \(z_{n} \) to take value 1 if node n is visited during prediction and 0 otherwise. From the chain rule, the probability that \(\ell \) is predicted as relevant for \(\mathbf{x} \) is as follows:
Consider the Amazon3M dataset with \(L \approx 3 \times 10^6\), setting \(K=2\) produces a tree of depth 16. Assuming \(\text {Pr} (z_{n} = 1 \mid z_{p(n)} =1,\mathbf{x} ) =0.95\), for \(\forall n \in p(n) \) and \(\text {Pr} (\mathbf{y} _{\ell } = 1 \mid z_{e} =1,\mathbf{x} ) =1\), it gives \(\text {Pr} (\mathbf{y} _{\ell }=1 \mid \mathbf{x} ) = (0.95)^{16} \approx 0.46\). This is to say, even if \(\text {Pr} (z_{n} = 1 \mid z_{p(n)} =1,\mathbf{x} ) \) is high (e.g, 0.95) at each \(n \in \mathcal {A} (e)\), multiplying them together can result in small probability (e.g, 0.46) if the depth of the tree, i.e., \(\mathcal {A} (e) \) is large. We choose to mitigate this issue by increasing \(K\), and hence limiting the propagation error.
Computational complexity
Training time analysis The training process can be decomposed into three steps: (i) learning the label representation, (ii) building the kary trees and (iii) learning a onevsrest classifier at each node.
First, we assume only \(\log (L)\) labels are relevant for every data point on average. Also, let \(\tilde{D}\) be the average feature density i.e. for dense features, \(\tilde{D} = D\). For the three variants of the label representation \(\mathbf{v} _{\ell }\) discussed in Sect. 2.1, learning the label representations requires a cost of \(\mathcal {O}(N\tilde{D}\log L)\), \(\mathcal {O}(N\log ^2L)\) and \(\mathcal {O}(N(\tilde{D}+\log L)\log L)\) for the input, output and the joint input–output space respectively.
When building the label tree, it takes \(\mathcal {O}(cKL\tilde{D})\) to cluster the labels at each level, where c is the number of iterations needed for Kmeans clustering to converge. Since Bonsai produces trees with small depth values, which can be considered small constant, the total time cost at this step is \(\mathcal {O}(cKL\tilde{D})\).
With the learnt label tree, K independent linear classifiers are learnt at each internal tree node which decide which child nodes a training point should traverse. Learning internal node classifiers at each level takes \(\mathcal {O}(KN\tilde{D}\log L)\). Therefore, the total time cost on learning internal node classifiers is \(\mathcal {O}(KN\tilde{D}\log L)\) since we omit the tree depth, which is a small constant.
Lastly, the onevsrest leaf node classifiers are trained at a cost of \(\mathcal {O}(MN\tilde{D}\log L)\) assuming that a leaf node can contain at most M labels. As we do not have any balancedness constraints on the Kmeans clustering, M can be equal to L in the worst case. However, in practice M is found to be much smaller. So the overall complexity of Bonsai for training T trees is \(\mathcal {O}\left( (\frac{cKL}{N\log L} + K + M)TN\tilde{D}\log L\right) \).
Prediction time analysis The prediction process can be decomposed into two steps: (i) traversal down from the root through the intermediate nodes, (ii) label prediction at the leaf nodes.
For part (i), we use the fact that: at each level, at most \(B\) nodes are further traversed down. The time cost at each level is \(\mathcal {O}(B\tilde{D} K)\). Therefore, traversing down all levels takes \(\mathcal {O}(B\tilde{D} K)\) since tree depth is a small constant. For part (ii), at most \(B\) leaf nodes are evaluated. This step has complexity \(\mathcal {O}(B\tilde{D} M)\), assuming that a leaf node contains at most M labels. Therefore, if we predict using T trees, the total complexity is \(\mathcal {O}(T B\tilde{D} K + T B\tilde{D} M)\).
Comparison with Parabel We highlight the difference of complexity between Bonsai and Parabel.
For training, Parabel takes \(\mathcal {O}((\frac{cKL}{N} + K \log L + M) T N\tilde{D} \log L)\)^{Footnote 2} while Bonsai takes \(\mathcal {O}\left( (\frac{cKL}{N\log L} + K + M)TN\tilde{D}\log L\right) \). Bonsai differs from Parabel in three ways: (i) a factor of \(\log L\) (equals tree depths in Parabel) is absent in the first two terms in Bonsai as tree depths are small constants in Bonsai. (ii) M is generally larger in Bonsai since balancedness is not enforced in label partitioning. (iii) c is also larger in Bonsai because Bonsai sets a larger K value during Kmeans clustering, which takes more iterations to converge.
For prediction, Parabel takes \(\mathcal {O}(TB\tilde{D} K \log L + T B\tilde{D} M)\) while Bonsai takes \(\mathcal {O}(T B\tilde{D} K + T B\tilde{D} M)\). The main difference is similar as in the case of training: (i) Bonsai gets rid of the \(\log L\) factor in the first term because Bonsai trees are shallow; (ii) meanwhile, M is generally larger in the case of Bonsai.
Though their training/prediction complexity are not directly comparable, we find that Parabel is faster in both training and prediction in practice. Therefore, we conclude that: in practice the role of larger M and c values rule over the absence of \(\log L\) factor, therefore making Bonsai slower than Parabel.
Experimental evaluation
In this section, we detail the dataset description, and the set up for comparison of the proposed approach against stateoftheart methods in XMC.
Dataset and evaluation metrics
We perform empirical evaluation on publicly available datasets from the XMC repository^{Footnote 3} curated from sources such as Amazon for itemtoitem recommendation tasks and Wikipedia for tagging tasks. The datasets of various scales in terms of number of labels are used, EURLex4K consisting of approximately 4000 labels to Amazon3M consisting of 3 million labels. The datasets also exhibit a wide range of properties in terms of number of training instances, features, and labels. Though the overall feature dimensionality is high, each training instance is a tfidf weighted sparse representation of features. The document length corresponding to each training sample can be further reduced by keeping the highest scores based on truncating at some predefined threshold (Khandagale and Babbar 2019). The detailed statistics of the datasets are shown in Table 1.
With applications in recommendation systems, ranking and webadvertising, the objective of the machine learning system in XMC is to correctly recommend/rank/advertise among the topk slots. We therefore use evaluation metrics which are standard and commonly used to compare various methods under the XMC setting—Precision@k (\(prec@k\)) and normalised Discounted Cumulative Gain (\(nDCG@k\)). Given a label space of dimensionality L, a predicted label vector \(\hat{\mathbf{y }} \in {\mathbb {R}}^L\) and a ground truth label vector \(\mathbf{y} \in \{0,1\}^L\):
where \(DCG@k = \frac{\mathbf{y }_\ell }{\sum _{l=1}^{}{\frac{1}{\log (\ell +1)}}}\), and \(rank_k(\hat{\mathbf{y }})\) returns the k largest indices of \(\hat{\mathbf{y }}\).
For better readability, we report the percentage version of above metrics (multiplying the original scores by 100). In addition, we consider \(k \in \left\{ 1,3,5\right\} \).
Methods for comparison
We consider three different variants of the proposed family of algorithms, Bonsai, which is based on the generalized label representations (discussed in Sect. 2.1) combined with the shallow tree cascades. We refer the algorithms learnt by partitioning the input space, output space and the joint space as Bonsaii, Bonsaio, and Bonsaiio respectively. These are compared against six stateoftheart algorithms from each of the three main strands for XMC namely, labelembedding, treebased and onevsall methods:

Labelembedding methods Due to the fattailed distribution of instances among labels, SLEEC (Bhatia et al. 2015) makes a locally lowrank assumption on the label space, RobustXML (Xu et al. 2016b) decomposes the label matrix into tail labels and non tail labels so as to enforce an embedding on the latter without the tail labels damaging the embedding. LEML (Yu et al. 2014) makes a global lowrank assumption on the label space and performs a linear embedding on the label space. As a result, it gives much worse results, and is not compared explicitly in the interest of space.

Treebased methods FastXML (Prabhu and Varma 2014) learns an ensemble of trees which partition the label space by directly optimizing an nDCG based ranking loss function, PFastXML (Jain et al. 2016) replaces the nDCG loss in FastXML by its propensity scored variant which is unbiased and assigns higher rewards for accurate tail label predictions, Parabel (Prabhu et al. 2018) which has been described earlier in the paper.

OnevsAll methods PDSparse (Yen et al. 2016) enforces sparsity by exploiting the structure of a marginmaximizing loss with L1penalty, DiSMEC (Babbar and Schölkopf 2017) learns onevsrest classifiers for every label with weight pruning to control model size. It may be noted that even though the actual number of true labels is unknown for the test set, evaluation metrics based on topk predictions can still be computed for predictions for OnevsAll methods.
Since we are considering only bagofwords representation across all datasets, we do not compare against deep learning methods explicitly. However, it may be noted that despite using raw data and corresponding wordembeddings, deep learning methods in XMC are still suboptimal in terms of prediction performance in XMC (Liu et al. 2017; Joulin et al. 2017; Kim 2014). More details on the performance of deep methods can be found in Wydmuch et al. (2018).
Bonsai is implemented in C++ on a 64bit Linux system. For all the datasets, we set the branching factor \(K=100\) at every tree depth. We will explore the effect of tree depth in details later. This results in depth1 trees (excluding the leaves which represent the final labels) for smaller datasets such as EURLex4K, Wikipedia31K and depth2 trees for larger datasets such as WikiLSHTC325K and Wikipedia500K. Bonsai learns an ensemble of three trees similar to Parabel.
For all the other stateoftheart approaches, we used the hyperparameter values as suggested in the various papers in order to recreate the results reported in them.
Experimental results
In this section, we report the main findings of our empirical evaluation.
Precision@k
The comparison of Bonsai against other baselines is shown in Table 2. The results are averaged over five runs with different initializations of the clustering algorithm. The important findings from these results are the following:

The competitive performance of the different variants of Bonsai shows the success and applicability of the notion of generalized label representation, and their concrete realization discussed in Sect. 2.1. It also highlights that it is possible to enrich these representations further, and achieve better partitioning.

The consistent improvement of Bonsai over Parabel on all datasets validates the choice of higher fanout and advantages of using shallow trees.

Another important insight from the above results is that when the average number of labels per training point are higher such as in Wikipedia31K, Amazon670K and Amazon3M, the joint space label representation, used in bonsaiio, leads to better partitioning and further improves the strong performance of input only label representation in Bonsaii. However, it degrades when the average number of labels per point is low (\(\le 5\)) for datasets such as WikiLSHTC325K and Wikipedia500K, in which cases the information captured between input and output representations does not synergize well.

Even though DiSMEC performs slightly better on Wiki500K and Wikipedia31K, its computational complexity of training and prediction is orders of magnitude higher than Bonsai. As a result, while Bonsai can be run in environments with limited computational resources, DiSMEC requires a distributed infrastructure for training and prediction.
Performance on tail labels
We also evaluate prediction performance on tail labels using propensity scored variants of \(prec@k\) and \(nDCG@k\). For label \(\ell \), its propensity \(p_{\ell }\) is related to number of its positive training instances \(N_{\ell }\) by \(p_{\ell } = \frac{1}{\left( 1+ Ce^{A\log (N_{\ell }+B)}\right) }\) where A, B are application specific parameters and \(C=(\log N1)(B+1)^{A}\). Here N is the total number of training samples, and parameters A, B vary across datasets, and were chosen as suggested in Jain et al. (2016). With this formulation, \(p_{\ell } \approx 1\) for head labels and \(p_{\ell } \ll 1\) for tail labels.
Let \(\mathbf{y} \in \{0,1\}^L\) and \(\hat{\mathbf{y }} \in {\mathbb {R}}^L\) denote the true and predicted label vectors respectively. As detailed in Jain et al. (2016), propensity scored variants of P@k and \(nDCG@k\) are given by
where \(PSDCG@k := \sum _{\ell \in rank_k{(\hat{\mathbf{y }})}}{[\frac{\mathbf{y }_\ell }{p_\ell \log (\ell +1)}]}\) , and \(rank_k(\mathbf{y })\) returns the k largest indices of \(\mathbf{y} \). To match against the ground truth, as suggested in Jain et al. (2016), we use \(100 \cdot {\mathbb {G}}(\{\hat{\mathbf{y }}\})/{\mathbb {G}}(\{\mathbf{y }\})\) as the performance metric. For M test samples, \({\mathbb {G}}(\{\hat{\mathbf{y }}\}) = \frac{1}{M}\sum _{i=1}^{M}{\mathbb {L}}(\hat{\mathbf{y }}_i,\mathbf{y} )\), where \({\mathbb {G}}(.)\) and \({\mathbb {L}}(.,.)\) signify gain and loss respectively. The loss \({\mathbb {L}}(.,.)\) can take two forms, (i) \({\mathbb {L}}(\hat{\mathbf{y }}_i,\mathbf{y} ) =  PSnDCG@k \), and (ii) \({\mathbb {L}}(\hat{\mathbf{y }}_i,\mathbf{y} ) =  PSP@k\). This leads to the two metrics which are sensitive to tail labels and are denoted by PSP\(@\)k, and PSnDCG\(@\)k.
Figure 4 shows the result w.r.t PSP\(@\)k, and PSnDCG\(@\)k among the treebased approaches. Again, Bonsaii shows consistent improvement over Parabel. For instance, on WikiLSHTC325K, the relative improvement over Parabel is approximately 6.7% on PSP\(@\)5. This further validates the applicability of the shallow tree architecture resulting from the design choices of Kway partitioning along with flexibility to allow unbalanced partitioning in Bonsai, which allows tail labels to be assigned into different partitions w.r.t the head ones.
Unique label coverage
We also evaluate coverage@k, denoted C@k, which is the percentage of normalized unique labels present in an algorithm’s topk labels. Let \({\mathbf {P}} = P_1 \cup P_2 \cup \ldots \cup P_M\) where \(P_i = \{l_{i1}, l_{i2}, \ldots , l_{ik}\}\) i.e the set of topk labels predicted by the algorithm for test point i and M is the number of test points. Also, let \({\mathbf {L}} = L_1 \cup L_2 \cup \ldots \cup L_M\) where \(L_i = \{g_{i1}, g_{i2}, \ldots , g_{ik}\}\) i.e the topk propensity scored ground truth labels for test point i, then, coverage@k is given by
The comparison between Bonsaii and Parabel of this metric on five different datasets is shown in Table 3. It shows that the proposed method is more effective in discovering correct unique labels. These results further reinforce the results in the previous section on the diversity preserving feature of Bonsai.
Impact of tree depth
We next evaluate prediction performance produced by Bonsai trees with different depth values. We set the fanout parameter \(K\) appropriately to achieve the desired tree depth. For example, to partition 4000 labels into a hierarchy of depth two, we set \(K=64\).
In Fig. 5, we report the result on three datasets, averaged over ten runs under each setting. The trend is consistent—as the tree depth increases, prediction accuracy tends to drop, though it is not very stark for Wikipedia31K.
Furthermore, in Fig. 6, we show that the shallow architecture is an integral part of the success of the Bonsai family of algorithms. To demonstrate this, we plugged in the label representation used in Bonsaio into Parabel, called Parabelo in the figure. As can be seen, Bonsaio outperforms Parabelo by a large margin showing that shallow trees substantially alleviate the prediction error.
Training and prediction time
Growing shallower trees in Bonsai comes at a slight price in terms of training time. It was observed that Bonsai leads to approximately 2–3x increase in training time compared to Parabel. For instance, on a single core, Parabel takes 1 h for training on the WikiLSHTC325K dataset, while Bonsai takes approximately 3 h for the same task. However, it may also be noted that the training process can be performed in an offline manner. Though, unlike Parabel, Bonsai does not come with logarithmic dependence on the number of labels for the computational complexity of prediction. However, its prediction time is typically in milliseconds, and hence it remains quite practical in XMC applications with realtime constraints such as recommendation systems and advertising.
Conclusion
In this paper, we present Bonsai, which is a class of algorithms for learning shallow trees for label partitioning in extreme multilabel classification. Compared to the existing treebased methods, it improves this process in two fundamental ways. Firstly, it generalizes the notion of label representation beyond the input space representation, and shows the efficacy of output space representation based on its cooccurrence with other labels, and by further combining these in a joint representation. Secondly, by learning shallow trees which prevent error propagation in the tree cascade and hence improving the prediction accuracy and taillabel coverage. The synergizing effects of these two ingredients enables Bonsai to retain the training speed comparable to treebased methods, while achieving better prediction accuracy as well as significantly better taillabel coverage. As a future work, the generalized label representation can be further enriched by combining with embeddings from raw text. This can lead to the amalgamation of methods studied in this paper with those that are based on deep learning.
Notes
 1.
We also tried \(K\text {means++}\) and observed that faster convergence did not outweigh extra computation time for seed initialization.
 2.
c is omitted in the original version, since \(K=2\) and it takes only a few iterations for Kmeans to converge.
 3.
References
Agrawal, R., Gupta, A., Prabhu, Y., & Varma, M. (2013). Multilabel learning with millions of labels: Recommending advertiser bid phrases for web pages. In World Wide Web conference.
Babbar, R., & Schölkopf, B. (2017). Dismec: Distributed sparse machines for extreme multilabel classification. In International conference on web search and data mining (pp. 721–729).
Babbar, R., & Schölkopf, B. (2019). Data scarcity, robustness and extreme multilabel classification. Machine Learning, 108(8–9), 1329–1351.
Babbar, R., Partalas, I., Gaussier, E., & Amini, M.R. (2013). On flat versus hierarchical classification in largescale taxonomies. In Advances in neural information processing systems (pp. 1824–1832).
Babbar, R., Metzig, C., Partalas, I., Gaussier, E., & Amini, M.R. (2014). On power law distributions in largescale taxonomies. In ACM SIGKDD explorations newsletter (pp. 47–56).
Babbar, R., Partalas, I., Gaussier, E., Amini, M. R., & Amblard, C. (2016). Learning taxonomy adaptation in largescale classification. The Journal of Machine Learning Research, 17(1), 3350–3386.
Bengio, S., Weston, J., & Grangier, D. (2010). Label embedding trees for large multiclass tasks. In Neural information processing systems (pp. 163–171).
Bhatia, K., Jain, H., Kar, P., Varma, M., & Jain, P. (2015). Sparse local embeddings for extreme multilabel classification. In Neural information processing systems.
Bhatia, K., Dahiya, K., Jain, H., Prabhu, Y., & Varma, M. (2016). The extreme classification repository: Multilabel datasets and code. http://manikvarma.org/downloads/XC/XMLRepository.html.
Deng, J., Berg, A.C., Li, K., & FeiFei, L. (2010). What does classifying more than 10,000 image categories tell us? In European conference on computer vision.
Denton, E., Weston, J., Paluri, M., Bourdev, L., Fergus, R. (2015). User conditional hashtag prediction for images. In ACM SIGKDD international conference on knowledge discovery and data mining.
Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9(Aug), 1871–1874.
Fang, H., Cheng, M., Hsieh, C.J., & Friedlander, M. (2019) Fast training for largescale oneversusall linear classifiers using treestructured initialization. In Proceedings of the 2019 SIAM international conference on data mining, SIAM (pp. 280–288).
Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multilabel prediction via compressed sensing. In Advances in neural information processing systems.
Jain, H., Prabhu, Y., & Varma, M. (2016). Extreme multilabel loss functions for recommendation, tagging, ranking and other missing label applications. In ACM SIGKDD international conference on knowledge discovery and data mining.
Jalan, A., & Kar, P. (2019). Accelerating extreme classification via adaptive feature agglomeration. arXiv preprint arXiv:190511769.
Jasinska, K., Dembczynski, K., BusaFekete, R., Pfannschmidt, K., Klerx, T., & Hüllermeier, E. (2016). Extreme fmeasure maximization using sparse probability estimates. In International conference on machine learning.
Joly, A., Wehenkel, L., & Geurts, P. (2019). Gradient tree boosting with random output projections for multilabel classification and multioutput regression. arXiv preprint arXiv:190507558.
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of tricks for efficient text classification. In Proceedings of the 15th conference of the European chapter of the association for computational linguistics (Vol. 2, pp. 427–431), Short Papers.
Khandagale, S., & Babbar, R. (2019) A simple and effective scheme for data preprocessing in extreme classification. In 27th European symposium on artificial neural networks, ESANN 2019, Bruges, Belgium, 24–26 April 2019.
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1746–1751).
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). Imagenet classification with deep convolutional neural networks. In Neural information processing systems (pp. 1097–1105).
Liang, Y., Hsieh, C.J., & Lee, T. (2018). Blockwise partitioning for extreme multilabel classification. arXiv preprint arXiv:181101305.
Lin, Z., Ding, G., Hu, M., & Wang, J. (2014). Multilabel classification via featureaware implicit label space encoding. In International conference on machine learning (pp. 325–333).
Liu, J., Chang, W.C., Wu, Y., & Yang, Y. (2017). Deep learning for extreme multilabel text classification. In SIGIR, ACM (pp. 115–124).
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.
Madjarov, G., Kocev, D., Gjorgjevikj, D., & Džeroski, S. (2012). An extensive experimental comparison of methods for multilabel learning. Pattern Recognition, 45(9), 3084–3104.
Majzoubi, M., & Choromanska, A. (2019). Ldsm: Logarithmdepth streaming multilabel decision trees. arXiv preprint arXiv:190510428.
McAuley, J., & Leskovec, J. (2013). Hidden factors and hidden topics: Understanding rating dimensions with review text. In RecSys, ACM (pp. 165–172).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Neural information processing systems (pp. 3111–3119).
Partalas, I., Kosmopoulos, A., Baskiotis, N., Artieres, T., Paliouras, G., Gaussier, E., Androutsopoulos, I., Amini, M.R., & Galinari, P, (2015). Lshtc: A benchmark for largescale text classification. arXiv preprint arXiv:150308581.
Prabhu, Y., & Varma, M. (2014). Fastxml: A fast, accurate and stable treeclassifier for extreme multilabel learning. In ACM SIGKDD international conference on knowledge discovery and data mining, ACM (pp. 263–272).
Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., & Varma, M. (2018). Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web conference on World Wide Web (pp. 993–1002).
Read, J., Pfahringer, B., & Holmes, G. (2008). Multilabel classification using ensembles of pruned sets. In 8th IEEE international conference on data mining, IEEE (pp. 995–1000).
Shen, D., Ruvini, J.D., Somaiya, M., & Sundaresan, N. (2011). Item categorization in the ecommerce domain. In Proceedings of the 20th ACM international conference on information and knowledge management, ACM (pp. 1921–1924).
Si, S., Zhang, H., Keerthi, S.S., Mahajan, D., Dhillon, I.S., & Hsieh, C.J. (2017). Gradient boosted decision trees for high dimensional sparse output. In International conference on machine learning.
Tagami, Y. (2017). Annexml: Approximate nearest neighbor search for extreme multilabel classification. In ACM SIGKDD international conference on knowledge discovery and data mining, ACM, IEEE.
Tai, F., & Lin, H. T. (2012). Multilabel classification with principal label space transformation. Neural Computation, 24(9), 2508–2542.
Tsoumakas, G., & Katakis, I. (2007). Multilabel classification: An overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3), 1–13.
Tsoumakas, G., Katakis, I., & Vlahavas, I. (2008). Effective and efficient multilabel classification in domains with large number of labels. In: Proceedings of the ECML/PKDD 2008 workshop on mining multidimensional data (MMD08).
Vens, C., Struyf, J., Schietgat, L., Džeroski, S., & Blockeel, H. (2008). Decision trees for hierarchical multilabel classification. Machine Learning, 73(2), 185.
Wei, T., & Li, Y.F. (2018). Does tail label help for largescale multilabel learning. In Proceedings of the 27th international joint conference on artificial intelligence (pp. 2847–2853), AAAI Press.
Weston, J., Bengio, S., & Usunier, N. (2011). Wsabie: Scaling up to large vocabulary image annotation. In IJCAI.
Wydmuch, M., Jasinska, K., Kuznetsov, M., BusaFekete, R., & Dembczynski, K. (2018). A noregret generalization of hierarchical softmax to extreme multilabel classification. In Advances in neural information processing systems (pp. 6355–6366).
Xu, C., Tao, D., & Xu, C. (2016a). Robust extreme multilabel learning. In Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining.
Xu, C., Tao, D., & Xu, C. (2016b). Robust extreme multilabel learning. In ACM SIGKDD international conference on knowledge discovery and data mining, ACM (pp. 1275–1284).
Yen, I.E., Huang, X., Dai, W., Ravikumar, P., Dhillon, I., & Xing, E. (2017). Ppdsparse: A parallel primaldual sparse method for extreme classification. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, ACM (pp. 545–553).
Yen, I.E.H., Huang, X., Ravikumar, P., Zhong, K., & Dhillon, I. (2016). Pdsparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In International conference on machine learning (pp. 3069–3077).
Yu, H.F., Jain, P., Kar, P., & Dhillon, I. (2014) Largescale multilabel learning with missing labels. In International conference on machine learning (pp. 593–601).
Zhang, M. L., Li, Y. K., Liu, X. Y., & Geng, X. (2018). Binary relevance for multilabel learning: An overview. Frontiers of Computer Science, 12(2), 191–202.
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Larisa Soldatova, Joaquin Vanschoren.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Khandagale, S., Xiao, H. & Babbar, R. Bonsai: diverse and shallow trees for extreme multilabel classification. Mach Learn 109, 2099–2119 (2020). https://doi.org/10.1007/s10994020058882
Received:
Revised:
Accepted:
Published:
Issue Date:
Keywords
 Largescale multilabel classification
 Extreme multilabel classification
 Large label space