Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification

Extreme multi-label classification refers to supervised multi-label learning involving hundreds of thousand or even millions of labels. In this paper, we develop a shallow tree-based algorithm, called Bonsai, which promotes diversity of the label space and easily scales to millions of labels. Bonsai relaxes the two main constraints of the recently proposed tree-based algorithm, Parabel, which partitions labels at each tree node into exactly two child nodes, and imposes label balanced-ness between these nodes. Instead, Bonsai encourages diversity in the partitioning process by (i) allowing a much larger fan-out at each node, and (ii) maintaining the diversity of the label set further by enabling potentially imbalanced partitioning. By allowing such flexibility, it achieves the best of both worlds - fast training of tree-based methods, and prediction accuracy better than Parabel, and at par with one-vs-rest methods. As a result, Bonsai outperforms state-of-the-art one-vs-rest methods such as DiSMEC in terms of prediction accuracy, while being orders of magnitude faster to train. The code for \bonsai is available at https://github.com/xmc-aalto/bonsai.


Introduction
Extreme Multi-label Classification (XMC) refers to supervised learning of a classifier which can automatically label an instance with a small subset of relevant labels from an extremely large set of all possible target labels. Machine learning problems consisting of hundreds of thousand labels are common in various domains such as product categorization for e-commerce [19,23,8,1], hash-tag suggestion in social media [13], annotating web-scale encyclopedia [22,20], and imageclassification [17,12]. It has been demonstrated that, in addition to automatic labelling, the framework of XMC can be leveraged to effectively address learning prob-Both Sujay and Han have equal contribution lems arising in bid-phrase suggestion in web-advertising and recommendation systems [1,22]. In the scenarios of ad-display, by treating each query as label, automatic prediction of potential monetizable bid-phrases can be made in response to an advertisement. The growing significance of XMC in web-scale data mining is further highlighted by dedicated workshops in premier machine learning and data mining conferences (cf. workshops on extreme classification at NIPS 2015-2017, WWW 2018 and Dagstuhl 2018 [7]).
From the machine learning perspective, building effective extreme classifiers is faced with the computational challenge arising due to large number of (i) output labels, (ii) input training instances, and (iii) input features. Another important statistical characteristic of the datasets in XMC is that a large fraction of labels are tail labels, i.e., those which have very few training instances that belong to them (also referred to as powerlaw, fat-tailed distribution and Zipf's law). This distribution is shown in Figure 1 for a benchmark dataset, WikiLSHTC-325K from the XMC repository [9]. In this dataset, only ∼150,000 out of 325,000 labels have more than 5 training instances in them. Tail labels exhibit diversity of the label space, and contain informative content not captured by the head or torso labels. Indeed, by predicting well the head labels, yet omitting most of the tail labels, an algorithm can achieve high accuracy. However, such behavior is not desirable in many real world applications, where fit to power-law distribution has been observed [2].

Related work
Recently, and somewhat surprisingly, one-vs-all methods [5,6,26] have been shown to yield better performance compared to other approaches based on label-embedding schemes [11,10,25,24] and tree-based methods [22,15,16,21]. While labelembedding methods usually make an assumption that the underlying intrinsic dimensionality of the label space is of much lower rank than the number of labels. However, tree-based methods, by partitioning the label or Label frequency in dataset WikiLSHTC-325K shows power-law distribution. X-axis shows the label IDs sorted by their frequency in training instances and Y-axis gives the actual frequency (on log-scale). Note that more than half of the labels have fewer than 5 training instances. feature space, have the computational advantage of enabling faster training and prediction. On the other hand, one-vs-all methods require distributed training infrastructure for scalable learning when the label set is of the order of millions. Therefore, a central challenge in XMC is to build classifiers which retain the accuracy of one-vs-rest paradigm while being as efficiently trainable as the tree-based methods.
To achieve the best of both worlds, a recent treebased method, Parabel [21], partitions the label space recursively into two child nodes using 2-means clustering. It also maintains balance between these two label partitions in terms of number of labels. Each intermediate node in the resulting binary label-tree is like a meta-label which captures the generic properties of its constituent labels. The leaves of the tree consist of the actual labels from the training data. During training and prediction each of these labels is distinguished from other labels under the same parent node through the application of a binary classifier at internal nodes and one-vs-all classifier for the leaf nodes.
By combination of tree-based partitioning and onevs-rest classifier, Parabel has been shown to give better performance than previous tree-based methods [22,15,16] while simultaneously allowing efficient training. However, in terms of prediction performance, Parabel remains sub-optimal compared to one-vs-rest approaches. In addition to error propagation due to cascading effect of the deep trees, its performance degenerates on tail labels as also observed in hierarchical classification [3,4]. This is the result of two strong con-straints in its label partitioning process, (i) each parent node in the tree has only two child nodes, and (ii) at each node, the labels are partitioned into equal sized parts, such that the number of labels under the two child nodes differ by at most one. As a result of the coarseness imposed by the binary partitioning of labels, the tail labels get subsumed by the head labels.

Bonsai overview
In this work, we propose Bonsai for learning shallow trees which promotes the diversity of tail labels in XMC. By relaxing the two main constraints in Parabel, the proposed algorithm, Bonsai: (i) allows much bigger partitioning of the label space at each intermediate node, (ii) its partitions do not impose strict balanced-ness constraints. Firstly, these relaxations enable learning of shallower trees. For instance, the tree depths compare as 2 in Bonsai versus 7 in Parabel for the benchmark EURLex-4K dataset, and 3 versus 13 for WikiLSHTC-325K dataset 1 . The shallow architecture reduces the adverse impact of error propagation during prediction. Secondly and more significantly, allowing large number of partitions with flexible sizes tends to help the tail labels since they can form a separate partition if their feature composition does not match the other labels. This is quite contrary to Parabel in which all the tail labels are forced to choose one of the two bigger label partitions.
Our experiments demonstrate that Bonsai leads to consistent improvement over Parabel in terms of pre-cision@k and nDCG@k metrics, and significantly outperforms on metrics which are sensitive to tail labels. Furthermore, we show that Bonsai also out-performs DiSMEC [5], a state-of-the-art one-vs-rest method, on four out of six datasets, while being up to 2 orders of magnitude faster to train.
As can be interpreted, the motivation behind Bonsai is quite similar to the democratic process in various countries which consist of multi-party system each with different ideologies and different sizes. In contrast, Parabel partitioning is quite similar to the twoparty system with almost similar sizes. All the labels in Parabel are attached to one of the two partitions, thereby undermining their own individual identity.

Proposed Algorithm: Bonsai
In this section, we formally present our method Bonsai which learns shallow trees and elucidate its differences compared to the Parabel algorithm. Child c node labels S c Figure 2: Illustration of Bonsai architecture. During training, label are partitioned hierarchically, resulting in a tree structure of label partitions. In order to obtain diverse and shallow trees, the branching factor (K) is set to ≥ 100 in Bonsai (shown as 3 for better pictorial illustration). This is in contrast to Parabel, where it is set to 2, leading to binary label partitions and hence much deeper trees. Inside non-leaf nodes, linear classifiers are trained to predict which child nodes to traverse down during prediction. Inside leaf nodes, linear classifiers are trained to predict the actual labels.
where K is relatively large (e.g. ≥ 100). Then, K child nodes of the root are created, each contains one of the partitions, {S k } K k=1 . The same process is repeated on the newly-created K child nodes recursively. The recursion terminates either when the node's depth exceeds pre-defined threshold d max or the number of associated labels is no larger than K, e.g, |S k | ≤ K. The treestructured architecture of Bonsai is illustrated in Figure 2, and is detailed below. Clustering step. Formally, without loss of generality, we assume a non-leaf node n has labels {1, . . . , L}. We aim at finding K cluster centers c 1 , . . . , c K by optimizing the following : where d(., .) represents a distance function and v l represents the D-dimensional vector representation of the la- The distance function is defined in terms of the dot product as follows : The above problem is NP-hard and we use the standard K-means algorithm (also known as Lloyd's algorithm) [18] 2 for finding an approximate solution to equation (2.1). Diversity promoting label partitioning. Bonsai learns the label partitions while promoting diversity among partitions by taking a more flexible approach. This is achieved by the following key ingredients : 1. K-way partitioning : By setting K > 2, Bonsai allows a varied partitioning of the labels space, rather than grouping all labels in two clusters as in Parabel. This facet of Bonsai is especially favorable for tail labels by allowing them to be part of separate clusters if they are indeed very different from the rest of the labels. Depending on the similarity to other labels, each label can choose to be part of one of the k clusters.
2. Data-dependent partitioning : Unlike Parabel, Bonsai : K = 16, tree depth 2 Parabel : K = 2, tree depth 6 Each circle corresponds to one label partition (also a tree node), the size of circle indicates the number of labels in that partition and lighter color indicates larger node level. The largest circle is the whole label space. Note that Bonsai produces label partitions of varying sizes, while Parabel gives perfectly balanced partitioning.
Bonsai does not enforce the balanced-ness constraint of the form, | operator is overloaded to mean set cardinality for the inner one and absolute value for the outer ones) among the partitions. This makes the Bonsai partitions more data-dependent since smaller partitions with diverse tail-labels are very moderately penalized under this framework. Even though this approach could produce potentially unbalanced partitioning, but it preserves the diverse nature of the partitioning.
As can be observed, the first step initiates a diverse K-way partitioning, and the second step maintains the diversity. As we will show in Section 3, the diverse partitioning of Bonsai leads to shallow trees with better prediction performance, and significant improvement is achieved on tail labels. Also, these differences are illustrated using pictorial depiction in Figure 3. Error propagation in Parabel. Parabel also adopts the "label tree" paradigm (as illustrated in Figure 2). However, it makes the following choices: (i) branching factor K is fixed to two, i.e., the label set at each internal node is partitioned into two subsets 3 . (ii) an additional constraint that is imposed : ||S 1 | − |S 2 || ≤ 1. This constraint implies that the label hierarchy is balanced.
The first consequence is tail labels are forced to join one of the two partitions, even though the tail labels could be quite different from the rest of labels in either partition. As a result, these tail labels are subsumed by the head labels, which harms the prediction performance. We later validate this effect in experimental evaluation.
Secondly, setting K = 2 results in trees of large depth, making Parabel more prone to error propagation, which we will validate in experimental evaluation. To see the intuition of this argument, we use the theoretical result from [21]. Given a data point x and a label l that is relevant to x, we denote e as the leaf node l belongs to and A(e) as the set of ancestor nodes of e and e itself. Note that |A(e)| is path length from root to e. Denote the parent of n as p(n). We define the binary indicator variable z n to take value 1 if node n is visited during prediction and 0 otherwise. Then, Theorem 3.2 in [21] states the probability that l is predicted as relevant for x is as follows: Consider the Amazon-3M dataset with L ≈ 3 × 10 6 , setting K = 2 produces a tree of depth 16. Assuming Pr(z n = 1 | z p(n) = 1, x) = 0.95, for ∀n ∈ p(n) and Pr(y l = 1 | z e = 1, x) = 1, it gives Pr(y l = 1 | x) = (0.95) 16 ≈ 0.46. This is to say, even if Pr(z n = 1 | z p(n) = 1, x) is high (e.g, 0.95) at each n ∈ A(e), multiplying them together can result in small probability (e.g, 0.46) if |A(e)| is large. We choose to Algorithm 1 Training algorithm: grow(n, K, d max ) partitions label space recursively and returns K children nodes of n. K-means(L, K) partitions label set L into K disjoint sets using standard K-means algorithm. Label features are derived from training data I. one-vs-all(I, {l 1 , . . . , l K }) learns K one-vs-rest linear classifiers mitigate this issue by increasing K.

Learning node classifiers
Once the label space is partitioned into a diverse and shallow tree structure, we learn a K-way One-vs-All linear classifier at each node. These classifiers are trained independently using only the training examples that have at least one of the node labels. We distinguish the leaf nodes and nonleaf nodes in the following way : (i) for non-leaf nodes, the classifier learns K linear classifiers separately, each maps to one of the K children. During prediction, the output of each classifier determines whether the test point should traverse down the corresponding child. (ii) for leaf nodes, the classifier learns to predict the actual labels on the node. Without loss of generality, given a node n, denote by {l} K 1 as the set of its children. For the special case of leaf nodes, the set of children represent the final labels. We learn K linear classifiers parameterized by {w 1 , . . . , w K }, where w l ∈ R D for ∀l = 1, . . . , K. Each output label determines if the corresponding K children should be traversed or not.
For each label l ∈ {1, . . . , K}, we define the training data as T l = (X l , s l ), where X l = {x i | y il = 1, i = 1, . . . , N } and s l ∈ {+1, −1} N represents the vector of signs depending on whether y il = 1 or not. We consider the following optimization problem for learning linear SVM with squared hinge loss and l 2regularization where L(z) = (max(0, 1 − z)) 2 . This is solved using the Newton method based primal implementation in LIB-LINEAR [14]. To restrict the model size, and remove spurious parameters, thresholding of small weights is performed as in [5].
The details of Bonsai's training procedure in the form of an algorithm is shown in Algorithm 1. The partitioning process in Section 2.1 is described as the procedure GROW in the algorithm. The One-vs-All procedure is shown as one-vs-all in Algorithm 1.

Prediction in shallow trees
During prediction, a test point x traverses down the tree. At each non-leaf node, the classifier narrows down the search space by deciding which subset of child nodes x should further traverse. If the classifier decides not to traverse down some child node c, all descendants of c will not be traversed. Later, as x reaches to one or more leaf nodes, One-vs-All classifiers are evaluated to assign probabilities to each label. In experiment, we set beam size to 10. To scale it up, both Parabel and Bonsai use beam search to avoid the possibility of evaluating all nodes. The above search space pruning strategy implies errors made at non-leaf nodes could propagate to their descendants. As we have discussed in Section 2.1, deeper trees suffer more from this issue. Bonsai sets relatively large values to the branching factor k (typically 100), resulting in much shallower trees compared to Parabel, and hence significantly reducing error propagation, particularly for tail-labels represent diversity of the label space.

Experimental Evaluation
In this section, we detail the dataset description, and the set up for comparison of the proposed approach against state-of-the-art methods in XMC.

Dataset and evaluation metrics
We perform empirical evaluation on publicly available datasets from the XMC repository 4 curated from sources such as Amazon for item-to-item recommendation tasks and Wikipedia for tagging tasks. The datasets of various scales in terms of number of labels are used, EURLex-4K consisting of approximately 4,000 labels to Amazon-3M consisting of 3 million labels. The datasets also exhibit a wide range of properties in terms of number of training instances, features, and labels. The detailed statistics of the datasets are shown in Table 1.
With applications in recommendation systems, ranking and web-advertising, the objective of the ma-4 http://manikvarma.org/downloads/XC/XMLRepository.html chine learning system in XMC is to correctly recommend/rank/advertise among the top-k slots. We therefore use evaluation metrics which are standard and commonly used to compare various methods under the XMC setting -Precision@k (prec@k) and normalised Discounted Cumulative Gain (nDCG@k). Given a label space of dimensionality L, a predicted label vector y ∈ R L and a ground truth label vector y ∈ {0, 1} L : , and rank k (ŷ) returns the k largest indices ofŷ.
For better readability, we report the percentage version of above metrics (multiplying the original scores by 100). In addition, we consider k ∈ {1, 3, 5}.

Methods for comparison
We compare Bonsai against nine algorithms which includes three state-ofthe-art methods from each of the three main strands for XMC namely, label-embedding, tree-based and onevs-all methods : • Label-embedding methods: LEML [28] makes a global low-rank assumption on the label space and performs a linear embedding on the label space, SLEEC [10] makes a locally low-rank assumption on the label space, RobustXML [25] decomposes the label matrix into tail labels and non tail labels so as to enforce an embedding on the latter without the tail labels damaging the embedding.
• Tree-based methods: FastXML [22] learns an ensemble of trees which partition the label space by directly optimizing an nDCG based ranking loss function, PFastXML [15] replaces the nDCG loss in FastXML by its propensity scored variant which is unbiased and assigns higher rewards for accurate tail label predictions, Parabel [21] which has been described earlier in the paper.  Table 2: prec@k (P@k) on benchmark datasets for k = 1, 3 and 5. For each case of P@k and dataset, the best performed score is highlighted in bold. Entries marked "-" imply the corresponding method could not scale to the particular dataset, thus the scores are unavailable.
• One-vs-All methods: PD-Sparse [27] enforces sparsity by exploiting the structure of a marginmaximizing loss with L1-penalty, DiSMEC [5] learns one-vs-rest classifiers for every label with weight pruning to control model size, CCD-L1, which is an in-built sparse solver in the LIBLINEAR package.
Bonsai is implemented in C++ on a 64-bit Linux system. For all the datasets, we set the branching factor K = 100 at every tree depth. We will explore the effect of tree depth in details later. This results in depth-2 trees for smaller datasets such as EURLex-4K, Wikipedia-31K and depth-3 trees for larger datasets such as WikiLSHTC-325K and Wikipedia-500K. Bonsai learns an ensemble of three trees similar to Parabel. The code for Bonsai is available at https://github.com/xmc-aalto/bonsai. For other approaches, the results were reproduced as suggested in the papers.

Experimental results
In this section, we report the main findings of our empirical evaluation. We present results on various performance metrics which are also described in the following sections.

Precision@k
The comparison of Bonsai against other baselines is shown in Table 2. The important findings from the these results are the following : • Bonsai shows consistent improvement over Parabel on all datasets validating Bonsai's choice in relaxing the constraints made by Parabel.
• Bonsai achieves the best prec@k on 4 out of 6 datasets, thereby presenting a very competitive baseline with improved performance.
• Even though DiSMEC performs somewhat better on Wiki-500K and Wikipedia-31K, its computational complexity of training and prediction is orders of magnitude higher than Bonsai. As a result, while Bonsai can be run in environments with limited computational resources such a desktop machine with a few cores, DiSMEC requires a distributed infrastructure for training and prediction.

Performance on tail labels
We also evaluate prediction performance on tail labels using propensity scored variants of prec@k and nDCG@k. For label l, its propensity p l is related to number of its positive training instances N l by p l ∝ 1/ 1 + e − log(N l ) . With this formulation, p l ≈ 1 for head labels and p l 1 for

WikiLSHTC-325K
Wikipedia-500K Amazon-3M  Figure 4: Comparison of prec wt @k (top row) and nDCG wt @k (bottom row) over tree-based methods. The reported metrics capture prediction performance over tail labels. Linear methods such as ProXML [6] and DiSMEC [5] still remain the best on this metric.  tail labels. Let y ∈ {0, 1} L andŷ ∈ R L denote the true and predicted label vectors respectively. As detailed in [15], propensity scored variants of P @k and nDCG@k are given by  where P SDCG@k := ∈rank k (ŷ) [ y p log( +1) ] , and rank k (y) returns the k largest indices of y.
To match against the ground truth, as suggested in [15], we use 100 · G({ŷ})/G({y}) as the performance metric. For M test samples, , y), where G(.) and L(., .) signify gain and loss respectively. The loss L(., .) can take two forms, (i)L(ŷ i , y) = −P SnDCG@k, and (ii) L(ŷ i , y) = −P SP @k. This leads to the two metrics which are sensitive to tail labels and are denoted by prec wt@k, and nDCG wt@k. Figure 4 shows the result w.r.t prec wt@k, and nDCG wt@k. Again, Bonsai shows consistent improvement over Parabel. For instance, on WikiLSHTC-325K, the relative improvement over Parabel is approximately 6.7% on prec wt @5. This further validates the design choices in Bonsai, which allows tail labels to be assigned into different partitions w.r.t the head ones. Linear methods such as ProXML [6] and DiSMEC [5] still remain the best on this metric.

Unique label coverage
We also evaluate cover-age@k, denoted C@k, which is the percentage of normalized unique labels present in an algorithm's top-k labels. Let P = P 1 ∪ P 2 ∪ ... ∪ P M where P i = {l i1 , l i2 , .., l ik } i.e the set of top-k labels predicted by the algorithm for test point i and M is the number of test points. Also, let L = L 1 ∪ L 2 ∪ ... ∪ L M where L i = {g i1 , g i2 , .., g ik } i.e the top-k propensity scored ground truth labels for test point i, then, coverage@k is given by The comparison on five different datasets shows that Bonsai outperforms Parabel in discovering correct unique labels. These results confirm the diversity preserving property of Bonsai, which are the result of flexible label partitioning induced instead of strictly two label partitions induced by Parabel.

Impact of tree depth
We next evaluate prediction performance produced by Bonsai trees with different depth values. We set the fan-out parameter K appropriately to achieve the desired tree depth. For example, to partition 4000 labels into a hierarchy of depth 2, we set K = 64.
In Figure 5, we report the result on 3 datasets, averaged over 10 runs under each setting. The trend is consistent, as tree depth increases, prediction accuracy tends to drop, though it is not very obvious for Wikipedia-31K.

Training and prediction time
Growing shallower trees in Bonsai comes at a slight price in terms of training time. It was observed that Bonsai leads to approximately 3x increase in training time compared to Parabel. For instance, on three cores, Parabel take one hour for training on WikiLSHTC-325K dataset, while Bonsai takes approximately three hours for this task. However, it may also be noted that the training process can be performed in an offline manner. Though, unlike Parabel, Bonsai does not come with logarithmic dependence on the number of labels for the computational complexity of prediction. However, its prediction time is typically in milli-seconds, and hence it remains quite practical in its applications with real-time constraints such as recommendation systems and advertising.

Conclusion
In this paper, we present Bonsai, which is a novel algorithm for extreme multi-label classification for learning diversity promoting shallow trees. We highlight the two main constraints imposed by Parabel which learns deep and balanced trees. We show that trees learnt by Parabel are prone to degradation in prediction performance due to forceful aggregation of head and tail labels into generic partitions and longer decision paths. On the other hand, Bonsai relaxes these two constraints, and achieves the best of both worlds. While maintaining fast training comparable to tree-based methods, it achieves comparable or better accuracy than state-ofthe-art One-vs-All methods.