1 Introduction

Extreme multi-label text classification (XMTC) is a task that assigns unknown text with related labels from extremely large-scale label sets. XMTC is widely applied to recommendation systems [1], patent classification [2] and search engines [3]. Different from traditional multi-label classification tasks, both the number of labels and the number of instances in the XMTC task are extremely large. Therefore, the XMTC task suffers from the following two challenges: (1) because of the extremely large output space, the training time and memory consumption of the model are excessive; and (2) a high proportion of tail labels (the label has few relevant instances) leads to severe label sparsity.

Traditional methods for solving the XMTC task can be roughly divided into the following three main categories: one-versus-all methods, embedding-based methods and tree-based methods. Similar to the straightforward strategy for solving multi-label classification, one-versus-all methods learn a subclassifier for each label. However, the computational complexity of such methods is very high due to the large-scale label sets. Embedding-based methods assume that the label matrix is low-rank and project the original label space into a low-dimensional subspace to reduce the complexity of the problem. Due to the large proportion of tail labels in the XMTC setting, the low-rank assumption is violated, leading to lower accuracy. The tree-based methods decompose the original problem into multiple subproblems by partitioning the label space. Existing tree-based methods only exploit the semantic label information, easily leading to error propagation.

Deep learning methods are currently the most effective techniques for solving XMTC tasks. These methods integrate feature extraction and classification into an end-to-end framework, thereby achieving superior performance. However, powerful text embeddings require deep network architectures, resulting in a large number of parameters. In XMTC scenarios, numerous labels have fewer positive instances, so training massive parameter models is a formidable task. The pre-trained Transformer models [4,5,6], which have a large number of parameters, were pre-trained unsupervised on a large-scale corpus, thereby resulting in better initialization parameters. X-Transformer [7] first applied the Transformer model to solve the XMTC task and achieved excellent performance, so many variants [8,9,10] have been proposed in recent years. The correlation between labels is ubiquitous in the XMTC task. Xun et al. [11] introduced the CorNet block, which uses label correlation to enhance label predictions and can be easily integrated with various other XMTC methods. However, the CorNet framework ignores the co-occurrence information between labels. BGNN-XML [12] employs co-occurrence correlations between labels to partition the label space, but fails to utilize label correlation information in the classification model.

Existing deep learning methods have the following shortcomings. First, these methods fail to fully consider the combination of semantic and cooccurrence correlations between labels. Second, these methods ignore the rich cooccurrence information between label clusters. In order to fully consider the correlations between labels and to exploit these label correlations in both the label space partitioning and the classification model, we propose TLC-XML, which comprises three modules: partitioning the label space (Partition), matching related clusters (Matcher) and ranking candidate labels (Ranker). In Partition, we use label semantics and co-occurrence information to extract correlations between labels. Then, the label correlation graph is constructed by using labels as nodes and correlations between labels as edges. Furthermore, we propose the label graph partition (LGP) algorithm to partition strongly correlated labels into the same cluster. In Matcher, we propose cluster correlation learning (CCL) algorithm, which uses the graph convolutional network (GCN) to extract the correlation between clusters. Then, these valuable correlations are introduced into the classifier to match related clusters. In Ranker, we propose the label interaction learning (LIL) algorithm, which aggregates the raw label prediction with the information of the neighboring labels. In addition, we use residual mapping to alleviate the over-smoothing problem.

We summarize the three main contributions of this paper as follows:

  1. 1.

    We propose a novel TLC-XML model based on a pre-trained Transformer for the XMTC task. TLC-XML accounts for the semantic and co-occurrence correlations between labels and uses the label correlations in the label space partitioning and classification model.

  2. 2.

    The CCL algorithm and LIL algorithm are proposed to extract different levels of correlation between labels, and these valuable information and Transformer-based feature extraction networks are integrated into an end-to-end training framework.

  3. 3.

    We conduct extensive experiments on five benchmark datasets of XMTC, and the experimental results show that TLC-XML outperforms state-of-the-art methods.

The rest of this paper is sequentially organized as follows. Section 2 reviews related work on extreme multi-label text classification. In Sect. 3, we present TLC-XML in detail, including the Partition module, Matcher module and Ranker module. Section 4 shows the experimental configuration, experimental results and ablation studies. Finally, the conclusion of this paper is given in Sect. 5.

2 Related Works

There are two approaches for solving the XMTC task, including traditional methods and deep learning methods. Their advantages and drawbacks are summarized in Table 1.

Table 1 Comparison of traditional and deep learning methods

2.1 Traditional Methods

One-versus-all methods are a classical strategy for solving the XMTC task, which trains a subclassifier for each label. However, one-versus-all methods suffer from a large model size and excessive training time. To reduce the complexity, PD-Sparse [13] exploits both primal and dual sparsity for sublinear time costs. DiSMEC [14] uses a double parallelization layer to improve training and prediction speed. In addition, DiSMEC prunes model weight coefficients to reduce the model size, thus requiring fewer computational resources. To further improve the DiSMEC training speed, Schultheis and Babbar [15] proposed a novel weight initialization strategy, significantly speeding up classifier training by setting the initial vector.

Embedding-based methods assume that the label matrix is low-rank and train a classifier on a low-dimensional embedding subspace. Due to the high proportion of tail labels in the XMTC setting, the low-rank assumption is broken. To address this drawback, SLEEC [16] partitions the training samples into multiple clusters and preserves the distance of the nearest label to learn local embeddings. SLEEC further uses the k-nearest neighbor classifier for prediction in each subspace. However, SLEEC partitions training samples without label information. AnnexML [17] extends SLEEC by constructing the k-nearest neighbor graph on the label embedding subspace and uses the approximate nearest neighbor search algorithm to efficiently predict.

Tree-based methods, which partitions the extreme problem into multiple subproblems, are an effective strategy for solving the XMTC task. Parabel [18] uses the k-means clustering algorithm to recursively partition the label space into two label clusters, which can construct a deep and balanced label tree. To prevent propagation error in the deep tree, Bonsai [19] combines the advantages of shallow trees and unbalanced partitioning, where tail labels are assigned to different partitions. Therefore, Bonsai achieves better prediction performance on tail labels. To accelerate inference for tree-based methods, XR-LINEAR [20] uses the masked sparse chunk multiplication (MSCM) algorithm to avoid unnecessary traversal and optimize memory locality.

2.2 Deep Learning Methods

Deep learning methods use various networks to learn semantic context representations from the original text and achieve superior performance. XML-CNN [21] first applies deep learning methods to the XMTC task. XML-CNN extracts the text representation by dynamic max-pooling and 1D convolution [22] and introduces a low-dimensional hidden layer for efficient computation. AttentionXML [23] builds a wide and shallow label tree and trains a separate classifier for each layer. In addition, AttentionXML uses bidirectional long short-term memory (BiLSTM) with an attention mechanism to capture a specific text representation for each label. X-Transformer [7] employs Transformer encoders to match relevant clusters and then uses sparse TF-IDF features and neural embeddings to rank the labels. Since the X-Transformer suffers from the drawbacks of a large model size and excessive text truncation, LightXML [9] improves the text embedding and input sequence length. In addition, LightXML combines the Transformer model with the generative cooperative network, enabling the Transformer model to learn a better text representation.

3 TLC-XML

In this section, we propose the TLC-XML model to solve the XMTC task. The proposed model comprises three modules: Partition, Matcher and Ranker. As shown in Fig. 1, in Partition, we construct the label correlation graph according to the label co-occurrence matrix and the label embedding matrix and partition the strongly correlated labels into the same cluster. In Matcher, the correlation between clusters extracted from the CCL algorithm is combined with the text representation extracted from the Transformer model, which is used to match related label clusters from the partitioned label clusters. In Ranker, we use the LIL algorithm to aggregate the label prediction with the information of the neighboring labels. The final scores of the candidate labels are determined by combining the outputs of the CCL and LIL algorithms.

Fig. 1
figure 1

The framework of the proposed TLC-XML model for XMTC

3.1 Problem Formulation

Formally, we assume that \(\left\{ \left( \text {x}_i,\text {y}_i \right) \right\} _{i=1}^{N}\) is a training set, where \(\text {x}_i\) denotes the raw text of the ith instance and \(\text {y}_i\in \left\{ 0,1 \right\} ^L\) denotes the label vector for the ith instance. \(\text {y}_{il}=1\) if instance \(\text {x}_i\) is related to the lth label and 0 otherwise. The objective of the XMTC task aims to learn a map f that assigns a score to each instance, and \(\text {y}_{il}=f(\text {x}_i) \) obtains a higher score if label l is related to instance \(\text {x}_i\). The main mathematical symbols used in this paper are summarized in Table 2.

Table 2 Main mathematical symbols used in this paper

3.2 Partition

To solve the extremely large-scale label space problem, we partition the original label space \(\mathcal {Y}=\left\{ 1, \ldots ,l, \ldots ,L \right\} \) into K label clusters \(\left\{ \mathcal {S}_k \right\} _{k=1}^{K}\), where \(\mathcal {S}_k\) represents the set of labels on the kth cluster.

3.2.1 Label Correlation Graph

The label co-occurrence information can reflect dependencies between labels. Therefore, we first use the conditional probability to extract the co-occurrence between labels. Since the conditional probability is directional, we employ a symmetric conditional probability matrix \( A^{p} = \left\{ a_{ij}^{p} \right\} \) to represent the co-occurrence strength between labels.

$$\begin{aligned} a_{ij}^{p} = \frac{1}{2}\left[{P\left( {j \mid i} \right) + P\left( {i \mid j} \right) } \right], \end{aligned}$$
(1)

where \(P\left( {j \mid i} \right) \) is the occurrence probability of the jth label if the ith label appears in an instance. The large fraction of tail labels in the XMTC task results in sparse label co-occurrence information. We further introduce the semantic information of labels to enhance the correlation between labels. The correlation matrix \(\hat{A}^{p} = \left\{ \hat{a}_{ij}^{p} \right\} \) combines co-occurrence and semantic correlations.

$$\begin{aligned} {\hat{a}}_{ij}^{p} = a_{ij}^{p} + \lambda \cdot \sigma \left( {\cos \left( {z_{i},z_{j}} \right) } \right) , \end{aligned}$$
(2)

where cos\((\cdot ,\cdot )\) returns the cosine similarity between two vectors, \(\sigma (\cdot )\) is the sigmoid function, \(\lambda \) is a trade-off parameter between semantic and co-occurrence correlations, and \(z_i\) and \(z_j\) are the word embeddings of the label text. We can further obtain the label correlation graph \(G=(V^p,\hat{A}^{p})\), where \(V^p\) and \(\hat{A}^{p}\) are the label node set and the adjacency matrix that stores the correlated edge weights.

3.2.2 Partition the Label Graph

In this subsection, to find clusters of strongly correlated labels in G, we propose the label graph partition (LGP) algorithm to partition the label graph. The quality of the label graph partition is measured by modularity Q [23].

$$\begin{aligned} Q = \frac{1}{2m}{\sum _{i,j = 1}^{L}{\left[{{\tilde{a}}_{ij}^{p} - \frac{k_{i}k_{j}}{2m}} \right]\delta \left( {i,j} \right) }} , \end{aligned}$$
(3)

where \( \tilde{a}_{ij}^{p}={\left\{ \begin{array}{ll} 1&{} \text {if} \,\,\hat{a}_{ij}^{p}>\tau _p\\ 0&{} \text {otherwise}\\ \end{array}\right. } \), \( \delta \left( l_i,l_j \right) ={\left\{ \begin{array}{ll} 1&{} \text {if}~ i ~ \text {and} ~ j ~ \text {are in the same cluster}\\ 0&{} \text {otherwise}\\ \end{array}\right. } \), \(k_{i} = {\sum _{j = 0}^{L}{\tilde{a}}_{ij}^{p}}\) is the degree of node i, \(m = {1/2}{\sum _{i,j = 1}^{L}{\tilde{a}}_{ij}^{p}}\) is the number of edges, and \(\tau _{p}\) is a noise threshold that is used to convert a weighted graph to an unweighted graph.

Algorithm 1
figure a

Label Graph Partition

The LGP algorithm is shown in Algorithm 1. The L labels on the original label space are partitioned into K label clusters. Then, a three-level label tree is constructed, where the root node is the entire label set, the nodes of the second level are label clusters, and the leaf nodes are the original labels, where each label can only appear in one cluster.

3.3 Matcher

After partitioning the label space, the original set of labels is partitioned into K label clusters. In Matcher, we aim to match the relevant \(K^{'}\) clusters \(\left\{ \mathcal {S}_{k}^{'} \right\} _{k = 1}^{K^{'}}\) from the label clusters \(\left\{ \mathcal {S}_{k} \right\} _{k = 1}^{K}\) for a given instance \(x_{i}\).

Fig. 2
figure 2

The co-occurrence of labels and label clusters on EURLex-4K

Previous works [24,25,26,27,28,29] have shown that exploiting the correlation between labels in multi-label classification can significantly enhance the classification performance. The GCN is an effective algorithm for extracting the correlations between labels. However, the sparse label co-occurrence matrix in the XMTC task results in many label nodes without correlated edges, which further limits the information propagation of GCN. After partitioning the label space, the co-occurrence information between label clusters is remarkably rich, as shown in Fig. 2. Therefore, classification performance can be improved by exploiting the correlation between label clusters.

3.3.1 Cluster Correlation Graph

The cluster correlation graph can be represented as \(\text {H} = \left( V^{m},E^{m} \right) \), where \(V^{m}\) and \(E^{m}\) are the nodes corresponding to the cluster and the connected edges. The adjacency matrix \(A^{m} = \left\{ a_{ij}^{m} \right\} \) denotes the co-occurrence correlation between label clusters.

We extract the representation of clusters \(V^{m} = \left\{ {v_{1}^{m},\ldots ,v_{k}^{m},\ldots ,v_{K}^{m}} \right\} \) by aggregating the label embedding.

$$\begin{aligned} v_{k}^{m} = \frac{\sum _{l \in \mathcal {S}_{k}}z_{l}}{\left| \mathcal {S}_{k} \right| } , \end{aligned}$$
(4)

where \(z_l\) is the embedding of label l and \(|\mathcal {S}_k |\) denotes the number of labels in label cluster \(\mathcal {S}_k\). To represent the co-occurrence relationship between clusters, we use the conditional probability \(a_{ij}^m\) to calculate the weight matrix between clusters, i.e.,

$$\begin{aligned} {a}_{ij}^{m}={\left\{ \begin{array}{ll} P\left( \mathcal {S}_i|\mathcal {S}_j \right) &{} \text {if} \ i\ne j\\ 1&{} \text {otherwise}\\ \end{array}\right. }. \end{aligned}$$
(5)

To reduce the computational cost, we filter the noise of lower values for \(a_{ij}^m\). In addition, to alleviate the excessive information aggregation from neighboring nodes, we further employ the re-weighted scheme [24] for \(\hat{a}_{ij}^m\) to balance the relationship between nodes and their neighborhoods.

$$\begin{aligned} \hat{a}_{ij}^{m}= & {} {\left\{ \begin{array}{ll} {a}_{ij}^{m}&{} \text {if} \ {a}_{ij}^{m}>\tau _m\\ 0&{} \text {otherwise}\\ \end{array}\right. }, \end{aligned}$$
(6)
$$\begin{aligned} \tilde{a}_{ij}^{m}= & {} {\left\{ \begin{array}{ll} \frac{p}{\sum \limits _{j=1}^L{\hat{a}_{ij}^{m}}}&{} \text {if} \ i\ne j\\ 1-p&{} \text {otherwise}\\ \end{array}\right. }, \end{aligned}$$
(7)

where \(\tau _m\) is the noise threshold and p is a trade-off parameter in which nodes tend to aggregate neighbor information if p is close to 1; otherwise, they focus on their own information. Therefore, \(\tilde{A}^{m} = \left\{ {\tilde{a}}_{ij}^{m} \right\} \) is the enhanced adjacency matrix, storing the correlation strength between clusters.

3.3.2 Cluster Correlation Learning

In this subsection, we propose cluster correlation learning (CCL) to capture the correlations between cluster nodes in the cluster correlation graph. CCL is based on an extended variant of the graph convolutional network (GCN), which is an effective neural network model for node-level prediction tasks. GCN can learn node embeddings by aggregating information from their neighbors using convolutional operations. For the node representation matrix \(H^{(t)} \in \mathbb {R}^{L \times \mathcal {D}}\) at the tth layer, we use the convolution operation of GCN [30], and the node representation of the next layer is updated by

$$\begin{aligned} H^{({t + 1})} = h\left( {{\tilde{A}}^{m}H^{(t)}W_{t}} \right) , \end{aligned}$$
(8)

where \(H^{(0)} = V^{m}\) is the initial embedding matrix of labels, \(W_{t} \in \mathbb {R}^{\mathcal {D} \times \mathcal {D}}\) is a matrix of learnable parameters, and \(h(\cdot )\) is a nonlinear activation function. After stacking the T-layer convolution, we take \(H^{(T)} \in \mathbb {R}^{L \times \mathcal {D}}\) as the final cluster embedding, where each node can integrate information from T-order neighbors in the cluster correlation graph.

We employ the classical multi-label classification loss \(\mathcal {L}_{matcher}\) to train an end-to-end classifier that considers dependencies between label clusters.

$$\begin{aligned} \mathcal {L}_{matcher} = \frac{1}{N}{\sum \limits _{i = 1}^{N}{{\sum \limits _{k = 1}^{K}\left[{\text {y}'_{ik}{\log \left( {\hat{\text {y}}'}_{ik} \right) } + \left( {1 - \text {y}'_{ik}} \right) {\log \left( {1 - {\hat{\text {y}}'}_{ik}} \right) }} \right]},~~}} \end{aligned}$$
(9)

where \( \text {y}'_{ik}={\left\{ \begin{array}{ll} 1&{} \text {if y}_{il}=1\ \text {and} \ l\in \mathcal {S}_k\\ 0&{} \text {otherwise}\\ \end{array}\right. } \) is the instance-to-cluster assignment, \({\hat{\text {y}}'}_{ik} = \sigma \left( H_{k}^{(T)}\Phi \left( \text {x}_{i} \right) \right) \) is the predicted value, \(\Phi \left( \cdot \right) \) is the Transformer encoder that represents the raw text as the feature vector, and \(\sigma \left( \cdot \right) \) is the sigmoid function that is used at the last layer to perform the binary classification for each label.

For an instance \(\text {x}_i\), the relevant candidate clusters \(\left\{ \mathcal {S}_{k}^{'} \right\} _{k = 1}^{K^{'}}\) can be obtained by the Matcher module.

$$\begin{aligned} \left\{ \mathcal {S}_{k}^{'} \right\} _{k = 1}^{K^{'}} = \left\{ {\mathcal {S}_{k} \mid {k \in {rank}_{K^{'}}\left( {\hat{\text {y}}}_{i}^{'} \right) }} \right\} , \end{aligned}$$
(10)

where \({rank}_{k}( \cdot )\) returns the k largest indices.

3.4 Ranker

3.4.1 Label Interaction Learning

A common approach for solving the XMTC task uses a fully connected layer as the last layer of the network and employs a sigmoid function to obtain the label prediction, such as XML-CNN [21], Attention-XML [22] and LightXML [9]. However, such methods ignore the correlation information between labels, which can be used to improve prediction accuracy. To exploit the correlation information between labels, we propose label interaction learning (LIL), which is an effective network block that can learn the importance of different labels and adaptively adjust label prediction. However, the deeper network could cause performance degradation and over-smoothing. Specifically, we first employ a fully connected layer to predict the raw output LIL(0) and use residual mapping [31] to aggregate the raw output with the information of the neighboring labels.

$$\begin{aligned} \left\{ \begin{array}{l} LIL\left( 0 \right) =W_{r}^{\top }\cdot \Phi \left( \text {x}_i \right) \\ LIL\left( c \right) =LIL\left( c-1 \right) +\Phi \left( \text {x}_i \right) W_c A^r\\ \end{array} \right. , \end{aligned}$$
(11)

where \(W_{r} \in \mathbb {R}^{\mathcal {D} \times L}\) is the matrix of learnable parameters on the fully connected layer, c is the number of aggregated layers, \(W_{c} \in \mathbb {R}^{\mathcal {D} \times L}\) is the matrix of learnable parameters to adjust the neighboring label information, and \(A^{r} \in \mathbb {R}^{L \times L}\) is transformed from the label correlation matrix \(A^p\) according to Eqs. (6) and (7). We employ the loss \(\mathcal {L}_{ranker}\) to train the LIL algorithm.

$$\begin{aligned} \mathcal {L}_{ranker}=\frac{1}{N}\sum _{i=1}^N{\sum _{l=1}^L{\left[ \text {y}_{il}\log \left( \hat{\text {y}}_{il} \right) +\left( 1-\text {y}_{il} \right) \log \left( 1-\hat{\text {y}}_{il} \right) \right] }}, \end{aligned}$$
(12)

where \(\hat{\text {y}}_{il}=\sigma \left( LIL\left( c \right) \right) \) is the predicted value of the labels.

3.4.2 Ranking Candidate Labels

In this subsection, we aim to rank each candidate label. The scores of the candidate labels are determined according to the combination output of the CCL algorithm and LIL algorithm. For an instance \(\text {x}_i\), the score of the candidate label is defined as \(score_{i} = \left\{ {\hat{\text {y}}}_{il} \mid l \in \left\{ \mathcal {S}_{k}^{'} \right\} \right\} \).

3.5 Training Algorithm

The TLC-XML training algorithm is provided in Algorithm 2.

Algorithm 2
figure b

TLC-XML training algorithm

4 Experiment

4.1 Experimental Configuration

4.1.1 Datasets

We evaluate TLC-XML on five publicly available XMTC benchmark datasets including RCV1 [32], EURLex-4K [33], AAPD [34], Wiki10-31K and AmazonCat-13K [35]. The characteristics of the benchmark datasets are summarized in Table 3, where N denotes the total number of training instances, M denotes the total number of test instances, L denotes the total number of labels, and \(\overline{{L}}\) and \(\widehat{{L}}\) denote the mean number of labels per instance and the mean number of instances per label, respectively.

Table 3 Datasets statistics

4.1.2 Evaluation Metrics

Ranking-based evaluation metrics are widely used to compare the performance of XMTC models, such as P@k and nDCG@k with k=1,3 and 5. P@k represents the probability that the top-k predicted labels are the ground truth labels. The predicted label vector and the ground truth label vector are denoted as \(\hat{\text {y}}\) and \(\text {y}\), respectively.

$$\begin{aligned} \text {P@}k=\frac{1}{k}\sum _{l\in {rank}_k\left( \hat{\text {y}} \right) }{\text {y}_l}, \end{aligned}$$
(13)

where \({rank}_k\left( \hat{\text {y}} \right) \) returns the k largest indices. The normalized discounted cumulative gain (nDCG) [36] is another widely used evaluation metric in the XMTC task, which measures the relevance and ranking of the predicted labels. The nDCG@k is defined as

$$\begin{aligned} \text {nDCG@}k = \frac{\text {DCG@}k}{\sum _{l = 1}^{\min (k,\Vert \text {y}\Vert _{0})}{1/{\log \left( {l + 1} \right) }}}, \end{aligned}$$
(14)

where \(\Vert \text {y} \Vert _0\) returns the count of nonzero values in \(\text {y}\), DCG@\(k = {\sum _{l = 1}^{k}\frac{\text {y}_{top(rank_{k}{(\hat{\text {y}})},l)}}{\log (l + 1)}}\) is a cumulative gain value based on the relevance and ranking of the predicted labels, and \({top}\left( \cdot ,l \right) \) returns the lth largest item.

4.1.3 Comparing Methods and Implementation Details

To evaluate the TLC-XML model, we compare it with eight state-of-the-art methods for solving the XMTC task, including the embedding-based method SLEEC [16]; the tree-based methods Parabel [18] and XR-LINEAR [20]; the Transformers-based methods X-Transformer [7] and LightXML [9]; and other neural network variants XML-CNN [21], AttentionXML [22] and CorNetAttentionXML [11]. For fair comparisons, all methods are run on our machine using released code, where the hyperparameters follow the settings given in their papers. Traditional methods use bag-of-word (BOW) features to train classifiers, Transformer-based methods uniformly use a single Transformer model to extract the features of the original text, and RNN methods also use a single label tree for prediction.

Table 4 Dataset-specific hyperparameters

The TLC-XML uses the pre-trained BERT model [4] as the text encoder, and combines the [CLS] token at the final five hidden layers to represent text, where the dropout rate of text representation is 0.2. The label semantic embedding matrix Z is represented by FastText embeddings of raw label text. The noise thresholds \(\tau _p\), \(\tau _m\) and \(\tau _r\) are set to 0.1, 0.05 and 0.005, respectively, the trade-off parameter \(\lambda \) is set to 0.5, and \(h\left( \cdot \right) \) uses ReLU activation functions. We use AdamW with 100 warm-up steps as the training optimizer, where the learning rate is 3e−5 on the fine-tuned Transformer, the learning rate on the remaining layers is 6e−5 and the weight decay is set to 0.01. In addition, dataset-specific hyperparameters are shown in Table 4, where EP denotes the number of learning epochs, B denotes the training batch size, IL denotes the input token length of the Transformer model, p is a trade-off parameter, and c is the number of aggregated layers. All experiments are implemented using the PyTorch framework on a single Nvidia 3080 Ti GPU.

4.2 Experimental Results

Table 5 Comparison results on RCV1
Table 6 Comparison results on EURLex-4K
Table 7 Comparison results on AAPD
Table 8 Comparison results on Wiki10-31K
Table 9 Comparison results on AmazonCat-13K

We compare eight state-of-the-art XMTC methods on five extremely large-scale datasets, and the detailed results are presented in Tables 5, 6, 7, 8 and 9, where the best performance is shown in boldface. TLC-XML outperforms the other comparative methods in all metrics except P@1. However, the P@1 evaluation results of TLC-XML were slightly lower than those of CorNetAttentionXML on EURex-4K and LightXML on AmazonCat-13K. Since TLC-XML exploits the correlation between labels, the label co-occurrence in the training set can significantly affect the prediction performance. \(\overline{{L}}\) or \(\widehat{{L}}\) are larger in the Wiki10-31K and AAPD datasets, as shown in Table 3, so TLC-XML achieves better performance than other datasets. In addition, the prediction performance of TLC-XML is significantly better for top-3 and top-5 metrics than for top-1 metrics due to the over-smoothing problem. We discuss the optimal number of aggregation layers in Sect. 4.3.3. To systematically compare such algorithms, the Friedman test [37] is used to evaluate whether there are statistical performance gaps. For each evaluation metric, the average rank of the jth algorithm is computed by \(R_j=\frac{1}{T}\sum \nolimits _{i=1}^T{r_{i}^{j}}\), where \(T=5\) is the number of datasets, and \(r_i^j\) denotes the ranking of the jth algorithm in the ith benchmark dataset. Friedman statistics \(F_F\) follow the F-distribution and can be computed by:

$$\begin{aligned} F_{F} = \frac{\left( {T - 1} \right) \mathcal {X}_{F}^{2}}{T\left( {P - 1} \right) - \mathcal {X}_{F}^{2}}, \end{aligned}$$
(15)

where \(\mathcal {X}_{F}^{2} = \frac{12T}{P\left( {P + 1} \right) }\left[{\sum \nolimits _{j = 1}^{P}{R_{j}^{2} - \frac{P\left( {P + 1} \right) ^{2}}{4}}} \right]\), and \(P=9\) is the number of comparison algorithms in our experiment.

Table 10 Summary of the Friedman statistics \(F_F\) and the critical value in terms of five evaluation metrics

Table 10 shows the Friedman statistics \(F_F\) and the corresponding critical values are summarized for each evaluation metric at a significance level \(\alpha = 0.05\). The \(F_F\) values for each evaluation metric are higher than the critical values, so the performance between the algorithms is significantly different.

To further validate the classification performance of TLC-XML against other comparison methods, we employ the Nemenyi test [37] to analyze the performance gap among the compared methods. The critical difference (CD) is introduced to compute the difference in the average ranks among the algorithms, where TLC-XML is treated as the control algorithm and \( CD = q_{\alpha }\sqrt{{P(P + 1)}/{6T}}\)(\(q_{\alpha }=3.102\) at the significance level \(\alpha =0.05\) and the number of comparison algorithms \(P=9\)). Figure 3 shows CD diagrams of all evaluation metrics, where the algorithm not connected with TLC-XML is considered to have significantly different performance at the significance level \(\alpha = 0.05\). As shown in Fig. 3, TLC-XML outperforms other comparative methods in the average rank of each evaluation metric on all datasets. In addition, TLC-XML significantly outperforms XML-CNN, SLEEC, Parabel and XR-LINEAR at the significance level \(\alpha = 0.05\).

Fig. 3
figure 3

Comparison of TLC-XML with eight comparison algorithms using the Nemenyi test

On the RCV1 and Wiki10-31K datasets, we compare the training time and performance of XML-CNN, AttentionXML, LightXML, CorNetAttentionXML, and TLC-XML. Figure 4 shows the training time and classification performance of these methods. The following conclusions can be summarized: (1) TLC-XML achieves the best performance with less training time; (2) the average ranking of TLC-XML and CorNetAttentionXML outperforms other comparison methods by utilizing label correlation, as shown in Fig. 3; and (3) compared with CorNetAttentionXML, TLC-XML outperforms CorNetAttentionXML in terms of classification performance and training time.

Fig. 4
figure 4

Training time and classification performance on RCV1 and Wiki10-31K

4.3 Ablation Studies

4.3.1 Effect of Partitioning the Label Space

To further explore the effect of the label space partitioning method on the classification performance, we compare the proposed LGP algorithm to random partitioning (Random) and cluster-based partitioning (Cluster) with 512 clusters on the Wiki10-31K dataset. In addition, we implement two variant models named LGP-Se and LGP-Co according to LGP algorithm. LGP-Se is implemented by exploiting only the label semantic information in Partition. LGP-Co is implemented by exploiting only the label co-occurrence information in Partition. Figure 5 shows that the LGP partitioning method outperforms the other partitioning methods, which verifies that using the proposed LGP algorithm to partition the label space works well on the XMTC task.

Fig. 5
figure 5

Effect of the partitioning method on the classification performance

4.3.2 Effect of Label Correlation

We further investigate the effectiveness of utilizing label correlation information. On the Wiki10-31K dataset, we compare TLC-XML with three other classification models (named FC, CorNet [11] and MaCor), where FC is implemented by using a fully connected layer, CorNet considers label correlation by using a CorNet block, and MaCor uses only the proposed CCL algorithm to exploit the correlation between clusters. Figure 6 shows that utilizing the label correlation information significantly improves the classification performance. Different from the CorNet, TLC-XML can extract different levels of correlation between labels, and this valuable information is further integrated with feature extraction networks. Therefore, TLC-XML significantly outperforms other classification models on the top-3 and top-5 evaluation metrics.

Fig. 6
figure 6

Effect of utilizing label correlation on classification performance

4.3.3 Effect of Aggregated Layers

To further verify the effect of aggregated layers in the LIL algorithm, we set different numbers of aggregated layers to evaluate the model performance on the RCV1 and Wiki10-31K datasets. Figures 7 and 8 show that the deeper layers can achieve better performance in terms of top-3 and top-5 evaluation metrics, mainly due to the deeper aggregated layer capturing richer correlation information. However, labels ignore their own information due to over-aggregating information from neighboring labels, especially in the case of higher accuracy top-1 predictions. For the evaluation metric P@1, the best results are obtained with the aggregated layer of 1. For the above results, we set the aggregated layer to 1 for smaller datasets and 2 for larger datasets.

Fig. 7
figure 7

Effect of the number of aggregated layers in RCV1

Fig. 8
figure 8

Effect of the number of aggregated layers in Wiki10-31K

5 Conclusion

In this paper, we propose a Transformer-based TLC-XML for XMTC tasks and exploit label correlations in both label space partitioning and the classification model. TLC-XML comprises three modules: Partition, Matcher, and Ranker. In Partition, we utilize semantic and co-occurrence correlations between labels to partition the label space. In Matcher, we combine the correlation between clusters and the text representation to match related label clusters. In Ranker, we aggregate the raw label prediction with neighboring label information and further use residual mapping to avoid over-smoothing. The experimental results demonstrate that TLC-XML is significantly superior to state-of-the-art XMTC methods. In many practical scenarios, it can be expensive and time-consuming to annotate all labels of the sample, especially when the number of possible labels is extremely large. Therefore, it is valuable to handle missing label problems with a robust and efficient strategy. In future work, we plan to exploit the correlation between labels for datasets with missing labels.