TLC-XML: Transformer with Label Correlation for Extreme Multi-label Text Classification

Zhao, Fei; Ai, Qing; Li, Xiangna; Wang, Wenhui; Gao, Qingyun; Liu, Yichun

doi:10.1007/s11063-024-11460-z

TLC-XML: Transformer with Label Correlation for Extreme Multi-label Text Classification

Open access
Published: 10 February 2024

Volume 56, article number 25, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

TLC-XML: Transformer with Label Correlation for Extreme Multi-label Text Classification

Download PDF

Fei Zhao¹,
Qing Ai¹,
Xiangna Li²,
Wenhui Wang^3,4,
Qingyun Gao¹ &
…
Yichun Liu¹

980 Accesses
1 Citation
Explore all metrics

Abstract

Extreme multi-label text classification (XMTC) annotates related labels for unknown text from large-scale label sets. Transformer-based methods have become the dominant approach for solving the XMTC task due to their effective text representation capabilities. However, the existing Transformer-based methods fail to effectively exploit the correlation between labels in the XMTC task. To address this shortcoming, we propose a novel model called TLC-XML, i.e., a Transformer with label correlation for extreme multi-label text classification. TLC-XML comprises three modules: Partition, Matcher and Ranker. In the Partition module, we exploit the semantic and co-occurrence information of labels to construct the label correlation graph, and further partition the strongly correlated labels into the same cluster. In the Matcher module, we propose cluster correlation learning, which uses the graph convolutional network (GCN) to extract the correlation between clusters. We then introduce these valuable correlations into the classifier to match related clusters. In the Ranker module, we propose label interaction learning, which aggregates the raw label prediction with the information of the neighboring labels. The experimental results on benchmark datasets show that TLC-XML significantly outperforms state-of-the-art XMTC methods.

Exploiting Dynamic and Fine-grained Semantic Scope for Extreme Multi-label Text Classification

CoocNet: a novel approach to multi-label text classification with improved label co-occurrence modeling

Article 02 July 2024

Label-Aware Document Representation via Hybrid Attention for Extreme Multi-Label Text Classification

Article 18 March 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Extreme multi-label text classification (XMTC) is a task that assigns unknown text with related labels from extremely large-scale label sets. XMTC is widely applied to recommendation systems [1], patent classification [2] and search engines [3]. Different from traditional multi-label classification tasks, both the number of labels and the number of instances in the XMTC task are extremely large. Therefore, the XMTC task suffers from the following two challenges: (1) because of the extremely large output space, the training time and memory consumption of the model are excessive; and (2) a high proportion of tail labels (the label has few relevant instances) leads to severe label sparsity.

Traditional methods for solving the XMTC task can be roughly divided into the following three main categories: one-versus-all methods, embedding-based methods and tree-based methods. Similar to the straightforward strategy for solving multi-label classification, one-versus-all methods learn a subclassifier for each label. However, the computational complexity of such methods is very high due to the large-scale label sets. Embedding-based methods assume that the label matrix is low-rank and project the original label space into a low-dimensional subspace to reduce the complexity of the problem. Due to the large proportion of tail labels in the XMTC setting, the low-rank assumption is violated, leading to lower accuracy. The tree-based methods decompose the original problem into multiple subproblems by partitioning the label space. Existing tree-based methods only exploit the semantic label information, easily leading to error propagation.

Deep learning methods are currently the most effective techniques for solving XMTC tasks. These methods integrate feature extraction and classification into an end-to-end framework, thereby achieving superior performance. However, powerful text embeddings require deep network architectures, resulting in a large number of parameters. In XMTC scenarios, numerous labels have fewer positive instances, so training massive parameter models is a formidable task. The pre-trained Transformer models [4,5,6], which have a large number of parameters, were pre-trained unsupervised on a large-scale corpus, thereby resulting in better initialization parameters. X-Transformer [7] first applied the Transformer model to solve the XMTC task and achieved excellent performance, so many variants [8,9,10] have been proposed in recent years. The correlation between labels is ubiquitous in the XMTC task. Xun et al. [11] introduced the CorNet block, which uses label correlation to enhance label predictions and can be easily integrated with various other XMTC methods. However, the CorNet framework ignores the co-occurrence information between labels. BGNN-XML [12] employs co-occurrence correlations between labels to partition the label space, but fails to utilize label correlation information in the classification model.

Existing deep learning methods have the following shortcomings. First, these methods fail to fully consider the combination of semantic and cooccurrence correlations between labels. Second, these methods ignore the rich cooccurrence information between label clusters. In order to fully consider the correlations between labels and to exploit these label correlations in both the label space partitioning and the classification model, we propose TLC-XML, which comprises three modules: partitioning the label space (Partition), matching related clusters (Matcher) and ranking candidate labels (Ranker). In Partition, we use label semantics and co-occurrence information to extract correlations between labels. Then, the label correlation graph is constructed by using labels as nodes and correlations between labels as edges. Furthermore, we propose the label graph partition (LGP) algorithm to partition strongly correlated labels into the same cluster. In Matcher, we propose cluster correlation learning (CCL) algorithm, which uses the graph convolutional network (GCN) to extract the correlation between clusters. Then, these valuable correlations are introduced into the classifier to match related clusters. In Ranker, we propose the label interaction learning (LIL) algorithm, which aggregates the raw label prediction with the information of the neighboring labels. In addition, we use residual mapping to alleviate the over-smoothing problem.

We summarize the three main contributions of this paper as follows:

1.
We propose a novel TLC-XML model based on a pre-trained Transformer for the XMTC task. TLC-XML accounts for the semantic and co-occurrence correlations between labels and uses the label correlations in the label space partitioning and classification model.
2.
The CCL algorithm and LIL algorithm are proposed to extract different levels of correlation between labels, and these valuable information and Transformer-based feature extraction networks are integrated into an end-to-end training framework.
3.
We conduct extensive experiments on five benchmark datasets of XMTC, and the experimental results show that TLC-XML outperforms state-of-the-art methods.

The rest of this paper is sequentially organized as follows. Section 2 reviews related work on extreme multi-label text classification. In Sect. 3, we present TLC-XML in detail, including the Partition module, Matcher module and Ranker module. Section 4 shows the experimental configuration, experimental results and ablation studies. Finally, the conclusion of this paper is given in Sect. 5.

2 Related Works

There are two approaches for solving the XMTC task, including traditional methods and deep learning methods. Their advantages and drawbacks are summarized in Table 1.

Table 1 Comparison of traditional and deep learning methods

Full size table

2.1 Traditional Methods

One-versus-all methods are a classical strategy for solving the XMTC task, which trains a subclassifier for each label. However, one-versus-all methods suffer from a large model size and excessive training time. To reduce the complexity, PD-Sparse [13] exploits both primal and dual sparsity for sublinear time costs. DiSMEC [14] uses a double parallelization layer to improve training and prediction speed. In addition, DiSMEC prunes model weight coefficients to reduce the model size, thus requiring fewer computational resources. To further improve the DiSMEC training speed, Schultheis and Babbar [15] proposed a novel weight initialization strategy, significantly speeding up classifier training by setting the initial vector.

Embedding-based methods assume that the label matrix is low-rank and train a classifier on a low-dimensional embedding subspace. Due to the high proportion of tail labels in the XMTC setting, the low-rank assumption is broken. To address this drawback, SLEEC [16] partitions the training samples into multiple clusters and preserves the distance of the nearest label to learn local embeddings. SLEEC further uses the k-nearest neighbor classifier for prediction in each subspace. However, SLEEC partitions training samples without label information. AnnexML [17] extends SLEEC by constructing the k-nearest neighbor graph on the label embedding subspace and uses the approximate nearest neighbor search algorithm to efficiently predict.

Tree-based methods, which partitions the extreme problem into multiple subproblems, are an effective strategy for solving the XMTC task. Parabel [18] uses the k-means clustering algorithm to recursively partition the label space into two label clusters, which can construct a deep and balanced label tree. To prevent propagation error in the deep tree, Bonsai [19] combines the advantages of shallow trees and unbalanced partitioning, where tail labels are assigned to different partitions. Therefore, Bonsai achieves better prediction performance on tail labels. To accelerate inference for tree-based methods, XR-LINEAR [20] uses the masked sparse chunk multiplication (MSCM) algorithm to avoid unnecessary traversal and optimize memory locality.

2.2 Deep Learning Methods

Deep learning methods use various networks to learn semantic context representations from the original text and achieve superior performance. XML-CNN [21] first applies deep learning methods to the XMTC task. XML-CNN extracts the text representation by dynamic max-pooling and 1D convolution [22] and introduces a low-dimensional hidden layer for efficient computation. AttentionXML [23] builds a wide and shallow label tree and trains a separate classifier for each layer. In addition, AttentionXML uses bidirectional long short-term memory (BiLSTM) with an attention mechanism to capture a specific text representation for each label. X-Transformer [7] employs Transformer encoders to match relevant clusters and then uses sparse TF-IDF features and neural embeddings to rank the labels. Since the X-Transformer suffers from the drawbacks of a large model size and excessive text truncation, LightXML [9] improves the text embedding and input sequence length. In addition, LightXML combines the Transformer model with the generative cooperative network, enabling the Transformer model to learn a better text representation.

3 TLC-XML

In this section, we propose the TLC-XML model to solve the XMTC task. The proposed model comprises three modules: Partition, Matcher and Ranker. As shown in Fig. 1, in Partition, we construct the label correlation graph according to the label co-occurrence matrix and the label embedding matrix and partition the strongly correlated labels into the same cluster. In Matcher, the correlation between clusters extracted from the CCL algorithm is combined with the text representation extracted from the Transformer model, which is used to match related label clusters from the partitioned label clusters. In Ranker, we use the LIL algorithm to aggregate the label prediction with the information of the neighboring labels. The final scores of the candidate labels are determined by combining the outputs of the CCL and LIL algorithms.

3.1 Problem Formulation

Formally, we assume that $\left\{ \left( \text {x}_i,\text {y}_i \right) \right\} _{i=1}^{N}$ is a training set, where $\text {x}_i$ denotes the raw text of the ith instance and $\text {y}_i\in \left\{ 0,1 \right\} ^L$ denotes the label vector for the ith instance. $\text {y}_{il}=1$ if instance $\text {x}_i$ is related to the lth label and 0 otherwise. The objective of the XMTC task aims to learn a map f that assigns a score to each instance, and $\text {y}_{il}=f(\text {x}_i) $ obtains a higher score if label l is related to instance $\text {x}_i$. The main mathematical symbols used in this paper are summarized in Table 2.

Table 2 Main mathematical symbols used in this paper

Full size table

3.2 Partition

To solve the extremely large-scale label space problem, we partition the original label space $\mathcal {Y}=\left\{ 1, \ldots ,l, \ldots ,L \right\} $ into K label clusters $\left\{ \mathcal {S}_k \right\} _{k=1}^{K}$, where $\mathcal {S}_k$ represents the set of labels on the kth cluster.

3.2.1 Label Correlation Graph

The label co-occurrence information can reflect dependencies between labels. Therefore, we first use the conditional probability to extract the co-occurrence between labels. Since the conditional probability is directional, we employ a symmetric conditional probability matrix $ A^{p} = \left\{ a_{ij}^{p} \right\} $ to represent the co-occurrence strength between labels.

$$\begin{aligned} a_{ij}^{p} = \frac{1}{2}\left[{P\left( {j \mid i} \right) + P\left( {i \mid j} \right) } \right], \end{aligned}$$

(1)

where $P\left( {j \mid i} \right) $ is the occurrence probability of the jth label if the ith label appears in an instance. The large fraction of tail labels in the XMTC task results in sparse label co-occurrence information. We further introduce the semantic information of labels to enhance the correlation between labels. The correlation matrix $\hat{A}^{p} = \left\{ \hat{a}_{ij}^{p} \right\} $ combines co-occurrence and semantic correlations.

$$\begin{aligned} {\hat{a}}_{ij}^{p} = a_{ij}^{p} + \lambda \cdot \sigma \left( {\cos \left( {z_{i},z_{j}} \right) } \right) , \end{aligned}$$

(2)

where cos$(\cdot ,\cdot )$ returns the cosine similarity between two vectors, $\sigma (\cdot )$ is the sigmoid function, $\lambda $ is a trade-off parameter between semantic and co-occurrence correlations, and $z_i$ and $z_j$ are the word embeddings of the label text. We can further obtain the label correlation graph $G=(V^p,\hat{A}^{p})$, where $V^p$ and $\hat{A}^{p}$ are the label node set and the adjacency matrix that stores the correlated edge weights.

3.2.2 Partition the Label Graph

In this subsection, to find clusters of strongly correlated labels in G, we propose the label graph partition (LGP) algorithm to partition the label graph. The quality of the label graph partition is measured by modularity Q [23].

$$\begin{aligned} Q = \frac{1}{2m}{\sum _{i,j = 1}^{L}{\left[{{\tilde{a}}_{ij}^{p} - \frac{k_{i}k_{j}}{2m}} \right]\delta \left( {i,j} \right) }} , \end{aligned}$$

(3)

where $ \tilde{a}_{ij}^{p}={\left\{ \begin{array}{ll} 1&{} \text {if} \,\,\hat{a}_{ij}^{p}>\tau _p\\ 0&{} \text {otherwise}\\ \end{array}\right. } $, $ \delta \left( l_i,l_j \right) ={\left\{ \begin{array}{ll} 1&{} \text {if}~ i ~ \text {and} ~ j ~ \text {are in the same cluster}\\ 0&{} \text {otherwise}\\ \end{array}\right. } $, $k_{i} = {\sum _{j = 0}^{L}{\tilde{a}}_{ij}^{p}}$ is the degree of node i, $m = {1/2}{\sum _{i,j = 1}^{L}{\tilde{a}}_{ij}^{p}}$ is the number of edges, and $\tau _{p}$ is a noise threshold that is used to convert a weighted graph to an unweighted graph.

The LGP algorithm is shown in Algorithm 1. The L labels on the original label space are partitioned into K label clusters. Then, a three-level label tree is constructed, where the root node is the entire label set, the nodes of the second level are label clusters, and the leaf nodes are the original labels, where each label can only appear in one cluster.

3.3 Matcher

After partitioning the label space, the original set of labels is partitioned into K label clusters. In Matcher, we aim to match the relevant $K^{'}$ clusters $\left\{ \mathcal {S}_{k}^{'} \right\} _{k = 1}^{K^{'}}$ from the label clusters $\left\{ \mathcal {S}_{k} \right\} _{k = 1}^{K}$ for a given instance $x_{i}$.

Previous works [24,25,26,27,28,29] have shown that exploiting the correlation between labels in multi-label classification can significantly enhance the classification performance. The GCN is an effective algorithm for extracting the correlations between labels. However, the sparse label co-occurrence matrix in the XMTC task results in many label nodes without correlated edges, which further limits the information propagation of GCN. After partitioning the label space, the co-occurrence information between label clusters is remarkably rich, as shown in Fig. 2. Therefore, classification performance can be improved by exploiting the correlation between label clusters.

3.3.1 Cluster Correlation Graph

The cluster correlation graph can be represented as $\text {H} = \left( V^{m},E^{m} \right) $, where $V^{m}$ and $E^{m}$ are the nodes corresponding to the cluster and the connected edges. The adjacency matrix $A^{m} = \left\{ a_{ij}^{m} \right\} $ denotes the co-occurrence correlation between label clusters.

We extract the representation of clusters $V^{m} = \left\{ {v_{1}^{m},\ldots ,v_{k}^{m},\ldots ,v_{K}^{m}} \right\} $ by aggregating the label embedding.

$$\begin{aligned} v_{k}^{m} = \frac{\sum _{l \in \mathcal {S}_{k}}z_{l}}{\left| \mathcal {S}_{k} \right| } , \end{aligned}$$

(4)

where $z_l$ is the embedding of label l and $|\mathcal {S}_k |$ denotes the number of labels in label cluster $\mathcal {S}_k$. To represent the co-occurrence relationship between clusters, we use the conditional probability $a_{ij}^m$ to calculate the weight matrix between clusters, i.e.,

$$\begin{aligned} {a}_{ij}^{m}={\left\{ \begin{array}{ll} P\left( \mathcal {S}_i|\mathcal {S}_j \right) &{} \text {if} \ i\ne j\\ 1&{} \text {otherwise}\\ \end{array}\right. }. \end{aligned}$$

(5)

To reduce the computational cost, we filter the noise of lower values for $a_{ij}^m$. In addition, to alleviate the excessive information aggregation from neighboring nodes, we further employ the re-weighted scheme [24] for $\hat{a}_{ij}^m$ to balance the relationship between nodes and their neighborhoods.

$$\begin{aligned} \hat{a}_{ij}^{m}= & {} {\left\{ \begin{array}{ll} {a}_{ij}^{m}&{} \text {if} \ {a}_{ij}^{m}>\tau _m\\ 0&{} \text {otherwise}\\ \end{array}\right. }, \end{aligned}$$

(6)

$$\begin{aligned} \tilde{a}_{ij}^{m}= & {} {\left\{ \begin{array}{ll} \frac{p}{\sum \limits _{j=1}^L{\hat{a}_{ij}^{m}}}&{} \text {if} \ i\ne j\\ 1-p&{} \text {otherwise}\\ \end{array}\right. }, \end{aligned}$$

(7)

where $\tau _m$ is the noise threshold and p is a trade-off parameter in which nodes tend to aggregate neighbor information if p is close to 1; otherwise, they focus on their own information. Therefore, $\tilde{A}^{m} = \left\{ {\tilde{a}}_{ij}^{m} \right\} $ is the enhanced adjacency matrix, storing the correlation strength between clusters.

3.3.2 Cluster Correlation Learning

In this subsection, we propose cluster correlation learning (CCL) to capture the correlations between cluster nodes in the cluster correlation graph. CCL is based on an extended variant of the graph convolutional network (GCN), which is an effective neural network model for node-level prediction tasks. GCN can learn node embeddings by aggregating information from their neighbors using convolutional operations. For the node representation matrix $H^{(t)} \in \mathbb {R}^{L \times \mathcal {D}}$ at the tth layer, we use the convolution operation of GCN [30], and the node representation of the next layer is updated by

$$\begin{aligned} H^{({t + 1})} = h\left( {{\tilde{A}}^{m}H^{(t)}W_{t}} \right) , \end{aligned}$$

(8)

where $H^{(0)} = V^{m}$ is the initial embedding matrix of labels, $W_{t} \in \mathbb {R}^{\mathcal {D} \times \mathcal {D}}$ is a matrix of learnable parameters, and $h(\cdot )$ is a nonlinear activation function. After stacking the T-layer convolution, we take $H^{(T)} \in \mathbb {R}^{L \times \mathcal {D}}$ as the final cluster embedding, where each node can integrate information from T-order neighbors in the cluster correlation graph.

We employ the classical multi-label classification loss $\mathcal {L}_{matcher}$ to train an end-to-end classifier that considers dependencies between label clusters.

$$\begin{aligned} \mathcal {L}_{matcher} = \frac{1}{N}{\sum \limits _{i = 1}^{N}{{\sum \limits _{k = 1}^{K}\left[{\text {y}'_{ik}{\log \left( {\hat{\text {y}}'}_{ik} \right) } + \left( {1 - \text {y}'_{ik}} \right) {\log \left( {1 - {\hat{\text {y}}'}_{ik}} \right) }} \right]},~~}} \end{aligned}$$

(9)

where $ \text {y}'_{ik}={\left\{ \begin{array}{ll} 1&{} \text {if y}_{il}=1\ \text {and} \ l\in \mathcal {S}_k\\ 0&{} \text {otherwise}\\ \end{array}\right. } $ is the instance-to-cluster assignment, ${\hat{\text {y}}'}_{ik} = \sigma \left( H_{k}^{(T)}\Phi \left( \text {x}_{i} \right) \right) $ is the predicted value, $\Phi \left( \cdot \right) $ is the Transformer encoder that represents the raw text as the feature vector, and $\sigma \left( \cdot \right) $ is the sigmoid function that is used at the last layer to perform the binary classification for each label.

For an instance $\text {x}_i$, the relevant candidate clusters $\left\{ \mathcal {S}_{k}^{'} \right\} _{k = 1}^{K^{'}}$ can be obtained by the Matcher module.

$$\begin{aligned} \left\{ \mathcal {S}_{k}^{'} \right\} _{k = 1}^{K^{'}} = \left\{ {\mathcal {S}_{k} \mid {k \in {rank}_{K^{'}}\left( {\hat{\text {y}}}_{i}^{'} \right) }} \right\} , \end{aligned}$$

(10)

where ${rank}_{k}( \cdot )$ returns the k largest indices.

3.4 Ranker

3.4.1 Label Interaction Learning

A common approach for solving the XMTC task uses a fully connected layer as the last layer of the network and employs a sigmoid function to obtain the label prediction, such as XML-CNN [21], Attention-XML [22] and LightXML [9]. However, such methods ignore the correlation information between labels, which can be used to improve prediction accuracy. To exploit the correlation information between labels, we propose label interaction learning (LIL), which is an effective network block that can learn the importance of different labels and adaptively adjust label prediction. However, the deeper network could cause performance degradation and over-smoothing. Specifically, we first employ a fully connected layer to predict the raw output LIL(0) and use residual mapping [31] to aggregate the raw output with the information of the neighboring labels.

$$\begin{aligned} \left\{ \begin{array}{l} LIL\left( 0 \right) =W_{r}^{\top }\cdot \Phi \left( \text {x}_i \right) \\ LIL\left( c \right) =LIL\left( c-1 \right) +\Phi \left( \text {x}_i \right) W_c A^r\\ \end{array} \right. , \end{aligned}$$

(11)

where $W_{r} \in \mathbb {R}^{\mathcal {D} \times L}$ is the matrix of learnable parameters on the fully connected layer, c is the number of aggregated layers, $W_{c} \in \mathbb {R}^{\mathcal {D} \times L}$ is the matrix of learnable parameters to adjust the neighboring label information, and $A^{r} \in \mathbb {R}^{L \times L}$ is transformed from the label correlation matrix $A^p$ according to Eqs. (6) and (7). We employ the loss $\mathcal {L}_{ranker}$ to train the LIL algorithm.

$$\begin{aligned} \mathcal {L}_{ranker}=\frac{1}{N}\sum _{i=1}^N{\sum _{l=1}^L{\left[ \text {y}_{il}\log \left( \hat{\text {y}}_{il} \right) +\left( 1-\text {y}_{il} \right) \log \left( 1-\hat{\text {y}}_{il} \right) \right] }}, \end{aligned}$$

(12)

where $\hat{\text {y}}_{il}=\sigma \left( LIL\left( c \right) \right) $ is the predicted value of the labels.

3.4.2 Ranking Candidate Labels

In this subsection, we aim to rank each candidate label. The scores of the candidate labels are determined according to the combination output of the CCL algorithm and LIL algorithm. For an instance $\text {x}_i$, the score of the candidate label is defined as $score_{i} = \left\{ {\hat{\text {y}}}_{il} \mid l \in \left\{ \mathcal {S}_{k}^{'} \right\} \right\} $.

3.5 Training Algorithm

The TLC-XML training algorithm is provided in Algorithm 2.

4 Experiment

4.1 Experimental Configuration

4.1.1 Datasets

We evaluate TLC-XML on five publicly available XMTC benchmark datasets including RCV1 [32], EURLex-4K [33], AAPD [34], Wiki10-31K and AmazonCat-13K [35]. The characteristics of the benchmark datasets are summarized in Table 3, where N denotes the total number of training instances, M denotes the total number of test instances, L denotes the total number of labels, and $\overline{{L}}$ and $\widehat{{L}}$ denote the mean number of labels per instance and the mean number of instances per label, respectively.

Table 3 Datasets statistics

Full size table

4.1.2 Evaluation Metrics

Ranking-based evaluation metrics are widely used to compare the performance of XMTC models, such as P@k and nDCG@k with k=1,3 and 5. P@k represents the probability that the top-k predicted labels are the ground truth labels. The predicted label vector and the ground truth label vector are denoted as $\hat{\text {y}}$ and $\text {y}$, respectively.

$$\begin{aligned} \text {P@}k=\frac{1}{k}\sum _{l\in {rank}_k\left( \hat{\text {y}} \right) }{\text {y}_l}, \end{aligned}$$

(13)

where ${rank}_k\left( \hat{\text {y}} \right) $ returns the k largest indices. The normalized discounted cumulative gain (nDCG) [36] is another widely used evaluation metric in the XMTC task, which measures the relevance and ranking of the predicted labels. The nDCG@k is defined as

$$\begin{aligned} \text {nDCG@}k = \frac{\text {DCG@}k}{\sum _{l = 1}^{\min (k,\Vert \text {y}\Vert _{0})}{1/{\log \left( {l + 1} \right) }}}, \end{aligned}$$

(14)

where $\Vert \text {y} \Vert _0$ returns the count of nonzero values in $\text {y}$, DCG@$k = {\sum _{l = 1}^{k}\frac{\text {y}_{top(rank_{k}{(\hat{\text {y}})},l)}}{\log (l + 1)}}$ is a cumulative gain value based on the relevance and ranking of the predicted labels, and ${top}\left( \cdot ,l \right) $ returns the lth largest item.

4.1.3 Comparing Methods and Implementation Details

To evaluate the TLC-XML model, we compare it with eight state-of-the-art methods for solving the XMTC task, including the embedding-based method SLEEC [16]; the tree-based methods Parabel [18] and XR-LINEAR [20]; the Transformers-based methods X-Transformer [7] and LightXML [9]; and other neural network variants XML-CNN [21], AttentionXML [22] and CorNetAttentionXML [11]. For fair comparisons, all methods are run on our machine using released code, where the hyperparameters follow the settings given in their papers. Traditional methods use bag-of-word (BOW) features to train classifiers, Transformer-based methods uniformly use a single Transformer model to extract the features of the original text, and RNN methods also use a single label tree for prediction.

Table 4 Dataset-specific hyperparameters

Full size table

The TLC-XML uses the pre-trained BERT model [4] as the text encoder, and combines the [CLS] token at the final five hidden layers to represent text, where the dropout rate of text representation is 0.2. The label semantic embedding matrix Z is represented by FastText embeddings of raw label text. The noise thresholds $\tau _p$, $\tau _m$ and $\tau _r$ are set to 0.1, 0.05 and 0.005, respectively, the trade-off parameter $\lambda $ is set to 0.5, and $h\left( \cdot \right) $ uses ReLU activation functions. We use AdamW with 100 warm-up steps as the training optimizer, where the learning rate is 3e−5 on the fine-tuned Transformer, the learning rate on the remaining layers is 6e−5 and the weight decay is set to 0.01. In addition, dataset-specific hyperparameters are shown in Table 4, where EP denotes the number of learning epochs, B denotes the training batch size, IL denotes the input token length of the Transformer model, p is a trade-off parameter, and c is the number of aggregated layers. All experiments are implemented using the PyTorch framework on a single Nvidia 3080 Ti GPU.

4.2 Experimental Results

Table 5 Comparison results on RCV1

Full size table

Table 6 Comparison results on EURLex-4K

Full size table

Table 7 Comparison results on AAPD

Full size table

Table 8 Comparison results on Wiki10-31K

Full size table

Table 9 Comparison results on AmazonCat-13K

Full size table

We compare eight state-of-the-art XMTC methods on five extremely large-scale datasets, and the detailed results are presented in Tables 5, 6, 7, 8 and 9, where the best performance is shown in boldface. TLC-XML outperforms the other comparative methods in all metrics except P@1. However, the P@1 evaluation results of TLC-XML were slightly lower than those of CorNetAttentionXML on EURex-4K and LightXML on AmazonCat-13K. Since TLC-XML exploits the correlation between labels, the label co-occurrence in the training set can significantly affect the prediction performance. $\overline{{L}}$ or $\widehat{{L}}$ are larger in the Wiki10-31K and AAPD datasets, as shown in Table 3, so TLC-XML achieves better performance than other datasets. In addition, the prediction performance of TLC-XML is significantly better for top-3 and top-5 metrics than for top-1 metrics due to the over-smoothing problem. We discuss the optimal number of aggregation layers in Sect. 4.3.3. To systematically compare such algorithms, the Friedman test [37] is used to evaluate whether there are statistical performance gaps. For each evaluation metric, the average rank of the jth algorithm is computed by $R_j=\frac{1}{T}\sum \nolimits _{i=1}^T{r_{i}^{j}}$, where $T=5$ is the number of datasets, and $r_i^j$ denotes the ranking of the jth algorithm in the ith benchmark dataset. Friedman statistics $F_F$ follow the F-distribution and can be computed by:

$$\begin{aligned} F_{F} = \frac{\left( {T - 1} \right) \mathcal {X}_{F}^{2}}{T\left( {P - 1} \right) - \mathcal {X}_{F}^{2}}, \end{aligned}$$

(15)

where $\mathcal {X}_{F}^{2} = \frac{12T}{P\left( {P + 1} \right) }\left[{\sum \nolimits _{j = 1}^{P}{R_{j}^{2} - \frac{P\left( {P + 1} \right) ^{2}}{4}}} \right]$, and $P=9$ is the number of comparison algorithms in our experiment.

Table 10 Summary of the Friedman statistics $F_F$ and the critical value in terms of five evaluation metrics

Full size table

Table 10 shows the Friedman statistics $F_F$ and the corresponding critical values are summarized for each evaluation metric at a significance level $\alpha = 0.05$. The $F_F$ values for each evaluation metric are higher than the critical values, so the performance between the algorithms is significantly different.

To further validate the classification performance of TLC-XML against other comparison methods, we employ the Nemenyi test [37] to analyze the performance gap among the compared methods. The critical difference (CD) is introduced to compute the difference in the average ranks among the algorithms, where TLC-XML is treated as the control algorithm and $ CD = q_{\alpha }\sqrt{{P(P + 1)}/{6T}}$($q_{\alpha }=3.102$ at the significance level $\alpha =0.05$ and the number of comparison algorithms $P=9$). Figure 3 shows CD diagrams of all evaluation metrics, where the algorithm not connected with TLC-XML is considered to have significantly different performance at the significance level $\alpha = 0.05$. As shown in Fig. 3, TLC-XML outperforms other comparative methods in the average rank of each evaluation metric on all datasets. In addition, TLC-XML significantly outperforms XML-CNN, SLEEC, Parabel and XR-LINEAR at the significance level $\alpha = 0.05$.

On the RCV1 and Wiki10-31K datasets, we compare the training time and performance of XML-CNN, AttentionXML, LightXML, CorNetAttentionXML, and TLC-XML. Figure 4 shows the training time and classification performance of these methods. The following conclusions can be summarized: (1) TLC-XML achieves the best performance with less training time; (2) the average ranking of TLC-XML and CorNetAttentionXML outperforms other comparison methods by utilizing label correlation, as shown in Fig. 3; and (3) compared with CorNetAttentionXML, TLC-XML outperforms CorNetAttentionXML in terms of classification performance and training time.

4.3 Ablation Studies

4.3.1 Effect of Partitioning the Label Space

To further explore the effect of the label space partitioning method on the classification performance, we compare the proposed LGP algorithm to random partitioning (Random) and cluster-based partitioning (Cluster) with 512 clusters on the Wiki10-31K dataset. In addition, we implement two variant models named LGP-Se and LGP-Co according to LGP algorithm. LGP-Se is implemented by exploiting only the label semantic information in Partition. LGP-Co is implemented by exploiting only the label co-occurrence information in Partition. Figure 5 shows that the LGP partitioning method outperforms the other partitioning methods, which verifies that using the proposed LGP algorithm to partition the label space works well on the XMTC task.

4.3.2 Effect of Label Correlation

We further investigate the effectiveness of utilizing label correlation information. On the Wiki10-31K dataset, we compare TLC-XML with three other classification models (named FC, CorNet [11] and MaCor), where FC is implemented by using a fully connected layer, CorNet considers label correlation by using a CorNet block, and MaCor uses only the proposed CCL algorithm to exploit the correlation between clusters. Figure 6 shows that utilizing the label correlation information significantly improves the classification performance. Different from the CorNet, TLC-XML can extract different levels of correlation between labels, and this valuable information is further integrated with feature extraction networks. Therefore, TLC-XML significantly outperforms other classification models on the top-3 and top-5 evaluation metrics.

4.3.3 Effect of Aggregated Layers

To further verify the effect of aggregated layers in the LIL algorithm, we set different numbers of aggregated layers to evaluate the model performance on the RCV1 and Wiki10-31K datasets. Figures 7 and 8 show that the deeper layers can achieve better performance in terms of top-3 and top-5 evaluation metrics, mainly due to the deeper aggregated layer capturing richer correlation information. However, labels ignore their own information due to over-aggregating information from neighboring labels, especially in the case of higher accuracy top-1 predictions. For the evaluation metric P@1, the best results are obtained with the aggregated layer of 1. For the above results, we set the aggregated layer to 1 for smaller datasets and 2 for larger datasets.

5 Conclusion

In this paper, we propose a Transformer-based TLC-XML for XMTC tasks and exploit label correlations in both label space partitioning and the classification model. TLC-XML comprises three modules: Partition, Matcher, and Ranker. In Partition, we utilize semantic and co-occurrence correlations between labels to partition the label space. In Matcher, we combine the correlation between clusters and the text representation to match related label clusters. In Ranker, we aggregate the raw label prediction with neighboring label information and further use residual mapping to avoid over-smoothing. The experimental results demonstrate that TLC-XML is significantly superior to state-of-the-art XMTC methods. In many practical scenarios, it can be expensive and time-consuming to annotate all labels of the sample, especially when the number of possible labels is extremely large. Therefore, it is valuable to handle missing label problems with a robust and efficient strategy. In future work, we plan to exploit the correlation between labels for datasets with missing labels.

References

McAuley, J.J., Pandey, R., Leskovec, J.: Inferring networks of substitutable and complementary products. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 785-794 (2015)
Jung G, Shin J, Lee S (2023) Impact of preprocessing and word embedding on extreme multi-label patent classification tasks. Applied Intelligence 53(4):4047–4062
Article Google Scholar
Jain, H., Balasubramanian, V., Chunduri, B., Varma, M.: Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp 528-536 (2019)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for Language understanding. arXiv preprint arXiv:1810.04805 (2018)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: A lite BERT for self-supervised learning of language representations. International Conference on Learning Representations, pp. 25-32 (2020)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp 5754-5764 (2019)
Chang, W.-C., Yu, H.-F., Zhong, K., Yang, Y., Dhillon, I.S.: Taming pretrained transformers for extreme multi-label text classification. In 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp 3163-3171 (2020)
Ye, H., Chen, Z., Wang, D.-H., Davison, B.: Pretrained generalized autoregressive model with adaptive probabilistic label clusters for extreme multi-label text classification. In International Conference on Machine Learning, pp 10809-10819 (2020)
Jiang, T., Wang, D., Sun, L., Yang, H., Zhao, Z., Zhuang, F.: Lightxml: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pp 7987-7994 (2021)
Zhang, J., Chang, W.-C., Yu, H.-F., Dhillon, I.: Fast multi-resolution transformer fine-tuning for extreme multi-label text classification. In Advances in Neural Information Processing Systems, pp 7267-7280 (2021)
Xun, G., Jha, K., Sun, J., Zhang, A.: Correlation networks for extreme multi-label text classification. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 1074-1082 (2020)
Zong D, Sun S (2023) Bgnn-xml: Bilateral graph neural networks for extreme multi-label text classification. IEEE Transactions on Knowledge and Data Engineering 35(7):6698–6709
Google Scholar
Yen, I.E.-H., Huang, X., Ravikumar, P., Zhong, K., Dhillon, I.: Pd-sparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In International conference on machine learning, pp 3069-3077 (2016)
Babbar, R., Schölkopf, B.: Dismec: Distributed sparse machines for extreme multi-label classification. In Proceedings of the tenth ACM international conference on web search and data mining, pp 721-729 (2017)
Schultheis E, Babbar R (2022) Speeding-up one-versus-all training for extreme classification via mean-separating initialization. Machine Learning 111(11):3953–3976
Article MathSciNet Google Scholar
Bhatia, K., Jain, H., Kar, P., Varma, M., Jain, P.: Sparse local embeddings for extreme multi-label classification. In Advances in neural information processing systems, pp 730-738 (2015)
Tagami, Y.: Annexml: Approximate nearest neighbor search for extreme multi-label classification. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 455-464 (2017)
Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., Varma, M.: Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, pp 993-1002 (2018)
Khandagale S, Xiao H, Babbar R (2020) Bonsai: diverse and shallow trees for extreme multi-label classification, machine learning. Machine Learning 109(11):2099–2119
Article MathSciNet Google Scholar
Etter PA, Zhong K, Yu H-F, Ying L, Dhillon I (2022) Enterprise-Scale Search: Accelerating Inference for Sparse Extreme Multi-Label Ranking Trees. In Proceedings of the ACM Web Conference 2022:452–461
Google Scholar
Liu, J., Chang, W.-C., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pp 115-124 (2017)
You, R., Zhang, Z., Wang, Z., Dai, S., Mamitsuka, H., Zhu, S.: AttentionXML: label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In Advances in neural information processing systems, pp 5820-5830 (2019)
Clauset A, Newman ME, Moore C (2004) Finding community structure in very large networks. Physical review E 70(6):066111
Article ADS Google Scholar
Chen, Z.-M., Wei, X.-S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5177-5186 (2019)
Xie K, Wei Z, Huang L, Qin Q, Zhang W (2021) Graph convolutional networks with attention for multi-label weather recognition. Neural Computing and Applications 33(17):11107–11123
Article Google Scholar
Tang, P., Jiang, M., Xia, B.N., Pitera, J.W., Welser, J., Chawla, N.V.: Multi-label patent categorization with non-local attention-based graph convolutional network. In Proceedings of the AAAI Conference on Artificial Intelligence, pp 9024-9031 (2020)
Vu H-T, Nguyen M-T, Nguyen V-C, Pham M-H, Nguyen V-Q, Nguyen V-H (2023) Label-representative graph convolutional network for multi-label text classification. Applied Intelligence 53(12):14759–14774
Article Google Scholar
Hang J-Y, Zhang M-L (2021) Collaborative learning of label semantics and deep label-specific features for multi-label classification. IEEE Trans Pattern Anal Mach Intell 44(12):9860–9871
Article Google Scholar
Xu J, Tian H, Wang Z, Wang Y, Kang W, Chen F (2021) Joint input and output space learning for multi-label image classification. IEEE Transactions on Multimedia 23:1696–1707
Article Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770-778 (2016)
Lewis DD, Yang Y, Russell-Rose T, Li F (2004) Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research 5(4):361–397
Google Scholar
Loza Mencía, E., Fürnkranz, J.: Efficient Pairwise Multilabel Classification for Large-Scale Problems in the Legal Domain. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp 50-65 (2008)
Yang, P., Sun, X., Li, W., Ma, S., Wu, W., Wang, H.: SGM: Sequence Generation Model for Multi-label Classification. In Proceedings of the 27th International Conference on Computational Linguistics, pp 3915-3926 (2018)
McAuley, J., Leskovec, J.: Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pp 165-172 (2013)
Prabhu, Y., Varma, M.: Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 263-272 (2014)
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. Journal of machine learning research 7:1–30
MathSciNet Google Scholar

Download references

Funding

This research was funded in part by the Natural Science Foundation of Liaoning Province in China (2020-MS-281).

Author information

Authors and Affiliations

School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China
Fei Zhao, Qing Ai, Qingyun Gao & Yichun Liu
State Grid Information and Telecommunication Group Company Ltd., State Grid Corporation of China, Beijing, 100053, China
Xiangna Li
Beijing Synchrotron Radiation Facility, Chinese Academy of Sciences, Beijing, 100049, China
Wenhui Wang
Chinese Spallation Neutron Source Science Center, Chinese Academy of Sciences, Dongguan, 523808, China
Wenhui Wang

Authors

Fei Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Qing Ai
View author publications
You can also search for this author in PubMed Google Scholar
Xiangna Li
View author publications
You can also search for this author in PubMed Google Scholar
Wenhui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qingyun Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yichun Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors confirm contribution to the paper as follows: study conception and design: FZ, QA; data collection: FZ; analysis and interpretation of results: FZ, QA; draft manuscript preparation: FZ, QA, QG, YL. Supervision: QA, XL, WW. All authors reviewed the results and approved the final version of the manuscript.

Corresponding author

Correspondence to Qing Ai.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zhao, F., Ai, Q., Li, X. et al. TLC-XML: Transformer with Label Correlation for Extreme Multi-label Text Classification. Neural Process Lett 56, 25 (2024). https://doi.org/10.1007/s11063-024-11460-z

Download citation

Accepted: 25 October 2023
Published: 10 February 2024
DOI: https://doi.org/10.1007/s11063-024-11460-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

TLC-XML: Transformer with Label Correlation for Extreme Multi-label Text Classification

Abstract

Similar content being viewed by others

Exploiting Dynamic and Fine-grained Semantic Scope for Extreme Multi-label Text Classification

CoocNet: a novel approach to multi-label text classification with improved label co-occurrence modeling

Label-Aware Document Representation via Hybrid Attention for Extreme Multi-Label Text Classification

1 Introduction

2 Related Works

2.1 Traditional Methods

2.2 Deep Learning Methods

3 TLC-XML

3.1 Problem Formulation

3.2 Partition

3.2.1 Label Correlation Graph

3.2.2 Partition the Label Graph

3.3 Matcher

3.3.1 Cluster Correlation Graph

3.3.2 Cluster Correlation Learning

3.4 Ranker

3.4.1 Label Interaction Learning

3.4.2 Ranking Candidate Labels

3.5 Training Algorithm

4 Experiment

4.1 Experimental Configuration

4.1.1 Datasets

4.1.2 Evaluation Metrics

4.1.3 Comparing Methods and Implementation Details

4.2 Experimental Results

4.3 Ablation Studies

4.3.1 Effect of Partitioning the Label Space

4.3.2 Effect of Label Correlation

4.3.3 Effect of Aggregated Layers

5 Conclusion

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation