1 Introduction

Text classification is a fundamental task in artificial intelligence with a wide range of applications, including sentiment analysis, news classification, topic analysis, Q&A system, etc [1,2,3,4]. Recently, deep learning-based text classification methods have made significant progress. Deep learning methods automatically extract text features and perform classification in an end-to-end manner. Compared to traditional text classification methods, deep learning methods require a certain amount of labeled data for supervised training to learn a high-performing classifier without manual feature extraction.

Deep learning-based text classification usually converts labels into one-hot vectors. The model fitting one-hot label distributions may raise two problems. First, the model’s generalization ability cannot be guaranteed, and it tends to cause overfitting. On the other hand, fitting one-hot vectors to the model will encourage the gap between the category to which it belongs and the other categories to be as large as possible. This can lead to model overfitting and arbitrary predictions [5,6,7,8]. This is because we assume a binary relationship between an instance and the labels and assume that the labels are independent of each other. However, labels are rarely completely independent, and instances can be associated with multiple labels [8,9,10,11]. Therefore, the one-hot vector representation is not sufficient to adequately describe the relationship between the instance and the labels. This situation is more obvious in confusion datasets with similar labels, which we refer to as the label confusion problem (LCP). LCP will further impair the classification performance of the model. For example, the Newsgroup corpusFootnote 1 contains many internal groups, and the labels within each group are very similar, such as “rec.autos” versus “rec.motorcycles” and “talk.politics.misc” versus “talk.religion.misc”. The existing classification methods misclassify these similar labels very severely.

Fig. 1
figure 1

A case study of the label–concept relationship graph shows the example’s instance, concepts, and the label on the left and the corresponding labels, concepts, and their relationships on the right

Recently, a label distribution learning paradigm was proposed by Geng et al. [12]. By measuring the similarity between an instance and the labels, a new label distribution is generated to replace the original one-hot vector. In this paper, we follow Geng et al. by calculating the similarity score between an instance embedding and the label embeddings during training; then we perform normalization and obtain the new label distribution to assist in model training, thereby solving the LCP. However, simply representing the instances and labels with word embeddings and then calculating the semantic similarity may result in a surface mismatching problem. For example, given the instance “Beyonce named People’s most beautiful woman” and the label “music”, their similarity is rather insignificant when the instance and the label are represented by word embeddings. The reason is that this instance and the label do not contain the same word. However, when we know that “Beyonce” belongs to the concept singer, we can determine that this instance strongly correlates to the label “music”. Moreover, there is a natural correlation between concepts and labels, while the conventional label embedding approach ignores the structural information between labels and concepts. For instance, “Ford Ranger” is an instance of small truck, as shown in Fig. 1, and this concept is strongly correlated to the label “autos”. Therefore, a relationship graph can be constructed for labels and concepts. In this paper, conceptual information from a knowledge base is incorporated into the text and label representations to assist in the similarity comparison between an instance and the labels. Through this process, a more accurate simulated label distribution can be obtained to better address LCP. In particular, we generate a new instance that incorporates conceptual information and feed it into the BiLSTM [13] to obtain an instance representation with the relevant concept semantics. Meanwhile, to introduce the label-related conceptual information into the label representation, we construct a relationship graph between labels and concepts by counting the number of co-occurrence of the label and concepts in the training set. The relationship graph is then fed into a graph neural network to obtain a label representation that incorporates the semantics of the relevant concepts. Meanwhile, we set up a multi-loss function, which is a dynamic combination of cross-entropy loss and KL divergence. It takes advantage of the simulated label distribution and the original one-hot label vectors to jointly supervise the model training to obtain optimal classification results.

The main contributions of this paper are summarized below:

We generate a simulated label distribution by calculating the correlation between an instance and the labels. This method can capture the overlapping relationships between labels and thus solve the LCP.

To solve the surface mismatching problem when computing the similarity between labels and instances, we incorporate conceptual information from the knowledge base into the instance and label representations. In particular, we feed the label–concept relationship graph into the graph attention network to obtain a label representation that incorporates conceptual information and interrelationships between labels.

To make full use of the generated simulated label distribution as well as the original label vector, we set up a multi-loss function to supervise the model training. Additionally, a soft switch is set to dynamically adjust the importance of the two loss functions in the multi-loss to obtain the optimal text classification performance.

We tested on five complex text classification datasets and a highly confused dataset. The experimental results show that SLDC outperforms competitive baselines by a large margin.

The remainder of this paper is organized as follows.

In Sect. 2, works related to this study is presented. Section 3 is a detailed description of the method for generating SLDC and the text classification method. Section  4 is the experimental setup and result analysis. Finally, the conclusion is drawn in Sect. 5.

2 Related Work

Using hard one-hot label representation may cause model overconfidence, leading to arbitrary predictions while also making it difficult to distinguish some label confusion [5,6,7,8,9,10]. Recently, researchers have focused on soft label representation to solve the above problems, such as label smoothing, label embedding, and label distribution learning.

2.1 Label Smoothing

Label smoothing is an effective regularization tool for deep learning that generates a soft label representation by applying a weighted average between a uniformly distribution and a one-hot distribution, which is often used to reduce the overfitting problem in training deep learning models [5, 6]. Szegedy et al. [5] applied label smoothing as a regularization tool and applied it to image classification. Vaswani et al. [14] applied it to machine translation tasks. Chorowski and Jaitly [15] used label smoothing to reduce the word error rate on speech recognition tasks.

Although label smoothing can alleviate the overfitting problem, the generated soft label distribution simply adds uniformly distributed noise to the hard one-hot vector. The soft label distribution does not reflect the semantic relevance between an instance and the labels. A true label distribution should reflect the importance of each label to the instance. In turn, similar instances have similar label distributions. At the same time, label smoothing can make the model learn insufficient knowledge and thus risk underfitting.

2.2 Label Embedding

Label embedding is the representation of labels in the embedding space to enable the label information to be utilized in the training process. Wang et al. [16] trained labels and text in the same embedding space. Zhang et al. [17] designed a metric module to convert the classification problem into a vector matching task between the label and the text embedding. Du et al. [18] focused on matching words in the text with labels. The classification results are obtained by computing the interaction between each word embedding and label embedding. This paper also uses joint label embedding learning to capture the semantic relations between instances and labels.

2.3 Label Distribution Learning

Label distribution learning is a new machine learning paradigm that obtains the ground-truth label distribution by calculating the importance of each label to the instance. The model parameters are then updated by minimizing the distance between the model output and the ground-truth label distribution [19,20,21]. Wang et al. [22] and Geng et al. [12] suggested some label distribution learning techniques. Using the topological information of the feature space and the correlation among the labels, Xu et al. [23] produced label distributions from the logical labels in the training set. In this study, we construct a simulated label distribution based on the label distribution learning by calculating the correlation between instances and labels to assist in model training. Then, we use the original label representation and the generated simulated label distribution to supervise the training process jointly. Meanwhile, we dynamically combine the two loss functions to achieve the best classification results.

2.4 Conceptualization

From the perspective of computer information processing, machine conceptual cognition is the concept output from a certain form of data input [24]. Conceptualization maps the entities within a text to the corresponding concepts in the knowledge base. Wang et al. [25] found the most likely concept corresponding to each entity using multiple rounds of random walks on a semantic map. Song et al. [4] used naïve Bayes to conceptualize entities and short texts while using homogeneous instances to eliminate the semantic fuzziness of entities at the same time. Kim et al. [26] used LDA to capture the semantic correlation between entities and eliminate ambiguity based on related entities. Hua et al. [27] designed a concept voting algorithm in a semantic map to find the optimal concept for each entity. Chen et al. [1] took conceptual information as a kind of knowledge and incorporated it into deep neural networks. Jiang et al. [28] used conceptualization information for short text classification. Xie et al. [29] explained the relationship between entities by conceptualizing entity pairs. In this paper, we enrich the text representation with text-related concepts. In addition, we build a graph structure between concepts and labels, which is fed into a graph neural network to obtain a label’s representation with relevant conceptual information. This is used to solve the problem of surface mismatching when comparing an instance with labels for similarity.

3 Methodology

The structure of the SLDC model is shown in Fig. 2. In this section, we first introduce the knowledge base we used and discuss how to extract the concepts that are most relevant to the entities. Based on the extracted concepts, we elaborate on how to generate a text representation of fused concepts. We also study how to establish structural relationships between labels and concepts and generate label representations of fused concepts. Finally, we discuss the method of generating simulated label distribution and introduce the setting of the loss function.

Fig. 2
figure 2

The structure of the SLDC model. MCG means the Microsoft Concept Graph

3.1 Microsoft Concept Graph and Typicality Scores

The Microsoft Concept Graph is a hierarchy of concepts (taxonomy) constructed from web data and search logs, with entities and concepts as vertices and ‘is-A’ relationships as edges. For example, in the ‘is-A’ relationship “apple isA fruit”, “apple” is an entity and fruit is a concept. The ‘is-A’ relation can be subdivided into an ‘instanceOf’ relation between entities and concepts and a ‘subclassOf’ relation between concepts. The former, such as “bird isA animal”, expresses the relationship between entities and concepts; the latter, such as “fruit isA food”, expresses the relationship between concepts, where fruit is a subconcept and food is a parent concept. Since any entity or concept has to be expressed through language, a lexical concept hierarchy is usually used in practice. The basic relationship is the relationship between words in a hierarchy (Hypernymy–Hyponymy). For example, “apple isA fruit” means apple is the hyponym of fruit, and fruit is the hypernym of apple. Over 12 million entities, over 8 million “is-A” relationships, and over 5 million unique concepts are all present in the Microsoft Concept Graph. The Microsoft Concept Graph is constructed from 1.6 billion web pages using the Hearst Patterns [30] in an automatically extracted manner. For example, the pattern “NP such as NP” can retrieve “Apple isA technology company” and “Tesla isA technology company” from “technology companies such as Apple and Tesla”.

An important feature of the Microsoft Concept Graph is that the relationship between entities and concepts is not described by 0 or 1, but in a possibility approach. This possibility approach is strictly consistent with how humans perceive concepts. For example, given an entity “platypus”, one is more likely to conceptualize it as animal than as music or food. The Microsoft Concept Graph describes this possibility by the frequency with which an object is identified as belonging to a category. Using the Hearst Patterns, it can be found that in the corpus, chicken as bird’s entity has 130 times, while robin has 279 times, which is consistent with robin being a more typical bird than chicken. The Microsoft Concept Graph defines the possible relationship between entities and concepts by counting their co-occurrence, and we call this possibility, the typicality score. Formally, this typicality score is defined as follows:

$$\begin{aligned} P(c \mid e)=\frac{n(e, c)}{\sum _{e \in c_{i}} n\left( e, c_{i}\right) ,} \end{aligned}$$
(1)
$$\begin{aligned} P(e \mid c)=\frac{n(e, c)}{\sum _{e_{i} \in c} n\left( e_{i}, c\right) .} \end{aligned}$$
(2)

Where \(n\left( e,c\right)\) is the count of concepts and entities that occur together in the web corpus according to Hearst Patterns. Typicality scores enable more precise and trustworthy knowledge representation and give users more freedom to query and manipulate the knowledge base [31]. Nevertheless, “extreme concepts” can be easily derived from these two conceptualizations. For example, when e=Microsoft, the typicality \(P\left( c_1 \mid e\right)\) is high for \(c_{1}=\) company and \(P\left( e\mid c_{2}\right)\) is high for \(c_{2}=\) largest OS vendor. However, company is a very abstract concept for Microsoft, and the range of entities covered by the abstract concept is too large. Whereas the largest OS vendor is a very concrete concept, the coverage of concrete concepts is limited. To solve the “extreme concepts” problem, we conceptualize entities using the improved method proposed by Wang et al. [32], as shown in Eq. (3).

$$\begin{aligned} {\text {Rep}}(e, c)=P(c \mid e) \cdot P(e \mid c)_{k-s m o o t h} \end{aligned}$$
(3)
$$\begin{aligned} P(e\mid c)_{k-s m o o t h}=\frac{n(e, c)+k}{\sum _{e_{i} \in c} n\left( e_{i}, c\right) +k N_{e}}, \end{aligned}$$
(4)

where \(P(e\mid c)_{k-smooth}\) is the smoothing function used to solve for extreme values. \(N_{e}\) is the number of all entities, and k is a tiny constant assuming that each concept-entity pair has a limited number of co-occurrences in the real world, regardless of whether these co-occurrences have been seen.

3.2 Concept Retrieval

The goal of this module is to extract concepts related to the given text. In particular, we obtain the set of text-related concepts. To achieve this goal, we use a two-step strategy: entity linking and conceptualization[1]. Entity linking is an important task in NLP and is used to identify entities mentioned in the text [33]. We use the existing entity linking method [34] to obtain the entities in the text. Then, for each entity, we determine the relative weights of the concept and the entity, and then the most relevant concept to the entity is obtained, based on the weight ranking. For example, for the given text, “Ford Bronco Test Mule Spotted Flexing Its Muscles in Australia”, we can extract the entities “Ford Bronco” and “Australia”. After that, we can extract the most relevant concepts of entities suvs and developed countries in the Microsoft Concept Graph. If an entity cannot find a concept in the Microsoft Concept Graph, then we assume that it is a concept itself.

3.3 Text Encoding

For each input text, we first tokenize the text and identify the entities. Then, the concept corresponding to the entity is placed after it, and a new text joining concepts is generated. Then, we encode the text using a RoBERTa [35]. RoBERTa improves the BERT [36] model at the structure and data levels using more training resources, a larger batch size, and a longer training time. RoBERTa removes Next Sentence Prediction (NSP) and trains on longer sentences. It employs a dynamic masking method that enables the model to gradually adapt to alternative masking strategies by randomly masking distinct tokens, therefore, acquiring knowledge of diverse language representations.

To fully use the semantic information gained by the RoBERTa model in a massive corpus, the input encoding part requires preprocessing the text into the same format as that used in pretraining. RoBERTa uses byte-pair encoding (BPE) [37] to preprocess the data by splitting the words into subwords and using a 50k-sized subword dictionary. For a word that is not in the dictionary, the last character is removed sequentially until a subword that exists in the dictionary can be found.

For the token sequence obtained after tokenization, we add [CLS] at the beginning of the sentence. We also add [SEP] at the end of the sentence and in the middle of the two sentences. Similar to the BERT input form, the input of RoBERTa is also obtained by summing token embeddings, segment embeddings, and position embeddings by elements. Afterward, the multi-layer transformer encoders extract features from the input vector. The last layer of the transformer produces each token’s representation. For the input text, a collection of token vectors can be generated.

We add a bidirectional long short-term memory (BiLSTM) to transform the input from the underlying layer. Longer distance dependencies can be better captured using the LSTM model. This is because, during the training process, LSTM can learn which information needs to be remembered and which information needs to be forgotten. BiLSTM can better capture the semantic dependencies between forward and backward text.

$$\begin{aligned} \overrightarrow{q_{t}} =\overrightarrow{{\text {LSTM}}}\left( x_{t}, \overrightarrow{q_{t-1}}\right) \end{aligned}$$
(5)
$$\begin{aligned} \overleftarrow{q_{t}} =\overleftarrow{{\text {LSTM}}}\left( x_{t}, \overleftarrow{q_{t-1}}\right) \end{aligned}$$
(6)

We splice each \(\overrightarrow{q_{t}}\) and \(\overleftarrow{q_t}\) to obtain a hidden state. Letting the number of hidden units of each one-way LSTM be u, we take the last hidden state as the representation of instance i and simply note it as \(q^{i} \in {\mathbb {R}}^{2 u}\). Then, we input \(q^i\) into a linear layer and calculate the predicted label distribution using the softmax function:

$$\begin{aligned} f^p=soft\max (q^iW_1+b_1), \end{aligned}$$
(7)

where \(f^p\) is the predicted label distribution. \(W_1\in {\mathbb {R}}^{2u\times k}\) and \(b_1\) are trainable parameters, and k is the number of labels.

3.4 Label Encoding

As shown in Fig. 1, we construct a graph of the relationship between labels and concepts, where the nodes include labels and concepts. The concepts correspond with each instance in the training set. The right half of Fig. 2 shows the process of label encoding. We first obtain the initial vector representation of each node using a RoBERTa encoder. To incorporate conceptual information into the label representation, we feed the constructed relational graph into the graph attention network (GAT) [38] because of its excellent performance in node representation.

Through the attention mechanism, GAT aggregates neighboring node features to produce a node representation. A set of node features is the input to a graph attention layer, where N is the number of nodes to be trained and d is the number of node features. After passing through the graph attention layer, a new set of node features incorporating conceptual information is generated.

Fig. 3
figure 3

Attention mechanism of node \(v_i\) to node \(v_j\)

As shown in Fig. 3, for node \(v_i\), to calculate the attention weights of the neighboring nodes \(v_j\) to \(v_i\). First, we multiply the two input vectors \(\mathbf {h}_{1}, \mathbf {h}_{2} \in {\mathbb {R}}^d\) by a weight matrix \(W\in {\mathbb {R}}^{d^{'}\times d}\) to obtain two \({\mathbb {R}}^{d^{'}}\) dimensional vectors. The two vectors are then spliced and sent through a single-layer feedforward neural network parametrized by a weight vector \(\mathbf {a}\in {\mathbb {R}}^{2d^{'}}\) and applying the LeakyReLU nonlinearity. The attention score of node \(v_j\) to node \(v_i\) is calculated as:

$$\begin{aligned} e_{i j}={\text {Leaky}} {\text {ReLU}}\left( \mathbf {a}^{T}\left[ W \mathbf {h}_{i}\Vert W \mathbf {h}_{j}\right] \right) . \end{aligned}$$
(8)

\(e_{ij}\) highlights the significance of node j to node i, where \(v_j\in N_i\), in which \(N_i\) represents the first-order neighbors of \(v_i\), including \(v_i\) itself. To make the attention scores easy to compare between different nodes, we normalize them with the softmax function:

$$\begin{aligned} \partial _{ij}=soft\text {max}_j\left( e_{i j}\right) =\frac{\exp \left( e_{i j}\right) }{\sum _{k \in N(i)} \exp \left( e_{i k}\right) .} \end{aligned}$$
(9)
Fig. 4
figure 4

An example of computing multi-head attention (with K = 3 heads) for central node (\(h_1\)) and its neighbor nodes (\(h_2\)\(h_6\))

As shown in Fig. 4, we add the multi-head attention mechanism to further increase the expressiveness of the attention layer. This mechanism distributes attention on features that are related in multiple places between the central and neighboring nodes. We average the results of K independent attentions as follows:

$$\begin{aligned} \mathbf {h}_{i}^{\prime }=\sigma \left( \dfrac{1}{K}\sum \limits _{k=1}^{K}\sum \limits _{j\in N_i}\partial ^k_{ij}W^k\mathbf {h}_j\right) , \end{aligned}$$
(10)

where \(\mathbf {h}_{i}^{\prime }\) represents node \(v_i\) after the graph attention layer with multi-head attention and \(\sigma\) denotes the nonlinear activation function.

The label vectors are passed through GAT to obtain a set of label representations incorporating relevant conceptual information \(h^l=[\mathbf {h}_{l1}^{\prime },\mathbf {h}_{l2}^{\prime },\ldots ,\mathbf {h}_{li}^{\prime }]\).

3.5 Simulated Label Distribution and Loss Function

Fig. 5
figure 5

Attention mechanism for generating the simulated label distribution

To measure the similarity of labels and the instance, we feed the instance representation (\(q^i\)) and the label representation (\(h^l\)) into a similarity layer. As shown in Fig.  5, the similarity layer consists of an additive attention mechanism. The text embedding and label embedding are added and activated by a tanh, and then fed into a linear layer to obtain the attention score. Next, a softmax function is used to normalize the attention scores to get a simulated label distribution:

$$\begin{aligned} f^{s}=soft\max \left( v^{T} \tanh \left( W_2 q^{i}+U h^{l}\right) \right) , \end{aligned}$$
(11)

where \(f^s\) represents the concept-based simulated label distribution and v, \(W_2\) and U are all-learnable parameters.

Once the concept-based simulated label distribution is obtained, the simulated label distribution can be used as a new training target to assist in model training. We combine the cross-entropy loss and the KL divergence [39] to jointly supervise the model training in order to fully exploit the simulated label distribution as well as the original label vectors:

$$\begin{aligned} multi-loss&= \gamma loss_{KL}(f^s,f^p)+(1-\gamma )loss_{CE}(f^t,f^p)\nonumber \\&=\gamma \sum \limits _{[1,k]}f^s\log (\dfrac{f^s}{f^p})+(1-\gamma )\sum \limits _{[1,k]}f^t\log (f^p), \end{aligned}$$
(12)

where \(f^t\) represents the original one-hot label distribution. \(\gamma \in [0,1]\) is a soft switch for dynamically adjusting the importance of KL divergence and cross-entropy loss. The simplest way is to treat \(\gamma\) as a hyperparameter and adjust it manually to obtain the best label distribution. However, manually adjusting the hyperparameters will repeat the experiment several times, which is time-consuming. \(\gamma\) can also be learned automatically by a neural network, which is computed as follows:

$$\begin{aligned} \gamma =\sigma (w^Tq^i+b_2), \end{aligned}$$
(13)

where w and \(b_2\) are learnable parameters and \(\sigma\) is the sigmoid function.

It is worth noting that when we choose to adjust \(\gamma\) automatically, we add a penalty term \(\dfrac{1}{\gamma (1-\gamma )}\) in multi-loss to avoid \(\gamma\) converging at extreme values:

$$\begin{aligned}multi-loss_{auto} &= \gamma loss_{KL}(f^s,f^p)+(1-\gamma )loss_{CE}(f^t,f^p)\nonumber \\&\qquad +\dfrac{1}{\gamma (1-\gamma )}\nonumber \\&\quad =\gamma \sum \limits _{[1,k]}f^s\log \left(\dfrac{f^s}{f^p}\right)+(1-\gamma )\sum \limits _{[1,k]}f^t\log (f^p)\nonumber \\&\qquad +\dfrac{1}{\gamma (1-\gamma )}. \end{aligned}$$
(14)

4 Experiments

4.1 Dataset

To evaluate the effectiveness of SLDC, we used five complex text classification datasets: 20-Newsgroups (20NG), Ohsumed, R52 and R8 of Reuters 21578, and Movie Review (MR). The datasets’ descriptions are as follows:

20NG dataset: 20NG is a text classification dataset that contains roughly 20,000 newsgroup documents. Some newsgroups contain similar topics (e.g., rec.autos, rec.motorcycles).

R8 dataset: The R8 dataset is a subset of the Reuters 21578 dataset. It is divided into 8 categories and 5,485 training and 2,189 testing texts.

R52 dataset: The R52 dataset is a subset of the Reuters 21578 dataset. There are 52 categories, with 6,532 training and 2,568 testing texts.

Ohsumed: The Ohsumed corpus is based on the National Library of Medicine’s Medical Line Collection, a major medical literature bibliographic database. We used 13,929 unique abstracts of cardiovascular disorders from a total of 20,000 abstracts dating back to 1991 for our study. Each text in this collection has an associated category from one or more of the 23 disease categories. In the present work, we focused only on single-label text categories totaling 7400 texts. The training set has 3357 texts, whereas the test set contains 4043 texts.

MR: Each review in the MR dataset for binary sentiment classification only contains one sentence. 5331 reviews are positive, and 5331 are negative for the corpus. 7108 texts make up the training set, whereas 3554 texts make up the test set.

We first preprocessed 20NG, R8, R52 and Ohsumed. Stop words and low-frequency terms were removed from these datasets. We did not clean the MR dataset because the text length was too short. The statistical information for each dataset is shown in Table  1. The labels for the three topic classification datasets are shown in Fig. 6.

Table 1 Statistical information of the five text classification datasets
Fig. 6
figure 6

Labels from the 20NG, R52, and R8 datasets

4.2 Baseline Models

We compared SLDC with a large number of competitive methods. Each method is described in detail as follows.

PV-DBoW [40]: PV-DBoW is a paragraph vector model, and word order is ignored in the text. We use Logistic Regression as a classifier.

PV-DM [40]: A paragraph vector model that takes word order into account. The classifier that we employ is logistic regression.

PTE [41]: PTE averages the word embeddings to produce document embeddings for text classification by learning word embeddings from a heterogeneous text network that includes terms, documents, and labels as nodes.

FastText [42]: FastText averages word/n-gram embeddings as text embeddings . And it is simple and efficient. The text is then classified using a linear layer. We compared the results with and without bigrams.

CNN [43]: Multiple kernels of various sizes are utilized when a convolutional neural network is applied to text classification tasks to extract important textual information (similar to a multiwindow-sized n-gram).

LSTM [44]: A temporal recurrent neural network called long short-term memory (LSTM) was developed to solve the long-term dependency issue with generic recurrent neural networks (RNNs), which contain a chained form of repeating neural network modules.

BiLSTM: BiLSTM is a bi-directional LSTM that learns the forward and backward semantics of the text.

BERT [36]: The performance of BERT in many text classification tasks has been further improved using BERT-base by tuning the parameters and making some empirical improvements.

CogLTX [45]: CogLTX identifies key sentences by training a discriminative model, then splices the key sentences and feeds them into a classifier to determine the category. We use RoBERTa as the classifier.

Label Smoothing [8]: We use the same text encoder as SLDC but employ label smoothing to generate the label distribution for training. The smoothing hyperparameter is set to 0.1.

Graph-CNN-C [46]: Graph-CNN-C applies the CNN model to the graph, which obtains a representation of the text by manipulating the convolution on the word similarity graph. Chebyshev filter is used by Graph-CNN-C.

Graph-CNN-S [47]: Similar to Graph-CNN-C but utilizing the Spline filter.

Graph-CNN-F [48]: Similar to Graph-CNN-C but utilizing the Fourier filter.

4.3 Experimental Setting

For the input encoder, we use the RoBERTa-base model with a hidden size of 768. Additionally, as a comparison, we test the 300-dimensional GloVe pre-trained word embeddings in the ablation experiment. We use stanza [34] as an entity extractor because of its excellent performance. Then, we find the basic level concepts associated with the entities by Microsoft Concept Graph. Unlike simply splicing a concept vector and a text vector, to preserve the serialization information of the text, we generate a new text by placing the concept words after the related entities. Then, we encode the text and obtain a text representation with conceptual information. Furthermore, dropout with p = 0.15 is applied between the Bert layer and the BiLSTM. For the 20NG and Ohsumed datasets, we set the maximum input sequence length to 512 tokens to encode as much information as possible. For the R8, R52, and MR datasets, the maximum input sequence length is set to 256. We also construct a label–concept relationship graph using the training data. Then, we calculate the number of co-occurrences between all concepts and labels on the training dataset. Additionally, the normalized scores that are too small are removed. With an initial learning rate of 3e-5 and a batch size of 64, the Adam [49] optimizer is used to train our model. Our model is implemented with PyTorch. In this paper, the evaluation measures are accuracy and F1 score. Accuracy defines the ratio of correctly predicted samples to the total number of samples. The F1 score is used to evaluate the performance of the model by balancing Recall and Precision.

Table 2 Accuracy of compared models on different datasets. The best results have been bolded

4.4 Experimental Results

In Table 2, the accuracy of our model is compared with other models on the above mentioned five text classification datasets. Our model achieves the best performance in classifying these complex datasets. For more in-depth analysis, PV-DBOW achieved comparable results to the strong baseline on 20NG and Ohsumed, but much lower results on shorter texts since PV-DBoW did not focus on the word order information, which is essential for classifying short texts. PV-DM is less effective than PV-DBoW and is comparable only on shorter datasets like MR. Unsupervised text embedding is not particularly discriminative in text classification, according to the results of PV-DBOW and PV-DM. Because PTE and FastText train text embeddings in a supervised manner so that the label information may be utilized to learn more discriminative embeddings, they are significantly more effective than PV-DBoW and PV-DM. CNN is more effective in processing data with shorter text lengths because it can better model continuous and short-range semantics by convolving to obtain n-gram features in sentences. BERT extracts text features using a multi-layer bidirectional transformer. The transformer extracts feature much more effectively than CNN and LSTM. Moreover, BERT has been pre-trained in a large corpus, and we only need to fine-tune the model to achieve competitive results on different datasets. CogLTX uses a discriminator to extract the key sentences in the instance and splice them into a new text, then feed it into a RoBERTa for classification. It can be seen that CogLTX works better on the 20NG dataset and Ohsumed dataset, which shows that the discriminator can extract the key information that is beneficial for classification. Additionally, Graph-CNN performs well, indicating that it is possible to generate similarity graphs with pre-trained word embeddings while maintaining the syntactic and semantic links between words, leading to the successful classification of long texts. Most of the above approaches use various complex models to extract text features to aid text classification. However, they ignore the importance of label representation for training. The above models still use one-hot vectors as the label representation in the training process, which makes the models overconfident and result in arbitrary prediction, especially for confused datasets(e.g., 20NG, R52). Using the soft label representation makes Label Smoothing get excellent performance. Compared to the hard one-hot label representation, the soft label representation can prevent models from being overconfident. However, label smoothing only adds noise to the label representation, which does not reflect the correlation between instances and labels, nor the relationship between labels.

There are two main reasons why SLDC works well: (i) SLDC generates a new simulated label distribution by computing the correlation between an instance and the labels, instead of simply using one-hot vectors as the label representation. This approach not only can fully utilize the large amount of semantic information contained in the labels, but also can effectively capture the overlapping relationships between the labels. This is very helpful to solve the LCP problem. (ii) SLDC introduces the concept knowledge from the knowledge base into the representation of instances and labels, which cannot only improve the model’s ability to solve the LCP problem by alleviating the surface mismatching when comparing instances and labels for similarity. It can also enrich the representation of instances and improve the performance of text classification.

4.5 Hyperparameter Tuning

Fig. 7
figure 7

SLDC hyperparameter analysis

\(\gamma\) is a hyperparameter that is used to adjust the proportion of KL divergence and cross-entropy loss in the multi-loss function. However, manual adjustment is more time-consuming, and automatic neural network tuning can effectively reduce the time. Figure 7 shows the accuracy comparison of our manual adjustment of different \(\gamma\) values and the automatic adjustment of \(\gamma\) by the neural network. \(\gamma\)=0 means that only the cross-entropy loss function is involved in the model training, i.e., the model is trained using only the original one-hot labels. However, the one-hot label representation assumes that the labels are independent of each other, ignoring the overlapping relationship between the labels and losing much semantic information contained in the labels. As \(\gamma\) increases, it indicates that KL divergence and the simulated label distribution are more involved in model training. The model has the highest accuracy when it goes through 30 iterations and \(\gamma = 0.7\). This shows that adjusting the ratio of the two loss functions can improve the model’s performance. In addition, although the accuracy achieved by automatic adjustment is slightly lower than the highest value, it also performs better than others. This demonstrates the effectiveness of automatic adjustment using neural networks.

4.6 Optimizer and Dropout Tuning

Optimizers are essential for deep learning models by computing and updating the network parameters to achieve the optimal value. We tried different optimizers to determine the effect on the model’s accuracy. In particular, we selected three optimizers commonly used for text classification, SGD+Momentum [50], Adam, and AdamW [51]. For SGD, we show the results corresponding to its optimal parameters for this task: learning rate = 3e-2 and momentum factor = 0.9. Adam and AdamW are set to the same learning rate of 3e-5. In addition, we tested the accuracy of the model under different dropout rates.

Fig. 8
figure 8

Accuracy comparison of different optimizers with dropout rate=0

Fig. 9
figure 9

Accuracy comparison of different optimizers with dropout rate=0.15

Fig. 10
figure 10

Accuracy comparison of different optimizers with dropout rate=0.3

Fig. 11
figure 11

Accuracy comparison of different optimizers at dropout rate=0.5

Figures 8, 9, 10, 11 show that the optimal accuracy occurs when using Adam with dropout rate= 0.15. We add the dropout between the Bert Layer and the BiLSTM. Neither the lack of dropout nor a large dropout rate can achieve the highest accuracy. Compared to SGD+Momentum, the adaptive gradient optimizers (Adam and AdamW) can automatically adjust the learning rate for different parameters instead of using a fixed value. At the same time, the adaptive gradient optimizers are more suitable for sparse data. Thus the model can converge more quickly and efficiently. We find that using Adam and adjusting the dropout rate appropriately achieves the highest accuracy, while AdamW does not achieve. This may be because the weight decay of AdamW does not bring help but has the opposite effect for our task.

4.7 Ablation Experiments

In this paper, we construct simulated label distributions by measuring the similarity between instances and labels to assist supervised model training. In particular, we incorporate relevant conceptual information into the instance representation as well as the label representation to address the surface mismatching problem when performing similarity comparisons. Second, we dynamically combine the two loss functions into a multi-loss function to compute the model loss for training. We use the 20NG dataset to test the effectiveness of these methods.

Fig. 12
figure 12

Accuracy of ablation experiments on the 20NG dataset

Fig. 13
figure 13

F1 score of ablation experiments on the 20NG dataset

As shown in Figures 12 and 13, r1 represents the model only using the basic predictor (BiLSTM) for training. r2 represents using multi-loss for training. r3 means adding conceptual information to instances and labels. Additionally, we contrast the models’ effectiveness using two different word embeddings (GloVe and RoBERTa). The results of the experiment show that, first, the accuracy of RoBERTa embedding is higher than that of GloVe embedding when the same method is used. This is because GloVe uses a fixed vector to represent a token, hence, cannot handle multiple meanings of a word. RoBERTa embedding specifies the word vector through sentence semantics. In this way, a word can have different embeddings in different contexts. The accuracy and F1 score of the model on the 20NG dataset can be significantly increased by employing multi-loss and incorporating simulated label distributions. This is because the 20NG dataset contains many groups with high similarity of labels within the groups. For the basic predictor, it is difficult to distinguish such similar labels. In contrast, simulated label distributions can capture the complex relationships between labels by computing correlations between instances and labels. And the dynamic combination of two loss functions, the cross-entropy loss function and KL divergence, can improve the model effect. This is because the original label vectors and the generated simulated label distribution can jointly supervise model training and improve the effectiveness of model classification. Third, we find that adding conceptual information improves model performance regardless of the word embedding method. This may be due to two reasons: (i) Conceptual information is incorporated into the text representation as high-level semantics, which can assist the model in text classification, thus improving the classification accuracy. (ii) The label representation obtained by GAT indicates that the semantic relations of the relevant concepts are available. More specifically, GAT’s message propagation mechanism can update the central node’s embedding based on the neighboring nodes’ embeddings. Thus, The label representation obtained by GAT can be considered to have the semantic relationship of related concepts. The surface mismatching problem can be effectively avoided when using instance and label embeddings with the concept semantics for similarity comparison. This allows better capture of semantic correlation between instances and labels. In turn, a simulated label distribution that better matches the instance-label relationship is generated.

4.8 Testing on Highly Confused Dataset

To further explore the ability of our model to solve the LCP problem. We sampled a subset of high confusion on the 20NG dataset for testing. The label composition of this subset is shown in Table  3. We choose RoBERTa-embedding+BiLSTM as the basic predictor. A comparison of the confusion matrices tested using each method is shown in Figures 14. 15, 16. Figure  14 represents the classification result using only the basic predictor. We can see that the confusion level of the classification results is relatively high. Figure 15 shows the classification effect of introducing simulated label distribution and multi-loss into the model training. We find that the degree of confusion in the classification results is reduced, indicating that the adoption of simulated label distribution and multi-loss can alleviate the LCP to some extent. Figure 16 shows the classification effect of the model after the introduction of conceptual information. It can be seen that the degree of confusion is further reduced. This is because the concept information can avoid the surface mismatching problem to generate a more accurate simulated label distribution.

Table 3 Labels of the 20NG subset
Fig. 14
figure 14

Testing on basic predictor

Fig. 15
figure 15

Testing with multi-loss function

Fig. 16
figure 16

Testing with conceptual information

5 Conclusion and Future Work

In this paper, simulated label distribution based on concept (SLDC) is proposed to effectively address LCP during text classification. Moreover, SLDC can capture the semantic overlap between labels by computing the correlation between an instance and the labels, while generating a new simulated label distribution to assist in supervising model training. In particular, conceptual information is extracted from a knowledge base and then incorporated into instance and label representations. In this way, the surface mismatching problem existing in instance-label similarity calculations can be resolved, producing a more accurate simulated label distribution. In addition, multiple loss functions are dynamically combined to jointly supervise model training. We conducted experiments on five complex text classification datasets, and the experimental results show that our method outperforms competitive baselines by a large margin, demonstrating the effectiveness of our model. Further experiments also verify that our method is especially helpful for confused datasets.

In the future, we will investigate how to use conceptual information to assist in multi-label text classification. There are dependencies between the labels of multi-label text classification, e.g., text belonging to “artificial intelligence” is often related to “deep learning”. Modeling dependencies between labels helps to classify text more accurately. In the future, we will use the is-A relationship in the concept knowledge base to help construct the hierarchical relationship between labels.