Keywords

1 Introduction

Traditional single-label classification learning means that each sample has a unique category label, where each label belongs to a mutually exclusive label set L(|L| > 1). However, In practical applications, usually a sample belong to multiple categories at the same time, we call such data as multi-label data [1]. For example, a news report can could be classified into “entertainment” and “technologies”, simultaneously. A movie can be both an “action movie” and a “thriller”. The multi-label classification is significantly different from the traditional single-label classification. The correlation and co-occurrence between categories lead to those existed single-label classification method cannot be directly applied to the multi-label classification problem. But also multi-label classification is gradually becoming the current research hotspot and difficulty, especially in the fields of text classification, gene function classification, image semantic annotation, etc.

Researchers is finding the optimal classification algorithm to improve the classification accuracy of multi-label data. There are two most common ideas for multi-label classification [2]. One is to convert multi-label dataset into single-label dataset, and then apply traditional data classification algorithm to them (abbreviated as PT). Binary Relevance (BR) [3] is a typical PT method. BR considers the prediction of each label as an independent single classification problem, and designs an independent classifier for each label, and trains each classifier using all training data. However, it ignores the interrelationships between tags, and often fails to achieve satisfying classification performance. Guo [4] propose a improved binary relevance algorithm, it sets two layers to decompose the multi-label classification problem into L-independent binary classification problems respectively. Liu [5] propose a classifier chain algorithm based on dynamic programming and greedy classifier chain algorithm to search for global optimal labels, which compensated for the Classifier Chain algorithm (CC) defects sensitive to label selection [6]. Label Powerset (LP) [7] encodes every label permutation as a binary number and obtains new labels. Another idea is to modify existing single-label learning algorithm to solve multi-label learning problem. For example, the MLkNN algorithm calculates the prior probability of each label through statistics in the label set, and the probability of the sample with labeled and no label, and then predicts whether the sample has label [8]. Tsoumakas [9] proposed the Random k-Labelsets method to decompose the initial label set into several small random subsets, and use the Label Powerset algorithm to train the classifier. In addition, other researchers have also used various methods for multi-label classification research [10,11,12,13]. In the data prediction training process, the existing multi-label classification algorithms either ignore the interdependence between category labels, or ignore the important influence of initial features on the predicted value, and even add these tags to the original features as an additional function. It makes the feature set that has a very high dimension more complicated. Even if the dependency relationship between category labels is fully utilized, the multi-label classification algorithm ignores the initial prediction value between the label set and the training set, it maks the multi-label classification inaccurate.

We propose a multi-label short-text classification algorithm which combines the similarity graph and the restart random walk model (abbreviated as SGaRW). On the one hand, the similarity graph is used to calculate the original relationship between the text and the labels, and on the other hand, we utilize the restarted random walk model to calculate the potential semantic relationships between the labels and the labels. Finally, reasonable fusion is performed to make multi-label classification result more accurate.

2 Preliminary and Background

We review the existing basic concepts and define the problem of multi-label classification in this section.

2.1 Multi-label Classification

Fundamentally, multi-label classification can be considered as a label ranking problem [14, 15]. This correlation is scored based on the correlation between the test sample and each category label, and then the label to which the sample belongs is determined based on the score value. Assume that \( \varvec{X} = \left\{ {x_{ 1} ,x_{ 2} , \ldots ,x_{n} } \right\} \) indicates the sample set, \( Y = \left\{ {y_{ 1} ,y_{ 2} , \ldots ,y_{m} } \right\} \) is label set, and \( {\text{D}} = \left\{ {\left( {{\text{x}}_{\text{i}} ,{\text{ Y}}_{\text{i}} } \right)| 1\le i \le {\text{n}}} \right\} \) is dataset, \( Y_{i}\,\subseteq\,Y \) is label set of the sample xi. Thus prediction of the label for sample x could be expressed as following vector H(x).

$$ H(x) = (h_{1} (x), \cdots ,h_{i} (x), \cdots ,h_{m} (x)) $$
(1)

In the vector, \( {\text{h}}_{\text{i}} \left( x \right) \in \) [0,1] describes relevancy between sample x and label yi. Multi-label classification is to achieve a classifier h: X → 2Y using training data. Given new sample x, the classifier can predict label set of the sample x subsumes. Therefore, multi-label classification is to seek an optimal classification algorithm to construct a high-precision score vector H(x) to achieve the purpose of accurate classification.

2.2 Similarity Graph

Similarity graph [16, 17] built based on WordNet is a directed weighted graph G = (V, E) is used to calculate semantic similarity among nodes in the graph, V = {itemsset, senseset}, itemsset is a collection of nodes (item) that represent words, senseset is a set composed of nodes (sense) that represent senses. According to the corresponding relationship between them, a directed edges <vi, vj> is added between two sense nodes, or between an item and a sense, or between two items. weight on the edge is signed as wij, wij represents the probability of thinking of the node vj definitely when seeing the current node vi, therefore, the weight wij reflects a conditional probability. So the similarity graph can be called a probability graph.

Such as the similarity graph shown in Fig. 1, itemsset = {adventure, thrilling, action}, senseset = {“a wild and exciting undertaking lawful”, “take a risk in the hope of …”, “something that people do or cause to happen”}. In WordNet, the word “adventure” has two meanings in total, the using frequency of the first meaning is 0.92, and using frequency of the second meaning is 0.08, so the weight from the item node “adventure” to the first sense node “a wild and exciting undertaking lawful” is 0.92, which means that the probability that someone is interested in the first meaning after seeing the word “adventure” is 0.92. In turn, The weight from sense node “a wild and exciting undertaking lawful” to item node “adventure” is 1, which means that someone must think of the word “adventure” when they see either “a wild and exciting undertaking lawful” or “take a risk in the hope of …”.

Fig. 1.
figure 1

Similarity graph

3 Implement Multi-label Classification of Short Text

Implementing multi-label classification of text is divided into two stages in this paper. At the first step, we create a similarity graph based on the text content, and calculates the original relationship between the text and the label set, which is the initial predicted value H(x). In the second phase, a label dependency graph is constructed and the restart random walk algorithm is performed on this graph to mine the potential semantic relationships between the labels. When the algorithm converges, we can obtain a vector consisting of the probability that the text belongs to each label, so we get labels which belong to the text.

3.1 Calculate Initial Association Between Sample and Labels

We consider short texts as sense nodes, map labels to item nodes, and create similarity graph G1 = (V1, E1). Then affinity score between the text and the label on a directed path can be defined as the product of the weights of all adjacent edges between the text node and the label node on the path [18], as shown in formula (2):

$$ affinity_{pt} (v_{doc} |v_{label} ) = \prod\limits_{\begin{subarray}{l} v_{i} ,v_{j} \in pt \\ (v_{i} ,v_{j} ) \in E1 \end{subarray} } {P_{pt} (v_{i} |v_{j} )} $$
(2)

Where, affinitypt(vdoc|vlabel) is affinity score from vdoc to vlabel, nodes sequence pt = <vdoc, … vi, vj, … vlabel> is a directed path from vdoc to vlabel, ppt(vi|vj) is weight on the edge between vi and vj in G1, which support calculating affinity scores between two nodes. According to Markov model, as the path length increases, the value of the conditional probability decreases. The longer the path, the less evidence of an intimate relationship between the two nodes.

We also know that there is more than one directed path from vdoc to vlabel in the similarity graph G1. So affinity scores of the text-to-label on the entire graph G1 can be expressed as the sum of the affinity score on all directed paths between these two nodes. Aff′(vdoc, vlabel) denote affinity scores in formula (3)

$$ Aff^{\prime}(v_{doc} ,v_{label} ) = \sum {affinity_{pt} (v_{doc} |v_{label} )} $$
(3)

Due to the asymmetric nature of the affinity scores, the final affinity scores between two nodes can be obtained by formula (4).

$$ Aff\left( {v_{doc} ,v_{label} } \right) = \frac{{Aff^{\prime}(v_{doc} ,v_{label} ) + Aff^{\prime}(v_{label} ,v_{doc} )}}{2} $$
(4)

Treat the affinity scores between vdoc and vlabel as the correlation score hi(x) between sample x and label yi, That is:

$$ h_{i} (x) = Aff\left( {x,y_{i} } \right) $$
(5)

Taking all the labels into account, we can get the correlation scores of the sample x and all labels in label set Y, as shown in formula (6)

$$ {\varvec{H(x)}} = [Aff\left( {x,{\text{y}}_{1} } \right), \ldots Aff\left( {x,{\text{y}}_{i} } \right) \ldots Aff\left( {x,{\text{y}}_{m} } \right)]^{\text{T}} $$
(6)

3.2 Random Walk on Label Dependency Graph

3.2.1 Obtain Dependency Among Labels

We construct graph G2 = (V2, E2) to encode dependency among labels. Vertices in the graph G2 represent labels in Y. If the label yi and yj mark the text x at the same time, add an edge between yi and yj, and the weight wij is defined as the number of samples labeled by labels yi and yj commonly:

$$ w_{ij} = \left| {\left\{ {x_{k} |y_{i} \in x_{k} \wedge y_{j} \in x_{k} } \right\}} \right|\quad if \, i \ne j $$
(7)

The adjacency matrix is used to store graph G2 and m × m dimensional symmetric matrix is obtained. Therefore, the obtained matrix after utilizing Eq. (8) to make it asymmetric is represented as S, and its element sij is used to represent the jump probability from label yi to label yj, mj is number of non-zero elements in the j-th column.

$$ s_{ij} = \frac{{w_{ij} }}{{m_{j} }} $$
(8)

3.2.2 Restart Random Walk

Random walk with restart [19] is defined as Eq. (9), it starts from a random node to retrieve graph. The retriever iteratively transmits to its neighborhood with the probability that is proportional to their edge weights, or it has some probability α to return to the starting point, until the steady-state is reached.

$$ {\varvec{P}}_{i} = a{\varvec{SP}}_{i} + (1 - a){\varvec{H}} $$
(9)

Since prediction of every label can be delivered to other labels to some extent, label prediction related to samples not only is determined by samples, but also could be strengthened by other labels. We uses random walk model to predict multiple labels of a sample. Additionally, initial probability between sample x and each label is defined as 1/m.

$$ \left\{ {\begin{array}{*{20}c} {P(Y)_{x} (0) = [\frac{1}{m},\ldots,\frac{1}{m}]_{m}^{\text{T}} } \\ {P(Y)_{x}^{(t + 1)} = a{\varvec{SP}}(Y)_{x}^{(t)} + (1 - a){\varvec{H}}(x)} \\ \end{array} } \right. $$
(10)

\( {\text{P(Y)}}_{\text{x}}^{{ ( {\text{t)}}}} \) is probability distribution vector which represent the relationship between the sample and each label at time t. S is probability transformation matrix. H(x) is aforementioned initial prediction value vector of labels of sample x. The process continues until P(Y)x converges. Prediction of the label is updated repeatedly, dependency among labels could be utilized sufficiently.

4 Experimental Result and Analysis

In this section, we explain the means by which similarity graph and restart random walk model are evaluated, whilst providing a description of the multi-label dataset and other settings used in the experimental study. Finally, the experimental results on the dataset and the statistical analysis are discussed.

4.1 Dataset

The data used in the experiment is English movie titles and overviews collected manually, it is called Movies dataset. Dataset statistics is shown as Table 1, in which label density equals to size of label set q divided by potential of the label set c, indicating probability that a label appears.

Table 1. Several statistical value

4.2 Evaluation Metrics

Traditional single classification performance evaluation metrics, such as recall and accuracy, cannot be used directly to evaluate the multiple-label classification performance. Therefore, we use the following three metrics to measure the performance of our method.

4.2.1 Hamming Loss

Hamming Loss [20] measures classification error based on single-label classification, that is, labels that belong to the sample do not appear in the labels set, but labels that the sample do not have appear. Smaller value means better performance of a classification model. The best is when it is 0. It is defined as:

$$ Hamming{ - }loss(x_{i} ,y_{i} ) = \frac{1}{\left| D \right|}\sum\nolimits_{i = 1}^{D} {\frac{{xor(x_{i} ,y_{i} )}}{\left| L \right|}} $$
(11)

|D| represents total number of samples. |L| represents total number of labels. xi and yj represent prediction result and true label respectively.

4.2.2 Jaccard Index

Jaccard Index [21] measures how similar two sets are. It is defined as size of intersection divided by size of union. Bigger value means better performance of a classification model. It is defined as:

$$ Jaccard(A,B) = \frac{{\left| {A \cap B} \right|}}{{\left| {A \cup B} \right|}} $$
(12)

4.2.3 Accuracy-Score

Accuracy-score [22] is used to compute accuracy of prediction. In multi-label classification, the function returns accuracy of subsets. The accuracy is 1 if entire prediction labels are consistent with real labels, meaning it reaches the best performance, otherwise is 0. It is defined as following:

$$ accuracy(y,\hat{y}) = \frac{1}{\left| L \right|}\sum\nolimits_{i = 1}^{\left| L \right|} {1(\hat{y}_{i} = y_{i} )} $$
(13)

\( \hat{y}_{i} \) is prediction value of the i-th sample and yi is corresponding real value.

4.3 Experimental Result and Analysis

Three experiments are designed to evaluate performance of algorithm of this paper on multi-label text classification. (1) Analyze influence of different α on algorithm, (2) Compare and analyze results by change the size of training set and test set v, (3) Compare our algorithm with other algorithms.

Experiment 1. Analyze influence of different α on our method. Let α = 0.0001, 0.00007, 0.00004, 0.00001 to operate experiment respectively. From Table 2 we see that three metrics reach all the largest when α = 0.00007, and the result is optimal at this point. When α is larger than 0.00007 or smaller than 0.00007, the performance of the algorithm tend to be poor. Generally speaking, influence of α on performance is limited (not exceeds ±0.36%).

Table 2. Experimental results when s values are different

Experiment 2. Use training set t and test set v of different sizes and analyze experimental results. Use training set whose size is 300, 600, 900, 1200, 1500 and test set whose size is 100, 300, 500. As Table 3 shows, when the size of test set |t| = 100 and the size of training set |v| is 300, the performance outperforms the others. When |t| = 300 and |v| = 1200, the result is wonderful. In summary, the performance obtained by our method is optimal when |t| = 500 and |v| = 1500.

Table 3. Experimental results when the training set v is different from the test set t scale

Next, we select a group of data with the best classification performance for further comparative analysis. Specifically, |t| = 500, and the size of training set v has different scales. It can be observed from Fig. 2 that Hamming’s Loss gradually decreases and Accuracy-score continues to increase as the size of the training set increasing, it means that classification performance of the algorithm tend to get better when the ratio between training data and test data increases. When |v| = 1500, the classification score both reaches the optimal.

Fig. 2.
figure 2

When |t| = 500, changes in Hamming loss and accuracy-score with different |v|

Experiment 3. To demonstrate how our method improves multi-label text classification performance, we compare our method with other methods in comparison to those existed similar methods, they are BR, LP, CC and MLkNN and so on. It should be noted that the parameter value of the MLkNN algorithm is set to k = 20, and parameters in other algorithms use the default value. The classifiers for the BR, LP and CC use the Naive Bayes classification.

Figure 3 show that SGaRW algorithm has a larger Accuracy-score value compared with MLkNN, it indicates that the accuracy of the labels of the text predicted by our method is higher. The Jaccard index of our method is greater than MLkNN, while the Hamming loss is less than MLkNN. In other words, using the SGaRW algorithm will make the labels that do not belong to the text appear in the predicted label set as little as possible, which reduce the error rate a lot. Comparison with BR, LP, CC and MLkNN algorithms shows that SGaRW algorithm has great advantage over other algorithms.

Fig. 3.
figure 3

Comparison of the different algorithms

5 Conclusion

We introduces a novel method SGaRW algorithm combining similarity graph and random walk model, which can resolve multi-label text classification problems efficiently. Utilizing prior information from WordNet to build similarity graph, and computing initial match value between labels and texts on it. Then a label dependency graph is constructed, and random walk with restart is been run on it. Finally labels of the text are determined. Core of the future work is to consider expanding dataset, introduce short text semantic understanding to improve performance of short text multi-label classification and optimize effectiveness of the algorithm furtherly.