Abstract
A multi-label classification method of short text based on similarity graph and restart random walk model is proposed. Firstly, the similarity graph is created by using data and labels as the node, and the weights on the edges are calculated through an external knowledge, so the initial matching degree of between the sample and the label set is obtained. After that, we build a label dependency graph with labels as vertices, and using the previous matching degree as the initial prediction value to calculate the relationship between the sample and each node until the probability distribution becomes stable. Finally, the obtained relationship vector is the label probability distribution vector of the sample predicted by the method in this paper. Experimental results show that we provides a more efficient and reliable multi-label short-text classification algorithm.
Similar content being viewed by others
Keywords
1 Introduction
Traditional single-label classification learning means that each sample has a unique category label, where each label belongs to a mutually exclusive label set L(|L| > 1). However, In practical applications, usually a sample belong to multiple categories at the same time, we call such data as multi-label data [1]. For example, a news report can could be classified into “entertainment” and “technologies”, simultaneously. A movie can be both an “action movie” and a “thriller”. The multi-label classification is significantly different from the traditional single-label classification. The correlation and co-occurrence between categories lead to those existed single-label classification method cannot be directly applied to the multi-label classification problem. But also multi-label classification is gradually becoming the current research hotspot and difficulty, especially in the fields of text classification, gene function classification, image semantic annotation, etc.
Researchers is finding the optimal classification algorithm to improve the classification accuracy of multi-label data. There are two most common ideas for multi-label classification [2]. One is to convert multi-label dataset into single-label dataset, and then apply traditional data classification algorithm to them (abbreviated as PT). Binary Relevance (BR) [3] is a typical PT method. BR considers the prediction of each label as an independent single classification problem, and designs an independent classifier for each label, and trains each classifier using all training data. However, it ignores the interrelationships between tags, and often fails to achieve satisfying classification performance. Guo [4] propose a improved binary relevance algorithm, it sets two layers to decompose the multi-label classification problem into L-independent binary classification problems respectively. Liu [5] propose a classifier chain algorithm based on dynamic programming and greedy classifier chain algorithm to search for global optimal labels, which compensated for the Classifier Chain algorithm (CC) defects sensitive to label selection [6]. Label Powerset (LP) [7] encodes every label permutation as a binary number and obtains new labels. Another idea is to modify existing single-label learning algorithm to solve multi-label learning problem. For example, the MLkNN algorithm calculates the prior probability of each label through statistics in the label set, and the probability of the sample with labeled and no label, and then predicts whether the sample has label [8]. Tsoumakas [9] proposed the Random k-Labelsets method to decompose the initial label set into several small random subsets, and use the Label Powerset algorithm to train the classifier. In addition, other researchers have also used various methods for multi-label classification research [10,11,12,13]. In the data prediction training process, the existing multi-label classification algorithms either ignore the interdependence between category labels, or ignore the important influence of initial features on the predicted value, and even add these tags to the original features as an additional function. It makes the feature set that has a very high dimension more complicated. Even if the dependency relationship between category labels is fully utilized, the multi-label classification algorithm ignores the initial prediction value between the label set and the training set, it maks the multi-label classification inaccurate.
We propose a multi-label short-text classification algorithm which combines the similarity graph and the restart random walk model (abbreviated as SGaRW). On the one hand, the similarity graph is used to calculate the original relationship between the text and the labels, and on the other hand, we utilize the restarted random walk model to calculate the potential semantic relationships between the labels and the labels. Finally, reasonable fusion is performed to make multi-label classification result more accurate.
2 Preliminary and Background
We review the existing basic concepts and define the problem of multi-label classification in this section.
2.1 Multi-label Classification
Fundamentally, multi-label classification can be considered as a label ranking problem [14, 15]. This correlation is scored based on the correlation between the test sample and each category label, and then the label to which the sample belongs is determined based on the score value. Assume that \( \varvec{X} = \left\{ {x_{ 1} ,x_{ 2} , \ldots ,x_{n} } \right\} \) indicates the sample set, \( Y = \left\{ {y_{ 1} ,y_{ 2} , \ldots ,y_{m} } \right\} \) is label set, and \( {\text{D}} = \left\{ {\left( {{\text{x}}_{\text{i}} ,{\text{ Y}}_{\text{i}} } \right)| 1\le i \le {\text{n}}} \right\} \) is dataset, \( Y_{i}\,\subseteq\,Y \) is label set of the sample xi. Thus prediction of the label for sample x could be expressed as following vector H(x).
In the vector, \( {\text{h}}_{\text{i}} \left( x \right) \in \) [0,1] describes relevancy between sample x and label yi. Multi-label classification is to achieve a classifier h: X → 2Y using training data. Given new sample x, the classifier can predict label set of the sample x subsumes. Therefore, multi-label classification is to seek an optimal classification algorithm to construct a high-precision score vector H(x) to achieve the purpose of accurate classification.
2.2 Similarity Graph
Similarity graph [16, 17] built based on WordNet is a directed weighted graph G = (V, E) is used to calculate semantic similarity among nodes in the graph, V = {itemsset, senseset}, itemsset is a collection of nodes (item) that represent words, senseset is a set composed of nodes (sense) that represent senses. According to the corresponding relationship between them, a directed edges <vi, vj> is added between two sense nodes, or between an item and a sense, or between two items. weight on the edge is signed as wij, wij represents the probability of thinking of the node vj definitely when seeing the current node vi, therefore, the weight wij reflects a conditional probability. So the similarity graph can be called a probability graph.
Such as the similarity graph shown in Fig. 1, itemsset = {adventure, thrilling, action}, senseset = {“a wild and exciting undertaking lawful”, “take a risk in the hope of …”, “something that people do or cause to happen”}. In WordNet, the word “adventure” has two meanings in total, the using frequency of the first meaning is 0.92, and using frequency of the second meaning is 0.08, so the weight from the item node “adventure” to the first sense node “a wild and exciting undertaking lawful” is 0.92, which means that the probability that someone is interested in the first meaning after seeing the word “adventure” is 0.92. In turn, The weight from sense node “a wild and exciting undertaking lawful” to item node “adventure” is 1, which means that someone must think of the word “adventure” when they see either “a wild and exciting undertaking lawful” or “take a risk in the hope of …”.
3 Implement Multi-label Classification of Short Text
Implementing multi-label classification of text is divided into two stages in this paper. At the first step, we create a similarity graph based on the text content, and calculates the original relationship between the text and the label set, which is the initial predicted value H(x). In the second phase, a label dependency graph is constructed and the restart random walk algorithm is performed on this graph to mine the potential semantic relationships between the labels. When the algorithm converges, we can obtain a vector consisting of the probability that the text belongs to each label, so we get labels which belong to the text.
3.1 Calculate Initial Association Between Sample and Labels
We consider short texts as sense nodes, map labels to item nodes, and create similarity graph G1 = (V1, E1). Then affinity score between the text and the label on a directed path can be defined as the product of the weights of all adjacent edges between the text node and the label node on the path [18], as shown in formula (2):
Where, affinitypt(vdoc|vlabel) is affinity score from vdoc to vlabel, nodes sequence pt = <vdoc, … vi, vj, … vlabel> is a directed path from vdoc to vlabel, ppt(vi|vj) is weight on the edge between vi and vj in G1, which support calculating affinity scores between two nodes. According to Markov model, as the path length increases, the value of the conditional probability decreases. The longer the path, the less evidence of an intimate relationship between the two nodes.
We also know that there is more than one directed path from vdoc to vlabel in the similarity graph G1. So affinity scores of the text-to-label on the entire graph G1 can be expressed as the sum of the affinity score on all directed paths between these two nodes. Aff′(vdoc, vlabel) denote affinity scores in formula (3)
Due to the asymmetric nature of the affinity scores, the final affinity scores between two nodes can be obtained by formula (4).
Treat the affinity scores between vdoc and vlabel as the correlation score hi(x) between sample x and label yi, That is:
Taking all the labels into account, we can get the correlation scores of the sample x and all labels in label set Y, as shown in formula (6)
3.2 Random Walk on Label Dependency Graph
3.2.1 Obtain Dependency Among Labels
We construct graph G2 = (V2, E2) to encode dependency among labels. Vertices in the graph G2 represent labels in Y. If the label yi and yj mark the text x at the same time, add an edge between yi and yj, and the weight wij is defined as the number of samples labeled by labels yi and yj commonly:
The adjacency matrix is used to store graph G2 and m × m dimensional symmetric matrix is obtained. Therefore, the obtained matrix after utilizing Eq. (8) to make it asymmetric is represented as S, and its element sij is used to represent the jump probability from label yi to label yj, mj is number of non-zero elements in the j-th column.
3.2.2 Restart Random Walk
Random walk with restart [19] is defined as Eq. (9), it starts from a random node to retrieve graph. The retriever iteratively transmits to its neighborhood with the probability that is proportional to their edge weights, or it has some probability α to return to the starting point, until the steady-state is reached.
Since prediction of every label can be delivered to other labels to some extent, label prediction related to samples not only is determined by samples, but also could be strengthened by other labels. We uses random walk model to predict multiple labels of a sample. Additionally, initial probability between sample x and each label is defined as 1/m.
\( {\text{P(Y)}}_{\text{x}}^{{ ( {\text{t)}}}} \) is probability distribution vector which represent the relationship between the sample and each label at time t. S is probability transformation matrix. H(x) is aforementioned initial prediction value vector of labels of sample x. The process continues until P(Y)x converges. Prediction of the label is updated repeatedly, dependency among labels could be utilized sufficiently.
4 Experimental Result and Analysis
In this section, we explain the means by which similarity graph and restart random walk model are evaluated, whilst providing a description of the multi-label dataset and other settings used in the experimental study. Finally, the experimental results on the dataset and the statistical analysis are discussed.
4.1 Dataset
The data used in the experiment is English movie titles and overviews collected manually, it is called Movies dataset. Dataset statistics is shown as Table 1, in which label density equals to size of label set q divided by potential of the label set c, indicating probability that a label appears.
4.2 Evaluation Metrics
Traditional single classification performance evaluation metrics, such as recall and accuracy, cannot be used directly to evaluate the multiple-label classification performance. Therefore, we use the following three metrics to measure the performance of our method.
4.2.1 Hamming Loss
Hamming Loss [20] measures classification error based on single-label classification, that is, labels that belong to the sample do not appear in the labels set, but labels that the sample do not have appear. Smaller value means better performance of a classification model. The best is when it is 0. It is defined as:
|D| represents total number of samples. |L| represents total number of labels. xi and yj represent prediction result and true label respectively.
4.2.2 Jaccard Index
Jaccard Index [21] measures how similar two sets are. It is defined as size of intersection divided by size of union. Bigger value means better performance of a classification model. It is defined as:
4.2.3 Accuracy-Score
Accuracy-score [22] is used to compute accuracy of prediction. In multi-label classification, the function returns accuracy of subsets. The accuracy is 1 if entire prediction labels are consistent with real labels, meaning it reaches the best performance, otherwise is 0. It is defined as following:
\( \hat{y}_{i} \) is prediction value of the i-th sample and yi is corresponding real value.
4.3 Experimental Result and Analysis
Three experiments are designed to evaluate performance of algorithm of this paper on multi-label text classification. (1) Analyze influence of different α on algorithm, (2) Compare and analyze results by change the size of training set and test set v, (3) Compare our algorithm with other algorithms.
Experiment 1. Analyze influence of different α on our method. Let α = 0.0001, 0.00007, 0.00004, 0.00001 to operate experiment respectively. From Table 2 we see that three metrics reach all the largest when α = 0.00007, and the result is optimal at this point. When α is larger than 0.00007 or smaller than 0.00007, the performance of the algorithm tend to be poor. Generally speaking, influence of α on performance is limited (not exceeds ±0.36%).
Experiment 2. Use training set t and test set v of different sizes and analyze experimental results. Use training set whose size is 300, 600, 900, 1200, 1500 and test set whose size is 100, 300, 500. As Table 3 shows, when the size of test set |t| = 100 and the size of training set |v| is 300, the performance outperforms the others. When |t| = 300 and |v| = 1200, the result is wonderful. In summary, the performance obtained by our method is optimal when |t| = 500 and |v| = 1500.
Next, we select a group of data with the best classification performance for further comparative analysis. Specifically, |t| = 500, and the size of training set v has different scales. It can be observed from Fig. 2 that Hamming’s Loss gradually decreases and Accuracy-score continues to increase as the size of the training set increasing, it means that classification performance of the algorithm tend to get better when the ratio between training data and test data increases. When |v| = 1500, the classification score both reaches the optimal.
Experiment 3. To demonstrate how our method improves multi-label text classification performance, we compare our method with other methods in comparison to those existed similar methods, they are BR, LP, CC and MLkNN and so on. It should be noted that the parameter value of the MLkNN algorithm is set to k = 20, and parameters in other algorithms use the default value. The classifiers for the BR, LP and CC use the Naive Bayes classification.
Figure 3 show that SGaRW algorithm has a larger Accuracy-score value compared with MLkNN, it indicates that the accuracy of the labels of the text predicted by our method is higher. The Jaccard index of our method is greater than MLkNN, while the Hamming loss is less than MLkNN. In other words, using the SGaRW algorithm will make the labels that do not belong to the text appear in the predicted label set as little as possible, which reduce the error rate a lot. Comparison with BR, LP, CC and MLkNN algorithms shows that SGaRW algorithm has great advantage over other algorithms.
5 Conclusion
We introduces a novel method SGaRW algorithm combining similarity graph and random walk model, which can resolve multi-label text classification problems efficiently. Utilizing prior information from WordNet to build similarity graph, and computing initial match value between labels and texts on it. Then a label dependency graph is constructed, and random walk with restart is been run on it. Finally labels of the text are determined. Core of the future work is to consider expanding dataset, introduce short text semantic understanding to improve performance of short text multi-label classification and optimize effectiveness of the algorithm furtherly.
References
Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09823-4_34
Zhang, M.L., Zhou, Z.H.: Multi-label neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 18(10), 1338–1351 (2006)
Trohidis, K., Tsoumakas, G., Kalliris, G.: Multi-label classification of music by emotion. EURASIP J. Audio Speech Music. Process. 2011(1), 4 (2011)
Guo, T., Li, G.Y.: An improved binary relevance algorithm for multi-label classification. Appl. Mech. Mater. 536–537, 394–398 (2014)
Liu, W., Tsang, I.W.: On the optimality of classifier chain for multi-label classification. In: International Conference on Neural Information Processing Systems. MIT Press (2015)
Read, J., Pfahringer, B., Holmes, G., et al.: Classifier chains for multi-label classification. Mach. Learn. 85(3), 333 (2011). https://doi.org/10.1007/s10994-011-5256-5
Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit. 40(7), 2038–2048 (2007)
Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multi-label classification. IEEE Trans. Knowl. Data Eng. 23(7), 1079–1089 (2011)
Read, J.: A pruned problem transformation method for multi-label classification. In: New Zealand Computer Science Research Student Conference (NZCSRS 2008), vol. 143150, p. 41 (2008)
Jizhao, Q., Hua, J.I., Huaxiang, Z.: Modified algorithm with label-specific features for multi-label learning. Comput. Eng. Appl. 49(22), 163–166 (2013)
Huang, J., Li, G., Wang, S., Zhang, W., Huang, Q.: Group sensitive classifier chains for multi-label classification. In: IEEE International Conference on Multimedia and Expo (ICME), Turin, pp. 1–6 (2015)
Huang, J., Li, G., Huang, Q., et al.: Learning label-specific features and class-dependent labels for multi-label classification. IEEE Trans. Knowl. Data Eng. 28(12), 3309–3323 (2016)
Qiao, L., Zhang, L., Sun, Z., et al.: Selecting label-dependent features for multi-label classification. Neurocomputing 259, 112–118 (2017)
Li, X., Ouyang, J., Zhou, X.: Supervised Topic Models for Multi-Label Classification. Elsevier Science Publishers B.V., Amsterdam (2015)
Soleimani, H., Miller, D.J.: Semi-supervised multi-label topic models for document classification and sentence labeling. In: ACM International on Conference on Information & Knowledge Management. ACM (2016)
Stanchev, L.: Creating a similarity graph from WordNet. In: 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14), pp. 1–11. Association for Computing Machinery, New York (2014). Article no. 36
Stanchev, L.: Semantic document clustering using a similarity graph. In: IEEE Tenth International Conference on Semantic Computing, pp. 1–8. IEEE (2016)
Stanchev, L.: Creating a probabilistic graph for WordNet using markov logic network. In: 6th International Conference on Web Intelligence, Mining and Semantics, pp. 1–12 (2016)
Tong, H., Faloutsos, C., Pan, J.Y.: Fast random walk with restart and its applications. In: 6th International Conference on Data Mining (ICDM 2006), pp. 613–622. IEEE (2006)
Díez, J., Luaces, O., del Coz, J.J., et al.: Optimizing different loss functions in multi-label classifications. Prog. Artif. Intell. 3(2), 107–118 (2015). https://doi.org/10.1007/s13748-014-0060-7
Hamers, L., Hemeryck, Y., Herweyers, G., et al.: Similarity measures in scientometric research: the Jaccard index versus Salton’s cosine formula. Inf. Process. Manag. 25(3), 315–318 (1989)
Hubley, A.M.: Using the Rey-Osterrieth and modified Taylor complex figures with older adults: a preliminary examination of accuracy score comparability. Arch. Clin. Neuropsychol. Off. J. Natl. Acad. Neuropsychol. 25(3), 197 (2010)
Acknowledgments
This work was supported in part by National Natural Science Foundation of China (No. 61762078, 61862058, 61967013), Youth Teacher Scientific Capability Promoting Project of NWNU (No. NWNU-LKQN-16-20).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 IFIP International Federation for Information Processing
About this paper
Cite this paper
Li, X., Yang, F., Ma, Y., Ma, H. (2020). Multi-label Classification of Short Text Based on Similarity Graph and Restart Random Walk Model. In: Shi, Z., Vadera, S., Chang, E. (eds) Intelligent Information Processing X. IIP 2020. IFIP Advances in Information and Communication Technology, vol 581. Springer, Cham. https://doi.org/10.1007/978-3-030-46931-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-46931-3_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46930-6
Online ISBN: 978-3-030-46931-3
eBook Packages: Computer ScienceComputer Science (R0)