Multi-label Classification of Short Text Based on Similarity Graph and Restart Random Walk Model

Li, Xiaohong; Yang, Fanyi; Ma, Yuyin; Ma, Huifang

doi:10.1007/978-3-030-46931-3_7

Xiaohong Li¹⁸,
Fanyi Yang¹⁸,
Yuyin Ma¹⁸ &
…
Huifang Ma¹⁸

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 581))

Included in the following conference series:

International Conference on Intelligent Information Processing

654 Accesses
1 Citations

Abstract

A multi-label classification method of short text based on similarity graph and restart random walk model is proposed. Firstly, the similarity graph is created by using data and labels as the node, and the weights on the edges are calculated through an external knowledge, so the initial matching degree of between the sample and the label set is obtained. After that, we build a label dependency graph with labels as vertices, and using the previous matching degree as the initial prediction value to calculate the relationship between the sample and each node until the probability distribution becomes stable. Finally, the obtained relationship vector is the label probability distribution vector of the sample predicted by the method in this paper. Experimental results show that we provides a more efficient and reliable multi-label short-text classification algorithm.

Download conference paper PDF

Research on Pseudo-label Technology for Multi-label News Classification

Research on Multi-label Text Classification Method Based on tALBERT-CNN

Article Open access 13 December 2021

Wikipedia Based Short Text Classification Method

Keywords

1 Introduction

Traditional single-label classification learning means that each sample has a unique category label, where each label belongs to a mutually exclusive label set L(|L| > 1). However, In practical applications, usually a sample belong to multiple categories at the same time, we call such data as multi-label data [1]. For example, a news report can could be classified into “entertainment” and “technologies”, simultaneously. A movie can be both an “action movie” and a “thriller”. The multi-label classification is significantly different from the traditional single-label classification. The correlation and co-occurrence between categories lead to those existed single-label classification method cannot be directly applied to the multi-label classification problem. But also multi-label classification is gradually becoming the current research hotspot and difficulty, especially in the fields of text classification, gene function classification, image semantic annotation, etc.

Researchers is finding the optimal classification algorithm to improve the classification accuracy of multi-label data. There are two most common ideas for multi-label classification [2]. One is to convert multi-label dataset into single-label dataset, and then apply traditional data classification algorithm to them (abbreviated as PT). Binary Relevance (BR) [3] is a typical PT method. BR considers the prediction of each label as an independent single classification problem, and designs an independent classifier for each label, and trains each classifier using all training data. However, it ignores the interrelationships between tags, and often fails to achieve satisfying classification performance. Guo [4] propose a improved binary relevance algorithm, it sets two layers to decompose the multi-label classification problem into L-independent binary classification problems respectively. Liu [5] propose a classifier chain algorithm based on dynamic programming and greedy classifier chain algorithm to search for global optimal labels, which compensated for the Classifier Chain algorithm (CC) defects sensitive to label selection [6]. Label Powerset (LP) [7] encodes every label permutation as a binary number and obtains new labels. Another idea is to modify existing single-label learning algorithm to solve multi-label learning problem. For example, the MLkNN algorithm calculates the prior probability of each label through statistics in the label set, and the probability of the sample with labeled and no label, and then predicts whether the sample has label [8]. Tsoumakas [9] proposed the Random k-Labelsets method to decompose the initial label set into several small random subsets, and use the Label Powerset algorithm to train the classifier. In addition, other researchers have also used various methods for multi-label classification research [10,11,12,13]. In the data prediction training process, the existing multi-label classification algorithms either ignore the interdependence between category labels, or ignore the important influence of initial features on the predicted value, and even add these tags to the original features as an additional function. It makes the feature set that has a very high dimension more complicated. Even if the dependency relationship between category labels is fully utilized, the multi-label classification algorithm ignores the initial prediction value between the label set and the training set, it maks the multi-label classification inaccurate.

We propose a multi-label short-text classification algorithm which combines the similarity graph and the restart random walk model (abbreviated as SGaRW). On the one hand, the similarity graph is used to calculate the original relationship between the text and the labels, and on the other hand, we utilize the restarted random walk model to calculate the potential semantic relationships between the labels and the labels. Finally, reasonable fusion is performed to make multi-label classification result more accurate.

2 Preliminary and Background

We review the existing basic concepts and define the problem of multi-label classification in this section.

2.1 Multi-label Classification

Fundamentally, multi-label classification can be considered as a label ranking problem [14, 15]. This correlation is scored based on the correlation between the test sample and each category label, and then the label to which the sample belongs is determined based on the score value. Assume that $ \varvec{X} = \left\{ {x_{ 1} ,x_{ 2} , \ldots ,x_{n} } \right\} $ indicates the sample set, $ Y = \left\{ {y_{ 1} ,y_{ 2} , \ldots ,y_{m} } \right\} $ is label set, and $ {\text{D}} = \left\{ {\left( {{\text{x}}_{\text{i}} ,{\text{ Y}}_{\text{i}} } \right)| 1\le i \le {\text{n}}} \right\} $ is dataset, $ Y_{i}\,\subseteq\,Y $ is label set of the sample x_i. Thus prediction of the label for sample x could be expressed as following vector H(x).

$$ H(x) = (h_{1} (x), \cdots ,h_{i} (x), \cdots ,h_{m} (x)) $$

(1)

In the vector, $ {\text{h}}_{\text{i}} \left( x \right) \in $ [0,1] describes relevancy between sample x and label y_i. Multi-label classification is to achieve a classifier h: X → 2^Y using training data. Given new sample x, the classifier can predict label set of the sample x subsumes. Therefore, multi-label classification is to seek an optimal classification algorithm to construct a high-precision score vector H(x) to achieve the purpose of accurate classification.

2.2 Similarity Graph

Similarity graph [16, 17] built based on WordNet is a directed weighted graph G = (V, E) is used to calculate semantic similarity among nodes in the graph, V = {itemsset, senseset}, itemsset is a collection of nodes (item) that represent words, senseset is a set composed of nodes (sense) that represent senses. According to the corresponding relationship between them, a directed edges <v_i, v_j> is added between two sense nodes, or between an item and a sense, or between two items. weight on the edge is signed as w_ij, w_ij represents the probability of thinking of the node v_j definitely when seeing the current node v_i, therefore, the weight w_ij reflects a conditional probability. So the similarity graph can be called a probability graph.

Such as the similarity graph shown in Fig. 1, itemsset = {adventure, thrilling, action}, senseset = {“a wild and exciting undertaking lawful”, “take a risk in the hope of …”, “something that people do or cause to happen”}. In WordNet, the word “adventure” has two meanings in total, the using frequency of the first meaning is 0.92, and using frequency of the second meaning is 0.08, so the weight from the item node “adventure” to the first sense node “a wild and exciting undertaking lawful” is 0.92, which means that the probability that someone is interested in the first meaning after seeing the word “adventure” is 0.92. In turn, The weight from sense node “a wild and exciting undertaking lawful” to item node “adventure” is 1, which means that someone must think of the word “adventure” when they see either “a wild and exciting undertaking lawful” or “take a risk in the hope of …”.

3 Implement Multi-label Classification of Short Text

Implementing multi-label classification of text is divided into two stages in this paper. At the first step, we create a similarity graph based on the text content, and calculates the original relationship between the text and the label set, which is the initial predicted value H(x). In the second phase, a label dependency graph is constructed and the restart random walk algorithm is performed on this graph to mine the potential semantic relationships between the labels. When the algorithm converges, we can obtain a vector consisting of the probability that the text belongs to each label, so we get labels which belong to the text.

3.1 Calculate Initial Association Between Sample and Labels

We consider short texts as sense nodes, map labels to item nodes, and create similarity graph G₁ = (V₁, E₁). Then affinity score between the text and the label on a directed path can be defined as the product of the weights of all adjacent edges between the text node and the label node on the path [18], as shown in formula (2):

$$ affinity_{pt} (v_{doc} |v_{label} ) = \prod\limits_{\begin{subarray}{l} v_{i} ,v_{j} \in pt \\ (v_{i} ,v_{j} ) \in E1 \end{subarray} } {P_{pt} (v_{i} |v_{j} )} $$

(2)

Where, affinity_pt(v_doc|v_label) is affinity score from v_doc to v_label, nodes sequence pt = <v_doc, … v_i, v_j, … v_label> is a directed path from v_doc to v_label, p_pt(v_i|v_j) is weight on the edge between v_i and v_j in G₁, which support calculating affinity scores between two nodes. According to Markov model, as the path length increases, the value of the conditional probability decreases. The longer the path, the less evidence of an intimate relationship between the two nodes.

We also know that there is more than one directed path from v_doc to v_label in the similarity graph G₁. So affinity scores of the text-to-label on the entire graph G₁ can be expressed as the sum of the affinity score on all directed paths between these two nodes. Aff′(v_doc, v_label) denote affinity scores in formula (3)

$$ Aff^{\prime}(v_{doc} ,v_{label} ) = \sum {affinity_{pt} (v_{doc} |v_{label} )} $$

(3)

Due to the asymmetric nature of the affinity scores, the final affinity scores between two nodes can be obtained by formula (4).

$$ Aff\left( {v_{doc} ,v_{label} } \right) = \frac{{Aff^{\prime}(v_{doc} ,v_{label} ) + Aff^{\prime}(v_{label} ,v_{doc} )}}{2} $$

(4)

Treat the affinity scores between v_doc and v_label as the correlation score h_i(x) between sample x and label y_i, That is:

$$ h_{i} (x) = Aff\left( {x,y_{i} } \right) $$

(5)

Taking all the labels into account, we can get the correlation scores of the sample x and all labels in label set Y, as shown in formula (6)

$$ {\varvec{H(x)}} = [Aff\left( {x,{\text{y}}_{1} } \right), \ldots Aff\left( {x,{\text{y}}_{i} } \right) \ldots Aff\left( {x,{\text{y}}_{m} } \right)]^{\text{T}} $$

(6)

3.2 Random Walk on Label Dependency Graph

3.2.1 Obtain Dependency Among Labels

We construct graph G₂ = (V₂, E₂) to encode dependency among labels. Vertices in the graph G₂ represent labels in Y. If the label y_i and y_j mark the text x at the same time, add an edge between y_i and y_j, and the weight w_ij is defined as the number of samples labeled by labels y_i and y_j commonly:

$$ w_{ij} = \left| {\left\{ {x_{k} |y_{i} \in x_{k} \wedge y_{j} \in x_{k} } \right\}} \right|\quad if \, i \ne j $$

(7)

The adjacency matrix is used to store graph G₂ and m × m dimensional symmetric matrix is obtained. Therefore, the obtained matrix after utilizing Eq. (8) to make it asymmetric is represented as S, and its element s_ij is used to represent the jump probability from label y_i to label y_j, m_j is number of non-zero elements in the j-th column.

$$ s_{ij} = \frac{{w_{ij} }}{{m_{j} }} $$

(8)

3.2.2 Restart Random Walk

Random walk with restart [19] is defined as Eq. (9), it starts from a random node to retrieve graph. The retriever iteratively transmits to its neighborhood with the probability that is proportional to their edge weights, or it has some probability α to return to the starting point, until the steady-state is reached.

$$ {\varvec{P}}_{i} = a{\varvec{SP}}_{i} + (1 - a){\varvec{H}} $$

(9)

Since prediction of every label can be delivered to other labels to some extent, label prediction related to samples not only is determined by samples, but also could be strengthened by other labels. We uses random walk model to predict multiple labels of a sample. Additionally, initial probability between sample x and each label is defined as 1/m.

$$ \left\{ {\begin{array}{*{20}c} {P(Y)_{x} (0) = [\frac{1}{m},\ldots,\frac{1}{m}]_{m}^{\text{T}} } \\ {P(Y)_{x}^{(t + 1)} = a{\varvec{SP}}(Y)_{x}^{(t)} + (1 - a){\varvec{H}}(x)} \\ \end{array} } \right. $$

(10)

$ {\text{P(Y)}}_{\text{x}}^{{ ( {\text{t)}}}} $ is probability distribution vector which represent the relationship between the sample and each label at time t. S is probability transformation matrix. H(x) is aforementioned initial prediction value vector of labels of sample x. The process continues until P(Y)_x converges. Prediction of the label is updated repeatedly, dependency among labels could be utilized sufficiently.

4 Experimental Result and Analysis

In this section, we explain the means by which similarity graph and restart random walk model are evaluated, whilst providing a description of the multi-label dataset and other settings used in the experimental study. Finally, the experimental results on the dataset and the statistical analysis are discussed.

4.1 Dataset

The data used in the experiment is English movie titles and overviews collected manually, it is called Movies dataset. Dataset statistics is shown as Table 1, in which label density equals to size of label set q divided by potential of the label set c, indicating probability that a label appears.

Table 1. Several statistical value

Full size table

4.2 Evaluation Metrics

Traditional single classification performance evaluation metrics, such as recall and accuracy, cannot be used directly to evaluate the multiple-label classification performance. Therefore, we use the following three metrics to measure the performance of our method.

4.2.1 Hamming Loss

Hamming Loss [20] measures classification error based on single-label classification, that is, labels that belong to the sample do not appear in the labels set, but labels that the sample do not have appear. Smaller value means better performance of a classification model. The best is when it is 0. It is defined as:

$$ Hamming{ - }loss(x_{i} ,y_{i} ) = \frac{1}{\left| D \right|}\sum\nolimits_{i = 1}^{D} {\frac{{xor(x_{i} ,y_{i} )}}{\left| L \right|}} $$

(11)

|D| represents total number of samples. |L| represents total number of labels. x_i and y_j represent prediction result and true label respectively.

4.2.2 Jaccard Index

Jaccard Index [21] measures how similar two sets are. It is defined as size of intersection divided by size of union. Bigger value means better performance of a classification model. It is defined as:

$$ Jaccard(A,B) = \frac{{\left| {A \cap B} \right|}}{{\left| {A \cup B} \right|}} $$

(12)

4.2.3 Accuracy-Score

Accuracy-score [22] is used to compute accuracy of prediction. In multi-label classification, the function returns accuracy of subsets. The accuracy is 1 if entire prediction labels are consistent with real labels, meaning it reaches the best performance, otherwise is 0. It is defined as following:

$$ accuracy(y,\hat{y}) = \frac{1}{\left| L \right|}\sum\nolimits_{i = 1}^{\left| L \right|} {1(\hat{y}_{i} = y_{i} )} $$

(13)

$ \hat{y}_{i} $ is prediction value of the i-th sample and y_i is corresponding real value.

4.3 Experimental Result and Analysis

Three experiments are designed to evaluate performance of algorithm of this paper on multi-label text classification. (1) Analyze influence of different α on algorithm, (2) Compare and analyze results by change the size of training set and test set v, (3) Compare our algorithm with other algorithms.

Experiment 1. Analyze influence of different α on our method. Let α = 0.0001, 0.00007, 0.00004, 0.00001 to operate experiment respectively. From Table 2 we see that three metrics reach all the largest when α = 0.00007, and the result is optimal at this point. When α is larger than 0.00007 or smaller than 0.00007, the performance of the algorithm tend to be poor. Generally speaking, influence of α on performance is limited (not exceeds ±0.36%).

Table 2. Experimental results when s values are different

Full size table

Experiment 2. Use training set t and test set v of different sizes and analyze experimental results. Use training set whose size is 300, 600, 900, 1200, 1500 and test set whose size is 100, 300, 500. As Table 3 shows, when the size of test set |t| = 100 and the size of training set |v| is 300, the performance outperforms the others. When |t| = 300 and |v| = 1200, the result is wonderful. In summary, the performance obtained by our method is optimal when |t| = 500 and |v| = 1500.

Table 3. Experimental results when the training set v is different from the test set t scale

Full size table

Next, we select a group of data with the best classification performance for further comparative analysis. Specifically, |t| = 500, and the size of training set v has different scales. It can be observed from Fig. 2 that Hamming’s Loss gradually decreases and Accuracy-score continues to increase as the size of the training set increasing, it means that classification performance of the algorithm tend to get better when the ratio between training data and test data increases. When |v| = 1500, the classification score both reaches the optimal.

Experiment 3. To demonstrate how our method improves multi-label text classification performance, we compare our method with other methods in comparison to those existed similar methods, they are BR, LP, CC and MLkNN and so on. It should be noted that the parameter value of the MLkNN algorithm is set to k = 20, and parameters in other algorithms use the default value. The classifiers for the BR, LP and CC use the Naive Bayes classification.

Figure 3 show that SGaRW algorithm has a larger Accuracy-score value compared with MLkNN, it indicates that the accuracy of the labels of the text predicted by our method is higher. The Jaccard index of our method is greater than MLkNN, while the Hamming loss is less than MLkNN. In other words, using the SGaRW algorithm will make the labels that do not belong to the text appear in the predicted label set as little as possible, which reduce the error rate a lot. Comparison with BR, LP, CC and MLkNN algorithms shows that SGaRW algorithm has great advantage over other algorithms.

5 Conclusion

We introduces a novel method SGaRW algorithm combining similarity graph and random walk model, which can resolve multi-label text classification problems efficiently. Utilizing prior information from WordNet to build similarity graph, and computing initial match value between labels and texts on it. Then a label dependency graph is constructed, and random walk with restart is been run on it. Finally labels of the text are determined. Core of the future work is to consider expanding dataset, introduce short text semantic understanding to improve performance of short text multi-label classification and optimize effectiveness of the algorithm furtherly.

References

Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining multi-label data. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 667–685. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-09823-4_34
Chapter Google Scholar
Zhang, M.L., Zhou, Z.H.: Multi-label neural networks with applications to functional genomics and text categorization. IEEE Trans. Knowl. Data Eng. 18(10), 1338–1351 (2006)
Article Google Scholar
Trohidis, K., Tsoumakas, G., Kalliris, G.: Multi-label classification of music by emotion. EURASIP J. Audio Speech Music. Process. 2011(1), 4 (2011)
Article Google Scholar
Guo, T., Li, G.Y.: An improved binary relevance algorithm for multi-label classification. Appl. Mech. Mater. 536–537, 394–398 (2014)
Article Google Scholar
Liu, W., Tsang, I.W.: On the optimality of classifier chain for multi-label classification. In: International Conference on Neural Information Processing Systems. MIT Press (2015)
Google Scholar
Read, J., Pfahringer, B., Holmes, G., et al.: Classifier chains for multi-label classification. Mach. Learn. 85(3), 333 (2011). https://doi.org/10.1007/s10994-011-5256-5
Article MathSciNet Google Scholar
Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit. 40(7), 2038–2048 (2007)
Article Google Scholar
Tsoumakas, G., Katakis, I., Vlahavas, I.: Random k-labelsets for multi-label classification. IEEE Trans. Knowl. Data Eng. 23(7), 1079–1089 (2011)
Article Google Scholar
Read, J.: A pruned problem transformation method for multi-label classification. In: New Zealand Computer Science Research Student Conference (NZCSRS 2008), vol. 143150, p. 41 (2008)
Google Scholar
Jizhao, Q., Hua, J.I., Huaxiang, Z.: Modified algorithm with label-specific features for multi-label learning. Comput. Eng. Appl. 49(22), 163–166 (2013)
Google Scholar
Huang, J., Li, G., Wang, S., Zhang, W., Huang, Q.: Group sensitive classifier chains for multi-label classification. In: IEEE International Conference on Multimedia and Expo (ICME), Turin, pp. 1–6 (2015)
Google Scholar
Huang, J., Li, G., Huang, Q., et al.: Learning label-specific features and class-dependent labels for multi-label classification. IEEE Trans. Knowl. Data Eng. 28(12), 3309–3323 (2016)
Article Google Scholar
Qiao, L., Zhang, L., Sun, Z., et al.: Selecting label-dependent features for multi-label classification. Neurocomputing 259, 112–118 (2017)
Article Google Scholar
Li, X., Ouyang, J., Zhou, X.: Supervised Topic Models for Multi-Label Classification. Elsevier Science Publishers B.V., Amsterdam (2015)
Book Google Scholar
Soleimani, H., Miller, D.J.: Semi-supervised multi-label topic models for document classification and sentence labeling. In: ACM International on Conference on Information & Knowledge Management. ACM (2016)
Google Scholar
Stanchev, L.: Creating a similarity graph from WordNet. In: 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14), pp. 1–11. Association for Computing Machinery, New York (2014). Article no. 36
Google Scholar
Stanchev, L.: Semantic document clustering using a similarity graph. In: IEEE Tenth International Conference on Semantic Computing, pp. 1–8. IEEE (2016)
Google Scholar
Stanchev, L.: Creating a probabilistic graph for WordNet using markov logic network. In: 6th International Conference on Web Intelligence, Mining and Semantics, pp. 1–12 (2016)
Google Scholar
Tong, H., Faloutsos, C., Pan, J.Y.: Fast random walk with restart and its applications. In: 6th International Conference on Data Mining (ICDM 2006), pp. 613–622. IEEE (2006)
Google Scholar
Díez, J., Luaces, O., del Coz, J.J., et al.: Optimizing different loss functions in multi-label classifications. Prog. Artif. Intell. 3(2), 107–118 (2015). https://doi.org/10.1007/s13748-014-0060-7
Article Google Scholar
Hamers, L., Hemeryck, Y., Herweyers, G., et al.: Similarity measures in scientometric research: the Jaccard index versus Salton’s cosine formula. Inf. Process. Manag. 25(3), 315–318 (1989)
Article Google Scholar
Hubley, A.M.: Using the Rey-Osterrieth and modified Taylor complex figures with older adults: a preliminary examination of accuracy score comparability. Arch. Clin. Neuropsychol. Off. J. Natl. Acad. Neuropsychol. 25(3), 197 (2010)
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by National Natural Science Foundation of China (No. 61762078, 61862058, 61967013), Youth Teacher Scientific Capability Promoting Project of NWNU (No. NWNU-LKQN-16-20).

Author information

Authors and Affiliations

College of Computer Science and Engineering, Northwest Normal University, Lanzhou, China
Xiaohong Li, Fanyi Yang, Yuyin Ma & Huifang Ma

Authors

Xiaohong Li
View author publications
You can also search for this author in PubMed Google Scholar
Fanyi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yuyin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Huifang Ma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaohong Li .

Editor information

Editors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Zhongzhi Shi
University of Salford, Manchester, UK
Sunil Vadera
Australian Defence Force Academy, UNSW Canberra, Canberra, ACT, Australia
Elizabeth Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Yang, F., Ma, Y., Ma, H. (2020). Multi-label Classification of Short Text Based on Similarity Graph and Restart Random Walk Model. In: Shi, Z., Vadera, S., Chang, E. (eds) Intelligent Information Processing X. IIP 2020. IFIP Advances in Information and Communication Technology, vol 581. Springer, Cham. https://doi.org/10.1007/978-3-030-46931-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-46931-3_7
Published: 26 June 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46930-6
Online ISBN: 978-3-030-46931-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)

Multi-label Classification of Short Text Based on Similarity Graph and Restart Random Walk Model

Abstract

Similar content being viewed by others

Research on Pseudo-label Technology for Multi-label News Classification

Research on Multi-label Text Classification Method Based on tALBERT-CNN

Wikipedia Based Short Text Classification Method

Keywords

1 Introduction