Research on the Recognition of Internet Buzzword Features Based on Transformer

Xu, Dawei; She, Yijie; Tan, Zhonghua; Li, Ruiguang; Zhao, Jian

doi:10.1007/978-981-19-8285-9_17

Dawei Xu^10,11,
Yijie She¹⁰,
Zhonghua Tan¹²,
Ruiguang Li¹³ &
…
Jian Zhao¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1699))

Included in the following conference series:

China Cyber Security Annual Conference

2312 Accesses

Abstract

Accurate identification of Internet buzzwords plays an important role in positive Internet opinion guidance. A Transformer-based Internet buzzword feature recognition system was designed to address this problem. The traditional way of crawling data has been improved, a real-time crawling module has been added, and an Internet buzzword corpus has been constructed by itself. The traditional way of crawling data has been improved, a real-time crawling module has been added, and an Internet buzzword corpus has been constructed by itself. Traditional machine learning models suffer from gradient disappearance and gradient explosion, the Transformer model, with its parallel computing and self-attentive mechanism, is a good solution to these problems, and its bi-directional connection allows the parameters of the context to be updated uniformly, thus allowing better aggregation of information and solving the problem of scattered contextual information. Transformation of the position-encoded part of the Transformer model starts with a relative position representation (RPR). It compensates for its inability to obtain relative location information. The experimental results show that the improved Transformer model can achieve an accuracy rate of 90.1%, a recall rate of 92.13%, and an F1 value of 91.16% in recognizing Internet buzzwords.

You have full access to this open access chapter, Download conference paper PDF

Recognition of Off-line Handwritten Uyghur Words Using Bayesian Networks with Grapheme Nodes

Article 04 September 2020

Research and Implementation of Buzzword Detection Technology Based on the Dynamic Circulation Corpus

Graph vs. bag representation models for the topic classification of web documents

Article 12 August 2015

Keywords

1 Introduction

With an Internet penetration rate of 73.0% as of December 2021 [1], Internet has become an essential part of people’s lives. Internet has given the public more channels to express their ideas, and Internet buzzwords are the concentrated product of expressing ideas, but there are positive and negative Internet buzzwords, and while they express the ideas of Internet users, they may produce negative public opinion guidance. Therefore, accurate identification of Internet buzzwords plays an important role in the guidance of correct Internet opinion.

The system applies deep learning techniques to achieve recognition of Internet buzzwords. Deep learning techniques can extract, transform and combine features from the initial text to obtain a set of feature representations, and then input a prediction function to obtain the recognition results [2]. Deep learning is built around the implementation of three functional components: the embedding layer, the encoding layer, and the output layer, embedding layer convert words into feature vectors, the Encoding layer obtains textual contextual features, and the output layer acquires the rules between sequences and classifies their output [3]. Although RNN structures are widely used to process sequence-like time-stream data [4,5,6], they suffer from structural problems such as serial computation, gradient disappearance [7], and one-way construction. The contributions of applying the Transformer model for web buzzword feature recognition are as follows: (1) In the data crawling, the module of real-time crawling is added, which can obtain the data of Internet buzzwords more accurately and improve the problem that the traditional crawling data is too slow to update. (2) The current web buzzword dataset is scattered and sparse so the data collected through web crawling is used to build a dynamic web buzzword corpus on its own. (3) Traditional machine learning models suffer from the problem of gradient disappearance and gradient explosion. The Transformer model, with its parallel computing and self-attentiveness mechanism, solves these problems, and its bi-directional connection allows the parameters of the context to be updated uniformly, thus enabling better information aggregation and solving the problem of information dispersion in the context. (4) Improvements to the start position of the Transformer model, converting the encoding vector of the starting position to a relative position [8] representation (RPR), compensate for the necessity to introduce explicit location information at the location code.

2 Related Work

The existing literature on the identification of Internet buzzwords and Internet neologisms summarizes three types: rule-based methods, statistical-based methods, and methods based on a combination of statistics and rules.

The rule-based approach focuses on developing rules that share common features between words, words, and words, based on linguistic theory and knowledge, or on observing the rules and patterns of word formation through long-term study of the language, and then summarizing their properties and combining them with grammar. As the core of the rule-based approach to new word discovery is the construction of a knowledge base for the domain, a more specialized rule base needs to be created, and new words need to be discovered based on the degree of similar recognition in its rule base when carrying out online buzzword identification.The statistical-based approach improves on the drawbacks of the rule-based approach which uses extensive manual annotation, saving significant time and labor costs. Even though the statistical-based approach makes up for many of the shortcomings of the rule-based approach, experiments in the literature have shown that the statistical-based approach has a low recognition rate that does not allow for good recognition of words, while a fusion of the two can improve the recognition rate of Internet buzzwords. The literature [9] proposes a kth order algorithm for PMI, and experiments show that its accuracy is improved by about 28.79% over PMI, and it is found that when the parameter k takes a value greater than or equal to 3, it can overcome the defects of the PMI method. The Transformer model is also based on a combination of statistical and rule-based methods and has been applied to Internet buzzwords to improve recognition rates.

3 Overall System Architecture

The Transformer deep learning model is applied to identify the features of Internet buzzwords, and the overall process of the system is shown in (see Fig. 1).

Firstly, the user logs in to the Internet buzzword recognition system and enters the text to be analyzed on the text analysis page. The Internet buzzword database in the background makes a judgment on the text entered, if it is an Internet buzzword in the corpus then it is directly identified as an Internet buzzword, if it does not exist in the Internet buzzword database then the input is entered into the Transformer model to determine if it is an Internet buzzword.

The Transformer Internet-based buzzword recognition technology solution is implemented in the following steps:

Step1, to crawl the existing Internet buzzword corpus on Weibo, to achieve real-time incremental crawling of Internet buzzwords on the original crawler technology, need to mark an identifier on the URL that is the data fingerprint, set the data fingerprint as a hash value, and then just compare the hash value to determine whether the crawled content needs to be updated.

Step2, the crawled Internet buzzwords were pre-processed by first de-duplicating the data, followed by word separation for the longer phrases, using search engine mode, and then filtering the deactivated words using Baidu’s deactivated word list.

Step3, use matplotlib library, jieba library, and word cloud library to realize the visual display of the processed Internet buzzwords and draw the word cloud of Internet buzzwords.

Step4, the pre-processed data is selected for the text vector representation by the Skip-gram method in the word2vec model.

Step5, for the feature vectors obtained in the previous step, position encoding is performed, and a position vector representing position information is combined on word embedding to obtain the final vector with position information.

Step6, input the vector with location information into the Transformer model and determine whether the input is a web buzzword or not.

4 System Implementation

4.1 A Subsection Sample

4.1.1 Data Acquisition

Real-time incremental crawling of Internet buzzwords is done by tagging URLs with a data fingerprint identifier. Set data fingerprint to the hash value, and generate a unique fixed-length string from the input words, the hash values are then compared to determine if the crawl needs to be updated. The former can insert a piece of data into the collection, returning 1 for success and 0 for failure; the latter can query whether an element exists in the collection, returning 1 for existence and 0 for non-existence. (see Fig. 2), when the Spider module receives a URL to process, a Spider middleware is added to determine whether the fingerprint of the URL exists in the Redis database and if so, the URL is discarded; if not, the new URL is fetched and crawled.

4.1.2 Data Pre-processing

By counting the content crawled by the keyword “Internet buzzwords”, a total of tens of thousands of high-frequency Internet buzzwords were crawled. Firstly, tens of thousands of buzzwords were de-duplicated, applying the duplicated() function of pandas, a data analysis tool in python, to detect duplicate data, duplicate rows with small indexes will return “True”, and data marked as True will need to be removed by applying the drop_duplicates() function.

The next step is to apply python’s third-party Chinese word splitting library, jieba, to the longer phrases in the crawled Internet buzzwords. According to the size of the granularity of the Internet, buzzword decided to use the more accurate search engine mode in the above for the word splitting process, for long words to cut the command as follows: jieba.cut_for_search(); jieba.lcut_for_search().

The next step is to filter the crawl data for English characters, numbers, mathematical characters, punctuation marks, single Chinese characters that are used very frequently, inflectional auxiliaries, adverbs, prepositions, conjunctions, etc. This article uses the Baidu deactivation word list filter.

4.1.3 Constructing an Online Buzzword Feature Vector

The pre-processed data is transformed into a character vector using the Word2vec model for characters. The Word2vec module is called from the Genism package. The Word2vec module contains two methods for vectorizing text, CBOW, and Skip-gram, respectively. In the training process, sg = 1 is set and the algorithm of Skip-gram is used for training. The window_size of the sliding window is set to 5, the dimension of the size word vector is set to 100, and min_count is used for the filtering operation. Words with a frequency less than the set value will be discarded, which is set to 5 in this paper. Skip-gram is the prediction of surrounding words using central words, for each central word there are K words as output, and there are K predictions for a word, for a total of K * V.

The model training process is as follows: (1) Use center_words V to query W0 and target_words T to query W1 to get two tensors of shape [batch_size, embedding_size], respectively, denoted as H1 and H2. (2) The two tensors are then dotted together. (3) Using a sigmoid function acting on (2), the result of the above dot product is normalized to a probability value of 0–1 as the predicted probability, and this model can be trained based on the label information L. After finishing the training of the model, W0 is generally used as the final word vector to be used, represented by a vector of W0. Using vector dot product, the similarity between different words can be calculated.

4.2 Transformer Model

The Transformer model was proposed by Vaswani A. et al. in their paper “Attention Is All You Need” [10], published in late 2017, and the general structure is shown in (see Fig. 3).

An analysis of the location coding place in the traditional Transformer model, since the Transformer model, does not have the iterative operation of a recurrent neural network, and no access to relative position information, so the position information of each word must be provided to the Transformer. Transformation of the position encoding part of the Encoder, the decoder part of the Transformer model into a relative position representation (RPR), compensating for its inability to obtain relative location information.

Two-position encoding vectors of the model need to be learned, one for computing \(z_i\) and one for computing \(e_{ij}\). If the middle index is k, then there will be 2k + 1 relative position encoding vectors to learn, of which k are to its left, k is to its right, and one belongs to itself. Relative positional encoding is not used in the traditional Transformer to calculate the degree of attention i pays to j after SoftMax for word i and word j. Comparing the two calculation methods, it is easy to see that the RPR calculation is more accurate for the position, so the model uses RPR for both the Encoder and Decode parts of the position encoding.

5 Analysis and Visualization of Experimental Results

5.1 Experimental Parameters

Table 1. Transformer model parameters.

Full size table

The number of layers is set to 2 by default, and the value of 128 is set to True. BIDIRECTIONAL is set to True to analyze the sequence from front to back and from back to front. Table 1 lists the parameters of the Transformer model and their corresponding optimal parameter values.

5.2 Comparative Experiments

The experimental evaluation of the network structure for the recognition rate of Internet buzzwords was evaluated using the precision Pre, recall Rec, and F1 values to evaluate the effectiveness of Internet buzzword recognition. To verify the performance of the Transformer model proposed in this paper, the feature vectors of Internet buzzwords were used as the input vectors of the model, and the accuracy recognition results of the comparison experiments on top of the single models commonly used by CRF, LSTM, BILSTM and CNN [12] are shown in Table 2.

Table 2. Recognition performance of the models.

Full size table

The experimental results show that the Transformer structure-based online buzzword recognition model is the best over the common single models of CRF, LSTM, BILSTM and CNN. The LSTM model has the lowest recognition rate for irregular words such as Internet buzzwords because it can only extract information from above, not below, and its F1 value is only 22.3% which is ineffective for the recognition of Internet buzzwords. The CNN model has an F1 value of 65.38%, which is an average performance in buzzword recognition compared to other models. The F1 value of the model using BILSTM is 87.53%, which is a 6.57% improvement compared to the CRF model and still performs relatively well in buzzword recognition. Applying the Transformer model performed best in terms of precision Pre, recall Rec, and F1 values, with 90.1%, 92.13%, and 91.16% respectively.

The evolution of the experimental evaluation parameter accuracy P is shown in (see Fig. 4), the evolution of the evaluation parameter recall R is shown in (see Fig. 5), and the evolution of the evaluation parameter F1 is shown in (see Fig. 6).

The line graph of the experimental results reveals that the Transformer-based model has the highest accuracy, recall and F1 score, with the change curve at the top, at 90.1%, 92.13% and 91.16% respectively, and the experimental data shows that the model in this paper improves the recognition rate of Internet buzzwords.

5.3 Visualization of Internet Buzzword Recognition

Internet buzzword recognition system based on python’s Flask lightweight web framework to implement a visual interface. The platform for the visualization of Internet buzzwords allows users to view options for data queries, real-time analysis, and hot topics in the sidebar of the home page after logging in, the data query is shown in (see Fig. 7): it contains all data, Internet buzzwords, non-Internet buzzwords, and allows you to view information such as user name, posting time and content, device information, number of likes, retweets and comments, and whether the data is an Internet buzzword.

The real-time analysis is shown in (see Fig. 8), where the words to be discriminated are entered at the content of the input, the probability of their prediction score is displayed at the sentiment score, and whether they are suspected to be Internet buzzwords is displayed at the sentiment evaluation column.

6 Conclusion

To improve the recognition rate of Internet buzzwords, Transformer-based Internet buzzword feature recognition is proposed. The module of real-time crawling has been added to the data crawling, which can obtain the data of Internet buzzwords more accurately and improve the problem of too slow an update of traditional crawling data. As buzzword datasets on the web are scattered and sparse, a dynamic corpus of Internet buzzwords is constructed in-house from data collected through web crawling. Traditional machine learning models suffer from the problem of gradient disappearance and gradient explosion. The Transformer model, with its parallel computing and self-attentiveness mechanism, solves these problems, and its bi-directional connection allows the parameters of the context to be updated uniformly, thus enabling better information aggregation and solving the problem of information dispersion in the context. Improvements to the start position of the Transformer model, converting the starting position-coding vector to a relative position representation (RPR). It compensates for the need to introduce explicit location information at its location code.

References

The 49th Statistical Report on the Development of the Internet in China. China Internet Network Information Center (2022)
Google Scholar
Qiu, X.P.: Neural Networks and Deep Learning. China Machine Press, Beijing (2020)
Google Scholar
Li, J., Sun, A., Han, J., et al.: A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 34, 50–70 (2020)
Article Google Scholar
Hammerton, J.: Named entity recognition with long short-term memory. In: North American Chapter of the Association for Computational Linguistics, pp. 172–175 (2003)
Google Scholar
Huang, Z., Xu, W., Yu, K., et al.: Bidirectional LSTM-CRF models for sequence tagging. Comput. Lang. (2015)
Google Scholar
Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv: Learning (2016)
Google Scholar
Bengio, Y., Simard, P.Y., Frasconi, P., et al.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
Dai, Z., Yang, Z., Yang, Y., et al.: Transformer-XL: attentive language models beyond a fixed-length context (2019)
Google Scholar
Du, L.P., Li, X.P., Yu, G., Liu, C.L., Liu, R.: New word discovery based on mutual information improvement algorithm for Chinese word separation system improvement. Beijing Univ. J. 52(01), 35–40 (2016)
Google Scholar
Vaswani, A., Shazier, N., Parmar, N., et al.: Attention is all you need. Neural Inf. Process. Syst. 30 (2017)
Google Scholar

Download references

Acknowledgments

This research was supported by the scientific research project of the Education Department of Jilin Province [NO. JJKH20220602KJ].

Author information

Authors and Affiliations

College of Cybersecurity, Changchun University, Changchun, China
Dawei Xu, Yijie She & Jian Zhao
School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China
Dawei Xu
College of International Education, Hainan Normal University, Haikou, China
Zhonghua Tan
National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China
Ruiguang Li

Authors

Dawei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yijie She
View author publications
You can also search for this author in PubMed Google Scholar
Zhonghua Tan
View author publications
You can also search for this author in PubMed Google Scholar
Ruiguang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jian Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dawei Xu .

Editor information

Editors and Affiliations

CNCERT, Beijing, China
Wei Lu
University of Chinese Academy of Sciences, Beijing, China
Yuqing Zhang
Peking University, Beijing, China
Weiping Wen
CNCERT, Beijing, China
Hanbing Yan
CNCERT, Beijing, China
Chao Li

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, D., She, Y., Tan, Z., Li, R., Zhao, J. (2022). Research on the Recognition of Internet Buzzword Features Based on Transformer. In: Lu, W., Zhang, Y., Wen, W., Yan, H., Li, C. (eds) Cyber Security. CNCERT 2022. Communications in Computer and Information Science, vol 1699. Springer, Singapore. https://doi.org/10.1007/978-981-19-8285-9_17

Download citation

DOI: https://doi.org/10.1007/978-981-19-8285-9_17
Published: 10 December 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8284-2
Online ISBN: 978-981-19-8285-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics