Keywords

1 Introduction

With an Internet penetration rate of 73.0% as of December 2021 [1], Internet has become an essential part of people’s lives. Internet has given the public more channels to express their ideas, and Internet buzzwords are the concentrated product of expressing ideas, but there are positive and negative Internet buzzwords, and while they express the ideas of Internet users, they may produce negative public opinion guidance. Therefore, accurate identification of Internet buzzwords plays an important role in the guidance of correct Internet opinion.

The system applies deep learning techniques to achieve recognition of Internet buzzwords. Deep learning techniques can extract, transform and combine features from the initial text to obtain a set of feature representations, and then input a prediction function to obtain the recognition results [2]. Deep learning is built around the implementation of three functional components: the embedding layer, the encoding layer, and the output layer, embedding layer convert words into feature vectors, the Encoding layer obtains textual contextual features, and the output layer acquires the rules between sequences and classifies their output [3]. Although RNN structures are widely used to process sequence-like time-stream data [4,5,6], they suffer from structural problems such as serial computation, gradient disappearance [7], and one-way construction. The contributions of applying the Transformer model for web buzzword feature recognition are as follows: (1) In the data crawling, the module of real-time crawling is added, which can obtain the data of Internet buzzwords more accurately and improve the problem that the traditional crawling data is too slow to update. (2) The current web buzzword dataset is scattered and sparse so the data collected through web crawling is used to build a dynamic web buzzword corpus on its own. (3) Traditional machine learning models suffer from the problem of gradient disappearance and gradient explosion. The Transformer model, with its parallel computing and self-attentiveness mechanism, solves these problems, and its bi-directional connection allows the parameters of the context to be updated uniformly, thus enabling better information aggregation and solving the problem of information dispersion in the context. (4) Improvements to the start position of the Transformer model, converting the encoding vector of the starting position to a relative position [8] representation (RPR), compensate for the necessity to introduce explicit location information at the location code.

2 Related Work

The existing literature on the identification of Internet buzzwords and Internet neologisms summarizes three types: rule-based methods, statistical-based methods, and methods based on a combination of statistics and rules.

The rule-based approach focuses on developing rules that share common features between words, words, and words, based on linguistic theory and knowledge, or on observing the rules and patterns of word formation through long-term study of the language, and then summarizing their properties and combining them with grammar. As the core of the rule-based approach to new word discovery is the construction of a knowledge base for the domain, a more specialized rule base needs to be created, and new words need to be discovered based on the degree of similar recognition in its rule base when carrying out online buzzword identification.The statistical-based approach improves on the drawbacks of the rule-based approach which uses extensive manual annotation, saving significant time and labor costs. Even though the statistical-based approach makes up for many of the shortcomings of the rule-based approach, experiments in the literature have shown that the statistical-based approach has a low recognition rate that does not allow for good recognition of words, while a fusion of the two can improve the recognition rate of Internet buzzwords. The literature [9] proposes a kth order algorithm for PMI, and experiments show that its accuracy is improved by about 28.79% over PMI, and it is found that when the parameter k takes a value greater than or equal to 3, it can overcome the defects of the PMI method. The Transformer model is also based on a combination of statistical and rule-based methods and has been applied to Internet buzzwords to improve recognition rates.

3 Overall System Architecture

The Transformer deep learning model is applied to identify the features of Internet buzzwords, and the overall process of the system is shown in (see Fig. 1).

Fig. 1.
figure 1

Overall system flow chart

Firstly, the user logs in to the Internet buzzword recognition system and enters the text to be analyzed on the text analysis page. The Internet buzzword database in the background makes a judgment on the text entered, if it is an Internet buzzword in the corpus then it is directly identified as an Internet buzzword, if it does not exist in the Internet buzzword database then the input is entered into the Transformer model to determine if it is an Internet buzzword.

The Transformer Internet-based buzzword recognition technology solution is implemented in the following steps:

Step1, to crawl the existing Internet buzzword corpus on Weibo, to achieve real-time incremental crawling of Internet buzzwords on the original crawler technology, need to mark an identifier on the URL that is the data fingerprint, set the data fingerprint as a hash value, and then just compare the hash value to determine whether the crawled content needs to be updated.

Step2, the crawled Internet buzzwords were pre-processed by first de-duplicating the data, followed by word separation for the longer phrases, using search engine mode, and then filtering the deactivated words using Baidu’s deactivated word list.

Step3, use matplotlib library, jieba library, and word cloud library to realize the visual display of the processed Internet buzzwords and draw the word cloud of Internet buzzwords.

Step4, the pre-processed data is selected for the text vector representation by the Skip-gram method in the word2vec model.

Step5, for the feature vectors obtained in the previous step, position encoding is performed, and a position vector representing position information is combined on word embedding to obtain the final vector with position information.

Step6, input the vector with location information into the Transformer model and determine whether the input is a web buzzword or not.

4 System Implementation

4.1 A Subsection Sample

4.1.1 Data Acquisition

Real-time incremental crawling of Internet buzzwords is done by tagging URLs with a data fingerprint identifier. Set data fingerprint to the hash value, and generate a unique fixed-length string from the input words, the hash values are then compared to determine if the crawl needs to be updated. The former can insert a piece of data into the collection, returning 1 for success and 0 for failure; the latter can query whether an element exists in the collection, returning 1 for existence and 0 for non-existence. (see Fig. 2), when the Spider module receives a URL to process, a Spider middleware is added to determine whether the fingerprint of the URL exists in the Redis database and if so, the URL is discarded; if not, the new URL is fetched and crawled.

Fig. 2.
figure 2

Real-time web crawling flow chart

4.1.2 Data Pre-processing

By counting the content crawled by the keyword “Internet buzzwords”, a total of tens of thousands of high-frequency Internet buzzwords were crawled. Firstly, tens of thousands of buzzwords were de-duplicated, applying the duplicated() function of pandas, a data analysis tool in python, to detect duplicate data, duplicate rows with small indexes will return “True”, and data marked as True will need to be removed by applying the drop_duplicates() function.

The next step is to apply python’s third-party Chinese word splitting library, jieba, to the longer phrases in the crawled Internet buzzwords. According to the size of the granularity of the Internet, buzzword decided to use the more accurate search engine mode in the above for the word splitting process, for long words to cut the command as follows: jieba.cut_for_search(); jieba.lcut_for_search().

The next step is to filter the crawl data for English characters, numbers, mathematical characters, punctuation marks, single Chinese characters that are used very frequently, inflectional auxiliaries, adverbs, prepositions, conjunctions, etc. This article uses the Baidu deactivation word list filter.

4.1.3 Constructing an Online Buzzword Feature Vector

The pre-processed data is transformed into a character vector using the Word2vec model for characters. The Word2vec module is called from the Genism package. The Word2vec module contains two methods for vectorizing text, CBOW, and Skip-gram, respectively. In the training process, sg = 1 is set and the algorithm of Skip-gram is used for training. The window_size of the sliding window is set to 5, the dimension of the size word vector is set to 100, and min_count is used for the filtering operation. Words with a frequency less than the set value will be discarded, which is set to 5 in this paper. Skip-gram is the prediction of surrounding words using central words, for each central word there are K words as output, and there are K predictions for a word, for a total of K * V.

The model training process is as follows: (1) Use center_words V to query W0 and target_words T to query W1 to get two tensors of shape [batch_size, embedding_size], respectively, denoted as H1 and H2. (2) The two tensors are then dotted together. (3) Using a sigmoid function acting on (2), the result of the above dot product is normalized to a probability value of 0–1 as the predicted probability, and this model can be trained based on the label information L. After finishing the training of the model, W0 is generally used as the final word vector to be used, represented by a vector of W0. Using vector dot product, the similarity between different words can be calculated.

4.2 Transformer Model

The Transformer model was proposed by Vaswani A. et al. in their paper “Attention Is All You Need” [10], published in late 2017, and the general structure is shown in (see Fig. 3).

Fig. 3.
figure 3

Transformer structure diagram

An analysis of the location coding place in the traditional Transformer model, since the Transformer model, does not have the iterative operation of a recurrent neural network, and no access to relative position information, so the position information of each word must be provided to the Transformer. Transformation of the position encoding part of the Encoder, the decoder part of the Transformer model into a relative position representation (RPR), compensating for its inability to obtain relative location information.

Two-position encoding vectors of the model need to be learned, one for computing \(z_i\) and one for computing \(e_{ij}\). If the middle index is k, then there will be 2k + 1 relative position encoding vectors to learn, of which k are to its left, k is to its right, and one belongs to itself. Relative positional encoding is not used in the traditional Transformer to calculate the degree of attention i pays to j after SoftMax for word i and word j. Comparing the two calculation methods, it is easy to see that the RPR calculation is more accurate for the position, so the model uses RPR for both the Encoder and Decode parts of the position encoding.

5 Analysis and Visualization of Experimental Results

5.1 Experimental Parameters

Table 1. Transformer model parameters.

The number of layers is set to 2 by default, and the value of 128 is set to True. BIDIRECTIONAL is set to True to analyze the sequence from front to back and from back to front. Table 1 lists the parameters of the Transformer model and their corresponding optimal parameter values.

5.2 Comparative Experiments

The experimental evaluation of the network structure for the recognition rate of Internet buzzwords was evaluated using the precision Pre, recall Rec, and F1 values to evaluate the effectiveness of Internet buzzword recognition. To verify the performance of the Transformer model proposed in this paper, the feature vectors of Internet buzzwords were used as the input vectors of the model, and the accuracy recognition results of the comparison experiments on top of the single models commonly used by CRF, LSTM, BILSTM and CNN [12] are shown in Table 2.

Table 2. Recognition performance of the models.

The experimental results show that the Transformer structure-based online buzzword recognition model is the best over the common single models of CRF, LSTM, BILSTM and CNN. The LSTM model has the lowest recognition rate for irregular words such as Internet buzzwords because it can only extract information from above, not below, and its F1 value is only 22.3% which is ineffective for the recognition of Internet buzzwords. The CNN model has an F1 value of 65.38%, which is an average performance in buzzword recognition compared to other models. The F1 value of the model using BILSTM is 87.53%, which is a 6.57% improvement compared to the CRF model and still performs relatively well in buzzword recognition. Applying the Transformer model performed best in terms of precision Pre, recall Rec, and F1 values, with 90.1%, 92.13%, and 91.16% respectively.

The evolution of the experimental evaluation parameter accuracy P is shown in (see Fig. 4), the evolution of the evaluation parameter recall R is shown in (see Fig. 5), and the evolution of the evaluation parameter F1 is shown in (see Fig. 6).

Fig. 4.
figure 4

Comparison of accuracy P (%) across models

Fig. 5.
figure 5

Comparison of recall R (%) across models

Fig. 6.
figure 6

Comparison of F1 (%) across models

The line graph of the experimental results reveals that the Transformer-based model has the highest accuracy, recall and F1 score, with the change curve at the top, at 90.1%, 92.13% and 91.16% respectively, and the experimental data shows that the model in this paper improves the recognition rate of Internet buzzwords.

5.3 Visualization of Internet Buzzword Recognition

Internet buzzword recognition system based on python’s Flask lightweight web framework to implement a visual interface. The platform for the visualization of Internet buzzwords allows users to view options for data queries, real-time analysis, and hot topics in the sidebar of the home page after logging in, the data query is shown in (see Fig. 7): it contains all data, Internet buzzwords, non-Internet buzzwords, and allows you to view information such as user name, posting time and content, device information, number of likes, retweets and comments, and whether the data is an Internet buzzword.

Fig. 7.
figure 7

Visualization of data enquiry pages

The real-time analysis is shown in (see Fig. 8), where the words to be discriminated are entered at the content of the input, the probability of their prediction score is displayed at the sentiment score, and whether they are suspected to be Internet buzzwords is displayed at the sentiment evaluation column.

Fig. 8.
figure 8

Example of real-time analysis page visualization

6 Conclusion

To improve the recognition rate of Internet buzzwords, Transformer-based Internet buzzword feature recognition is proposed. The module of real-time crawling has been added to the data crawling, which can obtain the data of Internet buzzwords more accurately and improve the problem of too slow an update of traditional crawling data. As buzzword datasets on the web are scattered and sparse, a dynamic corpus of Internet buzzwords is constructed in-house from data collected through web crawling. Traditional machine learning models suffer from the problem of gradient disappearance and gradient explosion. The Transformer model, with its parallel computing and self-attentiveness mechanism, solves these problems, and its bi-directional connection allows the parameters of the context to be updated uniformly, thus enabling better information aggregation and solving the problem of information dispersion in the context. Improvements to the start position of the Transformer model, converting the starting position-coding vector to a relative position representation (RPR). It compensates for the need to introduce explicit location information at its location code.