Microblog sentiment analysis based on deep memory network with structural attention

Microblog sentiment analysis has important applications in many fields, such as social media analysis and online product reviews. However, the traditional methods may be challenging to compute the long dependencies between them and easy to lose some semantic information due to low standardization of text and emojis in microblogs. In this paper, we propose a novel deep memory network with structural self-attention, storing long-term contextual information and extracting richer text and emojis information from microblogs, which aims to improve the performance of sentiment analysis. Specifically, the model first utilizes a bidirectional long short-term memory network to extract the semantic information in the microblogs, and considers the extraction results as the memory component of the deep memory network, storing the long dependencies and free of syntactic parser, sentiment lexicon and feature engineering. Then, we consider multi-step structural self-attention operations as the generalization and output components. Furthermore, this study also employs a penalty mechanism to the loss function to promote the diversity across different hops of attention in the model. This study conducted extensive comprehensive experiments with eight baseline methods on real datasets. Results show that our model outperforms those state-of-the-art models, which validates the effectiveness of the proposed model.


Introduction
Microblogs have become an essential channel for information dissemination in online social media, playing an indispensable role in information dissemination and information acquisition. As an important social media, users' emotions play an important role in microblogging information dissemination. The users who are removed from positive information in the information flow become increasingly depressed in social media. In contrast, those removed from negative information become increasingly positive [1]. In addition, users' emotions in social media are highly contagious, as images, videos, and even words themselves can cause changes in users' emotions. Therefore, analyzing whether microblogs contain users' subjective emotions and what kind of polarized emotions they contain is important for studying the mechanism and dynamics of information dissemination in microblogs, predicting trends of unexpected events [2], and even stock market forecasting [3].
The existing microblog sentiment analysis methods can be mainly classified into knowledge-based, machine learning methods based on feature classification, and deep learning methods. The knowledge-based method first constructs a knowledge base for microblog sentiment analysis, including sentiment lexicon [4], phrase lexicon and emoji lexicon [5], syntactic dependency relationship rule base, a domain ontology base, and then uses the knowledge base to aggregate and calculate the sentiment of microblogs. The knowledgebased method is simple and suitable for large-scale and multi-domain application scenarios. Still, it relies heavily on human experience, and it is complicated and costly to construct a good knowledge base. Therefore, knowledge-based methods are often combined with machine learning methods to extract sentiment features in microblogs using knowledge bases. Feature classification-based machine learning methods start with feature engineering to construct feature sets for microblog sentiment classification and then use supervised machine learning methods to classify sentiment blog sentiments. The feature sets include n-gram features, lexical features, syntactic dependency features, TF-IDF (term frequency-inverse document frequency) features, and knowledge-based features. Commonly used machine learning methods, including Naive Bayesian methods, SVM (support vector machines), CRF (conditional random fields), and ensemble learning methods [2], have been used in microblog sentiment analysis research.
There are two main drawbacks of feature classificationbased machine learning methods. First, supervised learning methods require a large number of manually labeled datasets, and unsupervised learning methods are currently not mature enough, so weakly supervised learning methods [6] have recently received more attention and are used in the research of microblog sentiment analysis [7]. Second, it relies heavily on feature sets and low domain applicability, and it often requires much effort to spend on feature engineering. Therefore, with the rapid development of deep learning, more and more scholars use deep learning-based methods to study microblog sentiment analysis.
The deep learning-based approach first divides the microblog text into words, and then represents the words as word vectors, and then builds a deep neural network model to semantically extract the microblog text. They construct a representation vector of microblog sentiment, and finally performs sentiment classification. The commonly used deep learning models include recurrent neural networks, convolutional neural networks. In the literature [8], a bidirectional long short-term memory network model was used to classify negative emotions of microblog users into anger, sadness and fear. Sun et al. [9] used a convolutional neural network model to analyze the sentiment tendency of microblogs. Ke et al. [10] constructed a multi-channel convolutional neural network for microblog sentiment classification by combining different feature information to form different network inputs. To improve the sentiment semantic extraction ability of the model, many deep learning models have been proposed to help solve sentiment analysis tasks in recent years, such as attention mechanism [11], deep memory network [12,13]. Li et al. [14] constructed a dual attention model with microblog text and sentiment symbols to classify microblog sentiment, and the accuracy of sentiment classification was improved using sentiment symbols. Nevertheless, microblogs are characterized by diverse forms, low standardization of linguistic expressions, many online vocabularies and emojis, which makes it difficult for those above-mentioned approaches to extract semantic features and user sentiments. In addition, the memory storage capacity of those above-mentioned approaches is too weak to store much information and easy to lose some of the semantic information, which lead to these methods cannot store long dependencies information and thus affect the performance of classification of user's polarity and subjective emotions.
To address the above limitations, this paper develops a deep memory network with structural attention model for sentiment classification, capturing semantic dependencies between context words and emojis through structural self-attention mechanism, which is free of syntactic parser, sentiment lexicon and feature engineering. Then, this study conducts extensive experiments on two real datasets from NLPCC 2013 and NLPCC 2014. Results show that our approach outperforms feature-based SVM, LSTM and other baseline models in the field of sentiment analysis. The main contributions of this work are summarized as follows: • To the best of our knowledge, this study first introduces a structural self-attention mechanism in deep memory model, which combines the storage mechanism of longterm contextual information of deep memory networks with the ability of structural self-attention, to find the user subjectivity and sentiment tendency hidden from texts and emojis in the online social media. • This study employs a penalty mechanism to the loss function of our deep learning model. Each attention vector in the attention matrix of the last computation step focuses on different sentiment features, which facilitates to improve the performance of subjectivity recognition and sentiment classification tasks. • This model achieves state-of-the-art performance compare to those baseline models in the field of sentiment analysis.
The rest of the paper is organized as follows. "Related works" demonstrates the related works from deep memory network, attention mechanism and sentiment analysis. "Deep memory network with structural attention for microblog sentiment analysis" details the structural deep memory model. "Experimental Results" and "Results and Discussions" show the datasets and the model performance. Conclusion and future work are presented in "Conclusions".

Sentiment analysis
Sentiment analysis (SA) is one of the most important tasks in the field of natural language processing in online social network platforms. Recent studies of sentiment classification can be mainly grouped into three categories: aspect-level SA, sequence-level SA, and document-level SA.
(1) Most current aspect-level SA recognizes the entities or aspects from text sentence, and makes a positive, neutral, or negative sentiment judgments on those aspects. For example, Gan et al. [15] proposed a self-attention-based hierarchical dilated convolutional neural network for multi-aspects sentiment analysis. Ding et al. [16] designed an aspect-level sentiment analysis tool consisting of sentiment classification and aspect recognition. However, these aspect-based sentiment analysis methods mainly depend on explicit aspect words in each sentence, and include word segmentation, semantic recognition, aspect extraction, and other complex processes. (2) Research on sequence-level sentiment analysis mainly focuses on the discrimination of implicit sentiment and sequence sentences, which aims to classify the sentiment polarities (positive or negative opinion) of those sentences. Long short-term memory model (LSTM) can effectively capture the semantics of long dependent sequences, and assign different weights to words in sentences automatically via introduction of attention mechanism [17]. Chen et al. [18] perform sentiment analysis separately on sentences from each type using BiLSTM-conditional random field (CRF) and convention neural networks (CNN), to improve the sequence-level sentiment classification. (3) Document-level SA aims to classify the sentiment polarity of document-level reviews posted by online users about products and services, which have a large number of sequence sentences [19]. In general, the task is referred to as document-level analysis because it considers each document as a whole and does not study entities or aspects inside the document or determine sentiments expressed about them via supervised methods of Artificial Neural Networks [20,21].
In online social network platforms, the contents published by users are usually short sentences which less 140 words. Furthermore, microblog texts may not have explicit aspect and sentiment words, so it is not suitable for using aspectlevel and document-level sentiment analysis. Therefore, we regard each sentence as a whole and need not study entities or aspects inside the sentence, and use document-level SA to determine the overall sentiment of those sentences via the deep memory networks with structural attention.

Deep memory network
Memory networks are a special kind of neural network learning framework [11]. The main idea is to store contextual information as long-term memory and generate representations of input data by reading, writing or updating operations on these long-term memories, which is used for generative or predictive tasks.
The memory network consists of five components: storage (M), input (I ), generalization (G), output (O), and response (R) [12]. Storage component M is used to store contextual information's long-term memory. Unlike RNN, this information is stored by separate components, thus enabling the use of fairly long-term contextual information in subsequent computations. The input component I is used to transform the input data into a feature representation inside the model and acts as feature extraction. The generalization component G updates the memory information stored in the storage component based on new inputs. It can compress or generalize the long-term memory as needed to meet model prediction or generation needs. The output component O generates an output of the feature representation space based on the current input and the state of the long-term memory. The response component R gives a specific prediction or generation result based on this input.
Theoretically, the components of a memory network can be assumed by any appropriate machine learning model. Still, an end-to-end implementation based on deep neural networks is one of the most convenient and well-performing implementations. Sukhbaatar et al. [22] proposed an end-to-end memory network model incorporating attention mechanisms and multi-step (hops) computation to implement multi-level output components. Deep memory networks have been used in a variety of fields, such as attribute-level sentiment classification [23], question and answer systems [24] with good results, and their ideas have been widely adopted for building more complex deep learning models [25].

Attention mechanism
The idea of attention mechanism originated from the mechanism of selective attention in human vision and was initially used mainly in visual images. The attention mechanism allows different parts of the sentence to play different roles in completing different tasks, thus avoiding encoding the information of the whole sentence as a fixed vector for all tasks.
The attention mechanism has been used extensively in various areas of text sequence processing and has become one of the required components of deep neural network models [13]. Dot-product attention [26] and additive attention [27] are the two most commonly used attention functions. Dot-product attention calculates the alignment scores based on hidden states for the encoder and hidden states for the decoder. Similar to dot product attention, additive attention uses a one-hidden layer feed-forward network to calculate the attention alignment scores. In addition, multi-head attention allows the model to stack several attention functions and run through them in parallel [28]. In general, it commonly uses dot product attention function in different heads. Multi-head contextual attention is a variant of the multi-head attention mechanism, which sets attention mechanisms on a window of fixed context words, to learn the different semantics of subspaces in other locations [28]. Memory-based attention borrows the idea of memory networks, stores the contextual information and updates it to calculate the attention scores [29].
Compared with the above-mentioned attention mechanism, the structural self-attention mechanism allows extracting different aspects of the sentence into multiple vector representation, enabling global semantic information to be available for each word at each position, facilitating the establishment of long dependencies. Therefore, this study combines the structural self-attention mechanism and the deep memory networks for sentiment analysis.

Deep memory network with structural attention for microblog sentiment analysis
In this section, we present a structural deep memory network model based on deep memory networks and structural self-attention mechanism for microblog sentiment analysis, which combines the storage mechanism of long-term contextual information of deep memory networks with the ability of structural self-attention.

Microblog sentiment analysis task
Microblog sentiment analysis is to determine the user sentiment e i of d i in a given microblog document s i ∈ D(i = 1, · · · , |D|). The document s i = {w 0 , w 1 , · · · , w n } consists of a set of ordered words and emojis, while D is a collection of documents. The microblog sentiment analysis includes subjectivity classification and sentiment classification: Subjectivity classification is classifying microblog documents into microblogs containing user sentiment and those that do not contain user sentiment, denoted as e i ∈ {o 0 , o 1 }.
Sentiment classification is classifying microblog documents s i ∈ {D|e i = o 1 } into microblogs containing positive sentiment or negative sentiment, denoted as e i ∈ {o p , o n }.
The reason for dividing the microblog sentiment analysis process into these two parts is the varying amount of sentiment categories in the microblog data. Most of contents in the microblog datasets only express some facts about users and do not express any sentiment. The two parts can reduce the impact of data imbalance on model performance.

Deep memory network with structural self-attention model
In this section, we present the details of our structural attention deep memory network, which is divided into five components.
(1) Input components The input component of our model is to encode the sentence sequence and extract the semantic information via the vector representation layer and a BiLSTM (bidirectional long short-term memory) layer. Following the previous work [13], emojis also play an important role in expressing their emotions, we consider the microblog s consist of word and emoji sequences {w 0 , w 1 , · · · , w n }. The vector representation layer converts the sentence sequences into word and emoji vector sequences (N is the number of words and emojis in the sentence and e is the dimension of embedding) through a vector matrix E ∈ R N ×e . The microblog document s is converted into vector sequences V = {v 1 , v 2 , · · · , v n }.
RNN has a serious gradient disappearance problem, which makes the model difficult to handle long sequences. Since the LSTM (long short-term memory) network is one of the widely used deep learning models for sentence learning tasks, we select it to extract text feature from microblog sequence information extraction in our model. The architecture of the LSTM model is presented in Fig. 1.
More specifically, each of the LSTM memory cell contains three gates (1) input gate i t ; (2) output gate o t ; (3) forget gate f t . At every timestep t, each of the three gates is presented with the input x t and the previous hidden states h t−1 . The input gate i t specifies which information is added to the cell state. The output gate o t specifies which information from the cell state is used as output. The forget gate f t denotes which information is removed from the cell state. Afterward, the equations below describe the update of memory cells from timestep t − 1 to t.
where h t denotes the output vector in each LSTM layer. W i x , W ih , W f x , W f h , W ox , W oh and W ch represent the weight matrices, respectively.b i , b f , b o and b c denote the bias vectors. is the element-wise multiplication. σ stands for the element-wise sigmoid unction. However, the common LSTM can only use the forward semantic information in the process of information extraction, and cannot use the inverse semantic information of the sequence. Therefore, BiLSTM model is used for semantic information extraction of lexical sequences in microblogging texts. Fig. 1 The architecture of LSTM where H f and H b are the hidden layer outputs of the forward LSTM and the backward LSTM, respectively. and the i hidden layer output vector of the bidirectional LSTM is stitched by the hidden layer output vectors at the corresponding positions of the forward LSTM and the backward LSTM. (

2) Storage component
In the propose model, the memory information is not modified during the operation, and the storage component only plays the role of preserving the memory information. Therefore, the storage component of the model is the output matrix of the input component, which is the implicit output matrix H of BiLSTM. During the operation, the model retrieves information from the storage component several times, thus enabling direct access to LSTM information.

(3) Generalization and output components
The core component of the model generalization and output function is the semantic feature extraction process based on the structural self-attention mechanism. The idea of the attention mechanism is to assign different levels of importance weights to different parts of the output sequence of hidden layer. The weight is the core of the attention mechanism and is expressed in the following equation.
where q is the query vector, when q comes from h is called the self-attention, f is the matching function for q and h i , which can be scored by additive substitution models, dot product models, bilinear models and other methods. Our study adopts the idea of multi-step (hops) computation. The illustration of the model is shown in Fig. 2. Assuming that the sequence length is n, the dimension of BiLSTM is d, i.e., the memory component R ∈ R n×d , the structural self-attention is where Q t ∈ R d×q is used to construct the query for selfattention and K t ∈ R q×k is used to construct the key values for structural self-attention computation. The width k of the matrix K t is the number of different aspects in the text sequence that the structural self-attention is concerned with. After the semantic extraction of structural self-attention, the feature representation matrix of microblogs is Equations (5) and (6) constitute a step in the calculation of the structural self-attention of the model. The difference is that the key value construction matrix of structural selfattentiveness is reconstructed using the feature representation matrix of step t − 1 in the calculation of step t.
where γ ∈ [0, 1] denotes the amount of information in the key value construction matrix of step t − 1 that is retained in the step t calculation. In the training process, γ is used as a trainable variable to be trained together with other variables of the model. The matrix L t is the form parameter to make sure the matrix M t and the matrix K t have the same dimension.
(4) Response component where W and b are the weight matrix and the residual vector of the fully connected layer, respectively.

Model training
(1) Parameter sharing The parameters to be trained in the structural deep memory network model include three components: the parameters of the BiLSTM, the parameters of each computation step in the generalization and output components, and the parameters of the fully connected layer in the response component. Among them, the parameters to be trained in the generalization and output components include the parameter matrices Q t , K t , L t And γ, in each computation step. To reduce the number of parameters and improve the training speed of the model, the parameter sharing is used for training.
(2) Loss function The activation function of the fully connected layer in the response module of the model uses a softmax function, which can be viewed as a probability distribution of the output results over the microblog sentiment categories. Therefore, the model uses the cross-entropy loss function.
where D is the set of microblog documents, E is the set of three sentiment categories and I e (·) is the indicator function.
When the three sentiment categories of the microblog s in the training data is 1, otherwise is 0; P e (s) is the probability that the sentiment e of the microblog s given by the structural self-attention deep memory model.

(3) Attention penalty factor
In the training process of the model, the number of microblog sentiment features (aspects) that can be paid attention to is k, in other words, the matrix K t is composed of k attention vectors. In the training process, different attention vectors in K t may focus on the same microblog sentiment feature, which causes redundancy of information and affects the model performance. To force different attention vectors to focus on different features, a penalty mechanism is proposed in the literature, which makes different attention vectors have different distributions and non-zero elements are concentrated in different dimensions. The penalty mechanism is used to make each attention vector in the attention matrix of the last computation step focus on different twitter sentiment features. The penalty term is minimized as part of the objective function along with the loss function in the training. Hence, the objective function for model training is where p ∈ [0, 1] is the penalty factor, || · || F is the Frobenius norm of the matrix, and I is the unit array.

Dataset
The experimental data used the NLPCC 2013 (Natural Language Processing and Chinese Computing, 2013) and NLPCC 2014 microblog sentiment evaluation task datasets. The data were first pre-processed to remove links, microblog usernames, and some punctuation marks from the microblog text. Microblog users usually publish emoticons, repeated punctuation, etc. to express specific emotions, so such punctuation was not removed. Therefore, we vectorized emojis based on the Emoji2vec, which is similar to the word2vec embedding model and widely used for NLP tasks [30][31][32][33].
More specifically, there are 7 types of emotions tags: happiness, like, anger, sadness, fear, disgust, and surprise in the NLPCC2013 and NLPCC2014 datasets. In this paper, like and happiness are considered as positive emotions, while sadness, anger, fear and disgust are negative emotions. The emotional polarity of surprise emotion is uncertain, and it is known from the data that both positive and negative surprise emotions exist, so the tweet data marked as surprise are excluded in the emotion classification. Furthermore, because some microblog texts in NLPCC2014 contain two emotions that may reflect opposing feelings, we eliminated these data in our experiments. For example, the sentence in Table  1"Should I consider reducing the frequency of coming home, once I come back to argue with my dad, I'm so annoyed!", the emotion is disgust and the sentiment is '-1', reflecting the negative emotion of the sentences.
For the microblog text processing, HanLP (http://hanlp. linrunsoft.com) language processing toolkit was used for word separation. In model training, Chinese word vectors pre-trained using a large-scale microblog corpus were used [34]. The deep learning toolkit was used for the experiments, and the code was run on a GPU workstation with 8 Titan XP graphic cards.

Baseline models
To verify the effective of our model, this section presents the baseline models as follows: Feature-Naive Bayes: The approach employs a deep learning model (word2vec) to obtain text embedding, conduct subjective classification and sentiment classification experiments [12].
Feature-Random Forest: Similar to Feature-Naive Bayes, the approach combines word2vec model and Random Forest to conduct opinion and sentiment classification [34].
Feature-SVM: support vector machine (SVM) is a representative of the traditional machine learning classifier and performs very well on aspect-level sentiment classification. This model input texts feature to SVM model to conduct subjective classification and sentiment classification experiments [35].
LSTM: recurrent neural network based on long-and shortterm memory units. This model takes a sequence of word vectors as input, and the last output vector of the implicit layer is used as the representation vector of the text sequence, and then the fully connected layer is used for classification [36].
BiLSTM: includes a forward LSTM and an inverse LSTM, and the output sequences of both are combined as the representation vector of the text sequence, and classification is also performed using the fully connected layer [37].
BiLSTM + Attention: the attention mechanism is added on top of BiLSTM, so that different outputs of the implicit layer play different roles in the final sentence tableau [38].
Structural Attention: structural self-attention model, which uses a structural attention mechanism that makes different attention vectors focus on different aspects of features in a text sequence [11].
BERT : BERT is a transformer-based deep learning model, which has achieved state-of-the-art performance in a wide variety of NLP tasks, and has become a ubiquitous baseline for the text classification tasks [39].

Parameter setting
The hyper-parameters setting of microblog sentiment classification experiment are shown in Table 4. More specifically, we set the learning rate as 0.001 with Adam optimizer and the dropout rate as 0.5. In addition, the measurement is the accuracy rate and F-score.

Overall Performance
The overall performances of all competing methods over datasets are shown in Tables 5 and 6. Both in Accuracy and F-score, our model outperforms others benchmark models on the tasks of subjective classification and sentiment classification, which verify the effectiveness of the proposed models. As shown in Table 5 and Fig. 3, for the datasets in NLPCC2013 and NLPCC2014, Feature-Naive Bayes, Feature-Random Forest and Feature-SVM, which are composed of the deep learning models and classical classifiers, achieves 0.6502, 0.7175 and 0.7101 in the task of subjective classification, 0.5920, 0.6848 and 0.7321in the sentiment classification, respectively. We notice that SVM classifiers outperform Naive Bayes and Random Forest classifiers on the overall tasks of subjective classification and sentiment classification.   rate, while the BiLSTM model has much better results due to its ability to extract both forward and reverse semantic information. The structural attention model does not have an absolute advantage over the general attention model in microblog text analysis, probably because the number of emotional features expressed in microblogs is limited, and focusing on too many features may instead have an impact on the results. In contrast, the developed model suffers from the same problem as the structural attention model, with the difference that the effect of too many features is eliminated to some extent after multiple steps. Finally, BERT achieves 0.7409 and 0.7805 in the task of subjective classification, 0.796 and 0.8243 in the sentiment classification. To our surprise, even though it has achieved state-of-the-art performance in many NLP tasks, it did not obtain the best performance in our datasets. One of the main reasons that BERT model cannot handle the emojis in the datasets directly, and it will lose many important information in the process of sentence embedding [40,41].
In summary, the comparative experiments verify that our model proposed in this paper outperforms those baseline models in the datasets of NLPCC2013 and NLPCC2014.
This study compares the F-score of our model with those baseline models. Table 6 and Fig. 4 present the comparative results in the datasets of NLPCC2013 and NLPCC2014. We notice that our method achieves better performance than baseline models in the measure of F-score, 0.7609 and 0.8756 in the tasks of subjective classification, 0.8028 and 0.8364 in the task of sentiment classification, respectively.

Ablation analysis
As shown in Fig. 5 and 6, we further investigate the effect of each part in our model. The DMNSA model outperforms those models that remove penalty mechanism, self-attention mechanism, deep memory network on NLPCC 2013 and NLPCC 2014 datasets. Specifically, the DMNSA model without deep memory network will directly take the BiLSTM The architecture of deep memory network brings 8.33% and 8.06% absolute accuracy promotion in the abovementioned tasks for the NLPCC 2013, respectively. Also, model with self-attention mechanism or penalty mechanism achieves better performance compared to directly using the BiLSTM model for the sentiment and subjectivity classification. Besides, our proposed DMNSA model surpasses other models without the core three components in terms of f-score on different datasets.
The experimental results validate that the memory network, self-attention mechanism and penalty mechanism are effective and necessary in the task of sentiment classification and subjectivity classification on different datasets.

Error analysis
Error analysis is important to building our sentiment classification framework based on the deep memory network with structural attention model. In this section, we carry out an error analysis of our model on the datasets from NLPCC 2013 and NLPCC2014, and find that most of the errors could be summarized as follows. The first factor is tweets have a wide array of online social ways of expressing the same token. For example, "It's soooo del" and "It's so delicious", our model may incorrectly train the different sentence embedding, which is the same semantics and same sentiment with a different expression. In addition, this model considers a single contextual word as the basic unit, so it cannot handle semantics phrases. For example, "Die for" is a feeling phrase whose meaning cannot be deduced from the word "Die" and "for". Finally, some sentences may contain comparative opinions or emotions, such as "compare to crying, we should try to smile to eat more". It is difficult to infer the sentiment of those sentences.

Conclusion
In this study, a novel microblog sentiment analysis model called structural deep memory network is proposed, which combines deep memory network and structural attention mechanism. The model employs bidirectional LSTM to extract the semantic information contained in the microblog text, and uses the extraction results as the memory component of the deep memory network and multi-step (hops) structural attention operations as the generalization and output components. Using the model to classify the subjectivity of the tweet text and the sentiment contained in the tweets, the experimental results show that the model performs well on both tasks.
The model can also be further extended to perform more fine-grained microblog sentiment analysis using a graphbased approach [42]. In addition, with the development of information technology, the online social media may include not only texts and emojis, but also audio and video [43], which may attract more attention and becomes a problem worthy of further research.