1 Introduction

Text emotion classification is an important branch of Natural Language Processing (NLP) research, aiming to identify the prominent emotion from short texts by predicting the label from a set of pre-defined emotions. Upon the identified emotions from user-generated texts (e.g., comments, reviews, blogs, and news reports), user attitudes and opinions can be retrieved and analyzed, and hence the task has great potential applications in various aspects of daily lives [17, 22, 43]. Existing works commonly extract semantic representation from texts via various deep modules in order to understand the semantic meaning for neural decision making. Considering the domain-specific nature of the task, researchers have made continuous efforts to improve classifier’s robustness by enriching word-level sentimental features, i.e., creating and exploiting different types of hand-crafted lexicon dictionaries [24, 27, 37].

In existing works employing lexicons in sentiment analysis [15, 39], the lexicon information is adopted as additional knowledge in the time-series context. Meanwhile, such works mainly investigate emotions from a coarse-grained perspective [1] and do not consider the interconnection between emotions. Based on such an assumption, only the most prominent emotion is recognized from a sentence. Nevertheless, due to the complexity of text expressions and the fuzzy nature of emotions, multiple types of emotions can be easily spotted in a single sentence and they are usually interconnected [9]. To address such an issue, a fine-grained emotion perspective can be helpful to bridge different fine-grained emotions with each other and profile a general emotion composition. Therefore, instead of only predicting a coarse-grained emotion label for the input text, we propose to study the intensity and distribution of various fine-grained emotions expressed by the sentence, which comprises a fuzzy-style emotion construction of the text. With such an emotion construction, we can shed light on the dependency relationship and interaction within a wide spectral of emotions towards the classification task.

To examine the emotion construction, we need to identify the implicit connections between text and emotion. To address this issue, we utilize emotion lexicons to bridge word with emotions. Li et al. [24] categorizes commonly-used lexicon dictionaries as categorical lexicons and dimensional lexicons based on the emotion theories they follow [8]. In categorical lexicons, word is labeled with at least one tag from an identified collection of emotions, e.g., the six basic emotions identified by Ekman [9]. Although such lexicons directly associate words with emotions, they can only work with a restricted scope of emotions and provide no clue about the intensity. On the other hand, dimensional lexicons, like NRC-VAD [26], following dimensional emotion theories that conceptualize emotions with measurable variables [29, 32, 33], annotate each word with intensity values on two dimensions (i.e., valence and arousal) or three dimensions (i.e., valence, arousal, and dominance (VAD)).

To model the relationship between text and emotions, we incorporate psychological domain knowledge together with dimensional lexicons. Russell and Mehrabian [33] recognized the sufficiency of VAD factors on the definition of emotional states and identified 151 emotions in terms of valence, arousal, and dominance with mean and standard deviation in the Three-Factor Theory.

We consider a probabilistic model to combine the Knowledge of Emotion (KoE) and NRC-VAD lexicon as the Knowledge of Words (KoW), so as to quantitatively measure the relationship between a word and an emotion in the VAD space. Such a relationship can be modeled as the emotion intensity. Given a sentence, a vector of sentence length is generated for each emotion by transiting the sentence and computing the intensity expressed by each word. The generated vector contains the intensity variation values and distributions across the sentence for a specific emotion and hence we name such a vector as EmoChannel. In Figure 1, the visualizations demonstrate the compatibility of KoE and KoW in VAD space, and in Figure 2, an example of how to generate EmoChannel is provided. So far, we obtain 151 channel vectors for 151 fine-grained emotions as the emotion construction of a sentence.

Fig. 1
figure 1

Visualization of the six basic emotions from Ekman [9] in the VAD space with the nearest words from NRC-VAD lexicon

Fig. 2
figure 2

A toy example demonstrates the process of generating EmoChannels for three emotions given a sentence of five words. “CHL” is an abbreviation of “Channel”

To further extract the dependency relationships among fine-grained emotions, we propose a self-attention-based model, EmoChannel-SA Network, which employs self-attention blocks over the constructed EmoChannels. The intermediate output of the self-attention block is exploited as the sentence-level emotion representation to enhance the decision-making process. Furthermore, we visualize the attention weights of multiple selected samples via heat maps and provide in-depth discussions about how the emotion dependency contributes to the learning process. Our main contributions are summarized as follows:

  • We propose an self-attention-based model to enhance classification performance by exploiting dependency of fine-grained emotions;

  • Extensive experiments on multi-class classification and sentiment analysis datasets of different topics show that our methodology produces competitive outperforms against the baseline models.

The early stage of this work has been published in [23]. We have since expanded the work on both technical and experimental contents. The main enhancements from the conference version include: (1) instead of using an additive attention, we employ self-attention module to extract more informative dependencies within fine-grained emotions; (2) we experiment on both multi-class classification and sentiment analysis tasks, to test the generality of the proposed model; (3) we provide insightful discussions by visualizing the attention weights via heat maps.

2 Related work

In this section, we first introduce the major emotion theories in psychological domain, and then briefly review the recent development of emotion classification approaches.

The commonly adopted emotion theories are of two types [8]: categorical and dimensional. Categorical theories consider the emotions are discretely and differently constructed, and a set of basic emotions can be understood among different cultures. The most popular emotion taxonomy is Ekman’s Six Basic Emotions theory [9], which concludes a primary emotion set with six emotions: anger, disgust, fear, happiness (joy), sadness and surprise. In contrast, dimensional models define emotion categories with measurable variables [29] that can be to conceptualize all emotion states. Most dimensional theories incorporate valence and arousal or intensity dimensions. As one of the most prominent two-dimension model following dimensional theory, Circumplex model developed by [32] maps emotions into a circular two-dimensional space. where the vertical axis and the horizontal axis represent activation-deactivation and pleasant-unpleasant, respectively. As a representative three-dimension model, Russell and Mehrabian [33] suggest that a three-dimensional system is sufficient to define emotional states. The system consists of three independent and bipolar dimensions, which are pleasure (pleasure-displeasure), arousal (degree of arousal), and dominance dominance-submissiveness”. Furthremore, their work identifies 151 terms denoting emotions in terms of the three factors with mean and standard deviation. In this work, we incorporate the Three-Factor Theory and dimensional emotion lexicon knowledge to construct the EmoChannel vector for a sentence.

Recently, exploiting deep neural networks for supervised text classification has become a mainstream approach, and it has achieved much remarkable progress. In the following, we review several widely adopted semantic feature extractors. Kim [18] proposed a classic TextCNN model with max-over-time pooling mechanism, which shows superior ability to extract local and position-invariant features. Another popular deep leanring model is Recurrent Neural Netowrks (RNN), which can extract sequential feature from a sequence. Hochreiter and Schmidhuber [14] and Socher et al. [35] used recursive networks to explicitly exploit time-series features. Several variants based on RNN have been proposed, for example, BiLSTM [13] and GRU [5] with more complex gate mechanisms. However, these methods fail to give adequate weights to some discriminative words. To address this problem, Bahdanau et al. [3] introduced and applied attention mechanism to machine translation. Afterwards, the attention mechanism has been widely employed in various NLP tasks. Vaswani et al. [40] employed the stacked self-attention blocks to learn the global dependency of a sentence as a more robust sentence-level representation. Delvin et al. [6] revisited the language model, combining Transformer-based architecture to pretrain language model over large-scale textual resources, which has been proven effective in improving downstream tasks.

Next, we introduce recent development in emotion classification field. Researchers have made continuous efforts on emotion classification task recently. It is a common approach in deep learning framework to exploit the CNN-based, RNN-based, and self-attention-based models to extract semantic features from word embedding vectors. For example, Chen et al. [4] employed a stacked BiLSTM model to obtain both forwards and backwards sequence information to identify the emotion categories and their corresponding causes. Feng et al. [11] address the group-based emotion detection exploiting topic exploration. Lai et al. [20] adopt graph convolutional networks to enhance fine-grained emotion classification performance. However, semantic feature is not sufficient in a domain-specific task due to the data sparsity issue [19]. Several works attempt to construct dedicated word representation, which encodes sentiment information into low-dimensional embedding vectors. Several works of effective representation learning rely on training embeddings from scratch on large tweets datasets [10, 38]. Besides, numerous attempts have been made to improve classifier performance using additional knowledge, such as emotion lexicons, syntax structure, and causality relationship. Qian et al. [30] adopted a BiLSTM with linguistic-inspired regularizers considering sentiment lexicons to predict text sentiment. Teng et al. [39] proposed a context-sensitive lexicon-based method based on a weighted-sum model to calculate sentiment aggregation using RNN architecture. Li et al. [21] presented Adaptive Gate Network to incorporate statistical features into classification task.

3 Methodology

In this section, we will elaborate on the process of constructing the fine-grained emotion intensity vector EmoChannel and our proposed EmoChannel-SA model framework.

3.1 EmoChannel: Emotion Distribution over Sentence

We first present the formal definition of EmoChannel as the intensity distribution of the emotion over a sentence.

Definition 1

Given a sentence and an emotion, the EmoChannel is an emotion-based sentence-level representation, depicting the emotion intensity variation over time across the sentence. The representations of different emotions are independent [23].

We model the emotion intensity of each word as the probability of corresponding word’s belongingness to the emotion. For an emotion Ek, we retrieve the VAD mean \( \boldsymbol {\mu }_{E_{k}} = [V^{m}_{E_{k}}, A^{m}_{E_{k}}, D^{m}_{E_{k}} ]\) and standard deviation \( \boldsymbol {\sigma }_{E_{k}} = [V^{sd}_{E_{k}}, A^{sd}_{E_{k}}, D^{sd}_{E_{k}}]\) from the Three-Factor Theory. Given a sentence of M words, \({w_{1}, w_{2}, \dots , w_{M}}\), for each word wi that is included in the NRC-VAD dictionary, we can retrieve the VAD tuple \(\mathbf {w_{i}} = [V^{w_{i}}, A^{w_{i}}, D^{w_{i}}]\). We Consider the three emotion factors following multi-variate Gaussian distribution and compute \(\mathbf {d}_{w_{i}}^{E_{k}}\) as the intensity of emotion Ek:

$$ \mathbf{d}_{w_{i}}^{E_{k}} = \frac{\exp{\left( -\frac{1}{2} (\boldsymbol{\mu}_{E_{k}} - \mathbf{w_{i}}) \boldsymbol{{\varSigma}}^{-1} (\boldsymbol{\mu}_{E_{k}} - \mathbf{w_{i}}) \right)}}{\sqrt{(2 \pi)^{3} | \boldsymbol{{\varSigma}} |}}, $$

where \(\boldsymbol {{\varSigma }} = diag(\boldsymbol {\sigma }_{E_{k}})\). To this end, we obtain a sentence-level affective representation by modeling the belongingness of words within the sentence with the emotion. For the M words, the EmoChannel of emotion Ek over the sentence is \(\mathbf {C}_{k} = [d^{E_{k}}_{w_{1}}, d^{E_{k}}_{w_{2}}, \dots , d^{E_{k}}_{w_{M}}]\). In total, 151 emotions were identified in the Three-Factor Theory, so as 151 independent EmoChannel representations are constructed for each short text. We concatenate all the channels to form the EmoChannel matrix as the emotion construction of the sentence.

We randomly initialize vectors for the OOL (out-of-lexicon) tokens following a similar way as implementing the word embedding model, because the large number of OOL tokens leads to an undesirable sparsity issue. Specifically, to differentiate the affective words included by lexicons and the OOL tokens, the values are sampled from a distribution with a much smaller standard deviation, say 0.001. Meanwhile, all weights in emotion construction matrix are trainable during the training process, making the decision-making more adaptive.

3.2 The model framework

In this section, we introduce the proposed model framework according to each component. In general, we first build a classifier to extract semantic features on textual input and employ emotion construction of the sentence to enhance the decision making. A generic framework is demonstrated in Figure 3.

Fig. 3
figure 3

The generic framework of the proposed EmoChannel-SA Network

3.2.1 Input layer

The input of our model consists of a sentence s with fixed length M and the EmoChannel vectors C = [C1,C2,⋯ ,Ck] of s. For non-Bert models, we first map each word into a D-dimensional continuous space and obtain the word embedding vector \(\mathbf {x}_{M} \in \mathbb {R}^{D}\). Then we concatenate all word vectors to form a D × M matrix as the model input:

$$ \mathbf{x} = [\mathbf{x}_{1}, \mathbf{x}_{2}, \dots, \mathbf{x}_{M}] $$

Following the same way as in [18], we pad the sentences to maintain a uniform length for all sentences. For Bert-based models, we tokenize the textual input with Bert tokenizer.

3.2.2 Semantic feature extraction layer

For a non-Bert model, we employ TextCNN [18] as the extractor to produce semantic representation from the text. We apply a filter \(\mathbf {W}^{f} \in \mathbb {R}^{h\times D}\) with window size h to slide through the embedding matrix. The new feature zi is generated from a window of word vectors xi:ih+ 1:

$$ z_{i} = \boldsymbol{f}(\mathbf{W}_{f} \circledast \mathbf{x}_{i:i-h+1} + b), $$

where, \(b \in \mathbb {R}\) is the bias term, and f(⋅) is a non-linear function, and each filter produces a feature vector \(\mathbf {z} = [z_{1}, z_{2}, \dots , z_{m}]^{\intercal }\) with padding. We employ d filters to produce a feature map \(\mathbf {Z} \in \mathbb {R}^{d \times M}\) in the semantic space. Afterwards, we employ a max-over-time pooling operation over each feature map and capture the maximum value \(\hat {z} = \max \limits {(\mathbf {z}_{i})}\). By doing this, we obtain a latent semantic vector \(\mathbf {z}^{s} = [\hat {z}_{1}, \hat {z}_{2}, \cdots , \hat {z}_{d}]\). For the Bert model, we retrieve the [CLS] token as the sentence representation for further processing.

3.2.3 Self-attention block

We apply a scaled dot-product attention block [40], or self-attention, over EmoChannel vectors to extract emotional dependency,

$$ \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^{T}}{\sqrt{d_{k}}} V\right), $$

where Q, K, and V are identical EmoChannel matrix, and \(\sqrt {d_{k}}\) is the square root of K’s dimension as a scaling factor.

The intensity variation within a channel reveals the changes of an emotion at different positions, which is helpful to understand the semantic compositionality and sentiment deviation. Therefore, we employ a multi-head self-attention module to analyze emotional dependency at different subspaces of EmoChannel matrix. We first linearly project the queries Q, keys K, and values V into dk, dk, and dv dimensions for h times, respectively. Then, we compute the self-attention matrix as follows:

$$ \mathbf{z}^{s} = \text{MultiHead}(Q, K, V) = \text{Dense}(\text{Concat}(\mathrm{head_{1}} {\cdots} \mathrm{head_{h}})), $$

where headi = Attention(Densei(Q),Dense1(Q),Densei(V )).

The latent semantic representation zs and the output of attention layer za are concatenated as an enhanced feature vector of the sentence and fed into the classifier.

3.2.4 Output layer & loss function

The obtained representation a is mapped into the label space by passing through the fully-connected layers and a softmax layer for predicting the labels. To optimize the classification model, we maximize the probability of the correct label y by minimizing the cross-entropy loss L, which is defined as:

$$ L(\mathbf{a}, y) = - \frac{1}{N} \cdot {\sum\limits_{i}^{N}} {\sum\limits_{j}^{c}} \mathbf{1} (y_{i} = j) \ln\mathbf{a}_{j}, $$

where 1(yi = j) = 1 if yi = j and 1(yi = j) = 0, otherwise.

4 Experiments

4.1 Datasets

We conduct experiments on different classification tasks to test the generality of the proposed method. The summary statistics of datasets are listed in Table 1.

Table 1 Statistics summary for the datasets

4.1.1 Multi-class classification task

We experimented on three multi-class datasets.

ISEAR contains descriptions about personal experience from personnel with and without psychological background, in which they had experienced according to seven emotions, i.e., “Anger”, “Disgust”, “Fear”, “Joy”, “Sadness”, “Shame”, “Guilt” [34].

TEC includes emotional tweets with prespecified hashtags. Each tweet is labeled by one of the six emotions, i.e., “Anger”, “Disgust”, “Fear”, “Joy”, “Sadness”, “Surprise” [25].

CrowdFlower (CF) includes tweets annotated by crowd-sourcing with 12 emotions. A subset of CF with emotions of happiness, worry, surprise, love, and sadness is adopted due to limited samples of the remaining emotions.

4.1.2 Sentiment analysis task

The above datasets with coarse-grained emotional labels are used to test the performance on emotion classification task. We further demonstrate the generality of the proposed method on sentiment analysis task with sentiment polarity labels. We conducted experiments on SST-1Footnote 1 with five sentiment labels [36], i.e., very negative/negative, very positive/positive, neutral, and SST-2 with binary labels.

4.2 Baselines

For multi-class classification task, we evaluate our model with the following recent baselines on ISEAR, TEC, and CF:

DPCNN [16] exploits word-level pyramid convolutional model employing downsampling and shortcut connections.

ASEDS [31] is a sentiment representation learning model based on Facebook posts and reactions.

DERNN [42] is a representation learning model encoding both sentence syntactic dependency and document topical information.

WLTM [28] addresses the data sparsity issue of sentiment mining using topic modeling.

Word Rep. [2] examines multiple representation-based models applied to affective systems and investigates their effectiveness.

TESAN [41] is a deep topic model encoding topical information into topic embedding. A self-attention module and a fusion gate are proposed to predict the emotion label.

DACNN [44] applies a multi-channel convolutional network with attention to improve emotion categorization performance.

ESTeR [12] is a unsupervised emotion detection model incorporating word co-occurrences and word associations based on a word graph.

WED [24] adopts domain knowledge and existing lexicons to generate an affective representation of a word using fine-grained emotion concepts. Distribution learning is adopted to facilitate the classification task.

AGN [21] employs a variational autoencoder to encode the corpus-level word-to-label frequency and proposes an Adaptive Gate Network to consolidate semantic representation with statistical features selectively.

We also compared with the widely-adopted text classification baselines, i.e., TextCNN [18], Bi-LSTM [13], C-LSTM [45], Transformer [40], and Bert [7], on multi-class datasets and sentiment analysis datasets.

4.3 Evaluation metrics

We adopt the accuracy and Macro F1 score to evaluate the model performance. Furthermore, we conducted T-test and report p-value on datasets without a standard train/test split to reveal how significant the improvements are against the baseline models.

4.4 Word embedding and hyperparameter setting

To focus on the effect of the proposed framework, we randomly initialize word embedding vectors (with a size of 300, except for Bert) to eliminate the influence of using different pretrained language models. The preprocessing of all datasets follows the procedures reported in [18].

The hyperparameters involved are set as follows. The CNN-based models have a filter size of [3,4,5] with 100 filters of each, and the RNN-based models have hidden dimension of 128. For the Transformer, we use an encoder with 8 heads and 3 blocks. The employed Bert model is the Bert-base Uncased, including 12 layers, 768 hidden units, and 110M parameters. We adopt Adam optimizer with a batch size of 64 for non-Bert models and 16 for Bert models. The model dropout rate is set to 0.5. For the leaky dropout layer, the rate is 0.1, and the leaky parameter c is set to 200 according to our results on a validation trial. We select three fine-grained concepts for each emotion label because we can only identify three connections for some coarse-grained labels.

4.5 Experiment results

The results of our proposed EmoChannel-SA model against other baseline methods were reported in Tables 2 and 3. In general, our proposed method can achieve the best results in most of the comparisons (except Accuracy on TEC and F1 score on SST-1), which means that the proposed method can produce comparable results against the state-of-the-art baselines on both multi-class datasets and sentiment analysis datasets. For the Bert-based method, our proposed method also yields substantial improvement to the Bert baseline.

Table 2 Results on multi-class classification task
Table 3 Results on sentiment analysis task.

Besides the aforementioned baseline models, we conducted experiments with two variants on ISEAR dataset for ablation study. We compare different attention mechanisms, i.e., addictive attention (AA) from [23] and self-attention (SA) of the proposed model, to investigate the effects of incorporating different dependency relations. We also tested the performance of the models with self-attention blocks only or additive attention only to evaluate how informative the EmoChannel vectors are. The ablation study’s results were reported in Table 4.

Table 4 Results of ablation study

5 Discussion

In what follows, we first elaborate on how our proposed approach promotes the prediction performance by learning the dependencies among fine-grained emotions via self-attention module, and then we illustrate the specific effect of the self-attention modules through case study from multiple perspectives.

5.1 Effect of adopting self-attention module

According to the results in Tables 2 and 3, we observed that EmoChannel with self-attention module yields impressive improvements against TextCNN on all datasets. Especially, compared with TextCNN, the proposed method using CNN as the feature extractor increases accuracy and F1 score by 1.48% and 1.61% on ISEAR, and 2.44% and 4.63% on TEC, respectively, indicating that the proposed method is effective. Moreover, using self-attention and EmoChannel matrix only can achieve 51.31% on accuracy and 52.72% on F1 according to the ablation results in Table 4. Therefore, we can conclude that it is beneficial to exploit the dependency within fine-grained emotions to understand of sentence polarity.

Moreover, we noticed that the proposed model performs differently on dataset aspect. Particularly, the improvements against baseline models on tweets datasets, such as TEC and CrowdFlower, were less impressive compared with those on ISEAR. We speculate that such a difference is caused by the special language profile of tweets data. The informal language usage of tweets causes trouble on machine understanding due to the increasing difficulty when identifying affective words, which makes it hard for the self-attention module to extract meaningful emotion dependencies. Furthermore, a large quantity of typos and slang in twitter posts leads to a great amount of vacancies in the EmoChannel vector, which can bias the emotion construction of a sentence. In contract, the AGN baseline, which is a self-attention-based model with statistical features, produced competitive results on tweets datasets. Therefore, statistical information has potential to benefit EmoChannel framework on such challenging datasets, where emotion dependency is difficult to extract.

We also observe that the proposed method does not outperform baseline model regarding F1 score on SST-1. We speculate that the reason is that the dependency within different EmoChannels cannot explicitly present sentiment orientation of the sentence. In contrast, Li et al. [24] indicate that incorporating the sequential feature of the emotional word embedding, which exploits the emotion distribution from the other perspective, can yield significant improvements on sentiment analysis task. Therefore, we think it does not mean that the EmoChannel method is not informative. Instead, it would be beneficial to refine the EmoChannel information to fit the sentiment analysis task like [24].

5.2 Comparison between additive attention and self-attention

We compare the models with different attention mechanisms and report the results in Table 4. The model with self-attention shows substantial improvements to the model with additive attention, which indicates self-attention can better extract and exploit dependency relationship within EmoChannels than additive attention for the classification task. In [23], additive attention calculates a weighted arithmetic mean of aligned EmoChannels, and the weights are chosen according to the relevance of each EmoChannel towards the final emotional latent feature vector. The additive attention module is expected to highlight the fine-grained emotions that have more contributions to the label emotion. However, such an approach cannot utilize the dependency within fine-grained emotions and reveal how different emotion constructions interact. Furthermore, the EmoChannel vector cannot reflect a global sentiment deviation at the sentence-level, thus the incremented EmoChannel vector can be less informative, which can explain why the model employing additive attention without word embedding presents low performance. In contrast, the model employing self-attention only achieves satisfactory results, indicating the self-attention module is superior in extracting valid features from EmoChannel vectors.

5.3 Case study

To better illustrate how the self-attention contributes to the classification task and how it can be affected by the other components, we visualize the resultant self-attention weights among the fine-grained emotions of several testing data samples from the ISEAR dataset, and analyze them in the following section. Specifically, the self-attention weights are visualized via a heat map, where a brighter entry denotes a larger weight and a darker entry denotes a smaller weight.

As described in Section 3.1, we have in total 151 fine-grained emotions, and a 151 × 151 heat map may not be straightforward for us to analyze the dependencies among these fine-grained emotions. Therefore, we select several representatives for each emotion type contained in the ISEAR dataset to construct the heat map. Since there are relatively more fine-grained emotions related to joy than the other labeled emotion types of the ISEAR dataset, we select 6 fine-grained emotions for joy and 3 fine-grained emotions for the other emotion types, the details of which can be found in Table 5. With these selected fine-grained emotions, we construct a heat map with size 24 × 24 for each selected testing sample, and show them together with their corresponding sentence content and emotion labels.

Table 5 The selected fine-grained emotion tags for each coarse-grained emotion label in Dataset ISEAR

Next, we will analyze the selected testing samples from multiple perspectives. To be more specific, we focus on the following three questions concerning the effect of the proposed self-attention module: (1) what self-attention learns during training; (2) the effect of semantic features on the self-attention module; (3) the effect of pre-trained word embedding vectors on the self-attention module. For the first question, we aim to illustrate that the proposed self-attention module has learned the dependencies among the fine-grained emotions and hence improved the performance of the emotion classification tasks. As for the other two questions, we focus on studying the working mechanism of the proposed self-attention module, that is, whether the propose self-attention module can still perform as expected if without semantic features or pre-trained word embedding. The above mentioned questions will be addressed one-by-one in the following.

5.3.1 What self-attention learns

To begin with, we utilize 2 examples to visualize what our SA module has learnt during prediction. Figure 4 (a) contains a case about breaking a promise and is labeled to have a guilt emotion. In the heat map, for each fine-grained emotion, the attention weights of the fine-grained emotions related to joy (i.e., from the 10th one to the 15th one) are relatively smaller, indicated by their darker entries. This naturally makes sense since joy is the least likely emotion type that this sentence can convey.

Fig. 4
figure 4

Example sentences with their heat maps, content, and their emotion labels. (a) “breaking an implicit promise” is likely to convey a guilty emotion, and hence the fine-grained emotions related to guilt and also other similar coarse-grained emotions all receive high attention values; (b) “treatment to become pregnant with a negative result” can be deduced to express a sad emotion, and therefore only the fine-grained emotions related to joy are suppressed

Similar phenomena can also be observed in Figure 4 (b), which describes a situation with sad emotion that someone finds out that her pregnancy test turns out to be negative. Under such a situation, the fine-grained emotions that are the least related are still the ones of joy. Hence the entries from the 10th column to the 15th one are relatively darker than those of the others.

Such observations reveal that our self-attention module can suppress the contributions from the fine-grained emotions of the emotion type that is the least likely to occur, and promotes the contributions of the fine-grained emotions related to those emotion types that are similar to the labeled one.

Interestingly, under some specific cases, we can have completely different observations. Let us take a look at Figure 5. Figure 5 (a) presents a case about being ashamed when praised, which is labeled to have a guilt emotion. However, the heat map shows that the largest attention weights lie on the fine-grained emotions related to joy instead of those related to guilt or shame. We think the reason behind is as follows. The input of the self-attention module is the EmoChannel vectors aggregating the word-level emotion information. We notice that there are two explicit emotive words in the sentence, “ashamed” and “praised”, which conveys shame and joy emotions, respectively, different from the labeled guilt type. Hence the self-attention module gets confused by the conflict between the labeled emotion label and the word-level emotion information, and fail to learn informative dependencies among fine-grained emotions.

Fig. 5
figure 5

Example sentences with their heat maps, content, and emotion labels. (a) “I feel ashamed when I am praised” is an example with two explicit emotive words, “ashamed” and “praised”, and hence the model gets confused by such a conflict and fails to learn informative dependencies among fine-grained emotions; (b) “I lost my way on a trip in the mountains” is another example expressing two explicit types of emotions conveyed by “lost” and “trip”, and hence the model gets confused, incorrectly highlighting the fine-grained emotions of joy instead of fear

Similar conclusion can be drawn from Figure 5 (b). The example of Figure 5 (a) is about getting lost in a trip and the label emotion is fear, but the heat map shows that the fine-grained emotions of joy receive relatively larger attention weights than the others. We find that there are also two explicit emotive words in the sentence, “lost” and “trip”, expressing sad and joy, respectively. Therefore, the self-attention module is confused by such inconsistency and cannot learn the expected dependencies among emotions.

To sum up, in general the self-attention module can learn to assign more weights to the fine-grained emotions that are similar to the labeled emotion type, and suppress the attentions from those of the least possible emotion type. However, when the input sentence itself contains explicit emotive emotions expressing different emotion types from the labeled one, the self-attention module may not learn informative dependencies among fine-grained emotions, under which case the final prediction shall rely more on the semantic feature vectors.

5.3.2 Effect of semantic features on self-attention

To validate whether the self-attention module can still learn meaningful dependencies among fine-grained emotions without the semantic feature vectors (Section 3.2.2), we select 2 testing samples to compare their heat maps achieved by the proposed approach with the ones achieved by removing the word embedding vectors.

Figure 6 (a) is about a case that someone has to walk through a field with wild bulls since his/her car breaks down, which shall lead to fear or sad. The labeled emotion is fear and the heat map with semantic feature taken into consideration shows that for each fine-grained emotions, the contributions from the fine-grained emotions related to joy are all suppressed, while the other fine-grained emotions are assigned with larger attention weights. However, if we relies only on the self-attention module without concatenating the semantic feature vectors, the learnt attention weights among fine-grained emotions can be biased towards a single fine-grained emotion and not credible, such as the dramatically large attention of the 15th fine-grained emotion in the right sub-figure of Figure 6 (a).

Fig. 6
figure 6

Example sentences of their content, emotion labels, and the heat maps resulted from including or removing semantic features. (a) is an example about walking over a field with wild bulls because of car breaking down, which naturally conveys a fear emotion. (b) is another example about receiving a letter from a friend, which is happy. The heatmaps obtained by using semantic features of both examples illustrate the correctly captured fine-grained emotions, while the ones without semantic features highlight the incorrect fine-grained emotions

Another selected example is Figure 6 (b), which describes that someone receives a letter from a friend. The labeled emotion type is joy, and the heat map with semantic features involved also reveals the same phenomenon that the largest attention weights are assigned to the 13th fine-grained emotion related to joy. However, by the self-attention module alone, the learnt self-attention weights indicate that the 7th and the 8th fine-grained emotions contribute the most to the final prediction, which does not make sense since these two fine-grained emotions are related to fear.

Therefore, we conclude that our self-attention module needs to work together with the semantic feature vectors to learn informative dependencies among fine-grained emotions for promoting the emotion classification performance. Without the word embedding vectors, the self-attention module may be biased towards some unrelated fine-grained emotions and fail to benefit the whole model.

5.3.3 Effect of pre-trained word embedding vectors

After showing that semantic feature vectors are indispensable for our self-attention module, we further explore the effect of using pre-trained word embedding vectors when computing semantic feature vectors. Specifically, we select 2 testing samples to compare their heat maps resulted from using pre-trained word embedding vectors and those of using random word embedding vectors.

The first example is Figure 7 (a), which is about visiting a friend and is labeled to have joy emotion. The heat map of using random word embedding vectors shows that the fine-grained emotions related to joy contribute the most to all the fine-grained emotions, especially the 13th fine-grained emotion. However, the heat map of using pre-trained word embedding vectors looks completely different, where the contributions of the fine-grained emotion related to joy are suppressed.

Fig. 7
figure 7

Example sentences of their content, emotion labels, and the heat maps using random or pre-trained word embedding vectors. (a) is an example about going to visit a friend, which conveys a joy emotion, while (b) is another example about feeling unhappy about waiting his girlfriend. The heatmaps obtained by pre-trained word embeddings of both examples seem to focus on conflicting fine-grained emotions against the labeled coarse-grained emotions, while the ones using random word embeddings show more reasonable attention weights

Similarly, Figure 7 (b) describes a case that someone’s girlfriend keeps him waiting when he tries to take her out. The labeled emotion is disgust, and the heat map of random embedding vectors shows that the fine-grained emotions related to joy are assigned with smaller attention weights. On the contrary, when using pre-trained word embedding vectors, the fine-grained emotions related to joy are assigned with larger weights than the others, which conflicts with the observation of using random word embedding vectors.

We think such interesting conflicts are caused by that the pre-trained word embedding vectors can already provide strong and informative semantic features. With these pre-trained word embedding vectors, the model learns to rely more on the semantic feature vectors. Hence the self-attention module is not trained enough and fails to perform as expected. Therefore, to guarantee that the self-attention module can get enough training, using random word embedding vectors is a better choice.

5.3.4 Emotion distribution

Since the self-attention block is directly employed over EmoChannel matrix, the attention weights can reflect an integration of all the affective words in a sentence, thanks to the inner product between queries and keys, as shown in Figure 4. As analyzed in Section 5.3.1, the self-attention module can learn to assign less weights to the fine-grained emotions that are the least likely to occur, and increase the weights of the other fine-grained emotions. Moreover, according to our discussion in Section 5.3.3, training without randomly initialized word embedding vectors encourages the self-attention module to encode semantic information, so that the attention map is able to present the real sentiment composition to some extent. Therefore, the learnt self-attention weights can be viewed as a modification or adaptation towards the inherent word-level emotion information with the sentence’s emotion label. The result of modified emotion information could be a natural fit as an emotion distribution. In the next step of our research, we plan to investigate the potential connection between attention map and sentence-level emotion distribution by implementing dimension deduction method on the attention map, which could be a feasible approach to construct a distribution label that facilitates the classification task.

6 Conclusion and future work

In this work, we proposed an EmoChannel-SA Network to enhance emotion classification performance by exploiting the dependency relationship with the emotion construction of the text. We examined 151 fine-grained emotions incorporating domain knowledge and a dimensional lexicon dictionary. Our experimental results show that the proposed method leads to a more robust classifier on datasets of various topics. To examine the generality of our proposed model, we conduct experiments on multi-class classification and sentiment analysis tasks. The experiment results indicate our model can produce competitive performance against state-of-the-art baselines. Furthermore, we provide in-depth discussions about how the interaction within fine-grained emotions affects decision-making by visualizing self-attention weights. We conclude that the self-attention module can learn to assign more weights to the fine-grained emotions that are similar to the labeled emotion type, and suppress the attentions from those of the least possible emotion type. Meanwhile, the learnt self-attention weights can be regarded as a modification towards the inherent word-level emotion information with the sentence’s emotion label. Therefore, the attention map has potential to shed lights on the exploration of a new approach towards emotion distribution learning. In future work, we plan to investigate the possible connection between self-attention weights and emotion distribution, which would be beneficial to enhance the information of the one-hot label.