1 Introduction

In today’s world, the availability of social media platforms is somehow overwhelming, and people spend a noticeable amount of their time communicating with each other through different social networking platforms. This communication could be via text, audio, or video, where people can express their emotions in these ways. There is no way to ignore or set aside emotions in human life because we, humans, use these emotions to communicate with each other, or make decisions [1]. Emotions can be expressed in different ways, for example, facial or body gestures, voice, or text. Detecting a person’s emotion, by looking at their facial and body gestures or hearing their voice, would be easier, but emotion detection from text is not that easy, even for humans themselves [2].

However, due to the availability of enormous valuable textual data on social media platforms, such as Instagram, Facebook, and Twitter, and knowing that they contain valuable information about crowd behavior and emotion [3], automating the hard task of detecting emotions has gained popularity in the past few years. For example, during the COVID-19 pandemic [4], people have been sharing their experiences and opinions on the issue. Analyzing these comments can help us understand what they really feel and whether they are dealing with depression or not to take further action. Another example is empathetic chat-bots that need to understand the emotions of their users to respond accordingly [5].

Emotion detection (ED) is now a subfield of natural language processing (NLP), where it tries to detect the emotion lying behind the text, such as joy, love, and sadness. There is now various research employing different models for textual emotion detection such as LSTM [6], BiLSTM, and GRU [7]. Although these models made some promising contributions in this field, they have some limitations, such as being slow, requiring vastly computational resources, and needing a large amount of training data. But this amount of labeled data and computational resources are not always available.

Therefore, our objective is to [8] show the benefit of transfer learning and how this problem can be addressed utilizing pre-trained language models such as EmotionalBERT. In this paper, we used EmotionalBERT, which is based on pre-trained BERT [8]. The knowledge of the BERT model is transferred to train a standard feed-forward neural network with a softmax layer built on top of it, in order to classify tweets based on their emotions. Our results show that not only EmotionalBERT can perform better compared to the RNN-based models considered in this experiment, with only 36% of the dataset, but it significantly improves the accuracy with only a few training epochs. We also test the model on a new small dataset and compare the results.

In the next section, we overview related literature on emotion detection in text. Then, in section 3, we introduce EmotionalBERT, the pre-trained language model used in this experiment. Section 4 details the data preparation, the baseline models, and the experimental results. Section 5 concludes the paper and points out future work.

2 Related work

There have been many attempts at facial emotion recognition [9, 10] or audio-visual emotion recognition [11, 12], but there has been less focus on detecting emotions from textual data, as it is a relatively new area in NLP. Some work has been done using traditional machine learning techniques [1, 9, 13, 14]). A very important aspect of textual data is its sequential pattern. It means that the meaning of a single word obviously depends on the rest of the words presented in the sentence or paragraph. So context can help us to find out what a single word really means and determine the emotion of the text. Unfortunately, these traditional machine learning techniques cannot help with capturing the sequential nature of text [15] and as a result fail to consider the context while classifying text based on their emotion.

This lack of ability of traditional machine learning models made some deep learning models such as recurrent neural networks (RNNs), and their variants long-short term memory (LSTM) [6], and gated recurrent units (GRU) [7], more prominent in emotion detection in text [16,17,18]. Although recurrent models consider the sequential nature of text [19] and have achieved state-of-the-art results for different NLP tasks, they are slow and need to be trained from scratch, there is a limitation of how much they can capture the long-term dependencies in the text [20], and they need a large amount of labeled data to train. Preparing this large amount of labeled data is a time-consuming and tedious procedure [13], so it is not always available.

This is where transfer learning, transferring knowledge from a general-purpose task into a more specialized target task, comes into play. Using transfer learning, we can achieve better results compared to traditional deep learning models, with much smaller training material. Pre-trained language models, such as bidirectional encoder representations from transformers (BERT) [8], and its variants, Open AI GPT[21], and Transformer-XL [22], have been widely used in various NLP tasks and have shown promising performance. These models have been trained on a huge set of data and gained knowledge from it and now that knowledge can be used for other similar tasks, with no need for the huge amount of data and training time.

Some work has been done using pre-trained language models (LMs) to classify emotions or sentiments in text. [23] utilizes BERT as an embedding layer which then the output passes through a CNN and BiLSTM layer to perform Bangla sentiment analysis. They also compare BERT embedding ability to various word embedding techniques, such as Word2Vec, GloVe, and fastText, and their results show that BERT significantly outperforms all embedding and algorithms. [24] compares BERT, RoBERTaFootnote 1, DistilBERTFootnote 2, and XLNet [25] pre-trained transformer models’ performance in recognizing emotions from texts. The implemented models are fine-tuned on the ISEAR data to classify it into seven emotion classes. Their results show that RoBERTa had the highest accuracy. [26] studies the effectiveness DeepEmotex models, which are fine-tuned USE [27] and BERT pre-trained models to classify text based on their emotions. They also studied the effect of varying the amount of data and found out that using more data for fine-tuning the pre-trained models can improve their performance.

But as we discussed earlier, although more training data can enhance the model’s performance, preparing a large amount of labeled data is a time-consuming and tedious task. In this paper, our objective is to demonstrate the benefit of transfer learning and how such pre-trained models maintain their accuracy using a small amount of labeled data compared to traditional deep learning models such as RNNs.

3 EmotionalBERT model

In this paper, we adopted EmotionalBERT, which is based on pre-trained BERT. The knowledge of the BERT model is transferred to train a standard feed-forward neural network with a softmax layer built on top of it, in order to classify tweets based on their emotions. The bidirectional encoder representations from transformers (BERT) [8] are a transformer-based language model that only uses the encoder part of the transformer. It has been trained on a huge amount of data (books, Wikipedia, etc.), and it can be used as a pre-trained model for different NLP tasks such as sentiment analysis (SA), question answering (QA), and text summarization (TS). There are several variants of BERT; here, we use the BERT-based model and fine-tuned it for the target task. The model has 12-layer encoders or as the authors call them, transformer blocks. Each transformer block contains a 768-dimensional hidden layer and a 12-head self-attention layer.

Fig. 1
figure 1

The EmotionalBERT model architecture

The first input token is supplied with a special token called [CLS] which stands for classification. It is the representation of the whole input sequence and therefore can be used for classification tasks. BERT has been trained based on two different techniques. The first one is masked language modeling in which 15% of the input sequence will be replaced by [MASK] token and the model tries to predict the masked tokens. The second technique is next-sentence prediction. The model gets two sentences as inputs that are separated by the [SEP] token and the model has to find out if the second sentence follows the first one.

We take the final output of the first token [CLS] and feed it to a classifier. The classifier contains a feed-forward neural network layer followed by a softmax function to get the probability of classes. The architecture of the model is demonstrated in Fig. 1.

4 EmotionalBERT evaluation

In this section, we discuss the datasets we used and the way we prepared them for the experiments. We also explain the baseline models that are used to compare our results with and demonstrate our experimental results.

4.1 Data preparation

We conduct our experiment using two different datasets, Wang and MELD, which in the following subsections will be discussed.

4.1.1 Wang dataset

For our first experiment, we adopt the dataset created by Wang et al. [1]. This dataset contains around 2.5 million tweets which over 1.3 million were available to download using their IDs. The tweets are labeled with seven emotion classes, six of which are from [28]: joy, sadness, anger, love, fear, thankfulness, and surprise (Fig. 1), and were labeled using their hashtags. The number of tweets for each emotion class is depicted in Table 1. As can be seen, the dataset is imbalanced, meaning the amount of data for some classes is far fewer than the others. Surprise has the least amount of data, and joy has the most. This is not surprising as emotions like joy or sadness are more common than surprise.

Table 1 Number of tweets for each emotion class in Wang dataset

4.1.2 MELD dataset

To prove our point and test the model performance, we ran another experiment on a new publicly available dataset called Multimodal EmotionLines Dataset (MELD) [29], which contains about 16,000 utterances from the TV series Friends.

Table 2 Number of utterances for each emotion class in MELD dataset

This dataset contains video, audio, and text of the utterances. We only used the text data for our experiment. The number of utterances for each emotion class is depicted in Table 2. As can be seen, this dataset is relatively small compared to the first one, and it is also imbalanced. The neutral class has the highest, and fear has the lowest number of utterances. We adopted this dataset to see how the models perform on a much smaller dataset.

To prepare the data, we took some simple preprocessing steps. We removed the stopwords and punctuations, lowercased the letters, and convert the contracted words to their original form. Tokenization has been done using the BERT wordpiece tokenizer. We limited the length of each sentence to 160 tokens. So, if there is any sentence longer than 160 tokens, they will be truncated.

4.2 Baseline models

For comparison, we consider two RNN-based models. The first one is the bidirectional GRU model proposed by Seyeditabari et al.Footnote 3. They used seven identical binary emotion classifiers for each emotion class. For the embedding layer, they tried different models and found out that there was no significant difference in their performance. They published their results based on two embedding models, ConceptNet Numberbatch [30] and fastText [31], both with 300 dimensions. The architecture of their model consists of an embedding layer, a bidirectional GRU layer, max-pooling and average-pooling layers, and a dense neural network layer.

After the embedding layer, there is a bidirectional GRU in order to capture a better understanding of the sequential nature of the tweets. To extract the most important features and an average representation from the output of this GRU layer, a concatenation of max-pooling and average-pooling is used. The output is then fed to a dense classification layer with a dropout rate of 50%. Finally, the sigmoid function produces the probability of each emotion class. For further study, you can refer to the original article.

The second one is an LSTM-based model. The model consists of a unidirectional LSTM layer with 300 hidden units, followed by a fully connected output layer. The pre-trained fastText embedding weights are used to initialize the embedding layer and are fixed during training. The model uses dropout with a rate of 0.5 to prevent overfitting. The AdamW optimizer is used to minimize the cross-entropy loss function. The model is trained on a GPU provided by Google Colaboratory service to speed up the training process. The weights are initialized using Xavier and orthogonal initialization for linear and LSTM layers, respectively.

4.3 First experiment

In the first experiment, we ran the EmotionalBERT and LSTM-based models on the Wang dataset and then compare the results with the bidirectional GRU model. The learning rate and batch size for EmtionalBERT and LSTM models are 2e-5 and 16, respectively. The EmotionalBERT and LSTM model trained for 3 and 5 epochs, respectively.

For EmotionalBERT, we chose three sets of increasing amounts of data to reach optimum F1 levels, and this has been achieved by 500K tweets. We did not feed the whole dataset to the model, in order to prove that despite the fact that feeding more data improves the model’s performance, EmotionalBERT can achieve better results than RNN models, even without a huge training material. Also, due to low resources, we were not able to train the LSTM model from scratch on the whole dataset. So, we decided to train it on 500k tweets as well.

4.3.1 Analyzing each phase

The first experiment is done on the Wang dataset in three phases, feeding data to the EmotionalBERT model with 100,000, 250,000, and 500,000 in each phase.

To fine-tune the model, we chose 16 for the batch size, as the amount of input data has a noticeable impact on the model performance. Considering how much the learning rate can affect the learning and convergence of the model, we decided not to choose it so large and chose 2e-5 for the learning rate. We trained the model for three epochs because in fine-tuning, the model has already learned many high-level and low-level features of the text, and there is no need for a large number of epochs. Also, after three epochs the training and validation accuracy was not changing anymore, so there was no point in training the model for more than three epochs.

Table 3 F1-score results for phase 1, with 100,000 training data
Table 4 F1-score results for phase 2, with 250,000 training data

We used the exact same BERT-based model for each emotion class, which predicts whether it is that specific emotion or the other ones. The results of these three phases are shown in Tables 3, 4, and 5, respectively. The reported numbers are F1-scores. We also added precision and recall scores to present more accurate results.

We can see in Table 3 that joy and anger have the highest F1-score, both equal to 81%, while on the other hand, surprise gets the lowest (63%). The overall F1-score in phase 1 is 73%. Here, we used only 0.1% of the dataset, so it is absolutely normal to not get the best results. In phase 2, we trained the model with 250,000 tweets. Here again, anger and joy have the highest F1-score and surprise and fear have the lowest (Table 4). We saw a 2% improvement in the overall F1-score (75%). But, this is not yet the optimal result, compared to the baseline models. So, we needed to keep feeding more data to the model. This time we decided to double the amount of data in phase 3, and repeat the experiment.

Table 5 F1-score results for phase 3, with 500,000 training data

In phase 3, we used 500,000 tweets for training the model. The results for phase 3 are depicted in Table 5. The most significant point in this table is that F1-score is 100% for four classes. That means the model predicted and classified all of them correctly. Here, as we have the optimal result, we ended the experiment with three phases. Although the model has a great performance for four classes, it did not perform well for surprise and fear, as they show descending F1-scores throughout the experiment.

This is possibly happening due to two reasons. First, as we mentioned earlier, the dataset is not balanced and surprise and fear have the lowest amount of data compared to other classes. As given in Table 1, there are only around 13000 and 73000 tweets labeled with surprise and fear, respectively. Also, we used only 36% of the dataset, so this amount is even less than what was previously mentioned. The distribution of data for each class in 36% of the dataset is demonstrated in Table 6.

Table 6 Number of tweets for each emotion class in 36% of the dataset
Table 7 Top 30 common words for classes surprise and fear

According to Table 6, in 500,000 tweets, there are 4,962 tweets with surprise labels and 26,944 for fear. As we can see, surprise has far fewer data compared to some classes such as joy or sadness. This lack of balance in the dataset can significantly affect the performance of the model for this class. So, the reason for the model not performing well on surprise is not having enough data.

Fear has not also a significant amount of data (26,944), but it is almost as large as thankfulness (29,152). So, there should be some other reason besides the insufficient amount of data. We decided to see whether there is any other reason for the model not performing well on fear. We checked the top 30 common words for each class in the dataset and found out that, unlike other classes, there are not enough representative words for this class (Table 7). The most common words for fear are mostly irrelevant and cannot be representative of this emotion.

The only one that can be considered related to fear is the hashtag “nervous” with only 512 frequency. This issue is also true for the surprise class. The word “surprise” with 165 frequency and the hashtag ”surprised” are the only representative words for this class. This makes it difficult for the model to predict the tweets correctly, and it may classify them into other classes by mistake.

4.3.2 First experiment results

Despite the fact that the model did not perform great for the fear and surprise classes, it got a far better overall F1-score compared to the RNNS. The comparison of the results is shown in Table 8. The bidirectional GRU outperformed the EmotionalBERT in classifying fear and surprise, but EmotionalBERT had a better performance in five other classes. As we can see the EmotionalBERT model got 86% F1-score, while the bidirectional GRU and LSTM got 80% and 33%, respectively. The results from our model show a significant improvement in F1-score, despite the fact that it used only 36% of the dataset.

Table 8 The comparison between EmotionalBERT, LSTM, and bidirectional GRU dataset

4.4 Second experiment

In the second experiment, we adopted the MELD dataset. Since the second dataset is small itself, we conduct this experiment in only one phase, using the whole data at once. The batch size and learning rate set for both models are 16 and 2e-5, respectively. The EmotionalBERT and the LSTM were trained for 5 and 20 epochs, respectively. The results in Table 9 depict the EmotionalBERT and the LSTM model F1-scores. Seyeditabari et al. did not test their bidirectional GRU model on the MELD dataset.

Table 9 The comparison between EmotionalBERT and LSTM performance on the MELD dataset

As the results demonstrate, EmotionalBERT outperforms the LSTM model in all classes. The LSTM model failed to capture three classes: disgust, fear, and anger. It is not surprising, since the amount of data for these classes was not enough for the model to learn them. Moreover, this dataset consists of utterances that can be more meaningful while considering the dialogs. The utterance before and after could help capture the context even better. Here, the model is only assessing a single utterance and it can affect the results. Table 10 demonstrates some examples in the dataset where the utterance could be classified into some other classes, without considering the running dialog. Despite the fact that the training data is so small and contains only 16K utterances, and the similarity between classes discussed above, the EmotionalBERT still achieved far better results compared to the LSTM model.

Table 10 The utterance instances in the MELD dataset that can be classified into other classes without considering the running dialog

5 Conclusion

In this paper, we tried to rely on transfer learning techniques that are now widely used in the field of natural language processing and developed an architecture based on the pre-trained BERT model, called EmotionalBERT, so that it can be used to detect emotions in the textual data with higher accuracy compared to RNN-based models, with considerably reduced training material [32, 33]. We demonstrated the benefit of transfer learning in dealing with small datasets, in classifying text based on their emotions.

For future work, the reasons the model has performed poorly in some classes can be explored using explainable AI. We can also work on dialog-based emotion detection to give the model a richer context and see how it can improve the model’s performance. Also, the effectiveness of employing other pre-trained models such as TinyBERT [32], DistilBERT, XLNet, and MobileBERT [33] on classifying texts adopting small datasets can be compared, and using explainable AI, we can analyze the superior models to find out the reasons they can perform better on small datasets and how we can introduce new models focusing on this aspect.