1 Introduction

Human language understanding and human language generation are the two aspects of natural language processing (NLP). In NLP applications different types of classifiers such as Support Vector Machines (SVM), Naive Bayes (NB), decision tree (DT), k-nearest neighbors (KNN), etc. are used to achieve the result of different domains (Itani et al. 2017). Speech recognition, document summarization, question answering, speech synthesis, machine translation, and other applications all employ NLP (Itani et al. 2017). Itani et al. (2017) also mentioned that these classifiers are also designed and used for different purposes such as question answering, movie sales or reviews, and many other applications. which means that the entity or corpora which they depend on need to be domain specific. The author also mentioned that in the field of sentiment analysis (SA), the classifier aims to identify whether the specific sentence or social media posts are positive, negative, neutral, happy, sad etc. The two critical areas of NLP are SA and emotion recognition (Nandwani and Verma 2021). Both names are used interchangeably, but different from a few angles. The author claims that due to ambiguities in natural language, the human language understanding is more difficult. Emotion is a conscious mental state subjectively experienced as strong feelings usually directed towards a specific object and typically accompanied by physiological and behavioral changes in body language. Textual emotion detection is devoted to automatically identifying emotion states in textual expressions, such as Happy, Sad, Angry etc. For instance, a person experiences joy when they receive good news and a person experiences fear when they are threatened. Emotions have a strong influence on our lives. We make decisions based on our mind whether we are happy, angry, bored, sad, or frustrated etc.

Since an adequate amount of data is generated in social media which is known as big data, SA, also known as opinion mining was introduced to examine the data steadily and constructively (Nandwani and Verma 2021). There are three types of Sentiment and emotion analysis techniques that are Machine Learning Based Approach, Deep-learning Based Approach, and Lexicon-based Approach which are described in Sect. 3. There are three levels of analysis in SA Word level analysis, Sentence level, and document level analysis. Nandwani and Verma (2021) claimed that SA is a method for detecting the authors’ viewpoints on a subject positive, or negative SA implies assessing whether the data is negative, positive, or neutral which assists the business person in understanding their customer’s views better so that they may create the necessary changes for their reviews and products. So the authors (Nandwani and Verma 2021) define sentiment analysis as the process of finding the writer’s perception/attitude, which may be negative, positive, or neutral. The motivation of SA is to regulate the categorized opinionated text and polarity of the text whether it is positive or negative. It does not only include positive and negative, it may include agreed or disagreed, good or bad (Nandwani and Verma 2021). Basically, the range can be quantified on a 5-point scale: agree, strongly agree, disagree, strongly disagree, or neutral. Nandwani and Verma (2021) also took an example of the Stock market, the investor’s decisions are based on the investor’s sentiment which can rapidly spread and amplify over the network. Nandwani and Verma (2021) mentioned the applications of NLP in the healthcare sector, where Twitter become a crucial source of information provided by healthcare citizens and professionals. For example: during the COVID pandemic, people use social media to share their thoughts, opinions, and feelings on COVID-19 by posting. They also refer to the applications of NLP in the Education sector. They said that nowadays educational institutions sometimes depend on Social media for marketing and advertisement purposes. The guardians and students act online research and learn more about the institution, courses offered, and professors. Thus the author claimed that by applying SA students can select the best institutions or professors in his admission or registration process.

In 2019, 2.4 billion people used Facebook, the biggest social media site on earth. Over a billion people use WhatsApp, Instagram, Twitter and YouTube, are other social media sites. In 2019, there were at least 3.5 billion people online and 7.7 billion in the world (Ortiz-Ospina and Roser 2023). This indicates that more than two-thirds of all internet users and one in three people worldwide use social media platforms. The world has been altered by social media. Our ability to find partners, access news information, and organize to demand political change is changing as a result of these technologies’ quick and widespread adoption (Ortiz-Ospina and Roser 2023). People are using social media to communicate their feelings since Internet services have improved now-a-days. On social media like Facebook, Twitter, Instagram etc., people freely express their feelings, arguments, and opinions on various topics. In addition, many users give feedback and review of various products and services on e-commerce sites. User ratings and reviews on multiple platforms encourage vendors and service providers to enhance their current systems, goods, or services. Now-a-days almost every industry is undergoing some digital transaction, which gives resulting a vast amount of structured and unstructured data. The enormous task for companies is to transform unstructured data into informative one that can help them in decision-making (Nandwani and Verma 2021; Ahmad et al. 2020).

In human life emotions are an inseparable component (Nandwani and Verma 2021). The word “emotion” is derived from the French word emotion means a physical disturbance that came into existence in the seventeenth century. In the nineteenth century, the word emotion was considered as a psychological term. These emotions influence human decision-making and help us communicate to the world in a better way. Emotion detection, also known as emotion recognition, is the process of identifying a person’s various feelings or emotions (for example, joy, sadness, or anger). Researchers have been working hard to automate emotion recognition for the past few years. However, some physical activities such as heart rate, shivering of hands, sweating, and voice pitch also convey a person’s emotional state (Itani et al. 2017; Kratzwald et al. 2018), but emotion detection from text is quite hard. Emotion analysis plays a critical role in the education sector both for teachers and students. The efficiency of a teacher is decided not only by his academic credentials but also by his enthusiasm, talent, and dedication to his work. Taking timely feedback from students is the most effective technique for a teacher to improve teaching approaches (Sangeetha and Prabha 2021). Darwin and Prodger (1998) mentioned that emotion detection from text is comparatively more complex than SA. “Emotion” is the strong feeling extracted from a person’s mood, or circumstances, according to the Oxford Dictionary. According to the Cambridge Dictionary “Emotion” is a strong feeling such as love and anger, or strong feelings in general. Darwin and Prodger (1998) also mentioned the different applications of Emotion Recognition such as systematic e-learning systems based on students’ emotions, human-computer interaction, mental health analysis, improving business blueprint, political events, criminal psychology or terrorist mindset after a crime or terrorist attack, performance of chatbot and so on.

This paper is organized into 11 sections. In Sect. 1, we discussed the evolution of emotion research questions and their motivation, terms and terminologies, the contribution of this paper, the SLR diagram (Fig. 5), and inclusion and exclusion criteria with data extraction. In Sect. 2, we mentioned the different emotional models and their classifications. In Sect. 3, we discussed the related work that is done in text-based emotion analysis, key aspects of Machine Learning, and the Deep Learning approach. In Sect. 4, we discussed the existing surveys on contents and topics, surveys on methods of emotion detection, and overall methodologies of text-based emotion detection. In Sect. 5, we discussed the different document types whether it is collected from journals, conference papers, workshop papers, or dissertations. In Sect. 6, we discussed the dataset, data preprocessing, feature extraction, and prospective methodology for emotion detection using Deep learning methods and algorithms. The word level, sentence level, and document level analysis are discussed in Sect. 7. The different deep learning techniques are reviewed in Sect. 8 with their different methodologies. In Sect. 9, we discussed about the different evaluation techniques/metrics. Challenges that were identified in text-based emotion detection (TBED) are discussed in Sect. 10. The results and conclusions, practical implications, future direction of related studies, and limitations of this paper are discussed in Sect. 11.

1.1 Evolution of emotion

After completing some psychological experiments on facial expressions captured in different circumstances on both humans and animals, In 1872, Charles Darwin claimed that humans and other animals express emotions with similar expressions and behavior under similar circumstances (Darwin and Prodger 1998). His theory of emotion was also related to era and associated circumstances. According to his claims, emotions in humans and other animals were realized over time. He discussed about general principles of emotions in detail; means of expressions of emotions in both humans and animals; causes and effects of all possible emotions like anxiety, grief, dejection, despair, joy, love, devotion, etc.; explanation of emotions with images to show the expressions for particular emotions. Some philosophical and spiritual categorizations of emotions were present before that as described in Bell (1824) and Sailunaz et al. (2018). Figure 1 shows the evolution of emotion in different fields of study.

Fig. 1
figure 1

Evolution of emotion (Sailunaz et al. 2018)

1.2 PICOC criteria (Widyassari et al. 2022)

See Table 1.

Table 1 PICOC criteria

1.3 Research question and motivation

See Table 2.

Table 2 Research question and motivation

1.4 Terms and terminologies

The terms that are used in emotion detection are Tyng et al. (2017):

  • Emotions Emotions are described as a complex set of interactions between subjective and objective variables mediated by neural and hormonal systems. It is the psychoneural processes involved in controlling the strength and pattern of actions in the dynamic flow of intense behavioral exchange between animals as well as with specific objects important for survival. The term “emotion” represents an “overarching” concept that encompasses affective, cognitive, behavioral, expressive, and physiological changes; emotions are triggered by external stimuli and are associated with a combination of feeling and motivation

  • Moods Moods last longer than feelings, which are also characterized by positive and negative moods. In contrast, feelings refer to mental experiences that necessarily have good or bad valence and are accompanied by internal physiological changes in the body, especially in the viscera, including the heart, lungs, and intestines, to maintain or restore homeostatic balance. Feelings are generally not caused by emotions

  • Drive Drive is an inherent program of action necessary for the satisfaction of basic and instinctive (biologically predetermined) physiological needs, e.g., hunger, thirst, libido, exploration, play, and bonding with a partner (Tyng et al. 2017); this is sometimes referred to as “homeostatic drive” In short, a crucial common feature of emotions, moods, and feelings, affects, and drives is their inherent valence, which lies on the spectrum of positive and negative valence (pleasure-unpleasure/good-bad).

1.5 Contribution

The contributions of this paper are as highlighted:

  • Better accuracy in TBED The deep learning and transformer-based models give better performance in Text-Based Emotion Recognition compared with Machine-Learning models. This paper presents surveys of comprehensive study and applications domains of TBED, whereas the SLR(Systematic Literature Review) diagram also presents a comprehensive and transparent overview of the article.

  • Multimodal Incorporation The Deep Learning techniques allow the incorporation of textual data with video, audio, and different physiological signals for a better TBED system. With a particular focus on techniques, datasets, applications, key aspects of Machine learning and Deep learning approach, and potential research, we try to give an in-depth overview of TBED.

  • Existing pre-trained models and transfer learning For fine-tuning the Emotion Recognition task with a large dataset different pre-trained models such as BERT, GPT, etc. are used. So, This paper discusses the different pre-training models in Sect. 11 and also evaluates the different evaluation matrices, various techniques, methods, approaches, and levels of analysis in TBED. We also summarize the key research findings, and trends in our field, which will make it easier for readers to follow the track. Additionally, we also provide a comprehensive description and summary of different publicly available datasets that can be used to support research in that domain.

  • Accountable and illustrating The Deep Learning models provide a personalized area for a definite domain which gives more accurate results and gives discrete differences in different languages and expressions of emotions. We also concentrate on exploring the role of TBED in different application areas and examining the function of TBED.

  • Circumstantial understanding Text-based emotion recognition depends on circumstantial information that is found in text or sentences. The Deep Learning techniques can capture more precisely for better results of the system. Also, the Transformer model is capable of reporting the context in which a phrase or word is used. We have also discussed various challenges that were identified in existing studies and future directions of the related studies for overcoming the challenges and limitations.

1.6 Inclusion and exclusion criteria

See Table 3.

Table 3 Inclusion and exclusion criteria

1.7 Data extraction

The data that are extracted from the mentioned research questions are represented in Table 4.

Table 4 Data extraction

2 Emotion models

The existing emotion models can be divided into two classes that are Categorical and Dimensional (Sailunaz et al. 2018; Calvo and Mac Kim 2013). Table 5 summarizes some basic emotion models used frequently in the study and expresses almost all possible human emotions that are proposed by Ekman (1992), Shaver et al. (1987), Oatley and Johnson-Laird (1987), Plutchik (1980) and Lövheim (2012). Based on distinct emotion labels the categorical emotion model is focused. The categorical model assumes that there are discrete emotion categories. Plutchik (1980) define a set of eight basic bipolar emotions, consisting of a superset of Ekman’s and with two additions: TRUST and ANTICIPATION. These eight emotions are organized into four bipolar sets: joy vs. sadness, anger vs. fear, trust vs. disgust, and surprise vs. anticipation and Ekman et al. (1999) concluded that the six basic emotions are anger, disgust, fear, happiness, sadness, and surprise. The dimensional model represents effects in dimensional form. According to Canales and Martínez-Barco (2014) the emotions are distributed in a two-dimensional circular space: valence dimension and arousal dimension, as shown in Fig. 2. The valence dimension indicates how much PLEASANT and UNPLEASANT is an emotion. The arousal dimension differentiates ACTIVATION and DEACTIVATION states (Figs. 3, 4).

Table 5 Emotion models (Sailunaz et al. 2018)
Fig. 2
figure 2

Circumplex model of affect (Sailunaz et al. 2018)

Fig. 3
figure 3

Lovheim’s emotion cube (Sailunaz et al. 2018)

Fig. 4
figure 4

Plutchik emotion wheel (Sailunaz et al. 2018)

3 Literature review

This section discusses the various studies related to text-based on emotion detection approaches and multiple studies on text mining techniques. Abdullah et al. (2018) is focusing on Arabic international politics and the global economy have led to the investigation of emotion detection. The author described the SEDAT by using Arabic tweets by applying CNN-LSTM. Since the SEDAT is able to determine the existence and the intensity of the emotions in Arabic tweets their real value score is between 0 and 1. From the evaluation, they are trying to gain insight into how SEDAT actually works in each language category from different languages. They split the testing dataset into MSA (Modern Standard Arabic), Egyptian, and Gulf datasets and found that the system performs best with MSA for Anger and Fear emotions, the Gulf dialect with Sadness emotions, and the Egyptian dialect performing best with Joy emotion. The author (Jamal et al. 2021) proposed an IoT-based framework using tweets for emotion detection with a hybrid approach of Term Frequency Inverse Document Frequency (TFIDF). In their work, they collect the raw tweets and filter using tokenization for storing important features without having noise. Secondly, the TFIDF is applied to estimate the importance of features both locally and globally, and then the adaptive synthetic (ADASYN) is applied to solve the problem of unbalanced classes among different emotion classes. Finally, they developed a deep learning model to predict the emotions with dynamic epoch curves. Murthy and Kumar (2021) discussed the different emotion models and their approaches (keyword-based, corpus-based, rule-based, machine learning approach, deep learning approach, and hybrid approach) and datasets that are used in emotion detection. They also discussed the different evaluation metrics that are used in emotion detection from text.

Kusal et al. (2022) discussed the TBED advancement that provides automated solutions for various fields and they present a systematic literature review of the literature published between 2005 to 2021. The author discussed the different classifiers that are used in TBED which are Machine Learning classifiers (Decision Tree (DT), K Nearest Neighbor (KNN), Support Vector Machine (SVM), Random Forest (RF), Naive Bayes (NB), Linear Regression (LR), Multinomial Naive Bayes (MNB)) and deep learning classifiers (Convolutional neural network, Recurrent Neural Network, BERT, Bi-LSTM, GRU, and pre-trained models). They concluded their discussion starting from the introduction, evolution, significance, techniques, applications, datasets, research challenges, and future scope.

Their study (Alswaidan and Menai 2020) shows that Chinese is the most dominant language after English in terms of emotion recognition in texts. In addition, there are newly published research results from other languages, including Arabic, Hindi, Indonesian, and Spanish. For emotion recognition in English, some researchers used one or more existing corpora with emotion annotations—Alm, Aman, ISEAR, SemEval-2007, SemEval-2018, and SemEval-2019, to measure the performance of their models. They concluded their study by showing that the hybrid approaches and learning-based approaches using a traditional text representation with distributed word representation outperform the other approaches in benchmark corpora. This paper also identifies the sets of features that lead to the best-performing approaches; highlights the impact of simple NLP tasks such as part-of-speech tagging and parsing on the performance of these approaches, and identifies some open issues.

Liu et al. (2022) discussed in Sect. 1 that deals with the introduction and the research objective. Section 2, discussed recent research on emotion recognition and the various methods used in this research. Sects. 3 and 4 discussed the methods and evaluation results of the automatic emotion recognition system. Finally, Sect. 5 discussed the conclusions drawn from the research results and the results of the feature research. El Hammoumi et al. (2018) discussed the important role of emotion detection in e-learning systems by using the convolutional neural network (CNN). The author proposed the system and tested it with students aged between 8 to 12 years.

Ullah et al. (2019) proposed a methodology named PDGDL (Program Dependence Graph with Deep learning) to identify different programming source codes by applying the exact coding styles’ features from each programmer. The author collected the data from Google Code Jam (GCJ) and analyzed 1000 programmers’ data, in which C++, Java etc. languages were used. Finally, they conclude that their proposed model defeated the existing techniques from the view of classification accuracy, recall, precision, and f-measure metrics. Bansal et al. (2021); Zhao and Wong (2023); Wang et al. (2023) also discussed SA and generative modeling, Transformer-based and lexicon-based SA.

Since there are many researchers studying emotion detection in text, Table 6 contains the summary of related research on text-based emotion detection using deep learning models. Social media uses a variety of ways to measure feelings, including diagnosing neurological problems, and anxiety in individuals, or assessing public sentiment. Changeable boundaries, language, and interpretation are major obstacles to automatic sentiment recognition principles. To quantify the inclination and provide an emotional class message, a soft classification approach is proposed. A supervised emotional framework is created and tested for text-streaming communication. Offline training and interactive classification approaches are two of the key activities. The first task is to create templates for describing emotions in text responses. The second task is to implement a two-stage framework for detecting live text transmissions for emotion monitoring. In addition, DL-EM uses an online means to assess collective emotions and identify incidents in live text sources. The deep learning model includes various stages such as preprocessing the text data, feature extraction from the text, feature selection from this extracted emotion feature set, and finally a deep classifier that recognizes the different emotions of the students. In the following section, Sect. 3.1 discusses the Research gap, and Sect. 3.2.1 discusses the key aspects of Machine learning in Emotion Detection, which includes different Supervised, unsupervised, semi-supervised, and hybrid approaches. Section 3.2.2 discusses the key aspects of the Deep Learning Technique in Emotion Detection including different existing methods that were used in different research areas.

3.1 Research gap

  • Studies found that the existing research aims only to identify emotion recognition in English literature, which is burdensome to generalize to cultural factors and another low-resource language such as the Assamese Language. The Deep Learning techniques in NLP research are required to create considering the linguistic conflict, cultural norms and expression, and body language, which can precisely be recognized in cross-cultural and multilingual environments.

  • Human emotions are heterogeneous, complex, and bounded with a wide range of different emotional states. In the existing literature, we have seen that many emotion recognition models aim to detect emotion based on Ekman’s six basic emotion labels. Emotion Recognition in research needs to explore more emotional labels such as envy, pride, fun, depression, shame, and gratitude etc. This also includes the different models that can accurately give the divergence between the textual data in Text-Based Emotion Recognition.

  • With time, a person’s state of emotion can be influenced by different experiences, surroundings, events, and interactions. In the existing research on Text-Based Emotion Recognition, the researchers pay no attention to the temporal changing of emotions, only focusing on the static representations of textual data. So, we think that there is a requirement for deep learning models that can accurately give the result of dynamic emotions on textual data.

  • Human emotional state is dependent on various factors such as its surroundings, circumstances, etc. So it is very difficult to detect and elucidate emotions when doing so without a large amount of data. In other words, the Deep Learning models are data-dependent. So, research needs Deep Learning models that can sort out the obscurity and grasp the contextual expressions in terms of Text-Based Emotion.

  • The Assamese language is a SOV structure language. In the existing literature, we have seen that there is some research work on SA in the Assamese language. Much work has been done in Speech-based Emotion Recognition but Text-Based Emotion Recognition is in the early stage. So, the research needs a Deep Learning Approach/model to identify emotion from Assamese textual data accurately (Fig. 5).

    Fig. 5
    figure 5

    Systematic literature review on emotion detection in text by using deep learning techniques

    Table 6 Summary of related researches on text-based emotion detection using deep learning models

3.2 Key aspects of machine learning and deep learning approaches in emotion detection

3.2.1 Machine learning

Machine learning, an automatic and important classification approach in Emotion detection and SA, which is performed using text features (Kohavi and Provost 1998). The author claimed that the approach can learn from the data and deal with the construction of algorithms. In the machine learning approach the whole dataset is divided into two parts for training and testing data, so it is called training dataset and testing dataset. The information used to train the model by contributing the characteristics of different instances of an item is known as the training dataset and then to check how successfully the model from the training dataset has been trained is known as the testing dataset (Nandwani and Verma 2021). Generally, the Machine learning approach is known as data-driven, since it works on labeled datasets (Anantrasirichai and Bull 2022). Dangi et al. (2021) discussed various Machine Learning techniques based on COVID-19 tweets, in which out of ten popular machine learning techniques, they achieved the highest accuracy of 96.06% in LSVM (Linear Support Vector Machine). The author also claimed that SGD (Stochastic Gradient Descent) obtained better results in all computational parameters.

Supervised, Unsupervised, Semi-supervised, and Reinforcement learning are the four different types of Machine learning classifiers. Dangi et al. (2022b) proposed a model based on five ML models Random Forest, Multinominal NB classifier, Logistic Regression, Decision tree, and Support Vector Machine, and their model is named SATD (Sentiment Analysis of Twitter social media Data). After implementing the model the author achieved an accuracy of 0.96, 0.97, 0.97, 0.98, and 0.98 in Logistic regression, SVM, Random Forest classier, Decision Tree, and Multinominal NB classifier respectively. Dangi et al. (2023) proposed a framework based on sentiment evaluation from Twitter text to detect positive and negative sentiments from the different real-time datasets, namely ARO-RRVFLN ( Artificial Rabbits optimized Robust Random Vector Functional Link Network). They used new headline sentiment analysis, sentiment 140, and 4000 short stories datasets for their experiment.

  1. 1.

    Supervised approach In the supervised approach, labeled datasets are used for training and testing the supervised classifier. The supervised learning algorithm examines the training data and infers a function, which was used for mapping new examples (Mohri et al. 2018). Since a labeled dataset with emotional tags is necessary, the annotation process is considered as one of the disadvantages as it’s going to be a time-consuming and monotonous task (Canales and Martínez-Barco 2014). To check the polarity of emotions whether it is positive, negative, or neutral, initially the root words are used to classify the emotions/sentiments of a text. Since, they may have different domains at different levels, so each domain may have different classifiers with different features. There are many learning algorithms in the Supervised approach:

    • Linear classifier Based on the linear combinations value of the characteristics, the linear classifier performs its classification. Let us consider, W is word frequency where, W = w1, w2, w3... and Vector C = c1, c2, c3... linear coefficient vector and S is scalar, the output of the Linear Predictor, also called hyperplane which separates two classes will be LP = W.C + S. The SVM and neural networks come under the Linear classifier. An important supervised learning method is the Support Vector Machine, which is used for classification tasks. The non-probabilistic classifier SVM focuses on determining the best linear separator for classification (Kaur et al. 2017; Rui et al. 2013). The author mentioned that for a given set of training data, where each data labeled belongs to one of the classes, the Support Vector Machine algorithm creates a model that allocates new data into one or two classes. To separate the classes Hyperplane is used. The neuron is the basic element of the Neural Network and has three layers: input layer, hidden layer, and output layer. Bhatti et al. (2004) discussed the neural network approach for human emotion recognition in which they introduced a feature selection method by involving the application of SFS (Sequential Forward Selection) with GRNN (General Regression Neural Network) with a consistency-based selection method. Dangi et al. (2022a) proposed a model known as COA (Chaotic coyote optimization algorithm) based on a time weight-AdaBoost support vector machine, which was used to achieve the classification precisely.

    • Decision tree classifier: The Decision tree classifier in which a condition is used to divide the data (Allouch et al. 2018). The decision tree classifier is a tree-like structure and is made up of nodes and branches in which each node denotes an attribute each branch denotes the rules or tests on an attribute, and each leaf node denotes a class label or output (Hasan et al. 2019). There are 3 steps for DT learning: features choosing, decision tree generation, and pruning (Ghanbari-Adivi and Mosleh 2019). The crucial factors in using a Decision Tree are finding which attributes to employ, what criteria to utilize for splitting, and when to be terminated. During this process, examined all the attributes and different division points are examined and explored. Pruning is a process by which a tree can boost its performance. By using pruning the less important elements can be removed to reduce the complexity, resulting in the increase of predictive power by reducing overfitting (Kusal et al. 2023). Kaur and Vashisht (2012) discussed the emotional variation in adolescent age through data mining (Decision tree technique).

    • Rule-based classifier: The data space is modeled by a set of rules in a Rule-Based Classifier. The right side presents the class label and the left-hand side presents a condition on the feature set expressed in the disjunctive normal form. Pradhan et al. (2016); Gao et al. (2015) proposed a rule-based method to detect emotion-cause components for Chinese micro-blogs. From the corpus, the emotional lexicon can be constructed automatically and manually.

    • Probabilistic classifier: The probabilistic classifier is allowed to predict a probability function over a set of classes for input data that are given. In the probabilistic classifier, not only the most likely classes are given but also the probability function over all classes is given (Kaur et al. 2017). The Naive Bayes classifier, Bayesian network, and Maximum entropy come under the Probabilistic classifier.

      1. (a)

        Naive Bayes classifier Based on Bayes theorem, the Naive Bayes classifier predicts the probability that given a set of features is a part of a particular label in which to extract the feature BOW(Bag of Words) is used (Kaur et al. 2017) in which ignores the position of the word in the document (Kang et al. 2012). The author said that all the features are independent in this model. The author discussed the Senti-Lexicons and improved the Naive Bayes classifier for SA of restaurant reviews. Their experiment showed that when unigram + bigram was used as a feature with this algorithm, the gap between the positive and negative accuracy was reduced to 3.6% as compared to the original Naïve Bayes (Jeyapriya and Selvi 2015). The author proposed a system that extracts aspects of product customer reviews, from each review sentence the nouns and noun phrases are extracted. The author also claimed that to check whether the sentence was a positive or negative opinion and to identify the number of it the NB classifier used a supervised term counting-based approach.

      2. (b)

        Bayesian network The Bayesian network presumes that all the features are fully dependent resulting in a directed acrylic graph whose edges represent the conditional dependencies and nodes represent the random variables. In text classification this model is considered very expensive therefore it is not frequently used (Kaur et al. 2017; Pradhan et al. 2016).

      3. (c)

        Maximum entropy Maximum entropy is a probabilistic distribution technique broadly used in text segmentation, POS tagging, and language modeling which prefers uniform models that can satisfy some constraint (Goyal and Tiwari 2017). Maximum entropy converts the labeled feature set to vectors by using encoding. The Maximum Entropy Classifier is also known as the Conditional exponential Classifier. Duric and Song (2012) claimed that the encoded vector was used to calculate the weights of features which can be used to predict labels for each feature set.

  2. 2.

    Unsupervised approach As the name suggests “Unsupervised”, the data that are used in this approach are not labeled with the classes. This classifier works with several root words for each emotion label which can be cross-referenced to the sentences, resulting in the sentences being classified to corresponding emotions. The author claimed that the supervised classification achieved more accuracy than the unsupervised one but unsupervised learning is more generalized than supervised learning (Sailunaz et al. 2018). Agrawal and An (2012) proposed a novel unsupervised context-based approach to detecting emotions from a text at sentence level analysis. The author considered Ekman’s basic emotions and their methodology does not depend on any existing manually crafted affect lexicons such as Word Net-Affect. Goyal and Tiwari (2017) discussed the different unsupervised algorithms such as the Point-wise Mutual Information (PMI) algorithm, Latent Dirichlet Allocation Algorithm, and Random Walk Algorithm. They mentioned that the PMI algorithm was used to automate a system and found consecutive words and their semantic polarity using emoticons to represent positive “thumbs up” and “thumbs down” to represent negative opinions. The author also mentioned that the Latent Dirichlet Allocation Algorithm was domain-independent and lexicon-based and it’s a way of automatically discovering topics that sentences contain. Random Walk Algorithm was the automatic construction of a domain-oriented sentiment lexicon in which it simulates a random walk on the graph that reflects four kinds of relationships and that are relationships in words, relationships between documents, relationships from words to documents, and relationships from documents to words.

  3. 3.

    Semi-supervised approach Since the supervised approach needs to train a classifier and it depends on more domain cost so the semi-supervised approach was introduced (Go et al. 2009; Goyal and Tiwari 2017). The author proposed a distant supervision approach that prepared the use of training datasets that were auto-generated, where emoticons were used to tag tweets as negative or positive.

  4. 4.

    Hybrid approach The lexicon-based approach has a problem of low recall in ML techniques and a problem of domain independence. To avoid both problems, a hybrid approach is introduced (Goyal and Tiwari 2017). The combination of the Machine learning approach and rule-construction into a unified model is called a Hybrid model and the hybrid approach has a higher probability of exceeding the two approaches separately. The author claimed from their survey that in recent advancements, hybrid models implementing unsupervised deep learning methods and standard rules performed better results (Acheampong et al. 2020).

3.2.2 Deep learning

In the recent advancement of the NLP research area, Deep Learning algorithms are more influential than another traditional approaches for emotion detection and SA. Without feature engineering, Deep learning algorithms can extract emotions or sentiments from text streams. There are many Deep learning algorithms namely, RNN, CNN, LSTM, BiLSTM, GRU, and Transformer Models etc. (Nandwani and Verma 2021). Since the deep learning models bring out the features and patterns by themselves, it prevents people from creating the features manually from text streams.

Jian et al. (2010) claimed the goal of SA is to recognize a text whether it is positive or negative automatically. The author used an individual model (i-model) based on an Artificial Neural Network to identify the Sentiment in text, in which the model consists of sentimental features, feature weight, and advanced knowledge. For their experiment, the author used Cornell Movie review data. The author concluded with their result by mentioning that the accuracy level of the i-model was remarkable as compared to the SVM and Hidden-Markov models.

Arora and Kansal (2019) discussed the text normalization of unstructured data for Twitter sentiment analysis by using Deep CNN. They proposed a model named “Conv-char-Emb” which can control the problem of using less memory space for embedding and noisy data. The author used Deep CNN for embedding since it used fewer parameters in feature representation. Pasupa and Ayutthaya (2019) implemented and compared the results of different deep learning models CNN, LSTM, and Bi-LSTM in the Thai children’s tales dataset. The author also claimed that word embedding helps a model to understand the semantics of each word, sentic helps to understand the sentiment of the word and POS-tagging helps the model to identify the different parts of speech of a sentence and pre-processing technique. After implementation, the author achieved a 0.817 F1 score at p < 0.01 from CNN model with all three features that are POS-tagging, Thai2Vec, and Sentic, which was comparatively better than among all.

Dashtipour et al. (2020) proposed a hybrid model to extract sentiment analysis in the Persian language. Alqaryouti et al. (2024) proposed a hybrid model of rule-based approach and domain lexicons for SA on government review data. The author concluded their study that their proposed model exceeded other lexicon-based models by 5%. From their result, they mentioned that the aspect extraction accuracy improved remarkably when implicit aspects were considered. Li et al. (2020b) discussed the Transformer-based context and speaker-sensitive model for emotion detection in conversations. Their proposed model namely HiTrans consisting two hierarchical transformers in which the author utilizes low-level transformers as BERT and feeds them into high-level transformers. After experimenting with the model the author claimed that HiTrans outperforms previous state-of-art models. They used three benchmark datasets that were EmoryNLP, MELD, and IEMOCAP.

Potamias et al. (2020) proposed a neural network methodology (recurrent CNN RoBERTa) that is recently proposed transformer-based network architecture which was further amplified with the recurrent CNN. Their methodology outperforms state-of-art performance under all benchmark datasets. Bansal et al. (2021) proposed a framework that used a deep learning-based classification model to separate the details of tweets from others. The author mentioned that the non-situational tweets consist of grief, sorrow, anger, etc. The author also solved an informative tweet selection task and sentiment classification concurrently using MTL (Multi-Task Learning) in a deep learning framework (Fig. 6).

Fig. 6
figure 6

Distribution of private and public dataset in TBED

4 Existing surveys on text-based emotion detection

Emotion detection is a concept encompassing many tasks, such as emotion extraction, emotion classification, emotion summarization, review analysis, depression detection, anger detection, etc. In the existing surveys, the researchers mainly conducted specific analysis of the tasks, technologies, methods, approaches, and application fields involved in the TBED process.

4.1 Surveys on contents and topics of emotion detection

When research on emotion detection was still at its outset, the contents and topics of surveys mainly focused on emotion detection tasks, review analysis and application areas.

Murthy and Kumar (2021) reviewed the emotional models that are categorical approach and dimensional approach. Where they discussed the Circumplex model by Russell, Plutchik’s wheel of emotion, and Emotion Space (Valence, Arousal, and Power). From the Parrot’s Emotion Taxonomy, they considered the primary emotions are love, joy, surprise, sadness, and anger. The secondary emotions are affection, lust, cheerfulness, zest, contentment, pride, optimism, surprise, irritation, sadness, disappointment, shame, exasperation, rage, disgust, and suffering. They considered the tertiary emotions also. The author reviewed and collected some corpora and lexicons used in research works related to the detection of emotion using text. In their review paper, they discussed different approaches proposed in the literature for the identification of emotions from text Keyword-based approach, Corpus-based approach, Rule-based approach, Machine learning approach, and Deep learning approach. The author also focused on the different evaluation metrics that are used to measure the statistics between the good models that can be it.

Sailunaz et al. (2018) focuses their review on emotion detection from both text and speech. They also discussed keyword-based methods, lexicon-based methods, machine learning methods, and hybrid methods. The author also discussed the datasets that are used in text-based emotion detection, but they mentioned only 5 datasets, where there are only 3 public datasets (ISEAR, EmoBank, The emotion in text dataset) and others are private datasets. They also included some of the classification models that are used in emotion detection are SVM (Support Vector Machine), HMM (Hidden Markov Model), GMM (Gaussian mixture model), Hybrid Systems, and Ensemble Systems. By summarizing basic achievements in the field and highlighting possible extensions for better outcomes, the author concluded their study.

Goyal and Tiwari (2017) defined the emotion recognition from structured and unstructured text. The author defines the tasks that are useful in the recognition of emotions are discussed that are extraction of features such as syntactic, semantic, feature selection, and linguistic resources available in the market. They also discussed the dictionary-based approach, corpus-based approach, machine learning approach, supervised approach, semi-supervised approach, unsupervised approach, and hybrid approach.

Acheampong et al. (2020)discussed some recent state-of-the-art in the emotion detection field, where the proposals are discussed in relation to the major contributions, approaches employed, datasets used, results obtained, strengths, and weaknesses. The author also discussed the datasets and their features. Some of the datasets that are mentioned in this paper are ISEAR, SemEval, EMOBANK, WASSA-2017 Emotion Intensities (EmoInt) Data, Cecilia OvesdotterAlm’s Affect data, Daily Dialogue, AMAN’S emotion dataset, Grounded emotion data, Emotion Stimulus data, crowdsourcing dataset, MELD dataset, Emotion lines, Smile Dataset. Along with strengths and weaknesses, the three main methods used in the design of TBED systems have been clarified. They concluded their study by outlining current issues and potential research directions for text-based emotion detection.

Deng and Ren (2021)reviewed their study on word embedding, architecture, and training level respectively. The author also discussed the challenges and opportunities from four aspects such as the shortage of large-scale and high-quality datasets, fuzzy emotional boundaries, incomplete extractable emotional information in texts, and textual emotion recognition in dialogue. In their review, the author creates a systematic and in-depth overview of deep textual emotion recognition technologies which provides sufficient knowledge and new insights for relevant researchers to understand better the research state, remaining challenges, and future directions in textual emotion recognition.

Rabeya et al. (2017) found the results of their current experimental study on emotion detection by using different text-based data are presented in this paper. The author defined that the emotional lexicons actually affect the state of an emotion in the case of lexicon-based analysis. The main goal of their study is to extract emotions from the Bengali text at the sentence level, and they presented an emotion detection model. Their proposed model detects emotion on the basis of the sentiment of each connected sentence. To demonstrate how frequently people express their emotion in the last or final sentence, a lexicon-based backtracking approach has been developed with 77.16 accuracy. Nandwani and Verma (2021)explains the various emotion models and the process of TBED. The difficulties encountered during emotion analysis are also discussed in this paper.

4.2 Surveys on methods of emotion detection

Goyal and Tiwari (2017)reviewed the work that had been done in identifying emotion detection in text, and argued that although many techniques, methodologies, and models had been created to detect emotions in text, there are various reasons that make these methods insufficient. The author reviewed the resources that are used for detecting emotions in text, which are labeled text, emotion lexicons, and word embeddings. They also discussed the methodologies are supervised approaches and unsupervised approaches. The author discussed the open problems that are mentioned in this review paper. Some of the open problems they discussed are the complex nature of emotion detection, shortage of quality data, and inefficiency of current models. They concluded their study by saying that having a large amount of textual data with the rise of social media in the past couple of years, and therefore the availability of vast self-expression text about any major or minor event, idea, or product, points to great potential to change how entities and organizations can use their future decision-making processes.

Sailunaz et al. (2018) examines the features, drawbacks, and potential future directions of current emotion detection research initiatives, emotion models, datasets, and techniques. They also discussed emotion intensity detection, sarcasm detection, multiple emotion detection, emotion cause detection, personality or mood detection, and emotion versus individual or social parameters. The author concentrated on reviewing studies that examine text-based emotion detection where they looked into different feature sets that were employed in pre-existing methodologies. They also review the fundamental accomplishments made in the area and point out potential improvements.

Al-Omari et al. (2020)proposed a system to detect emotions using deep learning approaches, where the main input of the system is a combination of GloVe word embeddings, BERT Embeddings, and a set of psycholinguistic features (e.g. from Affective Tweets Weka-package). In their approach, they determine the English textual dialogue and classify it into four categories that are Happy, Sad, Angry, and Other. They used the public data of SemEval 2019 and pre-processing the data and in the second step, they extracted the feature vectors. The Network Architecture was built using ensembling with different sub-models such as EmoDense, EmoDet-BERT-BiLSTM, and EmoDet-BiLSTM. The proposed system combined a fully connected neural network architecture and BiLSTM neural network to obtain performance results that show substantial improvements (F1- score 0.748) over the baseline model provided by SemEval-2019/Task- 3 organizers (F1_score 0.58).

4.3 Overall survey methodology

With the increase in the popularity of emotion detection in research, more research results began to accumulate. Researchers need to systematically organize and analyze results from a few number of publications to perform literature reviews. They used different survey methodologies to conduct surveys of a few number of papers.

Cui et al. (2023)divided into six stages that are research question definition, search strategy formulation, inclusion and exclusion criteria definition, quality assessment, data extraction, and data synthesis. The author can eliminate a large number of retrieved papers by using the standard process and finally conducting further analysis and research on a small number of papers.

Yamin et al. (2023)reviewed the systematic literature review (SLR) of the emotion recognition system in Malaysia, where the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analysis) consists of three stages, Identification, Screening, and retrieving. The author analyzed the results as document type, Year of publication, field of research, and most cited.

Kusal et al. (2023) reviewed the current literature published in TBED (Text Based Emotion Detection) between 2005 and 2021 methodically in this paper. The four main research questions were addressed in this review paper, which carefully reviewed 63 research papers from the databases of the IEEE, Science Direct, Scopus, and Web of Science. Additionally, it examines and highlights the various TBED applications in various research fields. They also summarize the various emotion models, methodologies, feature extractions, datasets, and potential research directions that have been provided.

4.4 Summary of advantages and disadvantages of the existing surveys

In the following, we discuss the advantages and disadvantages of the existing surveys from a number of different points of view.

4.4.1 From the point of view of the contents and topics of emotion detection

As summarized in Table 7, the researcher organized the literature and conducted depth investigations of the contents and topics of emotion detection. They reviewed the tasks of emotion detection analysis (e.g.emotion extraction, emotion detection from suicide note, emotion detection from covid-19 related tweets, depression detection, anger detection, IMDb review), the application areas of emotion detection (e.g. medical diagnosis, psychology, education, employee safety etc.), and different languages for emotion detection, such as Chinese, Arabic, Roman Urdu, German, Spanish, Tamil, Hinglish (Hindi + English), Bangla) (Nandwani and Verma 2021; Majeed et al. 2022; Tejaswini et al. 2022; Rabeya et al. 2017; Alwehaibi et al. 2022; Seyeditabari et al. 2018; Mao et al. 2022; Del Vigna12 et al. 2017; Xu et al. 2019a; Saffar et al. 2023; Sosea and Caragea 2020; Sadeghi et al. 2021; Bashir et al. 2023; Buechel et al. 2020). Since Emotion detection is an emerged topic, the knowledge is gradually integrated to it. Since we have covered a short time period for existing surveys, there have been few survey works analyzing the connection between the topics and methods (Rabeya et al. 2017).

Table 7 Summary of existing deep learning methods in the context of emotion detection, including methods, learning algorithm, datasets, key features, feature generation, model parameters, results, limitations, future work and application domain

4.4.2 From the point of view of the methods of emotion detection

Some researchers reviewed different techniques and methods of emotion detection in different tasks and areas. They analyzed and discussed emotion detection methods based on lexicons, rules, supervised, semi-supervised, unsupervised machine learning methods, as well as deep learning methods like LSTM, Bi-LSTM, BERT, ALBERT, RoBERT, MLP, GRU and other hybrid approaches (Sailunaz et al. 2018; Karna et al. 2020; Goyal and Tiwari 2017; Acheampong et al. 2020; Al-Omari et al. 2020; Yenter and Verma 2017; Cahyani et al. 2022; Sasidhar et al. 2020; Nugroho and Bachtiar 2021; Hassan et al. 2021; Samy et al. 2018; Majeed et al. 2020; Eckel 2009; Polignano et al. 2019). These researchers also mention the advantages and limitations of methods. As summarized in Table 7, even though existing surveys analyze the techniques and methods of emotion detection, which provides good insights into TBED.

4.4.3 From the point of view of the overall survey methodology

The survey methods that are used mainly include the content analysis method and systematic literature review (SLR) method. Cui et al. (2023); Yamin et al. (2023); Kusal et al. (2023) used the SLR method to analyze the content effectively. However, informatics methods have not been used to study the evolution of research methodologies and emotion detection over time. There are not many surveys that use community detection and keyword co-occurrence analysis to examine the relationship between research methods and topics and how they have changed over time. This paper presents a survey on the research methods, topics, and their evolution over time in order to fill any gaps in the existing surveys. It finds the techniques, subjects, and development of emotion detection in this field from 2017 to 2023 by fusing keyword co-occurrence analysis informatics analysis, and informatics analysis tools (Fig. 7).

Fig. 7
figure 7

Three stages of text data preprocessing

5 Document type

The data were classified based on three categories i.e. journal article, conference proceedings article, and dissertation. These kinds of research works are selected for the quality control process normally occurring procedure prior to publication. The process used by the subject matter experts to evaluate the task (Alwehaibi et al. 2022). We have collected some papers that are from journal articles and conference proceedings, where journal papers have a better structure than conference papers.precise (Yamin et al. 2023; Eckel 2009). Figure 8 contains the document type, number of publications, and percentage. The majority of the publications are journal articles; 117 publications or 51.76% while 95 publications or 42.03% are conference proceedings articles; 13 articles from the workshop or 5.75%; and 1 publication or 0.44% from the project category suggesting there is a potential for postgraduate research area (Table 8).

Table 8 Surveys with their advantages and disadvantages
Fig. 8
figure 8

Document type for selected studies (journal, conference, workshop, and project)

5.1 Paper and publications

We have collected many papers that discuss emotion detection from text. Among all, we have been selecting 226 papers that were published between 2013 to 2023. Figure 13 shows the graph of the development of the paper and publications from year to year for 10 years, from the graph it appears that research on emotion detection is still relevant.

6 Methodology

The process of emotion recognition involves various stages, such as collecting data, preprocessing text, developing feature extraction models, and evaluating. Data collection is the first step to collecting a labeled data set of emotional expressions. This data set should include examples of different emotions such as joy, sadness, anger, etc. The data set can consist of pictures, videos, or audio recordings, depending on the modality of the input.

  • Preprocessing The collected data must be pre-processed to ensure consistency and compatibility with deep learning models. Preprocessing steps may include resizing images, normalizing pixel values, extracting audio features, or other relevant steps specific to the data type.

  • Model architecture The next step is to design the architecture of the deep learning model. Convolutional Neural Networks (CNNs) are commonly used for image-based emotion recognition, while Recurrent Neural Networks (RNNs) or variants such as Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRU) are suitable for sequential data such as audio or video. These models can capture spatial and temporal dependencies in the data.

  • Training Once the model architecture is defined, the next step is to train the model on the labeled dataset. This involves presenting the input data to the model and adjusting its internal parameters (weights and biases) to minimize the difference between the predicted emotions and the actual labels. Training typically uses optimization algorithms such as stochastic gradient descent (SGD) or its variants.

  • Data augmentation Data augmentation techniques can be used to improve the generalization ability of the model and reduce overfitting. In these techniques, additional training samples are created by applying transformations such as rotations, translations, or adding noise to the input data.

  • Evaluation of the model After training, the model must be evaluated against a separate validation dataset. Common evaluation metrics for emotion recognition include accuracy, precision, recognition, and F1 score. These metrics measure how well the model performs in classifying emotions.

  • Hyperparameter tuning Deep learning models often have several hyperparameters that need to be tuned to achieve optimal performance. These hyperparameters include learning rate, stack size, number of layers, number of hidden units, etc. Tuning of hyperparameters is usually done using techniques such as grid search or random search.

  • Deployment Once the model is trained and evaluated, it can be used for emotion recognition tasks. To do this, the model can be integrated into a larger system or application, such as a real-time emotion recognition system in video surveillance or a chatbot that responds based on user emotions (Fig. 9).

    Fig. 9
    figure 9

    Basics steps to perform emotion detection in text

6.1 Dataset

There are different existing and customized datasets are used in TBED according to the types of experiments of different researchers (Sailunaz and Alhajj 2019). Text-based emotion recognition by using Deep learning is data-dependent and relies on a large amount of data. The data that are available for the general public through Google Cloud’s public dataset program are called public datasets and the datasets that are customized are called private datasets. This section discussed different publicly available datasets that are used in Emotion detection and sentiment analysis. The data consists of different data sets such as ISEAR, WASSA, and Emotion-stimulus, (having text and emotion as attributes), SemEval, Stanford sentiment treebank (SST) etc. Researchers collect data mainly from tweets, Facebook posts, Instagram, reviews, feedback etc. Data that are collected from social media or e-commerce sites are usually unstructured so it must be processed to make it structured. A dimensional model named valence, arousal dominance model (VAD) is used in the EmoBank data set collected from the news, blogs, letters etc. (Nandwani and Verma 2021). Table 10 contains some of the datasets, data size, respective emotions, and domains that are used in emotion detection from text. The distribution of private and public datasets in text-based emotion detection is represented in Fig. 6.

6.1.1 Datasets that are used in existing studies on text-based emotion recognition

  1. 1.

    EmoBank dataset is collected from Essays, news headlines, blogs, fiction, letters, and travel guides and has six emotion labels that are Anger, disgust, fear, joy, sadness, and surprise. EmoBank dataset contains 10,548 sentences (Buechel et al. 2020).

  2. 2.

    Sem Eval 17 This dataset is collected from Twitter, News headlines, Google news, and other major newspapers and has six emotion labels i.e. Anger, sad, disgust, joy, fear, surprise. It contains a total of 1250 sentences (Ahmad et al. 2020).

  3. 3.

    Crowd flower dataset was collected from different posts on Twitter and consists of various emotions like Boredom, rage, anger, empty, fun, enthusiasm, happiness, love, hate, neutral, sadness, relief, worry, and surprise. Crowd flower dataset contains 40,000 sentences (Seyeditabari et al. 2019).

  4. 4.

    MELD dataset is collected from Dialogues of television shows and consists of various emotions like Anger, disgust, fear, joy, sadness, surprise, neutral, and dataset contains 13,000 sentences (Poria et al. 2018; Sharma et al. 2023; Ghosh et al. 2022; Kharat et al. 2021; Chudasama et al. 2022; Mozhdehi and Moghadam 2023).

  5. 5.

    IMDB dataset is collected from a Movie review. This dataset consists of 2 different emotion labels Positive and negative. The dataset contains a total data size of 25,000 sentences (Sikhi et al. 2022).

  6. 6.

    The GoEmotions dataset is collected from Reddit comments. This dataset consists of different emotion labels like Admiration, amusement, approval, annoyance, anger, curiosity, confusion, caring, desire, disgust, disapproval, disappointment, embarrassment, excitement, fear, joy, grief, gratitude. The dataset contains a total data size of 58,000 sentences (Demszky et al. 2020; Alvarez-Gonzalez et al. 2021; Ngo et al. 2022; Kane et al. 2022; Kamath et al. 2022; Mahima et al. 2021).

  7. 7.

    ISEAR (The International Survey on Emotion Antecedents and Reactions) dataset. The dataset is constructed by the Swiss National Centre of Competence in Research. ISEAR is lead by Scherer and Wallbott. The dataset consists of seven different emotion labels i.e. joy, sadness, fear, anger, guilt, disgust, and shame. The data are collected from different Cross-cultural studies and contain a total data size of 7665 sentences (Seal et al. 2020; Mac Kim et al. 2010; Adoma et al. 2020b; Poria et al. 2013; Cortiz 2021; Danisman and Alpkocak 2008; Acheampong et al. 2021a).

  8. 8.

    Cecilia Ovesdotter Alm’s Affect data. This dataset is collected from different stories and consists of various emotions like Fear, anger, disgust, sadness, happiness, and surprise. Cecilia Ovesdotter Alm’s Affect dataset contains 15,000 sentences (Alm et al. 2005; Agrawal and An 2012).

  9. 9.

    Aman emotion dataset collects data from Blogs. This dataset has different emotion labels like No emotion, mixed emotion, Anger, disgust, fear, joy, sadness, and surprise. The dataset contains a total data size of 1466 sentences (Aman and Szpakowicz 2007; Hosseini 2017).

  10. 10.

    Smile dataset, collects data from Twitter. The dataset has different emotion labels which are Anger, disgust, happiness, surprise, and sadness. The dataset contains a total data size of 3085 sentences (Ebling et al. 2018).

  11. 11.

    DailyDialog dataset, collects data from Regular people conversations. This dataset have different emotion labels like Neutral, anger, disgust, fear, happiness, sadness, surprise and it contains 13,118 sentences (Li et al. 2017; Dave and Khare 2021; Lee et al. 2023; Liu et al. 2023).

  12. 12.

    NLPCC dataset, collects data from a Chinese question-answering forum Zhihu. This dataset has different emotion labels like Anger, fear, sadness, surprise, and happiness. The dataset contains a total data size of 7928 sentences (Qiu et al. 2016).

  13. 13.

    EmoInt (WASSA-2017 Emotion Intensities) dataset, collects data from Twitter and has different emotion labels like Anger, fear, joy, sad. The dataset contains a total data size of 7100 sentences (Köper et al. 2017; Lakomkin et al. 2018; Madisetty and Desarkar 2017; Kulshreshtha et al. 2018).

  14. 14.

    Ren-CECps dataset collects data from Blog posts. This dataset has different (Quan and Ren 2010b, 2009) emotion labels like Anger, joy, surprise, anxiety, hate, love, sorrow, expect,and sentiment with intensity. The dataset contains a total data size of 1487 sentences.

  15. 15.

    EEC dataset has different emotion labels like Anger, fear, sadness, surprise, happiness, disgust, love, optimism, pessimism, trust, anticipation, and sentiment with intensity. The dataset contains a total data size of 8640 sentences (Ayub and Wagner 2021).

  16. 16.

    Emotic (Emotions in Context Dataset), Kosti et al. (2017) collects data from people in real environments. This dataset has different emotion labels like Arousal and Dominance. The dataset contains 18,313 images with 23,788 annotated people.

  17. 17.

    Emory NLP dataset, collects data from 97 episodes, 897 scenes, and different utterances. This dataset has different emotion labels like Joy, sad, neutral, mad, scared, powerful, and peaceful. The dataset contains a total data size of 12,600 sentences (He et al. 2021).

  18. 18.

    SEMAINE dataset, Poria et al. (2019); McKeown et al. (2010) collects data from Dialogues. This dataset has different emotion labels like Poppy (happy), Obadiah (gloomy), Spike (angry) and Prudence (pragmatic). The dataset contains a total data size of 240 sentences

  19. 19.

    IEMOCAP dataset, Tripathi et al. (2018); Yu and Kim (2020); Tripathi and Beigi (2018) collects data from different speakers and have different emotion labels like Anger, disgust, sad, surprise, dimensional, happiness, excitement, frustration, other, neutral. The dataset contains a total data size of 5500 sentences.

  20. 20.

    SST (Stanford Sentiment Treebank) (Nandwani and Verma 2021; Jia et al. 2020) dataset, collects data from Movie reviews. This dataset has different emotion labels like Positive, very positive, negative, very negative, and neutral. The dataset contains a total data size of 11,855 sentences.

  21. 21.

    SemEval dataset, collects data from Twitter, News headlines, Google news, and other major newspapers. This dataset has different emotion labels like Anger, sad, disgust, joy, fear, and surprise. The dataset contains a total data size of 1250 sentences (Chatterjee et al. 2019b; Abdullah and Shaikh 2018; George et al. 2018).

  22. 22.

    Thai fairy tales dataset, Charoensuk (2018) collects data from Children tales. This dataset has three different emotion labels Positive, negative, and neutral. The dataset contains a total data size of 1964 sentences.

  23. 23.

    SS-Tweet dataset (Ahuja et al. 2019; Zhang et al. 2018b; Lynch et al. 2020) collects data from Twitter. This dataset has two different emotion labels i.e. Positive strength, negative strength. The dataset contains a total data size of 4242 sentences.

  24. 24.

    EmoTex dataset (Hasan et al. 2014, 2019) collects data from Twitter and have 134,100 sentences.

  25. 25.

    Text Affect dataset collects data from Google News. This dataset has different emotion labels like Anger, sadness, joy, disgust, surprise, and fear. The dataset contains a total data size of 1250 sentences (Canales and Martínez-Barco 2014; Acheampong et al. 2020; Seyeditabari et al. 2018).

  26. 26.

    Neviarouskaya dataset (Neviarouskaya et al. 2010; Neviarouskaya and Aono 2013) collects data from Blogs, stories. This dataset has emotion labels of interest, joy, surprise, sadness, anger, disgust, contempt, fear, shame, and guilt. The dataset contains a total data size of Dataset1: 1000 sentences and Dataset 2: 700 sentences

  27. 27.

    Dens (Dataset on Emotions using Naturalistic Stimuli) (Mishraa et al. 2021, b) collects data from Online narratives and projects. This dataset has a different emotion labels like Joy, sadness, anger, fear, anticipation, surprise, love, disgust, neutral.

  28. 28.

    Grounded Emotion dataset, Varshney et al. (2024)collects data from Twitter. This dataset has two different emotion labels i.e. Happy, sad. The dataset contains a total data size of 2557 sentences

  29. 29.

    Amazon Alexa dataset (Pandey et al. 2019), collects data from Alexa reviews. The dataset contains a total data size of 2970 sentences (Table 9).

    Table 9 Summary of some datasets that are used in emotion detection

6.2 Data preprocessing

Data preprocessing is a critical step in data cleaning. If the researcher collected the data from social media posts, reviews, blogs etc. There must be a possibility of having unstructured data. In text pre-processing unwanted information is removed from the original text which involves three steps that are removal of punctuation marks, conversion of lowercase characters, and word slicing. Tokenization is a process of breaking down a paragraph into words or a document into paragraphs or one sentence into words called tokens (Chaffar and Inkpen 2011). Let us consider the sentence “Assam is a beautiful place.” After tokenization, the sentence will become “Assam,” is, a, “beautiful”, “place”. The words “is”, and “a” will be considered as the stop words so need to be removed from the original text document. The unnecessary word removal is beneficial for finding various aspects from a sentence that are generally described by nouns or noun phrases while emotions are conveyed by adjectives (Sun et al. 2017). Stemming and lemmatization are two crucial steps in preprocessing. In stemming, words are converted to their root form by shortening the suffixes. For example, the words “argues” and ’argue“ become “argue”. This process reduces the unwanted computation of sentences (Kratzwald et al. 2018; Akilandeswari and Jothi 2018). Lemmatization is a method of grouping different inflected forms of the same word. It’s used in computational linguistics, natural language processing (NLP), and chatbots. Lemmatization involves morphological analysis to remove infectious endings from a token or word and transform it into the base word lemma (Ghanbari-Adivi and Mosleh 2019). The authors concluded that removing numbers and lemmatization improved accuracy while removing punctuation did not affect accuracy. In Fig. 7, we are taking a sentence for data preprocessing. Where in the 1st step the punctuation marks are removed, then all uppercase characters are converted into lowercase characters and lastly the sentence is sliced into words called word slicing. Then the sentence/data were ready for feature extraction. Table 10 contains the counting of words from Fig. 7 after word slicing which are ready for feature extraction.

6.3 Feature extraction

Before analyzing the data feature extraction is the major step. The TF-IDF method is widely used in various natural language processing tasks, including emotion classification. By extracting features by using TF-IDF, the model can identify words that carry more weight and contribute significantly to the overall sentiment or emotion expressed in the text. In emotion classification, this method helps in identifying keywords or phrases that are indicative of specific emotions. By assigning higher TF-IDF values to these words, the model can better understand and classify the emotions present in the text (Liu et al. 2022). In the process of word vectorization, a document is first broken down into sentences and each sentence is then further broken down into individual words. This step is crucial in capturing the semantic meaning of the text. Once the text is divided into words, the next step involves building a feature map or matrix. This feature map represents the numerical vectors for each word in the document. The feature map is constructed based on various algorithms and methods, such as the popular Word2Vec or GloVe models. The feature map or matrix allows the machine to understand the relationships and similarities between words so that similar words are represented by vectors that are close to each other in the feature map. This helps the machine to perform tasks such as emotion detection, and retrieval of information. In conclusion, word vectorization or word embedding is a fundamental technique used by machines to understand and process text data. By converting text into numerical vectors, computers can effectively analyze and interpret textual information, leading to various applications in natural language processing and machine learning (Nandwani and Verma 2021).

6.4 Prospective methodology for emotion detection using deep learning, including methods and algorithms

For solving the emotion detection in texts we have discussed the perspective of deep learning methodology in this section. The deep learning methods are considered supervised and semi-supervised, however, they come under unsupervised learning (Kusal et al. 2023). The remaining of this section is organized as follows. Sect. 6.4.1 presents CNN, Sect. 6.4.2 presents RNN, Sect. 6.4.3 presents LSTM, Sect. 6.4.4 presents GRU, Sect. 6.4.5 presents Transformer based model which includes BERT, RoBERTa, ALBERT, DistilBERT, XLNet, ULMFiT, and GPT.

6.4.1 Convolutional neural network (CNN)

CNN, a widely used method in Computer Vision, and NLP applications. The first Deep neural network CNN, with multiple convolutional layers was proposed by Poria et al. (2016). A small chunks of text were applied that correspond to the word in its given context. Zhang et al. (2018a) It is a convolutional layer of a deep feed-forward neural network that feeds data into a fully connected neural network layer. Kusal et al. (2023) mentioned that, the Convolutional in CNN extracts the features by filtering data that are given as input, and multiple filters are combined to get outputs. By using sub-sampling or pooling, layers feature resolution is reduced to enhance the CNN’s ability to handle noise and distortion. Different pooling algorithms were used to decrease the number of results while preserving important characteristics and the most commonly used pooling technique was max pooling, which selects the largest factor in the pooling window. Fully connected layers are responsible for carrying out the classification task (Kusal et al. 2023). Jaiswal and Nandi (2020) built a model that can predict human emotion from a real-time image and their model was based on Convolutional Neural Network with 74% of accuracy and tested robustly on 8 different datasets, such as Chicago Face Database, Fer2013, JAFFE Dataset, FEI face dataset, TFEID, CK and CK+, IMFDB, and custom dataset with different angles, faces, backgrounds and age groups.

6.4.2 Recurrent neural network (RNN)

RNN is widely used by researchers for text-mining and text-classifications (Sutskever et al. 2011). RNN is one type of neural network that is particularly suited for modeling sequential information. RNN was configured to recognize the sequential data properties and then anticipate the scenarios using patterns. The author (Abdelgwad et al. 2022) mentioned that the main function of RNN over feed-forward neural networks was its ability to process inputs of any length and remember all information all the time, which was helpful in any time-series prediction. The activation was based on the previous step each time and used recurrent hidden units (Bengio et al. 1994; Pascanu et al. 2013). The RNN (Britz 2015)was composed of feedback loops where the connections between neurons established a directed cycle. Across the network, when gradient decent error was backpropagated, RNN was prone to vanishing the gradient and exploding gradient difficulties. So, the utilization of GRU and LSTM is primarily favored in RNN.

6.4.3 Long short-term memory (LSTM)

Graves and Schmidhuber (2005) mentioned in their study that Hochreiter and Schmidhuber (1997) and Gers et al. (2002) invented LSTM that has been motivated by the flow of error analysis. Robinson (1991) discussed that the layer of LSTM consists of many recurrently connected blocks which are known as memory blocks. Each block contains one or more recurrently connected memory cells. Kusal et al. (2023) mentioned that the input of LSTM was preprocessed to sort for the embedding matrix. There were 200 cells in the next layer of LSTM and in the final layer, there were 128 text-classification and fully connected cells present. The final layer reduced the 128-height vector to a one-height output vector if there were only positive and negative classes to forecast by using the sigmoid activation function. Riza and Charibaldi (2021) discussed TBED using LSTM and Fast-Text by collecting the datasets from Twitter, where they considered 6 emotion classes (happiness, sadness, fear, disgust, anger, surprise). The author compared word embedding with WordVec2, GloVe, and FastText. Awais et al. (2020) discussed emotion detection by using physiological signals which were LSTM-based and mentioned that it is an integrated framework that may enable physiological signals of wireless communication to data processing nodes where LSTM-based emotion detection was performed. Bi-LSTM has two hidden layers (Kusal et al. 2023). Rayhan et al. (2020) mentioned that since BiLSTM processes information in two ways the model forms context for every character within a given text. They combined CNN with Bi-LSTM to identify emotions from Bangla text. Polignano et al. (2019) proposed a classification method based on Bi-LSTM, CNN, and Self-attention which demonstrates its effectiveness on different datasets (ISEAR, SemEval 2018, SemEval 2019). Sumanathilaka et al. (2021) also discussed TBED using Bi-LSTM, where they used three datasets and models that were evaluated from classification accuracy.

6.4.4 Gated recurrent unit (GRU)

Abdelgwad et al. (2022) and Hochreiter and Schmidhuber (1997) mentioned that a powerful and simple alternative of LSTM was GRU, a family of RNNs, which has been proposed to solve the vanishing gradient problem (Kusal et al. 2023). The author (Cho et al. 2014) mentioned that to solve the vanishing gradient problem GRU used the strategy of reset and update. The model decided the amount of earlier data from previous time steps that must be utilized in the future step by using the update gate and the amount of data discarded decided by the reset gate. Liu et al. (2020) discussed TBED using GRU with attention mechanism and emoticon emotion. Aslam et al. (2022) discussed the SA and Emotion Detection by using the Ensemble LSTM-GRU Model on Cryptocurrency-Related Tweets, in which two recurrent neural network applications including GRU and LSTM. Yan et al. (2021) proposed a deep learning model named CNN-BiGRU-AT for identifying the emotional distribution of student text, in which the model combines the convolutional gated recurrent unit with the attention mechanism. Lu and Zhang (2021) proposed a model named AT-BiGRU which was the combination of BiGRU and attention mechanism, which overcomes the problem of complex calculation and cost of high space text of most SA models.

6.4.5 Transformer based model

The recent developments in NLP have illustrated that the transformer-based architecture is more advanced in solving several classification problems irrespective of language discrepancy (Puranik et al. 2021; Li et al. 2021; Sharif et al. 2021). The Deep learning transformer-based models used the process of self-attention, weighing the importance of each portion of the input data differently (Kusal et al. 2023). Acheampong et al. (2021b) discussed the Transformer models for TBED and other NLP tasks, their discussion includes the GPT and its variants, Cross-lingual language model (XLM), Transformer XL, and Bidirectional Encoder Representations from Transformers (BERT). Al-Rfou et al. (2019) mentioned that the transformer models were originally designed for machine translation and used for language modeling, for implementing NLP research areas such as document summarization, question answering, and text classification. They used 64 transformer-based layers and casual attention to foresee the next character in fixed-length input. The model segments input data and learns from each segment separately and is limited by its fixed-sized input length of 512 characters.

Following are the different transformer-based models:

  • Transformer-XL Al-Rfou et al. (2019) proposed the the vanilla transformer. To resolve the limitations of the vanilla transformer, Transformer-XL was proposed. Dai et al. (2019) proposed an architecture of fixed-length context of Transformer-XL and accomplished the relative positional encoding and segment-level recurrence. Acheampong et al. (2021b) mentioned that the Transformer-XL was better for longer-term dependencies than RNN and Acheampong et al. (2021b) ’s transformer, which was also efficient for short-term dependencies. For audio analysis tasks and language modeling, the model achieved SOTA results and also can be used in different application areas such as; machine translation and audio analysis.

  • BERT (Bidirectional Encoder Representations from Transformers) Devlin et al. (2018) said that BERT was a language of transformation methodology launched by Google. BERT obtains substantial explanation of text by considering left to right context equally, which is used to solve different NLP problems and to train the different general-purpose language models. There are two steps involved in BERT that are Pre-training and Fine-tuning steps. In the pre-training phase, BERT trained the unlabeled data. After being initialized with the pre-trained parameters BERT model is then fine-tuned for specific NLP tasks. The cost of fine-tuning of BERT model is considerably lower in terms of system resources (Kusal et al. 2023). Li et al. (2019c) discussed the Aspect-based SA using exploiting BERT. They considered BERT as an Embedding Layer, where they mentioned that the embedding layer of BERT takes a sentence as input and calculates the token-level representations using the information from the whole sentence. The author experiments BERT + Linear, BERT + GRU, BERT + SAN, BERT + TFM, BERT + CRF, from where they achieved highest 61.12 F1 for BERT + GRU. Karimi et al. (2021) proposed a novel architecture named BERT Adversarial Training (BAT) to identify the Aspect Extraction and Aspect Sentiment Classification. Xu et al. (2019b) creates a dataset named RRC (ReviewRC), based on a popular benchmark for SA. The author then takes a look at a novel post-training approach on the BERT model to enhance the act of fine-tuning of BERT for RRC.

  • RoBERTa (Robustly optimized BERT technique) RoBERTa was first introduced by Liu et al. (2019); Kusal et al. (2023) a large dataset and tuned hyperparameters for pretraining and a detailed replica of BERT. Kusal et al. (2023)The pre-training masking method was changed from static to dynamic in RoBERTa. The static masking was conducted only once at the time of preprocessing and the dynamic masking was conducted many times during preprocessing step. This model was trained on five different English datasets and that were BookCorpus, the Stories dataset, Wikipedia, the Open Web data, and the CC-News data. Liao et al. (2021) discussed the aspect-category SA in text-stream by using RoBERTa. They proposed a multi-task aspect-category SA model based on RoBERTa and achieved that the proposed model exceeded other models for comparison in aspect-category sentiment analysis. Tan et al. (2022)They proposed a hybrid deep learning model that combines the strength of the sequence model and transformer model while suppressing the obstruction of the sequence model. Their proposed model achieved the F1-score of 93%, 90%, and 91% on the IMDb dataset, Sentiment140 dataset, and Twitter US Airline Sentiment dataset respectively.

  • DistilBERT Tang et al. (2019) proposed distillation in a neural network in which the author makes an effort to increase the speed of models. DistilBERT uses the early form of BERT by reducing the layers by a factor of 2 and uses the dynamic masking method for better inference. The model was on two datasets as BERT, BooksCorpus dataset, and English Wikipedia. Sanh et al. (2019) introduced a general-purpose pre-trained version of BERT, DistilBERT, 40% smaller, 60% faster, that remains 97% of language understanding ability. Dogra et al. (2021) discussed the DistilBERT for sentiment classification of Banking Financial News, where they found that DistilBERT can transfer basic semantic understanding to further domain which can lead to better accuracy than baseline TF-IDF. The random Forest with DistilBERT gives 78% accuracy.

  • XLNet In XLNet both Permutation Language Model and Transformer-XL were used and a language model which can be both auto-encoding and auto-regressive (Yang et al. 2019). Before applying the positional encoding and recurrence, the model learned bidirectional context using the Permutation Language Model, so that it is trained on completely different combinations within a sentence. XLNet is unsupervised learning and also addresses the limitations of BERT and GPT (Topal et al. 2021; Adoma et al. 2020a). The author discussed the effectiveness of BERT, RoBERTa, DistilBERT, and XLNet in identifying emotions from text streams. XLNet is pretrained and fine-tuned on English Wikipedia, Clueweb 2012–13, Common-Crawl, BooksCorpus, and Giga 5.

  • ULMFiT (Universal language model fine-tuning for text-classification)was developed in the year 2018 and is the first “pure” transfer learning application among all. Howard and Ruder (2018) claimed that it can be applied in any task of NLP and the techniques that were key for fine-tuning a language model. The author mentioned that ULMFit consists of the following steps that are, General-domain Language Model pretraining, target task Language Model fine-tuning, and target task classifier fine-tuning. The model was experimented on different datasets TERC-6, Yelp-bi, IMDb dataset, DBpedia, and AG’s news (Kusal et al. 2023). Nithya et al. (2022) discussed the Deep learning-based analysis on Code-Mixed Tamil text for Sentiment classification with Pre-Trained ULMFit in which they focused on finding sentiment in Tamil code mixed words, where they found that the deep learning-based Bi-LSTM model with ULMFit embedding gives better results in code-mixed language than other existing models.

  • GPT (Generative pre-trained Transformer series) Radford et al. (2019) designed GPT as an open AI without recurrent layers and purely attention-based. For standard language modeling purposes pre-training is achieved by integrating the learned position embeddings with byte-pair encoded tokens and including these embeddings into a cross-layer transformer decoder framework. The author also mentioned that the GPT was formed by the transformer decoder with 12 transformer layers and 12 attention heads, which are primarily used for representing text. These are fine-tuned and trained by pre-processing unlabeled large datasets, such that the BookCorpus dataset is on the restricted supervised dataset. In 2019, Brown et al. (2020) Radford released an enhanced version of GPT, open AI GPT2. Liu et al. (2021); Topal et al. (2021) mentioned that the GPT-3 trained with 175 billion parameters to process human-like text, to provide the largest language model by a large distance, which is an autoregressive language model. In various NLG tasks, GPT-3 gives better performance without fine-tuning or gradient updates. Kheiri and Karimi (2023) evaluated various natural language understanding and generation standards, where the retrieval-based prompt selection method usually overcomes the random baseline. They observed notable gains such as table-to-text generation, 41.9% on the ToTTo dataset, and 45.5% on the NQ dataset as open-domain question answering. Ouyang et al. (2023) discussed the GPT by examining the different GPT methodologies in Sentiment analysis. They also compare the methodologies based on GPT with other current, intuitive models which are previously used with the same dataset. The author concluded their study by saying that the GPT models have expansive benefits in sentiment analysis, which was vitally important for taking action and making sense of enormous user-generated data. Suhaeni and Yong (2023) discussed the GPT-based sentiment analysis on Adversarial Review data generation and detection on the Amazon.com review dataset and a fine-tuned GPT model was implemented. Additionally, they used Sentiment analysis to relate the data novelty and discussed the model’s behaviors and some variants of DSA were developed. Rahaman et al. (2023) took two initiatives on the class imbalance in SA through GPT-3-generated synthetic sentences and those were: (1) By fine-tuning the Davinci base model from GPT-3 to generate synthetic review generation and (2) Sentiment classification on balanced and imbalanced dataset by using nine models. Where the author achieved the highest accuracy on Multinominal Naïve Bayes, registering 75.12% on balanced datasets. Conneau and Lample (2019) discussed the how latest version of the GPT-4 and 5 can help with significant advancement in AI-driven NLP tools. The author also claimed that GPT-4 improves the model’s training data and speed, which can be evaluated as its overall performance. Additionally, the author also claims that in translating languages, answering questions, and resolving what people think about things, GPT-4 was better than GPT-3.

  • XLM (Cross-lingual language model) Acheampong et al. (2021b) mentioned that with monolingual data, BERT-based model achieved excellent results in NLP research areas. Lample et al. (2018) mentioned that XLM’s or cross-lingual language model enhanced the BERT-based model for translation and classification by pre-training XLM’s for NLP which provides MLM extension in BERT by using the model translation method proposed by Kao et al. (2009). Additionally, they have mentioned (Acheampong et al. 2021b) the significance of XLM such that parallel text input is possible with XLM, the model reflects on the bilingual classification task, and pre-trained XLM gives SOTA results.

7 Level of analysis

There are various levels of analysis in emotion detection,i.e. word level, sentence level, and document level analysis. The keyword level analysis is the first step of TBED where the emotions are extracted from the keyword. It is possible to create emotion labels and related words used in-depth to determine whether a sentence contains any feelings (Ashish et al. 2016). The keyword spotting technique for emotion recognition where a text document is taken as input and output is generated as an emotion class (sad, happy, anger, surprise etc.) (Shirsat et al. 2019). At the sentence-level sentiment analysis, documents or paragraphs are broken down into sentences, and each sentence’s polarity is identified (Shivhare and Khethawat 2012). At the document level analysis, the emotions are detected from the entire document. Sailunaz and Alhajj (2019) discussed how the keyword spotting technique works for a given set of strings and substrings. The authors described lexical similarity methods, learning-based methods, and hybrid methods. Various limitations were also considered, including ambiguity in keyword definitions, inability to recognize non-keyword phrases, lack of linguistic information, and difficulty in determining sentiment indices. Murthy and Kumar (2021) reviewed the keyword-based approach as it exploits the knowledge of key features that are combined with emotion labels using a lexicon such as Word-Net Affect and SentiwordNet, where the rules are applied and sentence structures are exploited. Additionally, keyword spotting and emotion intensity are evaluated including with Negation checks. So, it can determine the emotion label for each sentence. Asghar et al. (2017) created generalized and using the latter data where the Twitter user reviews that tailored based on their actions. The preprocessing of the text is based on the keyword-based approach. Seal et al. (2020) proposed a sentence-level emotion detection method, created 25 emotion classes based on keyword analysis, emoticons, short words, keyword negation, a set of proverbs, etc., and achieved 80% accuracy. They also discussed sentence-level emotion recognition from text based on semantic rules. The authors proposed an efficient emotion detection method from sentence-level text by searching for direct emotion words from a predefined emotion keyword database. This method analyzes emotional and phrasal verbs, takes negative words into account, and outperforms the new approach. In their experimental results, they used the C programming language and Python to implement the emotion detection algorithm and achieved 65% accuracy. Cheng et al. (2020) used a sentence-level sentiment detection framework with rule-based classification. Their research focus is on integrating cognitive-based emotion theory and sentiment-analysis-based computational techniques to detect and classify emotions from text. Furthermore, they seek to improve the performance of state-of-the-art techniques by incorporating additional emotion-related signals. Lu et al. (2021) simultaneously created a constraint diagram that encodes the relative polarity between unstructured features to improve the parameter estimation process and proposed a regularization-based framework to achieve better classification performance. Their approach can reveal unique semantic information from unstructured text data. It also justifies the underlying equivalence relationship with the standard Bayesian classifier, which is ideally optimal when the sampling distribution is known. The authors then show that his new method offers better generalization power due to its reduced solution search space and attractive asymptotic consistency. Figure 8 includes an analysis of different levels of text-based emotion recognition collected from 2017 to 2023 and Table 10 contains the deep learning methods level of analysis and respective domain of dataset used.

8 Deep learning methods

The method of emotion detection in research that is used there are various kinds of techniques. The distribution methods from 2018 to 2023 for this research are described in Fig. 10. Based on literature studies, the most widely used methods in emotion detection over the past 6 years are LSTM and CNN. The CNN and LSTM are the most favorite because, in the last 6 years of research, both CNN and LSTM are often used to extract or determine the final emotion from the words or sentences. CNN has been widely used and has recently proven to be a powerful tool for emotion detection. It consists basically of a convolution layer to extract features that is followed by a max pooling layer to condense the features output from the convolutional layer. Finally, a fully connected layer applies activation and assigns a prediction (Alwehaibi et al. 2022). Kim (2014) used a two-dimensional convolution neural network to output a two-dimensional array, namely convolution layer, by cross-correlation operation of a two-dimensional input array and two-dimensional convolutional kernel array. After experimenting with several rounds of convolution layer and pooling layer, they are able to extract emotional information from the text.

In Yenter and Verma (2017) used a novel approach through the use of combining kernel from multiple branches of CNN with LSTM layers. After that, they produce a model with the highest reported accuracy on the Internet Movie Database (IMDb) review. In the proposed method they utilize a CNN and an LSTM on the word-level classification of the IMDb review dataset. The method combines versions of the networks from Rahman et al. (2016) and Wadhawan and Aggarwal (2021); the novelty of the proposed network is that it has combined kernels through several branches that accept the data and perform convolution. The output of the CNN branches is fed into an LSTM before being concatenated and sent to a fully connected layer to produce a single, final output. The network is trained and tested in mini-batches between 16 and 128 embeddings. The first layer of the network gives the input scores as a sequence of indices and embeds each word into a vector of a given size e where a vector of 500 word indices embedded with 32 becomes 500 vectors of length 32. The embedding layer consists of a matrix of trainable weights that generate the vectors for each word index by matrix multiplication. So, during the training, the embedding layer improves upon the embeddings of each word. In convolution, the kernel layer during 1-dimensional convolution is of shape kernel size c specific to that branch. For example, when c = 3, the layer views 3 words at a time and, therefore establishes the sense of 3-word combinations. In the Activation stage, they use a rectified linear unit (ReLU) activation is applied for each branch of CNN layers output. This layer replaces any value having a negative value with zero (0). In the max pooling layer, it converts each kernel size of input into a single output of the maximum observed number. The result is a reduced, scaled-down version of the input. The purpose of this layer is to reduce overfitting while allowing further processing. Similar to the CNN layer, the one-dimensional max-pooling kernel shape is adjusted to the width of the data, so that the kernel size p parameter implies a kernel shape of p times the data width. After max-pooling, each branch goes through a dropout layer that randomly sets some of the inputs to 0. The dropout is applied to a specific d portion of the inputs. The next level for each branch is batch normalization; this level simply normalizes the distribution for each batch after elimination. The purpose of batch normalization is to reduce the internal covariate shift and thus achieve faster convergence. The output of this layer has the same form as the input. This layer is used to prevent overfitting and to generalize the network so that it does not focus on certain portions of the inputs. The shape of the output corresponds to the shape of the input.

Rayhan et al. (2020) used word embeddings are given as input, from which features are extracted and final classification is performed. The embedding layer serves as the first layer used to transfer the word vector representations of selected words of the considered tweet into the model structure. Four convolutional layers connected in parallel receive the output of the embedding layer, followed by a global max-pooling layer to which dropout is applied. Three dense fully connected layers follow, with the last layer responsible for classification. The application of Dropout resulted in better convergence and a smaller difference between training and validation accuracies. One of the most widely used methods in emotion detection is Long - Short Term Memory (LSTM).

Sharma et al. (2023) using the LSTM approaches for in Hindi Text Emotion Recognition. Firstly, they preprocess the emotion dataset which includes tokenization, stop word removal, and lemmatization. After that, the embedding layer is built and fed into one or more LSTM layers. The output is fed into a dense neural network (DNN) with units equal to the number of emotion labels and a sigmoid activation function to perform the classification. Karna et al. (2020) investigates the effectiveness of the deep learning-based Long Short Term Memory mechanism for the identification of textual emotions, which was carried out on the “Emotion Classification” dataset with six emotional groups. Their experimental results proved that LSTM-based text emotion classification provides relatively higher accuracy compared to the existing methods. In their experiment, they provide a matrix with uncertain properties that 1 is for Rage; 2 is for Worry; 3 is for Happiness; 4 is for Affection; 5 is for Sadness; 6 is for Astonished, and 7 is for Kindness, where the LSTM achieves the overall accuracy 94.15%. They are also implementing the nested LSTM and its overall accuracy reaches 92.34%. Jamal et al. (2021) proposed a hybrid approach based on deep learning models TFIDF and S-LSTM. Raw tweets are pre-processed into two different datasets using tokenization techniques to clean up noisy data. TFIDF is used to obtain local and global weight values for each feature. This indicates how important each feature is whether or not it is important for the next step. The length of tweets varies depending on the words used, which can lead to class imbalance issues. When training functions with different categories, there is a risk of class imbalance. The ADASYN method is used to balance each class and oversample some classes. LSTM-based deep learning models take weighted features as input, and these inputs are built on emotional groups such as emotions. B. Anger, sadness, happiness, surprise, etc. are identified. LSTM-based deep learning models are designed to take weight features as inputs and classify these inputs in terms of sentiment classes. Anger, sadness, happiness, surprise, etc. Configured a stack definition of LSTM layers that can interact with each other by processing tweets. Epoch curves are visualized to show each data point’s accuracy, effectiveness, and loss. To demonstrate the effectiveness of the proposed method, we extract precision and loss values representing significance. The proposed approach outperformed most state-of-the-art methods with a classification accuracy of 97.5%.

In Rayhan et al. (2020), the authors used bidirectional long-short-term memory with attentional layers in emotion recognition to improve prediction accuracy. They used his ISEAR dataset with his LSTM in ANN, CNN, and CNN phases, bidirectional learning with forward and backward passes of Bi-LSTM used unidirectional learning I have found it to be more effective than his LSTM. Accuracy is 59.81, 65.8, and 65.9 for KNN, CNN, and LSTM with CNN phase (Cahyani et al. 2022) study two experimental types of emotion detection based on text, that is Word2Vec-CNN-BiLSTM, and GloVe-CNN-BiLSTM. In the first and second scenarios, they classified emotion into five types of data namely happiness, anger, sadness, fear, and surprise. The accuracy for Word2Vec-CNN-BiLSTM, on the computer line, transjakarta, and computer line + transjakarta data are 84.34%, 83.73%, and 83.88% respectively. In their second scenario, the result shows that Word2Vec-CNN-BiLSTM has better accuracy than GloVe-CNN-BiLSTM. The accuracy results are 90.78%, 87.51%, and 90.31% respectively on the the computer line, transjakarta, and computer line + transjakarta data.

Park et al. (2020) developed two models to predict the meaning of Bangla texts. Their two predictive models allow prediction of six identical sentiment labels আনন্দ,িবষন্নতা, ভয়, রাগ, ভালবাসা, আযর্). To achieve this, they rely on a bidirectional gated repetition unit he called BiGRU and a combination of the most classical deep learning algorithms, CNNs, and bidirectional long-term memory he called CNN-BiLSTM. They selected 7,214 sentences containing six of his emotions and six identical terms from the dataset for the experiment. CNN - BiLSTM found that they performed slightly better compared to BiGRU (64.96%) with an accuracy of 66.62%. They concluded that CNN-BiLSTM shows promising results in emotion detection of long Bangla phrases. Sasidhar et al. (2020) created a dataset of 12,000 Hindi and English codes. Texts from various sources were mixed and annotated with emotions of ’happy’, ’sad’, and ’angry’. They used a pre-trained bilingual model to generate feature vectors and employed a deep neural network as the classification model. They found that CNN-BiLSTM outperformed other experimental models with a classification accuracy of 83.21%. Methods used for emotion detection in the last 6 years include BERT, ALBERT, Robert, GMFCC, DMFCC, NB, SVM, CRF, DT, MLP, random forest, graph-based, GRU, GRNN, DLSTA, RNN, CGRU, mBERT etc. To find out the important points in emotion detection from the text of research from selective studies in 8 years, it can be seen in Fig. 13. Table 10 and explains the Summary of methods, level of analysis, and domain, and Table 11 contains a summary of related research on text-based emotion detection using deep learning models.

Table 10 Summary of method, level of analysis, and domain
Fig. 10
figure 10

Distribution of deep learning methods used in TBED

Table 11 Summary of different evaluation metrics, problem domain, used datasets, dataset size, methods/techniques used, accuracy

9 Evaluation

Evaluation is a process of identifying the process to measure the performance of a specific method. The most common metrics that are used to measure the method in emotion detection are Kappa coefficient, Jaccard Accuracy, Precision, Recall, F-score, Pearson correlation, 10-fold cross-validation, and Chi-square.

Ilyas et al. (2023) developed a large RU-EN (Roman Urdu-English) emotion detection corpus in which 20,000 sentences are annotated as Emotion—sentences. In their experiment for deep learning techniques, where they use the evaluation method that is Precision, Recall, and F1 score and the classifiers are CNN, RNN, LSTM, Bi-LSTM, At-Bi-LSTM (Attention-based-Bi-directional LSTM) and BERT. From the literature studies of the last 6 years, most evaluation approaches taken in emotion detection are accurate according to Fig. 11. Researchers often use the F1 measure, Recall, and Precision also to evaluate a method.

Rachman et al. (2016) use F1, recall, accuracy, and precision for their experiment. Where they take the four sets of parameters i.e. happy, angry, disgusted, and sad emotions. Filipe et al. (2020) takes categorical label emotion based on six basic emotions of Ekman. Based on their experimental result, they used the evaluation methods F1, Recall, and Precision and found that the Corpus-Based Emotion (CBE) is able to improve the accuracy in the detection of emotions. The F-measure using WNA+ANEW was 0.50 and F-Measure using CBE with expanding is 0.61.

Majeed et al. (2020) discussed the different machine learning algorithms is like KNN, Decision Tree, SVM, and Random Forest for their corpus. They use three evaluation measures i.e. F1 score, Precision, and Recall where they perform 20 experiments to find out the best algorithm in machine learning and the performance of each classifier is measured. The results show that SVM gives higher accuracy, Precision, and Recall as compared to the Decision Tree, KNN, and Random Forest.

Strapparava and Mihalcea (2007) texted six different classification approaches that combine adjacent pairs of basic emotions: joy, trust, fear, surprise, sadness, disgust, anger, and anticipation to emotion identification in Portuguese tweets. They achieved a better performance that can be obtained both by improving a lexicon and by directly translating tweets into English and then applying an existing English lexicon. Their classifier succeeded in assigning at least one emotion to 118,046 tweets from Brazil, 6435 tweets from Portugal, 148 tweets from PALOP countries, and another 3459 tweets to other countries. The Portuguese and Brazilians show the same tendency for all emotions. Anger and disgust are the least the least pre-dominant emotions in all cases. Sadness is the most dominant emotion for Portugal and Brazil, but Sadness is surpassed by joy in PALOP countries and by anticipation in other countries (Fig. 12).

Fig. 11
figure 11

Distribution of evaluation methods used in TBED

Fig. 12
figure 12

Level of analysis in different papers and publications that we have collected

10 Challenges that were identified in existing studies in text-based emotion detection

Text-based emotion recognition brings some open challenges for NLP research. The challenge may be the complex nature of the emotional expression. In this section, existing challenges are discussed from the different aspects: Deng and Ren (2021) mentioned that there are four aspects of challenges and potential opportunities of text-based emotion recognition. Those are:

  1. 1.

    Shortage of high-quality datasets: Dimensional and discrete emotion models are broadly used in TBED as we have mentioned above. Deng and Ren (2021) claimed that incompatibilities of different databases, since there were no uniform annotation standards. Mohammad and Bravo-Marquez (2017) mentioned that SemEval-2007 was annotated with Ekman’s six basic emotions such as anger, disgust, happiness, sadness, fear, and surprise. The author WASSA-2017 annotated with four emotions anger, fear, joy, sadness (Quan and Ren 2010a) and Ren-CECps annotated with eight emotions i.e. Anger, joy, surprise, anxiety, hate, love, sorrow, expect (Krawczyk 2016). Since all the datasets have different emotion labels, cross-corpus emotion resources may not be compatible. The author also mentioned that creating a link between the annotation scheme of different emotion corpus and building an emotion detection benchmark will incorporate existing emotion corpus and a more substantial corpus will cover all emotion categories. Buda et al. (2018) said that for a large amount of data, the emotion classifier is likely to identify emotion labels, which was always unavoidable in TBED tasks. Moreover, Deng and Ren 2021 gives suggestions for eliminating data imbalances by saying that the way to eliminate data imbalances was to adjust data distribution by using two sampling techniques, over-sampling, and under-sampling, in which the oversampling often reproduced the data in minority classes to get overall relative balance distribution. Under-sampling removes the data majority classes to balance the data distribution in each emotion class. Chawla (2010) and Liew et al. (2016) mentioned that over-sampling can reduce the problem of data imbalance in TBED tasks. In case a large amount of training data were present, the model itself already be generalized and over-sampling may not work for a small amount of training data, over-sampling may lead to model overfitting resulting increase in the training time and computation cost. To achieve better or high-quality performance, high-quality datasets are necessary (Deng and Ren 2021; Lingren et al. 2014). There must be uniform emotional annotation, which gives compatibility between different corpora. The imbalanced distribution of data between each emotion label also affects the performance of the model. In short, the text-based emotion recognition models are data-dependent. Deng and Ren (2020) shows that it provides a large-scale emotional database to counteract and reduce the challenges of emotion annotation.

  2. 2.

    Fuzzy emotional boundaries: The nature of human emotional expression is not easy to understand. Even some symbols also have emotions, for example, “Whoa!”, exclamation marks have their own emotions may be shock or surprise. Deng and Ren (2021) mentioned that due to the people’s current feelings and own experiences. The emotions could be different even in the same expressions. Liu and Chen (2015) mentioned that fuzzy emotional boundaries and human subjective feelings are challenges for TBED. Buitinck et al. (2015) defined multi-label emotions as the different basic emotions, which can provide complex emotions more comprehensively and reduce the limitations. Nowadays, ML-EC (Multi-label emotion recognition) attracted more awareness and can recognize all possible emotions in textual expression. Tsoumakas and Vlahavas (2007), Jabreel and Moreno (2019) and Swain et al. (2018) claimed that to identify different emotion labels multi-label classification is used and MLC task can be transformed into more sophisticated learning scenarios through commonly used problem transformation algorithms and methods. Swain et al. (2018) transforms the ML-EC task into one binary classification problem by regenerating the input training data. Additionally, they mentioned that the Classifier chains (CC) transforms the MLC into a chain of binary classification problems to capture the label correlations.

  3. 3.

    Incomplete emotional information in text: Deng and Ren (2021) mentioned that achieving accurate emotion from humans only through textual information is a challenging task. For example, a person expressed in textual data is Joy, but this may also be the forced laughter, the inner emotion may be sad. Mohammad and Bravo-Marquez (2017) manually labeled SemEval-2007 for a surprise from 0.36 to 0.68 for sadness by using Pearson Correlation and other emotions were approximately 0.50. From which the author concluded their study that emotional information in text stream is incomplete and it is difficult for humans to recognize text stream emotion recognition tasks. The textual data are generally associated with other modalities like audio (Ko 2018), visual (García-Martínez et al. 2019; Schuller 2018), body language and psychological signals (Xu et al. 2018; Cabrera-Quiros et al. 2019) and multi-modal information predicts the overall performance of emotion recognition. The popularity of wearable devices is increasing nowadays and resulting people are more aware of recognizing emotions through physiological signals (Li et al. 2019a). Hence, Huang et al. (2019b) mentioned that the activity of emotions is directly proportional to the activity level of physiological signals (Li et al. 2019b; Chatterjee et al. 2019b).

  4. 4.

    Textual emotion recognition in dialogue: Since the open dialogue data is increasing with respect to the increasing users of social media, most of the text-based emotion recognition is conducted based on the assumption that the sentence is separated from expresses static emotion and the context, whereas the text-based emotion is highly correlated with contextual information and dynamic in nature for making Text-based emotion recognition task more challenging (Deng and Ren 2021). The performance was not adequate for directly applying the sentence-level deep text-based emotion recognition to the dialogue, and each utterance is highly dependent on the contextual signal. The contextual utterances and the current utterance are essential to the model for determining the dialogue’s overall emotional tendency in text-based emotional dialogue. The context utterance and current utterance are merged and all tokens are compressed into a word sequence by flattened context modeling. Then, the word sequence is fed into neural networks for final prediction (Agrawal and Suri 2019; Yang et al. 2011). The dynamic emotion modeling aims to track and explore the overall tone and the contextual information in dialogue. It is non deterministic pattern. The emotion state transition matrix is the basis for emotional interactive learning in traditional methods (Teoh and Cho 2011). Generally, the Markov model is used to present the dynamic emotions in continuous dialogue (Adikari et al. 2019).

11 Conclusion

Emotion detection is an interesting research topic in the Sentiment analysis that helps produce concise information. The idea of this paper is to present the latest research and progress made in this field with deep learning methods. The paper introduces the concept of emotion detection, emotion models, evolution, some important datasets, some deep learning methods, and limitations that can be seen for TBED. To recognize emotions, remove any content that can’t be used to identify emotions, such as hash-tagged content, URLs, emails, etc. is the first step in recognizing emotions. Then we converted the words that were misspelled, such as acronyms or unintentional message content. Then we find the group to which it belongs using the classifier, then look for polarity, valence, etc. to understand feelings. It’s easy to identify emotions by using negative words, modalities, adjectives, and emoticons. We also discovered that the semantic underpinnings of every context, in addition to the words themselves, are what distinguish an emotion. With the deep learning method, it is proven that it can provide a more structured, broad, diverse review, datasets, pre-processing, features, approach techniques, problems, and methods of evaluation that are available as a guide for future work. The important thing that is considered interesting from the review that has been done is the results of the analysis is that deep learning techniques are still the topic of current favorite trends. This is because there are still many things that are challenging for researchers to do. The machine-learning approach is also a favorite technique because of its automatic machine-learning performance. We hope that through this review paper, the reader will have a better understanding of the research done on this topic, features extracted, methodologies used, and results reported by various researchers.

11.1 Practical implications and technical directions of emotion detection

Applications of emotion detection include a wide range of platforms for e-commerce sites, social media, public opinion, customer service, the medical field, and psychology. Emotion detection has undergone many related tasks such as emotion identification, depression analysis, anger detection, opinion mining, etc. (Fig. 13).

Fig. 13
figure 13

Distribution over past 10 years for selected studies in text-based emotion detection

This analysis can be used to organize user reviews for a product or brand, find public opinion on a specific topic from social media, and look into customer satisfaction with products. Emotion detection can be used to extract the emotions of humans connected to user-generated content as long as it is involved. Emotion detection can help machines to become more powerful, and intelligent, better to compare the thoughts and opinions, which helps the businessman, psychologist, and policymakers etc. In emotion detection emotion dictionaries, statistical machine learning models, neural network-based deep learning models, and pre-trained models are used for better performance. Emotion detection is an important component of understanding natural language processing as well as sentiment analysis, in both business and academics to pay close attention to it. Currently, machines can provide much less accurate sentiment analysis than humans do. The emotion detection analysis will be enhanced by these, along with rules for affective reasoning to supplement interpretable information (Table 12).

Table 12 The summary of some challenges from the existing literature are represented in the table (Kusal et al. 2023)

11.2 Future directions of the related studies for overcoming the limitations and challenges

When considering the future direction, we think that the portability of the domain is probably the most important research direction (Michelucci 2022). The methods that are used in text-based emotion detection have observed a huge evolution over the past few years. The major difficulties in TBED have been addressed by AI-based technologies such as autoencoder, different learning styles, data augmentation (GANs), transfer learning, explainable AI, and adversarial machine learning (AML).

  • 1. Auto-encoders: Auto-encoders are a type of neural network that are typically used in supervised settings where the feature vectors are transferred into an arbitrary lower or high dimensional space during the encoding stage. The purpose of the learning is an “informative” representation of the data that may be used in different applications by reconstructing a set of input observations well enough (Sagha et al. 2017). The autoencoder can deal with missing data situations, improve system performance in noisy environments, and can be trained unsupervised. Another form of autoencoder is Denoising autoencoders, which have been employed in domain adaption, training a model on one labeled corpus with a different context (Kusal et al. 2023). Becker et al. (2022) used autoencoder to build an LSTM model and trained with non-aggressive comments. The data they have been collected from Facebook and Twitter with Bilingual (English and Hindi) sentences and developed a deep-learning model to identify different levels of aggression (direct, indirect, and no aggression). Kusal et al. (2023), Deldjoo et al. (2021) used SDA’s (Stacked Denoising Autoencoders) to perform sentiment analysis on a multitude to domains and languages.

  • 2. Adversarial Machine Learning (AML) NLP is an emerging research area, where deep learning and machine learning models have achieved success day by day. Another emerging research area is AML, which brings together the best machine learning practices, computer security, and robust statistics. Alsmadi (2021) The goal of their survey was two-fold the first one was to present recent advances on AML for the security of RS (i.e. attacking and defense recommendation models) and the second one was to show another successful application of AML in GAN’s (Generative Adversarial Networks) for generative applications. Alshemali and Kalita (2020) discussed the different types and attempts of adversarial attacks where they focused on studying aspects and research trends in AML specifically in text generation and text analysis. Zhang et al. (2020) and Bayer et al. (2023) discussed the different adversarial attacks on deep learning techniques in NLP.

  • 3. Data Augmentation: In NLP, it required vast training data to improve the performance of the model to complete a task like text classification. It is a time-consuming and expensive process in order to generate a large amount of training data. To solution of this scenario may be data augmentation. Data augmentation is a machine learning technique that artificially generates the additional synthetic training data through label preserving the transformation. Shi et al. (2020) mentioned that the data augmentation can be achieved by the thesaurus, back translation, word-embedding, replacing the synonyms, text-generation, etc. The author proposed a novel text-generation method to improve classifiers by artificially created training data and also used the GPT-2 355 million parameter model. They also presented and evaluated the text generation method for long and short text, where they achieved promising results when evaluating short as well as long text. Stanton and Irissappane (2019) developed an effective data augmentation algorithm called AUG-BERT for text-classification in labeled sentences.

    GAN is one of the trending data augmentation techniques for dealing with imbalanced classes. Kusal et al. (2023) defined GAN as a new framework that sets two neural networks against one other to create new, synthetic data that may look like actual data. Croce et al. (2020) proposed spamGAN, a generative adversarial network that works in a limited set of labeled data as well as unlabeled data for opinion spam detection. The author claimed that the spanGAN improves the state-of-art GAN-based techniques for text classification, where they experiment with the TripAdvisor dataset. The spamGAN is not only for detecting span opinion but also generating reviews that have reasonable perplexity. Li et al. (2020a) proposed a GAN-BERT algorithm that works on fine-tuning of BERT-based architectures with unlabeled data in GAN setting. Their experimental results show that it still achieved good performance in sentence classification tasks after the requirement for annotated examples that were reduced to 50–100 examples. Fang and Zhan (2015) proposed a deep multi-modal generative and fusion framework for multi-modal classification in imbalanced data. Their work consists of two modules, DMGAN and DMHFN, i.e. Deep Multi-modal Generative Adversarial Network and Deep Multi-modal Hybrid Fusion Network. Here, the DMGAN is used to class imbalanced problems and DMHFN identifies the integration of different information sources and fine-grained interactions for multi-modal classification.

  • 4. Transfer learning: Kusal et al. (2023) discussed transfer learning in text-based emotion recognition. The author mentioned that collecting huge amounts of data and labeling each emotion is a time-consuming process, the transfer learning allows us to reduce the amount of labeled data we need and take advantage of previously trained models to achieve better performance. The author also claimed that the source domain knowledge can be transferred to separate but related target domains using transfer learning, whereas it tries to increase the learning of the target task using useful information from the source domain when given a source and a target (Ahmad et al. 2020; Kusal et al. 2023). Omara et al. (2019) proposed a system by using Transfer learning for emotion recognition via cross-lingual embedding. Yasaswini et al. (2021) proposed a deep CNN modal for Arabic sentiment analysis based on character level representation. Their proposed model was used for applying Transfer learning between emotion recognition and sentiment analysis domains in the Arabic language. Hazarika et al. (2021) discussed the Transfer learning for offensive language detection in Dravidian Languages. They also discussed the challenges with various transfer learning-based models to classify the given sentences that are Malayalam, Tamil, and Kannada languages with six emotion labels. Zucco et al. (2018) discussed the conversational transfer learning for emotion recognition. They also proposed an approach, named TL-ERC (Transfer Learning-Emotion Recognition in Conversations), where they pre-train a hierarchical dialogue model on multi-turn conversations and then transfer its parameters to a targeted conversational emotion classifier.

  • 5. Explainable Artificial Intelligence (XAI) Kusal et al. (2023) claimed that machine learning and deep learning models have drawbacks for not being humanly interpretable, where lack of transparency is the major problem of both models. Very less amount of work has been done in this research area. Turcan et al. (2021) discussed the sentiment analysis approaches and already existing explainable systems, additionally they reviewed a critical area of explainable sentiment analysis approaches and discussed the insights of using explainable sentiment analysis in the medical field. Farruque et al. (2021) and Yin (2020) also used the Explainable Zero-shot model to Detect Depression Symptoms from text streams.

  • 6. Different learning styles There are different learning styles depending on how algorithms use multiple layers to extract increasingly higher-level features from the raw data. The following learning styles are going to be advanced trends in NLP research areas.

    • Meta-learning To solve a novel problem, it collects a large number of tasks, and each task is treated as a training example, after training a model to adapt all those training tasks meta-learning is used. Kusal et al. (2023) said that Meta-learning is a crucial task for both task-specific and task-agnostic scenarios. The major objectives they mentioned in their research were to expedite the learning process and improve quantitative model performance. Wu and Guo (2020) discussed the problem of impulsive recognition of the text classification task using meta-learning, where they mentioned a total of 73 meta-features that were tested in 81 data sets that are related to different five types of tasks.

    • Co-learning According to Kusal et al. (2023) co-learning is based on co-training and was first introduced by Mitchell and Blum in 1998. Where the Co-learning proposed that multiple classifiers can learn from different features of data. When the unavailability of labeled data or lack of training data situation occurs, co-learning is used to improve the generalization capacity of classifiers with learned features, which gives a better classification model in text classification. The first step is for all classifiers to be trained on a small labeled dataset in a completely supervised method and then each unlabeled data is tested as a trained classifier (Kusal et al. 2023). This automatically labeled data, by each classifier, is then used to retrain the other classifiers in a semi-supervised model. Helwe et al. (2019) proposed a unique technique for multi-domain text categorization named dual adversarial co-learning. De and Guo (2015) also discussed Assessing Arabic weblog credibility via deep co-learning, where they showed a lack of training data in the respective web blogs co-learning approach performed well with other baseline models.

    • Multi-task learning As the name suggests the Multi-task learning is simultaneously trained on many tasks rather than separately on each task. The goal is to create a generalized model that can handle a variety of tasks at the same time (Kusal et al. 2023). The author mentioned that shared representation is used to learn several linked tasks. There are two mechanisms in Multi-task learning, these are Kusal et al. (2023):

      • Soft-sharing mechanism that applies a task-specific layer to a different task

      • Hard-sharing mechanism that utilizes a shared feature space to extract the features for different tasks.

      Zhang et al. (2021) created a framework for classifying and finding aspect elements in a unified model by using Bi-LSTM and CNN network in a cascade for collaborative learning from which they can extract terms that are related to aspects and classify aspect sentiment in a multitasking framework. Hu et al. (2021) proposed a Multi-task learning framework using several application domains based on a hard-sharing mechanism for sentiment analysis and

    • Zero-shot learning Zero-shot learning is also called Few-shot learning which is the most used machine learning classification model. Kusal et al. (2023) mentioned that, at the time of testing if there is no labeled example, it is known as zero-shot learning, when there is only one labeled example, it is named one-shot learning, and when there are a few labeled instances at the time of testing is called few-shot learning. Thakkar et al. (2022) proposed a multi-label few-shot learning system for Aspect Category Detection (ACD) in sentiment analysis. Kusal et al. 2023 and Chen et al. (2019) used zero-shot and few-shot learning in news articles to perform cross-lingual sentiment analysis, where they used single-task and multi-task environments on Slovene and Croatian datasets.

    • Reinforcement learning Kusal et al. (2023) discussed reinforcement learning and defined it as a machine learning technique where an agent is trained in an interactive environment through reward and punishment mechanisms using feedback from its own actions and experiences. The agent has been punished for the wrong actions and rewarded for the correct actions. The author also mentioned that most of the reinforcement learning used in NLP research areas are Question–answer, Text-summarization, machine translation, dialogue systems etc. Chai et al. (2020) proposed a framework named WS-LSTM (Word-level Sentiment Analysis), where they try to achieve sentiment tendency for each word in a sentence. Su et al. (2021) proposed a model to perform the most salient texts with respect to the emotion label, which can be regarded as a hard version of attention for better performance. The author also observed the significant performance boosts over a wide range of text-classification tasks which includes single-labeled, multi-labeled, and multi-aspect classification on sentiment analysis.

    • Self-supervised learning As the name suggests it is a self-supervision process, where the models learn themselves by utilizing one part to anticipate the other of the data and create accurate annotations, called pretext tasks. The self-supervised learning is used where the dataset is too large and labeling it will be time-consuming, in which the data labeling works as an automated process, where the human intervention is eliminated. Su et al. (2021) proposed an approach that automatically trains attention supervision information for neural aspect-based sentiment analysis. Su et al. (2021) proposed a framework named SSentiA (Self-supervised Sentiment Analyzer), which is a hybrid method of self-supervised learning for sentiment classification from unlabeled data that combines a lexicon-based strategy with a machine learning classifier.

11.3 Limitations of our work

There are some limitations of our work:

  • We only collected the papers that are written in English, Persian, Indonesian, Arabic, Urdu, and Bangla languages and searched from different journals and publications. We believe that there are papers in other languages or other databases that also include emotion detection from text.

  • The keywords we chose to search on the internet were mainly “emotion detection”, “emotion detection from the text”, “anger detection”, and “emotion extraction”. There may be many papers related to our research topic that may not have the above keywords. To investigate the developments in emotion detection research, future studies could replicate this work by employing more precise keywords and using different literature databases (Itani et al. 2017).

  • We selected the main high-frequency keywords for emotion detection analysis, and some low-frequency keywords may have been ignored.

  • Another limitation of this article may be theoretic concepts, and in the literature, moreover there is a scope to establish a bridge between the “sentiment analysis” and “emotion recognition”. So that we develop some hybrid definitions from the best of the two domains.

  • The normalization procedure that may deployed in this article, particularly in the deep learning algorithm is a more conventional one. In future work, we can analyze the changes in each keyword in detail from the perspective of time and more comprehensive analysis results.