1 Introduction

Emotion is an essential part of our daily lives. Many scholars, particularly in human–computer interaction (HCI), aim to quantify various aspects of emotions. They use measurements of emotion to improve designs, human interaction, decision making, and many more [151], e.g., using speech to asses learners in online learning to study how they respond to a course [5]. Human–computer interaction is one of such domains. Computers can make better decisions with the use of emotion identification to help users [5].

Moreover, brainwave signals can be used to detect players’ emotions while playing a game to predict their affective states. But, they are intrusive methods, which distract the person from doing normal activities [26, 66]. Whether emotional states can be quantitatively assessed or not is a topic of controversy [151]. There is a dispute over the precise definition of emotion which has raged within psychology since early ideas attempted to provide a concise response to the question “What is/are emotions?” [147].

Researchers investigating emotions have chosen to measure everything around it because they are unable to quantify emotions directly [147, 151]. Affective computation, as defined in Wang et al. [169] in 1995, is “computing that relates to, arises from, or influences emotions,” or, in another way, any form of computing that has something to do with emotions”. The correct automatic identification of emotions is the cornerstone of Affective Computing and it is the subject of this study.

Emotion detection is, at its foundation, an automatic classifier that can classify human emotions into different categories [50, 105]. The process of creating an automatic classifier is done as follows: gathering data, identifying the features that are relevant to the goal, and then training the model to detect and classify specific patterns [59, 69]. The generated model is used afterward to categorize new data. For example, to build a model that can detect happiness and sadness from facial expressions, researchers need to feed photos of people smiling and others of people frowning, labeled as “happy” and “sad.”. These images are used to build the classifier. Following that, when the classifier obtains an image of a person smiling it recognizes the corresponding emotions [59, 69]. Building a model in real life is not that simple. Not only there are a lot of data to train and evaluate, but there is also an effort of interpretation to be made, as we will see later. In addition, humans express their emotions in various ways, including facial expressions, voices or speaking, body gestures, movements, writing, and others. Even our bodies respond with visible physical reactions to emotions (breath and heart rate, pupil size, and so on). Recently, it has been also proved that the environment can affect physiological body reactions and emotions [27, 59, 75, 112, 174].

Emotion detection technology has evolved especially in the business sector due to the massive potential of knowing and predicting how the consumer is feeling. This review differs from others by presenting recent publications that define and assess various modalities of emotions. We focus on research that attempted to connect empirically assessed components of emotional experience to identifiable emotional states. This review analyzes and evaluates the techniques used in these studies using various methods and summarizes their findings. It also gives an overview of each type of emotional information source in the following sections. In addition to examining methodologies that cover single assessments or uni-modal studies, and multi-modal studies, to study various ways to measure or recognize emotions.

We organize the rest of the review as follows: the second section describes emotion analysis, the evolution of emotion, and emotion models. The third section presents various machine learning algorithms used in emotion recognition. The fourth section examines emotion detection and analysis using various inputs and models. Emotions derived from speech and physiological states, emotions derived from text, facial expressions, body gestures, and combining environmental and physiological factors are all covered in this section. The fifth section expands on the preceding sections’ findings, emphasizing the strengths and weaknesses of the reviewed studies. The last section presents the conclusions and recommendations for further research.

2 Emotion analysis

Humans have many ways to express their feelings. They may express themselves by Writing, voice tone, facial expression, physiological reactions, body gestures and postures, and physiological signals. Many emotion models can be used to categorize these emotions. A suitable emotion model needs to be adopted to recognize and interpret emotions from any modality. It should provide a set of permissible emotions for a specific scenario [124, 169].

2.1 Evolution of emotion research

Charles Darwin argued that humans and other animals convey emotions with similar expressions and behavior under similar situations in 1872, after conducting some psychological tests on facial expressions taken in various circumstances on both humans and animals as in Ali et al. [4]. The period and events that occurred at the time altered his perception of feeling. Emotions in humans and other animals took much time to realize over time, according to his views. He covered general principles of emotions and how humans and animals can express emotional states, the causes and effects of all possible emotions such as anxiety, grief, depression, despair, joy, love, devotion, and so on, and the explanation of emotional states with images to show the expressions of specific emotions.

Some emotional expressions, according to Darwin, is universal for individuals all across the world. He also argued that animals of similar species, and humans, react similarly to the same circumstance. His research revealed that even in species that aren’t closely related, some emotions could have similar expressions. As noted in this work Ali et al. [4], some philosophical and spiritual categorization of emotions existed before that.

Emotional research began as a sub-field of philosophical and psychological theories. Emotions and their expressions, according to Darwin, are likewise linked to biological reasons. They later described emotions as brain mechanisms that are outputs of neural system functional features [4, 124, 169]. Figure 1 depicts the progression of emotion in various fields of study. According to the evolution theory, scientists developed different human emotions at different stages of human life several times. Over time, psychologists, sociologists, neuroscientists, biologists, and researchers from many other domains defined, described, categorized, and evaluated human emotions; as a result, many emotion models arose to cover all potential human emotions [136].

Fig. 1
figure 1

Evolution of emotion in various fields. Adapted from Ali et al. [4]

2.2 Emotion models

Human emotions can be classified and organized psychologically based on emotion type, intensity, and various other factors, all of which can be integrated and formed into emotion models. Emotion models are organized by describing different human emotions using scores, ranks, or dimensions. Existing emotion models categorize emotions based on their length, intensity, synchronization, velocity of change, event focus, appraisal elicitation, and behavioral influence ([20, 124, 169]).

Authors divided existing emotion models into two categories based on distinct emotion theories: categorical and dimensional [6, 24]. Classification or categorical emotion models define a set of unique emotional categories. The most commonly used model in emotion recognition research is that of Paul Ekman which involves six basic emotions: happiness, sadness, anger, fear, surprise, and disgust [6].

On the other hand, dimensional emotion models describe a few dimensions with specific parameters and then determine feelings based on those dimensions. Most dimensional emotion models have two or three directions: valence (which indicates whether an emotion is positive or negative), arousal (which depicts how excited the feeling is), and dominance (which shows how much control an emotion has) ([2, 6, 25, 117]).

Table 1 outlines a few basic types of classification emotion models (categorical and dimensional) that are often referenced in the literature and express nearly all human emotions. Briefly, Table 1 depicts basic used models in emotion recognition, emotions accompanied by each model, the type of each of them (Its scenario or approach), and its structure (cube or tree or valence-arousal shape).

In addition, Figs. 2, 3, and 4 depict some of the most often used emotion models in emotion-based research.

Fig. 2
figure 2

Circumplex emotional model Acheampong et al. [2]

Fig. 3
figure 3

Lovheim’s emotional cube. Acheampong et al. [2]

Fig. 4
figure 4

Plutchik’s emotion wheel of emotional models. Montero Quispe et al. [117]

Table 1 Emotion models Peng et al. [124]

3 Machine learning algorithms

Machine learning (ML) is the systematic study of algorithms and statistical models that computer systems use to do tasks without explicit instructions by relying on patterns and inference rather than precise instructions. ML algorithms construct a mathematical model based on samples of data, referred to as “training data,” to generate predictions or judgments without the need for explicit programming Zhang [178]. ML is a sub-field of artificial intelligence (AI). It is the most popular dominant technology today because it helps automate complex problems. Its basic idea is to create models that learn the relevant features of a dataset to make accurate predictions. It is used by most popular apps, such as Facebook and Netflix, to predict which advertisements to show and which TV episodes a user will enjoy [96].

ML techniques can be broadly classified into shallow learning and deep learning as shown in Fig. 5. Shallow learning methods require developers to manually choose the most suitable features for creating models as shown in Fig. 6 depicting the most common techniques used in ML architecture for emotion recognition, whereas deep learning methods automate the process of feature extraction and selection. On the other hand, deep learning requires a significant amount of data to achieve acceptable performance, making it more complex to implement, but the model obtained may be better as shown in Fig. 7. ML methods can also be classified according to the availability of labeled data in supervised and unsupervised learning. Supervised learning uses labeled/classified data. This indicates that each data point x has a specific outcome y. Unsupervised ML models use algorithms to detect similar patterns in the data. Classification and regression are two types of supervised learning, with linear and nonlinear methods/algorithms as shown in Fig. 6.

Fig. 5
figure 5

Machine learning techniques explanation Liu and Lang [96]

Fig. 6
figure 6

Shallow learning techniques explanation Liu and Lang [96]

Fig. 7
figure 7

Deep learning techniques explanation Liu and Lang [96]

3.1 Classification algorithms

Fig. 8
figure 8

Flow diagram of GBM algorithm Zhang et al. [177]

Fig. 9
figure 9

Flow diagram of decision tree Maji and Arora [106]

Fig. 10
figure 10

Representation of ANN supervised ML model Zheng and Nguyen [181]

Classification is a supervised learning algorithm. Its algorithms can be divided into linear and nonlinear algorithms.

  1. 1.

    Linear classification algorithms include the following:

    1. (a)

      Linear discriminant analysis (LDA): LDA is a supervised method used in creating ML models. It is a technique for reducing the dimensionality of data. It’s utilized in many tasks in ML and pattern categorization applications. The purpose of LDA is to convert the features from a higher-dimensional space to a lower one to avoid dimensionality issues while simultaneously saving resources and reducing training costs. This type of reduction is employed in marketing applications such as image recognition and predictive analytics.

  2. 2.

    Nonlinear classification includes the following algorithms

    1. (a)

      Gradient boosted machine (GBM): GBM or xgboost is an ML technique that can be used for regression and classification in a variety of contexts. It combines many weak prediction models, the most common of them being decision trees, to form an improved prediction model [74, 126]. Gradient boosted trees are used when a decision tree is a bad learner, and they typically outperform random forests [74, 103, 126]. A gradient-boosted trees model is built in the same step-by-step fashion as the boosting methods, with the extra feature of being able to optimize any loss function as shown in Fig. 8 [177].

    2. (b)

      Decision tree (DT): is one of the most widely used predictive modeling approaches in practice. It is a supervised ML technique for building a decision tree from training data. A decision tree is a regression and classification prediction model (also known as a classification tree or a reduction tree). It’s a translation from item observations to target value judgments. Leaves (also known as labels) indicate classifications, non-leaf nodes represent features, and branches represent feature combinations that lead to categories [87]. In other words, input values, or sample is split into two or more homogeneous sets (or sub-trees) relying on the most important main differences. DT utilizes several algorithms to decide to divide a node into two or more sub-nodes. Sub-node existence improves the homogeneity of the eventual sub-nodes as shown in Fig. 9 [106].

    3. (c)

      Neural network (NN): Artificial neural networks (ANNs), commonly known as neural networks (NNs), are ML models that are inspired by the working mechanism of the biological neural networks in brains. NN is a supervised learning technology based on classification and can also be used for regression and clustering.

      Artificial neurons (ANN) also called (MLp) are a network of interconnected units or nodes that are designed to look like real brain neurons. Each link can give a message to the other neurons similar to synapses in the brain. A signal is received by an artificial neuron, which processes it before sending it to the neurons to which it is connected. A nonlinear function is used to activate a neuron’s inputs and determines its output. Edges are the terms used to describe the connectors. The weight of neurons and edges varies regularly as you get more knowledge. The signal strength at a connection is affected by the weight. Neurons may have a signaling threshold beyond which they can only transmit a signal if the overall message is higher than the threshold. The most frequent method of grouping neurons is through layers. The system’s inputs can be adjusted in a variety of ways. After crossing the layers numerous times, signals move from the first (input) layer to the last (output) layer [65, 181]. Figure 10 represents A node-based network influenced by the brain’s simplicity known as an artificial neural network (ANN). Each arrow represents a connection between one artificial neuron’s output and another artificial neuron’s input, and each circular node is called a neuron. Zheng and Nguyen [181].

    4. (d)

      Naive Bayes (NB): The naive Bayes technique is a “probabilistic classifier” based on Bayes’ theorem. It relies on strong (naive) independent criteria between variables in mathematics (see Bayes classifier). They are one of the most fundamental Bayesian network models [110], yet when combined with kernel density, they can achieve higher levels of accuracy [127, 161]. Using Bayes theory, we can calculate NB as follows in Eq. 1 [15]

      $$\begin{aligned} p(C_k\mid x) = \frac{p(c_k)p(x \mid C_k)}{p(x)} \end{aligned}$$
      (1)

      where x = \(x_1\), \(x_2\),..., \(x_n\) \(p(C_k \mid x_1,x_2, \ldots , x_n)\) For each of K possible outcomes or classes \(C_k\).

    5. (e)

      Flexible discriminant analysis (FDA): Flexible discriminant analysis is a classification approach based on a combination of linear regression models that combines optimal scoring to modify the response variable into a better form for linear separation and numerous adaptive regression splines to build the discriminant surface.

    6. (f)

      Support vector machine (SVM) is a supervised ML method that can be used for both classification and regression problems. It is mainly applied for classification. It works as follows:

      Each feature is represented by a point in m-dimensional space (where m is the total number of features), with each feature’s value. Then, to do categorization, the algorithms select the plane that separates the classes (see Fig. 11) [72, 146]. It employs the kernel function, which is a mechanism for converting a low-dimensional input space into a higher-dimensional one. As shown in Fig. 11, support vectors are the coordinates of each observation. The SVM classifier is an efficient method that effectively distinguishes between the two classes(hyperplane/line).

    7. (g)

      Random forest (RF): is a classification and regression supervised learning technique. A random forest is a system that combines many decision trees into an ensemble [47] as shown in Fig. 12 [39].

    8. (h)

      K-nearest neighbor (KNN): The K-nearest neighbor method is one of the most fundamental Machine Learning algorithms. It is based on the supervised learning technique. (i) The KNN algorithm categorizes new data points based on how similar they are. This implies that as new data comes in, the KNN algorithm can quickly sort it into the right suitable category. (ii) The method can be used for both regression and classification however, classification is the most widely used. (iii) Because this technique is nonparametric, it makes no assumptions about the underlying data. (iv) It’s also known as a lazy learner algorithm. Figure 13 depicts a flow diagram of KNN algorithm [149].

    9. (i)

      Bagging CART: The ensemble method of bootstrap aggregating, or bagging, increases the accuracy of unstable models by averaging a set of the same model fit to bootstrapped samples of feature space. Consider the following scenario: the data is presented as a collection of p predictors \(X_1\), ..., \(X_n\) with a response vector (\(Y= Y_1,\ldots ,Y_n\)). We use some base procedure g\(\hat{(.)}\) to model the relationship between X and Y. A bagged model, g\(\hat{bag}(.)\), is a linear combination of several g\(\hat{(.)}\) fit to bootstrapped samples of X. Bagging acts as a smoothing operator for hard loss functions (consider a single split in a decision tree); smoothing decisions reduces variance in the model, ultimately improving the prediction error. Heuristically, the deviation of the bagged estimator g\(\hat{bag}\) should be equal to or smaller than the variance of the original estimator \(\hat{(.)}\). The reduction in deviation is higher when the initial estimator is unstable [22, 23, 184]. To create a model with lower variance, this method “averages” the predictions from several different models after they have been fitted. It takes the following steps to fit several models: First, they create several bootstrap samples, each of which functions as a distinct (almost) independent dataset selected from the true distribution. Then, for each of these samples, a weak learner is fitted, and the results are combined to build an ensemble model that has less variance than the sum of its parts. Similar to how the bootstrap samples are generally independent and identically distributed (i.i.d. ), the learned base models exhibit this property as well. Finally, by “averaging” the outputs of the weak learners, the variance is decreased without changing the projected result. In other words, bagging entails building an ensemble model that averages the outputs of these weak learners by fitting several base models to various bootstrap samples, as shown in Fig. 14 [174].

    10. (j)

      Stacking Method: Using a meta-classifier, the ensemble learning technique of stacking combines different classification models. The outputs (meta-features) of each classification model in the ensemble are utilized to fit the meta-classifier after each classification model has been trained individually using the whole training set. The predicted class labels or the ensemble probabilities can both be utilized to train the meta-classifier. As can be shown in Fig. 15, stacking is also known as a stacked generalization, which is an ensemble method.

      Regression, density estimations, distance learning, and classifications have all had effectiveness with stacking. First, stacking frequently investigates heterogeneous weak learners (different learning methods are combined), whereas bagging and boosting generally take homogeneous weak learners into account. Second, while bagging and boosting use deterministic techniques to combine weak learners, stacking uses a meta-model to combine the underlying models [174].

Fig. 11
figure 11

Representation of data items in SVM model in n-dimensional space Singh et al. [146]

Fig. 12
figure 12

Random forest architecture Dimitriadis et al. [39]

Fig. 13
figure 13

K-nearest neighbor architecture Srivastava [149]

Fig. 14
figure 14

Explanation of bootstrap aggregating method (bagging) Younis et al. [174]

Fig. 15
figure 15

Flowchart of stacking classification ensemble Younis et al. [174]

3.2 Regression algorithms

Regression is a supervised learning algorithm. The target variable must be numeric. It has the following variants:

  1. 1.

    Linear regression: includes the following algorithms:

    1. (a)

      Linear regression LR: Simple LR has only one input variable and one output variable, whereas multiple linear regression has one output variable but many input variables. The goal of an LR algorithm is to identify a linear equation between the input and output variables. In linear regression, the input and output variables can be calculated by the following formulae:

      - Linear regression: \(y = b_0 + b_1x\)

      - Multiple: \(y = b_0+b_1x_1+\cdots +b_nx_n\)

      Here, the ‘x’ variables are the input features, and ‘y’ is the output variable. \(b_0, b_1,\ldots ,b_n\) represent the coefficients that are to be generated by the linear regression algorithm.

    2. (b)

      Stepwise regression SR: is a technique used for selecting the best features for multiple linear regressions. There are three types of SR: backward elimination, forward selection, and bidirectional elimination.

    3. (c)

      Ridge regression: In situations when the independent variables are highly correlated, ridge regression is a technique for estimating the coefficients of multiple-regression models. It has been applied to a variety of disciplines, including engineering, chemistry, and econometrics. It is often referred to as Tikhonov regularization, after Andrey Tikhonov, and it is a technique for regularizing improperly posed issues. It is very helpful in reducing the multicollinearity issue in linear regression, which frequently arises in models with several parameters. In return for a manageable degree of bias, the approach generally offers enhanced efficiency in parameter estimation issues.

    4. (d)

      Lasso regression: Less absolute shrinkage and selection operator, often known as lasso or LASSO, is a regression analysis technique used in statistics and machine learning that does both variable selection and regularization to improve the predictability and understandability of the final statistical model. Lasso was initially developed for models of linear regression.

    5. (e)

      Elastic net regression: The two most widely used regularized linear regression methods, lasso, and ridge, are combined to create an elastic net. Ridge employs an L2 penalty, while Lasso employs an L1 penalty. In other words, elastic net linear regression regularizes regression models by applying the lasso and ridge techniques’ penalties. By taking into account their drawbacks to enhance the regularization of statistical models, the strategy combines the lasso and ridge regression methods. The lasso’s shortcomings—namely the fact that it only collects a small sample size for high-dimensional data—are improved by the elastic net method. The elastic net method allows the incorporation of “n” variables up until saturation. When the variables are grouped into highly correlated groups, lasso usually selects one variable from each group while completely ignoring the others. Where the dimensional data exceed the number of samples used, the elastic net technique is best suited.

    6. (f)

      Principle component regression (PCR): is a principal component analysis-based regression analysis tool. It is used to predict unknown regression coefficients in a standard linear regression model. Instead of explicitly regressing the dependent variable on the independent variable, PCR uses the principal components of the explanatory factors as regressors. Because only a subset of all the principle components is often employed for regression, PCR is both a regularized process and a shrinkage estimator.

      One of the most common applications of PCR is to solve the multicollinearity problem, which occurs when two or more input variables are nearly collinear. By removing part of the low-variance principal components in the regression step, PCR can effectively cope with such situations. Furthermore, because PCR usually only regresses on a fraction of all the principal components, it can significantly reduce the number of parameters that characterize the underlying model, resulting in dimension reduction. This is especially beneficial in situations with many dimensional data [104].

    7. (g)

      Logistic regression (LogR): basically for a given collection of features (or inputs), X, the target variable (or output) y can only take discrete values. Contrary to popular assumptions, logistic regression is a regression model. The model creates a regression model to forecast the likelihood that a particular data entry belongs to a specific category. LogR uses a sigmoid function to represent the data, just like lR assumes that the data follow a linear model as in Eq. 2 and shown in Fig. 16 [182].

      $$\begin{aligned} g(x)= \frac{\textrm{1} }{\textrm{1} + e^-x } \end{aligned}$$
      (2)

      Logistic regression can be categorized based on the following categories: 1. Binomial: There are just two potential kinds for the target variable: “0” or “1,” which can indicate “win” versus “loss,” “pass” vs. “fail,” “dead” vs. “alive,” and so on. 2. Multinomial: The target variable can have three or more non-ordered types (i.e., the types have no quantitative importance), such as “illness A” vs. “disease B” vs. “disease C.”

      3. Ordinal: it deals with ordered categories of target variables. A test score, for example, can be classified as “extremely poor,” “bad,” “good,” or “very good.” Each category can be given a score of 0, 1, 2, or 3 points.

  2. 2.

    Nonlinear regression: The algorithms mentioned in this method were explained in 2

Fig. 16
figure 16

Sigmoid function of logistic regression Zhu et al. [182]

3.3 Unsupervised learning

Unlike supervised learning of the following algorithms, it has only x values and no labels for the data points. This method is significant when grouping data points that have similar qualities. Clustering is an example of unsupervised learning.

  • Clustering in unsupervised learning consists of the following algorithms:

    • K-means: is one of the most basic and often used unsupervised ML techniques. K-means sorts similar data into groups or clusters. Data within a specific cluster bear a higher degree of commonality among observations within the cluster than it does with observations outside of the cluster. In other words, before assigning each data point to the cluster with the fewest centroids, the K-means method calculates k centroids. The algorithm aims to find and combine objects into groups (K) [113].

    • Self-organizing map (SOM): The Kohonen SOM is an unsupervised ANN able to handle nonlinear problems that can be used for exploratory data analysis, pattern recognition, and variable relationship assessment. It is often used to cluster high-dimensional data. There are only three levels to it.

      • The input layer: consisting of n-dimensional inputs.

      • Weight layer: weight vectors that are customized and represent the network’s processing units.

      • Kohonen layer: a computational layer made up of processing units that are arranged in a 2D lattice-like pattern (or 1D string-like structure). SOMs have the unique ability to map high-dimensional input features into spaces with fewer dimensions [92].

3.4 Deep learning algorithms DL

is divided into supervised and unsupervised learning.

3.4.1 Supervised DL

Deep learning is an ML method that mimics human learning behavior. Unlike typical machine learning algorithms, deep learning algorithms are developed in a hierarchy of increasing complexity and abstraction of data. Supervised learning methods in deep learning are:

  • Deep belief network (DBN): is a sophisticated generative model that employs a deep architecture. DBNs are deep neural networks made up of layers of restricted Boltzmann machines placed on top of each other (RBMs). RBM is a type of generative stochastic ANN that can learn a probability distribution from its inputs. DBN is a hybrid generative graphical model. DBN may be used to handle unsupervised learning challenges like reducing feature dimensionality, as well as supervised learning tasks like building classification or regression models. The two processes for training a DBN are layer-by-layer training and fine-tuning. As indicated in Fig. 17 [108], the top two layers have no direction. The layers above have direct links to lower layers. Figure 17 depicts that layer-by-layer training refers to the unsupervised training of each RBM, whereas fine-tuning refers to the use of error backpropagation methods to enhance the parameters of the DBN after the unsupervised training.

  • Deep neural networks (DNNs): are feed-forward networks (FFNNs), in which data are transferred from the input to the output layer without traveling backward, and the links between the layers are only one way, forward. The findings are achieved via backpropagation, which employs supervised learning using datasets of specific information depending on ’what we wish [1, 156]. Figure 18 depicts a representation of the deep neural network method. Each layer is followed by a nonlinear function called the activation function like sigmoid, relu, and tanh [48].

  • Convolutional neural network (CNN, or ConvNet): CNN, also known as ConvNet, is a type of deep neural network that is often used to analyze visual data. Video comprehension, audio recognition, and natural language processing (NLP) are among their numerous applications. In addition, combining LSTM with convolutional neural networks (CNNs) enhanced automatic image captioning, as demonstrated on Facebook. As can be seen, RNN aids us in data processing by predicting our next step, whereas CNN aids us in the visual analysis and automatic feature extraction [143].

  • Recurrent neural network (RNN): A recurrent neural network (RNN), which is an FFNN with a time twist, solved this challenge. In other words, the RNN is a discriminative categorical method that can process the serial and time series data mainly. In several tasks, the estimation relies on many previous tests to evaluate the sequences of inputs, besides the classification of individual tests [86]. This neural network isn’t stateless because it has connections between passes and interconnections over time. They’re a type of artificial neural network in which node connections form a directed graph and a series of linkages from one layer to the next, allowing data to flow back into earlier levels of the network and data to stay. RNNs use their internal state to process input sequences (memory). It can now distinguish linked, unsegmented handwriting and voice as a result. However, they act not only on the information you feed them but also on related information from the past, so what you feed and train the network with matters Karpathy et al. [78] as shown in Fig. 19.

  • Long short-term memory (LSTM): The LSTM is a discriminative method that can work on time stamp, sequential, and long-time-dependent data [86]. LSTMs are a form of RNN that can learn long-term dependencies, which allows them to recall what happened in the past and identify patterns over time to make their next guesses sensible. Because of LSTMs, machine translation, language modeling, and multilingual language processing have all advanced Staudemeyer and Morris [150].

  • GRU Transformation: It’s called gated recurrent unit. This method is a simplification of the LSTM. This method has two gates to get rid of the output gate present in the LSTM model. These gates are an update gate and a reset gate. The update gate indicates how much of the previous cell contents to maintain. The reset gate defines how to incorporate the new input with the previous cell contents. A GRU can model a standard RNN simply by setting the reset gate to 1 and the update gate to 0. It is simpler than the LSTM, can be trained more quickly, and can be more efficient in its execution. However, LSTM can be more expressive and with more data can lead to better results [86].

Fig. 17
figure 17

A deep belief network architecture Malik et al. [108]

Fig. 18
figure 18

A deep neural network architecture Feng et al. [48]

Fig. 19
figure 19

A recurrent neural network model Lakshmanna et al. [86]

3.4.2 Unsupervised deep learning methods

Unsupervised learning must be used as a complement to traditional learning methods to deal with massive unlabeled data. Training can be performed using generative models including autoencoders such as (stacked or denoising autoencoders) or VAE or GAN to initialize, replicate back and modify globally.

Fig. 20
figure 20

Discriminative and generative models of handwritten digits Bernardo et al. [14]

Fig. 21
figure 21

Autoencoder model architecture Lakshmanna et al. [86]

Fig. 22
figure 22

Stacked autoencoder architecture Shastry et al. [143]

Fig. 23
figure 23

Denoising autoencoder architecture Majtner et al. [107]

Fig. 24
figure 24

Generative adversarial networks architecture Madani et al. [102]

  • Generative Models: In statistical classification, two main methods are called the generative approach and the discriminative approach. These compute classifiers using several methods with varying levels of statistical modeling, such that new data instances may be produced by generative models. Discriminative models, on the other hand, distinguish between several types of data instances [180]. In other words, generative models are those that capture the joint probability p(XY) or just p(X) if there are no labels, given a set of data examples X and a set of labels Y. And, models that capture the conditional probability \(p(Y \mid X)\) [121] are discriminative. So, classifiers computed without using a probability model are also referred to loosely as “discriminative”.

    The distribution of the data itself is part of a generative model, which also indicates the likelihood of an example. For instance, models that can assign a probability to a sequence of words are often generative (and considerably simpler than GANs) and can predict the next word in a sequence. A discriminative model ignores the question of whether a given instance is likely and just tells you how likely a label is to apply to the instance such as the k-nearest neighbors algorithm, Logistic regression, support vector machines, decision tree learning, random forest, maximum-entropy Markov models, conditional random fields, etc.

    With the development of deep learning, a new family of techniques called deep generative models (DGMs) Tomczak [157] is established that is a group of methods that train deep neural networks to simulate the distribution of training samples. In particular, these methods cover the Gaussian mixture model (and other types of mixture models), the hidden Markov model probabilistic context-free grammar Bayesian network (e.g., Naive Bayes, autoregressive model), averaged one-dependence estimators, latent Dirichlet allocation, Boltzmann machine (e.g., restricted Boltzmann machine, deep belief network), but the most popular methods are variational autoencoders (VAE), generative adversarial networks (GAN), autoregressive models, flow-based methods, and diffusion models in addition to numerous hybrid approaches. These techniques are compared and contrasted, explaining the premises behind each and how they are interrelated while reviewing current state-of-the-art advances and applications [19, 157].

    These models have their roots in the 1980s and aim to learn about data without supervision, potentially providing benefits for standard classification tasks; gathering training data for unsupervised learning is much easier and less expensive than collecting labeled data, there is still a lot of information available, indicating that generative models can be helpful for a wide range of applications. Thus, generative modeling has been applied in many tasks including emotion recognition, image synthesis: super-resolution, text-to-image and image-to-image conversion, inpainting, attribute manipulation, pose estimation; video: synthesis and retargeting; audio: speech and music synthesis; text: summarization and translation; reinforcement learning; computer graphics: rendering, texture generation, character movement, liquid simulation; medical: drug synthesis, modality conversion; and out-of-distribution detection [19].

    In comparison, discriminative models can only handle a simpler task than generative models. More modeling is required for generative models. While generative models attempt to represent how data is distributed throughout the space, discriminative models attempt to define boundaries in the data space. The following diagram 20 shows the discriminative and generative models of handwritten digits [14].

    Note, by constructing a line in the data space, the discriminative model aims to distinguish between handwritten 0 s and 1 s. If the line is drawn correctly, it can discriminate between 0 s and 1 s without ever having to represent the precise placement of the instances on either side of the line in the data space. The generative model, on the other hand, attempts to generate acceptable 1 s and 0 s by producing digits that closely resemble their actual counterparts in the data space. The distribution throughout the entire data space must be modeled.

    1. 1.

      Autoencoder: The AE is a generative method that can be suitable for extracting the features and reducing the size with the same number of input and output units. These input and output layers are connected with one or more hidden layers. An autoencoder neural network is a type of unsupervised learning method that uses backpropagation to establish target values equal to inputs. In other words, it employs the formula \(y^{(i)} = x^{(i)}\) where \(y^{(i)}\) represents output nodes and \(X^{(i)}\) represents input nodes. The autoencoder tries to learn the \(h_{W,b}(x)\approx {x}\) function which represents the activation function. In other words, it’s attempting to learn a close approximation to the identity function, resulting in an output \(\hat{x}\) that looks like x. The identity function appears to be a very simple function to learn; but, by imposing constraints on the network, such as restricting the number of hidden units, we can uncover intriguing data structures [12]. Figure 21 shows a brief architecture of an autoencoder. Finally, Autoencoder is composed of two parts: the encoder and the decoder. The function of the encoder includes compressing and encoding the data. It can also convert the original data into other presentation spaces. This process of transformation is called the encoding phase. the encoder extracts features and uses the meaningful information acquired to represent the data. While the purpose of the decoder is to reconstruct the data converted by the encoder to its original representation space [90].

      • Stacked autoencoder: The stacking of autoencoders is an unsupervised deep learning technique that was first proposed in Liu et al. [95] to improve the performance of deep networks. Unsupervised pre-training, layer by layer, as input is fed through. The first layer (neurons of the first hidden layer of the encoder) in Fig. 21) can be utilized as input to the next autoencoder once it has been pre-trained. Traditional supervised classification can be handled by the last layer, and the pre-trained neural network can be fine-tuned via backpropagation. Figure 22 [143, 160] depicts a stacked autoencoder.

      • Denoising Autoencoder: Denoising autoencoders are a stochastic variation of the standard autoencoder. Denoising autoencoders try to mitigate the identity function risk by randomly corrupting input (i.e., creating noise), which the autoencoder then has to reconstruct or denoise. The model must sample an observation x from the training set during the denoising operation, which subsequently generates a corresponding corrupted \(x_0\) according to the corruption process P( \(x^{'}\) |x). Then, as shown in Fig. 23 [107], \(x^{'}\) is encoded to provide a hidden representation h for the data.

    2. 2.

      GAN: Using Generative Adversarial Networks, it is possible to train such detailed models to mimic a genuine distribution. By defining the problem as a supervised learning problem with two sub-models—the generator model we train to generate new examples and the discriminator model that attempts to classify examples as either real (from the domain) or fake (generated)—generative adversarial networks are a creative way to train a generative model [19, 54] . In other words, the generator model is used to generate new credible examples from the problem domain. While the discriminator model is used to classify examples as real (from the domain) or fake (generated). These two models compete against an enemy in a zero-sum game. Samples are created immediately by the generator. The discriminator, its rival, makes an effort to discriminate between samples taken from the generator and samples taken from the training data. The game continues until the generator model is producing believable examples and the discriminator model is duped roughly half the time [54]. Figure 24 depicts the structure of generative adversarial networks.

      “Zero-sum” implies that the parameters of the discriminator do not change when it correctly distinguishes between real and fake samples. In contrast, modifications to model parameters penalize the generator and vice versa. The generator should be able to consistently produce exact copies from the input domain. The discriminator is unable to distinguish between the two and consistently forecasts “unsure” (e.g., 50% for real and false). Essentially, this is an actor-critic model. It’s crucial to remember that each model can completely eclipse another. The generator will have trouble reading the gradient if the discriminator is too effective since it will produce values that are too close to 0 or 1. False negatives can result if the generator is too strong since it will take advantage of the discriminator’s flaws. The “skill level” of both neural networks must be equivalent, as determined by their respective learning rates [102].

      Generator model: the generator creates a sample in the domain using an input random vector of fixed length. A Gaussian distribution is used to generate the vector at random. After training, this multidimensional vector space’s points will match those in the issue domain, resulting in a compressed representation of the data distribution. A vector space made up of latent variables is what is referred to as a latent space. In the case of GANs, the generator assigns meaning to points in a predetermined latent space. Points selected from the latent space can be given to the generator model as input and used to produce new and distinctive output examples. After training, the generator model is kept and used to generate new samples.

      The discriminator model: the discriminator model predicts a binary class label of real or fake based on an input example (actual from the training dataset or generated by the generator model). (generated). The discriminator is a typical classification paradigm that is widely known. We are interested in a reliable generator, hence we discard the discriminator after training.

      Deep convolutional generative adversarial network, also known as DCGAN, was one of the first models on GAN to use convolutional neural networks. This network produces an image with the desired shape after receiving 100 randomly chosen numbers from a uniform distribution as input. There are numerous convolutional, deconvolutional, and fully linked layers in the network. To translate the input noise into the desired output image, the network employs numerous deconvolutional layers. The network’s training is stabilized via batch normalization. All layers in a generator employ ReLU activation, except the output layer, which uses the tanh layer, and all layers in a discriminator use leaky ReLU. Mini-batch stochastic gradient descent was used to train this network, and the Adam optimizer was employed to speed up training with adjusted hyperparameters. The paper’s findings were pretty intriguing. The authors demonstrated how the generators’ intriguing vector arithmetic features could be used to modify images in the way we chose [19, 54]. Figures 25, 26 depict the structure of the discriminator and generator of deep conventional generative adversarial networks.

      Also, conditional GANs are one of the most frequently used GAN variants. They are created by simply adding a conditional vector to the noise vector. See Fig. 27 depicts the structure of conditional GANs. Before cGAN, we were producing images at random from noise z samples. What if we wanted to create an image with certain attributes we wanted? Is there any way to give the model additional data on the kind of image we want to produce? Yes, and Conditional GAN is the approach to use to achieve that. It is feasible to control the data creation process by conditioning the model on extra data that is supplied to the generator and discriminator. Conditional GANs are used in a variety of tasks such as text-to-image generation, image-to-image translation, automated image tagging, etc. A unified structure of both networks has been shown in Fig. 27 [54, 102].

      Finally, GANs can produce viable samples and have stimulated a lot of interesting research and writing. The goal of GAN is to place the generator and discriminator into equilibrium. However, there are downsides to using a GAN in its plain version:

      • Images are generated off some arbitrary noise. When generating a picture with specific features, you cannot determine what initial noise values would produce that picture but search over the entire distribution.

      • A GAN only distinguishes between “real” and “fake” images. But there are no constraints that a picture of a cat has to look like a cat. Thus, it might result in no actual object in a generated image, but the style looks like the picture.

      • GANs take a long time to train. A GAN might take hours on a single GPU and a single CPU for more than a day.

    3. 3.

      VAE: We are aware that an autoencoder can be used to transform an input image into a considerably lower-dimensional representation that can hold latent data about the distribution of the input data. The encoded vector, however, may only be transferred to the corresponding input using a decoder in a standard autoencoder. Without a doubt, it cannot be utilized to produce variations on similar images [41, 53].

      To accomplish this, the model must understand the training data’s probability distribution. VAE is one of the most well-liked methods for unsupervised learning of complex data distributions, such as images, using neural networks. A Bayesian inference-based probabilistic graphical model seeks to understand the underlying probability distribution of the training data so that it can quickly sample new data from that distribution. In other words, a generative model known as a variational autoencoder (VAE) “provides probabilistic descriptions of observations in latent spaces.” This simply means that latent attributes are stored in VAEs as probability distributions [98]. Similar to GANs, VAEs are generative models built on neural network autoencoders, which are made up of encoders and decoders, two independent neural networks. In VAEs, the encoder takes in an input and converts it into a smaller representation, which the decoder can use to convert it back to the original input. But the latent space they transform their inputs into and the location of their encoded vectors might not be continuous or suitable for simple interpolation. It becomes a difficulty for generative models since they seek to produce variations on an input image from a continuous latent space or randomly sample from the latent space [98].

      Latent spaces in variational autoencoders are continuous by design, making random sampling and interpolation simple. To do this, the encoder’s hidden nodes output two vectors of the same size instead of one encoding vector: a vector of means and a vector of standard deviations. These hidden nodes will each serve as their respective Gaussian distributions. A latent vector of random variables is formed by the new vectors. The ith random variable’s mean and standard deviation values are represented by the ith element of mean and standard deviation vectors. To get the sampled encoding that is sent to the decoder, we sample from this vector. Decoders can then take a random sample from the input vector probability distributions. It is a stochastic generating process. It means that even for the same input, the actual encoding will vary slightly on each run owing to sampling even while the mean and standard deviation will remain the same [98].

      The goal of the autoencoder is to minimize both its latent loss and reconstruction loss, which measures how closely the autoencoder’s output resembles its input. (how close its hidden nodes were to a normal distribution). Less information can be encoded with a smaller latent loss, which increases the reconstruction loss. As a result, the latent loss and the reconstruction loss must be traded off by the VAE. The generated images will resemble images taken at train time too closely when the latent loss is modest, but they will look terrible. The reconstructed images at train time will seem good if the reconstruction loss is minor, but the novel-generated images will be very different from the reconstructed ones. we want both, so it’s critical to strike a good balance. Finally, VAEs are extremely effective generative tools because they can deal with a remarkable variety of data types, including sequential, nonsequential, continuous, discrete, and even labeled or entirely unlabeled data. Anomaly detection for predictive maintenance, signal processing, and security analytics applications are common uses for VAEs [53]. Also, the goal of VAE is to maximize the lower bound of data log-likelihood.

      The best feature of VAE is to learn both a generative model and an inference. Although both VAE and GANs are extremely interesting methods for applying unsupervised learning to understand the underlying data distribution, GANs produce superior outcomes versus VAE. In VAE, we optimize the lower variational bound while no such assumption is made in GAN. In actuality, explicit probability density estimation is not a concern for GANs. The blurry results that VAEs provide are one of their primary drawbacks. This is caused by the way data distributions are recovered and loss functions are calculated. The authors in Biedebach et al. [17] have suggested modifications to VAEs not to use the variational Bayes method to improve output quality.

    4. 4.

      Flow-based models: Exact log-likelihood models with tractable sampling and latent variable inference are what are known as flow-based generative models. To calculate the precise log-likelihood of observations, flow-based models generally apply a stack of invertible transformations to a sample from a prior. Because the model directly learns the data distribution, as opposed to the other two algorithms, the loss function is the negative log-likelihood [19]. Figure 28 depicts an architecture of flow-based generative models.

      As in nonlinear independent component analysis, a flow model f is typically built as an invertible transformation that maps the high-dimensional random variable x to a typical Gaussian latent variable \(z=f(x)\). A flow model’s fundamental design principle is that it can be any bijective function and be created by stacking discrete, straightforward invertible transformations. The flow model f is specifically created by explicitly combining several invertible flows like \(f(x) = f1 \ldots fL(x)\), where each fi has a tractable inverse and a tractable Jacobian determinant [19].

      Flow-based models have two main categories: models with normalizing flows and models with autoregressive flows that aim to improve the performance of the basic model.

      • Normalizing flows: For many machine learning instances, accurate density estimation is crucial. However, it is inherently complicated because the embedded probability distribution must be straightforward to calculate the derivative effectively when backward propagation in deep learning models is required. Even though the majority of real-world distributions are far more complex, the standard strategy is to employ Gaussian distribution in latent variable generative models. RealNVP or Glow are two examples of normalizing flow (NF) models that offer a reliable approximation of the distribution. Through the use of a series of invertible transformation functions, they convert a straightforward distribution into a complicated one. According to the change of variables theorem, we repeatedly replace the variable with the new one by flowing through a sequence of transformations. The ultimate target variable’s probability distribution is then what we eventually acquire [19, 64].

      • Models with Autoregressive Flows: This type of flow model is known as an autoregressive flow when flow transformation in a normalizing flow is framed as an autoregressive model where every dimension in a vector variable is subject to the condition of the previous dimensions. Compared to models with the normalizing flow, it advances. Popular autoregressive flow models include WaveNet for 1D audio signals and PixelCNN for image synthesis. Both of them are made up of a stack of causal convolutions, which is a convolution operation done while taking the ordering into account. The prediction at a given timestamp uses only data that has already been seen in the past. A masked convolution kernel handles the causal convolution in PixelCNN. WaveNet advances the output by several timestamps. As a result, that output lines up with the final input element [40].

      In comparison to innovative autoregressive models, flow-based models are conceptually promising for modeling complicated distributions but are constrained by performance concerns with density estimation. In addition, although flow models initially appear to be able to replace GANs in terms of providing respectable output, there is a considerable difference in the computational cost of training between them, with flow-based models requiring a lot more time than GANs to produce images with the same resolution. As a result, each algorithm (GAN, VAE, Flow-based Models) has advantages and disadvantages in terms of efficiency and accuracy. Flow-based models and GANs typically produce more accurate or realistic images than VAEs, while the latter is more time and parameter-efficient. Table 2 shows a comparison between GAN, VAE, and flow-based generative models. Therefore, GANs are efficient and parallel, rather than reversible. While VAEs are reversible and efficient but not parallel, flow models are parallel and reversible but not efficient. In reality, it means that the output, the learning process, and the efficiency are constantly in a trade-off.

Fig. 25
figure 25

Generator of DCGAN Creswell et al. [32]

Fig. 26
figure 26

Discriminator of DCGAN Creswell et al. [32]

Fig. 27
figure 27

A basic example of cGAN with y as the conditioning vector Creswell et al. [32]

Fig. 28
figure 28

Flow-based models architecture Madani et al. [102], Bond-Taylor et al. [19]

Table 2 A comparison between GAN, VAE, and flow-based generative models
  1. 5.

    Diffusion models: Diffusion models are a group of probabilistic generative models that transform noise into a representative data sample. Modern generative models known as diffusion models produce a variety of high-resolution images. Since OpenAI, Nvidia, and Google were able to train massive models, they have already attracted a lot of interest. Examples of diffusion model-based architectures include the full open-source stable diffusion, GLIDE, DALLE-2, and Imagen [100]. Many other diffusion-based architectures already exist. Most authors concentrate on the denoising diffusion probabilistic model (DDPM), which was initially proposed by Ho et al. [62] after being initialized by Sohl-Dickstein et al. [148]. They also explored several additional methods, including stable diffusion and score-based models. The previously mentioned generative techniques are all fundamentally distinct from diffusion models. They attempt, intuitively, to break down the sampling-based image-generating process into several small “denoising” steps. The idea behind this is that the model may self-correct over these minor adjustments and progressively provide a high-quality sample. In certain instances, models like the alpha-fold have previously employed this concept of representation refinement. But, nothing comes at zero cost. This iterative process makes them slow at sampling, at least compared to GANs [100]. Also, several generation tasks, including image, speech, 3D form, and graph synthesis, have already used diffusion models.

    Forward diffusion and parametrized reverse are the two processes that form diffusion models. Thus, the diffusion method is illustrated by the following: “destroy a data distribution’s structure methodically and gradually using an iterative forward diffusion approach. A highly adaptable and manageable generative model of the data is produced when we train a reverse diffusion process that restores structure to the data. With this method, we may quickly learn about, the sample from, and assess probabilities in deep generative models” [148]. The authors in Sohl-Dickstein et al. [148] built a generative Markov chain that converts a simple known distribution (e.g., a Gaussian) into a target (data) distribution using a diffusion process and means that the state of an entity/object at any point in the chain depends solely on the previous entity/object. Now, we can apply both two methods (forward and reverse diffusion methods) on a specific image 29 as a reference using the following scenario: “The original image’s structure (distribution) is gradually destroyed by adding noise and then using a neural network model to reconstruct the image, i.e., remove the noise at each step. The model finally learns to estimate the underlying (original) data distribution by repeating this process often enough with high-quality data. The trained neural network can then be used to create a new image that is a representation of the first training dataset, starting with just noise”.

    1. (a)

      Forward diffusion: 1- The original image (\(x_0\)) is slowly corrupted iteratively (a Markov chain) by adding (scaled Gaussian) noise.

      2- This process is done for some T time steps, i.e., \(x_{\text {T}}\).

      3- Image at timestep t is created by: \(x_{{\text {t}}-1} + \epsilon _{{\text {t}}-1} ({\text {noise}}) \rightarrow x_{\text {t}}\)

      4- No model is involved at this stage.

      5- Due to the iterative addition of noise, at the end of the forward diffusion stage \(x_{\text {T}}\), we are left with a (pure) noisy image that represents an “isotropic Gaussian.” This is simply a mathematical way of stating that the distribution has a conventional normal distribution and that its variance is constant across all dimensions. The data distribution has been changed to a Gaussian distribution.

    2. (b)

      Backward/Reverse diffusion: 1- We reverse the forward procedure at this point. Iteratively again, the aim is to take out the noise that was added during the forward operation. (a Markov chain). An artificial neural network model is used for this.

      2- The model is tasked with the following: Given a timestep t and the noisy image \(x_{\text {t}}\), predict the noise (\(\epsilon ^{'}\)) added to the image at step \(t-1\).

      3- \(x_{\text {t}}\) \({\rightarrow }\) Model \({\rightarrow }\) \(\epsilon ^{'}\) (predicted noise). The noise that is added to \(x_{{\text {t}}-1}\) during the forward pass is predicted (approximated) by the model.

    As previously mentioned, several design issues are presented by the enormous class of machine learning (ML) tasks known as natural image synthesis, which has numerous applications. Image super-resolution is one instance, in which a model is trained to convert an unrefined low-resolution image into an accurate high-resolution image. (e.g., RAISR). There are many uses for super-resolution, from enhancing medical imaging systems to recovering old family portraits. Class-conditional image creation is another image synthesis task in which a model is trained to produce a sample image given an input class label. The resulting generated sample images can be utilized to enhance the functionality of subsequent models for image segmentation, classification, and other tasks [63]. Deep generative models, including GANs and VAEs models, typically handle such image synthesis issues. However, when trained to create high-quality samples from challenging, high-resolution datasets, each of these generative models has drawbacks. For instance, mode collapse and unstable training are common problems for GANs. While VAE suffers from blurry results. As an alternative, diffusion models, which were first proposed in Sohl-Dickstein et al. [148], have recently come back into interest due to their training stability and their encouraging sample quality results for the creation of images and audio. As a result, as compared to other types of deep generative models, they may present more advantageous trade-offs. Diffusion models corrupt training data by gradually introducing Gaussian noise, gradually removing details until the data is pure noise, and then training a neural network to reverse this corruption. Running this reversed corruption procedure produces a clean sample by gradually denoising pure noise to create data from it. This synthesis process can be thought of as an optimization algorithm that generates likely samples by following the gradient of the data density [63].

    SR3 (super-resolution by repeated refinements) and CDM (cascaded diffusion models), a model for class-conditioned synthesis, are two connected techniques that today’s authors propose that push the limits of the image synthesis quality for diffusion models. The authors demonstrated that current methods (diffusion models such as SR3 and CDM) outperformed previous methods (GANs, VAEs) by scaling up diffusion models and using well-chosen data augmentation strategies. In particular, SR3 achieves robust image super-resolution results that outperform GANs in human assessments. High-fidelity ImageNet samples produced by CDM outperform BigGAN-deep and VQ-VAE2 by a significant margin on the FID score and Classification Accuracy Score [63].

    Thus, it has been proven that diffusion models beat GAN, VAE, and especially GAN. Both diffusion models and GANs have found wide usage in the field of image, video, and voice generation resulting in better results, such that generative adversarial networks (GANs) have been a research area of much focus in the last few years due to the quality of output they produce. While Diffusion models have become increasingly popular as they provide training stability as well as quality results on image and audio generation [132]. Even though GANs provide the foundation for image synthesis in a wide range of models, they do have several drawbacks that researchers are actively attempting to solve.

    • Disappearing gradients: The problem of disappearing gradients might cause the generator training to fail if the discriminator is too good.

    • Mode collapse: A generator can learn to only create one output if it produces an unusually believable result. The discriminator’s optimal plan is to develop the habit of rejecting such output without exception. And Google continues, “But if the next generation of discriminator becomes trapped in a local minimum and doesn’t find the best strategy, then it’s too simple for the next generator iteration to find the ideal output for the current discriminator” [132].

    • Failure to converge: GANs frequently experience this problem as well.

    According to these issues, OpenAI researchers have shown that diffusion models can achieve image sample quality superior to the generative models but come with some limitations [38]. This paper Dhariwal and Nichol [38] said that the researchers could achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, the team improved sample quality with classifier guidance. Researchers added that they believe two variables contribute to the difference between diffusion models and GANs:

    “There has been an extensive exploration of the model architectures employed in current GAN work. The paper Dhariwal and Nichol [38] said, ”GANs can trade off variety for fidelity, resulting in high-quality samples but not covering the entire distribution”.

    Also, Google AI diffusion model introduced two connected approaches named super-resolution via repeated refinements (SR3) and cascaded diffusion models (CDM). The authors proved that these approaches produced higher-image synthesis quality than GANs [38].

    According to DiffWave diffusion model, it generates high-fidelity audio for a variety of waveform creation tasks. It contains class-conditional generation, unconditional generation, and neural vocoding conditioned on the Mel spectrogram. Results demonstrated that in the unconditional generation task, it greatly outperformed autoregressive and GAN-based waveform models in terms of audio quality and sample diversity according to several automatic and human evaluations [38]. Finally, we can summarize the difference between these four generative methods as follows in Table 3 and Fig. 30.

Fig. 29
figure 29

Illustration of forward and backward/reverse diffusion process [148]

Table 3 A comparison between four generative models
Fig. 30
figure 30

A comparison between VAE, GAN, flow, and diffusion generative models Ho et al. [63], Rombach et al. [132]

There are many papers that used generative models to predict emotional labels. The authors in Zhao et al. [179] proposed the semisupervised generative adversarial network (SSGAN) for SER (Speech Emotion Recognition) which aims to identify emotion states from speech signals to capture underlying knowledge from both labeled and unlabeled data. Such that, the SSGAN is derived from a GAN, but its discriminator can categorize input samples as real or false and determine their emotional class if they are true. As a result, it is possible to learn how to distribute actual inputs in a way that encourages label information transfer across labeled and unlabeled data. This article suggested two advanced methods, the smoothed SSGAN (SSSGAN) and the virtual smoothed SSGAN (VSSSGAN), which, respectively, use virtual adversarial training (VAT) and adversarial training (AT) to smooth the SSGAN’s data distribution. Using labeled instances as inputs, the SSSGAN smooths the conditional label distribution; the VSSSGAN smooths the conditional label distribution without label information (or “virtual” labels) [179].

The results showed that the suggested strategies outperformed the latest methods. The distributional smoothness of the SSSGAN and VSSSGAN makes them more robust than the SSGAN in experimental settings with mismatched and semi-mismatched unlabeled training sets. Also, to assess the performance of the suggested approaches in intradomain and interdomain scenarios, several tests are run using the IEMOCAP dataset and three other publically accessible corpora [179]. This dataset is composed of five dyadic sessions corresponding to almost 10 h of recording time. Male and female speakers interact in pairs during each session in both scripted and unscripted spoken communication scenarios. While respondents’ emotions were evoked in hypothetical settings, performers were requested to transmit the appropriate semantic and emotional content in scripted scenarios. Three evaluators divided each session into turns and labeled each turn with emotion (such as neutral, happy, sad, angry, surprised, fear, disgust, frustration, excitement, and others). Each turn is given an emotional label based on majority agreement. In line with previous research, only four emotion categories (i.e., neutral, sadness, happiness, and anger) are taken into consideration in our experiments. Additionally, examples of excitement and happiness are combined. For these studies, a total of 5 531 turns (1 708 for neutral, 1 084 for sadness, 1 636 for happiness, and 1 103 for anger) were used [179].

To evaluate the performance of semisupervised SER with 300, 600, 1 200, and 2400 labeled examples, the training set is used to randomly choose labeled instances, with an equal number of examples for each category. The remaining instances in the training set are known as unlabeled examples. Table 4 lists many comparison techniques that perform well on the IEMOCAP dataset, including two supervised approaches and four semisupervised learning methods. SVM and DNN are chosen as the foundational supervised techniques. DNN has a similar framework to the SSGAN, but it does not have an unsupervised loss. The SVM used in the INTERSPEECH 2009 emotion challenge is a linear SVM that was trained using a relatively limited quantity of labeled data. Additionally, four semisupervised learning approaches—the self-training and denoising autoencoder (DAE) combined with SVM, the SSAE, and the semisupervised ladder autoencoder (SS-LAE)—are compared with our suggested methods. To ensure a fair comparison, our adopted decoder’s structure and the suggested method’s generator are compatible, and the offered techniques’ validation process is also compatible. Table 4 displays the outcomes of the methods of comparison.

Table 4 Averages of UARs [%] with the standard deviation over ten different experimental results With 300, 600, 1200, and 2400 labeled data. Several baseline supervised methods and semisupervised learning methods are selected for comparison [179]

The experimental results in Table 4 demonstrate that, in terms of the average UAR, the suggested methods outperformed the two supervised methods and the four semisupervised learning methods with various amounts of labeled data. The SSSGAN and VSSSGAN significantly outperformed alternative techniques at \(p < 0.05\). Given 2400 labeled data points, the SSGAN performs as well as the SSSGAN and VSSSGAN. However, when AT and VAT are used to smooth the conditional label distribution’s output, authors of Zhao et al. [179] could see improvements of 1.5% and 0.9%, respectively. An instance is that VAT and AT can investigate the input’s adversarial orientation, enhancing the robustness of the suggested approaches.

Also, they examined the effects of the number of labeled data and compared the performance of the offered methods to modern techniques. The effectiveness of the suggested procedures with labeled data from 300, 600, 1200, and 2400 is shown in Fig. 31.

Fig. 31
figure 31

Performance of the proposed methods with 300, 600, 1200, and 2400 labeled data points in terms of the UAR (%) Zhao et al. [179]

Figure 31 illustrates how the performance of various approaches grows with the amount of labeled data. Notably, performance grows gradually as the amount of labeled data doubles.

These findings imply that more labeled data is not necessarily advantageous for the suggested strategies. Additionally, the VSSSGAN achieves a 1.2% improvement in the UAR with 600 labeled data when compared to the SSGAN. In the meantime, the relative improvement is, respectively, 0.7%, 0.4%, and 0.9% for the 300, 1200, and 2400 labeled data. These findings imply that the quantity of labeled data affects how much performance is improved. Additionally, when fewer labeled data are available, the VSSSGAN performs better than the SSSGAN. In contrast, the SSSGAN outperforms the VSSSGAN if there are more labeled data available. This finding suggests that adding more labeled data may aid in smoothing the adversarial direction of the model [179].

Using the Generative Adversarial Network (GAN) in multiple-discriminator settings and joint minimization of the losses provided by each attribute-specific discriminator model (knowledge and emotion discriminator), the authors in Varshney et al. [162] presented a technique called EmoKbGAN for automatic response generation in this paper. The model could produce sentences that flow naturally with better control over emotion and content quality, according to experimental results on two benchmark datasets, the Topical Chat and Document Grounded Conversation datasets. These results showed that the proposed method significantly outperformed baseline models in terms of both automated and human evaluation metrics.

In other words, this research introduces EmoKbGAN, a unique knowledge-grounded neural network conversation model that uses both the underlying knowledge base and emotion labels to produce more in-depth and interesting responses. The MLE objective to supervise the training process is proposed to be replaced by multi-attribute discriminator training as they expand on the framework provided by Varshney et al. [162]. Contrarily, this approach primarily used two distinct models: a transformer-based language model, which aims to produce relevant responses with the support of attribute features provided as input to the model, and the two discriminators, which direct the generation process by calculating the likelihood that sampled sentences will satisfy the given constraints.

The authors in Varshney et al. [162] evaluated the proposed model on the knowledge-grounded Topical Chat dataset with around 11K human–human conversations. Each conversation’s words are based on one of eight major categories: fashion, politics, books, sports, popular culture, music, science & technology, and movies. Given that the annotators just used their common sense knowledge when writing the utterances in the dataset, some of them may not have any knowledge attached to them. The emotions that each phrase in a dialogue conveys are noted, such as anger, disgust, fear, sadness, happiness, surprise, curiosity to dive deeper, and neutrality. Five separate sets of data—Train, Valid Frequent, Valid Rare, Test Frequent, and Test Rare—have been created. Conversations about entities that are commonly found in the training set are included in the frequent set. The conversations in the rare set are about entities that were only occasionally seen in the training set. On the frequent dataset, they presented the findings of our experiments. Also, the authors performed experiments on the Document Grounded Conversations Dataset. The statements are based on information about the cast, the plot, the introduction, the reviews, and a few scenes. The typical document contains 200 words or less. We classify the target utterances of the CMU-DoG dataset using a BERT-based emotion classifier that has been trained on the utterances of the Topical Chat dataset. 200 sentences from the test set were utilized to evaluate the performance of the used model. They obtained an overall accuracy rating of 0.74 on the test set.

Researchers in Varshney et al. [162] also conducted ablation studies for the multi-source generator and attribute-specific discriminators to demonstrate the effectiveness of each EmoKbGAN module. The models are KbGAN: EmoKbG with only a knowledge discriminator; EmoGAN: EmoKbG with only an emotion discriminator. EmoKbG: Incremental transformer with twin decoders. We compare the outcomes of primary decoding and secondary decoding to illustrate the twin decoder’s effectiveness. EmoKbGAN-SD and EmoKbGAN-PD are the model names for EmoKbGANs that lack secondary and primary decoders, respectively. The last three utterances and the associated text-based information serve as our input. The hidden size is set at 512 for all models. They employed a three-layer bidirectional LSTM with dot product attention for the Seq2Seq-based generator. The number of encoder and decoder layers for transformer-based models is set to 3. Eight attention heads and 2048 filters are used in multi-head attention. For the utterances, knowledge, and generated answers, they employed shared vocab and embeddings. The number 512 is selected as the word embedding dimension empirically. For around 200 epochs, the discriminator and generator networks are alternately trained. They employed the ADAM optimizer for the generator, whose learning rate is set at 0.0001.

Also, the authors in Varshney et al. [162] used one of the most popular metrics for evaluating sequences like BLEU, perplexity (PPL), and n-gram diversity (Div.) to automatically evaluate the quality of generated responses. Figures 32, 33 depict evaluation results using automatic and human evaluation metrics for baselines, ablation, and the proposed model on Topical document and CMU-DOG datasets.

Fig. 32
figure 32

Evaluation results using automatic and human evaluation metrics for baselines, ablation, and the proposed model on the Topical Chat Frequent dataset [162]

Fig. 33
figure 33

Evaluation results using automatic and human evaluation metrics for baselines, ablation, and the proposed model on the CMU-DoG dataset Varshney et al. [162]

They also measured the quality of the generated text from a human perspective, they randomly sample 100 conversations from each model, and with the help of ten well-trained experts with postgraduate exposure, they evaluated the predicted responses using the following metrics: Fluency, Adequacy, Knowledge Relevance, and Emotional Content. On a scale from 0 to 2, authors in Varshney et al. [162] rated responses for fluency, sufficiency, and knowledge relevance, with a score of 0 denoting an incomplete or unfinished response, a score of 1 denoting a satisfactory response, and a score of 2 denoting an accurate response. On a scale of 0 to 1, where 0 denotes the incorrect emotion and 1, the proper emotion, they rated the emotional content. They calculated the Fleiss’ kappa value to assess the level of agreement between two annotators. They got “high agreement” with kappa scores of 0.80, 0.86, 0.81, and 0.72 for fluency, sufficiency, emotional content, and knowledge relevance, respectively.

According to automatic evaluation results, ITDD, EmpTransfo, and ECM showed that the models learn to decode lexically relevant replies with substantial diversity on both datasets and Figs. 32, 33 showed that the proposed model has stronger unigram and bigram diversities as compared to the baseline models. Due to a strong Div. (n = 1) and Div. (n = 2) score, they saw substantially fewer repetitive segments in the answer produced by the suggested EmoKbGAN model. Their findings are equivalent to those of the baseline models in terms of BLEU score performance on the Topical Chat dataset. This may be explained by the way BLEU uses n-grams to match tokens from the expected and target answers [162].

In some cases, the response may be factually and culturally correct yet employ synonyms that do not precisely reflect the real response. They suggested EmoKbGAN outperforms the baseline models on BLEU scores for CMU-DoG. In particular, EmoKbGAN significantly outperforms EmoKb-Seq2SeqGAN and EmoKb-TransformerGAN in terms of BLEU. This finding suggests that EmoKbGAN effectively integrates the context and pertinent knowledge base, leading to more varied responses. When only the generator portion of the model is used, they saw an increase in PPL scores and a fall in BLEU scores, illustrating the potency of our attribute-specific discriminators in the design. Although the suggested EmoKbGAN performs similarly to KbGAN and EmoGAN for the distinct metric, it significantly outperforms them for the BLEU and PPL measures. This proves that the model can provide more linguistically correct responses than before when attribute-specific discriminators are used jointly. When compared to the EmoKbGAN model, they also noticed a decline in the scores of the EmoKbGAN-SD and EmoKbGAN-PD models. This demonstrates the decoder’s twin decoding function in action [162].

According to human evaluation results, the models that integrate information tend to produce replies that are more understandable than the models that do not. Figures 32, 33 show that EmoKbGAN performs better than the other baseline models on both datasets in terms of fluency, sufficiency, emotion quality, and knowledge relevance. The improvement in the fluency and sufficiency scores compared to baseline models demonstrates that the suggested model produces responses that are more relevant and fluent. The elicited answers are more in line with the emotional sensitivity of the statements, according to the emotional content score. Indicating a general improvement in extracting the relevant data from the linked knowledge base, the knowledge relevance score also appears to have improved. As mentioned in the Baselines section, they also performed experiments using pre-trained language models like GPT, DialogGPT, BERT, and BART apart from ITDD, ECM, and EmpTransfo. On human evaluation also, they observed that even though these models exhibit competitive performance, the suggested approach EmoKbGAN exceeds them by a significant margin [162].

However, when comparing EmoKbGAN with the ablation models, authors Varshney et al. [162] found that their model can appropriately consume information and emotion while producing remarkably consistent answers. They noted that the performance is similar when the discriminators are used independently. But when they joined, they saw convergence and a boost in the overall effectiveness of the suggested approach. Because the attribute-specific discriminators outperformed KbGAN and EmoGAN models in terms of adequacy, emotional content, and knowledge relevance scores, they validated their use. The authors should have also pointed out that their recommended approach, EmoKbGAN, outperformed the EmoKbGAN-SD and EmoKbGAN-PD models. This shows how the decoder’s twin decoding feature works.

Finally, we presented a comprehensive overview of the generative models, their types, disadvantages, and advantages in terms of improving image quality, as well as their application in emotion recognition.

4 Emotion recognition

Emotions have a critical role in our decision-making, planning, reasoning, and other mental processes. Advanced driver assistance systems to recognize certain emotions (ADAS). Drivers who monitor their emotions while driving receive crucial input that helps them avoid accidents. The significance derives from the fact that aggressive driving on the road leads to traffic accidents.

’Emotion identification’ can be done utilizing facial expressions, voice, and text, as well as biosignals such as the electroencephalograph (EEG), blood volume pulse (BVP), electrocardiogram (ECG), electromyogram (EMG), galvanic skin response (GSR), respiration (RSP), and combination of more than one signal. This section presents a detailed overview of each one of the state-of-the-art strategies for Emotion recognition using the ML approaches indicated above [9].

Methods for recognizing emotions, in general, can be divided into two groups:-

  • One methodology is to use one modality (uni-modal) for recognizing emotions humans such as facial expressions [115], speech signals, written text, body gestures, posture, and so on, which are easy to collect and have been researched for years [145]. However, reliability cannot be guaranteed, as it is relatively easy for people to control physical signals such as facial expressions or speech, especially during social interactions. People may smile in a formal social setting even if they are experiencing negative emotions.

    Whereas signals such as electroencephalogram (EEG), body temperature (T), electrocardiogram (ECG), electromyogram (EMG), galvanic skin response (GSR), respiration (RSP), and other internal data [56] are not easy to control, some of these signals are very intrusive such as EEG, they interrupt normal activities. Other signals are non-intrusive, such as signals collected from smartwatches and wristbands.

  • In the second category (multi-modal), researchers employed more than one modality of the above-mentioned signals to identify emotions.

    Recent developments in wearables have different types of embedded sensors capable of measuring many physiological signals simultaneously and in a non-intrusive way. This enabled the creation of multi-modal datasets and consequently multi-modal emotion recognition models [114, 145].

4.1 Emotion recognition using physical signals

  • Using facial expressions to predict emotions.

    Facial expressions are vital in the understanding of emotions and non-verbal communication. They are significant for everyday emotional communication [119].

    They are also a feeling indication, allowing a person to express his emotional condition [138].

    People can instantly detect a person’s emotional state from facial expressions. As a result, researchers frequently employed information on facial expressions in automatic emotion identification systems [138].

    Because of its considerable academic and commercial potential, FER is a hot topic in computer vision and artificial intelligence disciplines. This category concentrates on research that mainly uses facial images, as visual expressions are one of the most important information channels in interpersonal communication [28].

    The computer vision community considers detecting human emotion based on facial expressions a challenging issue due to numerous challenges such as differences in face shape from person to person, difficulty in recognizing dynamic facial features, low image quality, and so on Said and Barr [135].

    The main problem when using facial expressions for identifying emotions is that they are prone to making. The person can hide or conceal his real emotions from his facial expressions. Face detection and emotion recognition were and still until now a research topic that needs enhancement. Researchers presented several ways to enhance solutions to this problem using deep learning approaches to advance the state-of-the-art and push the boundaries of traditional handcrafted techniques [135]. To conclude, faces are considered an intrusive method for emotion identification because the person must be facing the camera to take the image of his face. Neural networks have achieved great success in recognizing emotions from facial expressions.

  • Emotion Detection from text

    Nowadays, writings come in various formats, including social media posts, microblogs, news pieces, and more. With the development of Web 0.2, people are now able to express their opinions and feelings by writing. Researchers use the content of these postings for text mining and sentiment analysis.

    Sentiment analysis is the extraction of emotions from these messages and it is a massive and challenging task. Academics from several domains are attempting to develop methods for more precise detection of human emotions from various sources, including text [183].

    Researchers applied many word-based, sentence-based strategies, machine learning, and natural language other ways to obtain improved accuracy. Emotion analysis can be beneficial processing methods, in a variety of situations.

    Oxford Dictionary defines ’emotion’ as “a powerful feeling arising from one’s circumstances, mood, or interactions with others,” while ’sentiment’ is “a view or opinion being held or expressed.” A powerful sensation such as love or rage, or strong feelings in general according to the Cambridge Dictionary. However, sentiment is a notion, opinion, or idea based on a feeling about a circumstance or a way of thinking about something [6].

    Sentiments are either ’Positive,’ ’Negative,’ or ’Neutral,’ respectively. Sentiment analysis extracts meaningful information from the text to determine the attitudes of people toward various things such as a product, service, or event. sentiment analysis is a type of emotion detection [6, 16].

    Due to the real-time and pervasiveness nature of smartphones and social networking platforms, many people prefer to share their feelings and opinions, and other information using visual and textual methods. Most people are still using text for communicating their ideas and feeling in their daily routine using social media.

    There are many challenges for sentiment analysis. In some cases, a single piece of text may include mixed emotions. Then there are ambiguous emotions and words in some documents. Some words have many meanings, and multiple phrases might refer to the same feeling. Some of the text is sarcastic or includes slang. Multilingual text, misspellings, acronyms, and grammatically incorrect sentences are all features of Internet texts. Emotion extraction from text is a hot research topic. Researchers from all across the world are interested in modifications, improvements, and new approaches to handling the challenges of this work Alswaidan and Menai [6], Nandwani and Verma [120], Bharti et al. [16]. This approach is also a uni-modal technique. It can be implemented either by lexicon or ML.

  • Emotion Recognition from gesture and posture

    There has been a boost in interest in emotion recognition algorithms that utilize facial expressions, body posture, and gestures over the last decade. Emotion recognition methods based on facial expression, body postures, and gestures depend on the same [10] hypothesis as EMG claims that body postures and gestures are also involved in the response of emotions [91, 139] and are suitable for recognizing basic emotions.

    It’s commonly believed that using body language is just another way to convey the same fundamental emotions, like those shown through facial expressions. Furthermore, researchers employed the same muscles to detect emotions throughout cultures, according to Atanassov et al. [10].

4.2 Emotion recognition using physiological and speech signals

Physiological signals have many challenges: First, they are always collected from people while moving. This makes them prone to noise. Second, there are many variations among people in the measurements of these signals. Third, some of these signals are invasive, such as EEG signals, which require a person to wear an ahead-set, which is unpractical in real-life applications. An overview of the late state of emotion recognition approaches using speech signals and different physiological is presented in Saxena et al. [141], Wang et al. [167], Ali et al. [4]. This category is usually used as a multi-modal approach, combining multiple signals.

The ability to perceive and interpret driver emotions while driving and perform appropriate actions is one of the primary priority areas listed by international research groups for advancing intelligent transportation systems [175].

However, recognizing the mental state of an individual and responding while driving is a challenging task that remains a scientific problem. One of the main difficulties is that emotion-related signal patterns can vary widely from person to person or from one setting to the other. Furthermore, due to the difficulty in precisely defining emotions and their meanings, it is difficult to determine a perfect association between the classes (patterns) [44].

Using suitable sensors, however, a driver’s emotion and reaction can be caught and measured. Most emotion identification researchers have concentrated on analyzing a particular sensor data type, such as audio (speech) or video (facial expression) data [73]. Many recent studies in the emotion recognition field have begun to incorporate different sensor data to construct a powerful emotion identification system.

The main goal of combining many sensors is to simulate human thinking. Humans always use a variety of modalities to portray emotions during interactions. Researchers classified human modalities into audiovisual (facial expression, voice, gesture, posture, etc.) and physiological (respiration, skin temperature, etc.) [159].

The general methods for recognizing a person’s emotion are speech, facial expression, or gesture. The speech signal can determine the emotional state of the speaker [159]. During the activation of the sympathetic nervous system when feelings such as anger, fear, or joy, speech becomes loud and fast [97]. When a person feels sad, his parasympathetic nervous system is active, and his speech becomes slower. The problem with speech signals is similar to facial expression, the possibility of concealing one’s emotions by pretending the opposite.

Detecting a subject’s physiological pattern, on the other hand, can provide information about emotions because when a participant is positively or adversely excited, the sympathetic nerves of the autonomic nervous system are activated [97]. Sympathetic activation increases blood pressure, boosts respiration rate, and raises the heart rate [97]. The most common physiological signals used for emotion recognition include the following:

  • Electromyography (EMG) This term refers to a muscle’s activity or the frequency with which it is tense. When muscular cells are electrically or neurologically engaged, EMG detects the electrical potential created by these cells [109]. When you’re stressed, you’re likely to have a lot of muscle tension. It can also distinguish between negative and positive emotions by measuring muscle activity.

  • Electrodermal activity (EDA) OR GSR

    Skin conductivity (SC) is a measure of the conductivity of the skin, which increases when it sweats. This signal is a good and sensitive indicator of stress and other stimuli and a tool for distinguishing between conflict-free and anger or fear scenarios. This signal has an issue such that external factors such as temperature can influence it. As a result, Stržinar et al. [152] this signal needs the use of reference measurements and calibration. The skin conductance response (SCR) in an EDA signal occurs in response to a stimulus [165]. Figure 34 depicts Skin Conductance Response in EDA Signal [45].

  • Heart rate (HR) OR (ECG)

    The Sino-atrial node, which generates an electrical impulse, initiates an orderly progression of depolarization in each healthy heartbeat. This impulse travels into the heart muscle, causing the heart to contract. Electrical variations are associated with the buildup of action potentials moving along the heart muscle. Scientists can use electrodes in this study to measure the electrical impulses generated by the heart on the skin’s surface over time [125]. This kind of recording is known as ECG.

    Innovative and resilient technology for collecting emotion-related physiological data over a long period has been proposed [134].

    This system will not restrict users’ behavior (non-invasive) and can extract perfect physiological data following the real-world environment using wireless transmission technology. The Emotion Check [125] is a wearable gadget that can detect users’ heart rates and control their anxiety.

    In ECG technology, an electrocardiograph is used to capture the variations in the heart’s electrical activity as it occurs on the skin throughout each cardiac cycle. A physiological signal generated by the heart’s contraction and recuperation is observed by an ECG. ECG data with a physiological foundation are directly related to a person and are regularly used to assess a person’s psychological state [93].

  • Electroencephalogram (EEG)

    The electroencephalography signal is the measurement of brain waves and the assessment of brain activity. Currents flow during synaptic excitations of the dendrites of numerous pyramidal neurons in the cerebral cortex, resulting in brain waves [77]. They can measure EEG signals using small, flat metal disks (electrodes) connected to the scalp. The varying frequency ranges of the five primary brain waves identify them. These frequency bands, which range from low to high, are referred to as in Karaca et al. [77]:

Fig. 34
figure 34

Ideal skin conductance response (SCR) in the EDA signal [45]

  1. 1.

    Delta (\(\delta\)): waves have a frequency range of 0.5 to 4 Hz. They frequently appear while deep sleeping and may walk.

  2. 2.

    Theta (\(\theta\)): waves occur in a frequency range of 4 to 7.5 Hz. When they appear during slumber, they are related to enhanced learning, creativity, profound meditation, and unconscious access. Theta waves appear to be associated with arousal levels.

  3. 3.

    Alpha (\(\alpha\)): waves lie within the range of 8–13 Hz. In general, the Alpha wave appears as a round or sinusoidal-shaped signal and is associated with relaxation and super learning.

  4. 4.

    Beta (\(\beta\)): Between 14 and 26 Hz, beta waves occur. They’re related to things like active thinking, active attention, and problem-solving.

  5. 5.

    Gamma (\(\gamma\)): waves correspond to high values over 30 Hz. These data can be detected and used to detect the presence of specific brain problems.

Figure 35 illustrates the four brain waves with their usual amplitude levels.

Fig. 35
figure 35

Four typical brain waves, from high to low frequencies [77]

4.3 Emotion recognition using information fusion and physiological measurements

Recently smartphones and a variety of wearable devices, such as smartwatches and wristbands are supplied with various sensors, to continually monitor human physiological signals (such as heart rate, movements, EDA, and body temperature), as well as data from the surrounding environment (e.g., noise, brightness, etc.). As a result, massive databases have sprung up in different categories of research, including healthcare and smart cities. This surge of on-body and environmental data is a good opportunity for healthcare research, necessitating the development of new tools and methodologies for dealing with enormous multidimensional datasets [76]. Also, this boosted research in multi-modal emotion recognition.

The study Kanjo et al. [75] constructed a user-dependent prediction emotion model based on sensor data collected from participants walking around Nottingham city center with a smartphone and wristband 2, which incorporates physiological (HR, EDA, b-temp, Motion) and environmental (UV, Noise, air pressure) factors. The researchers used three methods to build this model:

Determining the relationship between on-body and environmental elements. They addressed various studies in this area to assess the link between on-body and environmental reactions. Noise, air pollution, traffic, and even congested areas can cause serious health problems, such as headaches, sleep problems, and heart disease. According to the impact of environmental and physiological factors on emotion recognition, the total accuracy (86%) of this study Kanjo et al. [75] is based on the combination of multi-modal classifiers (SVM, RF, and KNN).

They used a deep learning approach in Kanjo et al. [76] for emotion categorization through an iterative process of adding and removing many sensor signals from various modalities in a real-world investigation employing smartphones and wearable devices. It incorporated the local interactions of three sensor modalities: on-body, environmental, and location, into a global model that reflects signal dynamics and the temporal links correlating them. This method used various learning algorithms on the raw sensor data, including a hybrid approach that integrated convolutional neural networks and long short-term memory recurrent neural networks (CNN-LSTM).

The results revealed that deep learning approaches were effective in human emotion categorization (average accuracy 95% and F-Measure = 95%), and hybrid models beat standard fully connected deep neural networks (average accuracy 73%) and F-Measure = 73%) when using a wide range of sensors. The hybrid models also outperformed previously developed Ensemble approaches that used feature engineering to train the model (average accuracy 83% and F-Measure = 82%) [21, 76].

By allowing robots to understand emotions and body movements and react accordingly. Emotion recognition technology could improve human–machine interaction, enhancing user experience.

Table 5 Ilyas et al. [70] is an example of fusing or combining more than one modality to achieve higher accuracy levels and depicts various accuracy levels for each methodology. It can be noticed that fusing multiple modalities improves the accuracy of the model.

Table 5 Results of different evaluation metrics for each frame-based emotion recognition method [70]

The model presented in this research [70] detects emotions (anger, disgust, happiness, fear, sadness, surprise, neutral) using upper body movements (hand and head movements) and facial expressions. Tasks like mood and gesture recognition can be easily detected using face features and movement vectors once this correlation has been mapped. This method employs a deep CNN that had been trained on benchmark datasets displaying diverse emotions and body movements.

Features obtained through facial movements and body motion are fused to get emotion recognition performance. They used a variety of fusion approaches (feature-level fusion, decision-level fusion) to combine multi-modal signals for non-verbal emotion recognition. The algorithm achieved 76.8% emotion recognition accuracy using solely upper body movements, outperforming 73.1% using the FABO dataset.

Furthermore, using the FABO dataset, multi-modal compact bilinear pooling with temporal information outperformed the state-of-the-art method with an accuracy of 94.41%.

Liisi Kööts researched the influence of weather on affective experience (the link between negative and good emotions and weather variables like temperature, relative humidity, barometric pressure, and brightness) [82]. Similarly, other studies have looked at reactions and their links to wellbeing and physiological changes; however, only one of these has looked at merging physiological and wellbeing sensors with ecological sensors to forecast and model emotion [35, 57, 71, 75, 81, 84, 118, 123, 137].

The above-mentioned research encourages the inclusion of environmental measurements in emotion recognition models.

Information fusion (which includes merging multiple data sources to provide consistent and accurate information) has three levels:

  1. (a)

    Data-level fusion (low-level) tries to combine various data components from many sensors to complement one another. During data collection, it is possible to incorporate other data sources, such as user self-reported emotions [31, 49, 70, 167].

  2. (b)

    The feature level (intermediate-level data fusion) is used to pick the best set of characteristics for categorization during data analysis. Using feature-level fusion, the best combination of features, such as EMG, Respiration, Skin Conductance, and ECG, has been obtained [57, 70, 167].

  3. (c)

    The purpose of high-level data fusion (decision-making) is to improve decision-making by combining the outcomes of different methodologies. See Field et al. [49], Ilyas et al. [70] for more information on data fusion algorithms and applications in body sensor networks.

Information fusion has also been achieved in Younis et al. [174]. This study built a user-independent predictive emotional model based on integrating/fusing various modalities associated with heterogenous sensors (environmental and physiological sensors) using ensemble learning methods (bagging, boosting, and stacking) including a series of ML algorithms (SVM, DT, NB, and RF) as base classifiers to classify five distinct emotional states ranging from very negative to very positive. The results proved that the stacking ensemble method achieved a higher accuracy level of 98.2% as compared to other ensemble methods.

Also, this study Wang et al. [167] applied information fusion to predict emotional states. such that the authors created the Multi-modal Emotion Database with Four Modalities (MED4) as the first step in the multi-modal emotion database construction process. MED4 is a collection of synchronously recorded signals from participants’ speech, facial pictures, photoplethysmography, EEG, and photoplethysmography as they responded to happy, sad, angry, and neutral emotion-inducing video stimuli. 32 volunteers participated in the study, which was conducted in an anechoic chamber and a research lab with background noise. Four baseline algorithms—identification vector + probabilistic linear discriminant analysis (I-vector + PLDA), temporal convolutional network (TCN), extreme learning machine (ELM), and multi-layer perception network (MLP)—were created to test the database and the effectiveness of AER approaches. Additionally, two fusion algorithms were developed to use both internal and external data on the human state at the feature level and decision level, respectively. The findings demonstrated that EEG signals are more accurate in identifying emotions than speech signals (achieving 88.92% in an acoustically quiet environment and 89.70% in one with naturally occurring noise, respectively, vs. 64.67% and 58.92%, respectively). When speech and EEG signals are combined, fusion procedures can increase total emotion detection accuracy by 25.92% when compared to speech alone and 1.67% when compared to EEG in acoustically quiet conditions, and by 31.74% and 0.96% in naturally noisy conditions. Fusion techniques also improve AER’s robustness in noisy environments.

Table 6 is another example of information fusion that depicts physiological health sensors used to monitor human health and are combined with environmental sensors such as (UV, EnvNoise, and AirPressure) to predict emotional states.

Table 6 List of some on-body sensors that have been used for emotion detection

4.4 Emotion elicitation methods

There are many stimuli used to elicit emotions in the literature. They can be classified as follows:

  • Film clips With this technique, participants are shown the entire short film or selected parts. The fundamental benefit of this approach is that it provides access to a wide range of emotions, including love, rage, fear, joy, and others. On the other hand, its disadvantages include the necessity of separating specific interesting portions of the shown film. Additionally, because emotions are transitory, delaying the evaluation of that emotion may result in bias in labeling [83].

  • Pictures This method works by displaying a sequence of pictures to participants. The advantages of this method are it is easier to use and has the ability of self-reporting. The method’s drawbacks include the lack of standardization [4, 83].

  • Music This method is implemented by playing music for participants. The advantages are that it is easy and simple, highly standardized, and emotions develop with time-lapse (15–20 min.). The drawbacks are that the music taste might influence the experienced emotions. Additionally, this method gives only the moods either positive or negative. It does not give discrete emotions [51, 71, 81, 170].

  • Emotional behaviors as emotional stimuli In this situation, the goal is to modify the target person’s feelings by influencing his/her interpretation of the behavior. This technique has the advantage of eliciting emotional responses from a wide range of sources (posture, eye gaze, tone of voice, breathing, and emotional actions). On the other hand, while certain occurrences (such as making someone angry) are straightforward to control, others (such as making someone angry) are more complex [4].

  • Dyadic interaction tasks Interacting with several classes of couples elicits emotion in this game (friends, romantic partners, family members, etc.). This method has the advantage of presenting a wide range of emotional responses while also allowing researchers to investigate emotion in social contexts. Some of the method’s disadvantages are as follows: (1) It needs much time and resource commitment; for example, dyadic interaction processes might take anywhere from (2–4) h to complete. (2) Some therapies might be insufficient (the participant switches topics to avoid an emotional outburst). Finally, (3) simply shows emotion by using the example of Ali et al. [4].

  • In the wild Experiments The previously mentioned methods are called lab experiments. Recent research tends to use real-world ’in the wild’ experiments. These methods rely on real experiments in real-life settings such as people performing their daily activities like shopping [75, 174].

In general, choosing the elicitation scenario or the stimuli depends on the target emotions and available sensors. The music and picture scenarios, for example, will not be beneficial if the researcher needs to extract the speech signal from a subject. Emotional behaviors as emotional stimuli and dyadic interaction tasks are both relevant scenarios, in this case, Ali et al. [4].

5 Summary of previous research on machine learning for emotion recognition

In this section, we will discuss an overview of the author’s contribution and their findings about each modality mentioned above.

According to the facial expressions recognition modality, there were some researchers that used facial expressions as uni-modal to predict emotional states. The following results are examples of research using facial expressions for identification.

The work presented in Tarnowski et al. [155] offered a method for identifying seven primary emotional states based on facial expressions: neutral, joy, surprise, anger, sadness, fear, and disgust. Because the face is the most visible area of the body, computer vision systems (often cameras) can analyze the image of the face to detect emotions. They employed Microsoft Kinect for 3D face modeling in this experiment due to its low cost and ease of usage. Kinect has a low scanning resolution but a fast image registration rate (30 frames per second) [155]. It contains two cameras and an infrared emitter. Six participants between the ages of 26 and 50 took part in the study. Each experiment participant sat at a distance of 2 ms from the Kinect device in a sitting position. A participant’s task was to play mimic effects according to instructions on a computer screen. Researchers used photographs from the KDEF database [79] to create the instructions, which include the name of the emotional state and a picture of an actor performing the relevant imitation effect [155]. This experiment produced a classification accuracy of emotions of 96% (3-NN), 90% (MLP) for random division of data. For all users, the classification accuracy for the “natural” partition of data was 73% (for the MLP classifier). In an identical situation, the classification accuracy of the 3-NN classifier was 10% lower. That demonstrates that neural networks are capable of generalization [155].

To create an algorithm for real-time emotion recognition using virtual markers through an optical flow algorithm that works well in unstable conditions, the authors in Hassouneh et al. [61] used convolutional neural network (CNN) and long short-term memory (LSTM) classifiers to classify physically disabled people (deaf, dumb, and bedridden) and Autism children’s emotional expressions based on facial landmarks and electroencephalograph (EEG) signals. They employed ten virtual markers to gather data on six facial emotions (happiness, sadness, anger, fear, disgust, and surprise). Additionally, 55 college students with a mean age of 22.9 years (35 male and 25 female) freely participated in the experiment for facial emotion identification. Additionally, 19 undergraduate students offered their services to gather EEG signals. For the first phase of facial and eye detection, Haar-like features are employed. Based on a facial action coding system, the Lucas-Kande optical flow method is then used to track the characteristics when virtual markers are later set in specific spots on the subject’s face. The distance between each marker point and the subject’s face’s center is used as a feature to classify facial expressions. And, characteristics for emotional classification using EEG signals are derived from the fourteen signals gathered from the EEG signal reader (EPOC+) channels. The features are then presented to the LSTM and CNN classifiers after being cross-validated five times. With CNN, they were able to detect emotions using facial landmarks with a maximum identification rate of 99.81%. However, for emotion identification using EEG signals, the maximum recognition rate obtained using the LSTM classifier is 87.25%.

Chowdary et al. [29] achieved an average accuracy of emotion (96%) using the CK+ database using SVM and CNN classifiers. Also, Umer et al. [158] used the CNN algorithm to predict emotion classes (happiness, sadness, fear, disgust, surprise, anger, and neutral) and achieved an average accuracy level (77.8%) of the KDEF dataset, (87.2%) of GENKI dataset, and (92.8%) of CK+ dataset.

Bargal et al. [13] introduced an emotion recognition algorithm for videos. First, they cropped photos to the required size, then transformed them to grayscale space color and applied histogram equalization. Second, they used three famous CNN models for training utilizing the AFEW dataset Dhall et al. [37] and another dataset as additional training data. Third, the three CNN models’ outputs were concatenated and encoded to create a set of feature vectors. Fourth, to classify emotions, Bargal et al. passed feature vectors to an SVM classifier. The suggested method surpasses current state-of-the-art methods with an accuracy of 59.42% when tested on the AFEW dataset.

Dandil et al. [33] presented a convolutional neural network-based (CNN) face emotion classifier. Three convolution layers, one max-pooling layer after the first convolution layer, two average pooling layers after the second and third convolution layers, two fully convolution layers, and a softmax layer make up the proposed CNN. The ViolaJones face identification algorithm [163] was used to detect faces in images. They created a dataset comprising 3600 images to train and evaluate the suggested approach and used 240 images as testing data. The proposed method obtained a maximum accuracy of 72% in the evaluation.

According to the text emotion recognition modality, the following examples depict the results of previous papers that used written text to predict emotional states using ML and DL techniques.

This study Acheampong et al. [2] introduced a comprehensive overview of techniques used in written texts to identify emotional states. The authors used SVM, KNN, MLP, NB, and DT as base classifiers to classify emotional states (joy, happiness, sadness, fear, anger, surprise, disgust, neutral, fun, worry, love, hate, enthusiasm, boredom, relief, empty, and scared). They obtained the following results: KNN achieved an average accuracy of 83%, SVM gave an average accuracy of 77%, MLP gave an accuracy of 77%, NB achieved an average accuracy of 74%, and DT achieved an average accuracy of 74%.

In Nandwani and Verma [120], the authors used NB, SVM, RF, and CNN as base classifiers to classify Furious, cheerful, or depressed, positive, negative, and neutral emotional states. They proved that Naive Bayes achieved an F1 score above 90% in binary classification and an F1 score above 60% for the three-class classification of sentiments. Also, they proved that random forest (RF) with an accuracy of 95.6% performed better than the NB classifier. While the SVM classifier achieved an average accuracy of 85.47%. According to the CNN algorithm, it achieved 80% accuracy.

This study Bharti et al. [16] tried to solve the limitations that face sentiment analysis that includes emotion detection. To extract emotions from text, several approaches have been applied in the past using natural language processing (NLP) techniques, including the keyword approach, the lexicon-based approach, and the machine learning approach. However, due to their focus on semantic relations, keyword- and lexicon-based techniques have some drawbacks. To identify emotions in text, the authors of Bharti et al. [16] suggested a hybrid (machine learning + deep learning) model so as to improve the results. Deep learning techniques were used, including Bi-GRU and convolutional neural networks (CNNs). Approaches to machine learning that are utilized are the support vector machine, random forest, naive Bayes, and decision tree. Sentences, tweets, and dialogues are the three types of datasets used to assess the performance of the hybrid approach. The ability to work with multi-text sentences, tweets, dialogues, keywords, and vocabulary words of easily detectable emotions are some of the benefits they illustrated for the suggested approach. They got the following outcomes: SVM provides the maximum accuracy of 78.97% when compared to RF, NB, and DT, according to the ML classifier. The CNN model has the highest F1 score (80.76%) and the Bi-GRU model has the highest accuracy (79.46%) when using the DL approach. The hybrid model, which combines CNN, Bi-GRU, and SVM, has an F1 score of 81.27, a precision of 82.39%, a recall of 80.40%, and an accuracy of 80.11%.

Related to body gestures and posture emotion recognition and facial expressions. This study Mittal et al. [114] used CNN, and LSTM to classify these emotional states: (anger, happy, neutral, sad, disgust, fear, and surprise) based on body gestures and postures and facial expressions modalities. The authors in Mittal et al. [114] used two datasets: IEMOCAP which contains four emotion labels (angry, happy, neutral, and sad) that achieved an average accuracy of 78.2% using LSTM, CNN, and CMU-MOSEI Dataset which contains six emotion labels (angry, disgust, fear, happy, sad, and surprise) that achieved a mean classification accuracy of 85.0%.

The authors in Dzedzickis et al. [44] provided a summary of the primary relationships between body postures and emotions as in Table 7. They used computer vision systems and analysis algorithms that can follow the motions of selected reference points to measure facial expressions, body posture, and gestures.

In the realm of emotion recognition, such a measurement approach has preferences since it allows for non-contact measurements or non-invasive methods and delivers reliable results [44].

There are some limitations or drawbacks related to the presented ’EMG’ such as: (i) It only recognizes strong emotions that stay a certain amount of time; weak emotions or extremely brief, non-intense stimuli do not result in visible facial movements or changes in body posture that can be detected.

(ii) Tracking of body postures is hard to define the exact position of a reference point covered by clothes, so special marks for vision systems should be implemented in this case [44].

Despite the indicated drawbacks, facial expressions, body posture, and gesture tracking are still promising tools in the emotion recognition domain. Tables 9, 10 provide a summary of studies, including the analysis of facial expressions, body posture, and implemented emotions [44].

It is evident that in a majority of research, facial expressions, body posture, and gestures analysis methods were used together and complemented by other techniques to improve recognition accuracy in Table 9 [44]. When comparing methods for analyzing facial expressions, body position, and gestures with those previously discussed. It can be noted that these methods are one of the most promising methods for future applications. Particularly in practical applications that do not necessitate great precision and sensitivity due to their broad applicability [44].

Table 7 Relations between emotions and body posture and gestures Metri et al. [111], Lee et al. [88]

This study Ilyas et al. [70] used a combination of ML and DL methods (CNN, SVM, RNN, RNN-LSTM) as base classifiers to identify (happy, sadness, anger, and fear) emotional states based on a combination of facial expressions and body gestures and postures. They produced the following results: 77.7% for facial expression, 76.8% for upper body movements (hand and head movements) features, 85.7% for bimodal average fusion, 86.6% for bimodal product fusion, and 87.2% for bimodal bilinear pooling for all emotions using CNN.

Also, the authors in Raman et al. [130] used random forest (RF), logistic regression (LR), gradient boosting classifier (GBR), and ridge classifier (RC) to classify 12 emotional states (happy, angry, disagree, disgust, fear, hello, namaste, okay, sad, shock, surprise, and victorious). They proved that random forest achieved an average accuracy of 1.00%, logistic regression achieved an average accuracy of 1.00%, the gradient boosting classifier achieved an average accuracy of 96%, and the ridge classifier achieved an average accuracy of 1.00%.

According to physiological signals modality, there are many papers that used physiological signals to predict emotional states as in Table 11.

To provide an accurate approach to emotion recognition utilizing wearable technology, the authors in Domínguez-Jiménez et al. [43] proposed a model for the recognition of three emotions: amusement, melancholy, and neutral using physiological signals. With the help of video clips, 37 volunteers were asked to express the desired emotions while two biosignals—galvanic skin reaction and photoplethysmography, which measures heart rate—were being monitored. To determine a collection of features, these signals were examined in the frequency and time domains. Several classifiers and feature selection strategies were assessed. The best model was created using support vector machines for classification and random forest recursive feature elimination for feature selection. The findings demonstrate that neutral, amused, and depressed emotions can all be identified using simple aspects of the galvanic skin response. The authors were able to recognize the three target emotions with an accuracy of up to 100% when evaluated on the test dataset.

With the aid of CNN-based classification of multi-spectral topological images acquired from EEG signals, authors in Ozdemir et al. [122] suggested a novel method for estimating emotional states. By transforming EEG data into a series of multi-spectral topological images, as opposed to the majority of EEG-based techniques, which remove spatial information from EEG signals, temporal, spectral, and spatial information of EEG signals are preserved. A series of three-channel topographical images are used to train the deep recurrent convolutional network to recognize significant representations. The test accuracy that we were able to attain was 90.62% for negative and positive valence, 86.13% for high and low arousal, 88.48% for high and low dominance, and finally 86.23% for like-unlike.

People’s individual EDA features and musical features were combined by the authors of Yin et al. [173], who then produced a network of residual temporal and channel attention. They demonstrated the efficiency of the proposed network for mining EDA features by first applying a mechanism of channel-temporal attention for EDA-based emotion identification to investigate dynamic and steady temporal and channel-wise data.

The goal of this study Romaniszyn-Kania et al. [131] is to develop a tool and propose a physiological dataset to complement the psychological data. There were 41 pupils in the study group, ranging in age from 19 to 26. The research protocol that was given was built on the acquisition of the electrodermal activity signal using the Empatica E4 device during three exercises carried out in a prototype Disc4Spine system and employing psychological research techniques. In the context of emotions experienced, various data clustering and optimization techniques (hierarchical and non-hierarchical) were examined. The k-means classifier performed best during Exercise 3 (80.49%) and when the EDA signal was combined with negative emotions (80.48%). A comparison of the accuracy of the k-means classification with the independent division made by a psychologist revealed again the best results for negative emotions (78.05%).

Sepulveda et al. [142] provided a wavelet scattering algorithm to extract the characteristics of ECG signals based on the AMIGOS database as inputs for various classifiers, evaluating their performance, and reported that the accuracy of 88.8%, 90.2%, and 95.3% had been obtained in the valence, arousal, and two-dimensional classification, respectively, using the presented algorithm.

To distinguish between a driver’s calmness and anxiety, Wang et al. [168] used ECG data such as time-frequency domain, waveform, and nonlinear properties along with their previously described model of emotion detection. For calm and anxiety, accuracy values of 91.34% and 92.89%, respectively, were attained.

Li [67] collected 140 signal samples of ECG that were triggered by Self-Assessment Manikin emotion self-assessment experiments with the International Affective Picture System, and used Wasserstein generative adversarial network with gradient penalty to add various numbers of samples for various classes. The outcomes demonstrated that increasing the amount of data improved all three classifiers’ accuracy and weighted F1 scores.

Before they began processing the EEG data using a modified algorithm of the radial basis function neural network, Zhang et al. [176] first measured the EEG signals and extracted features from the signals. Then they compared and discussed the experimental results provided by various classification models. The results demonstrated that the improved algorithm outperformed competing algorithms.

The EEG data were divided into three emotional states by Wagh and Vasanth [164], who also used the discrete wavelet transform to break the EEG signal up into its component frequency bands. To distinguish between various emotions, they also extracted temporal domain characteristics from the EEG signal. The results showed that the highest frequency spectrum performed well in emotion recognition, with maximum classification rates of 71.52% and 60.19%, respectively, when the classification methods of decision tree and k-nearest neighbor were utilized.

Priyasad et al. [128] proposed a novel method based on a deep neural network-based multi-task classifier to determine the dimensional emotional states (low/high) from unprocessed EEG signals. In comparison to state-of-the-art techniques, our proposed model exceeded them by achieving accuracy levels of 88.24%, 88.80%, and 88.22% for arousal, valence, and dominance, respectively, using 10-fold cross-validation; 63.71%, 64.98%, and 61.81% with Leave-One-Subject-Out cross-validation (LOSO) on the Dreamer dataset; and 69.72%, 69.43%, and 70.72% for a LOSO evaluation on the DEAP dataset.

A dataset comprising three classes of emotions and a total of 2100 EEG samples from two participants was used to test the long short-term memory model that Mohsen et al. [116] presented for the classification of positive, neutral, and negative emotions. The provided model had a testing accuracy of 98.13% and a macro average precision of 98.14%, according to experimental findings.

In this study Doma and Pirouz [42], epoch data from EEG sensor channels are analyzed, and multiple machine learning techniques—including support vector machine (SVM), K-nearest neighbor, linear discriminant analysis, logistic regression, and decision trees—are compared. Each of these models is tested both with and without principal component analysis (PCA) for dimensionality reduction. Grid search was also used to reduce execution time for each of the machine learning models that were evaluated over the Spark cluster by hyperparameter tuning. In this study, a multi-modal dataset for the examination of human affective states—the DEAP Dataset—was utilized. The participants’ labels for each of the 40 1-minute long musical clips served as the foundation for the forecasts. music. Each movie was scored by participants based on its level of arousal, valence, likeness or dislike, dominance, and familiarity. For each of the 4 classes, a separate set of time-segmented, 15-s intervals of epoch data was used to train the binary class classifiers. The best segmentation result was achieved using PCA with SVM, which provided an F1 score of 84.73% with 98.01% recall in the 30th to 45th segmentation interval. Different classification models converge to higher accuracy and recall than others for each of the time segments and “a binary training class”. The findings demonstrate the need for several classification methods to categorize various emotional states.

Lee et al. [89] used a combination of physiological signals to achieve high performance of 80.18% and 75.86% for arousal and valence using deep learning autoencoders. In addition, Bizzego et al. [18] achieves an accuracy of 0.93% (train), 0.94% (test) using DNN and produces an accuracy of 0.64% (train), 0.61% (test) using SVM for basic emotion classes.

To classify emotions, scientists used EMG, RSP, skin temperature (SKT), heart rate (HR), skin conductance (SKC), and blood volume pulse (BVP) as input signals. Temporal and frequency parameters are the features retrieved from the EMG. Mean, standard deviation, mean of absolute values of the first and second difference (MAFD, MASD), distance, and so on are temporal parameters. The spectral coherence function’s mean and standard deviation are the frequency parameters. It had an 85% recognition rate for various emotions Gouizi et al. [55].

ECG, EMG, and SGR were employed as signals in the reference work AlZoubi et al. [7] to categorize eight emotions. The mean, median, STD, maxima, minima, the first and second derivatives of the preprocessed signal, and the transformation were among the 21 features recovered from face EMG and the other signals. The features mean, median, STD, minimum, maximum, minimum rate, and maximum rate of the preprocessed signals were employed in the study Yang and Yang [172] to classify four emotions, with a recognition rate of 85% by support vector machine (SVM).

In the study of Xu et al. [171], the authors collected EMG, EDA, ECG, and other signals from 8 participants using the Biosignalplux research kit, which is a wireless real-time biosignal acquisition unit with a series of physiological sensors. They employed SVM, Naive Bayes (NB), KNN, and Decision Tree (DT), with DT providing the best accuracy with the ST (Skin Temperature), EDA, and EMG signals Xu et al. [171]. They achieved an average recognition accuracy of over 81%.

The authors used ECG and GSR signals to distinguish among three emotions namely, happy, sad, and neutral [34]. ECG and GSR were retrieved. The emotional classification achieved the following results: 93.32%, 91.42%, and 90.12% using the SVM classifier to provide high accuracy for classifying all three emotional states respectively.

Hao et al. [60] used the CNN algorithm to predict arousal, and valence emotion classes using visual-audio stimuli and achieves an accuracy of 81.36% and 78.42% in speaker-independent and speaker-dependent experiments, respectively.

According to physiological and speech stimuli modalities, the following results are examples that used these modalities to identify emotional states.

This study Garg et al. [51] trained multiple ML algorithms such as Lasso regression, elastic net regression, ridge regression, kNN, SVR(RBF), SVR(poly), SVR(linear), DT, RF, MLP, and AdaBoost using different datasets to predict emotions (arousal and valence) classes using music stimuli and physiological signals. They produced average performance evaluation (RMSE) levels of 0.34 (Lasso), 0.34 (elastic net), 3.91 (ridge regression), 0.27 (KNN), 0.23 (SVR-rbf), 1250201.29 (SVR-poly), 5.31 (SVR-linear), 0.28 (DT), 0.30 (RF), 50.49 (MLP), 0.26 (Ada-boost) in arousal dimension, also produces 0.30 (Lasso), 0.30 (elastic), 5.16 (ridge), 0.25 (KNN), 0.22 (SVR-RBF), 741397.72 (SVR-poly), 3.65 (SVR-linear), 0.26 (DT), 0.26 (RF), 30.49 (MLP), and 0.22 (Ada-boost) performance evaluations in valence dimension according to (PMEmo, DEAM) datasets.

This study Kose et al. [83] investigated the potential of EOG and EMG signals for emotion recognition to classify four types of emotions, namely happy, relaxing, angry and sad are considered. The authors in Kose et al. [83] provided an improved method for emotion recognition using horizontal electrooculogram, vertical electrooculogram, zygomaticus major electromyogram, and trapezius electromyogram signals. Here, emotions were elicited by audio–visual songs. The time domain, frequency domain, and entropy-based information are retrieved for the classification of emotions. Support vector machines, naive Bayes, and artificial neural networks are used to classify these features. Accuracy, average precision, and average recall are used to compare each classifier’s performance. The key achievement of Kose et al. [83] is the identification of time domain features as the optimal characteristics for EOG and EMG data using the ANN classifier to achieve maximum classification accuracy. The following results came from this study: ANN classifier and time domain features are combined to obtain the highest classification accuracy (99%) possible. In comparison to SVM and NB classifiers, ANN proved to be the most accurate classifier, with an overall accuracy of 98%. The overall accuracy for time domain features was 92.75% as compared to entropy and frequency domain features.

This study Hu et al. [68] used skin conductance and subjective emotion evaluation of pleasure arousal dominance to analyze the variations in people’s assessments of tactile sensations for beech surfaces of varying shapes and roughness. They discovered that beech with arc forms could help a participant preserve some of their mental stability even under conditions of relatively high emotional reactivity. When it came to how beech was perceived, men exhibited a wider range of emotional arousal and a slower rate of emotional arousal than women.

Wu and Chang [170] conducted an experimental investigation on the effects of music on emotions using ECG. The findings indicated that the autonomic sympathetic nervous system was strengthened, repressed, and remained unaffected by fast, moderate, and slow music, respectively. Additionally, they proposed the usage of music as a stress reliever.

Andreu-Perez et al. [8] recorded films of the players’ faces while they were playing the video game “League of Legends” and used functional near-infrared spectroscopy to image the players’ brain activity. This information was used to decode the players’ skill level in a multi-modal framework, marking the first time this has been done using non-restrictive brain imaging technologies. The best tri-class classification precision, according to them, was 91.44%.

There were several experiments in the past for emotion recognition using voice and physiological cues have been undertaken, based on previous methodologies. The research began with a subject-dependent technique where the emotion recognition system is only used for only one user and must be retrained or re-calibrated before being used for another. The emphasis is currently on subject-independent approaches where the emotion identification system is generic (used for any user). Table 8 illustrates a short review of previous works in emotion recognition using speech and physiological signals.

The table illustrates which signals were evaluated, what emotion-eliciting stimuli were used, which emotions were recognized, the number of people in the study, and which features were extracted and classification algorithms used.

The table also includes the accuracy of the methods. The maximum accuracy achieved in the case of subject-dependent techniques was 96.58% for recognizing three arousal levels. An accuracy of 95% was achieved for four emotions. Moreover, a 91.7% accuracy level was obtained for six emotions. Thus, the accuracy levels depend on the number of explicit emotions and the type of model.

Regarding subject-independent techniques, on the other hand, have maximum accuracy of 99.5% for recognizing one emotion (stress), 86% for two emotions, and 70% for detecting four emotional states. We can also notice for physiological signals that besides the feature extraction and classification approaches, the emotion stimuli type affects the accuracy of the model.

In general, the used sensors, the number of subjects, the emotional states, the used stimuli, the feature extraction, and classification methods are the required parameters to build a robust and reliable emotion recognition system Ali et al. [4].

Most of the work presented in this category is related to subject-dependent models (personalized). Moreover, all the presented studies are in the laboratory. It can also be noted that the highest performance of 99.5% accuracy was achieved by combining EDA and HR signals.

Table 8 Previous work on emotion recognition using physiological and speech signals

According to information fusion and physiological signals modalities, there are many papers that used these modalities and achieved higher results as shown in Sect. 4.3.

Table 9 depicts previous works that used either facial expressions, body gestures and postures, and physiological signals as single modalities or a combination of each of them. Table 9 represents the aim of each previous research, emotions used in each study, used modalities, and experimental hardware devices used in each research to extract features used to identify emotional states.

Table 9 Review of scientific research work focused on emotion recognition and evaluation by the analysis of facial expressions, body posture, and gestures

Finally, we can summarize the most ML algorithms used in the emotion recognition modalities that were depicted in detail in Sect. 4, 3 according to the results mentioned in this Sect. 5 as in Tables 10, 11. Tables 10, 11 depict previous works from 2020 to 2022 that used different classifiers of ML techniques either using various approaches or modalities such as facial expressions, text, body gestures, and postures, physiological signals, and speech or using a combination of each of them to distinguish emotional labels used to predict emotional responses. Note, as shown in Table 11, according to physiological and environmental modality, there is only a recent study in the year 2022 (From 2020–2022).

Table 10 The most common ML algorithms used in Emotion Recognition
Table 11 The most common ML algorithms used in physiological and speech stimuli, physiological and environmental factors for emotion recognition

6 Challenges and open research avenues

There are many research challenges for emotion recognition from various modalities. These challenges can be classified as follows: First, regarding datasets, there are many datasets available for emotion recognition. Most of them are either uni-model (using only one measurement such as HR) or in the laboratory collected data. Thus, what is missing is creating real-world multi-modal datasets collected in the wild to be used as benchmarks for research experiments and comparing algorithms. In addition, Collecting real data is always a challenge as sometimes the data collection devices are invasive such as brain signals. Many sensors are now built-in smartwatches and wristbands. So, it can be beneficial to find correlates between these signals.

Second, related to classification models, there are also many opportunities to improve the performance of these models in terms of increasing accuracy and decreasing error rates. This can be done by using hybrid models such as CNN combined with other classification algorithms. Moreover, using ensemble methods to avoid the drawbacks of some algorithms such as over-fitting or under-fitting problems.

Third, concerning model generalization, most of the work presented here is personalized models (a model for each user). But, there is a need for generalized models to be used for any user (Generic) or subject-independent models. These models can be beneficial for reducing the time and effort required to create the models. They can be trained by fine-tuning and using transfer learning for personalized models.

Finally, regarding model transparency and trust, there is a tendency in ML and AI research to open, transparent models, which is essential for gaining user trust in these models. XAI (explainable artificial intelligence) is a sub-field of AI, which presents methods that explain these models. This field is a promising research area for emotion recognition systems.

7 Conclusions and future work

In this paper, we reviewed about 140 research papers in the field of emotion recognition. Affective Computing is the field of studying emotions. AHER is a powerful and effective approach for assessing human emotional states and forecasting human behavior to deliver the most appropriate marketing or educational strategies accordingly. It is also beneficial in treating various human–machine interaction systems.

This review presented various ML algorithms. In addition to presenting physical signals such as facial expressions, text messages, body gestures, postures, physiological signals, speech, and also environmental sensors. They are the most commonly used modalities in AHER, which depend on measurements of various parameters and using data ML methods for emotion recognition.

Some researches used uni-model methods (only one modality) and others used multi-modal methods (combining more than one modality). Results showed that using multi-modal methods can have a positive impact on the performance of emotion recognition models.

Selecting a method of these depends on the nature of the problem and the available data. In this work, we highlighted various studies that contributed to the debate over what constitutes emotion and whether we can experimentally quantify emotions. In addition to, presenting various ML algorithms.

Given the subjective nature of emotions, developing an efficient method for recognizing different emotional states remains a significant challenge. The majority of cutting-edge existing research depends on subject-dependent techniques or personalized models. To create a generic AHER system, multi-modal datasets and suitable algorithms for emotion identification are needed.

Challenges related to AHER can be summarized as challenges concerning data, methods, and models. Collecting real data is always a challenge as sometimes the data collection devices are invasive such as brain signals. For developing accurate models huge amounts of data collected in the wild are needed. Concerning the methods, recent research suggests that hybrid deep learning methods and ensemble learning methods are promising in terms of model accuracy. But, in terms of explainability, the presented models are black-box. So, more work is needed to make these models explainable (understandable by humans).

Future research should contain the following points:-

  • It will be concentrating more on deploying multi-modal data and approaches to emotion recognition, as combining more than one modality with the use of ML and data analysis will lead to advances in practical applications in a variety of disciplines, ranging from advertising and marketing to education.

  • It should be concentrating on ML approaches accompanied by XAI (explainable artificial intelligence) that are transparent and easily understood by humans will boost the adoption of these methods in real-life applications.

  • Also, it should be focused on subject-independent approaches as the emotion identification system is generic or user independent.

  • In addition, it should be concentrating on deep learning methods that can also be applied in automating the feature engineering process.