A novel driver emotion recognition system based on deep ensemble classification

Driver emotion classification is an important topic that can raise awareness of driving habits because many drivers are overconfident and unaware of their bad driving habits. Drivers will acquire insight into their poor driving behaviors and be better able to avoid future accidents if their behavior is automatically identified. In this paper, we use different models such as convolutional neural networks, recurrent neural networks, and multi-layer perceptron classification models to construct an ensemble convolutional neural network-based enhanced driver facial expression recognition model. First, the faces of the drivers are discovered using the faster region-based convolutional neural network (R-CNN) model, which can recognize faces in real-time and offline video reliably and effectively. The feature-fusing technique is utilized to integrate the features extracted from three CNN models, and the fused features are then used to train the suggested ensemble classification model. To increase the accuracy and efficiency of face detection, a new convolutional neural network block (InceptionV3) replaces the improved Faster R-CNN feature-learning block. To evaluate the proposed face detection and driver facial expression recognition (DFER) datasets, we achieved an accuracy of 98.01%, 99.53%, 99.27%, 96.81%, and 99.90% on the JAFFE, CK+, FER-2013, AffectNet, and custom-developed datasets, respectively. The custom-developed dataset has been recorded as the best among all under the simulation environment.


Introduction
In computer vision and artificial intelligence, the facial expression is a significant and promising field of research and one of the primary processing methods for intentions expressed through nonverbal means.When interacting with other people, it is impossible to avoid experiencing emotions [1].They may or may not be visible to the naked eye.Therefore, trained professionals can detect and recognize any indication [2] before or after it is expressed if they have the appropriate tools at their disposal, regardless of when it occurs [3].R-CNN and deep learning classifier techniques are used for emotion recognition.Only a few of the topics covered in this study are medical [4,5], human-machine interfaces [6], urban sound perception [7], and animation [8].Several fields, including the diagnosis of autism spectrum disorder in children [9] and security [10,11], are seeing an increase in emotion recognition technology.To recognize emotions, various features such as EEG [9], facial expressions (FE) [5,12,13], text [14], and speech [15,16] are used.
Moreover, due to various factors, including their ease of recognition, FE features are one of the most well-known methods of human recognition.The following are some advantages: (1) they are noticeable and visible; (2) they allow for the quick and easy collection of large face datasets using facial expression features; and (3) they contain a large number of features for emotion recognition [17,18].Through deep learning, specifically CNN-based learnable image features [15], it is also possible to compute, learn, and extract good facial expressions [19,20].Experts predict that FE will become increasingly significant due to advancements in artificial intelligence technology and the rising demand for applications in the era of big data.To be effective in complex environments, such as those characterized by occlusion, multiple views, and multiple objectives, facial emotion recognition (FER) solutions must be proposed in novel and innovative ways.The most relevant data collected under the most favorable conditions at the time of collection is highly desirable when attempting to accurately classify facial expressions to train a FE classifier [21].The "golden rule" is a term used to describe this.Traditionally, a DFER system will first preprocess the image that will be used as an input to accomplish this.Face detection is a preprocessing step included in most peer-reviewed papers and is described in detail here.The nose and mouth are the most frequently used facial expression cues, even though numerous human face regions can be used to cue facial expressions.The cheeks, forehead, and eyes are just a few other parts of the face that can cue different types [22] of FE.According to a recent study, a small amount of data is collected from the ears and hairs when detecting FE [23].
Accordingly, since the mouth and eyes can detect more FE than the rest, the computer vision deep learning model should place the most significant emphasis on these parts of the face and ignore the others.A CNN framework for DFER is proposed in this manuscript [24], which is based on the findings of this study and incorporates some of the observations made above.Attentional mechanisms, in particular, are employed to [25] draw attention to the essential features of the face.Attentional convolutional networks can achieve extremely high accuracy rates even when using only a few layers (i.e., no more than 50).
Numerous techniques for extracting features have been utilized in the literature for recognizing emotions in drivers, but these techniques restrict the recognition and extraction of emotions.In addition, most emotion recognition systems rely on handcrafted features (such as grayscale statistics, RGB histogram, RGB statistical, and geometrical features) and traditional machine learning classifiers (e.g., SVM, K-NN, and Naive Bayes).The majority of work on feature extraction in recognition systems employs standard methods such as LPB, HOG, and GLCM.However, handcrafted features are considered less robust due to their non-invariant nature, and they extract too many features, which can negatively affect model training and validation.Some invariant features, such as SURF, SIFT, and ORB, operate on low-level features such as edges and corners.Nevertheless, due to the vast number of images in a dataset, these descriptors may not achieve high accuracy.Image data have various forms of noise, blurriness, oversharpening, and unbalanced contrast, which may impair model training, resulting in low accuracy.
This paper develops a deep learning model based on the improved Faster R-CNN and deep ensemble classifier for emotion recognition, along with the ability to observe the driver's emotions while driving a vehicle.The Faster R-CNN model has been improved for detecting the driver's face region, and the learning block of the Faster R-CNN has been replaced with an improved CNN block (InceptionV3), which improves the accuracy and efficiency of face detection.The proposed approach can work in environments where outdated and other approaches are deficient with full potential.The main contributions of this paper are as follows: • A detailed systematic study is carried out on driver emotion recognition in videos and images.• A custom driver emotion recognition dataset is developed, where the emotions of 30 drivers were recorded in challenging environments (illumination and obstacles).• Face detection was performed using Improved Faster R-CNN.• Transfer learning was performed using state-of-the-art updated CNN models such as DenseNet (201 layers).• The proposed models were validated using various driver emotion benchmark datasets, including JAFFE, CK+, FER-2013, AffectNet, and the custom-developed dataset (CDD).To improve the accuracy of the model, data augmentation was used to expand the datasets.
The paper is organized as follows: the first section elaborates and explains the introduction of the entire manuscript.The next section explains and highlights the related work with

Related work
The six primary emotions: the emotions of pleasure, fear, hate, sorrow, disgust, and surprise (except neutral) are identified in [26].Ekman utilized this concept to create the facial action coding system (FACS) [27], which became the gold standard for emotion recognition research.Neutral was later added to most human recognition datasets as a seventh fundamental emotion.Figure 1 shows sample pictures of various emotions from the four benchmark datasets (FER-2013, JAFFE, CK+, and AffectNet datasets).The primary emotions are the happy face, angry face, disgusted face, fearful face, sad face, surprised face, and contemptuous face.
A two-step machine learning methodology was employed in initial studies on emotion recognition.In the first step, the image's attributes are extracted, whereas, in the second step, classifiers are used for emotion detection.Gabor wavelets [28], Haar features [29], Texture features linear binary pattern (LBP) [30], and Edge Histogram Descriptors [31,32] are some of the most frequently exploited manual features for the detection of FE.The classifier then identifies the image with the most appropriate sentiment.These techniques seem to be effective on more specialized datasets.However, posing significant limitations when applied to challenging datasets (with greater intra-class variation).To help the reader better understand some of the problems that images can provide, Fig. 1 of the first row included an image that only showed the reader's eyes or the covered hand or portions of the face.
Numerous firms have made significant advancements in neural networks, deep learning, picture categorization, and vision challenges.In [33], Khorrami demonstrated that CNN could achieve a better accuracy level for emotion recognition.
Moreover, zero-bias CNN's Toronto Face Datasets (TFD) and Cohn-Kanade dataset (CK+) for attaining state-of-theart results when applied to model human facial expressions.To construct a model for FE of stylized animation characters, the authors in [34] trained a network using deep learning and translated human images to animated faces.A FER neural network A network with a top layer of pooling, two convolution layers, and four initial layers or subnetworks has been proposed by Mollahosseini [35].The authors in [36] integrate the removal and classification of features using a single recurrent network, highlighting the importance of input from both components.To achieve cutting-edge CK+ and JAFFE accuracy, the BDBN Network was utilized.
The authors in [37] implemented a deep CNN on noisy labeling of authentic images acquired through crowdsourcing.They deployed ten taggers to re-enact each image to acquire the required precision, with ten tags in their dataset and numerous costing functions for their Deep Convolutional neural network (DCNN).To improve the spontaneous recognition of facial experiences, authors in [38] used more discriminative neurons that outperformed Incremental Boosting CNN (IB-CNN).The authors of [39] An identity-aware CNN (IA-CNN) created that uses identity-and expressionsensitive contrast loss to reduce variation of expressionrelated information during identity learning.Similarly, they have developed a network architecture with a focal model called end-to-end network architecture [40].To minimize uncertainty and avoid unclear face images (caused by labeling noise) from overfitting the deeper network, the authors in [41] devised a quick and efficient self-repair technique (SCN).SCN reduces uncertainty in two dimensions and ways: (1) using a self-attention mechanism to weight each sample of workouts in small batches with rank regularization; and (2) by carefully changing these samples in the lowest rank set.It was identified using an algorithm.In the real world, it is employed for occlusion changes and posture resistance [42].They created a new network dubbed the regional attention network to adequately represent the significance of position variant FER and face areas in occlusion (RAN).Deep learning attention networks for the recognition of facial emotions [43], multi-attention networks for the recognition of facial expressions [44], and a new review on emotion recognition using facial appearance [45] are some of the related works for the recognition of FE.All the works mentioned [46] above have improved emotion recognition significantly over previous work.Still, none of these works contains a simple method for identifying essential face regions to detect emotions.This study [47] suggests a new framework based on a further attentiveness-coevolutionary neural network to focus on critical facial areas [48,49].
The authors of [50] proposed ENCASE to combine expert features and DNN (Deep Neural Networks) for electrocardiogram (ECG) classification.They investigated specific  1.

The proposed methodology
The proposed ensemble learning framework involves several levels of merging different classifiers trained on different feature sets.Figure 2 illustrates the entire process of the proposed framework.Specifically, as shown in the feature preparation layer in Fig. 2, various feature sets are prepared using various feature extraction techniques.Furthermore, the feature set obtained by applying a specific method is further processed by applying various feature selection methods to obtain various feature subsets.Three state-of-the-art CNN models, DenseNet, InceptionV3, and Resnet-50 are used to extract high-dimensional features from the train and validation set images.The extracted fused features are concatenated to create a fused features vector.An ensemble classifier is developed using three state-of-the-art deep learning classifiers, CNN, gate recurrent unit (GRU), and multi-layer perception (MLP).The final output is selected by a voting scheme and assigned a class label predicted by most of the classifiers.In a real-world scenario, ensemble learning can be implemented in a more flexible configuration than that described in Fig. 2. For example, multiple base classifiers trained on the same feature set can be combined into a primary integration, then combined with the remaining base classifiers to form a quadratic integration.In this case, a secondary integration can be created from each feature set, and then some, all, or a secondary integration can be combined to create a top-level or even a final integration.The proposed framework can be viewed as a philosophical strategy for structural thinking and can also be used to solve the problem of driver emotion recognition.For example, multiple base classifiers trained on the same feature set can be combined into a primary integration, then combined with the remaining base classifiers to form a quadratic integration.In this case, a secondary integration can be created from each feature set, and then some, all, or a secondary integration can be combined to create a top-level or even a final integration.Granular computing can be viewed as a philosophical strategy for structural thinking, but it can also be used to solve the problem of driver emotion recognition.Each integration can be conceptualized as a single model in an ensemble learning framework because it contains multiple classifiers.Images can be of different sizes, which led to the development of the concept of particle size.Image size assists in improving the proposed model performance as it changes the proportion corresponding to the model size.The length and width of the images and models can vary considerably.Each level of classifier fusion can be interpreted as a different level of granularity in the proposed input learning framework.Three models are used in this stage: GRU, MLP, and CNN.MLP is frequently used to solve problems requiring supervised learning and research into computational neuroscience and parallel distributed processing.The GRU is a type of RNN that uses less memory than long short-term memory and considers more efficient.However, using datasets with longer sequences improves the accuracy of LSTM.GRU occasionally has advantages over LSTM that are greater.These models ran on the data from the task of feature fusion and feature concatenation.Every model also has a distinct architecture where various tasks are carried out.The next step is to predict the data in tabular form so that a confusion matrix can be effectively generated after these models have been fully processed.The final stage will involve performing cure graphs in a simulation environment based on the confusion matrix.
In this stage, three models are used: Gate Recurrent Unit, Multi-Layer Perception (MLP), and CNN.MLP is commonly used to solve problems that require supervised learning and research in computational neuroscience and parallel distributed processing.The GRU is a type of RNN that uses less memory than LSTM and is considered more efficient.However, using datasets with longer sequences improves the accuracy of LSTM.GRU occasionally has advantages over LSTM that are greater.These models ran on the data from the task of feature fusion and feature concatenation.Each model also has a distinct architecture where various tasks are carried out.The next step is to predict the data in tabular form so that a confusion matrix can be effectively generated after these models have been fully processed.The final stage will involve performing cure graphs in a simulation environment based on the confusion matrix.
Several benchmark datasets, including AffectNet, CK+, FER-2013, JAFFE, and custom-developed datasets, are used to train the proposed ensemble CNN classifier.It should be noted that we trained a separate model for each dataset used in this study.A separate validation and test set are used for model parameter tuning and performance evaluation.A confusion matrix is an important metric consisting of various measures such as true positive, false positive, false negative, and true negative.These measures are used for validating model performance using the accuracy, precision, recall, and f1-score metrics.

Face detection system
The face detection system allows images to pass through a specific process of facial recognition and be detected using the existing dataset.It combines R-CNN and deep learning methods for images with rectangular regions and schemes that utilize CNN features.The Faster R-CNN model can detect objects in two steps.The first step identifies a subset of regions on the target image where an object is present, while the second step is used to classify those objects in all regions, as shown in Fig. 3.
The proposed face detector is named "Improved Faster R-CNN," which uses the two-step scheme for emotion recognition.In the first step, fully connected layers are applied to the image in the detector with multiple layers up to the actual image and the segmented image.The image is placed in the system, where the dataset checks the image with the corresponding regions and other feature extractions.The position and expression of the image are matched with the dataset, and then the detector processes it further for analysis and recognition.In the second module, the improved Faster R-CNN is applied to the proposed areas of the images.Figure 3 shows a clear example of the face detection system where the two images show the actual image named as a test image

Regional convolutional neural network (R-CNN)
In the image processing stage, the improved faster R-CNN algorithm uses edge boxes to generate regions of interest.The proposed scheme then resizes and crops the image to the appropriate size.The resized region is then processed by a CNN, which uses features trained by SVM to identify the size and shape of the image, as shown in Fig. 5.

Data augmentation
In computer vision, data augmentation is used with different approaches, to increase the images in the given dataset by organizing and analyzing the existing dataset.Using image processing techniques, a single image is replicated to increase the quantity of image data that will be effective in computer vision and deep learning-based models for situations where the original dataset size is small.There are many ways to improve data, such as by changing the red, green blue (RGB) colors, using affine transformation, translating, rotating, adjusting the contrast, adding noise, taking noise away, changing the blueness, sharpening, flipping, cropping, and scaling, shown in Fig. 6.

Transfer learning-based on driver emotion recognition (DER)
Deep learning models are utilized to acquire transfer learning techniques to improve their performance when applying these methods to one-to-many challenges.This method has been used almost solely for object recognition in applications such as image speech recognition and image recognition in computer vision [49].Other applications that have made use of this technology include.This method has been proposed to analyze and evaluate the vibrant images placed in the detector for later evaluation.This approach has been applied to the training of the benchmark dataset to use a strategy of test data and augmented data.When the entire dataset is about to be fully trained, the novel model must be used, whereas transfer learning does not require the model to be trained for many epochs.This approach can decrease the computation burden.

Transfer learning
Transfer learning is the most commonly used for the following processes and steps.Figure 7 depicts the preceding approach and steps of transfer learning, with each step elaborated differently.
The data are loaded into the pre-trained network at the initial step for further analysis.For the new task, the weight will be contingent upon the existence of data.After that, the final layers have to be loaded and replaced with the final layers in which fewer lanes are used to learn faster.Then comes the new layer with the train network phases in which the 100 s of images with 7 s of classes take place.Then, it proceeds to the next level, in which the prediction and

DenseNet
The ResNet structure is given in Fig. 8.In this scenario, the convolution layers are used by the CNN to be trained and [48] delivered to the next CNN stage with the increase of the values in Xi and Hi.In traditional CNN, all the layers are connected as given in the formula (1), which can go deep and make the network hard.In this terminology, it may come across as a gradient vanishing or exploding.After that, by skipping at least two layers, the ResNet offers an idea to be employed in some shortcut connections.With some transitions and conditions in which the input is whereas the output of the convolution layer is which is added with the shortcuts for the input layers, Therefore, the summative of the output is what is illustrated in Eq. 2.
After that, DenseNet can revise the model with the concatenation of the whole feature map accordingly.The expressions are given in Eq. 3, in which the feature maps from the past layers are instead of a summation of the output: (1) where i stands for the index of layer, H denotes the operation of non-linear, and X i expresses the features of the lth layer.
The block diagram of the DenseNet is given in Fig. 9.With the focus on Eq. 3, the DenseNet can offer the concatenation of the previous maps to the previous layers.This terminology states that the maps of features are gathered and directed to one newly generated feature map.The DenseNet, which is newly designed, can propose advantages such as gradient vanishing to decrease the problem with the exploding manner, reuse, Etc.However, for the structure of the DenseNet to become feasible, some of the following changes need to be made.In which the downsampling is used to create a possible concatenation.The total given steps are given in Fig. 9, in which, from left to right, it precedes and increases with the S + 1. S is the actual module and the other values added with the module, such as S + 2, S + 2, and S + 4, are the concatenation maps.
Each layer is linked with another as it makes a total combination of 10 transitions.Figure 10 shows that each map for generating the S features is included with one of the operations of H1.The total of 5 layers in which the Sth values are introduced with each layer from the highest of S0 + 4S, in which the term So denotes the number of features mapped with the layer of the previous one.In this study, 32 has been kept as the default value of S. Nevertheless, there is a massive number of inputs to the networks, in which a layer named "bottleneck" was also introduced for the DenseNet.The convolution layer designated that layer before the layer of convolution.That layer has helped decrease the number of image features with the solution of the cost of computation.After that, considering the model's accuracy, a layer named "transition" is used to reduce the feature maps in the given procedure.The supposition is given if the S feature maps are to be generated with the Dense Block with the assumption of the value of the compression factor.Therefore, the maps of features will have to be minimized.If the value is so, the number of features on the maps will have to be the same.Figure 10 shows the relation between Dense Blocks and Transition Layer.
The overall structure and procedure of the DenseNet have been illustrated in Fig. 10, in which the input layers, GAP layers, Dense Blocks, and transition layers have been given with transitions from one layer to another.With the normalized batch layer, the transition layers consist of these with the value of convolution layers and the middle layers of pooling in which the two are kept in stride.In particular, the value of GAP is identical to the traditional pooling methods, but the term GAP has to undergo more powerful features in the aspect of reduction, which can reduce the map features by the value.It denotes that the term GAP layers are being reduced to a single digit as a whole slice.

Features weights optimization
For optimization of the detailed model to train our dataset for fine-tuning and to pre-train CNNs, we have separately evaluated each one.From the result, we have used the DenseNet pre-train CNNs.The study we propose has the uniqueness to augment the image.Furthermore, the additional dataset for the training has been generated with the usage of the technique of augmentation.To mitigate the overfitting problem, the data augmentation can be trained during the training with the collaboration of CNNs models.We have applied some randomized vertical and horizontal shifts with the extent of the 10% to the originality dimension in the study.By doing so, further, the rotation of randomized was fit to 20% which was applied for the images to train with having a small zoom of random.
In addition, we rotate the images horizontally to enhance dataset size.We eliminated all fully connected layers to optimize each network and utilized only the convolutional part of each model's architecture.At the final stage of the convolutional layer, we add a global mean polarity layer, followed by a classification layer with SoftMax nonlinearity.Using a learning rate of 0.0001 and a speed of 0.90, 50 iterations of stochastic gradient descent (SGD) optimization are used to refine the network.In all circumstances, the loss function is equal to the cross-entropy squared.It is used to alter the validation set's hyperparameters.To clarify, each network's input has a distinct shape.The initial stage in data preparation entails resizing all photos based on the model input and saving them in many files of various formats.The same initialization and learning rate rules are used to train both models.

Results and discussion
To proposed driver facial expression recognition (DFER) method has been effectively tested and proven on various standard datasets for the development of this section.In this study, we compared the current FER method with the stateof-the-art method, and the data collected as also encompassed the result of the qualitative and quantitative evaluation.The proposed system used two reference datasets.Every dataset is divided into two sections such as training, and validation sets randomly.The training sets will be 70% of the datasets and 30% will be for validation sets of each dataset.All simulations are performed under the simulation environment using the MATLAB R2021a platform contained in the proposal.All of this is done on a workstation shown in Table 2.

Datasets
Due to funding constraints, workload constraints, time constraints, and algorithm performance evaluation requirements, most FER researchers rely on benchmark datasets.The most commonly used normative datasets range from sentiment inquiry to evaluation.The benchmark datasets are the extended Cohn-Kanade (CK+), Japanese Female Facial Expressions (JAFFE), and FER-2013.In this work, we used the FER-2013 facial expression dataset [5], CK+ [49], AffectNet [49], JAFFE [14], and a custom-developed dataset for DFER, which are among the benchmark datasets used for DFER.This section will briefly overview the benchmark and custom datasets used in this work.After that, it will provide the performance of our models on benchmark datasets along with a custom-developed dataset and compare the results with some of the current sound work.

CK + dataset
The dataset named CK+, which is Cohn-Kanade, is a dataset for facial expression images [45].It is a publicly available dataset for recognizing the driver's facial expressions as an active unit.It has non-posed and posed expressions in which the analysis can be made with ease.In this dataset, the overall number of images was 593 in an aspect of sequence across the number of 123 subjects.This sequence's last frame was taken from the existing works used for the FER image base.For a  total of seven images, samples are given from this dataset in Fig. 12.

JAFFE dataset
The JAFFE dataset expresses basic FE in Japanese models (female).This dataset contains two hundred and thirteen (213) images of seven different FE.With the addition of the contempt class in CK+, each dataset has to use seven basic FE, which are most commonly used.To rate each image, 60 Japanese subjects were rated using 7 FE [46]. Figure 13 shows seven images from the dataset.

AffectNet dataset
AffectNet is one of the largest freely available datasets in the FER work [47].AffectNet is a new real-life FE dataset consisting of FE and annotations.AffectNet is an FE dataset of over 1 million facial images collected from the internet.There are over 1250 emotions from 6 different countries, people with 3 main search engines-the presence of seven different EFs (deterministic model) restored images (440,000, dimensional model).AffectNet is the most publicly available EF, valence, and pacing dataset, enabling investigators to conduct investigations automatically.FER in two distinct emotional models.There are two main lines in the hierarchical model.A deep neural network is used to classify the images and predict the symmetry and intensity of the simulation.Accuracy is based on seven categories (happy, surprised, sad, fear, disgust, anger, contempt, and neutral) in Fig. 14.

Custom developed dataset (CDD)
In custom datasets, the own mind datasets are created and used for analysis and evaluation in which the proposed and the existing datasets are combined for the FE recognition of the driver [48] in Fig. 15.The deep learning approach filters and extracts the features from the given custom and state-ofthe-art datasets.In the moving of a vehicle with a duration of exceeding time, these images have used the expressions of the driver with the real-time scenario to enhance and capture the right moment.Each image with the subject will be tested against the extracted features.Multiple images are taken from moving vehicles such as the Toyota Land cruiser, Honda Civic, and Toyota Prius.From these scenarios, every subject's data are recorded for up to 10 min each.Every subject in this study is a driver male with age 25 to 40 in which some of them have a beard or no beard, and wear a cap or no cap.As well as the videos were also recorded and evaluated for recognition in real time in which the obstacles come in a way, and the light changes as the vehicle moves forwards for the emotion recognition system.For the analysis and evaluation of the DFER, the proposed dataset and benchmarks

Results
The DFER method described above was tested and found useful on several standard datasets used in the development of this section.This study also includes a quantitative and qualitative evaluation of valid measures of outcomes obtained from data collection and a comparison of the proposed technique with current FER techniques.A more concrete example is that the proposed system uses five reference datasets.Each dataset in the proposed system is randomly split into training sets and test sets, and the training set is much larger than the test set.

Experiments on the JAFFE dataset
Experiments on the JAFFE dataset are conducted using a random hold-out splitting strategy, which yields the most accurate results.In the first part, 189 images (70%) are used for training and 24 images (30%) are used for validation.In the second stage, 6237 training images are added to the JAFFE-augmented dataset and used to support the model.792 images were also used during model validation.
Figure 16a shows an accuracy of 89.2% using the original JAFFE dataset, while Fig. 16b shows the DenseNet CNN model validation and training loss plots for the JAFFE dataset original.The confusion matrices of the original JAFFE dataset and the augmented dataset are shown in Fig. 18a and b, while the detailed classification test accuracy of the original JAFFE dataset is shown in Table 3.
Using the JAFFE-augmented dataset, the accuracy using the DenseNet CNN model is 98.01%, as shown in Fig. 17a, while Fig. 17b illustrates the training and model validation loss graphs DenseNet CNN for the JAFFE-augmented dataset.Figure 17a and b illustrates the comparative accuracy and loss of the proposed DenseNet CNN model for the augmented dataset is proportional to the number of epochs, while the detailed accuracy of the category test for the JAFFEaugmented dataset is shown in Table 4 (Fig. 18).

Experiments on the CK + dataset
Figure 19a and b shows the accuracy and loss of the training and validation of the original CK+ datasets.These studies were carried out using the random hold-out splitting method.There are two phases, with the first phase using for original CK+ dataset, 444 images 70% of the total for the training sets, using 192 images 30% of the total, for the validation sets.In the second phase, for the augmented CK+ dataset, a total of 14,652 training images and 6,336 validation images are taken.Figure 19a 6 (Fig. 21).7.

Experiments on the FER-2013 dataset
99% accuracy is achieved using the FER-2013 augmented dataset, as shown in Fig. 23a, while Fig. 23b shows the validation loss plot of the augmented FER-2013 dataset using the DenseNet CNN model while detailed test accuracy by class of FER-2013 augmented dataset is shown in Table 8 (Fig. 24).

Experiments on the AffectNet dataset
Figure 25a and b illustrates the training and validation accuracy and loss of the original AffectNet dataset.These studies were conducted by randomizing the hold-out split.The first phase uses 187,807 images which are 70% of the total for the training sets, while to use of 87,346 images which are 30% of the total for the original AffectNet dataset, as validation images.In the second phase, 2,817,105 training images and 1,310,190 validation images were used for the augmented AffectNet dataset.Figure 25a demonstrates an accuracy of 87.37%, while Fig. 25b shows the training and validation loss plot for the original AffectNet dataset by the DenseNet CNN model.The confusion matrix of the AffectNet original dataset and augmented dataset is shown in Fig. 27a and b while detailed test accuracy by class of AffectNet original dataset is shown in Table 9.
The accuracy achieved by the augmented AffectNet dataset is 96.81% as shown in Fig. 26a and b show the training and validation loss plot by the DenseNet CNN model while detailed test accuracy by class of AffectNet augmented dataset is shown in Table 10 (Fig. 27).

Experiments on the custom-developed dataset (CDD)
These studies were carried out using the randomized hold-out splitting method.Furthermore, Fig. 28a and b illustrates the accuracy and loss of the training and validation of the original CDD, respectively.In the first phase, 763,880 images 70% of the total for the training sets are used while the remaining 329,926 images representing 30% of the original CDD are used as validation images.In the second phase, 5,347,160    11.
The DenseNet CNN model achieves 99.90% accuracy on the augmented CDD, as shown in Fig. 29a and b the training and validation accuracy and loss plots while detailed test  (Fig. 30).

Statistical tests of the DenseNet model based on the aforementioned datasets
Descriptive statistics are used to summarize a set of observations, in order to communicate the largest amount of information as simply as possible.Statisticians commonly try to describe the following observations.
1. a measure of location, such as the arithmetic mean.
2. a measure of statistical dispersion.
3. a measure of the shape of the distribution like skewness or kurtosis The ANOVA test of the DenseNet model is shown in Fig. 31 while multiple comparison results are presented in Tables 13 and 14.The DenseNet model using original datasets is shown in Fig. 32 while augmented datasets are shown in Fig. 33.
Each cell of the matrix is weighted by its proximity to the cell in that row containing the strictly compatible item.This function can calculate linear or square weights using the original and augmented datasets of the DenseNet model as shown in Tables 15 and 16.

Comparison of experimental analysis
This study used four datasets: JAFFE, CK+, AffectNet, and the Custom dataset.The model has been tested and trained with validation and loss in which the accuracy and loss have been illustrated.On the datasets mentioned above and validation with loss and accuracy, we show the performance of our proposed DFER model.We briefly discuss our training approach before we further the evaluation process.We have tried to keep the architecture and hyperparameters used for testing and training the datasets.To analyze the weight of the network, we used the Gaussian Random Integer variables, in which the weight was kept at 0.05 with zero convolution.Also utilized Adam's optimizer with the proper rate of 0.005 values.Apart from this, a diverse optimizer was also utilized, including the Adam and lowering the stochastic gradients, which seemed to be more successfully obtained as the weight was decreased up to 0.001 value.For version L2 augmentation, it was used for magnification purposes.At the FER-2013, JAFFE, CK+, and AffectNet, our proposed custom-made dataset and the model achieved the best performance and took 15 days.
On the other hand, our custom dataset has taken only 4 days to accomplish.We have used minor distortions, small rotations, and reversals to improve the data.We use oversampling for model training on classes with fewer images in the     Using the FER-2013 original dataset of the testing sets, there has been an accuracy of 99.01%achieved.While the FER-2013 augmented dataset, the DenseNet model's achieved average accuracy is 99.27% under the simulation environment compared with the benchmark datasets.The year 2013 was determined with the computation of the outcomes for our proposed model, which is illustrated with the current work in Fig. 34 on FER-2013 and the accuracy graph.
We have used 120 images for training the JAFFE dataset, although 23 images were used for validation purposes for the JAFFE dataset.We used 70 for JAFFE dataset testing.The overall accuracy of the JAFFE original dataset is 89.2%.Compared with benchmark datasets, the model for the augmented JAFFE dataset achieves an accuracy of 98.01.01%, as shown in Fig. 35.

Conclusions
In this paper, a new framework based on CNN is proposed to recognize the emotional state of the driver.We  Putting aside what we have successfully achieved, several useful extensions can be addressed for further improvements.Without considering the influence of head pose variations, only the frontal faces of drivers are taken for training and implementation purposes.Therefore, further faces from several views can be considered from the images or videos which may help to improve the recognition accuracy.The deep learning techniques lack sufficient data to be the most effective they can.Therefore, it may be useful to pre-train a deep CNN on many other datasets before applying a fine-tuning process.A hybrid method can be developed in the future by

Fig. 1
Fig. 1 Facial appearance and texture feature-based robust DFER framework for sentiment knowledge discovery

Fig. 2
Fig. 2 Proposed ensemble classification model for driver emotions recognition

Fig. 3 Fig. 4
Fig. 3 DFER and emotion detection using an improved faster R-CNN model

Fig. 6
Fig. 6 Image augmentation using various image processing techniques

Fig. 8
Fig. 8 Structure of the ResNet block

Fig. 10
Fig. 10 Dense Blocks with relation to the Transition Layer

Fig. 11 Fig. 12
Fig. 11 Some random images from the FER-2013 dataset ICML-2013  was the first dataset used to represent the data for emotion recognition based on the existing dataset[5].The FER-2013 dataset consists of different images used to analyze and evaluate the proposed work.A total of 35,887 images were included in this dataset.In which the 48/48 resolution was set.The majority of the images were taken in the real-life scenario field.There are a total of 28,709 images these images are for the training set, and the 3589 images are for the test set.With the Google Application Programming Interface (API), the automatically captured datasets can be retrieved from the Google Image Search.The essential aspects of facial expression recognition are the sixth or neutral expressions to be applied to the faces.The dataset named FER-2013 is a common aspect of face recognition that shows low contrast, and facial occlusion is made in additional datasets.From this dataset, some pictures are given in Fig.11.

Fig. 17
Fig. 17 DenseNet CNN model based on training and validation, a accuracy and b loss plot for JAFFE-augmented dataset

Fig. 18
Fig. 18 Confusion matrix of JAFFE a original and b augmented dataset

19
DenseNet CNN model based on training and validation, a accuracy and b loss plot for CK+ original dataset

Fig. 20
Fig. 20 DenseNet CNN model based on training and validation, a accuracy and b loss plot for CK + augmented dataset

Fig. 21
Fig. 21 Confusion matrix of CK+ a original and b augmented dataset and b illustrates the model training and validation loss plot based on the CK+ original dataset to obtain an accuracy of 97.20%.The confusion matrix of the CK + original dataset and augmented dataset is shown in Fig. 21a and b while detailed test accuracy by class of CK+ Original dataset is shown in Table 5 .

Fig. 22
Fig. 22 DenseNet CNN model based on and validation, a accuracy and b loss plot for FER-2013 dataset

Fig. 23
Fig. 23 DenseNet CNN model based on training and validation, a accuracy and b loss plot for FER-2013 augmented dataset

Figure
Figure 22a and b shows the accuracy and loss of the original FER-2013 datasets for training and validation.These studies were conducted by randomizing hold-out split.There are two phases.The first phase was used for the original FER-2013 dataset, and the second phase was used for the

Fig. 24
Fig. 24 Confusion matrix of FER-2013 a original and b augmented dataset

25
DenseNet CNN model based on training and validation, a accuracy and b loss plot for AffectNet dataset

Fig. 26
Fig. 26 DenseNet CNN model based on training and validation, a accuracy and b loss plot for AffectNet augmented dataset

Fig. 27
Fig. 27 Confusion matrix of AffectNet a original and b augmented dataset

28
DenseNet CNN model based on training and validation, a accuracy and b loss plot for CDD

Fig. 29
Fig. 29 DenseNet CNN model based on training and validation, a accuracy and b loss plot for augmented CDD dataset, which solves the data imbalance problem and leads to model generalization.Classes can have the same order to train huge models.With the use of FER-2013, the other recognition datasets of DFER are more accessible.Apart from these, in the FER, with the variation in the internal class of the datasets, the additional research challenge is the unbalanced nature of

Fig. 30
Fig. 30 Confusion matrix of CDD original (a) and b augmented

Table 1
Performance of deep learning multi-layer feature-fusion methods

Table 2
Specification of GPU used for model training

Table 3
Detailed test accuracy by class of JAFFE original dataset

Table 5
Detailed test accuracy by class of CK+ original dataset

Table 7
Detailed test accuracy by class of FER-2013 original dataset

Table 8
Detailed test accuracy by class of augmented AffectNet dataset

Table 9
Detailed test accuracy by class of AffectNet original dataset