1 Introduction

The World Health Organization (WHO) has listed road traffic accidents as the leading cause of death among young people. In addition, between 20 and 50 million people worldwide suffer non-fatal injuries due to accidents, resulting in disability [1]. Despite these numbers, road traffic fatalities have decreased in recent years. This reduction is due to the large awareness campaigns conducted by different organizations and, to a greater extent, the new technologies included in vehicles to improve safety on the road.

Examples of these driving assistance technologies are the electronic stability control (ESP) or the anti-lock braking system (ABS). Today, new technologies are being investigated to avoid accidents, and brain–computer interfaces (BCIs) are among them. BCIs have been used primarily as driver support mechanisms as they provide direct feedback from the brain, being used in various scenarios [2]. For example, some of the existing solutions aim to mark the direction to be followed by an autonomous car [3] or control a vehicle’s multimedia system [4]. Nevertheless, these systems must consider existing cybersecurity challenges, such as the impact of adverse external stimuli [5], or data disruption based on malicious signals [6]. Other works, for instance, try to detect when the driver is distracted by using electroencephalography (EEG) [7].

In recent years, the relationship between brain waves and the different moods of the subject has been studied to detect subjects’ stress, identify if they are lying, or see their emotional state. EEG has taken great relevance in such a context because it is a simple, cheap, portable, and easy-to-use solution for identifying emotions. Thanks to these advances, EEG has been introduced in driving scenarios as assistive technology [8]. One of these works is proposed in [9] and focuses on detecting the emotional state of the driver to improve it with music. However, most of the existing work in the literature focuses on emotion detection in a calm state, or in the case that it is applied to a driving scenario, it does not focus on emotion detection but other parameters such as drowsiness or distractions. Because of this, the main research question of this work is offering a clear perspective on how a secondary task, a driving simulator, affects the primary task of eliciting and recognizing different emotions. Furthermore, it is necessary that the model has a reduced complexity, since it is intended to be applied in a real scenario. In addition, it has been studied how increasing the number and type of emotions detected affects the detection of emotions both when performed as a primary task (in a calm state) and when it is a secondary task (using a driving simulator).

In order to improve the previous challenges, the main contributions of the paper at hand are the following ones:

  • The design and implementation of a framework for detecting emotional states based on machine learning (ML) and deep learning (DL) algorithms in driving scenarios. This framework is composed of different layers, which are directly associated with the BCI cycle. It starts with an acquisition layer that allows obtaining the EEG signals of a subject while driving. After that, a preprocessing applies the bandpass and notch filter, and then, independent component analysis (ICA) is performed to remove possible noise. Following this, features related to brain rhythms and entropy are gathered. A total of 280 features have been obtained, thus avoiding the possible loss of information. These features are obtained from four-second data intervals, applying them in a sliding window model. To improve the performance of the framework so that it can be applied in a real use case. The dimensionality of these features is reduced, those correlated by more than 95% have been eliminated. Later a selection is performed using the principal component analysis (PCA) algorithm. Once the data are available for classification, supervised ML algorithms are applied. On the other hand, due to the use increase in recent studies, DL algorithms have been involved with different neural networks to measure their performance.

  • The creation of a scenario composed of (i) a BCI to collect the user’s EEG signals, (ii) a driving simulator and (iii) a sound stimulus generator. A series of use cases directly related to the driving scenario has been designed to measure the framework performance. In particular, the use cases are divided into two phases. On the one hand, the first phase focuses on presenting to subjects just auditory stimuli, while they are in a calm condition. On the other hand, the second phase aims to present auditory stimuli when the subjects are using the driving simulator. This allows to answer the question of how the execution of the main task affects emotion recognition. Each phase is divided into four sub-phases classified by the type of stimuli presented to the subject: no stimulus, neutral stimulus, positive stimulus, and negative stimulus.

  • The validation of the overall framework performance and the individual performance of each of the models is obtained by measuring the performance of each of the algorithms. In this way, it is possible to compare with the literature. The ML algorithms selected were K-nearest neighbors (KNN), random forest (RF) and XGBoost. On the other hand, a series of nodes have been chosen for DL, such as long short-term memory (LSTM) and convolutional neural network (CNN) [10]. The results obtained by this framework show an accuracy of up to 99% for the detection of two emotions, 93% for three emotions and 75% for four emotions. These results are better than those reported in the literature. Regarding the research question, the results when using a simulator are better because it provokes a more significant impact of the sound stimulus on the subject and, therefore, better separates the different emotional states. In all cases, the best-performing algorithm was RF.

The remainder of the paper is structured as follows. Section 2 reviews the state of the art in emotion recognition and its implementation in driving scenarios. After that, Sect. 3 introduces the elements that compose the created scenario and the interaction between them. Section 4 presents the protocol followed for each of the experiments performed. Additionally, Sect. 5 describes the results obtained for each experiment, comparing the results between experiments. Finally, Sect. 7 presents conclusions and future work.

2 Related work

Thanks to the advance in BCIs and research in neurology, it has been possible to develop systems for detecting emotions through BCIs and EEG. When studying EEG, some repetitive features can be identified. These characteristics are known as rhythms and are classified into different frequency bands [11]. Each of these bands is assigned to a different mental state. After knowing the assignment of each of the bands to the different states of the subject, different ML algorithms can be applied to identify them. In this sense, Elfaramawy et al. [12] proposed the detection of six emotions: anger, fear, happiness, neutral, sadness, and surprise. The unsupervised Gamma-GWR algorithm and 30 subjects were used for the emotions classification. To measure accuracy, subjects were asked to identify the mood of a different group of subjects from a series of images. This identification was 90.2% accurate, while the classifier was 88.8% accurate. In summary, the authors claimed that the use of human body language to identify different emotions is very effective.

Zheng and Lu [13] conducted an experiment in which they sought to detect positive, neutral, and negative emotions from the EEG. For this, they created a dataset with 15 subjects and aimed to modify the emotional state of the user using film clips. For the classification, they applied different ML and DL algorithms. For deep learning, they used the DBSN algorithm, obtaining an accuracy of 86.08%. The machine learning algorithms applied were SVM, LR, and kNN, bringing an accuracy of 83.99, 82.70, and 72.60%, respectively. Similar to this work, Joshi and Ghongade [14] used the SEED-VIG and DEAP datasets for their experiments to detect positive, negative, and neutral states. In this study, the authors use features based on the signal power and features related to the entropy of the signal. Another work that uses film clips to modify the emotions of the subjects is the one conducted by Kaur et al. [15], where the emotions to be predicted vary: happy, calm, and angry. The classification algorithm was SVM, obtaining 60% of accuracy in this case.

Bhatti et al. [16] designed an experiment in which they intended to modify the mood of 30 subjects using music tracks. Four emotions were intended to be detected (happy, sad, love and anger), and different musical styles such as rap, metal, or jazz were selected. The MLP, kNN, and SVM algorithms were used to classify these emotions, obtaining an accuracy of 78.11, 72.80, and 75.52%, respectively. Iacoviello et al. [17] created an experiment where they intended to detect users’ emotions, but these emotions would be self-induced in this case. To do this, ten subjects were selected and, depending on a symbol displayed on a screen, they should try to get upset or relaxed. The classification was conducted using SVM and PCA, obtaining an average of 90% accuracy. Other authors have studied how to solve this problem with DL techniques based on CNN and LSTM, as well as a combination of these. Sheykhivand et al. [18] sought to predict positive and negative emotional states from a sound stimulus. Using a network based on LSTM+CNN, it obtained about 96% accuracy.

There is a variety of work linking BCIs to driving scenarios, but to detect conditions other than emotions. Khaliliardali et al. [19] intended to anticipate acceleration and braking actions by the user. In this case, the prediction results were better, reaching 83% for braking and 79% for acceleration. Other types of works try to detect when the subject is distracted. This is the case of Izquierdo et al. [20] where EEG is used to detect when the subject is absentminded and to issue a series of alerts. To test this system, ten experiments were done where the subjects presented a series of obstacles such as pedestrians, signs, and other traffic objects. These experiments showed that the beta and theta band potentials increased upon receiving a distraction. Something similar was sought by Parasuram and Jagadeesh [21], who tried to identify distractions such as cell phone use or drowsiness using EEG, obtaining 87% of accuracy.

Finally, there is work seeking to help the driver by detecting emotions. Fan et al. [22] aimed to detect the emotional state of the subject while facing certain traffic situations. This study used Bayesian network (BNs) to achieve a 78% accuracy rate. Something similar was done by Bankar et al. [9], detecting the user’s state and applying music therapy to try to improve it. Using SVM, they obtained up to 81.46% accuracy. In addition, it was concluded that music has a great power to influence human emotions, making them a great mechanism to control them. On the other hand, Yan et al. [23] sought to detect a negative or angry emotional state in drivers when confronted with driving situations that usually elicited these modes, for example, a red traffic light. For this purpose, they used Hidden Naïve Bayes (BVP), obtaining 85% accuracy. These experiments and some other relevant ones are summarized in Table 1.

Table 1 Summary of articles on the detection of human emotions using EEG

After studying the existing literature on this topic, it can be seen that detecting emotions using BCIs, and specifically by the use of EEG, is at a very advanced stage. However, very few studies apply this methodology to a driving scenario. Most of the works applied to this use case focus their efforts on detecting other aspects that are external to emotions, such as the cognitive state to detect a distraction or the user’s intentions when driving the vehicle. Due to this scarcity of work, some unknowns remain to be solved, such as studying how the concentration required for driving affects EEG emotion recognition or what the maximum number of emotions can be detected while performing a secondary task.

3 Driving scenario and framework description

This section describes each of the elements that define the proposed driving scenario and framework able to classify emotions. The following elements and their relationships are shown in Fig. 1. In addition, each of these components will be discussed in detail below.

  • A BCI responsible for capturing the EEG.

  • A software for driving simulation.

  • A sound generator to present auditory stimuli related to different emotions to subjects.

  • A framework able to orchestrate all the previous elements.

Fig. 1
figure 1

Conceptual diagram of the solution designed, including all relevant actors

3.1 BCI headset

This work uses the eight-channel Versatile EEG Semi-Dry brain–computer interface designed by Bitbrain for EEG signals acquisition. This interface is based on semi-dry EEG, which means that it offers similar capture quality to interfaces that use gel on the electrodes but with better usability. Moreover, the arrangement of the electrodes can be easily interchanged following the 10–20 system. This scheme is an internationally recognized method to describe and apply the location of scalp electrodes in the context of an EEG exam. For our use case, the electrodes have been positioned mainly in the frontal area of the scalp since this is where the frontal lobe, responsible for emotions, is located. Thus, the eight electrodes have been placed in the positions Fp1, Fp2, F1, F2, F7, F8, F5, and F6 [31].

3.2 Driving simulator

The software selected for the driving simulation was City Car Driving [32]. This software has been chosen because it allows a wide range of modes of use. On the one hand, it is possible to create circuits for users to drive, moving from point A to point B. It is also possible to add different random parameters to this circuit. Some of these parameters could be other vehicles driving, pedestrians crossing the road, trams, and the different dynamic traffic signals, such as traffic lights. All these elements force the subject to stay alert to meet the objective of the circuit. On the other hand, a free circulation model allows simulating driving in a big city. Within this model, the simulator can define random routes for the subject. In addition to all this Artificial Intelligence (AI), there is a high level of customization of the scenario, such as the weather or the simulation time, with excellent graphics that allow greater subject immersion. It is also possible to add peripherals such as steering wheels, pedals, and even augmented reality glasses. As a negative point, since this software has already been compiled for commercial use and private license, it is impossible to modify the code to add custom features.

3.3 Sound generator

Different alternatives have been considered for sound generation. The first of these options is the use of wireless over-ear headphones. This option would be the most interesting in user comfort and sound isolation from the outside. However, the shape of these headphones causes direct contact between all the wiring and the electrodes, which would introduce noise to our EEG signals.

The second option was to use in-ear headphones, both wired and wireless versions. This would have lower isolation than the previous one, but its shape would allow the scalp electrodes not to be in contact with them. Moreover, they present two problems. First, the wireless ones can cause interferences with the Bluetooth signals of our interface. In addition, an issue of both wireless and wired ones is that they make direct contact with the clip placed on the ear of the subjects used by the interface as a reference, so it would again cause noise in the signals.

The last option, and the one used in this work, is to use speakers with stereo sound. This option was chosen because it is the most comfortable for users and does not cause interference with the signals captured by the electrodes of the interface.

3.4 Framework

All the elements described above are organized by a framework that contains all the design logic. The general structure of this design is described in Fig. 2. In general, the framework is composed of five phases that could be associated with the BCI cycle phases. The first of these phases focuses on communicating with the external agents that form the scenario. The second of these phases is responsible for data collection and processing, where various techniques are applied to transform raw EEG data into relevant information. The third phase is applied to extract the relevant information using feature extraction techniques. The fourth phase is conducted to reduce the dimensionality of the data by feature selection. Finally, DL and ML algorithms are used to predict the emotional states of the subjects in the fifth phase. The implementation of each of the phases is available in Github [33].

Fig. 2
figure 2

Architecture of the proposed framework

The first phase is responsible for establishing the necessary connections with the various external stakeholders. In this case, it is required to have a sound control module, which selects the sounds to be played at each moment, depending on the emotions intended to be provoked. On the other hand, the time control module is mainly in charge of detecting the initial and final instants of the simulations to label the data. Finally, the EEG Acquisition component connects to the BCI to obtain the EEG signals and store them for use in later phases.

The second phase of the proposed framework consists in obtaining the data captured by the BCI. For this purpose, this work uses the “Bitbrain Viewer” software. From the data processing perspective, one of the techniques used to remove artifacts is signal filtering. In this direction, two types of signals can interfere with EEG signals, corresponding to external signals and biological signals. The external signals are produced by electromagnetic interference, such as the noise made by the frequency of the electrical network. A notch filter has been applied to eliminate these signals, based on the elimination of a specific frequency, in this case, 50 Hz. In Fig. 3, it can be seen how after applying the filters, the different signal changes can be seen more clearly. In the same way, it eliminates those frequencies above or below the target frequencies, eliminating noise for the classifier.

Fig. 3
figure 3

Comparison between non-filtered and filtered EEG signals

The second type of signal affecting EEG, the biological signals, is produced by muscle contractions such as blinks or finger movements. Two different techniques are used to eliminate this noise, a bandpass filter on the one hand and the ICA algorithm on the other. The bandpass filter is applied between frequencies 4–60 Hz, which avoids eliminating the Alpha, Gamma, Theta and Beta frequency bands, in which we are interested in performing the emotions classification. However, after applying these filters, some noise still contaminates the original signals. To remove these signals, the ICA algorithm is used. To eliminate noise signals, ICA allows selecting the signals by studying an electrooculogram (EOG).

For this case study, the most commonly used features related to brain waves. The Short-Time Fast Fourier (STFT) algorithm has been implemented to obtain these features. Once this algorithm has been applied, the following intervals are extracted, corresponding to each of the frequency bands: Theta (5–8 Hz), Alpha (8–12 Hz), Beta (12–30 Hz), and Gamma (30–60 Hz). Representative statistical data, such as mean or standard deviation, have been extracted for these defined frequency intervals. These statistical data allow knowing the distributions of the data within each band. In this way, it is possible to characterize the trend that the data follow for each emotional state. The sparse data distribution for each state allows the model to learn the trends and uniquely identify each class to be predicted. Moreover, characteristics related to signal entropy are extracted, which measures the uncertainty of a source of information. These algorithms have been applied in four-second sliding windows to obtain better performance. These characteristics and how to calculate them are defined in Table 2.

Table 2 Features used for classification

In this case study, we have 288 features, so it is quite a large number. Because the features may be highly correlated, a previous phase of feature selection is performed (see Fig. 4). Two methods were applied for feature selection. The first of these methods consists in calculating the correlation between these variables. Fig. 4b shows how the correlation matrix looks like after eliminating those correlated by more than 90%. Once these characteristics have been dropped, 113 features were selected, which is still a significant number for the classification task. The second method applied to reduce the dimensionality is principal component analysis (PCA). To apply this method, we seek to obtain new features that represent 95% of the initial dataset. Using PCA, it was possible to reduce the number of 55 features without any correlation, as can be seen in Fig. 4c.

Fig. 4
figure 4

Correlation matrix in each of the feature selection steps

Once the data are available, the framework proceeds to the learning phase. In this phase, different algorithms were applied to formulate and recognize patterns in the data and learn by themselves. There is a wide variety of applied algorithms in the literature, but we use ML and DL algorithms for our case study. Focusing on ML, the algorithms selected for this task have been the most widely used in the literature: RF, kNN, and XGBoost. The Sklearn [37] and XGBoost [38] libraries have been used to implement these models, respectively. A hyperparameter search was not performed for the RF algorithm in the quaternary and ternary models due to its time cost. Once the data are reduced to only two emotions, a hyperparameter search using RandomSearch can be applied to improve the parameter search time, sacrificing a minimum classification accuracy. In the case of kNN, hyperparameter search can be applied since the computational cost is lower in this case. Finally, for XGBoost, the training and testing time is less, with acceptable accuracy in most models by following a boosting methodology.

For classification using DL, three types of neural networks have been implemented using the TensorFlow with Keras library, with nodes of type CNN, LSTM and a combination of these. These structures have been the ones with the best results reported in the literature. The layer configuration followed for the LSTM model was two layers of 32 neurons with a ReLU activation model. Between them, there is a Dropout layer with 20%, and at the output a Flatten layer to smooth the result. The architecture for the CNN network is composed of two 32-neuron Conv1D layers, with ReLU activation and a kernel size of \(3\times 3\). This architecture, similar to the one created based on LSTM, has added a Dropout layer between CNN layers and a Flatten layer at the output. CNNs are good at extracting the spatial local relevant features of data, but they struggle to capture the long-term dependence relationship in sequence data, which the LSTM can improve [39]. For this reason, a hybrid model has been proposed, composed by two layers, establishing 32 nodes per layer. For the output layers, a perceptron layer was designed in which a softmax activation was configured for the multiclass models, while a sigmoid activation was established for the binary classification. Finally, an EarlyStop function has been designed and applied to all DL algorithms in order to avoid overfitting in the model.

Fig. 5
figure 5

Flow of data obtained by the framework

The relationship of each of the described steps is shown in Fig. 5. This figure shows how the data used by the framework, both sound and EEG, go through the different phases of the framework until the creation of a model that allows the identification of the various states.

4 Experimental protocol

The protocol followed for conducting the experiments is an essential factor as it directly influences the subjects and, consequently, their emotions. Thus, an incorrect protocol can lead to obtain unconnected or erroneous results. These experiments have been applied to a total of three subjects with different sexes and ages, not presenting diagnosed mental illnesses. The subjects were 22, 23 and 29 years old, being two males and one female.

Fig. 6
figure 6

Structure of the protocol followed for the experiments

For this use case, the general structure of the protocol is divided into two equal phases, detailed below. The objective of this use case is, on the one hand, to measure the performance of the framework in two different situations, and on the other hand, to study how the primary task of driving affects the detection of emotions through a BCI. However, there are specific parameters of the scenario that must be considered. One of these parameters is the subject’s posture, which must be in a comfortable position, perpendicular to the screen. An adjustable chair has been used, and the loudspeakers have been positioned on both sides of the subject, trying to obtain a more enveloping sound. The subjects were asked to avoid all movements except those necessary to use the simulator to conduct the experiments.

As for the experiments themselves, they are divided into two equal phases (see Fig. 6). Each experiment is composed of two equal phases, which reduces the fatigue produced by prolonged use of the BCI. Each of these phases is composed of two different configurations of the scenario. In these configurations, the main task that the subject has to perform varies. In the first configuration, the subject is calm and only has to listen to auditory stimuli. In the second configuration, the task of listening to sounds takes a back seat, being the main task the use of the driving simulator. The creation of these two configurations allows to determine whether it is possible to detect emotions using a BCI when the subject is performing an external task.

In each of these configurations, four different types of sounds are applied. On the one hand, we have the first experiment where no sound is introduced, i.e., the subject is not exposed to any auditory stimulus. The second experiment is based on a series of stimuli classified as neutral. This type of stimulus corresponds to environmental sounds, such as the sound of rain. For the third experiment, positive stimuli are applied. These positive stimuli are mainly major musical hits, which have been categorized as optimistic songs. Finally, the fourth experiment is a series of negative stimuli, where the subject is presented with a series of uncomfortable and loud sounds. Some of these sounds are heavy traffic or a drilling sound.

The experiment where negative stimuli are applied has been conducted last in each subphase to prevent it from affecting subsequent experiments. In addition, as in the last experiment, the subject’s fatigue induces an irritable mood so that we would get better results. A rest time of 15 min is determined for the subjects to prevent the different phases and sub-phases from influencing each other.

5 Results

This section starts by measuring the framework detection performance following the most common experiments in the literature, such as the binary experiments. Subsequently, the results obtained for both three-class and four-class classifications are presented, and a comparison between the results obtained from the different experiments is conducted. To determine how good is the classification of an algorithm, the F1-score metric has been used, as it is the most robust for this purpose. Finally, a comparison is made with the results reported in the literature by similar works.

Fig. 7
figure 7

Binary classification of emotions

Fig. 8
figure 8

Three emotion classification

Fig. 9
figure 9

Classification of four emotions

5.1 Binary classification

The results obtained without the simulator and later with the simulator are reviewed, after which a comparison is conducted between them. Finally, we study how the experiments have affected the different subjects. The classes studied in this experiment are the “angry” and “non-stimuli” classes since they are the most widely conducted literature experiments and are also the most distinct classes.

The results without simulator are presented in Fig. 7. This figure shows that the algorithm with the best results is XGBoost, with an 86.5% F1-score on average for all subjects. However, the results are quite similar among the different algorithms, with a negligible difference except for kNN, whose results are quite mediocre. On the other hand, the difference between ML and DL algorithms is quite small.

The results of using the simulator experiments are shown in Fig. 7b. In this case, the best-performing algorithm is RF, with a 97% F1-Score on average. Again, the difference with DL algorithms, such as LSTM networks, is minimal, with only a 3% difference. Finally, it can be observed that there is an improvement in the results of the experiments when using the simulator. This is mainly because the use of the simulator provides that the stimuli have a more significant impact, and therefore, the emotions are more recognizable.

Figure 11 shows the respective confusion matrices for the RF algorithm in the case of the subject I since it is the one that has provided the best results in this case and can therefore be seen more clearly. The labels representing the classifications are zero for the class “non-stimuli” and one for the class “angry.” In this figure, it can be seen that the classes are more identifiable in the case of the classifications with the simulator. Because of this, the classifier obtains results with a higher F1-score. This improvement in classification when using a simulator could be attributed to the fact that the subject reacts better to stimuli when in concentration. For experiments in which the subject is in a situation where they only have to listen to the stimuli, it may be easier to control the reactions to the stimuli and thus obtain a lower response. We will try to confirm or disprove this hypothesis with the results of the following experiments. Finally, the results obtained for each of the subjects appear to be quite similar for each of the algorithms tested.

5.2 Classification with three emotions

The results without simulator experiments are shown in Fig. 8a. The best-performing algorithm is RF, with an average F1-score for all subjects of 66.1%, reaching a maximum of 74% in the best subject. In this case, the difference between ML and DL algorithms has been increased, obtaining better performance in ML algorithms, except for kNN.

For the simulator experiments, the results are shown in Fig. 8b. Again, the best-performing algorithm is RF, with an average of 91.8% and a maximum individual score of 94.1% for Subject I. In this case, the difference between the ML and DL algorithms is relatively high, with up to 40% for the best algorithms in each case regarding the F1-scores.

When compared to the binary classification conducted in the previous experiment, it can be observed that a much lower score is obtained when no simulator is used and similar when the simulator is used. This is due to the confusion generated by the previous classes and the addition of the neutral class. This new class has a high confusion rate compared with the “non-stimuli” class. This can be seen in Fig. 11, where the confusion matrix for these classes is presented. In the case of the experiments without the simulator, almost 32% of the time, “non-stimuli” class is classified as “neutral.” Theoretically, this makes sense since both are represented by brain waves with lower frequencies like Theta or Alpha, corresponding to calm states of mind. The following section, which presents the classification of four emotions, will more clearly detail the reason for this effect. Unlike the previous results, in this case, the DL algorithms obtain the lowest accuracies, even below kNN. This is mainly due to the uncertainty caused by the data from biosignals, which are labeled with the subjective states of the subject. In addition, an increase in the complexity of biosignals can cause a reduction in the temporal dimension, considered by algorithms such as LSTM.

5.3 Classification with four emotions

The results for studying four emotions follow the same trend as for the previous two experiments. As shown in Fig. 9a, the best algorithm, in this case, is the LSTM network, where up to 54% F1-score is obtained. However, the maximum was obtained for RF with a 55.2% F1-score for the first subject. Again, the differences between the ML and DL algorithms are practically indistinguishable.

Fig. 10
figure 10

Summary of best accuracies obtained in each experiment

For the simulator experiments, shown in Fig. 9b, the best-performing algorithm is RF with an average of 80.7% for all subjects, reaching a maximum of 84.2% F1-score for the second subject. As in the previous experiments, the difference between the ML and DL algorithms is increased.

In this case, when introducing a new emotion, the problem of emotion confusion described above increases. The complete confusion matrix is taken into account for this section. For the experiments without the simulator, the emotions of joy and neutral have a very high confusion. However, this confusion is reduced in the experiments where the simulator is used.

6 Discussion

6.1 Results analysis

The results obtained in this work show how ML algorithms, and in particular RF, get the best results for accuracy metrics. Figures 78 and 9 show the accuracy obtained for each of the subjects and the applied algorithms. Figure 10 summarizes the best results obtained for each subject participating in the experiments without simulator and with simulator. It can be seen that the higher number of class to predict, the lower accuracy. The results for classifications without the simulator are 95–92% for a binary model, 77–63% for ternary models, and 62–56% for quaternary models. Similarly, using a driving simulator increases the model accuracy in all cases, obtaining 99–98% for binary, 97–92% for ternary and 92–83% for quaternary.

Fig. 11
figure 11

Confusion matrix for RF and subject 1 when classifying four emotions

As can be seen in Fig. 11, in the case of the classifications with simulator, the classes are more identifiable and the classifier obtains results with a higher F1-score. This improvement in classification when using a simulator could be attributed to the fact that the subject reacts better to stimuli when in concentration. For experiments in which the subject is in a situation where they only have to listen to the stimuli, it may be easier to control the reactions to the stimuli, and thus to obtain a lower response.

To explain this, plots of the statistical values of the resulting FFT data have been obtained. Figure 12 shows these static data in the form of a box plot for each of the different brain bands. Starting with the emotion “Joy,” Fig. 12a shows the results obtained without the simulator, while Fig. 12e shows the results obtained for the simulator. It can be observed that in the experiments with and without the simulator, the predominant bands are Beta and Gamma. These bands correspond to the states of concentration and stress so that the use of the simulator provokes in the subjects a higher state of attention than when it is not used.

In the case of the neutral emotion, the results are shown in Fig. 12c, f. The results obtained for this emotion show that the predominant wave is Alpha, related to a neutral state. This is directly related to the type of stimulus used since the stimuli are classified as relaxing, provoking in the subject a state without stress.

The angry class is the most recognizable since it has the highest voltage values of all the categories. In Fig. 12c, it can be seen how the values of the Gamma band become very high due to the stress that this type of stimuli provokes in the subjects. However, when using the simulator, this stress is channeled into concentration to meet the objective proposed by the simulator, while the subject is subjected to a series of irritating stimuli. Because of this, the predominant range of frequencies is the Beta wave, closely related to a state of high concentration.

Finally, for the non-stimuli class, disparate results are obtained. In Fig. 12f, it is shown that the predominant wave is Gamma, which could mean that the subject is under some level of stress. Since no stimulus is applied when conducting this experiment, it depends directly on the subject’s mood during the experimentation. On the other hand, these stress levels are reduced when the simulator is used, and Alpha waves predominate. This may be due to the concentration generated by using the simulator, which causes the subject to focus the attention on it. Since the simulator has a simple objective and no external stimuli, the level of concentration required to conduct the task is not too high, so the predominant wave is the Alpha wave and not the Gamma wave.

Fig. 12
figure 12

Statistical values of FFT for different emotions

Once the results of the experiments have been obtained, they have been compared with those reported in the literature. One of the main drawbacks of the literature is that the accuracy metric is used, although other metrics, such as F1-score, work better for this type of task. The use of F1-score allows for more robust results since it considers recall and precision. On the other hand, accuracy only considers the number of hits that the model has had, which can lead to erroneous conclusions. To make a fair analysis, this section will take into account the accuracy of our solution (but the F1-score is also available in Sect. 5).

Table 3 shows a comparison with several works reported in the literature focusing on EEG emotion recognition. On the one hand, Zheng et al. [13] and Kaur et al. [15] aimed to detect three emotions, provoking them with the use of film clips. This type of stimulus, a priori, is more effective than music since it is both auditory and visual, and the subject pays full attention to this stimulus. The results obtained for these experiments are 86 and 60% accuracy, respectively. If we compare them with our best results for the classification of three emotions, we have 93% accuracy, improving these results.

Table 3 Comparison of this work with the literature

Nevertheless, this type of stimulus does not apply to our use case, as it would be a dangerous distraction for the driver. Because of this, our work focuses on just auditory stimuli. In this direction, Bhatti et al. [16] experimented with auditory stimuli, classifying four different classes: happy, love, sad, and anger. For this work, the authors obtained 78% accuracy, which is slightly below our results (79% accuracy).

However, none of these works use a simulator or another sequential task when receiving the stimuli. In particular, for the driving scenario, the results conducted are pretty limited, and a work that can be directly compared is Halim et al. [43]. This work sought to detect when the driver is under stress, implementing a binary classification. This study obtained 97% accuracy, while our work obtained 99%. This slight improvement may be due to the addition of entropy-derived features or the application of feature extraction through time windows. The novelty that this work brings to the literature is, on the one hand, to gain insight into how a secondary task affects the detection of emotional states. In this case, using a driving simulator leads to a more significant differentiation for each emotion. Because of this, the classifier can identify each category more accurately.

The most recent works follow the same methodology defined by previous works. In the case of Zeng et al. [44], the authors aim to predict when drivers are fatigued. To provoke this state, they conducted the experiments at the end of the day, when users were most tired. After collecting the data, they applied classification algorithms such as kNN, SVM, or PSO-HELM, obtaining a maximum of 83% accuracy. Another of the works carried out in 2022 is the one developed by Halin et al. [45]. This work is a proposal for a system based on virtual reality and a driving simulator. In this proposal, virtual reality glasses were used to simulate events that happen in real driving environments, which can modify the users’ attitudes. This work does not offer results or experimentation, so its performance cannot be measured.

On the other hand, this framework is designed so that the computation time is as short as possible and can be applied in a real environment. Therefore, dimensionality reduction techniques such as correlation or PCA have been used to increase the time performance of the model. In addition, different ML and DL algorithms have been tested to study which ones offered the best results. In this way, it was found that RF obtained the best results both in terms of prediction accuracy and prediction time.

6.2 Framework computational costs

This framework is designed to be extrapolated to a real use case, so execution times and resource consumption must be limited. The tests have been executed on a computer with a Ryzen 5 3600X processor with six cores at 3.6 Ghz, 16 GB of DDR4 RAM, and a Nvidia RTX2060 graphics card with 6 GB of VRAM. However, this graphics card will be used only to accelerate DL model training.

Fig. 13
figure 13

Hardware resources and time consumed in each phase of the framework

Figure 13 shows the hardware and time resources for training and prediction in each of the phases of the framework. The creation of the system is divided into two stages, on the one hand, the training of the model, where a more considerable amount of data is needed. The optimal model is trained in this work with about 300,000 vectors for each class. On the other hand, the evaluation of the model takes place, where 128 vectors are sufficient to recognize an emotional state.

Therefore, the first phase of the framework is the data acquisition phase. The consumption of this phase in terms of hardware resources depends directly on the library offered by the BCI. In the library offered by Bitbrain, the consumption of resources is reduced, around 40% of CPU. This consumption is caused by the management that must be done for the Bluetooth connections and data storage. The time required for this phase depends on the frequency of sending data from the BCI, in this case at 256 Hz, so it takes at least 20 min per class to create a model, 80 in total. In an evaluation in real time, 128 vectors are needed, requiring then 500 ms. Once the data are available, the second phase applies filtering that removes possible noise that may be introduced during the capture. This procedure is light on hardware resources, consuming only 25% CPU and 6% RAM resources. This procedure takes 2 min for training all data and only 50 ms for data evaluation.

The next phase of the framework is feature extraction, this procedure is the most expensive phase in terms of hardware resources. Using STFT algorithms and procedures to obtain the signal entropy (see Table 2) is costly. These algorithms are applied in windows of 128 vectors, reaching up to 65% of CPU consumption and 25% of RAM. RAM usage is configurable by limiting the amount of data that can be stored before writing it to disk. As for the execution time of this phase, it is high reaching up to 1 h due to the sliding window. The evaluation of an epoch takes only 400 ms, an adequate time for a system operating in real time. After this phase, 280 features are obtained for each of the vectors, so the amount of data is considerable, especially for the training of a model. Therefore, a feature selection phase is applied where features correlated by more than 95% are eliminated, and the PCA algorithm is applied. The consumption of this phase is limited in terms of hardware and time, improving the training time of new models.

Finally, the model training and prediction phase highly depend on the selected algorithm and the configuration used. For algorithms such as RF, the resources used are limited, around 35% of CPU. However, when using neural networks, the complexity increases, especially while training. For this reason, we have used the GPU optimization offered by the TensorFlow library, which causes the computation to be performed on the GPU, requiring up to 65% of GPU capabilities available. However, the training time of a DL model is reduced by up to 50% when training with GPU instead of CPU.

6.3 Limitations

The limitations of this work are the data captured and the training of the model. On the one hand, the data quality depends on the BCI used. Since it is a non-medical device product, the accuracy of the data is reduced, being very vulnerable to external noise. The BCI also has a limited sampling frequency and a reduced number of electrodes, decreasing the spatio-temporal resolution. On the other hand, when training the model, the labeling is subjective to the states evoked by the users. In general, the selected sounds evoke the states they have been designed for. However, it may happen that they do not evoke the target emotional state in specific situations. This labeling failure confuses the models, reducing the accuracy of evaluating new data. Another limitation of this system is the generation of models for new users. The models are individualized for each subject, which is why the system has to train a new model for each user. In addition, these models need a large amount of data to generalize correctly, so the entry of a new user into the system can take a significant time to run (around 1.5 h).

7 Conclusion

This work studied and analyzed the usage of BCI and EEG to detect emotions in driving environments. To achieve that goal, a realistic scenario has been created with the following elements: (i) a driving simulator allowing a good immersion for the user, using for this the City Car Driving simulator; (ii) a sound stimulus generator, in this case stereo loudspeakers have been used; (iii) a sound stimulus generator module, enabling the presentation of different types of sounds directly related to the emotions intended to provoke, among them, intense traffic sounds or irritating sounds to cause anger, environmental sounds to produce neutrality or tranquility, or music to induce a happy mood; finally (iv) a BCI Headset (Bitbrain Versatile BCI of eight channels) has been used to capture the EEG of the subjects.

Once the previous scenario was deployed correctly, a framework was created to recognize emotions in driving environments. This framework is composed of different steps associated with the BCI cycle. First, it acquires the EEG signals, processes them to eliminate potential noise, conducts a feature extraction process, and applies different classification algorithms to study their effectiveness. The algorithms tested have been, on the one hand, supervised ML algorithms such as RF, KNN and XGBoost. On the other hand, a set of DL algorithms based on networks with LSTM and CNN nodes.

After this, a use case related to a driving environment has been designed, where a series of experiments have been performed. In addition, the experiments are applied using the simulator and without it. In this way, it is possible to study how listening to auditory stimuli affects when they are a primary task and when they are a secondary task when the use of the driving simulator comes to the forefront. The results obtained for these experiments show that the lower the number of emotions, the better the classification received. An average of 99% accuracy was obtained when using a binary classifier to detect two emotions, 97% for three, and 75% for four emotions. In addition, better accuracy is obtained when using a simulator. This is mainly because the different emotional states have more disjoint values, i.e., more disparate. Because of this, the classifier can recognize them more accurately as they are in different ranges. These results improve those already reported in the literature and provide insight into whether it is possible to employ BCI as a method for emotion identification in driving environments. Moreover, since it is a model with reduced dimensionality, it is possible to apply it in a real use case where a relatively fast response is needed.

In future work, it is proposed to increase the number of subjects to obtain a more general vision of the results. In this way, it is possible to check with more certainty whether there is a homogeneous response between the different genders. In addition, it is intended to create a more immersive scenario, where the simulator is controlled using a steering wheel and pedals. On the other hand, the accuracies at the time of classification, especially for four emotions, can be increased. Different techniques will be tested with a more extensive set of algorithms. Features related to sound waves can be taken into account in order to obtain more accurate information about them. In this way, classification algorithms will be able to create relationships between sound features and the states they evoke. Finally, different types of stimuli will be added, such as the sound of an incoming call or different weather sounds.