1 Introduction

Emotions (Nezlek and Kuppens 2008) are present in the daily life of people, both interpersonally and intrapersonally, and they have direct impact on our functioning. For this reason, they must be identified and regulated constantly to meet one’s own and societies’ demands. Emotional intelligence (EI) is the ability to identify and regulate one’s emotions and understand the emotions of others (Goleman 1998). High EI capabilities helps you to build relationships, reduce stress, predict mood, defuse conflict and improve satisfaction (Lam and Kirby 2002).

According to Goleman (1998), there are five key elements to EI: self-awareness, self-regulation, motivation, empathy and social skills. All these features are important in many different contexts; such as in our classrooms (Aritzeta et al. 2016), enterprises (Côté 2014) and in our society in general (Lopes et al. 2004; Fernández-Abascal and Martín-íaz 2015). Moreover, EI is also an important quality factor in all phases of our lives, from young school age to elderly senior adult. At the last stages of our lives, that is elderly people, EI can help reduce isolation and loneliness through the development of group activities. That improves quality of life and contributes to mental and physical health, and enhances individual perception of the aging process and how the elderly may adapt to changing circumstances.

People with highly developed EI can understand what is happening around them in real time and become more helpful because they possess the skills to achieve specific goals. Anyone, regardless of age and generation (baby boomer, gen X, gen Y and gen Z), can have high EI, but their driving forces differ greatly (Yüksekbilgili et al. 2015).

Traditionally, EI and Emotional Quotation (EQ) have been mainly evaluated using questionnaires. In this paper we propose a complement to these evaluation techniques with the consideration of new technology. Particularly, ElectroEncephaloGraphy (EEG) and PhotoPlethysmoGrahpy (PPG) based technologies are considered in order to support, first, emotion detection and, later, EI or EQ estimation. In this paper, we are specially focused on the first step -emotion detection- and our methodology consists on several stages and steps: recording, preprocessing, analysis and feedback. Low-cost neurofeedback technologies are employed for gathering and recording brain activity data. Moreover, different supervised learning algorithms are evaluated in order to detect emotions of humans in our proposed methodology.

This paper is organized as follows. Section 2 analyses the concept of emotional intelligence and discusses its evaluation. In Sect. 3, our proposal is put into practice by identifying devices and establishing a process for, first, emotion detection and, second, emotional intelligence testing. Different supervised learning algorithms are implemented and compared for emotion detection, together with feedback that was implemented in order to allow emotional intelligence management and testing. Afterwards, Sect. 4 exposes the results gathered. Finally, conclusions and further works are presented in Sect. 5.

2 Emotional intelligence: elements and evaluation

There is a plenty of literature and literature reviews related to emotional intelligence (Jensen et al. 2007; Smith et al. 2009; Gayathri 2013; Arora et al. 2010; Laborde et al. 2016; Kotsou et al. 2019; Winardi et al. 2021). EI is an interdisciplinary concept, that is considered linked to different research fields, disciplines, professionals, and skills.

In general, emotionally intelligent people have the ability to control their own emotional impulses (Goleman 1996). As Fig. 1 shows there are four constructs, which are self and social awareness, and self and relationship management, that outline the necessary abilities to control the positive or negative motivation. With the four constructs, we can evaluate how a person feels at a certain moment in time, because there are multiple emotional states defined that vary constantly. Goleman (1996) provides a full set of emotions that determine the real state of people in a specific moment and that is what should be considered to achieve good results.

Fig. 1
figure 1

Emotional Intelligence dimensions (Goleman 1996)

2.1 Traditional EI evaluation by using questionnaires

In recent times, many resources covering EI have been published and provide an informative picture of this topic. The book (Stein and Book 2015) is a clear representation of the interest in the emotional area. Tests are usually meant to be completed by anyone interested on a simple yet adequate way to measure EI. There are some literature reviews available which provide an overview of the different evaluation methods that have been used since the inception of the EI concept (O’Connor et al. 2019; Kotsou et al. 2019; McEnrue and Groves 2006; Conte 2005). They all have in common the use of questionnaires and the further analysis of experts following the guidelines indicated for each method. Even though this approach for measuring EI is accurate, it has many disadvantages which are outlined afterwards.

In addition to academic contexts, many self-assessment companies have adapted the Emotional Quotient Inventory (EQ-i) (Bar-On 1997) in their tests. The evaluation (López Zafra et al. 2014) consists on a report with 133 statements that can be answered through Likert scales. At the end of the test, it is possible to get a general score and a specific score per each dimension. The results are better as long as the score is higher, which means that the person has abilities to solve demands and daily challenges.

Although those tests are well founded and give rigorous results, they require the subsequent analysis by experts and have some flaws which can affect the results. In addition, sometimes the necessity for self-assessment does not require giving that step and there are many resources online to receive fast feedback. The following tests make use of Likert scales on all the questions, as the author of Bar-On (1997) specified on his research. One of the simplest options is Mind Tools (2020), which contains 15 questions that sum up their answers to show a final score that later on can be interpreted with the provided legend. Another online test is Institute for Health and Human Potential (2020) and has 12 questions. Those make the subject identify common situations and has to determine in which level the reaction is positive or negative. Note that the creating company is dedicated to help organizations distribute the science behind EI. The test (Psychology Today 2020) contains 146 questions, being the largest in the list. Completing it can take up to 45 min and evaluates several aspects of life. This is one of the most complete to receive comprehensive report of the EI. The particularity of this test is that it asks multiple similar questions to ensure the accuracy of the answers. This test requires a fee to get the full results, otherwise they send limited ones. Another paid self-assessment test is TalentSmart (2020). It takes 10 min to complete with 28 questions and delivers scores for the key components of EI. At the same time, it gives advice to improve. The questionnaire (Filippi and Barattin 2019) is different from those exposed previously. Its authors propose the union of the irMMs-based method (Filippi and Barattin 2018) and the CUE model (Minge and Thüring 2018) to create a questionnaire that evaluates UX involving emotions to determine the subjective feelings, psychological reactions, among others. The survey contains Likert scale questions and a selectable between – 5 and 5.

Based on the research and the references gathered in this section, there exist several proposals related with the evaluation of the EI. Nevertheless, those approaches are mainly based on the use of quizzes. The main issue behind this approach is that that users must interrupt their tasks or activities to answer the corresponding questions, which definitely distracts the users, and that influences the rest of the evaluation. In a glimpse of solving this defect, there should be a possibility to measure EI while doing exercises. Additionally, while performing experiments emotions can change due to a wide variety of factors, so applying the concepts explained in this section may still provide even more useless data. There should be continuous tracking to detect the drifts in emotions based on IT solutions. There are three interrelated concepts: emotional intelligence and self-awareness measured with specific surveys that help running experiments. Researchers though, treat them as independent notions.

2.2 EI technology-based evaluation

The set of emotions identified by Goleman and their characteristics can be measured with neurological signs, such as EEG and MagnetoEncephaloGraphy (MEG). The techniques for emotion recognition allow researchers to use that information in other contexts. This is the case in Mendoza-Palechor et al. (2019), where the authors use a small EEG device to demonstrate its ability to recognize emotions. Additionally, the article Zhang et al. (2020) provides a solution for detecting emotions with an improved radial basis function neural network algorithm. In Table 1, we provide a set of available articles that demonstrate how technology can measure and analyze Goleman dimensions for emotional intelligence. The referenced articles have been obtained from a selection made among the papers available at different indexing websites.

Table 1 Available research that measure and analyze Goleman’s dimensions

Our proposal takes advantage of the emotional intelligence dimensions to know which are the human’s emotions that should be analyzed to get the exact current state of people while performing tasks, answering questions or any other. That allows measuring the success rate, achievements, benefits, etc. which are usually quantified with questionnaires. Afterwards, situational awareness provides the tools to recognize and self-regulate the emotions, defined by Goleman, that were captured before. Then, we can interrelate all concepts to gather constant sentiment feedback from users without interrupting the tasks. We provide an IT solution for the set of emotions defined and give feedback, which is included in a topic called neurofeedback (NF) training and complements to EI training. It is defined as Hammond (2011) the EEG biofeedback received through the use of specific devices designed to capture brainwave activity. Even though EEG and NF are fundamentally different (Jeunet et al. 2018), they share a common goal, so similar cognitive and neurophysiological processes are prone to be involved. Available literature endorses the capability of this concept to improve emotion regulation. A literature review (Linhartová et al. 2019) shows how NF training in real time seems to be a promising tool for that purpose. This is also the case for several studies (Dehghani et al. 2020; Herwig et al. 2019; Zaehringer et al. 2019; Zotev et al. 2013; Johnston et al. 2010), demonstrating the same positive results.

3 Our methodology: devices, stages and algorithms

Our proposal combines recording neurological and physiological signs, processing signals, detecting the sources of brain activity, determining emotions, and providing awareness of the emotional state. The final stage is where the study provides valuable insights to people, whereas the prior ones are the roots of the structure. An scheme of these phases is provided in Fig. 2. This can be considered as a procedure for getting the necessary outcomes to help people with or without disabilities while dealing with different situations that require intensive cognitive skills. The implementation of the solution may not focus only on specific moments in time where a stimulus invokes a reaction on humans. Instead, it can be applied to a continuous activity that does not require a specific event to happen. In fact, the use of technology is the medium to reach an end, which is self-detecting the EI and improving the performance when doing activities. The technologies indicated in the figure refer to the ones we used for testing, so we expect other researchers to use their own tools and position them where corresponds in each stage. Further explanation of the complete methodology is provided afterwards.

Apart from the presented literature that is compatible with the proposal in terms of emotion detection, there is an interest in testing whether it is actually feasible considering the aspects of evaluation of the emotional intelligence and its relation with self-awareness through the constant feedback. For our tests, we followed the recommendations of the psychologists that NeUX project had access to. The people under test were recurrent visitors of two senior centers: Albacete I and Albacete II. There were 25 participants with ages in a range between 65 and 74 years, of which 16 were women and 9 were men. This demography is adequate for our experiments because older adults tend to have less abilities to wear these type of devices, so being low cost, less intrusive and comfortable is preferable. The experiments were conducted by two psychologists specialized in cognitive stimulation therapy for elderly. Psychologists involved in this study are workers in centers of elderly people at the city of Albacete, they are responsible of the cognitive programs in these centers. In addition, the objective was detecting differences in human response while watching videos that evoke stress and nervousness. Following a response based approach, stress is used to denote a person’s physiological response to a difficult environment or distressing life event (Burnard 1991). Cognitive training and exercises related with emotions have many advantages for elderly over time. It has been proved (Rebok et al. 2014) that they enhance reasoning and response speed severely.

Fig. 2
figure 2

EI test process

3.1 Devices

In this research, Muse 2 was chosen among other viable alternatives compared in Table 2. During the study, the connection with the device was established with a regular laptop with a wireless approach. In order to understand why this equipment was selected, the use case must be studied first. The research aims at creating a comfortable low cost proposal or set of stages for elderly people, so devices must be unobtrusive to the person wearing it. EEG apparatus characterizes for its complexity and difficulty for wearing in correctly over the head. Additionally, they usually include wet sensors, which add quality to the recordings but requires more preparation time. Muse also has some weaknesses that must be outlined: the small number of sensors limits the noise reduction quality of post-processing, which may end up with useless recordings; according to our tests, PPG sensor does not measure heart rate correctly and heartbeats do not show proper values; Bluetooth of the receiving adopter must be able to work with the sampling rate without saturating, which is more prone to happen on mobile devices. Essentially, the target of the study requires specific devices that allow comfort and precision. Given that Muse fulfills the requirements, validation is necessary to confirm that results are correct. The validation in Krigolson et al. (2017) confirms that Muse can be used to conduct Event-Related Potential (ERP) studies from a computer, comparing the results with an advanced device. Additionally, Arsalan et al. (2019) uses Muse for classifying mental stress using different algorithms with high accuracy. Another article using Muse (Seo et al. 2019) compares machine learning methods for robust boredom classification, which is inside the scope of this study. Muse is designed for meditation but, as it has been demonstrated (Svetlov et al. 2019), there is no benefit on using it for mindfulness and short-term stress reduction while using it as a feedback tool to know whether activity was performed well. That article did not analyzed brain signals in any way, but used a mobile application that executed some black box algorithms. Knowing that, this study does not have that objective and does not perform automatic analysis.

Table 2 Comparison of low-cost EEG devices

As stated previously, the heart rate sensor of Muse does not provide useful data so, in order to keep gathering that data, another device must be used. ElectroCardioGraphy (ECG) devices are precise and have proven its versatility in different environments. Nevertheless, its use requires putting it directly over the thoracic cavity. This downside clashes with the necessity of having easy to wear devices. The alternative is again PPG, but with a tested device that has proven its precision. Polar OH1 can be wore in the upper arm or the forearm over the skin, which makes it compatible with the study. The validation study (Schubert et al. 2018) confirms that the device is accurate at yoga when doing moderate and vigorous exercises. A more thorough study in Hettiarachchi et al. (2019) shows high agreement in results between Polar OH1 and the measurements of an ECG device under moderate to high intensity physical activities. In our study, low to moderate activity is required, so this device satisfies the requisites.

The use of Heart-Rate Variability (HRV) has been studied to discover how events make the heart to behave differently than normal interbeat interval. The article (Appelhans and Luecken 2006) shows the use of HRV as a noninvasive way to know how the brain is able to organize emotional responses. Another article is Brosschot and Thayer (2003), where the authors describe a correlation between negative emotions in valence and the longer time taken for heart rate to stabilize, regardless of initial activity. The contrary situation happens when valence is positive, in which cardiovascular activity takes less time to get back to normal. This has some relation with stress as a somatic disease. A similar situation happens in Anttonen and Surakka (2005), as valence directly affects how the heart rate changes from the stimuli and gets back to normal following the same tendency previously exposed. Extrapolating those results to this study, HRV is relevant as a support for the neural analysis and helps getting deeper knowledge of the body’s response to similar states.

3.2 Recording stage

There are different environments in which the methodology designed can provide valuable data. The purpose of the trial must be established prior designing the tasks to be performed. Some examples that may need the introduction of self-awareness can be daily life tasks, concrete activities with intense brain activation, and mental tracking for arousal and valence wellness. In addition, specific invoked events can be considered as long as the analysis stage is correctly adapted to this case, in which several trials may be needed for better identification of emotions. This is because the more previously classified data available, the better results are to be expected. These concerns must be contemplated when deciding the time duration of the experiment, which can run from a few seconds up to some hours of continuous recording. The advantage of long lasting tests is that ulterior analysis may be more precise, as long as they are divided into epochs appropriately. Epochs are time windows extracted from a continuous signal with reference to a concrete event, which can be short or long in time and may be classified. The disadvantage is that filtering and cleaning artifacts can be a more complex task, due to the higher probability of capturing noise.

Another important aspect is movement. As a rule of thumb, it can be stated that more head or body movement induces higher noise rates, no matter the activity. For this reason, the preprocessing stage is easier and more accurate with low movement recordings.

At the time of wearing EEG devices, the real time graph of raw signals should be checked constantly to ensure the quality of the recordings, making sure noise is suppressed as much as possible. Cleaning the sensors can improve skin contact, providing better results. With the already provided information, the HRV is given as RR interval and can be used in the analysis in case of ECG. For PPG devices, such as the Polar OH1 used in this study, the data collected comes in PP interval form, which is similar to RR.

Our tests were made in a silent room with a comfortable chair and a monitor for showing the videos. The experiment is individual and is based on videos selected by the psychologists that evoke a feeling of stress and others that invoke relaxation. The recent study (Gilman et al. 2017) provided the videos for the experiment. Specifically, those under the fear section were selected. The reason why fear videos were displayed is because available literature (Horowitz and Wilner 1976) shows a relation between fear films and stress emotion. For relaxation, neutral videos were chosen from the same source. This accomplishes the imposed objective of cognitive stimulation and, in this case, through passive activities. The procedure was the following: explain the purpose of the experiment to the participant and the devices he/she will wear; put the devices on the body with help; watch a relaxing video with a duration of 5 min; watch a interleaved sequence of five stressful and five relaxing videos of 2 min duration each. The approximate duration of the whole test was 35 min per participant.

3.3 Preprocessing stage

After recording the experiment, some preprocessing should be executed over the files generated. In this stage, an analysis of the results should not take place. Instead, preprocessing refers to the manipulation of the data necessary to highlight the details that include potentially relevant information about the brain’s activity in the experiment domain. This section explains the preprocessing options for a device like Muse, which only has four electrodes. This does not limit the task, but retrieving the important input may be more difficult, or even impossible. Nevertheless, having standard recordings with good skin contact should not generate any problem.

There are two main software products designed for processing EEG signals from Brain-Computer Interface devices. EEGLAB (Delorme and Makeig 2004) is a Matlab toolbox that contains several features for signal analysis. MNE (Gramfort et al. 2013) is a Python library that helps exploring, visualizing, and analyzing human neurophysiological data. When choosing one of them, both have some advantages and some downsides. The former is GUI based, which facilitates its use for people without high programming knowledge. However, it is very slow making calculations compared to the latter, and contains less characteristics. The main advantages of MNE are fast speed, a programming interface for developers, and the high amount of features. Due to the maturity of MNE, the following contributions use it for brain signal preparation. Additionally, the MuseStudio library (Sánchez-Cifo et al. 2021a) described in Sánchez-Cifo et al. (2021b) facilitates this stage with a set of tools designed for data management of Muse devices.

This schema shows an overview of the different elements that should be considered while preprocessing brain data:

  1. 1.

    File import ensuring consistency among data collected depending on the record format.

  2. 2.

    Data import to MNE library.

  3. 3.

    Visualize recorded brain signal to check the existence of bad channels.

  4. 4.

    Filter data to remove noise and improve feature extraction.

  5. 5.

    Epoch imported data by dividing it into several segments.

  6. 6.

    Extract relevant features through the use of different methods, such as ERP and ICA.

  7. 7.

    Calculate the most relevant features for the study.

Following the schema, the first step is data import. Depending on the file format, the library may not be able to import data with the built in methods. If that is the case, which is expected to happen with Muse due to the imposed limitations to researchers, some rearrangement of the channels is necessary with Python. Essentially, another specific library for importing a concrete file format is required. Afterwards, data can be inserted in MNE with the included functions for raw data.

Once data is loaded in the library, visualizing the raw signal is highly advisable. It can provide the necessary information to know if the recording is appropriate or needs to be redone. This step allows checking if there are bad channels. With four electrode devices, this is critical because there are only four data sources. Together with raw signals, observing Power Spectral Density (PSD) can help deciding if channels are fine.

Once the recording has been properly checked, filtering removes noise and improves ulterior feature extraction. There are two main ways to remove powerline noise, which changes depending on the location of the recordings. The issue can be avoided if the recordings are made in a magnetically shielded room. Otherwise, as Fig. 3 shows, at 50 Hz (Europe based test) there is an spike (on the left) that must be neutralized with a notch filter (on the right). Additionally, as long as the specific experiment allows it, high frequencies can be filtered starting from 45 Hz to avoid noise totally. Another aspect to consider is adding a high-pass filter with 1 Hz cutoff. This is necessary to avoid slow drifts (Gramfort et al. 2013), that can reduce the independence of the sources when applying Independent Component Analysis (ICA) at feature extraction phase.

Fig. 3
figure 3

Example of signal using FFT before (on the left) and after (on the right) applying a notch filter

Given a filtered signal, epoching makes analyzing experiments an easier task. It consists on dividing data into several segments depending on a time frame. The separation of segments is determined by the induced events. This is useful in experiments in which some brain activity is invoked in certain moments in time.

Working with raw data can be very complex in terms of interpretation. Even though experts are able to get an overview of how the activation of the brain is affected when invoking a reaction, gathering and classifying all the events recorded is a hard task. For this reason, feature extraction helps discovering and obtaining meaningful data. Another task in this stage is continuing the noise reduction task that was started previously. There are several methods to extract the features of a recording. When looking at the literature, many researchers are working on finding the optimal way of knowing the features of EEG recordings (Sun et al. 2019; Cheng et al. 2020; Alyasseri et al. 2018; Jaiswal and Banka 2017). In general, they extract the features that best apply for the experiment ran. Thus, there is not a clear methodology for feature extraction. Nevertheless, some methods from data science are useful for almost any experiment, such as ERP and ICA. When there are fast changes in brain activity due to an evoked stimulus, ERP obtains the average power changes right after that event. This helps determining how neurons change in response to the stimulus. In order to simplify this task, data should be correctly epoched. Trials can benefit from this method if the time window to analyze is short and identified. The other method is ICA, which is widely used in any kind of experiment because it usually delivers useful data, and it is included in EEGLAB and MNE. Raw EEG signals are a mixture of different sources captured together from different locations. This data science technique helps by separating those sources into different data streams, as Fig. 4 depicts. Notice that the maximum amount of sources detected is limited by the number of sensors of the device. ICA also facilitates noise reduction of signals coming from outside the brain, including heartbeats and eye blinks. In Fig. 5, we show the aspect of the raw filtered signal before applying ICA (on the left) and after removing the sources captured from heartbeats and eye blinks (on the right), which correspond to the first and third stream. The result is similar to the original recording except for the inappropriate deviations.

Fig. 4
figure 4

Sources of data captured applying ICA to a raw EEG signal

Fig. 5
figure 5

Example of filtered signal before (on the left) and after (on the right) removing conflicting sources detected by ICA

Usually, in EEG and ECG (or PPG), the inputs for the different models are separated in batches or sliding windows (Wu et al. 2017; Kobayashi et al. 1999; Shahid et al. 2013; Cososchi et al. 2006). The signal was split in one second batches (256 samples per second). This configuration is the result of several tests with different options, including overlapping sliding windows and larger batches. On the one hand, larger batches make predictions less accurate due to the high amount of information to process at a time, and the non-linearity of the signal complicates the ulterior analysis. On the other hand, overlapping sliding windows fosters overfitting in our problem, which happens when the learned model fits well to training data but generalizes (predicts) worse with new data.

Once the signal is clean and some generic features are extracted, such as the ones previously exposed, this research requires the selection of other more advanced features for the algorithms to learn correctly. The nonlinear behavior of physiological data makes it very difficult to process and capture the main characteristics that identify a class over the others. The features here exposed are valid for our experiment, but the proposal is open to any others that adapts to the particular problem. These are the features chosen for this experiment:

  • The wavelet coefficients were calculated using the FFT method explained previously. Those are separated into five bands depending on the frequency range. The ranges are these: delta 0–4 Hz, theta 4–8 Hz, alpha 7.5–13 Hz and gamma 30–44 Hz. The last frequency finishes at 44 Hz because higher frequencies were filtered out.

  • With the coefficients calculated, the dimensionality can be further reduced with the mean of absolute value, average power, variance and standard deviation of each frequency band.

  • The decorrelation time is the first zero crossing of the autocorrelation function. This indicates the data periodicity, which simplifies the data series by comparing each element with another element some samples away. The definition of the autocorrelation function for a time series is:

    $$\begin{aligned} c_{xx_{k}} = \dfrac{\sum _{i=1}^{N-k}x_{i}x_{i+k}}{(N-1)\sigma ^{2}} \end{aligned}$$
    (1)

    where \(\sigma\) is the variance, k is the number of elements to jump in the series, and N is the number of elements.

  • The cross-correlation provides the similarity between two time series (Chandaka et al. 2009). It is defined as:

    $$\begin{aligned} CC(S,T,\Phi ) = {\left\{ \begin{array}{ll} \displaystyle \sum _{i=0}^{n-\Phi -1} S_{i+\Phi }T_{i} \quad \Phi \ge 0 \\ CC(T,S,-\Phi ) \qquad \Phi <0 \end{array}\right. } \end{aligned}$$
    (2)

    where S and T are signals, N is the length of the signal and \(\Phi\) is the time shift with these values \(\Phi = {-n+1,...,0,...,n-1}\). From this calculation, some features can be extracted: skewness, kurtosis, equivalent width, mean square abscissa, and centroid.

3.4 Analysis stage

The outcomes of the feature extraction stage are the inputs for the analysis. Once granted that the signals contain relevant data from the brain, they must be classified accordingly. For this study, we are using supervised learning, which requires that all recordings are properly classified. This classification depends on the objectives of the research and the specific dimension covered in Fig. 1. A researcher can expect different number of classes depending on the characteristics of the study, with ranges that start from as low as two classes up to an indefinite number of them.

There are several supervised learning models that can be tested with the suggested configuration of EEG and PPG. As article Alarcao and Fonseca (2019) demonstrates, there are many papers that cover different manners of detecting emotions with several methods. In this article, we compare three common algorithms to make sure that the proposal can be implemented in a real context. The purpose of this test is providing a solution that can be extrapolated to larger environments. Due to the fact that we are focusing on levels of stress, the problem here exposed is limited to binary classification, where recordings are segmented into calmed and stressed (within a confidence interval) depending on the real state of the person in the moment of the recording. The algorithms chosen for the analysis were Decision Tree (DT), Random Forest (RF) and Support Vector Machine (SVM). Those have been selected for the following reasons: decision tree is tested to check whether the divide and conquer approach is useful in neural and heart signal processing. Random Forest follows the same approach but acts as an ensemble which joins several non-related trees to improve the learning rates. Support Vector Machine algorithm is the most used method for these kind of problems due to its high accuracy (Alarcao and Fonseca 2019), but that comes with the cost of being the slowest to train. Even though we are considering those algorithms for a binary problem, adding more classes may require a multi-class approach for classic supervised learning algorithms (Saffari et al. 2009). Other solutions include deep learning (Schirrmeister et al. 2017), which can fit better to train data when complexity increases.

Decision Tree algorithm (Brodley and Friedl 1997) is a classification procedure that partitions a dataset into smaller subdivisions recursively, on the bases of a set of tests defined at each branch in the tree. The tree contains a root node, a set of internal nodes and a set of terminal nodes. Each node has only one parent node and two or more descendant nodes depending on the number of classes. A dataset is classified by sequentially dividing it according to the class label assigned to each observation. The advantages of decision trees are that they do not require assumptions regarding the distributions of the input data.

Random Forest algorithm (Strobl et al. 2007) is an aggregation of decision trees and is integrated in the group of ensembles. It is (Qi 2012) considered as a standard non-parametric and regression tool for constructing prediction rules. Random forest (Probst et al. 2019) can handle missing values, different types of variables, and high-dimensional data modeling. In contrast to decision trees, there is no need to prune the trees to avoid overfitting because the bootstrapping schemes help overcoming the issue.

Support Vector Machine algorithm (Direito et al. 2017) is focused on reducing the structural risk. It works by defining a hyperplane and a surrounding bound that try to maximize the shortest distance between the classes and, subsequently, minimize the generalization error. Generally, the points are not easy to divide, so they are usually spanned across a higher dimensional feature space. This transformation is performed by the kernel specified as parameter, which can be linear, radial basis function, polynomial, among others.

Data was separated following this stratified distribution: 70% train, 15% validation, 15% test. Stratification ensures that all the subsets contain the same number of classes. The parameters that drive the learning process must be set properly so that it best predicts the new recordings. The miss rate for a classification problem can be calculated with these equations:

$$\begin{aligned}{} & {} MISS(c_{\theta }, X) = \frac{1}{m} \sum _{i=1}^{m}error\left( c_{\theta }\left( x^{(i)}\right) , y^{(i)}\right) \end{aligned}$$
(3)
$$\begin{aligned}{} & {} \quad error(c_{\theta }\left( x^{(i)}\right) , y^{(i)}) = {\left\{ \begin{array}{ll} 0 \quad if \quad c_{\theta }\left( x^{(i)}\right) = y^{(i)} \\ 1 \quad if \quad c_{\theta }\left( x^{(i)}\right) \ne y^{(i)} \end{array}\right. } \end{aligned}$$
(4)

where X is a dataset of size m, \(c_{\theta }\) is the classification model, \(\theta\) represents the parameters of the model, \(x^{(i)}\) and \(y^{(i)}\) are the data to predict and the real class respectively.

We made a grid search to find the best configuration of the parameters for each algorithm with 5 fold cross validation. This approach testes all the possible configurations for a given set of parameters and returns the best combination. The procedure of a grid search is the following: (1) choose a set of parameters from the list; (2) learn a model using train data; (3) test the model with validation data; (4) store the combination of parameters and its score; (5) repeat from step 1 until there are not any more combinations to test; (6) return the best configuration and its score. This methodology must be repeated with the three algorithms. Once we have the best combination, we try it with test data, which must not be used in any other moment during the learning process. These are the best settings:

  • Decision Tree maximum depth 25, criterion entropy, best split strategy, maximum features log2, class weight balanced;

  • Random Forest maximum depth 35, number of estimators 70, criterion gini, best split strategy, maximum features square root, class weight balanced;

  • Support Vector Machine kernel polynomial, gamma 1/n_features, degree 2, regularization 0.1, tolerance 0.1.

The evaluation of the models was achieved through the calculation of accuracy, precision, recall, F1 score and Area Under Curve (AUC). Those are common ways of testing machine learning models. The accuracy is the ratio of correctly predicted observations to the total observations. This parameter is most effective when false positives and false negatives are similar, so we have to consider at other metrics as well. The equation for accuracy is the following:

$$\begin{aligned} Accuracy = \frac{TP + TN}{TP + FP + TN + FN} \end{aligned}$$
(5)

The precision refers to the ratio of correctly predicted positive observations compared to the total positive observations predicted. This is the equation for calculating the precision:

$$\begin{aligned} Precision = \frac{TP}{TP + FP} \end{aligned}$$
(6)

The recall is the sensitivity, which refers to the ratio of detection of positive cases and is defined as follows:

$$\begin{aligned} Recall = \frac{TP}{TP + FN} \end{aligned}$$
(7)

The F1 score refers to the weighted average of precision and recall to penalize the final score if one of those is low. The equation is here defined:

$$\begin{aligned} F1 = 2\frac{Precision*Recall}{Precision + Recall} \end{aligned}$$
(8)

Finally, the AUC requires calculating the Receiver Operating Characteristics (ROC) curve and helps finding good and bad classifiers rapidly. It corresponds to the relation between the true positive rate and the false positive rate after normalizing the correlation matrix to one. The objective followed is maximizing AUC. For the equations presented, this is the meaning of the variables: TP, TN, FP and FN stand for True Positives, True Negatives, False Positives and False Negatives, respectively.

3.5 Feedback stage

This is the last phase of our proposal and is the one that contributes with the most valuable information to a person using it. Users of the proposal see it as a black box where the recording devices are attached to their bodies and they receive the results of the analysis understandably. This challenge is achieved with the help of the already trained model and its use with real time data, which internally is processed as one second batches to match the training process and keep the validation score. Generally, the whole process remains unchanged for the final application in real environments, except for the unnecessary training stage.

Subjects with constant feedback should improve the EI through the perception of the states showed on screen. Then, subjects are expected to learn from that feeling and associate it the detected emotion. Thanks to this connection, people are expected to improve their self-awareness, so the final objective of the research is achieved. The aggregated success rate can be calculated with the help of an expert that interprets the reaction of a specific subject to an increasing emotion and ensuring that it is coherent with the cognitive exercises done previously. That way it is possible to determine if the proposal is definitely useful and, in such case, if the subject is improving his/her cognitive skills.

The information of the emotions is depicted with a scale that is easy to interpret for the subject and the expert. Figure 6 shows the five possibilities for calm and stress levels, which are the emotions tested in this article for ensuring the validity of the solution provided. However, researchers should add other similar scales for the different emotions captured, such as motivation, optimism, initiative, and others in Goleman’s dimensions. We constantly displayed the level of stress captured by our model and asked the expert to check if the subject was acting correctly in presence of that emotion. The expert evaluates the reaction with the intention of noting if the participant is overacting or underacting. In one of such cases, the subject did not responded well to the emotion according to the activities learned in sessions with the expert, which include good practices against these scenarios.

Fig. 6
figure 6

Scale designed to communicate the stress levels of subjects

4 Results

4.1 Analysis

With the described evaluation methods in Sect. 3, we provide the performance of our solution and, more importantly, the success rate when applying our proposal. Table 3 provides the scores for the three trained models, while Fig. 7 shows the joint ROC curve of all the models to make the comparison easier.

Table 3 Scores of the supervised learning algorithms implemented
Fig. 7
figure 7

ROC curve of the algorithms tested

After observing the scores of the best configuration for each algorithm, we must choose which one is the best and provides most valuable data. As this is a binary problem, receiving a score of 0.5 (50%) or less is not acceptable and discards the model. Starting with the accuracy, the best score is provided by SVM, whereas the worst score is given by DT. The precision is kept similar to accuracy values for SVM, but DT surpasses RF with a difference of 7.6%, which emphasizes the importance of checking different metrics. Then, the recall is still better on SVM followed by RF with a difference of 10.4%, which should be noticeable on a real environment. Nevertheless, DT had a subpar performance with only 22.4% of positive detection rate, which means that only that percentage of cases was detected as positive, leaving 77.6% positive cases undetected. Thus, we can conclude now that DT is useless in this context. SVM is still best according to AUC. At this point, the ROC curve clearly shows how the models perform with test data. As AUC is a calculation from ROC, we can expect SVM to be the best model (bold values in Table 3), followed by RF and, lastly, DT. Finally, F1 score provides the last judgment on which of the three is best. Due to the fact that F1 is calculated from precision and recall, predictably, the best model is SVM with a difference over RF of 12.9%. Referring to DT, F1 score is unacceptable. These results are in line with the current literature that has demonstrated multiple times that SVM are pretty accurate for EEG and ECG processing (Alarcao and Fonseca 2019).

Applying our methodology, we discovered that SVM is the best-performing algorithm for signal processing. With these results, we are ready to implement it without the training process and check how it performs with the feedback in the format presented in Sect. 3.5, instead of calculating the score.

4.2 Feedback

Participants in our study case were recruited on a Cognitive Stimulation and Emotional Intelligence Therapy program for older people with no cognitive impairment. These courses are led by several occupational therapists and psychologists, and provide to the attendees activities that aim to provide stimulation for thinking, meditation, mindfulness and concentration.

A group of these older people, 25 participants, were randomly selected for this research. These participants that watched the videos for training the algorithms were necessary for the final test. The subjects wore the devices individually once again in a similar environment. In this case, depending on their own stress level, we displayed its corresponding in the scale presented in Fig. 6. The results of Table 4 show the statistically calculated success rate of participants trying to control themselves. The tests have been calculated with 1 or 0 according to the appropriateness of the mitigating action, being 1 right and 0 wrong. This score was established by a supervisor, in our case a psychologist.

Table 4 Statistics of participants controlling their emotions. \(\mu\) is the population mean of the raw results

The results look promising according to the statistics. The mean shows a value of 76% which is good, but not great. However, we must consider that not all participants had the same level of emotional intelligence. So an increase is expected when the subjects complete more cognitive learning activities. In general, the use of our proposal helped the psychologists to recognize the participants’ emotion awareness according to their brain activity. Moreover, participants can demonstrate their skills in emotion management. We provide a valuable tool for people who can feel some kind of emotion evoked by a concrete situation, but they cannot distinguish and act to the feelings elicited. In such case, experts can determine which is the real emotion occurring in the brain, so that they can provide meaningful help to each participant. In contrast, people without any diagnosed disease, only the brain degradation due to the age, who do not feel any emotion by the videos cannot get benefited by our solution because no difference in brain activity is observable. During the cognitive stimulation programs, participants obtain emotion management capabilities through concrete training, but that does not ensure the awareness of a particular emotion. Generally, the majority saw an improvement on their self-awareness because it was easier for the subjects to relate what they are feeling with a number in the scale.

5 Conclusions and future work

It is widely accepted that Emotional Intelligence is the capacity to understand and manage our own emotions and feelings. Our EI affects the quality of our lives because it influences our behavior and relationships. There are some ways in which EI can be cultivated and increased. But, the common basis for improving our EI is to recognize, understand our own emotions and react to their related feelings adequately. However, there are many more ways of achieving high EI, including the effort of understanding other points of view, communicating and empathizing effectively with relatives, among others. For these reasons, people with high EI can act effectively against emotions and feelings.

Traditional EI evaluation requires the use of questionnaires, which provide rigorous results. However, the participants of a study must interrupt their activities in order to fill a quiz. The questions may be well suited for the research, but it is still an non natural and intrusive manner of measuring EI.

In this paper we introduced a methodology; that is, a set of devices and a process for evaluating self-awareness under recreated stress conditions, knowing that self-awareness is an important element of EI (Goleman 1998). In our proposal, EEG and PPG based devices are used in order to identify the stress awareness level of a set of older people (baby boomers). These participants were recruited among people enrolled in cognitive stimulation programs aimed for older people.

The heuristic validation of the EI required the generation, identification and evaluation of emotions and feelings in elderly. In order to elicit emotions to the participants, we have put them under stress conditions. Stress is a relevant cause of negative emotions and feelings. As a result, it is interesting to raise emotions such as sadness, fear, anger or aggressiveness.

After analyzing the participants’ behavior, we concluded that our proposal is feasible (76% of participants became aware of their own emotions and put into practice EI techniques), and can be used by researchers with the proper adaptation to each particular context. Moreover, previous emotion detection task was also needed for supporting EI testing. In this paper, three different algorithms were analyzed to check which one performs best, being the Support Vector Machine the one that provided the most accurate predictions (83,5% according to F1 score). Therefore, in the stress context previously outlined, our solution is helpful.

Although we provided a case study with real participants, there could be environments or specific activities that do not allow its usage or does not provide valuable data. As a result, we have identified and, if possible, mitigated the threats to validity of this research (Wohlin et al. 2012).

In this sense, internal and external validity concepts are discussed. The first task was analyzing the current literature about EI and its evaluation. We decided that our solution was aimed for low cost devices with Muse (EEG) and Polar (PPG). Muse is specially affected by this constraint because its limited number of electrodes. That makes noise more difficult to remove because we receive a smaller representation of the brain in a concrete time split. Thus, making fast movements with the device introduces a lot of useless information that do not emerge from the brain. We have taken that into account and that is why we made the experiments on a chair without movements. We expect people using our proposal to apply it with brain intensive activities, so this does not influence the outcomes.

Moreover, the participants of the study were elderly, having 64% women, so we can conclude that there is bias in the results. We have demonstrated that, even with the low representation of the population, the algorithms applied and the feedback provided were useful to determine the stress levels. However, extending the age ranges, testing other algorithms, and covering a wider spectrum of emotions are definitely of interest for our future work.

There are still some tasks that must be pointed out for future work. The proposed case study was fine for elderly, but examining the methodology with other devices with more electrodes and extensive development tools is necessary. Moreover, the age range is limited to elderly, so we still have to put the solution into practice with a wider range, considering other generations, including Generation X, Generation Y and Generation Z, together with a higher number of participants. We consider an interesting challenge the technology-based replication of studies, for instance (Fernández-Aguilar et al. 2018), in which using questionnaires, the authors concluded that older adults experienced more intensely negative emotions than young adults, especially in response to disgust and fear clips. They also reported higher arousal than young adults, especially in the case of sadness, anger and tenderness videos. Additionally, we are considering testing the solution support for positive emotions identification and management, including happiness, gratitude and calmness.

Emotional intelligence enables people to manage their emotions and feelings, and act to internal and external events. However, effective response relays on self-awareness, which is the ability to know the people’s own emotions. There are several ways to measure the emotional intelligence, but all of them require the subject to interrupt the tasks and focus on the evaluation. In this paper, a solution was introduced and put into practice to allow emotional intelligence testing. In this solution, emotion detection and sending feedback were also tackled. These activities were supervised by a specialist using EEG and PPG devices. Our solution includes all the steps required to perform a complete analysis and help subjects of the experiment with their activities.