1 Introduction

Affective computing (AC) is the computing that relates or influences emotions and focuses on understanding the psychophysiological mechanisms underlying the way humans recognize emotions (Bota et al. 2019). Traditional methodologies to recognize emotions in AC usually rely on facial expressions and speech. However, these methods lack feasibility during in-field experiments (Chanel et al. 2007).

Physiological signals are an alternative method to assess emotions, using a wide variety of wearable sensors in an unobtrusive approach (Bota et al. 2019).

AC has been used to recognize emotional states through physiological signals, and it has been applied to patients with cognitive disorders and depression to improve their mental state.

Research in AC has been mostly performed by presenting emotional stimuli in 2D environments. However, virtual reality (VR) is becoming more popular in the field of emotion recognition (Marín-Morales et al. 2020). VR is used to study human emotions in controlled conditions that simulate real-life situations and environments, providing high levels of presence, which relates to the feeling of being in the virtual environment (Rose et al. 2018). Previous studies have demonstrated the effectiveness of VR in eliciting specific emotional states in controlled settings by comparing the emotional response between virtual and real environments while assessing the physiological response through physiological signals (Marín-Morales et al. 2020).

Therefore, the goal of this study was to automatically detect and recognize emotional states using machine learning and VR. To achieve our goal, the following objectives were proposed: (1) to evaluate the effect of the degree of immersion in the presentation of the emotional stimuli; (2) to classify emotional stimuli from self-reported ratings and physiological signals using user-dependent and user-independent classification models. Thus, we formulated the following research questions: (1) How does the degree of immersion of the emotional stimuli affect the emotional response?; (2) How accurate is it possible to classify the emotional stimuli using machine learning from physiological signals and self-reported ratings?

The developed experimental procedure consists of presenting emotional film clips (using the EMDB database (Carvalho et al. 2012)) in two degrees of immersion and monitoring users’ physiological signals. The participants’ emotional states were monitored by measuring the following signals: electrocardiogram (ECG), electrodermal activity (EDA), electroencephalogram (EEG), skin temperature (SKT), respiration (RESP), and functional near-infrared Spectroscopy (fNIRS). Then, we applied specific signal processing techniques to extract a combination of features from these signals. Additionally, the participants reported subjective ratings for each presented film clip using the self-assessment manikin (SAM) (Bradley and Lang 1994) and an ad-hoc self-reported appraisal instrument developed for this study (Scherer et al. 2001; Smith 1989; Reisenzein et al. 2019; Aue et al. 2007; Denson et al. 2009).

Machine learning algorithms and models were applied to the features extracted from the physiological signals to classify the emotional category of the film clip accurately. The classification models were applied separately for the subjective ratings and the physiological signals, using user-dependent and user-independent models to detect the emotional category of the corresponding film clip.

This paper is divided into the following sections: Sect. 2 presents the state of the art; Sect. 3 describes the software, hardware, experimental procedure, emotional database, physiological signals, and machine learning algorithms used in the study; Sect. 4 shows the results for the two degrees of immersion and the classification results for the subjective ratings and physiological signals; Sects. 5 and 7 discuss the results obtained and present conclusion, as well as future work.

2 State of the art

2.1 Virtual reality

Virtual reality (VR) has been used extensively in several rehabilitation areas, from motor to cognitive rehabilitation. It is an alternative to conventional rehabilitation therapies, with potential limitations, such as the transfer to everyday activities. Thus, VR can be a means to deliver personalized therapy, as it can induce specific emotions, leading to a better prognosis (Stasieñko and Sarzyńska-Długosz 2016).

In this work, we used head-mount displays as the presentation modality, where users wear special glasses or helmets on the head (head-mounted displays (HMD)) (Trojan et al. 2014). Juvrud et al. (2018) developed the Virtual Lab, a large-scale in-home testing that combines physiological measures and VR. Meehan et al. (2002) studied the influence of the level of presence on the physiological response to a stressful environment.

Another important concept of VR is the degree of immersion. Immersion is the extent to which the VR system delivers sensations from the real world to a virtual world (Rose et al. 2018). Previous studies (Borrego et al. 2016; Goncalves et al. 2021) compared the sense of immersion between an HMD and 3D space projection of 2D information, the CAVE system. They concluded that participants reported a higher level of immersion in the HMD system compared to the CAVE.

Thus, previous studies support the use of VR to study the effect of immersion in physiological responses since higher levels of immersion reported using the HMD have been shown to elicit higher physiological responses (Meehan et al. 2002). Moreover, it would be interesting to evaluate how the different degrees of immersion in the presentation mode of the emotional stimuli, i.e., the screen size and angle, would affect the emotional response of the participants.

2.2 Emotion regulation

Emotion regulation can be defined as "the process of modulating one or more aspects of an emotional experience or response" (Rey et al. 2014). Individuals differ in how they use emotion regulation strategies, as deficits in this regulation affect mental health (Inwood and Ferrari 2018). Therefore, it is an important factor to maintain social relationships and adapt to new activities of daily living (Rey et al. 2014).

It is a major research topic in several mental health disorders and psychotherapy research, such as borderline personality disorder (BPD), depression, anxiety disorders, eating disorders, and somatoform disorders (Berking and Wupperman 2012). People with BPD have less emotional awareness and clarity, as well as the tendency to use self-injury as an emotion regulation strategy. Individuals with depression reported difficulties in identifying and accepting negative emotions, and people with anxiety disorders have a higher reaction to fearful situations (Berking and Wupperman 2012).

Studies usually have different approaches in terms of conceptualization and assessment of emotion regulation, thus creating limitations to comparing the results across studies (Berking and Wupperman 2012). Primary to emotion regulation, it is necessary to detect emotions properly. Emotion detection is often performed by detecting facial expressions, speech, gestures, and physiological states (Garcia-Garcia et al. 2017). Facial expressions reflect the emotions one person is feeling, by detecting a face, identifying points in the face that reveal emotions, and processing those points to decide which emotion is being detected (Garcia-Garcia et al. 2017). Speech is another method to detect emotions by gathering emotional information from the user’s voice, through the information generated to the link associated with paralinguistic information (Garcia-Garcia et al. 2017). Body gestures also reveal what a person is feeling, although it is still not clear how to create systems to detect body gestures to recognize emotions. Finally, emotions can be detected by monitoring physiological states and creating a classification system to predict the response of our body associated with emotions (Garcia-Garcia et al. 2017).

Most of these approaches have the limitation that facial expressions, speech, and gestures can be controlled to hide a specific emotional state (Chen et al. 2020). To overcome this problem, it is relevant to classify and distinguish between different emotion dimensions, using standardized self-report scales and physiological signals to identify different emotional states.

2.3 Emotional models

AC is a growing field due to its importance in real-world applications by recognizing and interpreting emotions (Sharma and Mathew 2020). Throughout the years, many emotional models appeared to categorize emotional states. These can be defined according to two dimensions: a discrete dimension and a continuous dimension (Scherer et al. 2001). In the discrete emotional dimension, emotions are categorized into specific states, such as fear and anger (Cacioppo and Tassinary 1990). Ekman (1992) stated that emotions are universal and shared between different cultures. Thus, people can easily recognize expressions in people from different cultures. From these facial expressions, Ekman enumerated the following basic emotions: happiness, sadness, anger, disgust, fear, and surprise. Plutchik (1982) created the wheel of emotions to classify eight basic emotions: joy, trust, fear, surprise, sadness, anticipation, anger, and disgust. This wheel is composed of low-intensity emotions on the outer edges of the wheel. Then as you move towards the center, the primary emotions appear until the center of the wheel is reached, holding the most intense emotions one can experience. In this model, every emotion has its own opposite emotion (e.g., sadness vs joy, anger vs fear, etc.). On the other hand, regarding the continuous emotional dimension, Russell (1980) developed the circumplex model of emotion (Fig. 1), a two-dimensional circular space where the vertical axis represents arousal, and the horizontal axis represents valence. In this model, an emotional state can be represented by any level of valence and arousal. Finally, Mehrabian and Russell (1974) added a third dimension—dominance. The Dominance scale was added to characterize control and dominance over one’s emotions.

Fig. 1
figure 1

Circumplex model of affect. Based on:Liapis and Xenos 2013

2.4 Emotional induction datasets

Inducing emotional states in experimental settings has been widely studied in terms of psychological research, using methods like hypnosis, mental imagery, music and facial expressions. These methods are usually based on deception and cannot be standardized (Uhrig et al. 2016).

An alternative method to induce specific emotional states that can be easily standardized and do not involve using deception is using pictures and films to generate these emotional states.

The international affective picture system (IAPS) (Lang et al. 2005) is a standardized dataset of emotional pictures divided into several emotional categories, shown for 6 s. It was developed to assess emotional states using the dimensional approach of emotions to classify them into valence, arousal, and dominance, using the Self-Assessment Manikin (SAM) scale (Bradley and Lang 1994).

The Geneva affective picture database (GAPED) (Dan-Glauser and Scherer 2011) is a new database composed of 730 pictures, created to increase the number of available emotional stimuli presented by the IAPS. Its content is divided into four specific negative contents (spiders, snakes, and scenes that induce emotions related to the violation of moral and legal norms), positive (humans and animal babies) and neutral pictures (inanimate objects). The rating of the images was performed according to the arousal, valence, and congruence with internal and external norms.

The ArtEmis dataset (Achlioptas et al. 2021) focuses on the affective experience triggered by visual artworks. It contains thousands of human emotional attributions and explanations regarding paintings from WikiArt. The emotional classification is subjective and personal, accompanied by a caption for each picture. Although the subjectivity of the responses, namely the captions, this dataset uses the following eight categorical emotional states: anger, fear, disgust, sadness, amusement, awe, contentment, and excitement.

The FilmStim database is a set of emotional film excerpts developed by Schaefer et al. (2010). The film excerpts were rated in multiple dimensions, and each film clip lasted 1–7 min. The films were rated in terms of emotional discreteness using self-reported levels of arousal, positive and negative affect (PANAS), positive and negative affect scores from the differential emotions scale (DES) and six emotional scores for anger, disgust, sadness, fear, amusement, and tenderness.

The database for emotion analysis using physiological signals (DEAP) (Koelstra et al. 2012) is a multimodal dataset composed of 40 music videos with a 1-min duration. These videos were rated in arousal, valence, dominance, like/dislike, and familiarity. This database also recorded multiple physiological signals (EEG, EDA, BVP, RESP, SKT, EOG) and the face recording video of the participants.

The emotional movie database (EMDB) (Carvalho et al. 2012) is a validated database composed of 52 film clips without auditory content, divided into the following categories: erotic, horror, social positive, social negative, scenery and objects. The film clips were rated in arousal, valence, and dominance, and the physiological signals (HR and EDA) were collected.

The MAHNOB-HCI (Soleymani et al. 2012) is a multimodal database composed of 20 emotional film clips with a duration of 35–117 s, rated in terms of arousal, valence, dominance, emotional label (neutral, anxiety, amusement, sadness, joy, disgust, anger, surprise, and fear), and predictability. It also contains recording from facial video, audio, eye gaze, and multiple physiological signals (EDA, ECG, RESP, SKT, EEG).

Several works used pictures and film clips to elicit emotional states. Both these methods have the ability to elicit emotions in most subjects, although films have recently been used to elicit emotions instead of pictures since films are more dynamic, ecologically valid and more similar to real-life experiments when compared to static images (Uhrig et al. 2016).

2.5 Physiological signals

Physiological signals can be objective sources of information about emotions and may be produced by the activity of the autonomous nervous system (ANS). Since natural bodily reactions might be harder to control, physiological signals are more adequate for emotional recognition (Wioleta 2013). For example, changes in the heart rate, respiration rate, and blood pressure are responses to the activation of the sympathetic nerves of the ANS. Hence, when someone is positively or negatively excited, such changes are inevitable and detectable by physiological signals (Jerritta et al. 2011). Namely, some emotions need more than one physiological signal to be properly explained. Thus, a multi-modal approach needs to be taken into account (Egger et al. 2019).

Electrocardiography (ECG) measures the heart’s electrical activity, resulting in the contraction and relaxation of the cardiac muscle, which allows obtaining information regarding the Heart Rate (HR) and Heart Rate Variability (HRV). HR is directly related to the activity of the ANS, as the parasympathetic nervous system (PNS) decreases HR, and the sympathetic nervous system (SNS) increases HR. This relationship between the HR and the ANS may inform about different emotions, as HRV is a useful measure of stress and mental effort (Jerritta et al. 2011).

The human skin is innervated by numerous efferent fibres, including sympathetic fibres. The eccrine sweat glands in the skin produce sweat that changes the skin’s electrical properties, thus making it possible to measure the Electrodermal Activity (EDA). EDA directly reflects the activity of the SNS because there is no innervation of the sweat glands by the PNS (Boucsein 2012). Thus it enables us to infer the level of arousal and the activation of the SNS (Bota et al. 2019).

Electromyography (EMG) measures the muscles’ electrical activity where there is an increase of amplitude in the EMG signal during muscle contraction (Bota et al. 2019). Recognizing emotions through facial expressions can be achieved by placing EMG electrodes on the face to measure the activity of the muscles, e.g., the Zygomaticus major and Corrugator supercilii muscles, as the muscle activity increases during emotions with negative valence (Gouizi et al. 2011).

Respiration (RESP) signals measures thoracic displacement to monitor the respiration pattern. A respiration cycle is composed of an inhalation period followed by an exhalation period and can be monitored to measure how deep and fast a person breathes (Schmidt et al. 2019). For example, slower respiration frequencies are associated with relaxed or depressive states, while faster frequencies may indicate happiness or anger (Egger et al. 2019).

Skin temperature (SKT) is related to the vasodilatation of peripheral blood vessels induced by increased activity of the SNS (Gouizi et al. 2011). This sensor can assess if someone is relaxed (SKT is higher) or stressed (SKT is lower).

The limbic system is responsible for the emotional response. This system comprises deep brain structures such as the amygdala and the hypothalamus (Lindquist et al. 2015). Although it is not possible to measure deep brain signals through electroencephalography and functional near-infrared spectroscopy, these two sensors allow us to detect brain activity at the scalp.

Electroencephalography (EEG) measures the sum of the electrical currents incoming from neurons’ excitation in the cerebral cortex during their activation (Bota et al. 2019). These signals are a direct response to the activity of the central nervous system (CNS) (Jerritta et al. 2011). Despite being a highly time-consuming assessment to perform (Egger et al. 2019), many EEG studies have been performed throughout the years to study the relationship between emotions, brain regions, and EEG frequency bands. Previous studies have shown that emotions are related to increased brain activity in the frontal, parietal and central regions of the brain (Crawford et al. 1996). For example, Hu et al. (2017) showed that high arousal stimuli decreased the alpha power in the parietal-occipital region, while high valence stimuli exhibited high alpha power in the frontal region.

Functional near-infrared spectroscopy (fNIRS) measures the concentration changes of oxygenated haemoglobin (HbO) and deoxygenated haemoglobin (HbR) through blood flow changes and brain activations caused by neuron firings (Naseer and Hong 2015). The brain’s prefrontal region is known to be involved in emotion processing. Thus fNIRS provides information regarding valence and arousal (Moghimi et al. 2012).

2.6 Emotion detection

Over the last years, several research studies have been performed on the topic of emotion detection and recognition using physiological signals (Jerritta et al. 2011).

Most of these works focused on a user-dependent approach to recognising emotions, which highly depends on each subject, and more recently on user-independent approaches, with unknown data from different users being used as testing data, to obtain the accuracy of the models (Jerritta et al. 2011).

These works also vary in terms of the emotional elicitation stimuli used, as well as the physiological signals acquired to recognize emotional states. Thus, in this section, a review of the previous works will be described from a user-dependent and user-independent point of view.

Kim and André (2008) acquired four physiological signals (EMG, ECG, EDA and RESP) to classify musical emotions (positive/high arousal, negative/high arousal, negative/low arousal, and positive/low arousal) using an extended linear discriminant analysis (pLDA), without any laboratory setting. The recognition accuracy obtained was 95% for user-dependent classification and 70% for user-independent classification. Liu et al. (2005) measured ECG and EMG to classify the following emotional states: anxiety, boredom, engagement, frustration and anger. These emotional states were elicited using computer-based cognitive tasks, like anagrams and pong tasks. The average correct classification across all participants (user-dependent) and all affective states was 75.12% with K-nearest neighbor, 83.5% with regression trees, and 74.03% with Bayesian networks.

Several studies used the IAPS database to elicit emotional states. Rigas et al. (2007) classified happiness, disgust and fear using EMG, ECG, EDA and RESP. The accuracy obtained was 62.70% with K-nearest neighbor and 62.41% with random forest, both for user-independent models. Haag et al. (2004) used a combination of multiple physiological sensors to measure valence and arousal. Using a neural network classifier, the classification accuracy for valence was 89.7% and 63.76% for arousal, with user-dependent models. Maaoui and Pruski (2010) used blood volume pulse (BVP), EMG, SKT, EDA and RESP sensors to classify amusement, contentment, disgust, fear, sad and Neutral. They obtained an accuracy of 90% with support vector machine and 92% with fisher linear discriminant analysis, both from user-dependent models. Finally, Gu et al. (2010) used ECG, BVP, EDA, EMG and RESP to classify several discrete emotions using the K-nearest neighbor classifier. The accuracy was 50% for user-independent and 90.7% for user-dependent models.

On the other hand, film clips have recently been used to elicit emotions instead of pictures. Li and Chen (2006) acquired ECG, SKT, EDA and RESP while eliciting the emotional states of Fear, Joy and Neutral by using one film clip for each of these emotional states. Then, a Canonical Correlation Analysis was performed to obtain an accuracy of 85.3% from a user-dependent point of view. Nasoz et al. (2004) also used film clips to elicit several discrete emotions combined with EDA and heart rate. The accuracy obtained was 71% with K-Nearest Neighbor, 74% with discriminant function analysis and 83% with marquardt back propagation, all user-dependent. Ramzan et al. (2016) also used audio-video film clips to classify arousal and valence. A combination of ECG and EEG signals were used to classify valence and arousal with an average accuracy of 71.6% and 54.0%, respectively.

From these studies, we can conclude that a combination of physiological signals leads to a higher classification accuracy of emotional states. Furthermore, from the comparison of the accuracy using user-dependent and user-independent models, we can obtain higher accuracy from user-dependent point of view.

3 Materials and methodology

3.1 Sample

Participants were recruited from a convenience sample of volunteer subjects, all university students and workers, fluent in English. Fourteen healthy adults (9 females, 5 males) volunteered to participate in this study, with an average age of 31 ± 7.53—from 21 to 50 years old. All volunteers were considered healthy, as none reported suffering from any psychological disorder or to take regular medication. Written informed consent was provided before collecting any type of data. The study and all the ethical procedures were approved by the ethics committee of the University of Madeira.

3.2 Materials

3.2.1 Hardware

The hardware used for this study comprises the following elements: An HTC Vive HMD, a Desktop, a Laptop, an EEG headset and a wearable device.

The HTC Vive is one of the main HMDs, with a 1080x1200 per eye pixel resolution and a \(110^\circ\) field of view (FOV). It provides head orientation and position tracking through the two base stations’ laser tracking system.

The Vive was connected to a computer with Windows 10, an Intel i7-6700 3.40 GHz processor with 16 GB of RAM, and an NVIDIA Quadro P6000 graphics card.

The laptop was used to acquire all the physiological signals from the EEG headset and the wearable device. It was a laptop with windows 10, an Intel i7-1165G7 2.80 Ghz processor, 16 GB of RAM, and an Intel Iris Xe graphics card.

The wearable device was a biosignalsplux 8-channel device developed by PLUX (PLUX Wireless Biosignals, S.A.). This device was developed for research purposes, allowing wireless and high-quality signal acquisition. It enables the acquisition and recording of eight physiological signals simultaneously, up to a sampling rate of 4000 Hz. For this study the following signals were acquired with a sampling rate of 1000 Hz: ECG, EDA, RESP, SKT and fNIRS. The fNIRS was acquired with a 24-bit resolution, while the remaining signals were acquired with a 16-bit resolution.

Finally, the EEG headset used was the g.Nautilus wearable EEG headset (g.tec medical engineering, GmbH Austria). This headset was designed for research applications and comes with flexible cables to configure the electrodes’ position using gelled electrodes with 32 EEG channels.

3.2.2 Software

The emotional stimuli experiment and the emotional self-report scales were designed using the PsychoPy2 software (Peirce et al. 2019). PsychoPy is a free cross-platform that allows creating and running experiments in behavioural sciences, like neuroscience and psychology.

The Virtual Desktop softwareFootnote 1 was used to cast the desktop running the PsychoPy experiment to the HMD. It allows duplicating the desktop using different personalized options, such as the screen size, screen distance and screen curve. On the immersive setup, the following values were setup for this software: Screen Size = 230°, Screen Distance = 1.50 m, Screen curve = \(100\%\), while on the non-immersive setup, the values used were: screen size = 70°, screen distance = 1.50 m, screen Curve = \(0\%\).

The acquisition of the physiological signals was performed using the OpenSignals softwareFootnote 2 and the OpenVibe software (Renard et al. 2010). The OpenSignals allows real-time visualization of the physiological signals acquired with all PLUX devices, like the biosignalsplux wearable device, while the OpenVIbe allows to acquire and visualize the EEG data.

All the signals acquired during this experiment from different devices were synchronized using the lab streaming layer software (Christian et al. 2019). This synchronization includes all the physiological signals from the biosignalsplux, the EEG acquired with OpenVibe and the emotional film clips onset and offset timestamps from the PsychoPy software.

These synchronized signals were saved in a.xdf file, and then only the segments of the signals corresponding to the visualization of a film clip (time between onset and offset) were processed using Python.

3.3 Emotional database

The emotional database used for this study was the EMDB film clips database, which is validated for Portuguese subjects, our target population (Carvalho et al. 2012). The EMDB database comprises the following emotional categories: erotic, horror, social positive, social negative, scenery and objects. For this experiment, the emotional category "Objects" was removed since there were only two film clips in this category, which are neutral videos in terms of valence and arousal. In comparison, there are ten film clips in each of the remaining categories, resulting in 50 film clips used for this experiment, each with an approximate duration of 40 s.

3.4 Emotional self-report scales

3.4.1 Self-assesssment manikin (SAM)

The self-assessment manikin (SAM) scale (Bradley and Lang 1994) was implemented using the PsychoPy software to provide a digital version of the original scale (see Figs. 2, 3 and 4). The subjective ratings’ score for arousal, valence, and dominance was obtained for every film clip shown. All the scales were presented in a 9-point Likert Scale, with the arousal scale ranging from 1 = Calm to 9 = Excited (Fig. 2), the valence scale from 1 = Unpleasant to 9 = Pleasant (Fig. 3), and the dominance scale from 1 = Controlled to 9 = In Control (Fig. 4).

Fig. 2
figure 2

Arousal scale (1 = calm, 9 = excited)

Fig. 3
figure 3

Valence scale (1 = unpleasant, 9 = pleasant)

Fig. 4
figure 4

Dominance scale (1 = controlled, 9 = in control)

3.4.2 Ad-hoc appraisal scale

The ad-hoc self-reported appraisal scale was designed to study further emotional dimensions that could potentially add to the arousal, valence, and dominance dimensions of emotions.

Physiological signals related to emotional reactions are organized around the personal meaning attributed to the situation through an appraisal process (Scherer et al. 2001). An association between the response of physiological signals from the ANS and appraisals have been demonstrated repeatedly. Thus, the opposite idea can be studied, that physiological signals can be indicators of specific appraisals.

For this experiment, we presented the following appraisal items on a 7-point Likert scale (1 = Not at all to 7 = completely agree): anticipatory effort, perceived obstacles, novelty, unexpectedness, intrinsic goal, perceived control, and control/stress.

The anticipatory effort is related to the appraisal of a situation as objectively relaxing or requiring effort. Thus, the higher the effort, the higher the activation of the activity of the Corrugator muscle (EMG) (Smith 1989). The perceived obstacles appraisal is related to the presence vs absence of resistance between the person and their goal, leading to a higher heart rate (ECG) (Smith and Ellsworth 1985). The novelty and unexpectedness items are related to how new and unexpected the event is to the user, increasing the skin conductance levels (EDA), decreasing the heart rate (ECG) and pausing the respiration rhythm (Reisenzein et al. 2019). The intrinsic goal is the appraisal of a stimulus as a goal congruent or incongruent (the relevance of the stimulus to the user). It influences the skin conductance, the forehead and finger temperature (SKT) (Aue et al. 2007). Finally, the perceived control and control/stress is the extent to which one feels in power of an event that occurs since we can appraise situations as challenges, leading to an increase in heart rate and a decrease in the skin conductance levels (Denson et al. 2009). All the items presented to the users for the appraisal scale are shown in Figs. 5 and 6.

Fig. 5
figure 5

Ad-hoc appraisal scale

Fig. 6
figure 6

Continuation of the ad-hoc appraisal scale

3.5 Experimental procedure

This experiment was divided into two sessions on consecutive days to present the EMDB database with the two degrees of immersion (immersive and non-immersive), one degree of immersion setup per session. Thus, the EMDB database was split in half—25 film clips (5 per category) and the order of the emotional categories were randomized for every session and participant. The film clip order within each category was also randomized; e.g., if the first selected category were "Horror", the user would be presented with five consecutive horror film clips in random order, and then the following category would be presented. Each session had an approximate duration of 1h30min due to the time spent configuring the gelled electrodes for the EEG wearable device.

Participants were randomly assigned to the two sets resulting from the split of the database and randomly assigned to the immersive and non-immersive setup set on the Virtual Desktop software.

In each session, participants were asked to rest in a chair as the physiological sensors were placed. Cardiac signals were acquired by placing three gelled electrodes on the participant’s chest in the V2 configuration of Einthoven’s triangle system. Two gelled electrodes were placed on the middle phalanxes of the second and third fingers of the participant’s non-dominant hand to measure EDA. The RESP signal was acquired using a piezoelectric sensor that measures displacement variations caused by the volume changes in the abdomen or thorax during a respiration cycle. The wearable chest belt was placed on the thorax of the user. SKT was acquired with a sensor placed on the distal phalanx of the \(4^{\textrm{th}}\) finger of the non-dominant hand to measure the peripheral body temperature. The fNIRS sensor was placed approximately near the AF7-AF9 region in the 10–20 EEG system to measure the activity of the prefrontal cortex of the brain (blue mark in Fig. 7). The EEG configuration map is shown in red circles in Fig. 7. First, the impedance of all 32 electrodes was checked using the g.NEEDaccess software (g.tec medical engineering, GmbH Austria). Then, the HTC Vive headset strap was placed over the EEG cap after all the electrodes had been placed without moving the EEG cap (Fig. 8). Finally, the EEG signals were visually inspected to assess their quality.

Fig. 7
figure 7

EEG 32-lead electrode configuration (red circles) and fNIRS location (blue circle) used in this experiment

Fig. 8
figure 8

An example of a participant wearing all the physiological sensors and the HTC Vive headset

Then, participants looked at a fixation cross at the centre of the screen for 3 min to record the baseline physiological signals. After that, a practice trial was performed to provide instructions on completing the subjective rating scales. Finally, the actual experiment starts by visualizing the emotional film clip. Then, participants had to answer the SAM and ad-hoc Appraisal scale after the end of each film clip, plus an additional interval of 10 s between each affective trial. Although the minimum spacing between each film clip was 10 s, this spacing and the time participants had to answer the SAM and Appraisal scale was enough for the emotional response to fully dissipate before presenting the following emotional stimulus. This was performed for every film clip, regardless of the emotional category, with no pause between consecutive emotional categories. Figure 9 shows a schematic of this procedure.

Fig. 9
figure 9

Experimental procedure of this experiment for each session

3.6 Signals processing

3.6.1 EEG

The EEG signals were acquired at a sampling frequency of 250 Hz. For every participant, the impedance levels for each electrode were checked using the g.NEEDacess software to reduce the noise in the input signals.

The raw EEG signals for each electrode were filtered using an independent component analysis (ICA). Then, an \(8^{\textrm{th}}\) order Butterworth band-pass filter at frequencies between 1 and 50 Hz was applied to remove any noise artifact. Finally, the power of the following EEG bands was computed for each electrode: Alpha (8–14 Hz), Beta (14–30 Hz), Gamma (30–49 Hz) and Theta (4–8 Hz).

3.6.2 ECG

The ECG signal (Fig. 10) was acquired to obtain information regarding the user’s heart rate (HR) and heart rate variability (HRV). Then, the Pan and Tompkins’ algorithm was implemented to detect the R peak in the QRS complex of the ECG signal (Pan and Tompkins 1985). This algorithm starts by filtering the raw signals with a 5 to 15 Hz band-pass \(2^{\textrm{nd}}\) order Butterworth filter. Then it computes the derivative of the filtered ECG signal, followed by the squaring and the integration of the signal with a time window of 80 ms. Finally, the R peaks are detected using an adaptive threshold. After obtaining the R peaks, it is possible to compute the peak-to-peak interval series, denominated as the RR-interval time series, which gives us the participant’s heart rate (\(HR=60/\Delta RR\)). Then, from the RR-interval time series, we computed a total of 23 ultra-short-term HRV metrics in the time-domain, frequency-domain, and non-linear features. According to the task force of the European Society of Cardiology and the North American Society of Pacing Electrophysiology (Task Force of the European Society of Cardiology 1996), a minimum time window of 2 to 5 min is necessary to obtain reliable HRV spectral components, in opposition to the 40-s time window used in this work. However, according to recent studies, it is possible to compute ultra-short-term HRV measurements in mobile settings (Salahuddin et al. 2007; Schaaff and Adam 2013). Salahuddin et al. (2007) found significant differences between baseline and Stroop test using HRV measurements within a 50-s window. Similarly, Schaaff and Adam (2013) used a 30-s time window for frequency-domain and a 15-sec window for time-domain features to measure emotional arousal. Therefore, while it might not be reliable from a clinical standpoint, ultra-short-term HRV metrics may still contain valuable information as studies demonstrated their value in the context of emotions classification. All the 23 features calculated for the ECG have been enumerated in Table 1.

3.6.3 RESP

The RESP signal (Fig. 10) was processed using the neurokit2 toolkit (Makowski et al. 2021) and analyzed in the time-domain, including respiration rate variability metrics, frequency-domain and non-linear features. A total of 23 features were computed (see Table 1).

3.6.4 fNIRS

The fNIRS (Fig. 10) functions by taking advantage of the different optical properties of the oxygenated haemoglobin (HbO) and the deoxygenated haemoglobin (HbR), which, in the red and near-infrared spectrum (wavelength of 650 and 900 nm), have different absorption coefficients. The data was filtered with a \(2^{\textrm{nd}}\) order 0.05\(-\)0.4 Hz band-pass Butterworth filter. The data was then processed using the modified Beer–Lambert law to convert the radiation intensity to changes in oxygen concentration (HBO and HBR), following the algorithms presented by Varandas et al. (2022). A total of 12 features were extracted (see Table 1).

Fig. 10
figure 10

Raw ECG, RESP and fNIRS signals acquired, respectively, on channel 1 in millivolts (mV), channel 3 in volts (V) and on the digital channel in micro-Amperes (\(\mu\)A). The green vertical line represents the onset time of the film, and the red line represents the offset time of the film

3.6.5 EDA

The EDA signal (Fig. 11) was acquired to obtain information regarding the skin conductance levels to measure participants’ arousal. First, the raw signal was filtered using a \(2^{\textrm{nd}}\) order Butterworth low-pass filter at 1 Hz. Then, using the neurokit2 toolkit (Makowski et al. 2021), the phasic component (skin conductance response (SCR)) and the tonic component (skin conductance level (SCL)) were computed. Frequency-domain features were also obtained after performing a 3-step down-sampling with a factor of 10. Then, the frequency components corresponding to the frequency bands of the HRV were computed. A total of 26 features were extracted (see Table 1).

Table 1 Physiological features extracted from the ECG, EDA, RESP, SKT fNIRS and EEG. A total of 216 features were computed

3.6.6 SKT

The SKT signal (Fig. 11) was filtered using a \(2^{\textrm{nd}}\) order Butterworth low-pass filter at 0.1 Hz, and then 4 time-domain features were computed (see Table 1).

Fig. 11
figure 11

Raw EDA and SKT signals acquired, respectively, on channels 2 in micro-Siemens (\(\mu\)S) and 4 in Celsius degrees (\({}^{\circ }\)C). The green vertical line represents the onset time of the film, and the red line represents the offset time of the film

3.7 Machine learning

In this study, only supervised machine learning methods were applied, thus, the focus will be on these methods. Supervised learning is a technique in which the algorithm learns from previously labelled data to predict never seen before data. It allows the models to correctly classify the input data into the target label faster than unsupervised learning algorithms. Thus, we considered the following supervised classifiers: K-nearest neighbor (kNN) (Pedregosa et al. 2011), support vector machines (SVM) (Vapnik 1998), random forest (RF) (Donges 2018), and artificial neural network (NN) (Jain et al. 1996). These classification models have been frequently used in previous studies regarding emotion recognition, as we enumerated in Sect. 2.6, thus, it will be possible to compare our results with the results of other studies.

Machine learning methods are usually expensive to compute with a large set of features, and the data is associated with a high noise level. To overcome this problem, it is necessary to perform feature selection algorithms on the input data of the classification models to select a smaller subset of features that maximizes the relevance to the target output, minimizing redundancy and leading to a better learning performance of the classification models (Chen et al. 2020). Recursive feature elimination (RFE) is an example of a feature selection method in which the final set of features is selected by recursively considering smaller sets of features and eliminating the least informative feature in each iteration (Guyon et al. 2002).

Training machine learning models is a systematic process that maximizes the available data to train the model to obtain the best performance model. Before starting the training phase, it is common to normalize the input data. This normalization is performed using standardization (Shanker et al. 1996), which subtracts the statistical mean from each value and divides the results by the standard deviation. Equation 1 shows this normalization, where X is the original sample of values, \(\overline{X}\) is the statistical mean of the values, STD(X) is the standard deviation of X, and \(X'\) is the standardized new sample of values.

$$\begin{aligned} X' = \frac{X-\overline{X}}{STD(X)} \end{aligned}$$
(1)

Splitting the original dataset into a training and testing set is necessary to start the training phase. This split is fundamental to get an accurate estimate of the model’s performance when introduced to new data it has never seen before. The most used method to split the original data into training and testing data is to use cross-validation methods, such as K-fold and leave one out cross-validation (Refaeilzadeh et al. 2016; Wong 2015).

The model’s performance was assessed using commonly used metrics for machine learning classification algorithms: accuracy, precision, recall, and F1-score (Eqs. 25), where TP is true positive, TN is true negative, FP is false positive, and FN is false negative (Mohammad and Nasir 2015).

$$\begin{aligned} Accuracy= & {} \frac{TP+TN}{TP+TN+FP+FN}\end{aligned}$$
(2)
$$\begin{aligned} Precision= & {} \frac{TP}{TP+FP}\end{aligned}$$
(3)
$$\begin{aligned} Recall= & {} \frac{TP}{TP+FN}\end{aligned}$$
(4)
$$\begin{aligned} F1-Score= & {} 2*\frac{Precision*Recall}{Precision+Recall} \end{aligned}$$
(5)

The area under the receiver operating characteristic curve (AUC-ROC) is also a performance metric for classification algorithms. It measures how much a model can distinguish between different classes. Thus, the higher the value of AUC, the better the model is at distinguishing between different classes (Hanley and McNeil 1982).

4 Results

In this experiment, we presented the emotional film clips in two different degrees of immersion—Immersive vs non-Immersive—to compare the effect of immersion in the film clips shown on the HMD on the subjective ratings of the SAM scale, thus, if it had any effect on the reported levels of arousal, valence and dominance, using non-parametric statistical tests. Then, we developed classification models to predict the emotional category of the film clips from the EMDB using the physiological signals features and the self-reported ratings.

4.1 Immersion

Regarding the two degrees of immersion (immersive and non-immersive), we obtained the data from the self-reported levels of arousal, valence and dominance and tested whether there were differences in these reported levels between the two degrees of immersion.

Since the data in the SAM scale is obtained through a 9-point Likert scale, the data is ordinal. Thus, non-parametric statistical tests need to be used. As mentioned in Sect. 3.5, the EMDB database was split in half. This resulted in two groups: Group A—the first five film clips of each category, and Group B—the remaining five film clips of each category. Then, participants were randomly assigned to view the film clips from Group A in one degree of immersion and clips from Group B in the other degree of immersion, and vice-versa. Thus, some participants viewed the film clips from Group A in the first session in the immersive setup and the film clips from Group B in the second session in the non-immersive setup. The remaining participants viewed the film clips from Group A in the non-immersive setup and film clips from group B in the immersive setup. Therefore, for each group, the data regarding arousal, valence and dominance was evaluated using the Mann–Whitney U statistical test to compare the respective mean ranks.

The results from the Mann–Whitney U test for the median of both groups are reported in Table 2. From these results, we only found a significant difference regarding "Dominance" in the category "Horror" for Group A and B (U = 8.00; p-value = 0.043; r = 0.55; Power = 0.15) and "social negative" for Group B (U = 8.00, p-value = 0.043; r = 0.55; Power = 0.15). No significant difference was found in the levels of arousal and valence in both groups.

Table 2 Results from the Mann–Whitney U test to compare the effect of the degrees of immersion in terms of arousal, valence and dominance, for the median of both groups

4.2 Classification

Classification models were trained to classify the subjective ratings from the SAM (arousal, valence and dominance), ad-hoc appraisal scale (anticipatory effort, perceived obstacles, novelty, unexpectedness, intrinsic goal, perceived control, and control/stress), and the features extracted from the physiological signals into the five target emotional categories of the EMDB (erotic, horror, social positive, social negative and scenery), according to the corresponding labels of the EMDB database. Since no significant difference was found for the levels of arousal and valence in both groups of the previous section (Sect. 4.1), all the film clips will be used as input data to the classification models, regardless of the degree of immersion in which they were presented to the participants. The results will be reported from a user-independent and a user-dependent point-of-view.

4.2.1 Subjective ratings

From a user-independent point-of-view, the following classifiers were used to evaluate which one had a better performance: k-nearest neighbor using the five closest neighbors (k = 5), SVM (C = 1, a radial basis function (RBF) kernel, and random seed of 42), a random forest classifier with 100 decision trees (Gini criterion and random seed of 42), and a multi-layer perceptron classifier with a random seed of 42.

The data was split into training and testing sets using the cross-validation method from scikit-learn (Pedregosa et al. 2011), leave one subject out (LOSO). Thus, the samples from one user are used as a testing sample, as this process is repeated for every user available. This results in 14 different subsets of data. The final accuracy score is the average accuracy across all users.

Then, a recursive feature elimination (RFE) algorithm was applied as a feature selection method to reduce the number of input features provided to the classification models. This feature selection was applied only to the training set to avoid data leakage and overfitting the model performance, using a RF classifier as an estimator of the features’ importance, removing the least important feature on each algorithm iteration and mitigating wrapper bias (Samala et al. 2020). The final number of input features used to evaluate the model’s performance was reduced to half (5 features).

The results obtained for the user-independent classifiers are shown in Table 3. The metrics used are Accuracy, Precision, F1-Score and ROC-AUC with a weighted average parameter. The ROC-AUC was computed using a one-vs-rest configuration—which computes the AUC of each class against the rest.

All the results are in percentage. The classifier with the best performance in classifying the emotional categories using the subjective ratings is the SVM, with an average accuracy of 54.0 ± 17.2%.

Table 3 Results for the user-independent classification of the subjective ratings

From a user-dependent point-of-view, the classifier with the best performance was the RF classifier with 100 decision trees (Gini criterion, random seed of 42). This classifier was used for each individual to classify the emotional categories.

The data of each user was split into training and testing sets using a Stratified K-Fold method from scikit-learn (Pedregosa et al. 2011) with ten folds (k=10), preserving the percentage of samples for each target class on each fold. This results in 10 different subsets of data from that specific user. For each user, the model’s accuracy is the average accuracy between the 10 subsets of data from that user.

Then, the RFE algorithm was also applied to the training set as a feature selection method, using the same classifier (random forest) as an estimator, mitigating wrapper bias (Samala et al. 2020). This process entails 140 subsets of data for our sample (14 x 10), with different features being selected as the most important for each user. Table 4 shows the results obtained for each user.

The average accuracy of all the users was \(69.3 \pm 11.4 \%\), with a minimum and maximum accuracy of \(56\%\) and \(90\%\), respectively.

Table 4 Results for the user-dependent classification of the subjective ratings

4.2.2 Physiological signals

Regarding the physiological signals, 216 features were extracted from the following sensors: EEG, ECG, EDA, RESP, fNIRS and SKT.

From a user-independent point-of-view, the same procedure performed in Sect. 4.2.1 was applied to these features, but using a random seed of 0, instead of a random seed of 42. After completing the RFE feature selection algorithm, 108 features were used as input features for the classification models.

The results are shown in Table 5. The classifier with the best performance in classifying the emotional categories using the physiological features is the RF classifier, with an average accuracy of 24.9 ± 4.0%.

Table 5 Results for the user-independent classification of the physiological signals

From a user-dependent point-of-view, the classifier with the best performance was also the RF classifier with 100 decision trees (Gini criterion, random seed of 0). This classifier was used for each individual to classify the emotional categories.

The data was split into training and testing sets using the same procedure as in Sect. 4.2.1. Then, the RFE algorithm was also applied as a feature selection method to reduce the number of input features to half (108 input features). The results obtained for each user are shown in Table 6.

The average accuracy of all the users was \(71.0 \pm 8.0 \%\), with a minimum and maximum accuracy of \(58\%\) and \(86\%\), respectively.

Table 6 Results for the user-dependent classification of the physiological signals

5 Discussion

The main goal of this work was to study and validate a machine learning pipeline (data collection, data pre-processing, feature extraction and feature selection) to predict emotional states from a combination of physiological signals while participants were subjected to validated emotional stimuli. Then, this pipeline will be transferred to a real-time prediction system to predict emotional states in real time. Thus, selecting the appropriate emotional stimuli database and the presentation format of these stimuli was necessary to elicit the emotional response required to be measurable through physiological signals.

Regarding the degree of immersion of the emotional stimuli, when analyzing the impact of the immersion, we only found a significant difference for "Dominance" in the categories "horror" and "social negative", as participants reported lower levels of control—being controlled by the stimulus - in the immersive setup compared to non-Immersive. However, previous studies have shown that dominance does not provide consistent results across several studies and that arousal and valence are considered the most important metrics to measure basic emotions (Betella and Verschure 2016). Thus, we assume that there is no difference in the degree of immersion between the Immersive vs non-Immersive setup, as all the film clips were used as input data to the classification models.

In this study, we compared the performance of both user-independent and user-dependent classification models for the subjective ratings and physiological signals to classify the five emotional classes of the EMDB database. Most of the previous work in emotion recognition state that emotions are highly subjective, so generalized models for the classification of emotions are still a challenge. Thus, classification models should be developed from user-dependent models rather than generalized models for all subjects (Varandas et al. 2022).

Therefore, in this work, we did not focus on feature and model utility. Since our procedure will be applied in a real-time experiment, continuously monitoring the participant’s emotional state is needed. Thus, we had to consider computing ultra-short frequency features for HRV, EDA and RESP signals. Although there are no clear guidelines on which features are comparable and reliable as the standard 5-min recordings (Pecchia et al. 2018), several studies apply ultra-short features.

For instance, regarding HRV spectral features, Salahuddin et al. (2007) showed that it was possible to monitor mental stress with wearable sensors, using a 30-sec window. Similarly, Schaaff and Adam (2013) also computed HRV frequency features using a 30-sec window, to classify emotional arousal. Bernardes et al. (2022) also found out that the LF and HF band computed in a 40-sec window had a correlation higher than 50% compared to a 2-min window, indicating a moderate to strong positive correlation.

Regarding the performed frequency analysis for EDA, we based our approach on the proposed by Posada-Quintero et al. (2016) to assess the dynamics of the autonomous nervous system based on the spectral analysis and frequency bands of HRV. Although the skin conductance level (SCL) is a slow-changing signal and may not be captured within the 40-second window, the skin conductance response (SCR) occurs as the sympathetic response to a stimulus, about 1–5 s after the presentation of a stimulus (Posada-Quintero and Chon 2020), should be measurable. Indeed, Posada et al. found that there was an increase in spectral power of the EDA signals between the frequencies 0.08–0.24 Hz for several stressors. Therefore, we believe that the spectral analysis of EDA signals is an indicator of the sympathetic-parasympathetic balance (Posada-Quintero and Chon 2020), so we included these features in our pipeline.

Acknowledging the challenges of using these ultra-short features, we analyzed the number of occurrences of the spectral features computed for the classification models. These results are shown in Tables 7 and 8. From these tables, it is possible to conclude that the VLF power band for HRV and EDA was never selected for either user-dependent or user-independent pipelines. Additionally, regarding the LF and HF bands for HRV and EDA, we can see that for EDA, these features had a high rate of selection, evidencing that features in this range of frequencies for EDA are reliable and associated with the activation of the sympathetic tone. For HRV, these features had a lower selection rate when compared to the EDA frequency features. Both these results are similar to the results obtained by Finseth et al. (2023) and suggests that our feature selection process only selected reliable features.

Table 7 Number of frequency feature occurrences for the user-independent models
Table 8 Number of frequency feature occurrences for the user-dependent models

Regarding related work using film clips for emotion recognition, the classification accuracy ranges from 71.0 to 85.3% using user-dependent models (Jerritta et al. 2011). Thus, these accuracy values were the reference values to compare the performance of our models.

In Sect. 4.2.1, the subjective ratings from the SAM and the ad-hoc Appraisal scale were used to predict the emotional categories of the EMDB. Using user-independent classification models (Table 3), we obtained an accuracy of 54.0 ± 17.2% using an SVM classifier. On the other hand, using a user-dependent classifier for each participant (Table 4), the average accuracy was 69.3 ± 11.4%, ranging from 56% to 90% accuracy using a Random Forest classifier. The average ROC-AUC obtained was 91.1 ± 5.2%, which indicates that our models have a high ability to distinguish between classes.

In Sect. 4.2.2, the features extracted from the physiological signals were used with the same goal—to predict the emotional category from the EMDB database. Again, the accuracy was lower in a user-independent classification than in a user-dependent classification. The accuracy of the user-independent was 24.9 ± 4.0% (Table 5) for a Random Forest Classifier. On the other hand, the average accuracy for user-dependent classification was 71.0 ± 8.0%, from 58 to 86% (Table 6). The average ROC-AUC was 91.0 ± 3.9%. which indicates that our models have a high ability to distinguish between classes.

Compared to previous works that used film databases in non-immersive environments, we note that results related to user-independent models achieved lower accuracy scores, while user-dependent results are similar. This points to the hypothesis that immersive environments lead to higher variability of emotional responses, thus hardening the task of generalization across subjects.

Regarding the number of classified emotions, it is also clear that the higher the number of target classes, the lower the accuracy scores results. Namely, the work of Li and Chen (2006) achieved an accuracy score of 85.3% classifying 2 emotions, while Nasoz et al. (2004 ) achieved an accuracy score of 71% for 6 target emotions. Thus, our work that classifies 5 different target classes achieved results similar to those that classify a higher number of emotions in the literature using user-dependent models.

On the other hand, we see that different stimuli lead to different results in emotion detection. For example, previous works that used IAPS, a database composed of pictures, achieved results as high as 92% in user-dependent scenarios, which is higher than our work and all works that employed film-based databases. Despite higher results, since pictures are less realistic than films, their use may be less adequate for real-world applications.

The results for subjective ratings and physiological signals support the hypothesis that the recognition of emotions is subjective for each individual. The accuracy scores for subjective ratings and physiological signals were lower for user-independent classification models compared to user-dependent, as expected from Sect. 3.7. Specifically, the ROC-AUC for user-independent models using physiological signals showed that the model could not distinguish and separate the data into the different emotional categories, while user-dependent models could clearly distinguish the data between different emotional classes. Thus, individuals react differently to the same emotional stimuli.

Moreover, it is possible to conclude that the subjective ratings responses can be more generalized between every participant since the difference in accuracy between user-independent models (54.0 ± 17.2%) and user-dependent models (69.3 ± 11.4%) is considerably lower than the difference between these models using the physiological signals. This could be explained by the fact that participants were aware of the emotional content presented to them. Thus, the responses to the subjective scales for each emotional film clip are similar across subjects, while the physiological response to these stimuli is distinct due to the way each participant reacts to the emotional stimuli presented.

6 Limitations and future work

Some limitations can be pointed out in this work. Firstly, this work is the first exploratory step to develop a machine learning pipeline using physiological data to be applied in subsequent real-time systems to predict emotional content. Hence, this work aimed to validate the proposed pipeline, centering our approach on model creation and quantification through cross-validation. Given that our research intends to develop systems for real-time processing, we could not disregard the potential of ultra-short features. The choice of the 40-sec window relied on the length of the EMDB film clips, which limited the maximum window length for data acquisition. Opting for a larger window to obtain these features would mean incorporating information about the transitions between emotional stimuli into the classification model.

Acknowledging that some of these ultra-short features might be unreliable in such a short window size, using the recursive feature elimination to select the relevant features, it is expected that only reliable ultra-short frequency features in this window can be considered in the feature selection algorithm. Nevertheless, to understand to what extent these were used in the models, we now report the number of occurrences of each frequency feature feeding the models (see Tables 7 and 8). In this analysis, we showed that VLF features are never selected for any model and identify the ones that have been considered and are relevant to distinguishing the emotional categories.

In future work, it is necessary to develop models to adapt the virtual environment in real-time according to the user’s emotional state. One of the main limitations of the real-time system is the inability to retrain the model according to the changes in the physiological signals over time and repeated exposure to emotional stimuli (Finseth et al. 2023). Thus, this limitation needs to be addressed in future studies. Nevertheless, these models were trained on healthy individuals, which may not represent individuals with health disorders. However, since the outcome of this work points us to use user-dependent models, a new model will be developed for each user, regardless of whether they are healthy or not. This implies that the most informative sensors may not generalize to clinical populations. Thus, in future work, we will validate whether these sensors and methodology apply to individuals with health disorders, improving ecological validity in out-of-the-lab applications for healthy and clinical populations.

7 Conclusion

This work aimed to develop machine learning classification models to detect and recognize emotional states using a combination of physiological sensors through wearable devices, starting from videos.

The proposed methodology effectively elicited the target emotional states by presenting emotional film clips from the EMDB dataset in an immersive and non-immersive virtual environment.

The results showed that the degree of immersion of the emotional stimuli was not relevant since there were no significant differences in the reported levels of arousal and valence for each emotional category. In terms of the classification models, it was clear that the emotional response is highly subjective and it depends on each user. Thus, we obtained higher accuracy using user-dependent models compared to user-independent models. The accuracy for user-dependent models was 69.3 ± 11.4% and 71.0 ± 8.0% using subjective ratings and physiological signals, respectively.