1 Introduction

Many psychological dimensions, such as the user's emotions or sense of presence [1, 2], have been involved in the evaluation of Virtual Reality (VR) systems. Many studies have proven that immersion associated with VR is an essential feature in generating both sense of presence and emotional states [3,4,5,6,7]. Typically, emotion-based studies engage one or two human senses [8, 9], most commonly vision and auditory [10]. As Gall, Roth, Stauffert, et al. [11] stated: "multisensory feedback provides means to manipulate the strength of illusion". The introduction of multisensory features improves users' affective responses [12,13,14], sense of presence and realism [15,16,17], enhancing the fidelity of the world [10], creating a sensorial richer virtual world [10, 18, 19] and an increased emotional connection with it [20], in contrast with unisensory interfaces [12]. All this is possible thanks to specific hardware that helps to isolate the user from the real world, such as head-mounted displays (HMDs) [16], haptic [21], smell [22], and gustatory interfaces [23,24,25], devices that simulate wind, warmth [13, 26], among others. Unisensory interfaces, in turn, are more likely to provide the undesired sense of hollow, fictitious, and unrealistic experiences [27].

Despite the positive aspects, to date, studies examining immersive VR regarding their ability to induce emotions are still scarce [4, 10, 19, 28, 29], due to the high complexity the system development encompasses [30, 31] and consequently due to the difficulty of evaluating the user experience [32]. The combination of different sensory channels implies that they can no longer be seen as separate [33]. Instead, they interplay, affecting each other, which hinders research in the multisensory VR context [28]. These obstacles are even more present when a large number of senses are explored [9, 31]. Despite studies involving just two stimuli (typically visual and audio) being easy to implement and very common in literature, multisensory virtual environments are an increasingly popular topic and transversal to numerous areas (games, therapy, treatment, training, education, tourism, to name a few) [7, 10, 18, 28, 29]. Moreover, multisensory stimulation in VR has been pointed out as more likely to increase immersion and, consequently, presence [2]. Compared to less immersive environments, immersive environments present greater potential to gather users' different emotional states [34,35,36] and more emotional arousal [2].

This paper explicitly explores studies that address, either as a primary or secondary objective, the users' emotional responses in multisensory immersive VR, particularly those that engage at least three human senses, typically composed of visual and audio cues, and complemented by other(s). This criterion will prevent obtaining studies on an audiovisual basis only, which are more likely to restrict the results related to the investigation of users’ emotional responses in VR. Moreover, one of the most important properties of the emotional aspects in VR is the “multisensory synergetic stimuli”, which can be achieved by the use of “a great variety of elements, colors, sounds, objects and smells” [37]. Based on that, we consider that studies with more than two stimuli provide more accurate results in emotional experiments. As far as we know, this investigation represents a novelty, as no other studies can be found reviewing the relationship between the input stimuli and the obtained emotions, considering the research area and the methodology used, as previously discussed. Therefore, this is one of the main research goals we intend to achieve, vital to understand how the qualitative and effective design of multisensory VR experiences can be affected by these factors, as mentioned by Kruijff, Marquardt, Trepkowski, et al. [10]. The obtained results will contribute as guidelines for designing richer multisensory virtual experiences. They also contribute as relevant topics for a research agenda, which should be taken into account by investigators in this field to keep the evolution of studies involving users’ psychological dimensions in VR.

1.1 The assessment of human emotional responses

Emotions are complex to understand and explore due to their abstract and subjective nature [38], making their measurement challenging [39]. Although there is still no consensus in the literature on an exact definition for “emotion”, Cabanac [40] defines it as "any mental experience with high intensity and high hedonic content (pleasure and displeasure)". Emotions are the basis for understanding the human mind regarding progress and evolution. As they are not voluntary expressions [38], they can provide relevant information regarding one’s unconscious state [10, 39]. Involuntary changes in the Autonomic Nervous System regulate one's physiological processes. Results obtained with the measurement of physiological responses can act as an indicator of an emotional variation [39, 41]. They can be provided by monitoring the heart rate (HR) or the heart rate variability (HRV), respiration rate (RR), via electrodermal activity (EDA), also known as galvanic skin response (GSR), which measures the skin conductance level (SCL), or by examining the finger pulse volume (FPV) or the blood pressure (BP), for example. This variation can also be expressed through behavior: verbally, through facial expressions, and body movements [39], to name a few.

Two general models are used to study emotions: the categorical approach and the dispositional theory. In the first one, emotions are represented as labels [42, 43]. In contrast, in the second one, the concepts of valence (pleasure and displeasure) and arousal (high activation and low activation) determine the individuals' emotional experience, which is represented in a two-dimensional model – the circumplex model of affect. So, each emotion can be defined as a combination of these two dimensions or by the position it takes in the model of valence and arousal [44]. More informally and amply, emotions can also be divided into positive, such as happiness, and negative, such as anxiety, fear, sadness, and disgust [41].

Several performance tasks and scenarios in VR can provide insights into the relationship between emotions and cognitive-behavioral responses. Among them, we highlight stress-inducing tasks, cognitive load tasks, attention and vigilance tasks, motor skills and coordination tasks, decision-making and risk assessment tasks, and social interaction tasks. By combining such performance tasks with subjective and objective measures, researchers can comprehensively understand how emotions and stress impact cognitive and behavioral responses in VR [45].

1.1.1 Objective and subjective methods and instruments

The study of emotions can be done consciously, with subjective methods and instruments [29], or unconsciously, resorting to objective methods and instruments. Also, humans' emotional state can be recognized by combining these two methods – a mixed approach [39].

Objective measures must be used if the aim is to measure the most unconscious level of the individual, which can be done by resorting to psychophysiological measures, like EDA, HR, or electroencephalography (EEG) [10], as previously discussed. They allow the collection of physiological and biometric data, which give essential information about how a person feels, even though not conscious of it [39]. Though, they also present some contraindications, mainly because of being obtrusive and noisy [46].

In turn, subjective measures should be used if the objective is to evaluate the emotional experience through an individual's subjective point of view. That includes established user scales, e.g., the “Visual Analogue Scale” (VAS) or the “Self-Assessment Manikin” (SAM), interviews, thinking aloud, and questionnaires (e.g., the “Check-All-That-Apply” (CATA) procedure, the “Positive and Negative Affect Schedule” (PANAS), or the “State-Trait Anxiety Inventory” (STAI)). Such measures can be pictorial, such as SAM, created by Bradley and Lang [47], which presents a representative drawing of the human figure for the inquiries to express their emotions. Regarding interviews, structured, semi-structured or unstructured interviews might be used. Respectively, they follow a predetermined set of questions or a standardized interview protocol (structured), or a combination of a protocol and flexibility for the interviewer to freely adapt the conversation with the participants according to their responses (semi-structured), and, finally, with no predetermined set of questions (unstructured interview). Focus groups, also a type of interview, are a method to obtain subjective reports. It involves a group of participants (typically 6 to 12), who engage in a facilitated discussion led by a moderator. Participants are invited to share their opinions, experiences, and perspectives regarding a specific topic, allowing the interaction and exchange of ideas [48]. Regarding the understanding of human emotions in VR, a phenomenological interview can be beneficial. It is frequently a semi-structured or unstructured interview, as the emphasis is the first-person perspective of a specific experience, considering the individual’s subjective meanings, emotions, body sensations, and intentions related to the phenomenon under investigation [49].

Questionnaires can be presented as a list of a limited number of adjectives, requiring respondents to express how they feel at that moment through a scale. Some examples are PANAS, composed of twenty adjectives (ten positives and ten negatives) and a scale from 0-5 [50]; the STAI, which focuses on the level of anxiety the respondents feel, which is expressed in a 0-3 scale [51]; VAS, initially created to assess the patient's pain in clinical research in 1921 by Hayes and Patterson [52], and nowadays used broader to specify the level of agreement in a continuous scale; and CATA questionnaires, which are used to widespread investigate the users’ perceptions on a variety of attributes [53]. Given the low complexity of the application [54], subjective measures are frequently used, despite the number of biases [55] and measurement issues [56], for instance, the interference of individual and socio-demographic aspects. As Prescott [57] pointed out, age, education, culture, socio-economic status, and personality might destabilize the results.

2 Methodology

This systematic literature review was inspired by the "Preferred Reporting Items for Systematic Reviews and Meta-Analyses" (PRISMA) [58] to minimize bias and maximize its contribution to science.

2.1 Research objectives

We propose the following research objectives:

  1. 1.

    To survey the current methods and instruments found in the literature related to the study of the users' emotions during multisensory VR experiences;

  2. 2.

    To explore the reported limitations regarding the study of users' emotional responses in multisensory VR experiences;

  3. 3.

    To assess the impact of the stimuli on emotional responses, to help understand the relationship between a specific stimulus and the obtained emotional response(s);

  4. 4.

    To investigate whether there is an association between the used method and the research area;

  5. 5.

    To identify the main gaps and challenges in this context.

2.2 Research questions

To meet the above-mentioned research objectives, we propose the following research questions (RQ):

  • RQ 1. What are the most common methods and instruments used to measure users' emotional responses during multisensory VR experiences?

  • RQ 2. What are the main limitations regarding the study of users' emotional responses in multisensory VR experiences reported by the author(s), associated to:

    • RQ 2.1 the used VR equipment?

    • RQ 2.2 the instruments for emotional responses collection?

    • RQ 2.3 the experimental design?

  • RQ 3. Is there any association evidence between the stimulus provided to the users and their emotional response(s)?

  • RQ 4. Is there any association evidence between the used method and the research area?

  • RQ 5. What gaps and challenges remain in the literature for future research?

2.3 Search strategy

We extensively searched four electronic data sources (Web of Science, Scopus, ACM Digital Library, and IEEE Xplore). To develop this Systematic Literature Review (SLR), a selection of data sources to perform the search was made, as it would be impracticable to search all the existent data sources. We chose three (Web of Science, Scopus, and ACM Digital Library) of the principal search systems meeting the necessary performance requirements to conduct systematic reviews, according to Gusenbauer and Haddaway [59]. Although not on this list, IEEE Xplore was considered for the search process, as it provides high-quality technical literature in engineering and technology, which is this paper’s general field of research. According to its website [60, 61], IEEE Xplore claims to provide the “highest quality technical literature in electrical engineering, computer science, electronics, and related disciplines”.

To summarize, two multidisciplinary (Web of Science and Scopus) and two computer science specialized databases (ACM Digital Library and IEEE Xplore) were considered to ensure relevant results. This combination allows obtaining a wide variety of papers pertinent to this paper’s subject. All four are comprehensive data sources, comprising peer-reviewed journal articles and conference proceedings, ensuring the source materials' quality.

We considered all the available results (inclusion criteria applied), disregarding the year of publication and the subject area. To restrict the results, two filters were applied: (1) related to language (Portuguese or English writing), and (2) to the document type (conference proceedings paper or journal paper).

The primary search terms consisted of "virtual reality", "emotions", and "stimulus", whether in the title, abstract, or keywords. However, due to the inconsistent terminology, some synonyms or similar terms were used, in some cases with wildcards, as follows:

  • To designate "virtual reality", we used the following terms: "virtual reality", "VR", "virtual environment*", "immersive environment*", "immersive technolog*" and "virtual scenario*";

  • For "emotions", we used only: "emotion*";

  • Finally, for "stimulus", we considered the terms "*sensory OR stimul*".

We also used Booleans ("AND" and "OR") in our search, resulting in the following query: ("virtual reality" OR VR OR "virtual environment*" OR "immersive environment*" OR "immersive technolog*" OR "virtual scenario*") AND emotion* AND (*sensory OR stimul*).

We performed the last search on all databases on January 24th, 2023.

2.4 Inclusion criteria

The inclusion criteria were as follows:

  1. 1.

    The paper is published in one of the mentioned data sources;

  2. 2.

    The paper is written in English or Portuguese (the authors’ native language);

  3. 3.

    The publication type is a research article or proceedings paper, published in a refereed journal or conference;

  4. 4.

    The search terms include, whether in the title, abstract, or keywords, the following query: ("virtual reality" OR VR OR "virtual environment*" OR "immersive environment*" OR "immersive technolog*" OR "virtual scenario*") AND emotion* AND (*sensory OR stimul*).

2.5 Exclusion criteria

The exclusion criteria taken into consideration were:

  1. 1.

    The paper's full text is not available;

  2. 2.

    The paper's methodology does not include the assessment of the users' emotional responses in virtual reality environments, either through subjective or objective measures, or a mixed approach;

  3. 3.

    The paper's methodology does not focus on a fully-immersive system. The criterion adopted was based on Costello [60], who defends a classification of VR systems according to the level of immersion provided, which can be non-immersive (desktop-based VR), semi-immersive (projection systems), or fully-immersive (HMD systems). For the author, VR systems are more immersive the less the user can perceive (see, hear, touch) the outside world [60], which is achieved nowadays by using high-resolution 360-degree vision HMD [62]. Furthermore, as previously discussed, fully immersive systems have greater potential to gather users' different emotional states [2, 34,35,36] compared to non-immersive ones.

  4. 4.

    The paper does not consider a multisensory virtual environment (the addition of at least one more sense to the base pair vision and audio).

2.5.1 Criteria used to identify the stimuli provided

The user isolation from the real world is achieved with the help of specialized hardware, such as HMDs [16], haptic [21], smell [22], and gustatory interfaces [23,24,25]. This section will overview the available hardware regarding the five human senses used to immerse the users in VR. Considering that this work focuses on fully-immersive VR setups, regarding the visual stimulus, the only requirement is the use of advanced visualization systems on the methodology, specifically HMDs, independently of the brand or its specifications. No specific criterion was applied regarding the auditory stimulus. Any audio configuration was considered, for instance, headphones, earphones (integrated or not in the HMD), or external devices. For haptic stimulus, we used the list of haptic interfaces distinguished by Wee, Yap and Lim [63], which consists of "handhelds", "wearables", "encountered-types", "physical props", and "mid-air". By handhelds, the authors consider controllers held by the user, attachments, or add-ons that enhance the haptic feedback of default devices, such as the controllers provided by Oculus Rift, HTC Vive, and Sony Playstation VR. Wearable devices are worn anywhere on the body (fingers, wrists, hands, etc.), for instance, the Emogle [64]. Encountered-types include devices that provide on-demand haptic feedback, typically robotic arms, drones, or specialized devices with an end effector attached. Physical props consist of tangible objects placed in the physical space aligned to the virtual object to deliver haptic feedback. Mid-air devices can provide ultrasonic vibrations through the air to deliver haptic sensations [63, 65]. Besides such classification, whole-body tactile stimulation has also been explored in VR, such as floor vibration [21, 66], which is also considered haptic or tactile feedback. New ways of providing haptic feedback have been developed and implemented recently. These methods include the use of heat to deliver the users the sensation of a warm environment [13], wind [10, 13, 28], pain [67], fire, or a ghostly breeze by means of a fan [19]. Such techniques are frequently considered haptic feedback [67, 68], which can be divided into two categories: "active haptic feedback", when the computer-controlled actuators exert forces on the user, and “passive haptic feedback”, which corresponds to the interaction with tangible objects [69]. Together, active and passive haptic feedback integrate the so-called "Smart Substitutional Reality" (SSR) [19]. This paper will include both active and passive haptics as haptic stimuli.

Finally, regarding smell and taste, no specific criterion was applied. Despite having an essential role in evoking human emotion [27] and enhancing the user experience, especially regarding the perception of realism and sense of presence [15,16,17], these two senses are much less explored, especially when compared to auditory and visual stimuli [30].

2.6 Data collection process

For data collection, the procedure was conducted using piloted forms. We used the reference manager software Endnote™ 20 to upload all the studies from the four databases (Web of Science, Scopus, ACM Digital Library, and IEEE Xplore). After removing the duplicates, a primary analysis of the paper's title and abstract was performed to screen the results for full-text analysis, following the PRISMA flow diagram [58]. At this stage, the results were divided according to the exclusion criteria (Subsection 2.5). Results are presented in Table 1. The ones selected for the quantitative analysis were thoroughly read, and the information regarding the proposed objectives and research questions was retrieved and synthesized into Table 1.

Table 1 Number of discarded papers after the title and abstract analysis, based on the exclusion criteria

2.7 Assessing studies’ relevance

After gathering the final results for qualitative analysis (N=37), the studies’ relevance was assessed by a scoring process consisting of an individual evaluation of the papers. The scoring system adopted was similar to the approach made previously by some authors [30, 45, 70, 71]. As defined by them, the scoring system ranges from 1 (lowest score) to 3 (highest score) per question. The total score results from the sum of the scores of the created questions. Considering that three questions were developed, the total score ranges between three (lowest relevance) and nine points (highest relevance). Thus, papers with a score of 3 were considered low-relevance papers; papers between 4 and 6 (both included) were addressed as medium-relevance papers; a score equal to or greater than 7 determined high-relevance papers. The first author completed the scoring process for all the studies, which the coauthors double-checked. The papers’ scores depended on the consensus between the authors. The three questions for assessing the relevance of the studies were:

  1. 1.

    Is the paper technically sound? For this question, we have created some guidelines to assess the papers' performance based on the work made by Spezi, Wakeling, Pinfield, et al. [72]. The authors define "technical soundness", also named "scientific soundness", as the papers' "methodological precision, coherence, and integrity", which includes the quality of argumentation, the logic of research, and the interpretation of data. Based on this, for the scoring options, we considered 1 point for those papers that present several issues regarding the quality of argumentation, the logic of research, and the interpretation of data; 2 points for the papers that present scarce issues regarding the same topics; and 3 points for the papers that sound technically accurate regarding the same three topics.

  2. 2.

    Is the sample size (number of participants) suitable, according to the recommendations? According to Macefield [73], although the specification of the participant group size remains challenging, the author considers that for comparative studies aiming to obtain valid results, a group size of 8-25 participants is a good baseline. However, increasing the participants’ size allows to obtain more statistically significant results. Accordingly, in this literature review, those papers that included less than 8 participants (Sample < 8) were assigned 1 point; 2 points were attributed if the sample was between 8 and 25 participants (8 ≤ Sample ≤ 25); 3 points were assigned to studies with a sample greater than 25 (Sample > 25).

  3. 3.

    How relevant is the study for concluding the correlations between the sensory stimulus and users’ emotions? This question was assessed based on each paper’s conclusions regarding the correlation between the stimulus and the users' emotional responses, as it is one of the main objectives of this literature review. If no correlations could be found (“Inconclusive”), the paper was assigned 1 point – low relevance; if no concrete correlations were established, 2 points were assigned – medium relevance; 3 points were attributed when objective correlations had been found by the authors – high relevance.

It is worth noting that this score, particularly the third point, concerns the studies’ relevance only for this paper’s purposes. This means that this score should be reviewed for other scientific purposes.

3 Results

3.1 Study selection

The data collection process was conducted in four steps (“Identification”, “Screening”, “Eligibility”, and “Included”), according to the PRISMA flow diagram [58] (Fig. 1), to gather relevant information to answer all the research questions.

Fig. 1
figure 1

PRISMA search flow guidelines

Initially, in the "Identification" step, we ran a search in the four electronic database sources (Web of Science, Scopus, ACM Digital Library, and IEEE Xplore) for the search terms, applying only language filters (Portuguese or English) and document type filters (conference proceedings paper or journal paper).

From this initial search, we obtained 1413 results. Additionally, we applied the so-called "snowballing method", whereby we could locate other relevant studies by checking the references in the already selected articles. Two more results were obtained, resulting in a total of 1415 results. The "Screening" stage, the second step of the process, consisted of removing the duplicate results (n=460), resulting in 955 studies. Then, these results were analyzed to check whether the paper addressed a VR experiment (16 results removed). In the third step, "Eligibility", the full text of the remaining papers (n=939) was assessed. In this process, we discarded 902 results, registering the reason according to the exclusion criteria, as previously noted (Table 1). Finally, we obtained 37 results for the qualitative and quantitative analysis, all written in English, which were included in the final stage (“Included”).

3.2 Qualitative analysis

Relevant data were extracted from these results for in-depth review. This information was organized and synthesized into Table 2, created based on the research questions we would like to answer. First, studies were segmented according to the research’s context area. This classification was based on the title, keywords, and abstract analysis. Six categories were created accordingly (“Therapy”, “Food research”, “General VR Research”, “Games”, “Tourism”, and “Marketing”). The third category was developed to comprise the papers that do not fit into any particular category, given their wide range of applications.

Table 2 Data obtained from all the articles selected for final analysis

For each selected study, the table contains the following variables: 1) identification of the study; 2) research's context area; 3) the sample size used for emotional responses collection, considering that some studies only investigate it with a sub-sample ("N="); 4) the methods and instruments used in the immersive VR experience, distinguished into “VR System” and “Measurement of the users’ emotions”; 5) the stimuli provided by the immersive VR experience to the user, that can be Visual (V), Auditory (A), Haptic (H), Smell (S) and Taste (T); 6) the “Context and main findings” of the study that might be relevant for conclusions; 7) the relationship found between the stimulus and the obtained users’ emotional responses (if applicable). Whenever the study’s main findings (column 7) were related to the relationship between the stimulus and the users' emotional responses (column 8), both columns were merged.

Little work has been performed regarding the emotions triggered according to each specific stimulus. It is still challenging to pinpoint the effect of specific cues on the users' emotional responses, as attested by some authors [10, 28, 33, 74]. Also, considering the emotions' subjective nature, this relationship is challenging due to the multiple factors that may affect the users' perception, as addressed by Kruijff, Marquardt, Trepkowski, et al. [10]. We can identify this gap as of now, which will be further discussed.

3.3 Studies’ relevance

As explained in section 0, to assess the relevance of the studies, three topics were examined (technical soundness, sample size, and general relevance). The mean rating for the fully analyzed papers (N=37) was 6.92, and the modal was 7. According to Fig. 2, the majority of the selected studies (N=26, approximately 70%) are high-relevance papers, considering having a score greater than or equal to 7. The other studies (N=11, approximately 30%) consist of medium-relevance papers. No low-relevance papers were found. These results indicate good methodological robustness.

Fig. 2
figure 2

Relevance assessment (Scores Histogram)

3.4 Quantitative analysis

3.4.1 Yearly Scientific production

A chart was created to understand the study distribution according to the research area. By looking at Fig. 3, it is possible to notice an increase in the number of studies regarding users' emotional responses in multisensory immersive VR (with at least 3 stimuli provided) starting in 2015, which is in line with conclusions from Brooks, Lopes, Amores, et al. [98]. Since then, every year, at least two studies have been published. This result can be supported by the democratization of VR, mainly for the gaming industry, with the release of the first more efficient and affordable VR headsets in 2014 [99]. We highlight the only study published before 2015, conducted by Saladin, Brady, Graap, et al. [85] in 2006, that does not match this reasoning. Also, the years 2019 and 2021 are not in line with the increasing number of results since 2015. Although for 2019 we cannot find any reason, 2021 might have been atypical due to the high restrictions caused by the pandemic of Covid-19 that year and the year before (considering that some of the papers written in 2020 would be published in 2021).

Fig. 3
figure 3

Distribution of studies according to the research area

Finally, Fig. 3 also allows concluding that therapy (N=11, approximately 30%) and general VR research (N=10, approximately 27%) are the two main research areas explored over time. Less emphasis was given to food research (N=7), games (N=6), tourism (N=2), and marketing (N=1).

3.4.2 Methods and instruments for measuring users' emotional responses

To understand the most common methods and instruments to measure users' emotional responses during multisensory immersive VR experiences, we defined variable sets for subjective and objective measures using SPSS®, to analyze them individually or as a group. Subjective measures were composed of 13 variables, i.e., the analyzed studies used 13 subjective instruments for users' self-report of their emotional responses (Table 3). Objective measures comprised 7 variables, meaning the analyzed studies used 7 objective measures to assess the users' emotional responses (Table 4).

Table 3 Frequency of the instruments used as subjective measures
Table 4 Frequency of the instruments used as objective measures

A crosstabulation between the research area and the used methods was built to analyze the most common methods for studying users' emotional responses in immersive multisensory VR (Table 5). This table allows concluding that the majority of the studies (21 studies, equivalent to 56,7%) used a subjective measure, either exclusively (9 studies, corresponding to 24,3%) or as part of a mixed approach (12 studies, corresponding to 32,4%). In turn, objective measures were the least used (4, equivalent to 11%). 16 studies (equivalent to 43,2%) presented objective instruments in their methodologies, either individually (4 studies, corresponding to 10,8%) or as part of a mixed approach (12 studies, corresponding to 32,4%). According to the combined analysis of Tables 3 and 5, the most common method used to measure users' emotional responses in multisensory immersive VR experiences is subjective, independent of the research area, and mainly resorting to specifically designed questionnaires (17,4%), i.e., custom questionnaires developed by the authors, and Interviews (17,4%). With the simultaneous analysis of Tables 4 and 5, we can observe that objective measures are the least used, and the most recurrent instruments were GSR (38,7%) and HR (35,5%), whether measured through heartbeats (HR), its monitoring (HRM), or its variability (HRV).

Table 5 Association between the research area and the used methods for measuring users’ emotional responses

3.4.3 Association between the used method and the research area

Goodman and Kruskal's λ was run to determine whether the used method for measuring the users' emotions (dependent variable) could be better predicted by the research area (independent variable) – Table 6. Goodman and Kruskal’s λ was very close to 0 (p=0,063), meaning there is no association between the variables [100].

Table 6 The association between the used method and the research area (Goodman and Kruskal’s λ)

3.4.4 Limitations on the study of emotional responses in multisensory VR

To answer research question 2, we analyzed the main limitations pointed out by the author(s) whenever they mentioned them. This data was summarized and classified into Tables 7, 8 and 9, respectively, according to the limitations regarding the used VR equipment (RQ 2.1), the used instruments for emotional responses collection (RQ 2.2), and the experimental design (RQ 2.3). The Discussion (Section 4.2) will further address them in more detail. For each limitation, the studies reporting it are identified. According to the authors’ point of view, the context or justification for such limitation is indicated when necessary.

Table 7 The main limitations regarding the used VR equipment
Table 8 The main limitations regarding the used instruments for emotional responses collection
Table 9 The main limitations regarding the experimental design

4 Discussion

This work aimed to overview the methods and instruments used to assess users' emotional responses during multisensory virtual experiences exploring the addition of at least one more sense to the base pair vision and audio. A total of 37 articles matched the defined criteria and were thoroughly analyzed. Our main results were distributed along Section 3 (Results), which is the core information for the following discussion and conclusions. This section will be organized into topics that will answer the proposed research questions to simplify.

4.1 Methods and instruments for measuring users' emotions during multisensory VR experiences

To answer RQ 1, an analysis of the most used instruments was done (Section 3.4.2). As revealed, subjective measures are the most common to assess users' emotional responses in multisensory immersive VR experiences, representing 56,7% of the methods used in the analyzed studies. Specifically, researchers tend to assess the subjective emotional responses by resorting to questionnaires (17,4%) and interviewing the participants (17,4%). We believe that one possible reason for this result is related to the more accessible and less expensive implementation process, compared to the objective measures, as previously discussed. Also, when obtaining subjective reports, researchers gain a more comprehensive understanding of the relationship between emotions and performance in VR, which might be another explanation for this result.

Among the objective instruments, GSR remains the most popular in emotional experiments, according to our results, which is also supported by recent statements from Hosany, Martin and Woodside [54]. However, there is still a controversy in the literature regarding its power and accuracy to measure emotional responses, considering, for instance, the reported mismatch between EDA and the subjective method SAM [90]. This is contradicted by the results of Kaminska, Smolka, Zwolinski, et al. [77], who have proven SCL to have the highest correlation with the participants' subjective reports regarding the assessment of stress level and mood [77]. Also Felnhofer, Kothgassner, Schmidt, et al. [4] concluded that, due to not discriminating different affective states, SCL might not be the best instrument to measure emotions, which is in line with IJsselsteijn [101], who believes HR may be a better indicator.

Despite our results showing a trend in the use of subjective measures, using a mixed approach for assessing the users' emotional responses in multisensory immersive VR experiences has been recommended by several authors [10, 19, 28, 31, 55, 79, 81, 82, 85]. This idea is also supported by Johnson and Onwuegbuzie [102], who stated that a mixed approach might result in better validity and more robust conclusions [103]. However, our literature review does not allow us to predict a clear advantage of resorting to it, as it is still very incongruent whether its usage represents a benefit or a disadvantage. On the one hand, if the data between the two methods present similar results, the mixed approach will understandably bring more robustness to the results. However, supposing a large discrepancy between objective and subjective data, using a mixed approach is less beneficial than using only one objective or subjective measure. So, a previous study should be done by researchers to define which method and instruments are more advantageous to achieve the investigation goals. Nevertheless, we could not find in any of the studies a discussion regarding the reason why the researchers had resorted to that instrument or measure. That is, the pros and cons are not adequately balanced when choosing the method to assess users’ emotional responses. Our conclusion concerning this topic is that using a mixed approach does not certainly bring more reliable results. Instead, researchers must equate the pros and cons of each approach (subjective, objective, or mixed), according to the specific VR experiment. Further discussion on this topic will be found in Conclusions (Section 5).

4.2 Limitations found regarding the study of users' emotional responses in multisensory VR experiences

According to the authors’ reports, we concluded that several pointed out the same limitations, although in different research areas and contexts, which we believe can contribute as an alert to ameliorating future research. These results, corresponding to RQ 2, were synthesized into Tables 7, 8 and 9, and will be discussed in Sections 4.2.1 regarding the used VR equipment, Section 4.2.2 regarding the instruments for emotional responses collection, and Section 4.2.3 regarding the experimental design.

4.2.1 Limitations regarding the used VR equipment

The limitations regarding the VR equipment (RQ 2.1) were classified according to its intrusiveness and social interaction inhibition, independent of the used method (subjective or objective). Generically, our results demonstrate that the VR headset is mainly intrusive for tasting experiences. From a researcher's perspective, VR equipment might hamper the assessment of the users' emotional responses, mainly due to covering a significant part of the face, which avoids assessing behavioral reactions, such as facial expressions or gestures [77]. From a user’s point of view, the fact that participants cannot precisely assess the appearance of the tasting sample [75, 78, 94] will affect their experience and, consequently, their reported feelings. Considering that the users had to remove the VR headset after each part of the experience to answer the questionnaire was also pointed out as a limitation [78, 94]. We highlight the use of in-VR questionnaires, which has become more relevant in VR experiments, as users can answer any questionnaire without taking out the headset and not breaking their levels of presence [104, 105], e.g., the VR Questionnaire Toolkit from Feick, Kleer, Tang, et al. [106]. It contributes to alleviating the transition from VR to the real world, which, in turn, contributes to maintaining the user’s immersion and presence [104, 107], and reducing disorientation [105].

Finally, VR equipment was pointed out by Sinesio, Moneta, Porcherot, et al. [78] to inhibit social interaction, which the authors consider fundamental to simulating real-life scenarios, despite the eventual presence of avatars.

4.2.2 Limitations regarding the used instruments for emotional responses collection

The primary limitations regarding the instruments to collect users' emotional responses (RQ 2.2) were distinguished by their subjectiveness, objectiveness, or the use of a mixed approach. Objective instruments are subjected to bias caused by external factors, and the hardware’s intrusiveness and sensitivity. Worch, Sinesio, Moneta, et al. [80] commented on the negative impact of external factors such as the test condition, the technological material, or the presentation order to obtain more accurate results. Another limitation worth noting is how intrusive hardware can be, such as EEG [77], which can be overpassed by resorting to other objective methods or less invasive EEG hardware. Finally, regarding the hardware's sensitivity, the interference of the room temperature on the GSR data collection process was pointed out by Kaminska, Smolka, Zwolinski, et al. [77]; physical movements can affect HR [29] and EDA [55], which provides little exploratory interaction between the user and the VR scenario [29]. Specifically, the “Empatica E4” wearable used to collect EDA data was reported not to work trustworthily [55].

Subjective instruments denote a self-report, which is considered a limitation per se. One’s perception of the own emotional responses is an arduous task, highly subjected to bias [55, 94], for instance, due to the social desirability and ideas of role models [55]. So, although using more than one subjective measure can be a good method to validate emotional responses, it also can result in a mismatch between the data collected [55]. The same occurs when using a mixed approach, i.e., the mismatch between subjective and objective data, as reported by Salgado, Flynn, Naves, et al. [90].

4.2.3 Limitations regarding the experimental design

The most reported limitation considering the experimental design (RQ 2.3) was the absence of a mixed approach, independently of the use of a subjective or an objective approach [10, 19, 28, 31, 55, 79, 81, 82, 85]. It means that when researchers used an objective approach, a subjective method would be necessary to complement results and vice-versa. Although rare, some researchers suggest using additional instruments of the same method to validate the results [7, 83]. However, as discussed in Section 4.1, our results do not allow us to conclude that a mixed approach will necessarily result in more solid, consistent, or rigorous outcomes, considering the potential risk of inconsistency between the subjective and the objective data.

Concerning the investigation of multisensory stimuli, as we will address later, some authors claimed that the stimulus effects should be explored individually to compare their modal effectiveness vs. multimodal [76], and to understand what levels of manipulation are most effective in inducing specific emotions [29].

Although sporadic, it was also pointed out as a limitation the absence of intense emotional images due to ethical reasons, which could have contributed to inducing stronger emotional responses in the user [11]. For instance, this limitation could be reduced by using an affective picture system, such as the International Affective Picture System (IAPS), which allows more control over the emotional responses elicited in the user. Such a system includes a great variety of emotion-eliciting photographs, such as nature, erotic scenes, weapons, animals, etc. [108]. Also, one of the most cited and frequently used [4], the “Velten Mood Induction Procedure” could be a good option, as it combines photo material with music [109].

It was commented that the questionnaires had been filled out long after the experience led the users to report what they remembered rather than what they actually felt [65]. This limitation allows concluding the importance of applying the questionnaires as soon as possible after the experiment and, ideally, to create short VR experiences to avoid a long period between the experiment's beginning and the questionnaire's fill. As previously discussed, this issue could be overpassed by resorting to in-VR questionnaires.

Another considerable limitation reported by some authors [31, 74, 94] regarding the experimental design of multisensory VR systems is the incongruency between the stimuli. This problem has been increasingly highlighted in the literature. Many authors have shown that VR users subjected to incongruent stimuli tend to report several negative feelings, such as a disturbance in presence [74], the sense of disownership [110, 111], the loss of agency, and the sensation of being out of ones' own body [111].

Finally, Brengman, Willems and De Gauquier [31] commented that multisensory VR systems are, in general, hard to implement and evaluate, but even harder the more stimulus they include. Indeed, considering the essential requisite of stimuli congruency, the several obstacles of implementing and assessing certain human senses such as taste and smell, and all other previously discussed limitations, designing multisensory VR systems requires significant effort and expertise. All these limitations contribute to understanding why this topic has been so poorly explored.

4.3 The relationship between the stimulus provided to the users and their emotional response(s)

One of this research's main goals was to find a relationship between the stimulus provided to the users and their emotional response (RQ 3). Unfortunately, only a few studies explored such association, furthermore, generically. As Dey, Chen, Billinghurst, et al. [29] and Chin, Thompson and Ziat [76] noted, stimuli should be explored individually to understand their isolated effects on users' emotions. Regarding the individual effect of visual stimulus, results from Wu, Weng and Xue [41] confirm that its correlation with emotions is greater than haptic or auditory stimuli. In turn, the haptic stimulus had a greater influence on users' emotions than the auditory. Yet, none of these conclusions provide concrete information, such as which emotions are triggered according to each specific stimulus. An attempt to achieve such a clear conclusion was made, for instance, by Torrico, Han, Sharma, et al. [75], who found that bright-VR was associated with "free", "glad", "aggressive", and "enthusiastic" emotions, considering different light conditions in taste perception. Dark-VR was associated with "nostalgic" and "daring". Similarly, Cornelio, Dawes, Maggioni, et al. [93] found a significant effect of blue and red lighting to predict lower valence and dominance when the users tasted neutral samples. When the users tasted sweet samples, red lighting was found to reduce valence, and blue lighting predicted low arousal levels. Dey, Chen, Billinghurst, et al. [29] demonstrated that visually manipulated HR feedback could significantly enhance "interest", "excitement", "scariness", "nervousness", and "fear" in VR.

Regarding the auditory stimuli, Tamtama, Santoso, Wang, et al. [95] demonstrated that the presence of it in a dark tourism context triggered the negative feelings of “pity”, “empathy”, “sadness”, “nervousness”, “fear”, “self-awareness”, and “horror”. Work from Kruijff, Marquardt, Trepkowski, et al. [10] revealed significant correlations between a “bee swarm” (normal sound condition) and “happiness”, while “bad weather situation” (low-frequency sound) and “zombie swarm” (low frequency and normal sound) correlated to “surprise”. No more auditory correlations were found between sensorial stimulus and the users' emotional responses. For smell correlations, the same authors [10] found that the smell provided in the “sea view situation” happiness to the user. Still regarding smell, unpleasant and intense odors demonstrated a positive association with anxiety [92].

Regarding taste, only one study was found relating the tasting sample of milk, white and dark chocolate to several positive and negative emotions [62].

For the haptic stimulus, Kruijff, Marquardt, Trepkowski, et al. [10] found a correlation with “surprise” (in the "spider behind the back" situation). Also, the back vibration during the “bad weather situation” and the “zombie swarm” situation caused “surprise”. Kono, Miyaki and Rekimoto [18] found that the haptic feedback from EMS or Solenoid induces “fear” and “pain”. Considering thermal feedback, warm condition triggered more arousal and valence than cold, which, in turn, was perceived to provide more relaxation and less arousal [95].

To summarize, none of the obtained results allow concluding a pattern on which specific stimulus triggers each emotion, as they all present vague or dispersed results unrelated to each other. This limitation will be further addressed in Sections 4. 5 and 5.

4.4 The association evidence between the used method and the research area

Goodman and Kruskal's λ was run to verify the existence of an association between the used method and the research area (RQ 4). The result showed no association between these two variables. Still, there should be considered the fact that little research has been published so far related to the assessment of users' emotional responses in multisensory immersive VR experiences, which is reflected in the low number of studies analyzed in this literature review.

4.5 Gaps and Challenges

In this paper, one of our goals was to address the main gaps and challenges remaining for future research (RQ 5). First, as previously stated, our results demonstrate a significant gap regarding the study of users' emotions in multisensory immersive VR experiences, specifically in understanding which stimuli evoke certain users' emotional responses, regardless of the research context. As Kruijff, Marquardt, Trepkowski, et al. [10] highlighted, "there is hardly any evidence that supports which combinations of stimuli can trigger specific emotional responses", which our paper corroborates six years later. Indeed it is still impossible to achieve a pattern or a set of guidelines regarding the immersive design aspects within an IVE to stimulate specific emotional responses in a certain context (positive, negative, or neutral). We believe that, in the next few years, with the constant development of technology, specifically VR, a significant number of studies regarding this subject will appear, which is in line with our results regarding the distribution of studies by year of production (Section 3.4.1) and the findings from Brooks, Lopes, Amores, et al. [98]. Thus, it will provide the chance to obtain more concrete results in this field. Although some exploratory experiments have been done to understand the correlation between a specific stimulus and the triggered emotional responses, for instance, according to light conditions [75, 93], auditory stimulus [10], haptic thermal feedback [95], the studies we analyze were, in general, recent, revealing that there is still a long way to go in this regard.

For future research, we recommend comparing our results considering less immersive VR environments, i.e., the inclusion of unisensory and two-stimuli environments, which our work does not encompass. Such investigation could provide more robustness and consistency regarding the association between the input stimuli and the obtained emotional response. Finally, it is also worth noting that other results could have been obtained using a different VR systems classification (e.g., Slater’s classification of immersion in terms of sensorimotor contingencies [112]). So, it could be pertinent for future work to consider such dimensions.

5 Conclusions

This paper reviewed the relationship between multisensory information in VR and the users’ emotional responses, considering papers exploring at least three different stimuli, i.e., considering the addition of at least one more sense to the base pair vision and audio. Although this exclusion criterion might be seen as reductive, as unimodal and bimodal VR experiences can also provide valid information regarding emotions, this study contributed to understanding the significant lack of knowledge remaining on this subject. It also acts as an alert to the urgency of more research stimulating more than just two human senses. We consider that the existence of so little information in this regard so far derives from the complex task for researchers to simulate and evaluate other than audiovisual stimuli, as previously addressed. Also, lately, the focus of VR research has been the portable mass-market VR, which is not as compatible with extra hardware, and makes it harder to explore other senses, such as smell and taste. Lastly, due to restrictions caused by the COVID-19 pandemic, fewer studies with VR in general than expected were conducted, which indirectly limited this paper’s results and conclusions.

Indeed, if the fourth exclusion criterion had not been included, at least 251 more articles would have been added to this study. Although results originated from that investigation could have provided more consistent results, we believe that the future of VR is the multisensory experiences providing as many sensorial inputs as possible. As previously discussed, the more senses stimulated, the more immersive and emotionally powerful the IVE becomes, which justifies the importance of adding more than just two stimuli to a VR experience. The rapid evolution of technology and VR in particular, which can be proven, for instance, by the noticeable increase in publication rates regarding the less explored stimuli so far (smell and taste) [98], also contributes to reinforcing such conviction. When the focus on such studies is more frequent, it will be possible to clearly outline the association between specific stimuli and the triggered emotions.

Finally, this literature review allowed us to conclude, contrary to what some authors defend, that the use of a mixed approach to measuring users’ emotional responses in VR does not clearly guarantee more reliable results, mainly considering the potential risk of biased results caused by the mismatch between the objective and subjective data. So, the researcher should ponder in advance the inherent risk of using each approach. Many aspects of the VR system can cause the users to feel more cybersickness. For instance, as enumerated by Bockelman and Lingum [113], technical aspects (e.g., the type of display and its comfort of design), or, regarding the visual experience (e.g., the navigation speed, the scene oscillation, the 3D experience), and the hardware interaction, contribute to cybersickness symptoms. Considering an investigation that recognizes the existence of certain related risks that contribute to a greater feeling of cybersickness (for example, the use of an HMD with low resolution, or hardware interaction delay), and taking into account that the individuals tend to respond according to what they believe researchers expect (one of the behaviors of the so-called “good-subject effect” [114], which, in this case, refers not to feeling cybersickness), then, inevitably, the objective data is likely to contradict the users’ subjective responses. That is, objective data is likely to demonstrate a negative emotional experience caused by the negative symptoms of cybersickness, contrary to the user's subjective report, which might reveal a biased, and, consequently, a different result caused by the good-subject effect. These are examples of eventual situations to which only objective data should be applied, as there is a high risk of mismatch between objective and subjective measures. On the other hand, if these conditions are not expected, and it is predictable to have low levels of cybersickness (which can be anticipated, for instance, by resorting to pre-tests), a mixed approach should be considered. Finally, supposing the VR experience per se is complex and intrusive, for example, due to the equipment it requires the participant to use (e.g., haptic, smell, or taste interfaces), it might be beneficial not to resort to objective methods to assess the users’ emotional responses, as they will cause even more intrusiveness and, consequently, affect the collection of objective data.