1 Introduction

Test anxiety is an emotional state characterized by subjective feelings of discomfort, fear, and worry about the evaluation made by an external authority [1]. Anxiety can have a considerable negative impact on students’ academic performance [2, 3], hence motivating the study of techniques that could help students in reducing it.

In vivo and imaginal exposure approaches have traditionally been used in psychology to treat various types of anxiety. Both approaches expose the individual to his or her feared situations over time, by either using the real situation (in vivo) or asking the individual to imagine it (imaginal). Repeated exposure to the feared situation causes a desensitization that gradually lowers the elicited anxiety, making the situation less fearful. Over the past two decades, Virtual Reality Exposure (VRE) has been increasingly used as an alternative to in vivo and imaginal exposure in the treatment of anxiety [4,5,6,7]. VRE involves repeatedly exposing people to the feared stimulus in a virtual environment (VE), i.e. a simulated environment created through computer-generated images. VRE systems for test anxiety should thus reproduce different exam situations [8]. However, the simulated exams should be able to produce anxiety responses in the student to support the desensitization process.

Unfortunately, the literature on VRE systems for test anxiety is very limited, with only three papers available on this specific topic [8,9,10]. Moreover, they suffer from three major limitations. First, not all of them assess the capability of their system to produce anxiety responses, which is a pre-requisite for possible VRE use. Second, all of them concern a written test. Since oral tests elicit higher levels of test anxiety than written tests [11], the availability of VRE systems for oral tests would be precious to a large population of students worldwide. Third, existing VRE systems for test anxiety require the availability of a head-mounted display, posing a barrier to widespread use, especially at home. This paper aims to address all three limitations by proposing a VRE system for test anxiety that focuses on oral tests reproduced on common computer displays, and by assessing with a study the feasibility of the system for VRE. In the proposed system, a virtual examinerFootnote 1 (VX) conducts the oral exam. The VX can perform three different sets of behavior, creating different oral exam scenarios aimed at eliciting three levels of increasing anxiety for VRE purposes.

This paper presents the following contributions: (1) it proposes the first VRE system for test anxiety that deals with oral exams; (2) it conducts the first feasibility study of a VRE system for oral exams, showing its capability to induce different levels of anxiety; (3) while VRE systems for test anxiety were based on immersive VR so far, the system proposed and studied in this paper employs desktop VR; (4) unlike other studies of VRE systems for test anxiety, the study also exploited a visual annotation tool, proposed and described in this paper, that allows to identify the instants in which participants remember they had felt positive or negative affect. The information collected through the tool is also useful in guiding further development of the system, as described in the paper.

2 Related work

The literature has proposed several VRE systems [12,13,14,15,16,17,18,19,20,21,22,23,24] aimed at fear of public speaking that allow the user to give a speech to a virtual audience. Studies [18,19,20,21,22, 24] have shown that some of these systems are able to increase the level of anxiety and distress in participants, confirming feasibility for VRE in public speaking. Other studies [13,14,15,16,17, 23] have used the systems for treatment, showing that the VRE approach is successful in reducing the level of public speaking anxiety. However, such scenarios are different from those needed in test anxiety, which should instead reproduce the situations experienced by students in school exams.

To the best of our knowledge, there are only three proposals of VRE systems for test anxiety in the literature [8,9,10]. Table 1 synthetically highlights the main differences between those works and the present paper.

Table 1 Comparison of the main differences of the present paper with studies conducted on VRE systems for test anxiety

TAVE (Test Anxiety Virtual Environments) [8] is an immersive VRE system that provides users with three VEs, to be experienced in chronological order: student’s home, representing the day before and then the morning of the exam; followed by subway journey to the exam location; and finally school corridor and classroom where the exam takes place. A feasibility study was conducted to assess whether the different VEs of TAVE could elicit significantly different responses in students. The system was tested on 11 students with high levels of test anxiety and ten students with low levels of test anxiety. After experiencing each VE, students’ level of anxiety was measured with the State Anxiety Inventory and Subjective Units of Discomfort Scale questionnaires. Results showed that students with high test anxiety experienced higher levels of anxiety than students with low test anxiety in all the VEs. However, the level of anxiety in the two groups did not progressively increase with the sequence of VEs but peaked during the subway journey.

Kwon et al. [9] proposed an immersive VRE system that includes two VEs: house on the day before the exam and school on the day of the exam. They assessed the feasibility of the VRE system in inducing different levels of test anxiety in adolescents. However, in addition to experiencing the VEs in an exposure session, participants performed also a meditation session to regulate anxiety immediately after the exposure session. The system was tested on 21 adolescents whose general anxiety and test anxiety were measured with two self-report questionnaires at the beginning of the study, before they were exposed to the first VE. The study also measured participants’ heart rate variability with a physiological sensor during VEs exposure. After each session, adolescents’ level of anxiety was measured with the visual analogue scale. The main results of the study showed a significant difference in anxiety only between exposure and meditation sessions. The study does not instead specify if there were significant differences in perceived anxiety between the exposure sessions.

Luo et al. [10] developed a VRE system for test anxiety containing three VEs aimed at eliciting increasing anxiety levels: home on the day before the exam, school entrance on the day of the exam, and classroom where the exam takes place. The system was assessed on middle school adolescents: unfortunately, while the paper states that participants experienced varying degrees of anxiety during VE exposures, it does not illustrate a study that might provide evidence about the system’s ability to elicit increasing levels of anxiety.

The VRE systems for test anxiety described above have three aspects in common.

First, they concern a written test. In TAVE [8], students are seated in a classroom and take a written test with multiple-choice, general knowledge questions. In [9], adolescents are seated in a classroom where a written test is about to start, but the test itself is not simulated. In [10], students are seated in a classroom and, after selecting their preferred subject from a list of five options, proceed to take a written test with questions from a college entrance test. In some educational systems, students take oral tests as an adjunct or an alternative to written tests, and several studies have highlighted the benefits of oral tests as assessment method [25,26,27]. The reliance on oral tests changes from country to country, for example they are central to the Italian educational system [28, 29], while the US educational system prefers written tests [26, 27, 30, 31]. As already mentioned, oral tests elicit higher levels of anxiety than written tests [11], and the availability of VRE systems for oral tests could thus benefit a large population of students worldwide. Moreover, the need for engaging in interactions with the professor in oral exams introduces an additional challenge for students who also suffer from social anxiety. Therefore, it is crucial to provide students with scenarios in which they are not only exposed to an exam situation but also to the interaction with the professor. For these reasons, we focused our proposed system on oral exams.

Second, current VRE systems for test anxiety require using a head-mounted display. Although this can increase users’ sense of immersion, it restricts the use of the system to laboratory settings and to people who own a head-mounted display. Conversely, the use of widely available hardware would allow a much larger number of students to benefit from the VRE system, also at home. Moreover, although modern head-mounted displays have made significant progress in limiting symptoms of motion sickness, there is still the risk of negative side effects, such as nausea, headaches, and dizziness [32, 33]. On the other hand, the use of a VRE system using commonly available hardware would help minimize health risks. Furthermore, existing studies on effects of immersive VR on users show that VR can lead to more visual fatigue than traditional screens [34]. For these reasons, the VRE system we propose is meant for common computer monitors.

Third, the studies on VRE systems for test anxiety have exposed participants to the different VEs in a fixed order. In [8], both the group with high test anxiety and the group with low test anxiety were exposed to the three VEs in chronological order: first the home, then the subway, then the classroom VE. In [9], all participants were exposed to the two VEs in the same order: first the house, then the school VE. Exposing all participants to the VEs in a fixed order could lead to order effects, i.e., the participants’ responses to the VEs could be affected by the participants’ responses elicited in the previously experienced VEs. On the contrary, to prevent order effects, the order of conditions was counterbalanced in our study, i.e., all six possible orders of conditions were created, and then different participants were assigned a different order as shown in Fig. 1.

Fig. 1
figure 1

Participant flow diagram

3 Materials and methods

3.1 Participants

The study was approved by the Institutional Review Board of the University of Udine. A sample of 37 (32 male, 5 female) participants was recruited for the study. Participants were volunteers who received no compensation. Their age ranged between 20 and 35 (M=22.28, SD=2.78), and they were recruited through direct contact among undergraduate Computer Science students of the University of Udine. Since the proposed VRE system concerns the simulation of an oral test, we looked for participants who were likely to take an oral test in the immediate future as they could be more representative of the intended users of the system. For this reason, students who had completed all the exams in their curriculum were not considered eligible. One participant was excluded from the analysis because he did not follow the instructions in completing some of the questionnaires. Figure 1 shows the participant flow diagram.

Although the recruited sample was predominantly male, it must be noted that there is substantial evidence from several studies that females tend to report greater anxiety [35] and higher intensity of emotional experiences [36, 37] than males. We thus reasoned that, if male predominance in our sample was possibly going to influence the results, the influence would have been in the direction of obtaining a smaller, not a greater, response in the intensity of anxiety.

3.2 The proposed VE

The system was developed in Unity version 2019.4.18f1. As a 3D model of the VX, we used the “Business_Female_03” character from Microsoft’s Rocketbox library [38]. The VX spoke with the “Elsa (Neural)” voice of the Azure Cognitive Services text-to-speech, with pitchDelta parameter set to -12.

In the VE, the VX sat behind a desk in an office. We defined three sets of VX behaviors (A, B, C), intended to elicit three increasing levels of anxiety. The three sets aimed at making the VX appear respectively friendly (set A), only partially friendly (set B), and unfriendly (set C). Figure 2 shows an example of a behavior from each set, and Table 2 describes all behaviors. Behaviors were chosen based on the indications about their likely effect on users, available in the literature on non-verbal communication [39,40,41,42,43] and virtual agents [44,45,46].

Fig. 2
figure 2

Examples of different VX behaviors: nodding to the student in set A (a); one of the distracted positions in set B (b); head scratching in set C (c)

Table 2 Description of behaviors in A, B, C. The micro-movements mentioned in the IDLE BEHAVIOR row were obtained through head motion capture of a human actor

In set A, the VX maintained a smiling expression to convey agreement [42] and a positive attitude [43,44,45] (IDLE BEHAVIOR in Table 2). The VX performed all other behaviors in set A with a more pronounced smile than IDLE BEHAVIOR to make the perception of the VX’s facial expression more recognizable, as recommended in [45]. In addition, to further represent agreement and a positive attitude, set A included nodding [42,43,44,45] (BEHAVIOR 1) and tilting the head [39] (BEHAVIOR 2 and BEHAVIOR 5). Placing one hand on the opposite hip and the other hand on the opposite shoulder [46] (BEHAVIOR 4) were introduced to represent interest.

In set B, the VX kept a neutral expression on its face, not showing any signs of positive or negative attitude (IDLE BEHAVIOR in Table 2). Behaviors performed by the VX in set B to convey a neutral or slightly negative attitude, boredom, or disengagement included shaking its head [44, 45] (BEHAVIOR 1 in Table 2), turning its head away from participant [39, 44] (BEHAVIOR 2 and BEHAVIOR 6), raising its eyebrows [44] while raising its arms (BEHAVIOR 4).

In set C, the VX maintained a frowning expression to convey a negative attitude [45] (IDLE BEHAVIOR in Table 2). The VX performed all other behaviors in set C with a more pronounced frowning face than IDLE BEHAVIOR to make the perception of the VX’s facial expression more recognizable, as recommended in [45]. Behaviors performed by the VX in set C to convey a negative attitude or discomfort included performing a frowning and thoughtful expression and then shaking the head [45] (BEHAVIOR 2 in Table 2), scratching the head (BEHAVIOR 4) [47], rubbing the hands on the thighs (BEHAVIOR 3) [47], facing down [45] to use a smartphone (BEHAVIOR 5), checking the time on the wristwatch [43] (BEHAVIOR 6).

Each set also included neutral behaviors (i.e., BEHAVIOR 3 and BEHAVIOR 6 in set A; BEHAVIOR 3 and BEHAVIOR 5 in set B; BEHAVIOR 1 in set C). In these behaviors, the VX changed head position and orientation several times while maintaining a smiling (A), neutral (B) or frowning (C) expression.

The VE was displayed from the viewpoint of a student sitting in front of the VX. The VX first greeted the student, then asked him or her three test questions, and finally informed him or her that the exam was over.

The proposed system offers the possibility of customizing the questions (users can define a pool of questions from which the VX randomly selects its questions), but we chose not to use this feature in the study in order to ask exactly the same questions to each participant. In choosing the subject of the questions, we reasoned that different familiarity of participants with the topic could affect the level of anxiety they could experience. To prevent this confounding factor, the questions in the study concerned a topic (Basics of International Law) that was unrelated to the participants’ degree curriculum. In addition, we checked that participants were not possibly familiar with the topic for other reasons. Each question was followed by 30 seconds of silence, during which the VX performed three behaviors from the assigned set of behaviors, following the order illustrated in Table 3. The timing of events was the same in the three conditions, and is described in detail by Table 3.

Table 3 Script of the oral exam experience. The VX spoke in Italian, all sentences have been translated here into English for reader’s convenience

3.3 Visual Annotation Tool (VAT)

One of the goals of this study was to better understand which aspects of the experience elicit affective responses in participants. To do so, we wanted to identify the exact time instants that had elicited an affective response in participants during exposure, distinguishing between positive responses (i.e., time instants in which participants had felt at ease) and negative responses (i.e., time instants in which participants had felt distressed). Given the central role of the VX in the experience, we were particularly interested in identifying which VX-related factors elicited such responses in participants. To support the achievement of these goals, we developed a Visual Annotation Tool (VAT) as a software-based measuring instrument. The VAT allows one to replay the exposure experience, and to mark any time instant as positive or negative by pressing two different keys on the keyboard, made easier to recognize by a red sticker and a green sticker. Once the replay of the experience is complete, the VAT displays the marked time instants on a timeline, representing positive and negative instants as green and red dots, respectively. Then, each time instant can be selected to replay the corresponding part of the experience and categorize them in a list of pre-defined categories.

In this study, we used the VAT as follows. After experiencing the VRE system, participants used the VAT to replay the three conditions in the same order they had experienced. Participants were instructed to mark the time instants that made them feel at ease or distressed during the first exposure by pressing the green or red key on the keyboard, respectively. Once the replay of all conditions was complete, the VAT displayed green and red dots on the timeline for each condition, allowing the experimenter to identify the specific instants during which the participants respectively felt at ease or distressed. Figure 3 shows the timeline of condition C with the positive and negative dots identified by one of the participants.

Fig. 3
figure 3

The annotation tool, as seen by the experimenter. In this screenshot, there is only one time instant in which the participant felt at ease (represented by the first dot from the left in the timeline data visualization, colored green) while all other dots (colored red) identify time instants in which the participant felt distressed. The dot selected by the experimenter is highlighted by a white line. The items listed in the upper right part of the figure allow the experimenter to assign one or more categories to the selected dot after interviewing the participant about that time instant

Then, the experimenter used the three timelines to conduct an interview with participants to investigate the factors that had made them feel distressed or at ease during exposure. To do so, we used as selectable categories in the VAT three VX factors (i.e., facial expression, gaze, and posture), and an additional item “other” that allowed for the inclusion of free descriptive text. The experimenter selected each marked time instant to replay the corresponding part of the experience and asked the participant to indicate which factors had caused the elicited feeling. Based on participant’s answer, the experimenter categorized the time instant in the VAT. In Fig. 3, the experimenter is examining the replay of the second time instant from the left in the timeline (the VAT highlights the selected time instant with a white line below it).

3.4 Measures and hypotheses

3.4.1 Self-reported social interaction anxiety

We administered the Italian adaptation of the Social Interaction Anxiety Scale (SIAS) [48] to measure the level of anxiety among participants in general social interactions. The SIAS is a 19-item self-report questionnaire which describes anxious reactions that can occur during social interactions [49]. For each item, respondents rate how true the statement is for them on a 5-point Likert-type scale (0=“not at all”, 4=“extremely”). Total score range from 0 to 76. Higher scores indicate higher social interaction anxiety. Consistently with the literature on VRE systems for social anxiety [50,51,52,53,54,55,56], we chose to use the SIAS as an instrument to measure participant’s trait anxiety in the specific context of social interaction.

3.4.2 Self-reported anxiety

We used the Visual Analogue Scale for Anxiety (VAS-A) as an instrument to assess participants’ state anxiety [57, 58] during the exposure to each condition. The scale was a 10 cm long line, with “not at all anxious” and “very anxious” printed at its left and right ends, respectively. Participants reported their score by drawing a vertical mark on the scale.

3.4.3 VX attitude

We administered a 4-item questionnaire (Table 4) that asked participants to rate aspects of the VX’s attitude they perceived on a 7-point Likert-type scale (1=“not at all”, 7=“very”). To calculate the score, the scale of the second and fourth items was inverted, and the answers were averaged. Higher scores indicate a more negative attitude. Cronbach’s alpha in the three conditions was respectively 0.70 (A), 0.60 (B), 0.86 (C). We performed an exploratory factor analysis with principal component extraction and Oblimin rotation to evaluate factorial validity. Bartlett’s test was significant (p<.001 in all conditions) and KMO was greater than .60 in set A (0.70) and set C (0.75), while it was 0.54 in set B. The analysis confirmed the intended one-factor structure that explained respectively 57.27% of variance in set A, 48.64% in set B, 72.63% in set C.

Table 4 VX attitude questionnaire items

3.4.4 Elicited positive and negative responses

We measured the number of times participants reported a feeling of ease (positive responses) or distress (negative responses) with the VAT. For each condition, the two counts were used as an indication of positive and negative responses elicited.

3.4.5 Factors associated to positive and negative responses

As described in Section 3.3, factors associated to positive and negative responses were collected through the participant’s interview and stored in the VAT. The aim was to identify the VX factors (and other possible factors) that had a greater impact in eliciting responses in participants.

3.4.6 Qualitative interview

After conducting the first part of the interview with the VAT as described in Section 3.3, we further interviewed participants to gather comments about how to improve the system. The experimenter asked a sequence of structured open-ended questions, following the diagram in Fig. 4. The data collected were analyzed by employing thematic analysis.

Fig. 4
figure 4

Flow diagram of the structured qualitative interview. The interview was conducted in Italian, all sentences have been translated here into English for reader's convenience

3.4.7 Hypotheses

We formulated the following hypotheses:

  1. H1.

    The three conditions would produce three different, increasing values of anxiety because the three sets of behavior of the VX aim at making the VX appear friendly in condition A, only partially friendly in condition B, and unfriendly in condition C. We expect that the increasingly less friendly behavior of the VX elicit a correspondingly increasingly higher level of anxiety in participants.

  2. H2.

    The three conditions would produce three different, increasingly negative perceptions of VX attitude because the VX has been designed to behave friendly when it performs set A, and to increasingly reduce its friendliness when it performs sets B and C.

  3. H3.

    The three conditions would produce three different, decreasing (resp. increasing) counts of positive (resp. negative) responses elicited because the VX’s behavior is more friendly when it performs set A, less friendly when it performs set B, and unfriendly when it performs set C. We thus expect that sets A, B, and C elicit a progressively decreasing (resp. increasing) number of positive (resp. negative) responses in participants.

3.5 Procedure

We aimed at maintaining study design consistency with previous studies that assessed the feasibility of VRE systems for social anxiety [52, 55], fear of public speaking [21, 22] and test anxiety [8, 9]. Participants were tested individually in a 50-minute session that started with filling the SIAS. To prevent the spill over of anxiety elicited by a condition to the next condition, each condition was preceded by a two-minute period during which participants sat in a comfortable position while listening to calm music and watching a series of relaxing images of natural scenarios (such as forests, pools and hills) through a 24-inches desktop monitor positioned in front of them. Participants were asked to imagine being the student in the oral test, and told that they did not need to answer the questions audibly. After exposure to each condition (A, B, C) via the desktop monitor, participants filled the VAS-A and VX perception questionnaires. The same sequence (relax, exposure, questionnaire) was followed for each condition. Participants were exposed to the three conditions in counterbalanced order as illustrated in detail by Fig. 1. After experiencing all the conditions, participants replayed the conditions in the same order they were presented previously, using the previously described VAT to mark the time instants of the experience in which they remembered to have felt at ease or distressed. Finally, participants were interviewed: for each time instant marked by the participant with the VAT, the experimenter watched the corresponding part of the experience in the VAT together with the participant, and discussed with him or her the reasons why that experience had caused a positive or negative reaction. Then, the experimenter categorized the participant’s answer in the VAT. Finally, the experimenter asked a sequence of structured open-ended questions, following the diagram in Fig. 4.

4 Quantitative results

All analyses were conducted using SPSS version 28.0.1.0. A repeated-measure ANOVA was used to compare the effects of the VX’s behaviors (A, B, C) on VAS-A, VX attitude, and counts of elicited positive and negative responses. If Mauchly’s test indicated a violated assumption of sphericity, degrees of freedom were corrected using Greenhouse-Geisser estimates of sphericity. ANOVA results are shown in Table 5.

Table 5 ANOVA results

ANOVA revealed a main effect of state anxiety measured with VAS-A, and Bonferroni post hoc comparison found a significant difference for all pairs of conditions (A vs. B, p<0.01; B vs. C, p<0.01; A vs. C, p<0.001).

ANOVA revealed a main effect of VX’s attitude, and Bonferroni post hoc comparison found a significant difference for all pairs of conditions (A vs. B, p<0.001; A vs. C, p<0.001; B vs. C, p=0.001).

ANOVA revealed a main effect for both counts of elicited positive and negative responses. Bonferroni post hoc comparison found a significant difference for all pairs, both with positive responses (A vs. B, p<0.001; A vs. C, p<0.001; B vs. C, p<0.05) and negative responses (all three pairs, p<0.001).

To explore if participants with different levels of social anxiety were affected differently by the proposed system, for each dependent variable we contrasted the group of participants with an above-median score (n=18) with the group of remaining participants (n=18). The median score in the SIAS was 30.50 (M=30.64, SD=12.20, range 10-60). A 3x2 mixed design ANOVA (between-subjects: above-median vs. below-median SIAS scores, within-subjects: A vs. B vs. C) revealed no significant differences between the two groups for any measure, and no interaction between the two variables.

5 Qualitative results

To conduct the qualitative analysis of the participants’ interviews, we initially transcribed the audio recordings verbatim. Then, following Braun and Clarke’s method [59], we performed a thematic analysis on the transcripts, identifying and categorizing common and prominent themes. The analysis included the following:

  1. 1.

    Reading and re-reading transcripts to familiarize with data;

  2. 2.

    Coding features of interest in the dataset and collating relevant data for each code;

  3. 3.

    Grouping all codes into potential themes, collecting all pertinent data for each theme, and organizing themes into levels (i.e., major themes or sub-themes within them). When a subject was particularly large or complex, it was split into one or more sub-themes;

  4. 4.

    Determining the significance of the themes and sub-themes with respect to the coded extracts and the whole dataset;

  5. 5.

    Refining each theme and sub-theme, generating clear definitions and names.

The corresponding author performed the steps described above and coded the data. However, since the process of defining the codes and applying them to the dataset can be biased by subjective interpretation [60], the validity and reliability of the themes must be confirmed through additional coding. Following [61], the data were coded also by an independent external coder, who was not involved in our research, and used a codebook we provided. The codebook listed the themes and sub-themes identified by the thematic analysis. For each code, it provided a label and a complete description with inclusion and exclusion criteria. We also explained to the external coder that he could use multiple codes on the same text fragment. Both coders used Taguette [62], an open-source web-based CAQDAS (Computer Assisted Qualitative Data Analysis Software), to code the data. SPSS version 28.0.1.0 was used to compute Cohen’s Kappa to assess inter-rater reliability [63, 64]. The overall kappa coefficient was 0.80, showing a strong level of agreement [65].

Themes fell into three topic areas:

  • Suggestions: themes capturing suggestions for enhancing the system;

  • System: themes related to system features;

  • Voice: themes related to the VX’s voice.

In the following, we summarize the themes and sub-themes of the three topic areas. More details can be found in Appendix 1.

Tables 6, 7, and 8 summarize the themes and sub-themes belonging to the topic areas of Suggestions, System, and Voice, respectively.

Table 6 Themes and sub-themes of the Suggestions topic area
Table 7 Themes and sub-themes of the System topic area
Table 8 Themes and sub-themes of the Voice topic area

6 Discussion

6.1 Elicited anxiety

Results confirmed the feasibility of a desktop VR system for potential use in VRE for test anxiety. Indeed, they confirmed our hypothesis on anxiety (H1): conditions A, B, and C elicited different, increasing anxiety levels in participants. The ability of the VRE system to elicit different levels of anxiety extends the findings of other studies of VRE systems for test anxiety [8, 9]. Those systems simulated written exams, used immersive VR, and exposed participants to different conditions in a fixed order. In contrast, our study focused on the simulation of oral exams using desktop VR and exposed participants to different conditions in a counterbalanced order to prevent order effects. Moreover, no significant differences were found between participants with above-median SIAS scores and the remaining participants. This result supports the feasibility of the proposed system in eliciting different levels of anxiety regardless of participants’ trait social anxiety.

6.2 Perceived VX’s attitude

Results confirmed our hypothesis on perceived VX’s attitude (H2): the three conditions produced increasing negative perceptions. The ability to elicit increasing levels of anxiety by changing the attitude displayed by the VX extends the results of the literature. Studies of VRE systems for fear of public speaking [20, 22] showed that exposure to a group or crowd of agents with different attitudes can elicit different levels of anxiety in participants. Our study showed that similar results can be obtained with a single virtual agent. A previous study showed that the attitudes of a virtual job interviewer, controlled by the experimenter with a Wizard-of-Oz technique (i.e., the user perceives a direct interaction with the virtual job interviewer, while in fact the experimenter decides and commands the responses of the virtual interviewer in real-time), can elicit different levels of anxiety [52]. Our study showed how similar effects can be obtained in a virtual oral exam administered by an agent that is controlled by sets of predefined behaviors.

6.3 Participants’ positive and negative responses

Unlike previous virtual agent and VRE studies, we created a visual annotation tool (VAT) to identify the time instants in which participants felt positive or negative affect. The VAT also supported an interview to collect factors that contributed to participants’ positive or negative responses. It is worth noting that information collected through the VAT is also useful for further guiding the development of the system because it helps in identifying the sources of participants’ ease or distress. By knowing the different aspects of the VE that cause more at ease or distress in participants, it is possible to further improve the VRE system by inserting the right stressors in the different scenarios of the system. For these reasons, its use could be considered for other VRE studies. The VAT allowed us to obtain information that would not have been collected using the questionnaires alone. Results collected through the VAT confirmed our hypothesis on elicited positive and negative responses (H3): the VX’s behaviors produced three different, decreasing counts of positive responses elicited and three different, increasing counts of negative responses elicited. Counts of positive and negative responses elicited in participants were consistent with the increasing anxiety elicited by the three conditions (A, B, C), and were also consistent with the increasing negative perception of VX’s attitude. Results obtained by using the VAT in interviewing participants indicated expression and posture of the VX as particularly influential factors, consistently with previous studies that highlighted the relevance of virtual agent posture and facial expressions in eliciting users’ emotions [40]. Several participants remembered they felt a negative response when the VX suspended eye contact with them, consistently with studies that highlighted the importance of eye contact in positive interpersonal interaction [41, 42].

Additional participants’ comments revealed that positive responses were often elicited when the VX appeared interested and listening, or calm and relaxed. On the contrary, negative responses were often elicited when the VX appeared distracted, uninterested, judgmental or impatient. Some participants mentioned that their negative responses originated at times from a negative perception of their performance during the simulation rather than from specific VX’s behavior.

In summary, participants’ feedback indicated that the facial expressions, posture, eye contact, level of interest, and general attitude of the VX played a central role in influencing participants’ emotional responses.

6.4 Participants’ feedback

In the final interview, the system received a high level of consensus: 25 participants stated that they would use the system; a few others would use it after some improvement (n=3). Twenty-three participants would use it to train for the oral exam (as a study method, n=6; or to learn how to manage emotional aspects, n=18). The fact that most participants would use the system, and the current unavailability of VRE systems for oral exams, motivate the importance of continuing the research, and of further development of the system. The thematic analysis described in Section 5 and Appendix 1 allowed to identify several comments and suggestions that can be useful to inform the design of VRE systems for test anxiety. It is interesting to note that the aspect of the system on which participants reflected more frequently was the VX’s voice. Five participants suggested to make changes to the VX’s voice to better reflect the VX's attitude. They suggested improving the VX’s voice to make it more realistic, less flat, and more expressive. In particular, one participant suggested introducing variations in the tone of the VX’s voice so that when the VX does not put him at ease, its tone of voice also reflects this discomfort. Studies of virtual agents have shown that the emotions they convey through facial expressions, head movements, and voice elicit a greater response in the user while the agent is speaking rather than listening [66]. For this reason, changing VX’s voice to reflect VX’s attitude may enhance anxiety elicitation in participants. It is worth noting that 5 participants believed the VX’s tone of voice changed between conditions, while it actually did not. Changes they perceived went in the direction of consistency with perceived VX’s attitude. This can be an interesting aspect for further research because it suggests that VX behavior and facial expressions may cause illusory voice changes that are consistent with participants' expectations derived from behavioral cues.

During the interview, participants identified the three conditions as three different levels of difficulty, where set A was the easiest level, set B was the intermediate level while set C was the most difficult level. The order of difficulty perceived by participants was consistent with the increasingly negative perception of the VX’s attitude and the level of anxiety increasingly elicited.

During the interview, most participants’ suggestions to improve the system were focused on increasing level of realism in different aspects, i.e., quality or variety of animations, appearance of VX or VE, voice or range of expressions of the VX. In particular, four participants suggested that the VX should be customizable not only in terms of behavior but also in appearance, in order to closely resemble the real professor they feared. Allowing students to personalize the appearance of the VX is an aspect that should be considered for future improvements of VRE systems focused on oral exams because a previous study showed that individuals who conducted an oral presentation before virtual agents with an appearance similar to real people known to them experienced higher anxiety than when the presentation was conducted in front of unfamiliar virtual agents [67]. Likewise, a VX resembling the student’s real professor in appearance could potentially elicit heightened levels of anxiety, similar to what he or she might experience during an actual oral exam.

6.5 Limitations and future research

To the best of our knowledge, our paper is the first proposal and feasibility study of a VRE system for test anxiety that deals with oral exams. While existing VRE systems for test anxiety simulate predefined written exams whose questions cannot be customized, our system allows users to customize questions and can thus better adapt to students’ needs by offering an experience closer to the exam they are preparing for. However, there are some limitations that should be taken into account. First, the study was conducted on a predominantly male sample. As mentioned earlier (Section 3.1), the possible consequences of a predominant male sample on anxiety measures have been examined in the literature: several studies show that females tend to report greater anxiety [35] and higher intensities of emotional experience [36, 37] than males. Moreover, females exhibit higher levels of test anxiety than males [68, 69]. Therefore, if male predominance in our sample influenced the results, it might have attenuated the intensity of anxiety. Replicating the experiment with a gender-balanced sample might thus produce higher anxiety values and provide a more accurate representation of anxiety experienced across genders. Second, all participants in the study shared a common academic background in Computer Science, therefore they all possessed a similar set of skills, academic experiences, and perspectives related to this discipline. This common background might not have fully captured the variability and nuances that could have emerged from participants with different academic backgrounds. As a result, the conclusions of this study might be limited in their generalizability to other fields. Third, we asked participants to use the VAT to mark the time instants of the experience in which they remembered feeling at ease or distressed during the first exposure. This may introduce a potential recall bias. Participants might have been asked to mark their affective responses during the first exposure, but we ruled out this possibility to maintain their focus on the task and enable them to identify more closely with the role of the student in an oral exam. In this way, they should be more likely to feel emotions similar to those they would experience during a real oral exam. Furthermore, we opted to introduce the VAT only after participants had experienced all three conditions to prevent participants from focusing during exposure on thinking about which emotions they should mark later with the VAT, and thus letting their attention remain primarily on the task. Fourth, the evaluation of the system focused on assessing its feasibility in eliciting anxiety in students, but the system was not used in an intervention. Therefore, our next research step will be to carry out a study of the system in a home program for reducing test anxiety using a gender-balanced sample of participants. We plan to conduct a study in which participants will have the ability to define a pool of questions from which the virtual examiner will randomly select its questions. Furthermore, during the oral exam simulation, participants will be able to answer audibly to the questions while the virtual examiner will randomly perform the behaviors of the selected set. Finally, participants will be able to self-assess their performance to track their progress over time. The study will also explore the efficacy of the VRE system in reducing anxiety in participants by comparing their anxiety before and after the treatment period.

7 Conclusions

In this paper, we proposed a VRE system for test anxiety that deals with oral exams. In the three scenarios offered by the system, a virtual examiner conducts the oral exam, performing behaviors from one of three predefined sets, which differ in how friendly the examiner behaves. To the best of our knowledge, this is the first feasibility study of a VRE system for test anxiety that deals with oral exams. Results of the quantitative study showed that the three scenarios of the proposed system are able to elicit three different levels of increasing anxiety. The thematic analysis of the interview conducted with the participants provided further insights into the aspects of the system that contributed to eliciting positive or negative responses in them, which can help in improving the design of VRE systems for test anxiety. While this paper showed the feasibility of the proposed system for VRE, our future work will focus on using and evaluating it as a treatment for test anxiety.