1 Introduction

1.1 Virtual reality for psychology research

Virtual reality (VR) systems involve the use of some combination of computer-generated images, audio, and haptics designed to give the user the feeling that the virtual environment is real (Park et al. 2019). In recent years, virtual reality systems have become readily available in research, business, and commercial contexts as a result of improvements in technologies and affordability (Pan et al. 2018). One burgeoning field of application for VR is in the study of human conversations, one of the primary topics of study in psychology research (Yoon and Brown-Schmidt 2019).

Compared to some traditional conversation research paradigms, VR methods offer several notable advantages (Park et al. 2019). In theory, VR offers an enticing solution to the historical tension in the field of psychology between the desire for experimental control and ecological validity, or realism (Parsons 2015). VR allows for the chance to replace the use of static, abstract stimuli with responsive, multimodal, and contextually embedded scenarios, while allowing for near full control of what is presented, along with detailed recording of the behaviours of participants within the tool and potential for measures to be administered within VR. The requirement of developing specific VR tools for research can also improve the reproducibility of findings, as other researchers can more simply make use of the same VR tools.

In terms of tracking behaviours, the specific information recorded depends on the VR system employed, with popular HMD (Head Mounted Display) systems such as the Oculus Quest 2 tracking the head and hands of users (Carnevale et al. 2022). It is also possible to track the full body of a VR user by adding additional tracking devices or employing motion capture systems (Caserman et al. 2020). Eye tracking has also been implemented in HMD systems, with devices such as the HTC Vive Pro Eye having reliable eye tracking directly embedded into the device (Sipatchin et al. 2021). Neural recording methods such as fMRI, MEG, and EEG have also been used in VR studies (Lenormand and Piolino 2022; Li et al. 2021; Roberts et al. 2019; Tehrani et al. 2021), though the degree to which users can freely move while wearing these devices varies based on their portability.

The field of proxemics study is a notable example of how the unique affordances of VR methodologies can be applied to psychology research. The study of proxemics refers to the perception and behaviours related to the space around them, originally introduced by Hall (1968) in the 1960s. In the current day, the tracking capabilities of VR along with the ability to precisely manipulate stimuli, even when moving, have been used to explore the proxemics of topics such as social settings (Duverné et al. 2020), crowds (Dickinson et al. 2019), and conversations (Kolkmeier et al. 2016). On a similar topic, distance perception has also been a popular avenue of study in VR (e.g. Ebrahimi et al. 2018; Ries et al. 2008; Vienne et al. 2020).

1.2 Social VR

The study of conversation is a field of psychology where VR presents real promise for addressing some of the historical issues. Traditional methods for studying elements of social psychology involve the use of trained actors or reducing the number and complexity of stimuli. For example, the reading the mind in the eyes test presents a set of images of eyes and asks the participant to identify the emotion shown (Baron-Cohen and Wheelwright 2001). The inclusion of actors in a study leads to issues around replication, as the specific characteristics and behaviours of the actors influence outcomes (Baumeister 2016). Virtual characters in VR methods provide a compromise to address these problems by allowing for greater consistency and control of the conversational stimuli without fully abstracting from real world scenarios. Virtual characters are categorised as avatars and agents. Avatars are human-controlled characters in a virtual space, such as the character you control in a videogame. Agents, in comparison, are non-user-controlled virtual characters that are generally human-like in visual design. The level of sophistication for agents varies for their visual style and fidelity, as well as their interactivity and responsiveness.

The term social VR is used in different ways to broadly refer to different social aspects of VR. For example, it is often used to refer to commercially available online spaces/programmes designed for multiple users to interact with each other using VR devices (Handley et al. 2022). In research, it is more commonly used to refer to the tools for teaching (Bermejo et al. 2023), training (Howard and Gutworth 2020), therapy (Anderson et al. 2013), and exploring interaction with agents, with single or multiple users (Pan et al. 2018). This review is specifically examining the latter form of social VR, though some of the findings may have relevance to the design considerations of online space social VR systems. For more information on design principles and social experiences of online commercial forms of social VR see reviews by McVeigh-Schultz et al. (2018), Kolesnichenko et al. (2019), Jonas et al. (2019), and Cao et al. (2023), as well as the works of Guo Freeman (e.g. Maloney et al. 2020; Freeman and Maloney 2021; Freeman and Acena 2021).

While the application of VR to the study of conversation has many promising features, the use of VR in psychology is still in its infancy. There are also significant risks involved in treating VR methods as a magic solution that by default provides benefits such as greater ecological validity. Indeed, it is the manner in which VR tools are implemented that dictates their value (Christophers et al., in press). An unintentionally unnerving social interaction in VR with a robotic conversation partner could sparsely be argued as more ecologically valid than traditional methods. On the methodological side, while no review has been conducted to date on VR conversation studies, the results of reviews of general psychology studies using VR have highlighted notable weaknesses. For example, poor reporting and a lack of open availability of tools were observed in a review conducted by Lanier et al. (2019). In addition, the methodologies employed in VR studies are heterogeneous, making comparison between studies a difficult task (Vasser and Aru 2020).

In response to the lack of standards that have led some researchers to refer to the field as the “Wild West” (Birckhead et al. 2019). Based on the current state of the field, it is imperative that a shared understanding be established. The use of varied methods and designs of conversational agents and scenarios are not inherently problematic, but for the results of these studies to be meaningful we must be able to determine what characteristics of the tool are responsible for the observed psychological outcomes. To achieve this the designs of these experiences must be intentionally crafted, with extensive reporting of the specific methodology employed.

One of the primary causes of these disparities is a lack of theory for the experience of conversing with a virtual agent. While, there may be theories informing individual elements such as nonverbal gestures (Wang and Ruiz 2021), there is no unified theory of the complete experience. Following Whetten’s (1989) proposal of what constitutes a theory, a model must outline the key factors involved, the relations between these factors, explain these relations, and outline the circumstances in which the model applies. Through this, theories allow for clear communication between researchers and provide a structure for the collection, analysis, and interpretation of data, allowing for cross-comparisons for building a body of research on a topic (Hayes 2023). Toward that aim, it is necessary to first establish the relevant factors and their relations, as well as identify relationships that have been as yet untested.

1.3 Contribution

In this paper, we conducted a narrative review and developed a relationship map for the impact of key design features of dyadic conversations with VR agents on experiential outcomes. We use the term relationship map here to refer to a visual representation of the existing findings on the relations between factors, along with our expectations for how those features would interact in cases where they have yet to be directly examined. This relationship map is aimed at both providing the basis for the development of a theoretical model, as well as providing VR conversation researchers with an overview of the current state of the field and a baseline set of expectations for the impact of design choices.

The specific dyadic conversation format looked at in this paper involves one human interacting with one agent. The selected outcomes are experiential in nature from the perspective of the user and are also based on the common aims of current VR tools, such as wishing to prompt feelings of social stress (Zimmer et al. 2019). Our relationship map was based on the findings of a narrative review of the current literature we conducted.

The narrative review drew from diverse disciplines, including social psychology, conversation analysis, HCI, environmental study, as well as social VR, both general and conversation specific. This decision was motivated by a series of factors. Namely, the limited quantity and varied quality of VR conversation studies to date, the multidisciplinary nature of the area, and the aim of identifying knowledge gaps in the current field of VR conversation research.

Using the findings of the review of the state of research for social VR, we developed a relationship map. This map defines expectations for how design features impact qualitative outcomes for dyadic agent conversations. In the narrative review results section, we explore and define the key design features of dyadic VR agent conversations, the relevant experiential outcomes they influence, as well as their relationships with other features.

2 Narrative review results

2.1 User experiential outcomes

The scope of this review in terms of outcomes is purely concerned with the conscious, self-reported outcomes of the user during and following the VR conversation experience. Because of this, no behavioural or physiological outcomes are included. The specific outcomes were chosen based on their relevance to dyadic agent conversations, as well as their popularity as a topic of examination in the field of social VR. One outcome that was originally included in this list was simulator sickness: unintended side effects of VR experiences including dizziness, nausea, and blurred vision. While this is a popular topic of study in VR and an important consideration when designing VR experiences, we consider it to play a more minor role in dyadic conversation experiences due to the typically stationary format of these conversations (Pan et al., 2018). Participants typically sit or stand opposite the agent without significant movement or virtual locomotion which typically pose the greatest risk of cybersickness (Saredakis et al. 2020). For more information on the topic of cybersickness see the recent review conducted by Tian et al. (2022).

Below, we first present the major categories of user experiential outcomes. Further below, we present the key features of relevant VR experiences and research findings on how they impact (or do not impact) the listed outcomes.

2.1.1 Presence and engagement

Presence refers to the degree to which users feel they are, “really there” in a virtual environment, and that elements of that world are “real” (Piccione et al. 2019). Engagement here refers to a temporal set of affective and motivational experiences of the user during conversation (Lohse et al. 2016; Wiebe et al. 2014), with levels of presence being significantly related to levels of engagement (Deriu et al. 2021). Presence in particular is regularly assessed in social VR research, due to its theorised role as a mechanism in the effectiveness of VR therapy (Price et al. 2011) and the emotional impact of VR experiences (Diemer et al. 2015). While results are somewhat mixed, the general findings show that heightened presence can lead to greater emotional responses (e.g. Jicol et al. 2021), particularly for emotions related to arousal such as fear (Diemer et al. 2015).

2.1.2 Social presence

One important element of interaction with social agents is the concept of social presence, the experience that a character is “real” and that you can perceive their thoughts and emotions (Biocca 1997). This is often viewed as a way of assessing how successful a communication system is at emulating face-to-face interaction with a human and is regularly theorised by researchers to lead to greater positive social outcomes (Oh et al. 2018), such as positive emotional experiences. High levels of social presence can, however, also increase psychological discomfort for some individuals, particularly for those who are generally uncomfortable with social interactions (Allmendinger 2010; Cortese and Seo 2012). Findings are somewhat limited and mixed on whether social presence has a direct relationship to general presence, with the overall results suggesting that social presence has a positive correlation with general presence (Bulu 2012; Thie and van Wijk 1998; Zhang and Zigurs 2009). It should be noted that the nature of this relationship, as well as the manner in which presence and its related components should be understood and operationalised are ongoing points of debate in the field (e.g. Latoschik and Wienrich 2022; Skarbez et al. 2018; Slater et al. 2022).

2.1.3 Psychological discomfort

One of the most frequently examined outcomes in psychology studies using VR, psychological discomfort refers to the degree of stress, fear, and/or anxiety that result from the conversation (Somarathna et al. 2022). Psychological discomfort appears to have a bidirectional relationship with presence/engagement, with greater levels of fear leading to increased presence/engagement and vice versa (Diemer et al. 2015; Jicol et al. 2021). Examples of how this is studied in relation to conversation include social stress induction paradigms (Zimmer et al. 2019), public speaking phobia (Jinga et al. 2021), and social anxiety (Kerous et al. 2020). Psychological discomfort is naturally interlinked with other forms of emotional experience.

A distinction is drawn in this relationship map between psychological discomfort and other emotional experiences. The proposal here is not that these are perfectly discrete categories of outcomes, but is rather motivated by the resounding popularity of phobic research and therapy in the VR space. In addition, the results of studies examining the relationships of core elements of VR such as presence with nonphobic emotions are distinct when compared to similar work on fear-related emotions (Diemer et al. 2015).

2.1.4 Non-fear-related emotional experience

This outcome encompasses the emotional experiences of users during the VR conversation that are not directly associated with fear such as joy, sadness, and relaxation. Emotions are one of the core interests of psychological research, and prompting specific emotions has been a popular aim in both psychology and therapeutic research using VR (Somarathna et al. 2022). In comparison to psychological discomfort and presence, the relationship between non-fear related emotion and presence is less consistent and does not appear to have the same bidirectionality (Diemer et al. 2015).

2.2 Key design features

This set of key design features for dyadic agent conversations was selected based on a combination of their importance to the topic, along with how frequently they varied between and within studies in the area of social VR. The proposed relationships are not intended to be reductive, universal rules, but rather inform the reader of current findings and expectations of the area and prompt consideration on how best to apply those features when developing VR tools, based on the specific aims and qualities of the project.

Some features that were considered but ultimately omitted from the relationship map include the type of VR device used. This was removed as a standalone feature due to the wide array of devices currently available and the lack of research directly comparing the impact of the VR device type has as a whole on the identified outcomes for social VR experiences. In its stead, relevant components are included in other features below, such as the display fidelity, level of control agency, haptic-based interpersonal touch, and audio quality. Technology will continue to advance and shift regarding VR, making focusing on the affordances of those devices a potentially more valuable avenue of study.

For each feature in this section, a definition and background are provided, followed by a description of the feature’s relationship with the experiential outcomes, and other features (See Fig. 4 for full relationship map).

2.2.1 Scenario agency

One element explored in several VR papers is the effect of the VR scenario being active or passive in nature. Here, we define agency in VR scenarios as being two-pronged, involving both the level of ability to freely move throughout the space, and the degree to which the user can impact the scenario. A VR scenario is active when the participant is involved and has control over the experience, such as being able to freely move around an area and talk to whomever they choose. In contrast, passive scenarios only allow for looking around as a predetermined experience plays out around them. For freedom of movement in VR, this can vary from three degrees of freedom (tracking head rotation, but not movement) at the most restrictive, to six degrees of freedom (translating the user’s movement through space) and the ability to move to other areas in the scene through some form of virtual locomotion/steering (movement in VR that exceeds the physical input, usually from a handheld controller) or teleportation.

In the domain of pain management, a study conducted by Phelan et al. (2019) found that active scenarios extended the pain thresholds of participants and were rated as being more engaging and immersive as compared to the passive scenarios. While pain management has been the area that has most commonly examined the impact of active and passive scenarios (Boylan et al. 2018; Furness et al. 2019), similar findings of improved engagement and immersion for active scenarios were reported in studies looking at skills training (Piccione et al. 2019), and social anxiety (Sekhavat and Nomani 2017). For social presence specifically, greater agency is also associated with greater levels of social presence (Fortin and Dholakia 2005; Oh et al. 2018; Skalski and Tamborini 2007).

To achieve greater levels of scenario agency, VR device setups that include motion tracking of multiple body points provide greater opportunities for interaction. Active scenarios have been found to result in greater levels of presence/engagement compared to passive scenarios (Furness et al. 2019; Piccione et al. 2019). In terms of emotional outcomes, little work has been conducted directly examining the influence of scenario agency levels. The results of a pilot study looking at social anxiety suggest that active scenarios were more impactful (Sekhavat and Nomani 2017), while another study that compared traumatic 2D films or interactive VR scenes found no differences in terms of negative emotional impact (Dibbets and Schulte-Ostermann 2015). A more recent study conducted by Jicol et al. (2021) directly examined the relationships between emotions (happy vs fear), agency, and presence. They found that the agency was positively correlated with presence and moderated the impact of emotion on presence.

Active scenarios can moderate the impact of an agent’s proximity on psychological discomfort, as users can freely move around the space and move to a more comfortable distance. Passive scenarios are likely to moderate the impact of a self-avatar on presence, as being unable to move and have the avatar move in line with you removes some of its key benefits (Makled et al. 2018).

In relation to conversations in VR, scenario agency can relate to a variety of aspects including whether the user can freely move around the area, whether they can initiate conversations or are forced into them, and the degree of responsiveness of the conversation agents. In general, active scenarios are more impactful, and we would argue that some level of interactivity should be included in VR conversation tools where possible to better immerse the user.

2.2.2 Visual fidelity

In terms of technical features of the VR scenario, the visual fidelity of the experience is a key consideration. Visual fidelity has a layered meaning in the field of VR, including both the technical features such as texture resolution, as well as the design which is considered in terms of realism (realistic vs. cartoon). As noted by Vasser and Aru (2020), two studies that purport to be address the same topic can have wildly varying levels of visual fidelity, leading to potential reliability issues. While the general trend in the field has been a strive towards achieving realistic visuals, there remains debate on whether this is important to the experiences and study outcomes (Slater et al. 2020). At the most basic level, more realistic visuals may lead to users feeling more present in the scene, and greater engagement (Riva et al. 2019; Rizzo and Koenig 2017).

As visual fidelity and realism increase so too do the risks of inducing the “uncanny valley” effect. This is the idea that virtual characters can cause feelings of eeriness and aversion the more realistic and human-like they are. Findings on the effect suggest this comes as a result of inconsistent realism between elements of the character, such as their visuals and behaviours. For example, the more realistic the character looks, the greater our expectations for them to be lifelike and natural feeling (Kätsyri et al. 2018). In order to avoid this problem, for research questions that are not contingent on realistic visuals we would argue it is often preferable to make use of moderate levels of fidelity.

On the topic of conversation, a study that looked at the influence of visual fidelity on anxiety in a job interview scenario found that the visuals had no significant impact on anxiety, with the level of anxiety appearing to be more related to the scenario and the sensitivity of the participant (Kwon et al. 2013). For simulation training programme, the visual fidelity of the hardware was also non-significant as a moderating factor for their effectiveness (Appel et al. 2020). For an example of a tool with moderate fidelity visuals see Fig. 1. Taking this approach also has the added benefit of reducing the complexity and development time of a VR tool.

Fig. 1
figure 1

Example of a moderate fidelity VR tool used by Christophers et al. (2023). This café scene includes a social agent sitting at the opposite end of a table from the position of the user, including background characters and objects

Along with degrees of realism, other design elements of agents have been investigated. In line with findings on self-similarity, where people tend to like and trust other people who look like them (Byrne 1971; Montoya et al. 2008), participants who had virtual conversations with other participants rated avatars who looked similar to them as being more likeable and less eerie compared to dissimilar avatars (Shih et al. 2023).

While this feature is multi-layered, general findings suggest that more realistic visuals inspire greater levels of presence and engagement (Vasser and Aru 2020). In line with the findings related to the uncanny valley, findings on the impact of the realism of character models on social presence suggest that greater visual realism only consistently leads to increased social presence when the level of behavioural realism is appropriate (Bailenson et al. 2005; Garau et al. 2003). The impact of the virtual reality device’s display, in terms of image definition and display size, has also been examined by a small number of studies, with two (Ahn et al. 2014; Bracken 2005) finding that better displays resulted in higher levels of social presence, with another two finding no effect of display fidelity (James et al. 2011; Skalski and Whitbred 2010). These are in part or fully determined by the technical features of the VR device used.

The visual fidelity of self-avatars has been found to mediate their influence on presence and embodiment (Gorisse et al. 2019). Visual fidelity can also mediate the influence of nonverbal signals on both categories of emotional experiences as more realistic visuals can result in greater expectations for those motions to be lifelike and risk uncanny valley effects (Mori et al. 2012). With that said, combining higher levels of visual realism with appropriate behavioural realism has been found to lead to higher levels of positive affect towards the characters (Ferstl et al. 2021b; Zibrek et al. 2018; Zibrek and McDonnell 2019). Lastly, the visual fidelity of the environment mediates its impact on emotional experiences (both categories), with more realistic environments generally heightening the emotional impact (Newman et al. 2022).

2.2.3 Inclusion of a self-avatar

Another technical feature of VR tools that has received attention are self-avatars. When using a head mounted display (HMD), a VR device you place on your head, the user is no longer able to see their actual body as their field of view is covered by the device’s screen. To address this, virtual bodies that match the movements of the user known as self-avatars have been implemented in some tools (Pan et al. 2018). Research looking at the impact of self-avatars has found that they can lead to greater user engagement, presence, and sense of embodiment in the scene (Parmar et al. 2022; Wagnerberger et al. 2021; Young et al. 2015). On the more physical side of things, several studies have demonstrated that having a virtual self-avatar can aid in distance perception tasks within VR (Lin et al. 2015; Phillips et al. 2010; Ries et al. 2008). Recent results also suggest they could enhance performance or training effects in general (Birk and Mandryk 2019; Friehs et al. 2022; Ratan et al. 2022).

The specific design of avatars can also have an impact, with one study finding that having a self-avatar that is dissimilar to the user can reduce the amount of social anxiety experienced, when compared to having an avatar that looks like their real life self (Aymerich-Franch et al. 2014). Likely the most prominent finding from this area of study is the proteus effect, which suggests that users generally act in line with how they would expect their avatar to behave (Praetorius and Görlich 2020). For example, participants who exercised with an obese avatar showed decreased physical activity (Peña et al. 2016). A review of this effect carried out by Ratan et al. (2020) indicates that effect sizes are relatively consistently between small and medium (0.22–0.26).

Offering users the ability to customise their avatars can also influence a variety of outcomes. While most studies on avatar customisation have focused on non-VR contexts such as health intervention tools or video games, findings from these areas can help provide expectations for the effects of implementing them in VR tools. As an example, the creation of customised self-avatars appears to lead to increased motivation for participants using digital self-improvement programmes (Birk and Mandryk 2018, 2019; Darville et al. 2018). In line with the proteus effect, prompting participants to make specific types of self-avatars can also impact behavioural and qualitative outcomes (Peña et al. 2022; Sah et al. 2017), such as students who created avatars that resembled their actual selves performing better than those who created ideal self or future self-avatars (Ratan et al. 2022). Additionally, participants who could personalise their avatars reported higher levels of presence during an experience where they performed various motions in front of a mirror (Waltemate et al. 2018), though this may be a more pointedly embodied experience than typical social VR experiences.

The inclusion of these self-representations is linked with greater feelings of presence and embodiment in the space (Caserman et al. 2020; Parmar et al. 2022). The nature of the avatar, such as its visual characteristics, can influence both categories of emotional experience, as the proteus effect suggests that users will, to a moderate degree, match their behaviours and mindset to their expectations of the avatar (Praetorius and Görlich 2020; Vahle and Tomasik 2022). While the influence of self-avatars on social presence when in conversation with an agent has not been specifically investigated, results from virtual interactions between participants suggest that full self-avatar embodiment results in greater levels of social presence (Aseeri and Interrante 2021; Cho et al. 2020; Heidicker et al. 2017; Yoon et al. 2019) For a more in-depth examination of self-avatar design choices, see Weidner et al. (2023).

2.2.4 Non-verbal signals

Nonverbal forms of communication are an important element of conversation (Gatica-Perez 2009), leading researchers to regularly develop sets of these behaviours for virtual agents (Wang and Ruiz 2021). The nonverbal avenues of communication studied for social agents are primarily physical in nature, including facial expressions, gestures, and touch. One common application of reactive emotional expressions is in the domain of public speaking exposure therapy. El-Yamri et al. (2019) developed a system where the emotional expressions of the audience in a VR public speaking setting would react to the voice tone, speech content, and gaze behaviours of the user. The emotional valence of the audience in turn can influence the level of anxiety experienced by the speaker (Jinga et al. 2021).

Gestures are motions and poses typically made with hands and arms as part of communication, generally in combination with speech. Their form and function have previously been categorised into systems such as those described by McNeill (1992). These gestures range from small beat gestures that move in rhythm with words, to pointing-based deictic gestures that indicate a location, and emblem gestures that represent objects or concepts, sometimes in place of words. For social agents, gestures are one of the most commonly implemented forms of nonverbal signals (I. Wang and Ruiz 2021). With the aim of improving the believability and level of expressiveness of the agents, behaviours including nodding (Cassell et al. 1999), beat gestures (Mancini et al. 2011), and emblem gestures (Rickel and Johnson 1999) have been developed. For gestures, a study was carried out comparing two forms of agent nodding behaviours (Aburumman et al. 2022). While listening to participants, agents would either exhibit fast nodding, or nodding that mimics the users’ nods with a short delay. Both implicit and explicit measures demonstrated that participants both liked and trusted the agents that mimicked their nodding. This highlights the importance of nonverbal communication being delivered in an appropriate manner. This is reinforced by the findings of Conrad et al. (2015), who found that while agents who displayed more facial expressions prompted more acknowledgements and smiles from participants, they were rated as being less natural. In order to achieve this, systems have been developed with the aim of automatically generating nonverbal gestures based on the characteristics of vocal speech (e.g. Marsella et al. 2013; Yang et al. 2020). A recent model developed by Ferstl et al. (2021a) was rated as being significantly more appropriate when compared to randomly generated gestures (See Fig. 2 for an example gesture sequence).

Fig. 2
figure 2

Example agent gesture sequence generated by the ExpressGesture system (Ferstl et al. 2021a)

The inclusion of social touch in conversation has also shown promise in the field of VR agent research. These studies generally make use of mixed reality methods, combining HMD VR systems with either physical props or haptic feedback synced up with the touch of a social agent. As an example, Hoppe et al. (2020) created an artificial hand that was used to apply social touch to participants, in this case, a tap on the shoulder. They found that the inclusion of this touch led to participants reporting greater presence for the agent they were interacting with, as well as greater uncertainty of the distinction between avatars and agents. In the domain of economic bargaining, social touch delivered through haptic vibrations was found to generally increase compliance with unfair offers (Harjunen et al. 2018). These findings demonstrate the persuasive and engaging power of touch in VR applications.

The posture held by agents is another avenue of nonverbal communication that has been studied, though less commonly compared to gestures and gaze behaviours (I. Wang and Ruiz 2021). Posture involves the orientation of one’s body, and plays a part in communicating emotion and intention (Dael et al. 2012). Typically implemented in agents intended for therapeutic use cases, agent postures are manipulated with the intent of improving rapport (e.g. Gratch et al. 2006; Kang et al. 2008; Huang et al. 2011; DeVault et al. 2014). There have been few studies to date examining the impact of agent postures on psychological outcomes. Results from a study examining postural mirroring during a job interview found no significant differences in terms of stress or presence, though the female agent was rated as warmer when it exhibited postural mirroring (Antonio Gómez Jáuregui et al. 2021). Comparing an agent displaying open and closed body language, Li et al. (2018) also found no impact on presence.

Lastly, one of the most developed and commonly studied set of features in VR are gaze behaviours. Gaze behaviours can provide cues to coordinate the flow of conversation (Kendon 1990), and indicate attentiveness (Heylen 2006). In terms of avatars, gaze behaviours that match the speech of the user were viewed more positively and resulted in greater social presence ratings (Garau et al. 2003). In a study on the single and joint effects of agent gaze and proxemics during interaction, the gaze and proxemics responses of participants were strongest when agents manipulated both at the same time (Kolkmeier et al. 2016). The implementation of an algorithm that matched gaze patterns of virtual agents to categories of emotional states also resulted in increases in the sense of general presence in the scenario (Randhavane et al. 2019a, b).

The nature of the signals performed by the agent, such as their valence, frequency, and format (e.g. expression, gestures, touch) has been shown to influence the emotional experiences of users (El-Yamri et al. 2019). Poorly implemented or unnatural nonverbal communication from the agent can also lead to reduced presence, social presence and engagement (Conrad et al. 2015; Garau et al. 2003; Oh et al. 2018). For social presence in particular, a study was conducted that made use of a model to generate movements (gait, gesture, gaze) for agents with the intent of being friendly and likeable, based on their interactions with a user (Randhavane et al. 2019a, b). The application of their model increased the level of reported social presence, once again emphasising the value of nonverbal signals being responsive and based on the ongoing interaction. It is worth noting that personal conversation style preferences and contextual factors of the conversation may moderate the value of mirroring (Aneja et al. 2021; Wang and Ruiz 2021). The value of nonverbal signals in fostering a sense of social presence has been similarly supported by a range of studies, with higher levels of behavioural realism resulting in higher levels of social presence (Bailenson et al. 2005; Garau et al. 2005; Guadagno et al. 2007; Nowak and Biocca 2003). For a more in-depth review of agent nonverbal communication see Wang and Ruiz (2021).

2.2.5 Level of agent automation

Virtual social agents can be classified in two ways in terms of their level of autonomy. Fully autonomous agents have no human input and converse with users using some combination of their speech and body language to respond with appropriate communication cues in order to give the feeling of a natural conversation. Semi-autonomous agents, on the other hand, operate using a “Wizard of Oz” setup, where the behaviours and responses of the virtual character are being operated out of sight by a researcher or therapist (Pan et al. 2018). These are far less complicated to implement compared to fully autonomous conversation agents.

One important consideration and potential drawback to Wizard of Oz systems is the added latency in responses and/or nonverbal signals depending on the degree to which the operator controls the agent. As an example, the operator may need time to figure out which of their predetermined responses is most appropriate, and simultaneously select the nonverbal behaviours to accompany the speech. For nonverbal signals in particular, timing is an important element contributing to whether they feel natural to the user (Aburumman et al. 2022; Ota et al. 2021), and for speech, rhythm is a key element of conversation quality (Borrie et al. 2019; Clark 1996), with larger delays potentially leading to more negative perceptions of the conversation partner (Schoenenberg et al. 2014). These problems can also apply to more automated systems depending on the speed at which they analyse and respond.

While several fully autonomous agents have been developed (DeVault et al. 2014; Kahl and Kopp 2018; R. Zhao et al. 2016), to date they have considerable limitations. The first is that they are highly specialised, designed for particular social circumstances and one or two pre-established topics of conversation. The behaviours both recognised and performed by the agent must also be clearly established beforehand by the developers. Lastly, most of these systems currently require the use of non-immersive VR displays in order to allow for the expressions of users to be clearly visible.

At the current stage of development, the expectation is that agents with high autonomy will be more limited, and as such prompt lower levels of presence and engagement (Pan et al. 2018). For social presence specifically, studies to date have typically studied the impact of perceived avatar/agent status rather than their actual level of automation (Oh et al. 2018), with the direct impact of degrees of agent automation being understudied. With that said, findings of a recent meta-analysis suggest that social presence ratings are typically higher for perceived avatars compared to perceived agents (Felnhofer et al. 2023). Additionally, highly autonomous agents run the risk of increasing feelings of psychological discomfort through uncanny valley effects, particularly if the topic of discussion is emotional in nature (Stein and Ohler 2017).

2.2.6 Agent proximity

The proximity of the agents should also be considered. Proximity in this case refers to the position and angle at which an agent is placed in the virtual space relative to the user. When talking, we generally have a preferred distance between us and our conversation partners. This can vary both across cultures (Sicorello et al. 2019) and in the moment in response to how the conversation is playing out (Bönsch et al. 2018). While this preferred personal space includes not wanting to be too close or too far from our conversation partner, recent findings suggest that being too close causes greater and more immediate discomfort when compared to being too far away (Welsch et al. 2019). As a result of this, agents that violate the personal space boundaries of participants will likely result in feelings of discomfort.

The angle at which an agent is standing can also serve as an additional source of nonverbal communication. As an example, a virtual agent developed by Pejsa et al. (2017) altered its body alignment in order to cue the next speaker in a triadic (two humans, one agent) conversation. While dyadic conversation generally takes place either looking directly at one another or at a 90° angle (Kendon 1990), direct orientation appears to make people feel more attended to (Nagels et al. 2015), though as with all nonverbal communication, this is dependent on factors such as culture and gender (Brugel et al. 2015). The characteristics of a user’s self-avatar, particularly their arm length, have also been found to influence interpersonal space expectations (Buck et al. 2022). Additionally, the form of VR used may influence personal space preferences, with one study finding that users preferred larger distances in CAVE systems compared to when in an HMD (Bönsch et al. 2020).

The agent being too far from the user can cause psychological discomfort, with more immediate discomfort potentially arising in cases where the agent intrudes inside the user’s personal space (Hecht et al. 2019; Welsch et al. 2019). Similarly, participants who were asked to walk through agents in augmented reality exhibited heightened physiological arousal and reported qualitatively that the experience felt unnatural and uncomfortable (Huang et al. 2022). The emotions displayed by agents have also been found to influence the size of the personal space of participants, with greater distance being kept when interacting with angry agents (Bönsch et al. 2018).

2.2.7 Environmental context

The environmental context of an immersive virtual environment includes its setting (e.g. an office space; see Fig. 3 for a set of example locations), layout, features, and visual design. This multifaceted set of features is a vital element of most virtual immersive experiences, particularly for building a sense of presence (Newman et al. 2022). The most commonly studied element of environmental context in terms of experiential outcomes is the impact of nature and urbanity.

Fig. 3
figure 3

Example of a variety of virtual social settings including the outside of a building, a bar, and two locations in a concert hall from a study conducted by Llobera et al. (2021)

To start with the layout, work in the field of architecture has been conducted looking at its experiential and behavioural outcomes. The general finding is that exposure to scenes of nature, even just through photographs, leads to stress reduction and physiological relaxation (Jo et al. 2019). This effect is reflected in the improved affect for people living in urban areas with nature fixtures such as parks when compared to those living in other urban areas (van den Bosch and Ode Sang 2017). These findings have motivated the development and study of the efficacy of a series of immersive nature environments, particularly for therapeutic applications (Appel et al. 2020; Blum et al. 2019; Kim et al. 2017). The results of these studies support the potential of nature scenes for relaxation and prompting positive affect.

Looking more specifically at the structure of the environment, a CAVE VR study was conducted looking at the influence of architectural design on physiological stress reactions (Fich et al. 2014). They asked participants to carry out a series of stressful tasks (such as giving a speech) in front of a panel. The layout of the virtual room in which these tasks took place was manipulated, finding that participants displayed significantly greater levels of stress in the closed room (no visible exits or windows) when compared to the open room (three large openings in the wall).

The relevance of these elements is linked with the architectural design concept “visual comfort”, which is the subjective perception of comfort drawn from an individual’s visual environment (Davis and Nutter 2010). Building on this concept, Cha et al. (2020) carried out a VR study looking at the effect of interior colour schemes on emotions, heart rate, and productivity. They found that the colour changes had a significant impact, with blue, white, and green scenarios leading to lower heart rates and red colour schemes being rated as more exciting, unpleasant, and tense. A later study also found that varying the colour schemes of a VR environment impacted both qualitative and physiological measures for participants (Li et al. 2021). While the influence of visual features on psychological factors is a relatively underexplored avenue compared to nature scenes, existing theory and findings highlight the importance of considering how they may influence the results of Social VR studies.

Other elements of the environmental contexts to bear in mind, include the social setting, and how populated with characters they are. The social setting refers to the social expectations of the location, as well as an individual’s subjective relationship with the setting. This sociological concept was explored in VR by Duverné et al. (2020), who found that proxemics (people’s perception and use of/movement through space) norms varied according to their subjective relationship with the social setting, with no main effect for the settings themselves. Lastly, a study that employed a VR crowd simulation in a university setting found that the more dense the crowd, the greater the levels of negative affect reported, along with differences in proxemics behaviours within the space (Dickinson et al. 2019).

Based on these findings, we would argue that environmental context is a key factor to consider for agent conversations in VR. The social setting of the experience provides a set of expectations for the user in terms of how they should act, as well as how the agent should act. On the more implicit side of things, the layout, visual, and auditory qualities of the environment in which the conversation is taking place will also impact their experience of the interaction.

The setting employed has been found to moderate the impact of visual fidelity on presence and engagement in some cases, as strong sets of expectations for the situation can lead users to pay less care to the visual features of the scene (Kwon et al. 2013). The environmental context also mediates the impact of an agent’s proximity on psychological discomfort, as different settings provide different sets of expectations for personal distance (Duverné et al. 2020).

2.2.8 Audio features

The audio features of an agent interaction as part of this relationship map includes any agent vocal utterances, along with any background audio, and the quality of the audio system. While voice is a critical element of typical conversation (Moore et al. 2016), it is an understudied element of social agent interactions (Hortensius et al. 2018). With that said, as with many of the outcomes discussed in this paper, findings from face-to-face conversations and purely voice-based agents can be used as a baseline set of expectations to be experimentally validated in the field of research.

Vocal characteristics have been found to influence perceptions of a speaker’s personality traits (Wang et al. 2021) and emotional state (Mehrabian 2008). For example, a higher vocal pitch has been found to lead to greater attribution of feminine traits and likeability (Ko et al. 2006; Krahé et al. 2021; Pisanski and Rendall 2011). These characteristics appear to interact with stereotypes individuals hold about the speaker, such as those related to their gender (Aung and Puts 2020; Jin and Park 2023), with pitch differences in some cases leading to different attributions for men compared to women. We would expect differences in perception of the character based on vocal characteristics of a social agent to impact emotional experience, and potentially psychological discomfort (e.g. feeling the agent is judgemental or cold), but this is a yet unexplored avenue of research.

The soundscape of the environment is another important consideration. Consisting of the auditory stimuli related to the location, the soundscape includes both background noises and sounds that play in response to your movement and actions. The inclusion of an appropriate and reactive soundscape can enhance feelings of presence and realism (Kern and Ellermeier 2020; Zhao et al. 2021).

While pre-recorded vocal lines are often used for social agents (Pan et al. 2018), rapid advances have been observed in synthetic speech technology in recent years (Tan 2023). Synthetic speech involves output from computer systems that are designed to mimic human speech. The source of input for synthetic speech can range from vocal recordings which are then digitally altered (e.g. pitch frequency, speech rate, vocal tract length), to systems like IBM Watson that only require text (Cabral et al. 2017). At the current time, text-to-speech based systems typically lack elements of human expressiveness compared to systems making use of vocal recordings (Higgins et al. 2022). The applicability of these methods to social agents has also been examined, though primarily for purely audio-based agents. Recent findings suggest that while users accepted and liked the synthetic voices, they still considered them to be more eerie compared to human speech (Mckie et al. 2022). This has been proposed to result from uncanny valley effects (Do et al. 2022; Kühne et al. 2020). In recent studies conducted with synthetic speech combined with virtual agents, findings have largely supported this, with increased eeriness, particularly for perceived mismatches between social cues across modalities (Abdulrahman and Richards 2022; Higgins et al. 2022). For these studies, results were mixed on whether there was a difference in social presence between human and synthetic speech. As stands the application of text-to-speech may result in increased psychological discomfort compared to human speech, but based on the rate of advancement this effect may be reduced significantly, or eliminated, in coming years. Additionally, developments in the field of natural language generation are likely the next step for agent voice interaction, potentially allowing for agents to respond in naturalistic manners to users without the need for text supplied by an outside party (Foster 2019).

Regarding emotional impact, a study conducted comparing emotional impact between voice types found that participants had stronger emotional reactions to recording-based-synthetic and natural voices compared to text-to-speech, with the emotional state of the character also having a direct effect on emotional outcomes (Higgins et al. 2022).

On a more general level, while investigations of the relationship between audio quality and social presence are limited, findings suggest that higher audio quality leads to higher levels of social presence (Christie 1974; Dicke et al. 2010; Skalski and Whitbred 2010). An avenue of potential interest that has been understudied to date are non-semantic verbal interjections (e.g. “hmm”, “argh”, “uhh”), the addition of which to a voice-based agent resulted in greater levels of rapport and enhanced learner motivation (Ceha and Law 2022). Another is the inclusion of dynamic speech directivity, where different types of vocal noises influence the direction of the sound in line with real life speech (Arai 2001). A series of recent studies implemented dynamic audio systems towards this end, though results have been unclear in terms of how impactful it is on ratings of naturalness, with participants generally not being able to identify the difference between static and dynamic systems (Ehret et al. 2020; Noufi et al. 2023; Sugimoto and Kinoshita 2023).

3 Discussion and conclusion

3.1 Limitations

One of the limitations of this paper model is its scope. Other conversation setups are present in the VR literature, such as group conversations with an increased number of agents (Novick et al. 2018), conversations with other human-controlled avatars (M. Wang 2020), and conversations with a combination of agents and avatars (Pejsa et al. 2017). While some relationships identified in our review will hold true across these scenarios, particularly those related to the environmental context or visual fidelity, the move from dyadic to group conversations involves a significant shift in the dynamics and complexity of the interaction (Cooney et al. 2020).

Another limitation comes from the current field of VR conversation literature. Due to the infancy of the field, we were required to broaden the scope of the narrative review to draw from wide-ranging areas of study. While this had advantages for the review in terms of allowing for a more layered perspective on the topic, the lack of direct research in the area required us to make assumptions about the transferability of in-person findings to VR scenarios.

3.2 Recommendations and conclusion

In terms of recommendations, we argue that the next step for developing the field of social VR is the development of a complete theoretical model of the experience of interacting with a social agent. The findings of this paper can serve as a basis for building this theory. While the field is still in its infancy and many areas of interest are as yet understudied, the development of a model could aid in the development of a shared understanding of the experience, particularly between disciplines (Hayes 2023; Whetten 1989). One of our other primary suggestions is to make research data and tools freely available to other researchers. This would provide a twofold improvement to the field, first by allowing for improved replicability. Secondly, the availability of these tools would provide better opportunities for researchers to build upon existing paradigms without having to start from scratch each time.

Another recommendation is to take the factors and relationships identified in our relationship map into account during the development of tools that make use of dyadic agent conversations. See Fig. 4 for a summary of the relationships. As well as this, it is vital that future research continues to directly investigate the relationships between design features and user outcomes in VR in order to strengthen our understanding of them, as well as identify their relative importance.

Fig. 4
figure 4

This table summarises previous findings on the relationships between design features and experiential outcomes as they relate to dyadic agent conversations. Relationships marked with a “?” indicate cases where the relationship has yet to be directly tested. Relationships marked the a “~” indicate that the feature has no direct, innate impacts on the outcome but alters the impact of other features on that outcome

In terms of other opportunities for research in the field, one underutilised feature of VR as a method is the potential for reactive stimuli. While some of the agents discussed in this paper make use of some combination of the user’s speech and movements to inform how they behave, the degree of responsivity could be improved in future. One avenue is the utilisation of physiological tracking data, which has previously been used for customising VR training (Uribe-Quevedo et al. 2021), assessing mental workload (Luong et al. 2020), and emotion recognition (Gupta 2022).

The key features identified in the model highlight the multidimensional nature of VR conversations, including visual, technical, behavioural, and contextual factors for the immersive environment and the design of the agent. Based on this, we would argue that there is considerable value in the inclusion of additional measures when conducting social VR research. For example, rather than simply assessing a participant’s level of anxiety following a job interview scenario, measurements should be taken to assess what specifically contributed to those feelings. Additionally, the diverse behavioural measures that VR affords should be taken into consideration as another source of insight into conversation.

In conclusion, the application of VR methods to the field of psychology, particularly for the study of conversation, is still in its early stages of maturation and is plagued by methodological and theoretical weaknesses. To address this failing, we conducted a narrative review and developed a conceptual model to aid future researchers in making informed design decisions when creating VR methods. Drawing on results from varied fields of research, this relationship map provides a set of expectations for how the design features of VR experiences impact psychological outcomes for the user for dyadic agent conversation scenarios. While exact guidelines cannot be given on the “optimal” levels of various design features at this stage, this model contributes to the field by providing initial expectations of the role these features play in experiential outcomes.

3.3 Interdisciplinary collaboration considerations

Given the multifaceted nature of social VR, interdisciplinary collaboration is a valuable avenue for the advancement of the field, both in terms of technical development and methodological guidance. This paper worked towards the aim of bringing together findings from disparate fields for a better shared understanding, with one of the next steps ideally being for researchers from those fields to collaborate. For example, strides towards photorealistic avatars/agents and, by connection, animation requires joint efforts or at minimum an understanding of a variety of fields including 3D modelling, animation, psychology, and anatomy (e.g. Wheatland et al. 2015). On the methodological side of things, as conversational agents become more advanced and naturalistic the introduction of techniques from domains such as conversation analysis or discourse analysis could be a way of gaining more insight into the conversation dynamics (for an overview see Rapley 2018). On the technology side of things, as social VR tools become more complicated and more commonly integrate multiple users, research on networking (Cheng et al. 2022a, b, c), and the creation of end-to-end systems (Friston et al. 2021) becomes more important.

Towards the aim of understanding and guiding best practices for interdisciplinary research, the European Commission funded SHAPE-ID project carried out a review of interdisciplinary research and developed a toolkit to guide researchers on best practices (European Commission 2021). Their toolkit provides resources including case studies of successful collaborations, top tips, guides, and FAQs. We would recommend these resources as a starting point for any researchers interested in collaborating across disciplines on social VR research.

4 Image attributions