1 Introduction

The use of self-avatars, the virtual representation of the user’s body from their own perspective, is starting to become ubiquitous in virtual reality (VR) and mixed reality (MR). The possibility of seeing your own body while immersed brings many benefits. First, it increases the users’ sense of presence (SoP), as it shifts the user from being an observer to really experiencing the virtual environment (VE) (Lok et al. 2003; Slater and Usoh 1993). Besides, it enhances the spatial perception and distance estimation (McManus et al. 2011; Ebrahimi et al. 2018) while bringing a positive impact on cognitive load (Steed et al. 2016), trust, and collaboration (Pan and Steed 2017). Apart from increasing the SoP, self-avatars also increase the sense of embodiment (SoE) (Argelaguet et al. 2016; Fribourg et al. 2020).

Fig. 1
figure 1

Self-avatars can be described based on their visibility, fidelity, and dynamics

Figure 1 shows that an avatar can be described based on its visibility, fidelity, and dynamics. The minimal configuration of an avatar’s visibility just relies on users’ hands (Dodds et al. 2011); however, the current trend is to go beyond hands and model the partial body (Gruosso et al. 2021) or even the entire human body (Ogawa et al. 2020b). Avatar fidelity accounts for shape and texture realism related to skin and clothes. Existing works for hand or full body user representations range from abstract or minimal representation (Argelaguet et al. 2016), such as the stickman (Fribourg et al. 2020), to more realistic ones with gender-matched, skin-matched limbs (Dewez et al. 2019), or even with realistic human models.Footnote 1 The latest tendency relies on models that preserve the user’s body shape such as using a skinned multi-person linear body (SMPL) model (Loper et al. 2015), or 3D scanning systems such as point-cloud representation for fully personalized avatars (Thaler et al. 2018).

Avatar dynamics involves the capture of the user’s movements that enables or facilitates interaction tasks in MR (Pan and Steed 2019). While head and hand-tracking are provided by almost all standards HMDs or through the use of external sensors such as Leap Motion, less work has been done on full body tracking. Some works proposed using two VIVE trackers attached to the feet for allowing foot tracking and visibility (Pan and Steed 2019; Bonfert et al. 2022; Bozgeyikli and Bozgeyikli 2022). Although traditional approaches for full body capture are based on inertial measurements units (IMUs) sensors placed on a suit that the user wears (Xu et al. 2019; Waltemate et al. 2018; Jayaraj et al. 2017), there are emerging solutions based on computer vision (Bazarevsky et al. 2020; Tome et al. 2020) or depth information such as Azure Kinect v2 (Gonzalez-Franco et al. 2020b).

An alternative approach to computer-generated avatars is the use of video-based self-avatars, whose aim is to create user representation by segmenting the egocentric view of the user. The VR community has explored this idea over the last decade. Approaches based on color segmentation (Fiore and Interrante 2012; Bruder et al. 2009; Perez et al. 2019; Günther et al. 2015) can be deployed in real-time but they do not work well just with uncontrolled scenarios or different skin colors (Fiore and Interrante 2012). Alternatively, segmentation approaches based on depth information (Rauter et al. 2019; Lee et al. 2016; Alaee et al. 2018) are effective for some situations, but still lack the precision to provide a generic, realistic, and real-time immersive experience. In our previous work (Gonzalez-Sosa et al. 2022), we proposed a real-time egocentric body segmentation algorithm. This algorithm, inspired by Thundernet’s architecture (Xiang et al. 2019), achieves a frame rate of 66 fps for an input resolution of \(640 \times 480\), thanks to its shallow design. To train it, we created a 10,000 images dataset with high variability in terms of users, scenarios, illumination, and gender, among others. Therefore, this paper extends our previous preliminary work (Gonzalez-Morin et al. 2022a) and further investigates how to integrate the real-time egocentric user body segmentation algorithm. Our main contributions can be summarized as:

  • A detailed description of our full E2E system composed of three main modules: video pass-through capable VR device, real-time egocentric body segmentation algorithm, and optimized offloading architecture.

  • A subjective evaluation of the video-based self-avatar technology integrated into a gamified immersive experience, conducted by 58 users. To the best of our knowledge, it is the first work that includes a user study using full-body video-based self-avatars.

The rest of this article is structured as follows: Sect. 2 covers relevant state-of-the-art regarding the use of segmentation algorithms in augmented virtuality (AV). Section 3 describes the implementation details of our egocentric body segmentation E2E system. Section 4 gives details about the gamified immersive experience designed to validate the video-based self-avatar technology. Section 5 presents the subjective results obtained with a set of 58 different subjects and describes the between-subjects and within-subjects experiments conducted. Finally, Sect. 6 further elaborates on the benefits and drawbacks of using video-based avatars in place of CGI avatars. Section 7 concludes the paper.

2 Related works

Fig. 2
figure 2

Simplified representation of our custom video pass-through implementations

Augmented Virtuality is a MR subcategory of the reality continuum that aims to merge portions of the reality surrounding the user with a virtual environment. In particular, introducing the user’s own body has attracted a lot of interest in the scientific community. For instance, Bruder et al. proposed a skin segmentation algorithm to incorporate users’ hands handling different skin colors (Bruder et al. 2009). Conversely, a floor subtraction approach was developed to incorporate users’ legs and feet in the VE. Assuming uniformity in floor appearance, the body was retained by simply filtering out all pixels not belonging to the floor (Bruder et al. 2009).

Chen et al., in the context of 360\(^{\circ }\) video cinematic experiences, explored depth-keying techniques to incorporate all objects below a predefined distance threshold (Chen et al. 2017). This distance threshold could be changed dynamically to control how much of the real world was shown in the VE. The user could also control the transitions between VE and the real world through head shaking and hand gestures. Some of the limitations that the authors pointed out were related to the reduced field of view of the depth sensor.

Pigny et Dominjon Pigny and Dominjon (2019) were the first to propose a deep learning algorithm to segment egocentric bodies. Their architecture was based on U-NET and it was trained using a hybrid dataset composed of images from the COCO dataset belonging to persons and a 1500-image custom dataset created following the automatic labeling procedure reported in Gonzalez-Sosa et al. (2020). The segmentation network has two decoders: one for segmenting egocentric users, and the other for exocentric users. However, their segmentation execution speed could not currently totally keep up with the camera’s update rate while maintaining a decent quality. They reported 16 ms of inference time for \(256\times 256\) images, and observed problems of false positives that downgrade the AV experience.

Later, in Gruosso et al. (2021), the authors developed a system able to segment the user’s hand and arms using a dataset composed of 43K images from TeGO (Lee and Kacorri 2019), EDSH (Li and Kitani 2013), and a custom EgoCam dataset, using semantic segmentation deep learning network based on DeepLabv3+ (Chen et al. 2018). For the experiment, they used an HTC Vive HMD, a monocular RGB camera, and a Leap Motion controller to provide a free-hands interaction camera. Inference time takes around 20 ms for a \(360\times 360\) RGB video in a workstation provided with Nvidia Titan Xp GPU (12GB memory). They manage to run VR rendering and CNN segmentation in the same workstation.

3 System design and implementation

The generation of real-time, accurate and effective video-based self-avatars for MR requires the following solutions to be integrated as a single MR end-to-end system:

  • Video pass-through capable VR device captures the real scenario using a frontal stereo camera, accurately aligns it to the user’s view and head movement, and renders it, in real-time, within the immersive scene.

  • Real-time egocentric body segmentation algorithm identifies the pixels corresponding to the user’s body from the frames captured by the frontal camera.

  • Optimized offloading architecture egocentric body segmentation algorithms require high-end hardware not available in the current state-of-the-art HMDs. Consequently, we need a communication architecture that allows fast data exchange between the immersive client and the server running the segmentation algorithm.

3.1 Custom video pass-through implementation

There are several VR devices with video pass-through capabilities that are already commercially available, such as the Varjo XR-3Footnote 2 or the HTC VIVE Pro.Footnote 3 These devices are still wired, constraining the range of possible use cases of the proposed system. Consequently, we decided to build our own video pass-through solution, thoroughly described in (Morín et al. 2022), for the Meta Quest 2,Footnote 4 as it is a well-known commercially successful standalone VR device. For this purpose, we had to build our own hardware along with the necessary software to effectively integrate the captured video into the VR scene, ensuring a perfect alignment between the captured feed and head movement.

To comply with many MR video pass-through use cases,Footnote 5 we chose the ELP 960P HD OV9750 stereo cameraFootnote 6 which provides a maximum resolution of 2560 \(\times\) 960 at a 60Hz update rate and a field of view of 90°. Most of the commercially available video pass-through MR devices incorporate stereo cameras that are aligned with the user’s eyes. On the other hand, we realized, after some initial tests, that for our particular use case tilting the stereo cameras 25° towards the ground provided a better experience and ergonomics for the user. We designed a custom 3D printed attachment to fix the stereo camera to the VR device with the described offset as in Fig. 2.

Our system, built using Unity 2019.3, has been designed to work both in wired, using Unity’s in-built C\(\#\) API to access the stereo camera feed as a webcam, and wireless modes, using a custom native Android plugin based on the UVC libraryFootnote 7 to access the camera feed. While our goal is to implement a fully wireless system, the wired mode is necessary for development purposes and to have controlled network conditions during our subjective evaluation experiments.

Fig. 3
figure 3

Schematized representation of the VR device’s and stereo camera’s rendering frame’s coordinate frames. In orange, the VR device rendering camera coordinate frame. In blue, the rendering planes coordinate frame. In red, the parameters to be estimated using our proposed two-step calibration method. (Color figure online)

Fig. 4
figure 4

Deep learning architecture implemented and used for the task of egocentric human body segmentation, as described in Gonzalez-Sosa et al. (2022)

The rendered stereo camera’s feed and the user’s head movement must be perfectly aligned to avoid any user discomfort or VR sickness. This misalignment can come from the three non-exclusive sources: lens distortion effects, inaccurate placement of the planes where the camera feed is rendered, and motion-to-photon delays.Footnote 8 We need to apply the necessary calibration steps and adjustments to reduce the effect of the aforementioned misalignment sources.

The captured video frames from the stereo camera are rendered into two virtual planes (in blue in Fig.  3), which we refer to as rendering planes, one for each eye. We need first to estimate the camera intrinsic parameters and distortion coefficients, using a well-known camera calibration technique (Zhang 2000), to accurately correct the camera’s lens distortion and properly scale these rendering planes.

In the second step, we aim to accurately place the rendering planes relative to the VR device’s coordinate frame, as shown in Fig. 3. While this is generally achieved using a classic camera-IMU (inertial measurement unit) calibration method (Mirzaei and Roumeliotis 2008), commercial devices don’t provide access to raw sensor data with accurate time-stamps. Consequently, we decided to design our own workaround calibration method which used the VR device’s controllers and the Aruco (Garrido-Jurado et al. 2014) library to estimate the geometric relationship between the device and rendering planes. For more details refer to Morín et al. (2022).

Finally, we need to introduce a method to remove the misalignment produced by the motion to photon latency. This latency produces an unnatural and uncomfortable decoupling between the stereo feed, the head movement, and, consequently, the virtual scene. To overcome this issue, we add a delay alignment buffer, which stores the position and orientation of the stereo camera’s coordinate frame. The size of the buffer (\(t_\textrm{c}\)) corresponds to the empirically estimated motion to photon delay. We assume \(t_\textrm{c}\) to be constant for the same stereo camera and VR device models. By placing and rotating the stereo camera feed according to the first pose stored in the buffer, we achieve an accurate head movement to stereo feed alignment.

For more details regarding our custom video pass-through implementation please refer to Morín et al. (2022).

3.2 Egocentric segmentation

This is a crucial step in our setup: we need an accurate real-time egocentric body segmentation algorithm which allows identifying which pixels, from the stereo camera feed, correspond to the user’s body. Our algorithm is based on deep learning techniques. Particularly, it is based on the use of semantic segmentation networks (Guo et al. 2018). In this case, the segmentation is performed from the first point of view (egocentric vision).

The designed algorithm must meet two requirements: (1) real-time performance achieving an update rate above 60Hz, and (2) high-quality segmentation in uncontrolled conditions. Concerning the first requirement, we designed our architecture inspired by Thundernet (Xiang et al. 2019), a very shallow semantic segmentation algorithm composed of: (1) an encoding subnetwork; (2) a pyramid pooling module (PPM) that enables to extract features at different scales; and (3) a decoding subnetwork (see Fig. 4). Several modifications were applied to the baseline architecture: (1) larger pooling factors of 6, 12, 18, 24 were used in the PPM module to better extract features of larger input images, (2) three additional long skip connections were added between encoding and decoding subnetworks for refining object boundaries, and (3) weighted cross entropy loss function was used to handle class imbalance between the human body (foreground) and background.

To achieve high-quality segmentation, a data-centric approach was followed, putting a strong emphasis on the variability of the training data. We created a dataset of almost 10.000 images composed of three datasets: (1) Ego Human dataset, a semi synthetic dataset composed of egocentric body parts (arms, lower limbs, torsos) merged with realistic backgrounds that facilitates the process of groundtruth generation; (2) a subset of THU-READ dataset (Tang et al. 2018), originally created for action recognition, whose segmentation groundtruth was created manually using Amazon Mechanical Turk; and (3) EgoOffices: an egocentric dataset that was manually captured using a portable custom. The dataset was captured by more than 25 users and multiple realistic scenarios, such as different apartments or office rooms. As the hardware setup involves the use of a stereo camera for providing egocentric stereo vision, to the user, we re-trained the Thundernet architecture with images composed of two copies of the same images resulting in a \(640\times 480\) image, to replicate stereo vision. For more details, please refer to Gonzalez-Sosa et al. (2022).

3.3 ZMQ-based offloading architecture

Our egocentric body segmentation solution can’t fulfill the real-time requirements using mobile hardware, such as smartphones or wireless VR devices. Consequently, we decided to wirelessly offload the segmentation algorithm to a nearby server. The offloading architecture must be specifically designed to allow real-time processing, providing the highest throughput and lowest latency possible. For this reason, we used ZeroMQ,Footnote 9 a TCP-based architecture that has shown outstanding performance in most throughput and latency benchmarks (Sommer et al. 2018). We have developed our own ZMQ-based offloading architecture, which is already validated in realistic MR offloading use cases (González Morín et al. 2022b). The architecture has been specifically designed to ensure high reliability, throughput, and low latency data transmissions.

The architecture has been designed following a publisher-subscriber scheme in which one or multiple nodes advertise services to which other nodes can publish or subscribe to. These nodes can be defined as individual agents within the architecture, which can subscribe or publish to an available data channel or topic.

Each software module runs a communication library named Alga. Alga implements a PUB-SUB ZMQ messaging pattern, where each frame is encapsulated in a multi-part ZMQ message, composed of three sub-messages: topic, header, and data (payload). This architecture provides the core functionality required for distributed computing: semantic routing (using topics), custom metadata (in the header), payload framing with arbitrary packet size, reliable transmission, and queue size limit (using ZMQ high watermark functionality) to control latency. ZMQ also provides the possibility to run the functionality either in the same or different nodes without changing the code, supporting in-memory, inter-process (Unix socket), and TCP transport protocols. Even though using TCP instead of UDP transport imposes an overhead in terms of throughput and latency, we have found that the achieved performance is enough for the purposes of our application. For future versions of the architecture, we are already working on a version of Alga that supports the same functionality using UDP transport.

The architecture implements several types of publishers/subscribers according to the kind of data to transmit. For our particular application, we use the Picture and Unsigned 8-bit Picture packets. The first one is specifically designed to transmit individual JPEG-coded color frames. In our case, this packet type is used to transmit the stereo camera information. The second type is designed to transmit 8-bit single channel frames, specifically designed for segmentation masks. Consequently, this packet type is used to transmit the resulting segmentation mask back to the VR client.

The architecture is built both in C++ and Python. On the server side, we use the Python version of the architecture to align with the egocentric body segmentation algorithm. As the VR client is built in Unity, we created a C++ plugin, allowing to use of non-blocking asynchronous transmission and reception. This is necessary to avoid extra latencies coming from Unity’s update rate not being synchronized with the camera capture or segmentation algorithm.

3.4 End-to-end system

Fig. 5
figure 5

Final E2E simplified architecture, including the languages in which each agent was built

The final End-to-End (E2E) setup is depicted in Fig. 5. The principal data flow of the E2E architecture is as follows:

  1. 1.

    Stereo camera frame capture As we have already described, the VR client runs Unity and is in charge of obtaining individual frames from the stereo camera video feed. The individual frames are moved to the underlying C++ plugin running our offloading architecture.

  2. 2.

    Transmit stereo color frame to server The offloading architecture, via Alga, is in charge of packing each frame and transmitting them via TCP through an arbitrary topic. The segmentation server runs another instance of Alga and receives the frames through the same topic.

  3. 3.

    Egocentric body segmentation The server then infers the segmentation mask for each received frame.

  4. 4.

    Transmit result back to VR client The mask is transmitted back to the VR client via Alga, through another arbitrary topic.

  5. 5.

    Shader application on rendering frames Finally, the client applies the received masks to the custom shader attached to the rendering frames. The shader is in charge of exclusively rendering the stereo camera pixels corresponding to the user’s body according to the received mask.

3.4.1 Hardware setup

The egocentric segmentation server was running on a PC with an Intel Xeon ES-2620 V4 @ 2.1Ghz with 32 GB of RAM and powered with 2 GPU GTX-1080 Ti with 12GB RAM each. On the VR client side, we considered two modalities:

  • Wireless Mode: The immersive scene is running directly on the Meta Quest 2 HMD, to which the stereo camera is directly connected.

  • Wired Mode: The immersive scene runs on a workstation to which a Meta Quest 2 is connected via an USB-C cable using the "Link" mode. The stereo camera is connected to the workstation. In this case, the client workstation includes an Intel Core i7-11800 H and a Nvidia Geforce RTX 3070 GPU. The workstation is connected to the server via an Ethernet cable, with a 1 Gbps interface.

We used Unity version 2019.3 on the client side, where the stereo frames are captured at a resolution of \(1280\times 480\). For the validation experiments, we used a Netgear R6400 router, providing symmetric wireless 200 Mbps when a single user is connected (González Morín et al. 2022b).

3.5 Performance evaluation

The proposed offloading architecture has been thoroughly benchmarked in terms of throughput and latency in multiple relevant scenarios and wireless technologies (González Morín et al. 2022b). We decided to extend the benchmark by testing the architecture on the final E2E system. To obtain the following results we used the exact hardware setup described in Sect. 3.4.1.

3.5.1 Offloading architecture and video pass-through

We first aim to evaluate the performance of our offloading architecture alone in both hardware setups, wired and wireless.

Table 1 Performance evaluation results for the wired and wireless scenarios along with the server processing times

In this first round of experiments, we removed the segmentation algorithm from the server, directly transmitting back to the client a single channel 8-bit mask of the incoming frame. This mask is obtained using a chroma-key algorithm which adds a negligible overhead on the server side. For each scenario, we repeated a set of four experiments each transmitting 1000 frames. No other users are connected to the router. We store the times that each frame takes from the moment it is loaded in memory on the client side to when the final mask is received from the server on the client side. The initial results are shown in the first two columns of Table 1. Our E2E setup provides mean round trip times of 7.37 and 10.30 milliseconds on wired and wireless scenarios, respectively, with confidence intervals below 20 ms in both cases.

3.5.2 Segmentation server

In this case, the goal is to measure the total time consumed by the server in performing all the necessary steps for a frame: reception, necessary transformations such as scaling, and inference. Similarly to the previous set of experiments, we carried out four experiments each processing 1000 frames. We obtained a mean time of 16.7 ms, as shown in Table 1, which guarantees an update rate of 60 Hz, complying with the MR applications’ real-time requirements.

3.5.3 End-to-end system

Finally, by aggregating the results obtained in the previous two experiments we obtained E2E mean round trip times (RTT) of 29.67 ms and 32.39 ms. This latency is low enough to allow the use of delay buffers to ensure a proper alignment of the color frame and masks, as described in Sect. 3.6.

3.6 Delay buffers

Fig. 6
figure 6

Left: Mask-color pixels alignment observed when no delay buffer is used. Right: alignment achieved using the delay buffer

As the obtained RTTs values are higher than the device’s update period of 16 ms, the hand pixels appear to be misaligned with respect to the segmentation mask. The misalignment comes from the fact that the segmentation mask arrives at the VR client on an average of 32.29 ms later than its corresponding stereo pixels are loaded in memory and rendered. The feeling is as if the mask follows the actual hand. This artifact, shown in Fig. 6, affects the overall experience and must be removed.

To overcome this issue, we decided to add a frame buffer of a size equivalent to the measured mean RTT (32.29 ms) according to the camera’s update rate. By doing this we overcome the artifacts generated by the system’s added delays, ensuring the mask and its corresponding stereo pixels are rendered simultaneously. Notice that the size in time of this delay buffer must be added to the alignment buffer described in Sect. 3.1.

Fig. 7
figure 7

Left: the physical rooms used for the experiments. Right: the distribution of the tiles within the volcano path

Fig. 8
figure 8

Detail of the different sequences happening in the immersive experience. Tiles are placed following the design depicted in Fig. 7

4 Subjective evaluation

Our study aims to investigate the role of full-body video-based self-avatars in Mixed Reality and how our E2E implementation affects the overall experience in terms of presence, embodiment, visual quality, and acceptability. We compare our video-based avatar approach with a state-of-the-art hand-tracking solution, as well as with our previous implementation of video-based avatars using chroma-key segmentation (Perez et al. 2019).

4.1 Research questions

The following research questions were established:

  • RQ1: The inclusion of a stereo camera that allows video pass-through increases the sense of presence and embodiment with respect to a virtual representation of your hands. Seeing your full body, including your feet, is relevant for the experience.

  • RQ2: Deep learning based segmentation is a better solution for self-avatars in uncontrolled conditions than color-based segmentation.

4.2 Immersive experience: design and considerations

We decided to create an immersive scene that forces the users to be aware of and use their entire body, involving demanding visuomotor tasks. To this aim, a volcano immersive environment was created. Users were told to walk over a set of narrow and non-contiguous tiles creating a path along the crater of the volcano, forcing the users to pay attention to their lower limbs while walking. To provide passive haptics, to increase the feeling of being there, we incorporated 1 cm height steps, as can be seen in Fig. 7. The main purpose of the steps is to elicit visuotactile stimuli. To align the real and virtual steps, we implemented a C++ plugin running a standard Aruco tracking library Garrido-Jurado et al. (2014), and the camera calibration parameters estimated in Sect. 3.1. We placed an Aruco marker as shown in Fig. 7.

The whole path of tiles is not visible to the user from the beginning. On the contrary, the tiles are activated, in order, when users match their hands with a neon-light representation of the hands shown on the floor.Footnote 10 The neon-light hands are placed on the ground/tiles level so that the users need to touch with their own hands the actual surface, increasing the haptic sensations. The hands are tracked using Meta Quest 2 hand-tracking solution. Finally, at the end of the tiles path, we created a virtual portal that, once reached by the users, would teleport them to a friendly scenario. The goal of this final friendly scene is to allow the users to explore the virtual environment as well as their own body representation without any stress or pressure. The representation of the entire setup and the described details are shown in Fig. 8.

Fig. 9
figure 9

Conditions explored in the user study. From left to right: Local scenario, Condition 1—Virtual hands, Condition 2—Video-based Chroma, and Condition 3—Video-based Deep Learning

Fig. 10
figure 10

Diagram of the protocol followed to conduct between-subjects and within-subjects experiments

The total area occupied by the setup is 4 m \(\times\) 6 m, using 16 steps in total. The distribution used is shown in Fig. 7. Our E2E system has been specifically designed for wireless operation. However, is out of the scope of this work to evaluate and study how the network quality of service affects the immersive experience. To achieve a fair comparison between the different algorithms and our E2E system we need to have controlled network conditions. For this reason, we decided to use the wired setup in which the Meta Quest 2 is connected to a workstation running the scene. Furthermore, the wired version considerably reduces the battery consumption and increases the length of the experiment runs.

4.3 Conditions

The main goal of the experiment is to evaluate the different modalities available to create the self-avatar user representations (see Fig. 9). The following three conditions were explored:

  • Virtual Reality the user is represented with the virtual hand models provided by the Meta Quest 2. We use it as a control condition, to assure that our video-based self-avatar brings some benefits that can not be achieved with just virtual hands.

  • Chroma users are represented as the output of a simple algorithm that masks everything in the range of skin color, similar to Perez et al. (2019). This algorithm is directly implemented inside a shader in Unity, thus running locally on the client side, and taking less than 3 ms per frame. To ensure that users could see their feet as a reference for walking, we asked them to wear some pink booties, which were perfectly detected by the color-based filter, as can be seen in Fig. 9.

  • Deep learning the user is represented by the output of our semantic segmentation algorithm reported in Sect. 3.2 using the described E2E architecture.

4.4 Protocol

In the beginning, participants were briefed about the experiment and its purpose, and were asked to sign a consent form. Three modalities were set up: A, B, and C. To counterbalance the order effect, users were evenly distributed among modalities. On each modality, users needed to perform the experiment under two conditions, as depicted in Fig. 10. After each condition, users answered the questionnaire described in Sect. 4.5.

Experiments took place in Madrid for three consecutive days in March 2022 at a local university. On the first and third days, room A was used, while on the second day, room B was used (see Fig. 7).

It is worth noting that, for safety reasons, users under modality A did not use the 1 cm steps described above, since the Virtual Reality condition did not allow them to see their feet.

4.5 Questionnaire

After providing some demographic information about age, gender, and previous experience with VR, users were asked to fill out a 30-item subjective questionnaire, where 12 were used to measure embodiment, 7 for presence, 6 for visual quality, and 4 for acceptabilityFootnote 11 It is an adaptation of the questionnaire used in our previous work (Villegas et al. 2020), with a stronger focus on embodiment compared to presence. As in Villegas et al. (2020), it is based on well-established tools from the state-of-the-art (Gonzalez-Franco and Peck 2018; Witmer and Singer 1998) and ITU-T Recommendations, to favor the reproducibility of the results and its comparison with other works.

Embodiment We have used the embodiment questionnaire (EQ) proposed by Gonzalez-Franco and Peck (2018). We chose to include a subset of 13 items, namely: Q1, Q2, and Q3 for ownership, Q6, Q7, Q8, and Q9 for agency, Q14 for location; Q17, Q19, and Q20 for external appearance, and Q25 for external stimuli (modifying it to “I had the feeling that if I fell off the bridge I was going to hurt myself”). Unlike in the original questionnaire, Q20 (“I felt like I was wearing different clothes from when I came to the laboratory”) has been reversed, since now the users should be able to see their own clothes. Items from the tactile sensation factor were not included as they were not addressed in our study. A total score has also been computed following (Gonzalez-Franco and Peck 2018), as the weighted average of all items, where items Q1-Q14, which belong to the main embodiment components (ownership, agency, and location) have double the weight than the rest of the items (Q17-Q25).

Presence We have used Witmer and Singer’s presence questionnaire version 3 (PQv3) (Witmer et al. 2005). PQv3 contains 32 items to explain the sense of presence using the four different sub-dimensions: involvement, interface quality, adaptation/immersion, and sensory fidelity. We used a subsampling with seven items, as it shows a good correlation with the full questionnaire in the overall score (computed as the mean of all items), as well as in the aforementioned sub-dimensions (Pérez et al. 2021).

Visual Quality To complete the evaluation of the experience, we have included additional questions to assess the subjective perception of the visual quality of the different elements (Perez et al. 2019). The visual quality of the virtual Environment, the virtual objects, and the user’s own Body, Arms, and Legs were assessed using ITU-T P.913 absolute category rating scale (ACR). Since some of the elements may not appear under some conditions, an additional category was included indicating that the element was not visible at all, which was rated as 0. Additionally, the annoyance of elements from the local Background that were misclassified as foreground by the segmentation algorithm (false positives) was rated using ITU-T P.913 degradation category rating (DCR). A Total score has also been computed using the average of all scores (including both ACR and DCR ratings).

Acceptability Four additional questions were included to explore people’s acceptability. Cybersickness was assessed using the single-item vertigo score rating (VSR) scale, as recommended by ITU-T P.919. We use the ratio of users with severe symptoms (\(VSR \le 2\)) as the evaluation score. Global quality of experience (QoE) was asked using the ITU-T ACR scale, and then the proportion of good-or-better (GoB) ratings was extracted (Hoßfeld et al. 2016). Finally, the Net Promoter Score was extracted with two computations: the traditional one, based on the self-reported probability of recommending the game to a friend (“NPS-R”) (Reichheld 2003), and a variant based on the self-reported willingness to pay (“NPS-P”) (Villegas et al. 2020).

Qualitative evaluation Users had two moments to provide free-form feedback, one answering why would you recommend the experience, and a final comment with free-form text about the overall experience.

4.6 Statistical analysis

The design of the experiment allows for two types of analysis (see Fig. 10): a between-subject analysis between the first execution of the three conditions, aimed at comparing the Virtual Reality baseline solution with both video-based avatars (RQ1), and a within-subject analysis between Chroma and Deep Learning implementations, to address RQ2.

Presence, Embodiment, and Video Quality measures can be modeled as normal variables, and therefore we will analyze them using parametric statistics (Narwaria et al. 2018). One-way ANOVA will be used for the between-subject analysis, with Bonferroni-corrected pairwise Student T-tests as post-hoc analysis. One-way ANOVA with repeated measures will be used for the between-subject analysis. No post-hoc test is needed, as there are only two conditions.

Acceptability scores do not use a (weighted) average of the different ratings to provide a final score, but an analysis of the distribution of the ratings. Consequently, to test for the significance of the differences in the acceptability measures, Mann–Whitney U-tests will be applied to the distributions of ratings, with Bonferroni correction for the within-subject analysis.

A significance level of \(p < 0.05\) is established for all the tests.

Table 2 Distribution of users for modalities
Fig. 11
figure 11

Results for Embodiment (top), Presence (middle), and Visual Quality (bottom) questionnaires, both for between-subject (left) and within-subject (right) analysis. Bars show Mean values for each of the measures, and error bars show the standard error of the mean (SEM)

4.7 Participants

The game was evaluated by 58 volunteers (30 female and 28 male). The participant’s ages ranged from 18 to 63 (M = 22.16, SD = 8.62), most of them were students from a local university and some of them were staff, with no implication in the development of the project at all. All users were Caucasian but one woman, who has Black ethnicity. Table 2 shows a fair homogeneous distribution between modalities and gender distribution between modalities. 22 of the 58 performed the experiment in Room B (see Fig. 8 right), while the remainder 36 performed the experiment in Room A (see Fig. 8 left). One woman assigned to the C modality could only accomplish the first condition, as she reported extreme fear of conducting the experience one more time.

5 Results

Figure 11 shows the main quantitative results for Embodiment, Presence, and Visual Quality.

5.1 Embodiment

The quantitative results of the different sub-scales of the Embodiment Questionnaire, as well as a total value aggregating all items, are shown in the top row in Fig. 11, for both between- and within-subject analysis. Results are shown on a 1–7 Likert scale. All scores are high, indicating that the sense of embodiment is high under all conditions.

Between-subject analysis shows a significant effect of the video-based algorithms in Location (\(F_{2,55} = 4.56\), \(p =.015\), \(\eta ^2 =.14\)), Appearance (\(F_{2,55} = 6.48\), \(p =.003\), \(\eta ^2 =.19\)), and Response (\(F_{2,55} = 3.26\), \(p =.046\), \(\eta ^2 =.11\)) scales. Pairwise T-Tests show significant differences between Virtual Reality and both Chroma and Deep Learning for Appearance and Location; and between Chroma and Deep Learning for Response.

Within-subject analysis shows a significant effect between Chroma and Deep Learning algorithms in Ownership (\(F_{1,38} = 4.37\), \(p =.043\), \(\eta ^2 =.02\)) and Appearance (\(F_{1,38} = 6.13\), \(p =.018\), \(\eta ^2 =.03\)).

In summary, we can see moderate improvements in some Embodiment components when replacing the conventional virtual hands with a full-body video-based avatar, especially in terms of Location and Appearance. We observe small, but significant, improvements when we compare both video-based avatar solutions.

5.2 Presence

The quantitative results of the different sub-scales of the Presence Questionnaire, as well as a total value aggregating all items, are shown in the middle row in Fig. 11, for both between- and within-subject analysis. Results are shown on a 1–7 Likert scale.

Presence mean values are high for all sub-scales: 5.1 for Involvement, 5.2 for Haptic sensory fidelity, and 6.1 for Adaptation. However, no significant differences are found between conditions in either between- or within-subject analysis.

5.3 Visual quality

Visual Quality scores are shown in the bottom row in Fig. 11, for both between- and within-subject analysis. A Total (average) score is also shown.

Environment and Objects are common for all the conditions, and they show a MOS value of about 4.0 (Good), without significant differences. The perception of Body and Arms has a moderately lower MOS, between 3.0 (Fair) and 3.5, and the differences between algorithms are not significant either.

There is a significant difference in the perception of the Legs. This is obvious for Virtual Reality, where no legs were shown in the virtual scene, but also between Chroma and Deep Learning algorithms, in both between- and within-subject comparisons. In the within-subject comparison, the MOS for Chroma and Deep Learning is 2.38 and 3.61, respectively, which means an increase of more than one quality level.

The same difference exists in the perception of the Background false positives. In this case, the highest quality is obtained in the Virtual Reality scene, where no false positives exist. The within-subject comparison between Chroma and Deep Learning shows significant MOS values differences, with results of 2.90 and 3.77, respectively.

Consequently, there is a significant difference in the Total score both for between-subject and within-subject comparisons.

5.4 Acceptability

Table 3 Acceptability results summary

Table 3 shows the acceptability results. Only two participants reported severe cybersickness effect, both in the first execution of video-based avatar conditions: one in Chroma, and one in Deep Learning. The latter did not want to do the second condition, and therefore, her results have been excluded from the within-subject analysis altogether.

Quality of Experience ratings were extremely high, with above \(80\%\) of the participants reporting Good or Excellent QoE. Net Promoter Scores were also high, mostly in the range of 40–70\(\%\).

When comparing the acceptability of the three conditions, no strong differences can be observed, as different acceptability indicators show different results. Furthermore, the U-test analysis of the underlying distribution of the scores does not show any significant difference. Therefore, we can conclude that all three conditions have similar (and high) levels of acceptability.

5.5 Qualitative evaluation

Several insights could be also inferred from analyzing free text comments given by users. People, in general, were grateful to experience this new technology, stating in most cases that it was a very interesting/incredible/pleasant experience, reporting having fun while doing the experiment, and envisioning it as the future.

In all conditions, most users reported feeling teleported to the place "I felt like I was really there", and also experiencing the threat of the experience: "It’s interesting to see how your body reacts unconsciously even though you know nothing will happen to you" or "Even though it is a game it gives you the feeling that you are going to fall down [..] partially tricks your brain [..] there was a moment when I had to hold on because I felt like I was really going to fall". People, in general, reported good perception of the distance, especially related to touching the tiles, but show some uncertainties with respect to their height perception. From the perspective of the game itself, people felt that the game is too short and that adding more functionalities could make the game more fun. Some people wearing prescription glasses, reported a more comfortable experience when they took them off.

In the VR scenario, some users were very happy to see their own hands without using controllers, while, at the same time, reported that it could be even better if more body parts were perceivable. Indeed, another reported a "disconcerting experience of not knowing exactly where to walk". With respect to the perception of false positives, users perceive less of them in the deep learning condition, vs the chroma condition: "the false positives were very low compared to the previous experience and the avatar looked much better than in the previous experience, it was quite good", or this comment after chroma conditions: "the second time I tried the game it looked worse, I saw many more parts of the class, which made me feel more out of the game environment".

In general, in all deep learning responses, people show high acceptance to see their own body, reporting that it makes them feel more immersed, perceiving the experience as more realistic. Besides, users also appreciated being able to see their accessories such as watches, jewelry, painted nails, etc. Especially in modalities A, and B, users reported a high difference with respect to the previous conditions, where only certain body parts could be seen. For instance: "It has been really impressive to see the change and be able to see my own body including my clothes and many details. It honestly blew me away. I’ve never had a VR experience where I could actually see my body and everything" or "It looks like you’re the one going through the catwalks and, moreover, the avatar looks exactly like you". Moreover, some people reported the benefits of seeing their own legs: "Seeing my legs made me feel better in space".

In the chroma scenario, the feedback could be different depending on whether users were wearing clothes in the same range as the skin color "The avatar was very well done", or not "the credibility is less because the body is not completely represented (I could not see my legs)".

Users also reported that there is still room for improvement concerning seeing your entire body with deep learning. One thing is the fact that segmentation boundaries are not perfect, and tend to be slightly bigger (thick edge) than the actual body: "The fact of showing one’s own body helps a lot to the perception of space and movement of oneself, however, the noise, that is produced around the body distracts me a lot from the scene, diverts the attention too much and took me out of the experience" or "I think it is a unique experience, however, I think the quality of the avatar image can be improved". In this case, users perceived that the segmentation performed with color-based is more precise, although it can only segment skin color parts, e.g. "in this experience, the hands looked more realistic because the cropping was better, however, I was not able to see my own clothes and I saw most of the furniture in the classroom". In general, tracking was perceived as very stable in all conditions, although the presence of long hair can block the device’s sensors.

6 Discussion

6.1 Implementation

We have presented our fully functional video-based self-avatar E2E system for MR. Our implementation allows us to run the proposed segmentation algorithm in real-time on a commercial VR device. We achieved E2E latencies below 11 ms if no algorithm is running on the server. Consequently, the system allows using other computationally demanding solutions on standalone VR devices such as object tracking or camera-based full-body tracking, extending the current boundaries of MR solutions and applications.

The system still has some limitations and we had to assume a set of trade-offs in terms of segmentation accuracy, frame rate, resolution, etc. Consequently, the presented solution cannot be fairly compared to much more mature solutions such as the virtual hand solution based on hand-tracking provided by Meta or other vendors. However, we must remark that the presented technology is in the boundary of the state-of-the-art, which leaves room for improvement and it can be a starting point for future distributed and novel MR solutions.

6.2 Evaluation

Regarding the first research question established in Sect. 4.1, our qualitative analysis shows that video-based self-avatars provide marginal improvement only in some components of the sense of embodiment with respect to the Virtual Reality condition. No significant difference is observed in the presence factor.

Regarding the second research question, the Deep Learning algorithm offers better visual quality than the Chroma in uncontrolled environments, both for the legs and for the background perception. These differences result in small improvements in Ownership and Appearance factors of the sense of embodiment. No significant difference is observed in presence either.

We can extract several interpretations from this. First, all the conditions show high levels of sense of presence and embodiment. Since in all cases users are immersed in a virtual environment with which they interact with their own hands (either “real” or virtual), the degree of embodiment and presence achieved by the baseline condition is high. Therefore, any observed improvement must be necessarily small. Second, the maturity of the virtual reality technology is much higher than the video-based avatars and, consequently, the implementation of VR looks more like an actual finished product than the video avatars. Somehow the benefits of seeing their own body may be limited by the limitations on the implementation (e.g. false positives). For example, VR only shows the hand (compared to the full arms of other solutions), but it offers better resolution in the hand image. Both facts counterbalance and, as a result, the video quality for the Arms is similar in all conditions.

A similar situation happens between Chroma and Deep Learning: the former is only able to detect the hands and some colors of clothes (red or yellow); however, these elements are segmented with better pixel precision than the Deep Learning algorithm. Besides, differences regarding the number of body parts segmented between Chroma and Deep Learning might be reduced when the subjects are wearing clothes with similar colors as the skin. This was the case for 18 out of the 58 considered subjects.

These findings are also supported by qualitative evaluation given by users: they appreciated the possibility of seeing your own body completely and, at the same time, reported room for improvement. Finally, we also experienced that some people tend to forget whether they have seen their own body limbs or not, which increases some uncertainty about the results.

Finally, it is also possible that the main benefits of using video-based avatars cannot be measured in terms of presence or embodiment. Other aspects of the overall Quality of Experience should be explored.

6.3 Methodology

First of all, we are evaluating presence, embodiment, and video quality using already existing questionnaires and methodologies, but none of them were designed for video-based avatars. As a consequence, there are some limitations.

The Presence Questionnaire (PQ) was developed for VR, and the Embodiment Questionnaire (EQ) was designed for computer-generated and animated avatars. When used in a different scenario, some items may not completely apply, or they could be even misunderstood by the users. For instance, we have seen that Q8 (I felt as if the movements of the virtual body were influencing my own movements) shows a range of values completely different from other items, which might show that it is not being well understood by the users. Other items such as Q2, Q15, Q17, or Q20 might create some confusion for the users due to the particular case of this technology showing the user’s appearance as it is.

We have also observed some limitations of some questions related to visual quality, as we are using a single question to capture, at the same time, quality perception related to how much are you seeing from your body and how sharp, accurate and stable this vision is. These two aspects of the quality should be probably rated separately. In what concerns NPS-P, people reported that they will be willing to pay more if the game itself were longer.

6.4 Comparison

Previous works have also observed that a realistic virtual hand representation elicits a stronger sense of ownership (Argelaguet et al. 2016). Later, Fribourg et al. (2020) underlined a lower popularity of the appearance factor compared to the control and point of view, when assessing in combination (preference of choosing first the degree of control, before the degree of appearance). This latter conclusion might not be easily extrapolated to the case of a video-based self-avatar, as there is no need for avatar control when moving or interacting with real objects (only with virtual ones).

Improvement in presence and embodiment perceived in this work is in line with our previous work where virtual hands (using controllers) and real hands conditions were compared (Villegas et al. 2020), showing a higher difference between the chroma-based and pure VR conditions. This is due to the fact that, in the present study, all conditions share the same hand-tracking algorithm, while in Villegas et al. (2020), the tracking in the VR condition was done through the VR controllers.

Our results are not in line with the conclusions extracted in Gisbergen et al. (2020), where users needed to walk on a 60 m high broken pathway under two different conditions: (1) high resemblance shoes, or (2) generic shoes. Authors stated that under extreme situations which trigger psychological arousal such as stress, high avatar-owner resemblance is not a requirement.

6.5 Benefits

The use of video-based self-avatar, provides some important benefits, when compared with traditional avatars. First, there is no need to scale the avatar to match the user’s body dimensions, as segmentation is done using a 1 : 1 scale, allowing the user to have an accurate spatial perception. Also, there are no problems related to misalignment between the virtual and real body (Ogawa et al. 2020a), uncanny valley effect (Bhargava et al. 2021), or self-avatar follower effect (Gonzalez-Franco et al. 2020a). Extending egocentric segmentation to other items beyond human body parts, would allow the user to easily interact with real objects while immersed. Last but not least, users recognize the avatars with their own bodies with high acceptance values.

6.6 Drawbacks

There are still certain factors that can be improved for a better integration of video-based self-avatars in MR applications. One clear issue is the discrepancy in illumination conditions between the video-based self-avatar and the immersive environment where it is integrated (e.g. 360 video, VR environment). It would be also desirable to further investigate how to discard false positives from the segmentation algorithm, so that they do not downgrade the user experience, e.g. by using distance information.

6.7 Applications

For evaluation purposes, the technology of video-based self-avatar has been presented in this work by being integrated into a gamified immersive experience. However, we believe it can play an important role in many emerging use cases, namely:

  • Seeing your own body and others peers with high fidelity (Joachimczak et al. 2022) on immersive telepresence systems (Arora et al. 2022; Izumihara et al. 2019) or other types of immersive communication systems (Perez et al. 2022) can further foster SoP, co-presence, and communication skills.

  • Seeing your own body and being able to seamlessly interact with real objects for immersive training purposes (Ipsita et al. 2022) or for other purposes such as seeing your notebook or smartphone while attending an immersive conference.

6.8 Limitations

One of the limitations of this study is related to the lack of comparison with a full-body avatar condition due to the lack of mature full-body commercial tracking solutions that do not require additional hardware. Also, the conclusions extracted from our user study are extracted from questionnaires answered by a population subset, which, although balanced in terms of gender, lacks other diversity factors such as height, age, and skin color.

7 Conclusions

In this work, we proposed an E2E system that provides full-body video-based self-avatar user representation for MR applications. This E2E system can be divided into three different modules: the video pass-through solution for VR goggles, a real-time egocentric body segmentation algorithm, and an optimized offloading architecture. The E2E system has been evaluated using an arbitrary hardware setup. The achieved results showed the implementation’s capability to fulfill the real-time requirements, providing update rates above 60 Hz. Finally, we described our delay-correction buffers to overcome the round trip latencies above the update rate of 60 Hz ensuring an accurate alignment between the color and received mask frames.

To the best of our knowledge, this is the first user study that conducts a subjective evaluation of a video-based self-avatar through a gamified experienced performed on a set of 58 participants. The goal was to evaluate and compare three different user representation solutions: virtual hands, chroma-based full-body segmentation, and deep-learning-based full-body segmentation. The overall results allowed us to reach two main conclusions: (1) video-avatars provided marginal improvement only in certain components of the perceived embodiment compared to the virtual hand’s solution, (2) deep-learning solution offers better visual quality and perception of ownership and appearance than the chroma solution. The observed low differences between the conditions might be since, in all three conditions, the same robust and accurate hand-tracking algorithm is provided by the Meta Quest 2. We also observed that the scores obtained for presence or embodiment are high for all conditions, making it difficult to discriminate among them. Besides, the proposed E2E video-based solution is at the edge of the state-of-the-art, and consequently, it lacks the maturity provided by the commercial virtual-hands solution. However, the presented E2E system is already useful in its current state and can be considered the starting point for future MR distributed implementations and potential improvements. Finally, we discussed some advantages and disadvantages of video-based self-avatar solutions and potential applications. We believe this solution can be relevant in applications such as telepresence, training, and education.