The Vicarios Virtual Reality Interface for Remote Robotic Teleoperation

Intuitive interaction is the cornerstone of accurate and effective performance in remote robotic teleoperation. It requires high-fidelity in control actions as well as perception (vision, haptic, and other sensory feedback) of the remote environment. This paper presents Vicarios, a Virtual Reality (VR) based interface with the aim of facilitating intuitive real-time remote teleoperation, while utilizing the inherent benefits of VR, including immersive visualization, freedom of user viewpoint selection, and fluidity of interaction through natural action interfaces. Vicarios aims to enhance the situational awareness, using the concept of viewpoint-independent mapping between the operator and the remote scene, thereby giving the operator better control in the perception-action loop. The article describes the overall system of Vicarios, with its software, hardware, and communication framework. A comparative user study quantifies the impact of the interface and its features, including immersion and instantaneous user viewpoint changes, termed “teleporting”, on users’ performance. The results show that users’ performance with the VR-based interface was either similar to or better than the baseline condition of traditional stereo video feedback, approving the realistic nature of the Vicarios interface. Furthermore, including the teleporting feature in VR significantly improved participants’ performance and their appreciation for it, which was evident in the post-questionnaire results. Vicarios capitalizes on the intuitiveness and flexibility of VR to improve accuracy in remote teleoperation.

visual feedback, among others [5,10,20]. In contrast, modern virtual reality (VR) interfaces, owing to the video-gaming community, have seen important technological advances in graphics engines and devices [29] in terms of immersion, the development of natural physical control devices, high-fidelity graphical renderings, and native user viewpoint changes for near-natural visualizations (termed as "teleporting"), which overcome some of the limitations mentioned above.
During teleoperation, human operators are believed to build a mental model of the affordances of the task and the remote environment, based on the available visual feedback [14]. The features of VR can facilitate the building of such mental models as it can present the spatial information in 3D as compared to the traditional 2D displays. Presence (being there) and immersion are key factors of VR, characterized by the quality of the perceptual experience of the human operator [6,36,46,47]. Greater immersion typically elicits a greater sense of presence [22,51], and thereby improves performance in teleoperation [21]. Utilizing VR in robotics, therefore, seems logical, and has been recognized as such for its inherent benefits [3,25,40]. Milgram and Ballantyne [30] have highlighted the similarity between telerobotics and VR, sharing the common structure of projecting the user to the remote environment.
Building on these characteristics, this article presents a systematic approach for utilizing VR-based user interfaces in modern telerobotic applications in hazardous environments. The objective is to improve the immersion and presence in the remote teleoperation, and providing the teleporting feature to allow users to overcome visual occlusions in the VR scene. We hypothesize that these two factors significantly affect the performance and utility of the interface in the real-world. We present Vicarios, a VR-based interface that facilitates real-time remote robotic teleoperation , while exploiting the immersiveness and high-fidelity of VR for real-time feedback. Vicarios aims to keep the user's interaction natural and intuitive, by projecting the action to (and the perception from) the remote environment, as closely as possible. The central premise of Vicarios involves: (i) the stereoscopic virtual rendering of the real remote scene, including models of the remote robotic platforms, coupled with (ii) the real-time video streaming and depth information feedback from the remote environment; (iii) teleporting in the VR environment to have to have a better view of the teleoperation space and overcome occlusions; and (iv) mapping of the operator gestures/actions to the remote scene/devices relative to the actual user viewpoint. The integrated Vicarios interface, seen in Fig. 1, forms the basis of a framework that is extensible and adaptable to different teleoperation scenarios. The following sections outline the contribution of this research in detail. 1 1 A shorter version of this paper was presented at ICAR 2019 [34].

Related Work
The Vicarios interface draws on prior work in bridging VR with robotic systems and enhancing visual feedback from the remote environment. Researchers have long seen the advantages of using 3D graphics to overcome limitations of 2D displays, to better assist in robot motion planning and help with operator training [3,30]. The more recent investigations in human-robot interaction have focused on using affordable VR devices (e.g., [25]), integrating the VR graphics engine softwares (e.g., Unity3D, Unreal) with compatible robotic hardware. Probably the most commonly used robot software is the Robot Operating System (ROS) [42]. Below, the most recent works in this field are discussed.
Peppoloni et al. [40], were one of the first groups to propose the concept of combining ROS-compatible robotic platforms and high performance VR interfaces for teleoperation. They combined the platforms with the Oculus Rift and the Leap Motion controller at the operator station, and the Kinect sensor for 3D point-cloud feedback from the remote scene. Subsequently, several research groups have explored VR-based teleoperation interfaces. A similar approach was adopted by [24] for a VR-teleoperation interface, using Unity3D, ROS, and ROSbridge for intercommunication. Several robot models were imported into the VR environment and interaction was implemented through gesture-tracking using the Leap Motion controller, mapping the motion in the virtual and real worlds. Lipton et al. [25] proposed a similar approach for bimanual manipulation, based on the Baxter robot, Unity3D, Razer Hydra hand trackers, and ROS. The user uses the VR world as the primary interface for interaction, and uses the real-time video stream from the real-world scene (shown in a customized window) for guidance. Another VR interface was introduced by [49] to allow users to interact with industrial robots via gestures, for trajectory planning. They allowed the users to define the robot trajectory in VR, in offline mode, to be executed by the real robot later, in a more supervisory form of interface. A more integrated approach was recently presented by Whitney et al. [50], where the imported robot meshes are rendered in Unity3D, whereas the remote scene and objects are rendered as a point-cloud from the remote depth camera. As visual feedback, the user views the robot mesh, the manipulated object from the point-cloud, and the customized video-feed received from robot-mounted cameras. Whitney et al. [50] present their innovation in importing different robot meshes and different mappings of the visualization and control of the robots.
As noted, combining ROS-enabled robotic platforms with VR graphics engines is quickly becoming a standard approach. Our contribution builds on the principles adopted Fig. 1 The Vicarios Interface: A user teleoperating a remote robotic arm using the HTC Vive Pro VR system. Unreal VR graphics engine provides immersive visualization. Remote camera(s) provides video and depth feedback in real-time by [24,25,40,50], to advance the state-of-the-art in VRbased interfaces for teleoperation. A related aspect is that most of the above studies use cameras either mounted on the remote robot and/or fixed in the scene for a thirdperson view, which can be limited due to occlusions. To overcome this, Rakita et al. [44] propose the idea of multiple camera viewpoints in teleoperation using multiple cameras, drawing inspiration from the computer graphics domain, where moving the user viewpoint in a virtual scene is very useful in applications such as cinematic replays and peopletracking in crowd simulations [11,17]. Our approach in this article is to integrate the native teleporting feature of VR to allow multiple viewpoints to the operator during telemanipulation. Teleporting is a commonly used technique in VR for instantaneously changing one's position virtually [18], and VR users have shown improved performance with it [12,26].
In this work, the Vicarios VR interface for remote robotic teleoperation is proposed, with the following contributions: (i) One-to-one stereoscopic rendering of the remote environment, including models of the remote robotic platforms, in high quality VR for immersive visualization, (ii) Real-time video streaming feedback, mono and stereo, for first-person views of the scene, and realtime point-cloud visualization for improved depth perception of the remote scene, (iii) User viewpoint changes through teleporting in VR for intuitive visualization of the remote scene, and to overcome possible occlusions in the manipulation scene, (iv) Viewpoint-independent mapping between operator gestures and remote robot motion.
The proposed interface is evaluated through a comparative user study, quantifying the effects of introducing such an interface on participants' performance in representative remote teleoperation functions. Specifically, for the immersion and presence factor, we test Vicarios against the traditional real-world stereo visual feedback condition (baseline). Further, we tested two conditions, with and without the teleporting feature, to establish the utility and impact of the feature on user performance. We used pre-defined user viewpoint positions to keep the goals simple at this initial validation stage. Finally, we discuss the extension of the Vicarios interface, integrating the teleporting feature with a moving remote camera, adapting it in VR as well as in the real-world. In this article, the terms describing the framework, experiments, and results, e.g.,, tasks, goals, etc., are in line with the IEEE standards referenced in [15,37].

Vicarios Platform Description
The overall setup of the Vicarios interface, seen in Fig. 1, follows a general teleoperation setup, which includes an operator with a visualization interface and a remote environment. The proposed framework, as shown in Fig. 2, is divided into three major parts: the operator site, the remote environment, and a communication network between them. The gesture/motion controllers at the operator site convey the commands to the remote robots. The remote robots and the environment are rendered virtually in the VR-based visualization interface. The communication network allows real-time data exchange between the operator site and the remote environment, i.e.,, sending commands, receiving remote robot status, along with the real-time video and point-cloud information.

Vicarios Platform Implementation
A practical implementation of Vicarios was realized, based on the schema detailed in Fig. 2. The overall components are seen in Fig. 1, and described in detail in the following subsections.

Operator Site
The operator site consisted of the following components: a) Visualization hardware: an HTC Vive Pro head-mount display unit (hereinafter, referred to as HTC), with a display resolution of 2880 x 1600, a 110-degree fieldof-view, and a 90 Hz refresh rate. It includes its own proprietary Lighthouse tracking system for accurate head-motion tracking. b) Visualization software: the VR-based interface was based on the Unreal graphics engine on Windows 10. The VR scene in Unreal displayed the virtual 3D models of the remote robots, rendered 1:1 the remote environment virtually, and provided the real-time video feed (mono and stereo). Unreal permits importation, display and animation of the robot models, complete with their links and joints hierarchy. Robot models are imported as URDF (Universal Robot Description Framework) files, based on an open C++ library [2].
A dedicated robot message controller parses messages between the remote environment and the operator site, allowing real-time, 1:1 mapping between the motions of the real and the virtual robots. c) Command interface: The operator commanded the remote scene using the HTC motion controllers (HMC), giving them a flexibility to teleoperate the remote robots in real-time using an ungrounded motion controller interface. The HMC are laser-based external tracking system for an accurate 6D pose estimation, within 2 mm of resolution, allowing the exploration of the visualized environment as well as natural interaction with the remote robots. There are three buttons on the HMC that the operator can use for: i) triggering the clutch action to engage / disengage the motion between the HMC and the remote robot. This overcomes the motion range limitation of the HMC as compared to that of the remote robot, ii) triggering the grasp action to open/close the remote robot hand end-effector, iii) triggering the user viewpoint changes in VR as desired by the operator for teleporting.

Remote Environment
The remote environment consisted of a Universal Robot UR5 manipulator, attached with a Pisa/IIT SoftHand [1]. The UR5 is a 6 degrees-of-freedom (DOFs) manipulator, able to move at up to 1 m/s and each joint at 180 deg/s. It has a payload of 5 kg and a repeatability of ±0.1mm. The Pisa/IIT SoftHand is an underactuated soft robotic hand with 19 joints, but uses only one actuator to activate its adaptive synergy [4,9]. The actuator is driven by commanding a grasping variable g r ∈ [0, 1], which drives the hand to open or close. For visual feedback, up to three cameras were mounted in the remote environment, streaming real-time video and point-cloud data, including: (i) one ZED-mini stereoscopic RGB-D camera (2560x720 @ 60 Hz; depth range 0.1 -15 m) for 3D video feed and point-cloud data; and (ii) two Intel Realsense RGB-D (1280 x 720 @ 90Hz; depth range of 0.2 -10 m) cameras, one mounted on the UR5 wrist for the close-up video feed and one external for point-cloud data.
The remote environment also consisted of a mounting bench for the robot platform. As a use-case in this article, a sphere (tennis ball) and a cylindrical base are also introduced for pick-and-place tasks.

Communication Network
A direct Ethernet LAN connection was used between the remote environment and the operator site. Robot Operating System (ROS) [42] served as the framework between the operator VR interface and the remote robots, implemented in Ubuntu 16.04. Unreal on Windows communicated with the remote environment devices on Linux via UDP sockets and ROSbridge [13].
Within Unreal, starting from the open source C++ library [2], the inherent ROS architecture was encapsulated in dedicated classes and Unreal Blueprints, making it more convenient to connect to a ROS server, to add ROS topics' publishers or subscribers, and to add/modify ROS messages and types. The UDP sockets allowed the streaming of videofeed and point-cloud data from the remote environment. A distinct advantage of using VR as an interface is that it does not require more communication bandwidth than its traditional counterparts (e.g., 2D monitors, mouse / keyboard input, etc). The processing of the scene-rendering and meshes is done at the operator site itself, without relying on any communication channel.

Communication (re-)establishment
The communication between the client (operator site) and the server (remote environment) is established using a handshake protocol via ROSbridge. The protocol involves probing the accessibility to the server ROS topics at the client, thereby confirming if the client-server are connected or not. As soon as the probe is successful, i.e.,, connection is established, the user is notified (through a visual cue) and their commands can be sent to the remote robot via ROSbridge. In case of a loss of communication or an unsuccessful handshake, the user is notified via a warning cue and all command topics are set to default values (e.g., zeros) to avoid any hiccups in robot motion. The loss of communications ccould be a hardware (cables disconnected) or software problem. The handshake protocol is re-triggered as above to re-establish communication. When the problem persists, the complete hardware setup is then investigated as part of the solution to the problem.
Data Streaming Real-time, high resolution video and pointcloud data requires a high bandwidth and can suffer from latency issues. To address this, we developed a real-time video and point-cloud streaming system, that has five key components: (i) data acquisition using the software drivers of the respective cameras, (ii) an encoder that compresses the acquired data (point-cloud and video) from the cameras in the remote site, (ii) a streaming algorithm to send the data over the network, (iii) a decoder that decompresses the data received at the operator site, and (iv) a rendering system that plays the incoming streams. The video streaming system was developed based on the FFmpeg multimedia framework, with the H.264 video compression and the Real-time Transport Protocol (RTP) for streaming. The point-cloud streaming system included a state-of-the-art encoding algorithm [27], and used the Boost ASIO over a TCP socket for streaming.

Visualization and Viewpoint-independent Motion Mapping
The Vicarios interface aims to provide smooth visualization and intuitive motion mapping for seamless coordination between the operator and remote sites.

Visualization
The visualization in the HTC of the remote scene is shown in Fig. 1. The flexibility of the VR environment is evident in the arrangement of the remote scene and the placement of the camera-feed and the point-cloud data. The VR environment can closely replicate the remote scene based on the information available about the remote environment. Depending on the application, this knowledge can be obtained either through a priori scene layout information and/or real-time capture, reconstruction, and rendering in VR of the 3D remote environment. The scene needs to be rendered in a way that the operator feels immersed in it and can focus on performing the task.
Vicarios uses a combination of a priori scene knowledge and real-time scene capture of the remote environment for the creation of the VR environment. The a priori knowledge, including the known dimensions, kinematics, and locations of the remote robot platform, the poses of the RGB-D cameras, and the mounting bench (seen in Fig. 1), helps to locate the corresponding mesh models in the VR scene. The motion of the robot model in VR is animated using the real-time joint-states information from the real robot via ROSbridge (noted in Section 3.1.3). The dynamic elements, e.g., those involved in the pick-and-place task, the sphere and the cylindrical base, are not known a priori, and are obtained through real-time capture from the remote environment using the camera feeds and point-cloud data. The point-clouds are projected in an intuitive manner within the VR scene, registered to the a priori known pose of the RGB-D cameras. The camera intrinsic parameters are used for the registration, and hand-eye calibration is performed to overcome any misalignment due to inaccuracies in parameters. The video-feeds are displayed on embedded screens positioned within the VR scene. This mixture of virtual models and the real-world point-clouds is termed as augmented virtuality with the objective of allowing users to infer the relative pose of those objects in the remote environment with respect to the actual robot pose.
Therefore, in its present use-case scenario, Vicarios renders the models of a priori known objects from the remote environment that are presumably structured and fixed in terms of location. The novel contribution here is the projection of the live point-cloud data within VR, which provides the 3D pose and dimensions of the manipulated objects. This facilitates the users in combining the information from the real and virtual worlds within the same visualization paradigm, thereby enhancing the sense of immersion and presence.

Motion Mapping
As mentioned earlier, the operator commands the remote scene, i.e.,, the robot motion, directly using the HMC. The conventional sense-plan-act control paradigm [8] therefore, is based on a human-in-the-loop design, without any assumption of autonomy (or semi-autonomy) on part of the robots. The sensed data (i.e.,, video, point-cloud, etc.) from the scene is communicated to the human operator in realtime, based on which the operator plans the next step, and in-turn sends the commands for action to the robot. At this stage, the remote robots are assumed to be passive, i.e.,, not acting autonomously. Based on the control algorithm of choice for the robots (e.g., Jacobian, Inverse Kinematics), the calculated joint angles, velocities, and accelerations are sent to the remote robots from the HMC interface. The jointstates of the remote robots are, in turn, communicated back to the interface to update the poses of the rendered robot models. This integration means that the operator actually visualizes and commands the motion of the virtual robot. The real robot's motion is mapped 1:1 to this virtual robot motion.
Since the kinematics of the remote robot and the operator controllers are non-homothetic, the 6-DOF pose of the robot is commanded using velocity-control, i.e., the velocity of the HMC is mapped to the robot, calculated using the inverse Jacobian method. The robot pose is commanded by finding the position error e ∈ SE(3) between the current and the desired robot pose. Then, the damped least-squares solution to equation (1) iteratively finds the change in joint angles Δq that minimizes the error e [7].
J is the 6-DOF jacobian of the robot, and λ ∈ R is a nonzero damping constant. Finally, a proportional controlleṙ q = K p · Δq sets the joint velocities to reach the desired pose. The values for λ as 0.001 and K p as 0.6, were empirically obtained.

Viewpoint-independent Motion Mapping
The aforementioned approach also grants viewpoint-independent mapping between the gestures of the HMC and the motion of the remote robot. This is an important feature facilitated by the Vicarios interface, to let the operator freely change the viewpoint within the VR interface, without worrying about re-mapping their gestures. It is hypothesized (proven later in the text) that the ability to observe the remote scene (virtually) from different perspectives improves the accuracy when doing complex remote telemanipulation.
To allow this mapping, the following method is adopted. When the clutch button of the HMC is pressed, i.e., the motion between the HMC and the remote robot is engaged, the pose of the HMC, h(t 0 ), and the robot, r(t 0 ), are saved. When the user moves their hand to new a pose h(t 1 ), the difference between the h(t 0 ) and h(t 1 ), i.e., Δh, is calculated. Δh is then transformed to the robot base frame, scaled, and added to r(t 0 ) to obtain the desired new robot pose (corresponding to h(t 1 )). The equations are as follows: where {·} o ∈ R 3×3 is the orientation matrix and {·} p ∈ R 3×1 is the position vector. The mapping to the robot frame is done by adding the desired transformation W Δh in the global frame W to the robot's current pose r(t 0 ), obtaining the new pose r(t 1 ): where W R R is the rotation matrix from the world frame W to the robot frame R. Figure 3 demonstrates the concept with sample user motions mapped to the robot according to the chosen user viewpoint, without requiring the user to reorient their gestures/hand movements. The viewpoints can have different positions and orientations in 3D space. The above-mentioned world and robot frames are both in the Unreal engine. Commanding the real robot implies the robot coordinate frame in Unreal needs to be transformed to that of ROS. Due to the differences in the way the Unreal and ROS platforms have been developed, for positions, the right-handed coordinate frame of ROS needs to be transformed to a left-handed one for Unreal, with a scaling factor of 100 for dimensional conformity. Similarly, for orientations, the x− and z− axes of ROS need to be inverted for Unreal.
Although velocity-based teleoperation allows real-time control between non-homothetic devices, it limits the range of motion to that of the motion control device, e.g., the HMC. This issue is easily solved by using the clutch-based system, mentioned earlier. The teleoperation control between the master device and the remote robots can be coupled and decoupled as required to change the workspace of the user. This allows the user to pause and resume the robot control as desired, avoiding complicated gestures/hand movements and providing greater user comfort.

Vicarios Evaluation -User Studies
The aim with Vicarios is to establish it as an immersive interface for remote teleoperation in unstructured environments. Towards this end, user studies were conducted to understand the performance and utility of the interface, and to test the effectiveness of the different features, including teleporting and viewpoint-independent motion mapping. In particular, we performed hypothesis testing to assess the impact on the user, positive or negative, of the Vicarios VR-based interface, against the traditional video-only interface. Below, we present the details of our experimental procedures, conditions, hypotheses, and analysis metrics.

Experiment Procedure
For the evaluation studies, as stated earlier, a tennis ball and a cylindrical base were used for pick-and-place tasks. The real-time video feed was displayed on virtual screens positioned within the VR environment. The point-cloud data was projected in the VR scene in the appropriate pose, referenced by the real-world camera poses.
The participants sat on a chair in a different room than the remote environment site and controlled the robot end-effector motions using the HMC. Participants viewed the remote scene through the HTC that was fixed (as seen in the supplementary video). This was done to have consistency across the different testing conditions, having no head movements in any of them. As stated earlier, we used a pick-and-place task to evaluate the interface. A fixed stereo camera, located behind the robot base served as the default user viewpoint, and allowed assessing our proposed teleporting feature in the presence of occlusions or obfuscations. At the "Go" signal from the experimenter, starting from point "A", the participant picked up the sphere (tennis ball) from point "B" and placed it inside a cylindrical base located at point "C" (Fig. 4c). Participants released the grasped sphere, based on their judgement of the end-effector location vis-a-vis the target location. At the end of each trial, the experimenter gave a "relax" signal to the participants, indicating to bring their head back from the HTC. At this stage, the robot arm was reset to its initial pose. The sphere and cylindrical base were replaced for the next trial; their locations were randomized across trials (Fig. 5c).

Experimental Conditions
For a comparative evaluation, the trials were performed in three different conditions: participants performed the experiment viewing the remote setup through the real-time stereo video from the ZED-mini camera (seen in Figs. 4a, 5a) and the wrist-mounted realsense camera, without any VR rendering. Participants were limited to the viewpoints based on those two cameras only, simulating the more traditional teleoperation interface. Participants used the HMC to command the robot, using the clutch and grasp trigger buttons only.

VR without teleporting (VR-No Teleport): participants
were asked to perform the experiment, visualizing the remote scene through the Vicarios interface, but with a fixed user viewpoint (seen in Fig. 5b). The viewpoint for this condition was set the same as that used in the RW-stereo condition. Only the point-cloud data from the ZED-mini camera was shown in the VR scene (Figs. 4b, 5b) to render the objects in the scene. Participants also viewed the video feed from the wristmounted realsense, as in the baseline condition. Here too, only the clutch and grasp trigger buttons on the HMC were used. 3. VR with teleporting (VR-With Teleport): In this condition as well, participants performed the experiment, visualizing the scene through the Vicarios interface. Participants used all 3 buttons on the HMC, and had the ability to change their current viewpoint, using teleporting, among four different view poses (Figs. 4b, c, d, e, 5c). At the beginning of each trial, they were asked to teleport through all 4 poses in order to remind In panel c "A" refers to starting location, "B" refers to picking location and "C" refers to placing location them of the different viewpoints possible with this feature. Moreover, they were free to choose any viewpoint location at any time within the trial to perform the experiment. To cover for all the 4 viewpoints, in addition to the point-cloud data from the ZED-mini, which served user viewpoints tp 1,2,4 (Figs. 4b, c, e, 5c), a second realsense camera was added in the remote scene for tp 3 (Figs. 4e, 5c). This ensured the availability of the point-cloud data at every viewpoint during teleporting.
Each participant repeated the experiment ten times in only one assigned condition. At the end of the experiment, participants filled in a post-experiment questionnaire, rating certain pre-determined properties of the proposed interface on a 7-point Likert scale (1: bad and 7: good).

Participants
Twenty-four participants (Age: 30 ± 7, five females) took part in our evaluation study, divided into 3 groups of 8 each for the 3 conditions (a between participants design). All participants were right-handed and were naive to the study, having no prior experience with the Vicarios platform.

Hypotheses
We have two hypotheses: H 0 : we predict that participant performance in the RW-Stereo condition would be better than the VR-No Teleport condition due to the differences in depth cues (e.g., perspectives, lighting, shading etc) between the real-world and VR. It is usually difficult to reproduce exactly the same effect with the cues in VR as in the real-world. Any mismatch in depth cues between the real-world and VR alters participants' depth perception and reduces their performance in VR [35]. H 1 : we predict that participant performance would significantly improve in the VR-With Teleport condition, and would be better than the VR-No Teleport condition, owing to the teleporting feature that allows participants to overcome occlusions and/or obfuscations.

Analysis Metrics
We quantified participants' experiment completion times and their success rate as dependent variables of the experiment in all conditions. The success rate is defined as a percentage, counting the number of trials where the participants successfully picked the sphere and placed it in the target location, inside the cylindrical base. Further, we also analysed participants' ratings from the post-experiment questionnaire for certain properties of Vicarios (defined in Fig. 7).

Performance and Completion Times
For the success rates, results of the user studies, as seen in Fig. 6a, revealed that participants recorded better values in the VR-With Teleport condition as compared to the other two conditions, i.e., RW-Stereo and VR-No Teleport. Specifically, the average success rate, calculated out of a total of 80 trials per condition, was 62.5 ± 14.90 (M ± SD %) in RW-Stereo . It dropped to 51.25 ± 19.60% for VR-No Teleport, but reached 90.00 ± 5.30% for the VR- Fig. 5 Schematic of the experimental scene in different conditions. Yellow circle represents the manipulated sphere (tennis ball) in its initial location. Blue dashed circle represents the target final location (cylindrical base). Positions 1-to-5 show the possible locations of the initial and target locations that were randomized across trials for each participant. a RW-Stereo condition setup with the ZED-mini camera (Stereo) and UR5-wrist-mounted realsense camera (rc) for videofeeds. b VR-No Teleport condition with pcl 1 point-cloud data coming from ZED-mini camera, rc for video-feed and tp 1 represents user viewpoint location in VR environment. c (VR-With Teleport) condition with pcl 1,2 locations of the ZED-mini and realsense cameras in remote scene used to project point-cloud data. tp 1,2,3,4

represent the teleport locations
With Teleport. Moreover, the one-way analysis of variance (ANOVA) on participants' scores revealed a significant difference in the means of the group conditions (F (2, 21) = 15.04; p < 0.001). A Tukey's test revealed a significant difference of pairs: VR-With Teleport vs. VR-No Teleport (p < 0.001) and VR-With Teleport vs. RW-Stereo (p < 0.01) but no significant effect between RW-Stereo and VR-No Teleport (p = 0.29). In other words, this implies that our hypothesis H 0 is rejected, while H 1 is approved. The results indicate that multiple user viewpoints included in the last condition contributed to participants achieving the pick-and-place task more accurately. Indeed, participants teleported at least twice in the third condition in order to get a better viewpoint, especially when the manipulation scene was occluded. A closer analysis revealed that participants used viewpoints 2 and 3 (tp 2,3 Fig. 5a) the most.
For the completion times, seen in Fig. 6b, participants recorded 18.98 ± 8.60 sec. (Mean ± confidence intervals) during RW-Stereo condition, 16.75 ± 7.00 sec. in the VR-No Teleport condition, and 29.15 ± 14.31 sec. in the VR-With Teleport condition. Evidently, participants were slower in the third condition, requiring more time when using the teleporting feature.

Post-Experiment Questionnaire Results
Results of the post-experiment questionnaire, seen in Fig. 7, revealed the following: (i) the VR-With Teleport condition received the highest ratings for the properties of 'helpfulness of visual information' and 'sufficiency of viewpoint(s)', while the RW-Stereo condition receive the lowest. This result is in accordance with the reported results for the participant performance. (ii) Participants reported that the task was more physically demanding in the VR-With Teleport condition , while the RW-Stereo condition was rated the least demanding. (iii) For 'setup affordance' (ease-of-use) and 'satisfaction with the pace of task completion', participants rated the RW-Stereo condition better than the VR conditions. (iv) For the 'control intuitiveness (ease-of-control of remote robot)' item, the VR-No Teleport condition was rated lower than the other two conditions, while for 'task easiness', it was rated better than the other two. (v) As seen in Fig. 7, the rest of the questionnaire items were rated quite similarly across conditions, including 'confidence (with the setup/task)', 'robot pace', 'cyber-sickness', 'stress', and 'hurriedness'.

Discussion
In this paper, we investigated the features and performance of the Vicarios interface against two hypotheses, H 0 and

H 0 Rejected
Testing the hypothesis for the participants' performance in the RW-Stereo and VR-No Teleport conditions, it was observed that they were not significantly different. The task completion times recorded in the conditions were also not significantly different. This led to the rejection of H 0 . This result was unexpected, especially since previous studies have showed that users' performance using stereo camera views from the real-world is much better than through a VR interface, due to the diminished quality of the depth cues and realism in VR [35]. The rejection of H 0 might indicate that the Vicarios interface VR environment is able to match the real-world scene, viewed through the stereo camera. The authors of [48] proposed a similar immersive VR-based robot teleoperation system, and compared it to a purely video-based teleoperation system viewed on large screen. User performances and the post-experiment questionnaire data credited the VR-based approach more than the video-feed condition. This result is along expected lines, attributed to the lack of immersion in the videoonly condition. In our case though, it is probable that the scene layout reproduced in the VR condition of VR-No Teleport is identical to that seen in the RW-Stereo. This result indicates that the immersion variable in Vicarios was adequately satisfied in the VR condition. This relates to the concepts of immersion and presence, where immersion can be defined as the technological sophistication of a particular VR system, while presence can be defined as its perceptual counterpart. More immersive technologies typically elicit a greater sense of presence from the users [22,51]. Participants' presence factor can be increased dramatically by increasing realism of the VR depth cues, e.g., realistic lighting, perspectives, textures, shadowing, etc. [33,45]. In fact, it has been shown that participants' performance depends on the quality of immersion and presence in VR [6]. Therefore, it can be concluded that the depth cues reproduced in VR were realistic and close to those viewed in the real-world condition (RW-Stereo).

H 1 Approved
The hypothesis tested whether the ability to change one's viewpoint in the VR-With Teleport condition, using the teleporting feature, would improve the participants' performance compared to the RW-Stereo and VR-No Teleport conditions. The feature allowed teleporting to predefined positions within VR and provided relative motion mapping for intuitive user gestures. The reason for using predefined teleporting poses was to make it simple and easy-to-use. The overall results demonstrated that our hypothesis H 1 , was approved, confirming that this added feature significantly improved participants' performance. The lower success rates recorded for the other conditions were mainly due to the occlusions caused by the moving robot in the manipulation scene. Indeed, in the VR-With Teleport condition, participants used poses 2 and 3 (tp 2,3 in Fig. 5c) intensively to overcome the occlusions when moving the robot arm, to better visualize the target scene. However, the task completion times recorded in the VR-With Teleport were longer by nearly 10 sec. as compared to the other two conditions. This was mainly due to the fact that participants teleported more than once across the four poses in the VR scene (tp 1,2,3,4 : Fig. 5c).

Motion Mapping Latency
To evaluate the latency in the motion mapping Eqs. 2 and 3, the setup was controlled for a pre-defined period of time. A cross-correlation analysis was done between the user's Cartesian command motions (recorded from the HMC) and the UR5 end-effector Cartesian motions. The robot was set to full speed (i.e., K p to 1.0). The measured round-trip latency between the user commanding the motion and the user visualizing that motion on the robot in VR is 80 ms, implying a 40 ms one-way latency between user command and actual robot motion. This takes into account the VR command transmission time, the robot arm controller loop, the overall network communication between the remote site and the operator site (with ROSbridge), and the rendering time for the robot in VR.

Data Streaming Latency
As noted earlier, bandwidth and latency are key issues when using streaming point-cloud and video data over network. We implemented a latency measurement tool to capture the latency of the data streams. This included three parts: (i) a Chrono high-resolution clock to measure the execution time of each of the five components involved in streaming, mentioned in Section 3.1.3, (ii) the network time protocol (NTP) to synchronize the computers at the remote site and the operator site, and (iii) a barcode mechanism with frame-id and time for each acquired video frame, and a timestamped header for the point-cloud frame. At the operator site, the barcode is used to synchronize the received/decoded stream and to measure latency over the network. Keeping an HD resolution (1280 x 720) at 30 fps, the following one-way, end-to-end (from remote -tooperator) latency was measured: -ZED-mini stereo video: 180ms, -Realsense video: 110 ms, -Point-cloud (Either camera, 2m depth, 921600 encoded points): 600 ms The values for all three data streams are within those observed in literature for state-of-the-art systems [19]. The questionnaire analysis also shows that these latency values do not have an adverse impact on participants in terms of cyber-sickness or control intuitiveness. For Vicarios, endto-end latency involves processing times in different stages, e.g.,, image / point-cloud acquisition, data compression, streaming, data decoding, and display rendering. A deeper analysis of the above end-to-end latency values is important to isolate and improve the components in order to optimise the overall latency. This problem is being currently addressed and under investigation.

Questionnaire Analysis
For the post-experiment questionnaire, here we express the questions where significant differences were recorded among the conditions.
-Participants rated the VR-With Teleport condition highly for the questions concerning viewpoints and felt that the available user viewpoints were more helpful, as compared to the other two conditions. For sufficiency of viewpoint(s), they rated the RW-Stereo condition the lowest. -Participants found the tasks to be more physically demanding (physically less demanding) and more difficult (task easiness) in the VR-With Teleport condition, as compared to the other two conditions, although the experiment across conditions was the same, except the teleport feature added in the third condition. Recall that the only difference here is the addition of a button for changing the user viewpoint. This is explained in light of previous studies, which show that teleporting techniques can create cognitive overload on participants [38]. On the other hand, it is clear that teleporting motivated participants to be more careful during task execution, which resulted in higher success rates for the VR-With Teleport condition. -In all 3 conditions, participants expressed their satisfaction with the affordances of the setup (ease-of-use) and felt confident in using the interface (setup affordance and confidence). -The robot control was rated to be similarly intuitive (control intuitiveness) for the RW-Stereo and VR-With Teleport conditions, while the VR-No Teleport condition was rated lower. We recall that this is a 'between participants' study. This result, therefore implies that the introduction of Vicarios, especially with teleporting, permits the participants to control the remote robot just as intuitively and transparently as they do without VR. This detail follows from the earlier discussed rejection of hypothesis H 0 as well.
-In all 3 conditions, participants were satisfied with their pace (satisfaction with your pace) and felt that the robot was slow (robot pace). Indeed, the robot was run only at 60% speed during the experiment for safety reasons. This is not a hard constraint and can be removed as desired depending on user's expertise. -For hurriedness, participants gave average rating in all 3 conditions, implying that the visualization interface was not a factor in their sense of hurriedness in task completion. On the other hand, they felt comfortable in using the VR-based interface (no cyber-sickness and no stress).
Overall, based on the participants' performance and the post-questionnaire results, the Vicarios interface is evaluated positively, including for the teleporting feature. Nevertheless, one pertinent question is on the placement of the remote cameras. Here, we used only 3 fixed camera locations, but the locations might differ depending on the goal and/or the remote environment itself. Put another way, how realistic would the teleporting feature be in a real teleoperation scenario and whether it would be possible to extend this approach so that the remote camera locations can be easily changed to adapt to the user's desired viewpoint, the goal requirements, and the cluttered remote environment? These questions are addressed in what is discussed next.

Teleporting in VR-based Teleoperation with a Motion-Capable Remote Camera
In the current study, the VR-With Teleport condition used 2 fixed external remote cameras, the ZED-mini and the realsense, and 1 wrist-mounted camera (Fig. 5c) to receive a sufficient perspective of the remote scene in the video and point-cloud data. Participants were asked to teleport only to predefined user viewpoints (four possible locations) in order to keep the function simple and to better control the experimental variables. The results from the task performance and the questionnaire provided some vital clues as to how the practical setup for this condition in Vicarios could be improved. Outside the VR context, the authors of [44] developed a method to improve the traditional teleoperation visual feedback by providing users with an effective real-time camera viewpoint using a second camera-in-hand robot arm. In the case of Vicarios, the viewpoint-independent motion mapping feature potentially allows a user to choose any user viewpoint in VR to visualize the remote scene. The limitation here would be the different camera viewpoints available for visual information coming from the real-world, i.e., videos and point-clouds. Taking inspiration from [44] towards integrating the changing camera viewpoint idea, we propose the following robotic system that allows for the teleporting feature in a plausible teleoperation scenario Fig. 8. The part that is added to the Vicarios platform includes a robotic arm (Franka Emika), with the external ZED-mini mounted on its end-effector. This makes the ZED-mini motion-capable (the camera-robot), allowing for changing its viewpoints in the remote scene, and removing the need for the external realsense. The teleport locations can be easily modified based on the requirements. Moreover, different user modes (e.g., expert vs. regular) can be implemented, where only certain users can control the moving camera's locations. This setup has already been implemented (seen in the supplementary video); its evaluation is part of future work.

Summary
The Vicarios interface forms the basis of a software and control framework for intuitive remote robotic teleoperation for human-in-the-loop tasks. High-fidelity visualization and perception are key components of teleoperation, and an immersive interface provides inherent benefits in this regard. As stated earlier, the article builds on the recent approaches in research and adds the contributions of the integration of multiple features: viewpoint-independent motion mapping based on current point-of-view, realtime video and point-cloud from the remote environment to capture information (pose, dimensions, etc.) about dynamic elements, and the teleporting feature for intuitive and immersive execution. These features have shown to improve users' performance in tele-manipulation tasks. The questionnaire results demonstrate the improvement to the operator situational awareness that Vicarios offers, allowing consistent execution of remote teleoperation. Furthermore, we proposed an extension to the Vicarios framework to allow for an intuitive coupling between the user-driven teleporting in VR and a motion-capable remote camera, granting the possibility to overcome occlusions in the manipulation scene.
A note on ethical aspects of Vicarios -the key insight here is that the robotic platform in the remote environment is not autonomous, and is not interacting with other humans / living objects. Further, Vicarios does not require, nor does it gather personal information about the human operators. The system being based on the human-in-the-loop paradigm, the robot cannot act on its own, i.e.,, it cannot do something entirely different or detrimental, than what the human is commanding it to do. The improvement in situational awareness envisaged with Vicarios, is indeed for the human in-the-loop, and not for the robot. Vicarios is introduced here as a telerobotics interface in hazardous environments, where the human operator is always incharge. As noted in literature [31,41], replacing human workers with robots in hazardous operations can lead to improvements in workplace safety. We believe that robots replacing humans is not always unethical, especially when contrasted with the goal of saving worker lives.

Limitations at the Current Stage
Although Vicarios contributes several new features that are helpful to the operator, there are some limitations to the framework at this stage, that we intend to address in our future works. For instance, Vicarios was not tested when rendering perception modalities other than vision (i.e.,., haptic and audio), to determine their impact on users' Fig. 8 Vicarios extension with the additional camera-robot (Franka Emika) arm to achieve changes in camera viewpoints to adapt to the teleporting feature. a Camera-robot pose 1, b Camera-robot pose 2, c Camera-robot pose 3 shows the hardware setup in the remote site. d, VR-teleport view 1 e VR-teleport view 2, f shows the actual user view within the interface. The red-bordered panel shows the view of the UR5-wrist-mounted camera, while the yellow-bordered panel shows the view from the camera-robot performance [10]. This would open new perspectives in transmitting users' commands, such as voice or gesture control, as well as sensory substitution in some cases. Another limitation is that we have tested the system with only a few selected viewpoints while visualizing the virtual environment. Greater freedom in viewpoint selection can help as well as hinder performance [43]. Finally, latency is an inevitable problem in teleoperation interfaces that profoundly alters the operators' performance. We have included overall latency analysis in this article, and we intend to delve further to understand sub-component latency as well. Cloud-based solutions in telerobotics [16] can offer a promising alternative allowing complex computation to be offloaded to the cloud servers. The bandwidth and latency outcomes will need to be understood for such an approach.

Conclusions and Future Work
In this paper, the Vicarios VR interface was presented, forming the basis for an immersive and intuitive remote robotic teleoperation interface. It integrates: (i) the Unreal graphics engine for high quality VR rendering robotic platforms for manipulation in remote environments, viewed through the HTC system; (ii) real-time video and pointcloud streaming, with RGB-D cameras for real-time perception and remote environment update; (iii) user viewpoint changes using the teleporting feature in VR to overcome occlusions; and (iv) viewpoint-independent mapping between operator gestures and remote robot manipulator. All these features of Vicarios interface allow the user to freely explore the VR scene to better understand, view, locate, and interact with the remote environment with the aim of making it adaptable for demanding telerobotic domains, e.g., disaster response, nuclear decommissioning, telesurgery, etc.
As next steps, Vicarios shall be evaluated with respect to users' performance and learning, when allowing them to control the manipulator and the motion-capable camera during task execution. With dynamic unstructured remote environments, obtaining a priori information can be difficult, if not impossible. To address this, further evolution of the real-time capture and rendering of the remote environment in VR shall also be investigated. Finally, we will aim to implement and test the Vicarios platform when communicating via a wireless connection (e.g., 5G).

Declarations
Ethical Approval The testing and experimental procedures were approved by the "Ethics Committee of Liguria Region" (Italy), in accordance with the guidelines of the Declaration of Helsinki for research involving human participants Consent to Participate Informed written consent was obtained from all participants involved in the study

Consent to Publish Individuals appeared in the current submission images/videos gave their consent for publication
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.